Getting Table Info From Page Using Python and BeautifulSoup - python

The page I am trying to get info from is https://www.pro-football-reference.com/teams/crd/2017_roster.htm.
I'm trying to get all the information from the "Roster" table but for some reason I can't get it through BeautifulSoup.I've tried soup.find("div", {'id': 'div_games_played_team'}) but it doesn't work. When I look at the page's HTML I can see the table inside a very large comment and in a regular div. How can I use BeautifulSoup to get the information from this table?

you don't need Selenium. What you can do (and you correctly identified it), was pull out the comments, and then parse the table from within that.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/crd/2017_roster.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except ValueError as e:
print(e)
continue
Output:
print(tables[0].head().to_string())
No. Player Age Pos G GS Wt Ht College/Univ BirthDate Yrs AV Drafted (tm/rnd/yr) Salary
0 54.0 Bryson Albright 23.0 NaN 7 0.0 245.0 6-5 Miami (OH) 3/15/1994 1 0.0 NaN $246,177
1 36.0 Budda Baker*+ 21.0 ss 16 7.0 195.0 5-10 Washington 1/10/1996 Rook 9.0 Arizona Cardinals / 2nd / 36th pick / 2017 $465,000
2 64.0 Khalif Barnes 35.0 NaN 3 0.0 320.0 6-6 Washington 4/21/1982 12 0.0 Jacksonville Jaguars / 2nd / 52nd pick / 2005 $176,471
3 41.0 Antoine Bethea 33.0 db 15 6.0 206.0 5-11 Howard 7/27/1984 11 4.0 Indianapolis Colts / 6th / 207th pick / 2006 $2,000,000
4 28.0 Justin Bethel 27.0 rcb 16 6.0 200.0 6-0 Presbyterian 6/17/1990 5 3.0 Arizona Cardinals / 6th / 177th pick / 2012 $2,000,000
....

The tag you are trying to scrape is dynamically generated by JavaScript. You are most likely using requests to scrape your HTML. Unfortunately requests will not run JavaScript because it pulls all the HTML in as raw text. BeautifulSoup can not find the tag because it was never generated within your scraping program.
I recommend using Selenium. It's not a perfect solution - just the best one for your problem. The Selenium WebDriver will execute the JavaScript to generate the page's HTML. Then you can use BeautifulSoup to parse whatever it is that you are after. See Selenium with Python for further help on how to get started.

Related

beautiful soup find_all() not returning all elements

I am trying to scrape this website using bs4. Using inspect on particular car ad tile, I figured what I need to scrape in order to get the title & the link to the car's page.
I am making use of the find_all() function of the bs4 library but the issue is that it's not scraping the required info of all the cars. It returns only info of about 21, whereas it's clearly visible on the website that there are about 2410 cars.
The relevant code:
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import requests
url = 'https://www.cardekho.com/used-cars+in+bangalore'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = bs(webpage,"html.parser")
tags = page_soup.find_all("div","title")
print(len(tags))
How to get info on all of the cars present on the page.
P.S - Want to point out just one thing, all the cars aren't displayed at once. More car info gets loaded as you scroll down. Could it because of that? Not sure.
Ok, I've written up a sample code to show you how it can be done. Although the site has a convenient api that we can leverage, the first page is not available through the api, but is embedded in a script tag in the html code. This requires additional processing to extract. After that it is simply a matte of getting the json data from the api, parsing it to python dictionaries and appending the car entries to a list. The link to the api can be found when inspecting network activity in Chrome or Firefox while scrolling the site.
from bs4 import BeautifulSoup
import re
import json
from subprocess import check_output
import requests
import time
from tqdm import tqdm #tqdm is just to implement a progress bar, https://pypi.org/project/tqdm/
cars = [] #create empty list to which we will append the car dicts from the json data
url = 'https://www.cardekho.com/used-cars+in+bangalore'
r = requests.get(url , headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser")
s = soup.find('script', {"type":"application/ld+json"}).next_sibling #find the section with the json data. It looks for a script tage with application/ld+json type, and takes the next tag, which is the one with the data we need, see page source code
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));' #strip the text from unnecessary strings and load the json as python dict, taken from: https://stackoverflow.com/questions/54991571/extract-json-from-html-script-tag-with-beautifulsoup-in-python/54992015#54992015
with open('temp.js','w') as f: # save the sting to a javascript file
f.write(js)
data_site = json.loads(check_output(['node','temp.js'])) #execute the file with node, which will return the json data that will be loaded with json.loads.
for i in data_site['items']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
for page in tqdm(range(20, data_site['total_count'], 20)): #'pagefrom' in the api call is 20, 40, 60, etc. so create a range and loop it
r = requests.get(f"https://www.cardekho.com/api/v1/usedcar/search?&cityId=105&connectoid=&lang_code=en&regionId=0&searchstring=used-cars%2Bin%2Bbangalore&pagefrom={page}&sortby=updated_date&sortorder=asc&mink=0&maxk=200000&dealer_id=&regCityNames=&regStateNames=", headers={'User-Agent': 'Mozilla/5.0'})
data = r.json()
for i in data['data']['cars']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
time.sleep(5) #wait a few seconds to avoid overloading the site
This will result in cars being a list of dictionaries. The car names can be found in the vid key, and the urls are present in the vlink key.
You can load it into a pandas dataframe to explore the data:
import pandas as pd
df = pd.DataFrame(cars)
df.head() will output (I omitted the images column for readability):
loc
myear
bt
ft
km
it
pi
pn
pu
dvn
ic
ucid
sid
ip
oem
model
vid
city
vlink
p_numeric
webp_image
position
pageNo
centralVariantId
isExpiredModel
modelId
isGenuine
is_ftc
seller_location
utype
views
tmGaadiStore
cls
0
Koramangala
2014
SUV
Diesel
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2206305_1614944913.jpg
9.9 Lakh
Mahindra XUV500 W6 2WD
13
3019084
9509A09F1673FE2566DF59EC54AAC05B
1
Mahindra
Mahindra XUV500
Mahindra XUV500 2011-2015 W6 2WD
Bangalore
/used-car-details/used-Mahindra-XUV500-2011-2015-W6-2WD-cars-Bangalore_9509A09F1673FE2566DF59EC54AAC05B.htm
990000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2206305_1614944913.webp
1
1
3822
True
570
0
0
{'address': 'BDA Complex, 100 Feet Rd, 3rd Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, Bangalore', 'lat': 12.931, 'lng': 77.6228}
Dealer
235
False
1
Marathahalli Colony
2017
SUV
Petrol
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2203506_1614754307.jpeg
7.85 Lakh
Ford Ecosport 1.5 Petrol Trend BSIV
14
3015331
2C0E4C4E543D4792C1C3186B361F718B
1
Ford
Ford Ecosport
Ford Ecosport 2015-2021 1.5 Petrol Trend BSIV
Bangalore
/used-car-details/used-Ford-Ecosport-2015-2021-1.5-Petrol-Trend-BSIV-cars-Bangalore_2C0E4C4E543D4792C1C3186B361F718B.htm
785000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2203506_1614754307.webp
2
1
6086
True
175
0
0
{'address': '2, Varthur Rd, Ayyappa Layout, Chandra Layout, Marathahalli, Bengaluru, Karnataka 560037, Marathahalli Colony, Bangalore', 'lat': 12.956727624875453, 'lng': 77.70174980163576}
Dealer
495
False
2
Yelahanka
2020
SUV
Diesel
13,969
0
https://images10.gaadicdn.com/usedcar_image/320x240/usedcar_11_276591614316705_1614316747.jpg
41 Lakh
Toyota Fortuner 2.8 4WD AT
12
3007934
BBC13FB62DF6840097AA62DDEA05BB04
1
Toyota
Toyota Fortuner
Toyota Fortuner 2016-2021 2.8 4WD AT
Bangalore
/used-car-details/used-Toyota-Fortuner-2016-2021-2.8-4WD-AT-cars-Bangalore_BBC13FB62DF6840097AA62DDEA05BB04.htm
4100000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/usedcar_11_276591614316705_1614316747.webp
3
1
7618
True
364
0
0
{'address': 'Sonnappanahalli Kempegowda Intl Airport Road Jala Uttarahalli Hobli, Yelahanka, Bangalore, Karnataka 560064', 'lat': 13.1518821, 'lng': 77.6220694}
Dealer
516
False
3
Byatarayanapura
2017
Sedans
Diesel
18,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2202297_1615013237.jpg
35 Lakh
Mercedes-Benz E-Class E250 CDI Avantgarde
15
3013606
4553943A967049D873712AFFA5F65A56
1
Mercedes-Benz
Mercedes-Benz E-Class
Mercedes-Benz E-Class 2009-2012 E250 CDI Avantgarde
Bangalore
/used-car-details/used-Mercedes-Benz-E-Class-2009-2012-E250-CDI-Avantgarde-cars-Bangalore_4553943A967049D873712AFFA5F65A56.htm
3500000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2202297_1615013237.webp
4
1
4611
True
674
0
0
{'address': 'NO 19, Near Traffic Signal, Byatanarayanapura, International Airport Road, Byatarayanapura, Bangalore, Karnataka 560085', 'lat': 13.0669588, 'lng': 77.5928756}
Dealer
414
False
4
nan
2015
Sedans
Diesel
80,000
0
https://stimg.cardekho.com/pwa/img/noimage.svg
12.5 Lakh
Skoda Octavia Elegance 2.0 TDI AT
1
3002709
156E5F2317C0A3A3BF8C03FFC35D404C
1
Skoda
Skoda Octavia
Skoda Octavia 2013-2017 Elegance 2.0 TDI AT
Bangalore
/used-car-details/used-Skoda-Octavia-2013-2017-Elegance-2.0-TDI-AT-cars-Bangalore_156E5F2317C0A3A3BF8C03FFC35D404C.htm
1250000
5
1
3092
True
947
0
0
{'lat': 0, 'lng': 0}
Individual
332
False
Or if you wish to explode the dict in seller_location to columns, you can load it with df = pd.json_normalize(cars).
You can save all data to a csv file: df.to_csv('output.csv')

Reading tables with Pandas and Selenium

I am trying to figure out an elegant way to scrape tables from a website. However, when running below script I am getting a ValueError: No tables found error.
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:...\chromedriver.exe')
driver.implicitly_wait(30)
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
df_list=pd.read_html(driver.find_element_by_id("eventTab_4").get_attribute('outerHTML'))
When I look at the site elements, I notice that the code below works if the < table > tag lies neatly within the < div id="...">. However, in this case, I think the code is not working because of the following reasons:
There is a < div > within a < div > and then there is the < table > tag.
The site uses Javascript with the tables.
Grateful for advice on how to pull the tables for all races. That is, there are several tables which are made visible as the user clicks on each tab (race). I need to extract all of them into separate dataframes.
from selenium import webdriver
import time
import pandas as pd
pd.set_option('display.max_column',None)
driver = webdriver.Chrome(executable_path='C:/bin/chromedriver.exe')
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
time.sleep(5)
tab = driver.find_element_by_id('tabs') #All tabs are here
li_list = tab.find_elements_by_tag_name('li') #They are in a "li"
a_list = []
for li in li_list[1:]: #First tab has nothing..We skip it
a = li.find_element_by_tag_name('a') #extract the "a" element from the "li"
a_list.append(a)
df = []
for a in a_list:
a.click() #Next Tab
time.sleep(8) #Tables take some time to load fully
page = driver.page_source #Get the HTML of the new Tab page
source = pd.read_html(page)
table = source[1] #Get 2nd table
df.append(table)
print(df)
Output
[ Silk No Horse Unnamed: 3 ACS SH CFR Trainer \
0 NaN 1 JEM ROCK NaN 4bC A NaN Eric Sands
1 NaN 2 MAISON MERCI NaN 3chC A NaN Brett Crawford
2 NaN 3 WORDSWORTH NaN 3bC AB NaN Greg Ennion
3 NaN 4 FOUND THE DREAM NaN 3bG A NaN Adam Marcus
4 NaN 5 IZAPHA NaN 3bG A NaN Andre Nel
5 NaN 6 JACKBEQUICK NaN 3grG A NaN Glen Kotzen
6 NaN 7 MHLABENI NaN 3chG A NaN Eric Sands
7 NaN 8 ORLOV NaN 3bG A NaN Adam Marcus
8 NaN 9 T'CHALLA NaN 3bC A NaN Justin Snaith
9 NaN 10 WEST COAST WONDER NaN 3bG A NaN Piet Steyn
Jockey Wgt MR Dr Odds Last 3 Runs
0 D Dillon 60 0 7 NaN NaN
continued

Cannot Scrape Fantasy Table Using Python

I'm trying to scrape fantasy player data from the following site: http://www.fplstatistics.co.uk/. The table appears upon opening the site, but it's not visible when I scrape the site.
I tried the following:
import requests as rq
from bs4 import BeautifulSoup
fplStatsPage = rq.get('http://www.fplstatistics.co.uk')
fplStatsPageSoup = BeautifulSoup(fplStatsPage.text, 'html.parser')
fplStatsPageSoup
And the table was nowhere to be seen. What came back in place of where the table should be is:
<div>
The 'Player Data' is out of date.
<br/> <br/>
You need to refresh the web page.
<br/> <br/>
Press F5 or hit <i class="fa fa-refresh"></i>
</div>
This message appears on the site whenever the table is updated.
I then looked at the developer tools to see if I can find the URL from where the table data is retrieved, but I had no luck. Probably because I don't know how to read the developer tools well.
I then tried to refresh the page as the above message says using Selenium:
from selenium import webdriver
import time
chromeDriverPath = '/Users/SplitShiftKing/Downloads/Software/chromedriver'
driver = webdriver.Chrome(chromeDriverPath)
driver.get('http://www.fplstatistics.co.uk')
driver.refresh()
#To give site enough time to refresh
time.sleep(15)
html = driver.page_source
fplStatsPageSoup = BeautifulSoup(html, 'html.parser')
fplStatsPageSoup
The output was the same as before. The table appears on the site but not in the output.
Assistance would be appreciated. I've looked at similar questions on overflow, but I haven't been able to find a solution.
Why not go straight to the source that pulls i that data. Only thing you need to work out is the column names, but this gets you all the data in one request and without using selenium:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
url = 'http://www.fplstatistics.co.uk/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Mobile Safari/537.36'}
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '"iselRow"' in script.text:
iselRowVal = re.search('"value":(.+?)}\);}', script.text).group(1).strip()
url = 'http://www.fplstatistics.co.uk/Home/AjaxPricesFHandler'
payload = {
'iselRow': iselRowVal,
'_': ''}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['aaData'])
Output:
print (df.head(5).to_string())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 Mustafi Arsenal D A 0.3 5.2 £5.2m 0 --- 110 -95.6 -95.6 -1 -1 Mustafi Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
1 Bellerín Arsenal D I 0.3 5.4 £5.4m 0 --- 54024 2.6 2.6 -2 -2 Bellerin Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
2 Kolasinac Arsenal D I 0.6 5.2 £5.2m 0 --- 5464 -13.9 -13.9 -2 -2 Kolasinac Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
3 Maitland-Niles Arsenal D A 2.6 4.6 £4.6m 0 --- 11924 -39.0 -39.0 -2 -2 Maitland-Niles Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
4 Sokratis Arsenal D S 1.5 4.9 £4.9m 0 --- 19709 -29.4 -29.4 -2 -2 Sokratis Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
By requesting driver.page_source you're cancelling out any benefit you get from Selenium: the page source does not contain the table you want. That table is updated dynamically via Javascript after the page has loaded. You need to retrieve the table use methods on your driver, rather than using BeautifulSoup. For example:
>>> from selenium import webdriver
>>> d = webdriver.Chrome()
>>> d.get('http://www.fplstatistics.co.uk')
>>> table = d.find_element_by_id('myDataTable')
>>> print('\n'.join(x.text for x in table.find_elements_by_tag_name('tr')))
Name
Club
Pos
Status
%Owned
Price
Chgs
Unlocks
Delta
Target
Kelly Crystal Palace D A 30.7 £4.3m 0 --- 0
101.0
Rico Bournemouth D A 14.6 £4.3m 0 --- 0
100.9
Baldock Sheffield Utd D A 7.1 £4.8m 0 --- 88 99.8
Rashford Man Utd F A 26.4 £9.0m 0 --- 794 98.6
Son Spurs M A 21.6 £10.0m 0 --- 833 98.5
Henderson Sheffield Utd G A 7.8 £4.7m 0 --- 860 98.4
Grealish Aston Villa M A 8.9 £6.1m 0 --- 1088 98.0
Kane Spurs F A 19.3 £10.9m 0 --- 3961 92.9
Reid West Ham D A 4.6 £3.9m 0 --- 4029 92.7
Richarlison Everton M A 7.7 £7.8m 0 --- 5405 90.3

bs4 find table by id, returning 'None'

Not sure why this isn't working :( I'm able to pull other tables from this page, just not this one.
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
page = soup(url.content, 'html')
table = page.find('table', id='team_and_opponent')
print(table)
Appreciate the help.
The page is dynamic. So you have 2 options in this case.
Side note: If you see <table> tags, don't use BeautifulSoup, pandas can do that work for you (and it actually uses bs4 under the hood) by using pd.read_html()
1) Use selenium to first render the page, and THEN you can use BeautifulSoup to pull out the <table> tags
2) Those tables are within the comment tags in the html. You can use BeautifulSoup to pull out the comments, then just grab the ones with 'table'.
I chose option 2.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.basketball-reference.com/teams/BOS/2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
I don't know which particular table you want, but they are there in the list of tables
*Output:**
print (tables[1])
Unnamed: 0 G MP FG FGA ... STL BLK TOV PF PTS
0 Team 82.0 19805 3141 6975 ... 604 373 1149 1618 8529
1 Team/G NaN 241.5 38.3 85.1 ... 7.4 4.5 14.0 19.7 104.0
2 Lg Rank NaN 12 25 25 ... 23 18 15 17 20
3 Year/Year NaN 0.3% -0.9% -0.0% ... -2.1% 9.7% 5.6% -4.0% -3.7%
4 Opponent 82.0 19805 3066 6973 ... 594 364 1159 1571 8235
5 Opponent/G NaN 241.5 37.4 85.0 ... 7.2 4.4 14.1 19.2 100.4
6 Lg Rank NaN 12 3 12 ... 7 6 19 9 3
7 Year/Year NaN 0.3% -3.2% -0.9% ... -4.7% -14.4% 1.6% -5.6% -4.7%
[8 rows x 24 columns]
or
print (tables[18])
Rk Unnamed: 1 Salary
0 1 Gordon Hayward $29,727,900
1 2 Al Horford $27,734,405
2 3 Kyrie Irving $18,868,625
3 4 Jayson Tatum $5,645,400
4 5 Greg Monroe $5,000,000
5 6 Marcus Morris $5,000,000
6 7 Jaylen Brown $4,956,480
7 8 Marcus Smart $4,538,020
8 9 Aron Baynes $4,328,000
9 10 Guerschon Yabusele $2,247,480
10 11 Terry Rozier $1,988,520
11 12 Shane Larkin $1,471,382
12 13 Semi Ojeleye $1,291,892
13 14 Abdel Nader $1,167,333
14 15 Daniel Theis $815,615
15 16 Demetrius Jackson $92,858
16 17 Jarell Eddie $83,129
17 18 Xavier Silas $74,159
18 19 Jonathan Gibson $44,495
19 20 Jabari Bird $0
20 21 Kadeem Allen $0
There is no table with id team_and_opponent in that page. Rather there is a span tag with this id. You can get results by changing id.
This data should be loaded dynamically (like JavaScript).
You should take a look here Web-scraping JavaScript page with Python
For that you can use Selenium or html_requests who supports Javascript
import requests
import bs4
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
soup=bs4.BeautifulSoup(url.text,"lxml")
page=soup.select(".table_outer_container")
for i in page:
print(i.text)
you will get your desired output

How can I scrape data from a url within a target url and amend everything to a single data frame in python?

I am trying to scrape data points from one webpage (A), but then scrape data from each individual data point's own webpage and combine all of the data into a single data frame for easy viewing.
This is for a daily data frame with four columns: Team, Pitcher, ERA, WHIP. The ERA and WHIP are found within the specific pitcher's url. For the data below, I have managed to scrape the team name as well as the starting pitcher name and organized both into a data frame (albeit incorrectly).
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
targetUrl = 'http://www.baseball-reference.com/previews/'
targetUrl_response = requests.get(targetUrl, timeout=5)
soup = BeautifulSoup(targetUrl_response.content, "html.parser")
teams = []
pitchers = []
for i in soup.find_all('tr'):
if i.find_all('strong'):
for link in i.find_all('strong'):
if not re.findall(r'MLB Debut',link.text):
teams.append(link.text)
if i.find_all('a'):
for link in i.find_all('a'):
if not re.findall(r'Preview',link.text):
pitchers.append(link.text)
print (df)
I'd like to add code to follow each pitcher's webpage, scrape the ERA and WHIP, then amend the data to the same data frame as team and pitcher name. Is this even possible?
Output so far:
0
Aaron Sanchez TOR
CC Sabathia NYY
Steven Matz NYM
Zach Eflin PHI
Lucas Giolito CHW
Eduardo Rodriguez BOS
Brad Keller KCR
Adam Plutko CLE
Julio Teheran ATL
Jon Lester CHC
Clayton Kershaw LAD
Zack Greinke ARI
Jon Gray COL
Drew Pomeranz SFG
Few things off the bat (see what I did there :-) ) the sports-reference.com pages are dynamic. You're able to pull SOME of the tables straight forward, but tif there are multiple tables, you'll find them under comment tags within the html source. So that might be an issue later if you want more data from the page.
The second thing is I notice you are pulling <tr> tags, which means there are <table> tags, and pandas can do the heavy work for you as opposed to iterating through with bs4. It's a simple pd.read_html() function. HOWEVER, it won't pull out those links, just strictly the text. So in this case, iterating with BeautifulSoup is the way to go (I'm just mentioning it for future reference).
There's still more work to do as a couple of the guys didn't have links/return era or whip. And you'll also have to account for if a guy is traded or change leagues, there might be multiple ERAs for the same 2019 season. But this should get you going:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
targetUrl = 'http://www.baseball-reference.com/previews/'
targetUrl_response = requests.get(targetUrl, timeout=5)
soup = BeautifulSoup(targetUrl_response.content, "html.parser")
teams = []
pitchers = []
era_list = []
whip_list = []
for i in soup.find_all('tr'):
if i.find_all('strong'):
for link in i.find_all('strong'):
if not re.findall(r'MLB Debut',link.text):
teams.append(link.text)
if i.find_all('a'):
for link in i.find_all('a'):
if not re.findall(r'Preview',link.text):
try:
url_link = link['href']
pitcher_table = pd.read_html(url_link)[0]
pitcher_table = pitcher_table[(pitcher_table['Year'] == '2019') & (pitcher_table['Lg'].isin(['AL', 'NL']))]
era = round(pitcher_table.iloc[0]['ERA'],2)
whip = round(pitcher_table.iloc[0]['WHIP'],2)
except:
era = 'N/A'
whip = 'N/A'
pitchers.append(link.text)
era_list.append(era)
whip_list.append(whip)
print ('%s\tERA: %s\tWHIP: %s' %(link.text, era, whip))
df = pd.DataFrame(list(zip(pitchers, teams, era_list, whip_list)), columns = ['Pitcher', ',Team', 'ERA', 'WHIP'])
print (df)
Output:
print (df)
Pitcher Team ERA WHIP
0 Walker Lockett NYM 23.14 2.57
1 Jake Arrieta PHI 4.12 1.38
2 Logan Allen SDP 0 0.71
3 Jimmy Yacabonis BAL 4.7 1.44
4 Clayton Richard TOR 7.46 1.74
5 Glenn Sparkman KCR 3.62 1.25
6 Shane Bieber CLE 3.86 1.08
7 Carson Fulmer CHW 6.35 1.94
8 David Price BOS 3.39 1.1
9 Jesse Chavez TEX N/A N/A
10 Jordan Zimmermann DET 6.03 1.37
11 Max Scherzer WSN 2.62 1.06
12 Trevor Richards MIA 3.54 1.25
13 Max Fried ATL 4.03 1.34
14 Adbert Alzolay CHC 2.25 0.75
15 Marco Gonzales SEA 4.38 1.37
16 Zach Davies MIL 3.06 1.36
17 Trevor Williams PIT 4.12 1.19
18 Gerrit Cole HOU 3.54 1.02
19 Blake Snell TBR 4.4 1.24
20 Kyle Gibson MIN 4.18 1.25
21 Chris Bassitt OAK 3.64 1.17
22 Jack Flaherty STL 4.24 1.18
23 Ross Stripling LAD 3.08 1.17
24 Robbie Ray ARI 3.87 1.34
25 Chi Chi Gonzalez COL N/A N/A
26 Madison Bumgarner SFG 4.28 1.24
27 Tyler Mahle CIN 4.17 1.2
28 Andrew Heaney LAA 5.68 1.14

Categories

Resources