Cannot Scrape Fantasy Table Using Python - python

I'm trying to scrape fantasy player data from the following site: http://www.fplstatistics.co.uk/. The table appears upon opening the site, but it's not visible when I scrape the site.
I tried the following:
import requests as rq
from bs4 import BeautifulSoup
fplStatsPage = rq.get('http://www.fplstatistics.co.uk')
fplStatsPageSoup = BeautifulSoup(fplStatsPage.text, 'html.parser')
fplStatsPageSoup
And the table was nowhere to be seen. What came back in place of where the table should be is:
<div>
The 'Player Data' is out of date.
<br/> <br/>
You need to refresh the web page.
<br/> <br/>
Press F5 or hit <i class="fa fa-refresh"></i>
</div>
This message appears on the site whenever the table is updated.
I then looked at the developer tools to see if I can find the URL from where the table data is retrieved, but I had no luck. Probably because I don't know how to read the developer tools well.
I then tried to refresh the page as the above message says using Selenium:
from selenium import webdriver
import time
chromeDriverPath = '/Users/SplitShiftKing/Downloads/Software/chromedriver'
driver = webdriver.Chrome(chromeDriverPath)
driver.get('http://www.fplstatistics.co.uk')
driver.refresh()
#To give site enough time to refresh
time.sleep(15)
html = driver.page_source
fplStatsPageSoup = BeautifulSoup(html, 'html.parser')
fplStatsPageSoup
The output was the same as before. The table appears on the site but not in the output.
Assistance would be appreciated. I've looked at similar questions on overflow, but I haven't been able to find a solution.

Why not go straight to the source that pulls i that data. Only thing you need to work out is the column names, but this gets you all the data in one request and without using selenium:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
url = 'http://www.fplstatistics.co.uk/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Mobile Safari/537.36'}
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '"iselRow"' in script.text:
iselRowVal = re.search('"value":(.+?)}\);}', script.text).group(1).strip()
url = 'http://www.fplstatistics.co.uk/Home/AjaxPricesFHandler'
payload = {
'iselRow': iselRowVal,
'_': ''}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['aaData'])
Output:
print (df.head(5).to_string())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 Mustafi Arsenal D A 0.3 5.2 £5.2m 0 --- 110 -95.6 -95.6 -1 -1 Mustafi Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
1 Bellerín Arsenal D I 0.3 5.4 £5.4m 0 --- 54024 2.6 2.6 -2 -2 Bellerin Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
2 Kolasinac Arsenal D I 0.6 5.2 £5.2m 0 --- 5464 -13.9 -13.9 -2 -2 Kolasinac Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
3 Maitland-Niles Arsenal D A 2.6 4.6 £4.6m 0 --- 11924 -39.0 -39.0 -2 -2 Maitland-Niles Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
4 Sokratis Arsenal D S 1.5 4.9 £4.9m 0 --- 19709 -29.4 -29.4 -2 -2 Sokratis Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)

By requesting driver.page_source you're cancelling out any benefit you get from Selenium: the page source does not contain the table you want. That table is updated dynamically via Javascript after the page has loaded. You need to retrieve the table use methods on your driver, rather than using BeautifulSoup. For example:
>>> from selenium import webdriver
>>> d = webdriver.Chrome()
>>> d.get('http://www.fplstatistics.co.uk')
>>> table = d.find_element_by_id('myDataTable')
>>> print('\n'.join(x.text for x in table.find_elements_by_tag_name('tr')))
Name
Club
Pos
Status
%Owned
Price
Chgs
Unlocks
Delta
Target
Kelly Crystal Palace D A 30.7 £4.3m 0 --- 0
101.0
Rico Bournemouth D A 14.6 £4.3m 0 --- 0
100.9
Baldock Sheffield Utd D A 7.1 £4.8m 0 --- 88 99.8
Rashford Man Utd F A 26.4 £9.0m 0 --- 794 98.6
Son Spurs M A 21.6 £10.0m 0 --- 833 98.5
Henderson Sheffield Utd G A 7.8 £4.7m 0 --- 860 98.4
Grealish Aston Villa M A 8.9 £6.1m 0 --- 1088 98.0
Kane Spurs F A 19.3 £10.9m 0 --- 3961 92.9
Reid West Ham D A 4.6 £3.9m 0 --- 4029 92.7
Richarlison Everton M A 7.7 £7.8m 0 --- 5405 90.3

Related

Unable to load tables with "Load more" options in a website using Python

Need to scrape the full table from this site with "Load more" option.
As of now when I`m scraping , I only get the one that shows up by default on when loading the page.
import pandas as pd
import requests
from six.moves import urllib
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
header = {'Accept-Language': "en-US,en;q=0.9",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
resp2 = requests.get(url=URL2, headers=header).text
tables2 = pd.read_html(resp2)
overview_table2= tables2[0]
overview_table2
Player Name
Team
Matches
Goals
Time Played
Unnamed: 5
0
Jorge Pereyra Diaz
Mumbai City
9
6
538 Mins
NaN
1
Cleiton Silva
SC East Bengal
8
5
707 Mins
NaN
2
Abdenasser El Khayati
Chennaiyin FC
5
4
231 Mins
NaN
3
Lallianzuala Chhangte
Mumbai City
9
4
737 Mins
NaN
4
Nandhakumar Sekar
Odisha
8
4
673 Mins
NaN
5
Ivan Kalyuzhnyi
Kerala Blasters
7
4
428 Mins
NaN
6
Bipin Singh
Mumbai City
9
4
806 Mins
NaN
7
Noah Sadaoui
Goa
8
4
489 Mins
NaN
8
Diego Mauricio
Odisha
8
3
526 Mins
NaN
9
Pedro Martin
Odisha
8
3
263 Mins
NaN
10
Dimitri Petratos
ATK Mohun Bagan
6
3
517 Mins
NaN
11
Petar Sliskovic
Chennaiyin FC
8
3
662 Mins
NaN
12
Holicharan Narzary
Hyderabad
9
3
705 Mins
NaN
13
Dimitrios Diamantakos
Kerala Blasters
7
3
529 Mins
NaN
14
Alberto Noguera
Mumbai City
9
3
371 Mins
NaN
15
Jerry Mawihmingthanga
Odisha
8
3
611 Mins
NaN
16
Hugo Boumous
ATK Mohun Bagan
7
2
580 Mins
NaN
17
Javi Hernandez
Bengaluru
6
2
397 Mins
NaN
18
Borja Herrera
Hyderabad
9
2
314 Mins
NaN
19
Mohammad Yasir
Hyderabad
9
2
777 Mins
NaN
20
Load More....
Load More....
Load More....
Load More....
Load More....
Load More....
But I need the full table , including the data under "Load more", please help.
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}
def main(url):
params = {
"action": "stats",
"league_id": "750",
"limit": "300",
"offset": "0",
"part": "leagues",
"season_id": "2022",
"section": "football",
"stats_type": "player",
"tab": "overview"
}
r = requests.get(url, headers=headers, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x['title'], *[i.get_text(strip=True) for i in x.find_all_next('td', limit=4)])
for x in soup.select('a.player_link')]
df = pd.DataFrame(
goal, columns=['Name', 'Team', 'Matches', 'Goals', 'Time Played'])
print(df)
main('https://www.mykhel.com/src/index.php')
Output:
Name Team Matches Goals Time Played
0 Jorge Pereyra Diaz Mumbai City 9 6 538 Mins
1 Cleiton Silva SC East Bengal 8 5 707 Mins
2 Abdenasser El Khayati Chennaiyin FC 5 4 231 Mins
3 Lallianzuala Chhangte Mumbai City 9 4 737 Mins
4 Nandhakumar Sekar Odisha 8 4 673 Mins
.. ... ... ... ... ...
268 Sarthak Golui SC East Bengal 6 0 402 Mins
269 Ivan Gonzalez SC East Bengal 8 0 683 Mins
270 Michael Jakobsen NorthEast United 8 0 676 Mins
271 Pratik Chowdhary Jamshedpur FC 6 0 495 Mins
272 Chungnunga Lal SC East Bengal 8 0 720 Mins
[273 rows x 5 columns]
This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.
I'll stick to working with dynamically loaded pages with Selenium browser automation suite.
Installation
To get started, you'll need to install selenium bindings:
pip install selenium
You seem to already have beautifulsoup, but for anyone who might come across this answer, we'll also need it and html5lib, we'll need them later to parse the table:
pip install html5lib BeautifulSoup4
Now, for selenium to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:
Browser
Link to driver download
Chrome:
https://sites.google.com/chromium.org/driver/
Edge:
https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox:
https://github.com/mozilla/geckodriver/releases
Safari:
https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Opera:
https://github.com/operasoftware/operachromiumdriver/releases
You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser 😮
Anyway, it's a bit out of this answer scope, just keep in mind, for any browser and it's family you'll need a driver.
I'll stick with Firefox. Hence you need Firefox installed and driver placed somewhere. The best option would be to add this folder to PATH variable.
If you choose chromium, you'll have to strictly stick to Chrome browser version. As for Firefox, I have a pretty old geckodriver 0.29.1 and it works like a charm with the latest update.
Hands on
import pandas as pd
from selenium import webdriver
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
driver = webdriver.Firefox()
driver.get(URL2)
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
while(element.is_displayed()):
driver.execute_script("arguments[0].click();", element)
table = driver.find_element_by_css_selector('table')
tables2 = pd.read_html(table.get_attribute('outerHTML'))
driver.close()
overview_table2 = tables2[0].dropna(how='all').dropna(axis='columns', how='all')
overview_table2.drop_duplicates().reset_index(drop=True)
overview_table2
We only need pandas for our resulting table and selenium for web automation.
URL2 — is the same variable you used
driver = webdriver.Firefox() — here we instantiate Firefox and the browser will get opened. This is where selenium magic will happen.
Note: If you decided to skip adding driver to a PATH variable, you can directly reference your here, e.g.:
webdriver.Firefox(r"C:\WebDriver\bin")
webdriver.Chrome(service=Service(executable_path="/path/to/chromedriver"))
driver.get(URL2) — open the desired page
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
Using xpath selector we find a link that has the same text as your 20th row.
With that stored element we click it all the time till it disappears.
It would be more sensible and easy to just use element.click(), but it results in an error. More info on other stack overflow question.
Assign table variable with a corresponding element.
tables2 I left this weird variable name as is in your question.
Here we get outerHTML as innnerHTML would render contents of the <table> tag, but not the tag itself.
We should not forget to .close() our driver as we don't need it anymore.
As a result of html parsing there will be a list just like in question provided. I drop here the unnamed column and last empty row.
The resulting overview_table2 looks like:
Player Name
Team
Matches
Goals
Time Played
0
Jorge Pereyra Diaz
Mumbai City
9.0
6.0
538 Mins
1
Cleiton Silva
SC East Bengal
8.0
5.0
707 Mins
2
Abdenasser El Khayati
Chennaiyin FC
5.0
4.0
231 Mins
...
...
...
...
...
...
270
Michael Jakobsen
NorthEast United
8.0
0.0
676 Mins
271
Pratik Chowdhary
Jamshedpur FC
6.0
0.0
495 Mins
272
Chungnunga Lal
SC East Bengal
8.0
0.0
720 Mins
Side note
Job done. As some further improvement you may play with different browsers and try the headless mode, a mode when browser does not open on you desktop environment, but rather runs silently in the background.

Scraping tbody data -- trouble

I'm new with this, but struggling to understand how this makes sense/why nothing seems to be working.
Basically, all I want to do is scrape the table data (play by play) from a list of ncaa.com links (sample below)
https://stats.ncaa.org/game/play_by_play/12465
https://stats.ncaa.org/game/play_by_play/12755
https://stats.ncaa.org/game/play_by_play/12640
https://stats.ncaa.org/game/play_by_play/12290
For context, I got these links by scraping HREF tags from a different list of links (which contained every NCAA team's game schedule).
I've struggled through a lot of this, but there's been an answer somewhere...
Inspector makes it seem like the Play by Play (table) data is a tbody tag, or at least I think?
I've tried a script as simple as this (which works for other websites)
import pandas as pd
df = pd.read_html(
'https://stats.ncaa.org/game/play_by_play/13592')[0]
print(df)
But it still didn't work for this site... I read a bit about using lxml.parser instead of html.parser (like in the code below).. but also not working -- I thought this was my best chance at getting the tables from multiple links at once:
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/game/play_by_play/12564',
'https://stats.ncaa.org/game/play_by_play/13592'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml.parser')
for profile in soup.find_all('a'):
profile = profile.get('tbody')
profiles.append(profile)
# print(profiles)
for p in profiles:
print(p)
Any thoughts as to what is unique about this site/what could be the issue would be greatly appreciated.
That website will check if the request comes from a bot or a browser, so you need to update requests' header with a real user-agent.
Each page has 8 tables. The code below will go through each url you mentioned above and print out all tables. You can review them, see which one you need, etc:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://stats.ncaa.org/game/play_by_play/12465',
'https://stats.ncaa.org/game/play_by_play/12755',
'https://stats.ncaa.org/game/play_by_play/12640',
'https://stats.ncaa.org/game/play_by_play/12290',
]
s = requests.Session()
s.headers.update(headers)
for url in urls:
r = s.get(url)
dfs = pd.read_html(r.text)
len(dfs)
for df in dfs:
print(df)
print('___________')
Response:
0 1 2 3
0 NaN 1st Half 2nd Half Total
1 UTRGV 23 19 42
2 UTEP 26 34 60
___________
0 1
0 Game Date: 01/02/2009
1 Location: El Paso, Texas (Don Haskins Center)
2 Attendance: 8413
___________
0 1
0 Officials: John Higgins, Duke Edsall, Quinton, Reece
___________
0 1
0 1st Half 1 2
___________
0 1 2 \
0 Time UTRGV Score
1 19:45 NaN 0-0
2 19:45 NaN 0-0
3 19:33 NaN 0-0
4 19:33 Emmanuel Jones Defensive Rebound 0-0
.. ... ... ...
163 00:11 NaN 23-25
164 00:11 NaN 23-26
165 00:00 Emmanuel Jones missed Two Point Jumper 23-26
166 00:00 NaN 23-26
167 End of 1st Half End of 1st Half End of 1st Half
3
0 UTEP
1 Arnett Moultrie missed Two Point Jumper
2 Julyan Stone Offensive Rebound
3 Stefon Jackson missed Three Point Jumper
4 NaN
.. ...
163 Stefon Jackson made Free Throw
164 Stefon Jackson made Free Throw
165 NaN
166 Julyan Stone Defensive Rebound
167 End of 1st Half
[168 rows x 4 columns]
___________
0 1
0 2nd Half 1 2
[..]

Getting Table Info From Page Using Python and BeautifulSoup

The page I am trying to get info from is https://www.pro-football-reference.com/teams/crd/2017_roster.htm.
I'm trying to get all the information from the "Roster" table but for some reason I can't get it through BeautifulSoup.I've tried soup.find("div", {'id': 'div_games_played_team'}) but it doesn't work. When I look at the page's HTML I can see the table inside a very large comment and in a regular div. How can I use BeautifulSoup to get the information from this table?
you don't need Selenium. What you can do (and you correctly identified it), was pull out the comments, and then parse the table from within that.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/crd/2017_roster.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except ValueError as e:
print(e)
continue
Output:
print(tables[0].head().to_string())
No. Player Age Pos G GS Wt Ht College/Univ BirthDate Yrs AV Drafted (tm/rnd/yr) Salary
0 54.0 Bryson Albright 23.0 NaN 7 0.0 245.0 6-5 Miami (OH) 3/15/1994 1 0.0 NaN $246,177
1 36.0 Budda Baker*+ 21.0 ss 16 7.0 195.0 5-10 Washington 1/10/1996 Rook 9.0 Arizona Cardinals / 2nd / 36th pick / 2017 $465,000
2 64.0 Khalif Barnes 35.0 NaN 3 0.0 320.0 6-6 Washington 4/21/1982 12 0.0 Jacksonville Jaguars / 2nd / 52nd pick / 2005 $176,471
3 41.0 Antoine Bethea 33.0 db 15 6.0 206.0 5-11 Howard 7/27/1984 11 4.0 Indianapolis Colts / 6th / 207th pick / 2006 $2,000,000
4 28.0 Justin Bethel 27.0 rcb 16 6.0 200.0 6-0 Presbyterian 6/17/1990 5 3.0 Arizona Cardinals / 6th / 177th pick / 2012 $2,000,000
....
The tag you are trying to scrape is dynamically generated by JavaScript. You are most likely using requests to scrape your HTML. Unfortunately requests will not run JavaScript because it pulls all the HTML in as raw text. BeautifulSoup can not find the tag because it was never generated within your scraping program.
I recommend using Selenium. It's not a perfect solution - just the best one for your problem. The Selenium WebDriver will execute the JavaScript to generate the page's HTML. Then you can use BeautifulSoup to parse whatever it is that you are after. See Selenium with Python for further help on how to get started.

bs4 find table by id, returning 'None'

Not sure why this isn't working :( I'm able to pull other tables from this page, just not this one.
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
page = soup(url.content, 'html')
table = page.find('table', id='team_and_opponent')
print(table)
Appreciate the help.
The page is dynamic. So you have 2 options in this case.
Side note: If you see <table> tags, don't use BeautifulSoup, pandas can do that work for you (and it actually uses bs4 under the hood) by using pd.read_html()
1) Use selenium to first render the page, and THEN you can use BeautifulSoup to pull out the <table> tags
2) Those tables are within the comment tags in the html. You can use BeautifulSoup to pull out the comments, then just grab the ones with 'table'.
I chose option 2.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.basketball-reference.com/teams/BOS/2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
I don't know which particular table you want, but they are there in the list of tables
*Output:**
print (tables[1])
Unnamed: 0 G MP FG FGA ... STL BLK TOV PF PTS
0 Team 82.0 19805 3141 6975 ... 604 373 1149 1618 8529
1 Team/G NaN 241.5 38.3 85.1 ... 7.4 4.5 14.0 19.7 104.0
2 Lg Rank NaN 12 25 25 ... 23 18 15 17 20
3 Year/Year NaN 0.3% -0.9% -0.0% ... -2.1% 9.7% 5.6% -4.0% -3.7%
4 Opponent 82.0 19805 3066 6973 ... 594 364 1159 1571 8235
5 Opponent/G NaN 241.5 37.4 85.0 ... 7.2 4.4 14.1 19.2 100.4
6 Lg Rank NaN 12 3 12 ... 7 6 19 9 3
7 Year/Year NaN 0.3% -3.2% -0.9% ... -4.7% -14.4% 1.6% -5.6% -4.7%
[8 rows x 24 columns]
or
print (tables[18])
Rk Unnamed: 1 Salary
0 1 Gordon Hayward $29,727,900
1 2 Al Horford $27,734,405
2 3 Kyrie Irving $18,868,625
3 4 Jayson Tatum $5,645,400
4 5 Greg Monroe $5,000,000
5 6 Marcus Morris $5,000,000
6 7 Jaylen Brown $4,956,480
7 8 Marcus Smart $4,538,020
8 9 Aron Baynes $4,328,000
9 10 Guerschon Yabusele $2,247,480
10 11 Terry Rozier $1,988,520
11 12 Shane Larkin $1,471,382
12 13 Semi Ojeleye $1,291,892
13 14 Abdel Nader $1,167,333
14 15 Daniel Theis $815,615
15 16 Demetrius Jackson $92,858
16 17 Jarell Eddie $83,129
17 18 Xavier Silas $74,159
18 19 Jonathan Gibson $44,495
19 20 Jabari Bird $0
20 21 Kadeem Allen $0
There is no table with id team_and_opponent in that page. Rather there is a span tag with this id. You can get results by changing id.
This data should be loaded dynamically (like JavaScript).
You should take a look here Web-scraping JavaScript page with Python
For that you can use Selenium or html_requests who supports Javascript
import requests
import bs4
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
soup=bs4.BeautifulSoup(url.text,"lxml")
page=soup.select(".table_outer_container")
for i in page:
print(i.text)
you will get your desired output

How can I scrape data from a url within a target url and amend everything to a single data frame in python?

I am trying to scrape data points from one webpage (A), but then scrape data from each individual data point's own webpage and combine all of the data into a single data frame for easy viewing.
This is for a daily data frame with four columns: Team, Pitcher, ERA, WHIP. The ERA and WHIP are found within the specific pitcher's url. For the data below, I have managed to scrape the team name as well as the starting pitcher name and organized both into a data frame (albeit incorrectly).
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
targetUrl = 'http://www.baseball-reference.com/previews/'
targetUrl_response = requests.get(targetUrl, timeout=5)
soup = BeautifulSoup(targetUrl_response.content, "html.parser")
teams = []
pitchers = []
for i in soup.find_all('tr'):
if i.find_all('strong'):
for link in i.find_all('strong'):
if not re.findall(r'MLB Debut',link.text):
teams.append(link.text)
if i.find_all('a'):
for link in i.find_all('a'):
if not re.findall(r'Preview',link.text):
pitchers.append(link.text)
print (df)
I'd like to add code to follow each pitcher's webpage, scrape the ERA and WHIP, then amend the data to the same data frame as team and pitcher name. Is this even possible?
Output so far:
0
Aaron Sanchez TOR
CC Sabathia NYY
Steven Matz NYM
Zach Eflin PHI
Lucas Giolito CHW
Eduardo Rodriguez BOS
Brad Keller KCR
Adam Plutko CLE
Julio Teheran ATL
Jon Lester CHC
Clayton Kershaw LAD
Zack Greinke ARI
Jon Gray COL
Drew Pomeranz SFG
Few things off the bat (see what I did there :-) ) the sports-reference.com pages are dynamic. You're able to pull SOME of the tables straight forward, but tif there are multiple tables, you'll find them under comment tags within the html source. So that might be an issue later if you want more data from the page.
The second thing is I notice you are pulling <tr> tags, which means there are <table> tags, and pandas can do the heavy work for you as opposed to iterating through with bs4. It's a simple pd.read_html() function. HOWEVER, it won't pull out those links, just strictly the text. So in this case, iterating with BeautifulSoup is the way to go (I'm just mentioning it for future reference).
There's still more work to do as a couple of the guys didn't have links/return era or whip. And you'll also have to account for if a guy is traded or change leagues, there might be multiple ERAs for the same 2019 season. But this should get you going:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
targetUrl = 'http://www.baseball-reference.com/previews/'
targetUrl_response = requests.get(targetUrl, timeout=5)
soup = BeautifulSoup(targetUrl_response.content, "html.parser")
teams = []
pitchers = []
era_list = []
whip_list = []
for i in soup.find_all('tr'):
if i.find_all('strong'):
for link in i.find_all('strong'):
if not re.findall(r'MLB Debut',link.text):
teams.append(link.text)
if i.find_all('a'):
for link in i.find_all('a'):
if not re.findall(r'Preview',link.text):
try:
url_link = link['href']
pitcher_table = pd.read_html(url_link)[0]
pitcher_table = pitcher_table[(pitcher_table['Year'] == '2019') & (pitcher_table['Lg'].isin(['AL', 'NL']))]
era = round(pitcher_table.iloc[0]['ERA'],2)
whip = round(pitcher_table.iloc[0]['WHIP'],2)
except:
era = 'N/A'
whip = 'N/A'
pitchers.append(link.text)
era_list.append(era)
whip_list.append(whip)
print ('%s\tERA: %s\tWHIP: %s' %(link.text, era, whip))
df = pd.DataFrame(list(zip(pitchers, teams, era_list, whip_list)), columns = ['Pitcher', ',Team', 'ERA', 'WHIP'])
print (df)
Output:
print (df)
Pitcher Team ERA WHIP
0 Walker Lockett NYM 23.14 2.57
1 Jake Arrieta PHI 4.12 1.38
2 Logan Allen SDP 0 0.71
3 Jimmy Yacabonis BAL 4.7 1.44
4 Clayton Richard TOR 7.46 1.74
5 Glenn Sparkman KCR 3.62 1.25
6 Shane Bieber CLE 3.86 1.08
7 Carson Fulmer CHW 6.35 1.94
8 David Price BOS 3.39 1.1
9 Jesse Chavez TEX N/A N/A
10 Jordan Zimmermann DET 6.03 1.37
11 Max Scherzer WSN 2.62 1.06
12 Trevor Richards MIA 3.54 1.25
13 Max Fried ATL 4.03 1.34
14 Adbert Alzolay CHC 2.25 0.75
15 Marco Gonzales SEA 4.38 1.37
16 Zach Davies MIL 3.06 1.36
17 Trevor Williams PIT 4.12 1.19
18 Gerrit Cole HOU 3.54 1.02
19 Blake Snell TBR 4.4 1.24
20 Kyle Gibson MIN 4.18 1.25
21 Chris Bassitt OAK 3.64 1.17
22 Jack Flaherty STL 4.24 1.18
23 Ross Stripling LAD 3.08 1.17
24 Robbie Ray ARI 3.87 1.34
25 Chi Chi Gonzalez COL N/A N/A
26 Madison Bumgarner SFG 4.28 1.24
27 Tyler Mahle CIN 4.17 1.2
28 Andrew Heaney LAA 5.68 1.14

Categories

Resources