Not sure why this isn't working :( I'm able to pull other tables from this page, just not this one.
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
page = soup(url.content, 'html')
table = page.find('table', id='team_and_opponent')
print(table)
Appreciate the help.
The page is dynamic. So you have 2 options in this case.
Side note: If you see <table> tags, don't use BeautifulSoup, pandas can do that work for you (and it actually uses bs4 under the hood) by using pd.read_html()
1) Use selenium to first render the page, and THEN you can use BeautifulSoup to pull out the <table> tags
2) Those tables are within the comment tags in the html. You can use BeautifulSoup to pull out the comments, then just grab the ones with 'table'.
I chose option 2.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.basketball-reference.com/teams/BOS/2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
I don't know which particular table you want, but they are there in the list of tables
*Output:**
print (tables[1])
Unnamed: 0 G MP FG FGA ... STL BLK TOV PF PTS
0 Team 82.0 19805 3141 6975 ... 604 373 1149 1618 8529
1 Team/G NaN 241.5 38.3 85.1 ... 7.4 4.5 14.0 19.7 104.0
2 Lg Rank NaN 12 25 25 ... 23 18 15 17 20
3 Year/Year NaN 0.3% -0.9% -0.0% ... -2.1% 9.7% 5.6% -4.0% -3.7%
4 Opponent 82.0 19805 3066 6973 ... 594 364 1159 1571 8235
5 Opponent/G NaN 241.5 37.4 85.0 ... 7.2 4.4 14.1 19.2 100.4
6 Lg Rank NaN 12 3 12 ... 7 6 19 9 3
7 Year/Year NaN 0.3% -3.2% -0.9% ... -4.7% -14.4% 1.6% -5.6% -4.7%
[8 rows x 24 columns]
or
print (tables[18])
Rk Unnamed: 1 Salary
0 1 Gordon Hayward $29,727,900
1 2 Al Horford $27,734,405
2 3 Kyrie Irving $18,868,625
3 4 Jayson Tatum $5,645,400
4 5 Greg Monroe $5,000,000
5 6 Marcus Morris $5,000,000
6 7 Jaylen Brown $4,956,480
7 8 Marcus Smart $4,538,020
8 9 Aron Baynes $4,328,000
9 10 Guerschon Yabusele $2,247,480
10 11 Terry Rozier $1,988,520
11 12 Shane Larkin $1,471,382
12 13 Semi Ojeleye $1,291,892
13 14 Abdel Nader $1,167,333
14 15 Daniel Theis $815,615
15 16 Demetrius Jackson $92,858
16 17 Jarell Eddie $83,129
17 18 Xavier Silas $74,159
18 19 Jonathan Gibson $44,495
19 20 Jabari Bird $0
20 21 Kadeem Allen $0
There is no table with id team_and_opponent in that page. Rather there is a span tag with this id. You can get results by changing id.
This data should be loaded dynamically (like JavaScript).
You should take a look here Web-scraping JavaScript page with Python
For that you can use Selenium or html_requests who supports Javascript
import requests
import bs4
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
soup=bs4.BeautifulSoup(url.text,"lxml")
page=soup.select(".table_outer_container")
for i in page:
print(i.text)
you will get your desired output
Related
Need to scrape the full table from this site with "Load more" option.
As of now when I`m scraping , I only get the one that shows up by default on when loading the page.
import pandas as pd
import requests
from six.moves import urllib
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
header = {'Accept-Language': "en-US,en;q=0.9",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
resp2 = requests.get(url=URL2, headers=header).text
tables2 = pd.read_html(resp2)
overview_table2= tables2[0]
overview_table2
Player Name
Team
Matches
Goals
Time Played
Unnamed: 5
0
Jorge Pereyra Diaz
Mumbai City
9
6
538 Mins
NaN
1
Cleiton Silva
SC East Bengal
8
5
707 Mins
NaN
2
Abdenasser El Khayati
Chennaiyin FC
5
4
231 Mins
NaN
3
Lallianzuala Chhangte
Mumbai City
9
4
737 Mins
NaN
4
Nandhakumar Sekar
Odisha
8
4
673 Mins
NaN
5
Ivan Kalyuzhnyi
Kerala Blasters
7
4
428 Mins
NaN
6
Bipin Singh
Mumbai City
9
4
806 Mins
NaN
7
Noah Sadaoui
Goa
8
4
489 Mins
NaN
8
Diego Mauricio
Odisha
8
3
526 Mins
NaN
9
Pedro Martin
Odisha
8
3
263 Mins
NaN
10
Dimitri Petratos
ATK Mohun Bagan
6
3
517 Mins
NaN
11
Petar Sliskovic
Chennaiyin FC
8
3
662 Mins
NaN
12
Holicharan Narzary
Hyderabad
9
3
705 Mins
NaN
13
Dimitrios Diamantakos
Kerala Blasters
7
3
529 Mins
NaN
14
Alberto Noguera
Mumbai City
9
3
371 Mins
NaN
15
Jerry Mawihmingthanga
Odisha
8
3
611 Mins
NaN
16
Hugo Boumous
ATK Mohun Bagan
7
2
580 Mins
NaN
17
Javi Hernandez
Bengaluru
6
2
397 Mins
NaN
18
Borja Herrera
Hyderabad
9
2
314 Mins
NaN
19
Mohammad Yasir
Hyderabad
9
2
777 Mins
NaN
20
Load More....
Load More....
Load More....
Load More....
Load More....
Load More....
But I need the full table , including the data under "Load more", please help.
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}
def main(url):
params = {
"action": "stats",
"league_id": "750",
"limit": "300",
"offset": "0",
"part": "leagues",
"season_id": "2022",
"section": "football",
"stats_type": "player",
"tab": "overview"
}
r = requests.get(url, headers=headers, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x['title'], *[i.get_text(strip=True) for i in x.find_all_next('td', limit=4)])
for x in soup.select('a.player_link')]
df = pd.DataFrame(
goal, columns=['Name', 'Team', 'Matches', 'Goals', 'Time Played'])
print(df)
main('https://www.mykhel.com/src/index.php')
Output:
Name Team Matches Goals Time Played
0 Jorge Pereyra Diaz Mumbai City 9 6 538 Mins
1 Cleiton Silva SC East Bengal 8 5 707 Mins
2 Abdenasser El Khayati Chennaiyin FC 5 4 231 Mins
3 Lallianzuala Chhangte Mumbai City 9 4 737 Mins
4 Nandhakumar Sekar Odisha 8 4 673 Mins
.. ... ... ... ... ...
268 Sarthak Golui SC East Bengal 6 0 402 Mins
269 Ivan Gonzalez SC East Bengal 8 0 683 Mins
270 Michael Jakobsen NorthEast United 8 0 676 Mins
271 Pratik Chowdhary Jamshedpur FC 6 0 495 Mins
272 Chungnunga Lal SC East Bengal 8 0 720 Mins
[273 rows x 5 columns]
This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.
I'll stick to working with dynamically loaded pages with Selenium browser automation suite.
Installation
To get started, you'll need to install selenium bindings:
pip install selenium
You seem to already have beautifulsoup, but for anyone who might come across this answer, we'll also need it and html5lib, we'll need them later to parse the table:
pip install html5lib BeautifulSoup4
Now, for selenium to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:
Browser
Link to driver download
Chrome:
https://sites.google.com/chromium.org/driver/
Edge:
https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox:
https://github.com/mozilla/geckodriver/releases
Safari:
https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Opera:
https://github.com/operasoftware/operachromiumdriver/releases
You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser 😮
Anyway, it's a bit out of this answer scope, just keep in mind, for any browser and it's family you'll need a driver.
I'll stick with Firefox. Hence you need Firefox installed and driver placed somewhere. The best option would be to add this folder to PATH variable.
If you choose chromium, you'll have to strictly stick to Chrome browser version. As for Firefox, I have a pretty old geckodriver 0.29.1 and it works like a charm with the latest update.
Hands on
import pandas as pd
from selenium import webdriver
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
driver = webdriver.Firefox()
driver.get(URL2)
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
while(element.is_displayed()):
driver.execute_script("arguments[0].click();", element)
table = driver.find_element_by_css_selector('table')
tables2 = pd.read_html(table.get_attribute('outerHTML'))
driver.close()
overview_table2 = tables2[0].dropna(how='all').dropna(axis='columns', how='all')
overview_table2.drop_duplicates().reset_index(drop=True)
overview_table2
We only need pandas for our resulting table and selenium for web automation.
URL2 — is the same variable you used
driver = webdriver.Firefox() — here we instantiate Firefox and the browser will get opened. This is where selenium magic will happen.
Note: If you decided to skip adding driver to a PATH variable, you can directly reference your here, e.g.:
webdriver.Firefox(r"C:\WebDriver\bin")
webdriver.Chrome(service=Service(executable_path="/path/to/chromedriver"))
driver.get(URL2) — open the desired page
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
Using xpath selector we find a link that has the same text as your 20th row.
With that stored element we click it all the time till it disappears.
It would be more sensible and easy to just use element.click(), but it results in an error. More info on other stack overflow question.
Assign table variable with a corresponding element.
tables2 I left this weird variable name as is in your question.
Here we get outerHTML as innnerHTML would render contents of the <table> tag, but not the tag itself.
We should not forget to .close() our driver as we don't need it anymore.
As a result of html parsing there will be a list just like in question provided. I drop here the unnamed column and last empty row.
The resulting overview_table2 looks like:
Player Name
Team
Matches
Goals
Time Played
0
Jorge Pereyra Diaz
Mumbai City
9.0
6.0
538 Mins
1
Cleiton Silva
SC East Bengal
8.0
5.0
707 Mins
2
Abdenasser El Khayati
Chennaiyin FC
5.0
4.0
231 Mins
...
...
...
...
...
...
270
Michael Jakobsen
NorthEast United
8.0
0.0
676 Mins
271
Pratik Chowdhary
Jamshedpur FC
6.0
0.0
495 Mins
272
Chungnunga Lal
SC East Bengal
8.0
0.0
720 Mins
Side note
Job done. As some further improvement you may play with different browsers and try the headless mode, a mode when browser does not open on you desktop environment, but rather runs silently in the background.
https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A
I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...
What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.
#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]
#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')
# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))
# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part
all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}
#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatÃsticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]
all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])
There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.
You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()
coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()
coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()
print('Coritiba Fouls: ' + coritiba_fouls, 'Cuiaba Fouls: ' + cuiaba_fouls)
print('Coritiba Corners: ' + coritiba_corners, 'Cuiaba Corners: ' + cuiaba_corners)
print('Coritiba Offsides: ' + coritiba_offsides, 'Cuiaba Offsides: ' + cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: ' + str(len(dfs)))
for df in dfs:
print(df)
print('___________')
This will print in the terminal:
Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
Coritiba (4-2-3-1) Coritiba (4-2-3-1).1
0 23 Alex Muralha
1 2 Matheus Alexandre
2 3 Henrique
3 4 Luciano Castán
4 6 EgÃdio Pereira Júnior
5 9 Léo Gamalho
6 11 Alef Manga
7 25 Bernanrdo Lemes
8 78 Régis
9 97 Valdemir
10 98 Igor Paixão
11 Bench Bench
12 21 Rafael William
13 5 Guillermo de los Santos
14 15 MatÃas Galarza
15 16 Natanael
16 18 Guilherme Biro
17 19 Thonny Anderson
18 28 Pablo Javier GarcÃa
19 32 Bruno Gomes
20 44 Márcio Silva
21 52 Adrián MartÃnez
22 75 Luiz Gabriel
23 88 Hugo
___________
Cuiabá (4-1-4-1) Cuiabá (4-1-4-1).1
0 1 Walter
1 2 João Lucas
2 3 Joaquim
3 4 Marllon Borges
4 5 Camilo
5 6 Igor Cariús
6 7 Alesson
7 8 João Pedro Pepê
8 9 ValdÃvia
9 10 Rodriguinho Marinho
10 11 Rafael Gava
11 Bench Bench
12 12 João Carlos
13 13 Daniel Guedes
14 14 Paulão
15 15 Marcão Silva
16 16 Cristian Rivas
17 17 Gabriel Pirani
18 18 Jenison
19 19 André
20 20 Kelvin Osorio
21 21 Jonathan Cafu
22 22 André Luis
23 23 Felipe Marques
___________
Coritiba Cuiabá
Possession Possession
0 42% 58%
1 Shots on Target Shots on Target
2 2 of 4 — 50% 0% — 0 of 8
3 Saves Saves
4 0 of 0 — % 50% — 1 of 2
5 Cards Cards
6 NaN NaN
_____________
[....]
When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code:
import requests
url = "https://www.onthesnow.com/epic-pass/skireport"
response = requests.get(url)
response.text
The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?
The data you see is loaded from external URL in Json form. To load it, you can use this example:
import json
import requests
url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"], 1):
print(i, d["title"])
Prints:
1 Beaver Creek
2 Breckenridge
3 Brides les Bains
4 Courchevel
5 Crested Butte Mountain Resort
6 Fernie Alpine
7 Folgà rida - Marilléva
8 Heavenly
9 Keystone
10 Kicking Horse
11 Kimberley
12 Kirkwood
13 La Tania
14 Les Menuires
15 Madonna di Campiglio
16 Meribel
17 Mont Sainte Anne
18 Nakiska Ski Area
19 Nendaz
20 Northstar California
21 Okemo Mountain Resort
22 Orelle
23 Park City
24 Pontedilegno - Tonale
25 Saint Martin de Belleville
26 Snowbasin
27 Stevens Pass Resort
28 Stoneham
29 Stowe Mountain
30 Sun Valley
31 Thyon 2000
32 Vail
33 Val Thorens
34 Verbier
35 Veysonnaz
36 Whistler Blackcomb
The page I am trying to get info from is https://www.pro-football-reference.com/teams/crd/2017_roster.htm.
I'm trying to get all the information from the "Roster" table but for some reason I can't get it through BeautifulSoup.I've tried soup.find("div", {'id': 'div_games_played_team'}) but it doesn't work. When I look at the page's HTML I can see the table inside a very large comment and in a regular div. How can I use BeautifulSoup to get the information from this table?
you don't need Selenium. What you can do (and you correctly identified it), was pull out the comments, and then parse the table from within that.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/crd/2017_roster.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except ValueError as e:
print(e)
continue
Output:
print(tables[0].head().to_string())
No. Player Age Pos G GS Wt Ht College/Univ BirthDate Yrs AV Drafted (tm/rnd/yr) Salary
0 54.0 Bryson Albright 23.0 NaN 7 0.0 245.0 6-5 Miami (OH) 3/15/1994 1 0.0 NaN $246,177
1 36.0 Budda Baker*+ 21.0 ss 16 7.0 195.0 5-10 Washington 1/10/1996 Rook 9.0 Arizona Cardinals / 2nd / 36th pick / 2017 $465,000
2 64.0 Khalif Barnes 35.0 NaN 3 0.0 320.0 6-6 Washington 4/21/1982 12 0.0 Jacksonville Jaguars / 2nd / 52nd pick / 2005 $176,471
3 41.0 Antoine Bethea 33.0 db 15 6.0 206.0 5-11 Howard 7/27/1984 11 4.0 Indianapolis Colts / 6th / 207th pick / 2006 $2,000,000
4 28.0 Justin Bethel 27.0 rcb 16 6.0 200.0 6-0 Presbyterian 6/17/1990 5 3.0 Arizona Cardinals / 6th / 177th pick / 2012 $2,000,000
....
The tag you are trying to scrape is dynamically generated by JavaScript. You are most likely using requests to scrape your HTML. Unfortunately requests will not run JavaScript because it pulls all the HTML in as raw text. BeautifulSoup can not find the tag because it was never generated within your scraping program.
I recommend using Selenium. It's not a perfect solution - just the best one for your problem. The Selenium WebDriver will execute the JavaScript to generate the page's HTML. Then you can use BeautifulSoup to parse whatever it is that you are after. See Selenium with Python for further help on how to get started.
I'm trying to scrape fantasy player data from the following site: http://www.fplstatistics.co.uk/. The table appears upon opening the site, but it's not visible when I scrape the site.
I tried the following:
import requests as rq
from bs4 import BeautifulSoup
fplStatsPage = rq.get('http://www.fplstatistics.co.uk')
fplStatsPageSoup = BeautifulSoup(fplStatsPage.text, 'html.parser')
fplStatsPageSoup
And the table was nowhere to be seen. What came back in place of where the table should be is:
<div>
The 'Player Data' is out of date.
<br/> <br/>
You need to refresh the web page.
<br/> <br/>
Press F5 or hit <i class="fa fa-refresh"></i>
</div>
This message appears on the site whenever the table is updated.
I then looked at the developer tools to see if I can find the URL from where the table data is retrieved, but I had no luck. Probably because I don't know how to read the developer tools well.
I then tried to refresh the page as the above message says using Selenium:
from selenium import webdriver
import time
chromeDriverPath = '/Users/SplitShiftKing/Downloads/Software/chromedriver'
driver = webdriver.Chrome(chromeDriverPath)
driver.get('http://www.fplstatistics.co.uk')
driver.refresh()
#To give site enough time to refresh
time.sleep(15)
html = driver.page_source
fplStatsPageSoup = BeautifulSoup(html, 'html.parser')
fplStatsPageSoup
The output was the same as before. The table appears on the site but not in the output.
Assistance would be appreciated. I've looked at similar questions on overflow, but I haven't been able to find a solution.
Why not go straight to the source that pulls i that data. Only thing you need to work out is the column names, but this gets you all the data in one request and without using selenium:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
url = 'http://www.fplstatistics.co.uk/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Mobile Safari/537.36'}
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '"iselRow"' in script.text:
iselRowVal = re.search('"value":(.+?)}\);}', script.text).group(1).strip()
url = 'http://www.fplstatistics.co.uk/Home/AjaxPricesFHandler'
payload = {
'iselRow': iselRowVal,
'_': ''}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['aaData'])
Output:
print (df.head(5).to_string())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 Mustafi Arsenal D A 0.3 5.2 £5.2m 0 --- 110 -95.6 -95.6 -1 -1 Mustafi Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
1 BellerÃn Arsenal D I 0.3 5.4 £5.4m 0 --- 54024 2.6 2.6 -2 -2 Bellerin Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
2 Kolasinac Arsenal D I 0.6 5.2 £5.2m 0 --- 5464 -13.9 -13.9 -2 -2 Kolasinac Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
3 Maitland-Niles Arsenal D A 2.6 4.6 £4.6m 0 --- 11924 -39.0 -39.0 -2 -2 Maitland-Niles Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
4 Sokratis Arsenal D S 1.5 4.9 £4.9m 0 --- 19709 -29.4 -29.4 -2 -2 Sokratis Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
By requesting driver.page_source you're cancelling out any benefit you get from Selenium: the page source does not contain the table you want. That table is updated dynamically via Javascript after the page has loaded. You need to retrieve the table use methods on your driver, rather than using BeautifulSoup. For example:
>>> from selenium import webdriver
>>> d = webdriver.Chrome()
>>> d.get('http://www.fplstatistics.co.uk')
>>> table = d.find_element_by_id('myDataTable')
>>> print('\n'.join(x.text for x in table.find_elements_by_tag_name('tr')))
Name
Club
Pos
Status
%Owned
Price
Chgs
Unlocks
Delta
Target
Kelly Crystal Palace D A 30.7 £4.3m 0 --- 0
101.0
Rico Bournemouth D A 14.6 £4.3m 0 --- 0
100.9
Baldock Sheffield Utd D A 7.1 £4.8m 0 --- 88 99.8
Rashford Man Utd F A 26.4 £9.0m 0 --- 794 98.6
Son Spurs M A 21.6 £10.0m 0 --- 833 98.5
Henderson Sheffield Utd G A 7.8 £4.7m 0 --- 860 98.4
Grealish Aston Villa M A 8.9 £6.1m 0 --- 1088 98.0
Kane Spurs F A 19.3 £10.9m 0 --- 3961 92.9
Reid West Ham D A 4.6 £3.9m 0 --- 4029 92.7
Richarlison Everton M A 7.7 £7.8m 0 --- 5405 90.3