Scraping tbody data -- trouble - python

I'm new with this, but struggling to understand how this makes sense/why nothing seems to be working.
Basically, all I want to do is scrape the table data (play by play) from a list of ncaa.com links (sample below)
https://stats.ncaa.org/game/play_by_play/12465
https://stats.ncaa.org/game/play_by_play/12755
https://stats.ncaa.org/game/play_by_play/12640
https://stats.ncaa.org/game/play_by_play/12290
For context, I got these links by scraping HREF tags from a different list of links (which contained every NCAA team's game schedule).
I've struggled through a lot of this, but there's been an answer somewhere...
Inspector makes it seem like the Play by Play (table) data is a tbody tag, or at least I think?
I've tried a script as simple as this (which works for other websites)
import pandas as pd
df = pd.read_html(
'https://stats.ncaa.org/game/play_by_play/13592')[0]
print(df)
But it still didn't work for this site... I read a bit about using lxml.parser instead of html.parser (like in the code below).. but also not working -- I thought this was my best chance at getting the tables from multiple links at once:
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/game/play_by_play/12564',
'https://stats.ncaa.org/game/play_by_play/13592'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml.parser')
for profile in soup.find_all('a'):
profile = profile.get('tbody')
profiles.append(profile)
# print(profiles)
for p in profiles:
print(p)
Any thoughts as to what is unique about this site/what could be the issue would be greatly appreciated.

That website will check if the request comes from a bot or a browser, so you need to update requests' header with a real user-agent.
Each page has 8 tables. The code below will go through each url you mentioned above and print out all tables. You can review them, see which one you need, etc:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://stats.ncaa.org/game/play_by_play/12465',
'https://stats.ncaa.org/game/play_by_play/12755',
'https://stats.ncaa.org/game/play_by_play/12640',
'https://stats.ncaa.org/game/play_by_play/12290',
]
s = requests.Session()
s.headers.update(headers)
for url in urls:
r = s.get(url)
dfs = pd.read_html(r.text)
len(dfs)
for df in dfs:
print(df)
print('___________')
Response:
0 1 2 3
0 NaN 1st Half 2nd Half Total
1 UTRGV 23 19 42
2 UTEP 26 34 60
___________
0 1
0 Game Date: 01/02/2009
1 Location: El Paso, Texas (Don Haskins Center)
2 Attendance: 8413
___________
0 1
0 Officials: John Higgins, Duke Edsall, Quinton, Reece
___________
0 1
0 1st Half 1 2
___________
0 1 2 \
0 Time UTRGV Score
1 19:45 NaN 0-0
2 19:45 NaN 0-0
3 19:33 NaN 0-0
4 19:33 Emmanuel Jones Defensive Rebound 0-0
.. ... ... ...
163 00:11 NaN 23-25
164 00:11 NaN 23-26
165 00:00 Emmanuel Jones missed Two Point Jumper 23-26
166 00:00 NaN 23-26
167 End of 1st Half End of 1st Half End of 1st Half
3
0 UTEP
1 Arnett Moultrie missed Two Point Jumper
2 Julyan Stone Offensive Rebound
3 Stefon Jackson missed Three Point Jumper
4 NaN
.. ...
163 Stefon Jackson made Free Throw
164 Stefon Jackson made Free Throw
165 NaN
166 Julyan Stone Defensive Rebound
167 End of 1st Half
[168 rows x 4 columns]
___________
0 1
0 2nd Half 1 2
[..]

Related

Unable to load tables with "Load more" options in a website using Python

Need to scrape the full table from this site with "Load more" option.
As of now when I`m scraping , I only get the one that shows up by default on when loading the page.
import pandas as pd
import requests
from six.moves import urllib
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
header = {'Accept-Language': "en-US,en;q=0.9",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
resp2 = requests.get(url=URL2, headers=header).text
tables2 = pd.read_html(resp2)
overview_table2= tables2[0]
overview_table2
Player Name
Team
Matches
Goals
Time Played
Unnamed: 5
0
Jorge Pereyra Diaz
Mumbai City
9
6
538 Mins
NaN
1
Cleiton Silva
SC East Bengal
8
5
707 Mins
NaN
2
Abdenasser El Khayati
Chennaiyin FC
5
4
231 Mins
NaN
3
Lallianzuala Chhangte
Mumbai City
9
4
737 Mins
NaN
4
Nandhakumar Sekar
Odisha
8
4
673 Mins
NaN
5
Ivan Kalyuzhnyi
Kerala Blasters
7
4
428 Mins
NaN
6
Bipin Singh
Mumbai City
9
4
806 Mins
NaN
7
Noah Sadaoui
Goa
8
4
489 Mins
NaN
8
Diego Mauricio
Odisha
8
3
526 Mins
NaN
9
Pedro Martin
Odisha
8
3
263 Mins
NaN
10
Dimitri Petratos
ATK Mohun Bagan
6
3
517 Mins
NaN
11
Petar Sliskovic
Chennaiyin FC
8
3
662 Mins
NaN
12
Holicharan Narzary
Hyderabad
9
3
705 Mins
NaN
13
Dimitrios Diamantakos
Kerala Blasters
7
3
529 Mins
NaN
14
Alberto Noguera
Mumbai City
9
3
371 Mins
NaN
15
Jerry Mawihmingthanga
Odisha
8
3
611 Mins
NaN
16
Hugo Boumous
ATK Mohun Bagan
7
2
580 Mins
NaN
17
Javi Hernandez
Bengaluru
6
2
397 Mins
NaN
18
Borja Herrera
Hyderabad
9
2
314 Mins
NaN
19
Mohammad Yasir
Hyderabad
9
2
777 Mins
NaN
20
Load More....
Load More....
Load More....
Load More....
Load More....
Load More....
But I need the full table , including the data under "Load more", please help.
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}
def main(url):
params = {
"action": "stats",
"league_id": "750",
"limit": "300",
"offset": "0",
"part": "leagues",
"season_id": "2022",
"section": "football",
"stats_type": "player",
"tab": "overview"
}
r = requests.get(url, headers=headers, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x['title'], *[i.get_text(strip=True) for i in x.find_all_next('td', limit=4)])
for x in soup.select('a.player_link')]
df = pd.DataFrame(
goal, columns=['Name', 'Team', 'Matches', 'Goals', 'Time Played'])
print(df)
main('https://www.mykhel.com/src/index.php')
Output:
Name Team Matches Goals Time Played
0 Jorge Pereyra Diaz Mumbai City 9 6 538 Mins
1 Cleiton Silva SC East Bengal 8 5 707 Mins
2 Abdenasser El Khayati Chennaiyin FC 5 4 231 Mins
3 Lallianzuala Chhangte Mumbai City 9 4 737 Mins
4 Nandhakumar Sekar Odisha 8 4 673 Mins
.. ... ... ... ... ...
268 Sarthak Golui SC East Bengal 6 0 402 Mins
269 Ivan Gonzalez SC East Bengal 8 0 683 Mins
270 Michael Jakobsen NorthEast United 8 0 676 Mins
271 Pratik Chowdhary Jamshedpur FC 6 0 495 Mins
272 Chungnunga Lal SC East Bengal 8 0 720 Mins
[273 rows x 5 columns]
This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.
I'll stick to working with dynamically loaded pages with Selenium browser automation suite.
Installation
To get started, you'll need to install selenium bindings:
pip install selenium
You seem to already have beautifulsoup, but for anyone who might come across this answer, we'll also need it and html5lib, we'll need them later to parse the table:
pip install html5lib BeautifulSoup4
Now, for selenium to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:
Browser
Link to driver download
Chrome:
https://sites.google.com/chromium.org/driver/
Edge:
https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox:
https://github.com/mozilla/geckodriver/releases
Safari:
https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Opera:
https://github.com/operasoftware/operachromiumdriver/releases
You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser 😮
Anyway, it's a bit out of this answer scope, just keep in mind, for any browser and it's family you'll need a driver.
I'll stick with Firefox. Hence you need Firefox installed and driver placed somewhere. The best option would be to add this folder to PATH variable.
If you choose chromium, you'll have to strictly stick to Chrome browser version. As for Firefox, I have a pretty old geckodriver 0.29.1 and it works like a charm with the latest update.
Hands on
import pandas as pd
from selenium import webdriver
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
driver = webdriver.Firefox()
driver.get(URL2)
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
while(element.is_displayed()):
driver.execute_script("arguments[0].click();", element)
table = driver.find_element_by_css_selector('table')
tables2 = pd.read_html(table.get_attribute('outerHTML'))
driver.close()
overview_table2 = tables2[0].dropna(how='all').dropna(axis='columns', how='all')
overview_table2.drop_duplicates().reset_index(drop=True)
overview_table2
We only need pandas for our resulting table and selenium for web automation.
URL2 — is the same variable you used
driver = webdriver.Firefox() — here we instantiate Firefox and the browser will get opened. This is where selenium magic will happen.
Note: If you decided to skip adding driver to a PATH variable, you can directly reference your here, e.g.:
webdriver.Firefox(r"C:\WebDriver\bin")
webdriver.Chrome(service=Service(executable_path="/path/to/chromedriver"))
driver.get(URL2) — open the desired page
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
Using xpath selector we find a link that has the same text as your 20th row.
With that stored element we click it all the time till it disappears.
It would be more sensible and easy to just use element.click(), but it results in an error. More info on other stack overflow question.
Assign table variable with a corresponding element.
tables2 I left this weird variable name as is in your question.
Here we get outerHTML as innnerHTML would render contents of the <table> tag, but not the tag itself.
We should not forget to .close() our driver as we don't need it anymore.
As a result of html parsing there will be a list just like in question provided. I drop here the unnamed column and last empty row.
The resulting overview_table2 looks like:
Player Name
Team
Matches
Goals
Time Played
0
Jorge Pereyra Diaz
Mumbai City
9.0
6.0
538 Mins
1
Cleiton Silva
SC East Bengal
8.0
5.0
707 Mins
2
Abdenasser El Khayati
Chennaiyin FC
5.0
4.0
231 Mins
...
...
...
...
...
...
270
Michael Jakobsen
NorthEast United
8.0
0.0
676 Mins
271
Pratik Chowdhary
Jamshedpur FC
6.0
0.0
495 Mins
272
Chungnunga Lal
SC East Bengal
8.0
0.0
720 Mins
Side note
Job done. As some further improvement you may play with different browsers and try the headless mode, a mode when browser does not open on you desktop environment, but rather runs silently in the background.

Python Limit time to run pandas read_html

I am trying to limit the time for running dfs = pd.read_html(str(response.text)). Once it runs for more than 5 seconds, it will stop running for this url and move to running the next url. I did not find out timeout attribute in pd.readhtml. So how can I do that?
from bs4 import BeautifulSoup
import re
import requests
import os
import time
from pandas import DataFrame
import pandas as pd
from urllib.request import urlopen
headers = {'User-Agent': 'regsre#jh.edu'}
urls={'https://www.sec.gov/Archives/edgar/data/1058307/0001493152-21-003451.txt', 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'}
for url in urls:
response = requests.get(url, headers = headers)
response.raise_for_status()
time.sleep(0.1)
dfs = pd.read_html(str(response.text))
print(url)
for item in dfs:
try:
Operation=(item[0].apply(str).str.contains('Revenue') | item[0].apply(str).str.contains('profit'))
if Operation.empty:
pass
if Operation.any():
Operation_sheet=item
if not Operation.any():
CashFlows=(item[0].apply(str).str.contains('income') | item[0].apply(str).str.contains('loss'))
if CashFlows.any():
Operation_sheet=item
if not CashFlows.any():
pass
I'm not certain what the issue is, but pandas seems to get overwhelmed by this file. If we utilize BeautifulSoup to instead search for tables, prettify them, and pass those to pd.read_html(), then it seems to be able to handle things just fine.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent': 'regsre#jh.edu'}
url = 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text)
dfs = []
for table in soup.find_all('table'):
dfs.extend(pd.read_html(table.prettify()))
# Printing the first few:
for df in dfs[0:3]:
print(df, '\n')
0 1 2 3 4
0 Nevada NaN 4813 NaN 65-0783722
1 (State or other jurisdiction of NaN (Primary Standard Industrial NaN (I.R.S. Employer
2 incorporation or organization) NaN Classification Code Number) NaN Identification Number)
0
0 Ralph V. De Martino, Esq.
1 Alec Orudjev, Esq.
2 Schiff Hardin LLP
3 901 K Street, NW, Suite 700
4 Washington, DC 20001
5 Phone (202) 778-6400
6 Fax: (202) 778-6460
0 1
0 Large accelerated filer [ ] Accelerated filer [ ]
1 NaN NaN
2 Non-accelerated filer [X] Smaller reporting company [X]
3 NaN NaN
4 NaN Emerging growth company [ ]

beautiful soup find_all() not returning all elements

I am trying to scrape this website using bs4. Using inspect on particular car ad tile, I figured what I need to scrape in order to get the title & the link to the car's page.
I am making use of the find_all() function of the bs4 library but the issue is that it's not scraping the required info of all the cars. It returns only info of about 21, whereas it's clearly visible on the website that there are about 2410 cars.
The relevant code:
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import requests
url = 'https://www.cardekho.com/used-cars+in+bangalore'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = bs(webpage,"html.parser")
tags = page_soup.find_all("div","title")
print(len(tags))
How to get info on all of the cars present on the page.
P.S - Want to point out just one thing, all the cars aren't displayed at once. More car info gets loaded as you scroll down. Could it because of that? Not sure.
Ok, I've written up a sample code to show you how it can be done. Although the site has a convenient api that we can leverage, the first page is not available through the api, but is embedded in a script tag in the html code. This requires additional processing to extract. After that it is simply a matte of getting the json data from the api, parsing it to python dictionaries and appending the car entries to a list. The link to the api can be found when inspecting network activity in Chrome or Firefox while scrolling the site.
from bs4 import BeautifulSoup
import re
import json
from subprocess import check_output
import requests
import time
from tqdm import tqdm #tqdm is just to implement a progress bar, https://pypi.org/project/tqdm/
cars = [] #create empty list to which we will append the car dicts from the json data
url = 'https://www.cardekho.com/used-cars+in+bangalore'
r = requests.get(url , headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser")
s = soup.find('script', {"type":"application/ld+json"}).next_sibling #find the section with the json data. It looks for a script tage with application/ld+json type, and takes the next tag, which is the one with the data we need, see page source code
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));' #strip the text from unnecessary strings and load the json as python dict, taken from: https://stackoverflow.com/questions/54991571/extract-json-from-html-script-tag-with-beautifulsoup-in-python/54992015#54992015
with open('temp.js','w') as f: # save the sting to a javascript file
f.write(js)
data_site = json.loads(check_output(['node','temp.js'])) #execute the file with node, which will return the json data that will be loaded with json.loads.
for i in data_site['items']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
for page in tqdm(range(20, data_site['total_count'], 20)): #'pagefrom' in the api call is 20, 40, 60, etc. so create a range and loop it
r = requests.get(f"https://www.cardekho.com/api/v1/usedcar/search?&cityId=105&connectoid=&lang_code=en&regionId=0&searchstring=used-cars%2Bin%2Bbangalore&pagefrom={page}&sortby=updated_date&sortorder=asc&mink=0&maxk=200000&dealer_id=&regCityNames=&regStateNames=", headers={'User-Agent': 'Mozilla/5.0'})
data = r.json()
for i in data['data']['cars']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
time.sleep(5) #wait a few seconds to avoid overloading the site
This will result in cars being a list of dictionaries. The car names can be found in the vid key, and the urls are present in the vlink key.
You can load it into a pandas dataframe to explore the data:
import pandas as pd
df = pd.DataFrame(cars)
df.head() will output (I omitted the images column for readability):
loc
myear
bt
ft
km
it
pi
pn
pu
dvn
ic
ucid
sid
ip
oem
model
vid
city
vlink
p_numeric
webp_image
position
pageNo
centralVariantId
isExpiredModel
modelId
isGenuine
is_ftc
seller_location
utype
views
tmGaadiStore
cls
0
Koramangala
2014
SUV
Diesel
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2206305_1614944913.jpg
9.9 Lakh
Mahindra XUV500 W6 2WD
13
3019084
9509A09F1673FE2566DF59EC54AAC05B
1
Mahindra
Mahindra XUV500
Mahindra XUV500 2011-2015 W6 2WD
Bangalore
/used-car-details/used-Mahindra-XUV500-2011-2015-W6-2WD-cars-Bangalore_9509A09F1673FE2566DF59EC54AAC05B.htm
990000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2206305_1614944913.webp
1
1
3822
True
570
0
0
{'address': 'BDA Complex, 100 Feet Rd, 3rd Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, Bangalore', 'lat': 12.931, 'lng': 77.6228}
Dealer
235
False
1
Marathahalli Colony
2017
SUV
Petrol
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2203506_1614754307.jpeg
7.85 Lakh
Ford Ecosport 1.5 Petrol Trend BSIV
14
3015331
2C0E4C4E543D4792C1C3186B361F718B
1
Ford
Ford Ecosport
Ford Ecosport 2015-2021 1.5 Petrol Trend BSIV
Bangalore
/used-car-details/used-Ford-Ecosport-2015-2021-1.5-Petrol-Trend-BSIV-cars-Bangalore_2C0E4C4E543D4792C1C3186B361F718B.htm
785000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2203506_1614754307.webp
2
1
6086
True
175
0
0
{'address': '2, Varthur Rd, Ayyappa Layout, Chandra Layout, Marathahalli, Bengaluru, Karnataka 560037, Marathahalli Colony, Bangalore', 'lat': 12.956727624875453, 'lng': 77.70174980163576}
Dealer
495
False
2
Yelahanka
2020
SUV
Diesel
13,969
0
https://images10.gaadicdn.com/usedcar_image/320x240/usedcar_11_276591614316705_1614316747.jpg
41 Lakh
Toyota Fortuner 2.8 4WD AT
12
3007934
BBC13FB62DF6840097AA62DDEA05BB04
1
Toyota
Toyota Fortuner
Toyota Fortuner 2016-2021 2.8 4WD AT
Bangalore
/used-car-details/used-Toyota-Fortuner-2016-2021-2.8-4WD-AT-cars-Bangalore_BBC13FB62DF6840097AA62DDEA05BB04.htm
4100000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/usedcar_11_276591614316705_1614316747.webp
3
1
7618
True
364
0
0
{'address': 'Sonnappanahalli Kempegowda Intl Airport Road Jala Uttarahalli Hobli, Yelahanka, Bangalore, Karnataka 560064', 'lat': 13.1518821, 'lng': 77.6220694}
Dealer
516
False
3
Byatarayanapura
2017
Sedans
Diesel
18,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2202297_1615013237.jpg
35 Lakh
Mercedes-Benz E-Class E250 CDI Avantgarde
15
3013606
4553943A967049D873712AFFA5F65A56
1
Mercedes-Benz
Mercedes-Benz E-Class
Mercedes-Benz E-Class 2009-2012 E250 CDI Avantgarde
Bangalore
/used-car-details/used-Mercedes-Benz-E-Class-2009-2012-E250-CDI-Avantgarde-cars-Bangalore_4553943A967049D873712AFFA5F65A56.htm
3500000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2202297_1615013237.webp
4
1
4611
True
674
0
0
{'address': 'NO 19, Near Traffic Signal, Byatanarayanapura, International Airport Road, Byatarayanapura, Bangalore, Karnataka 560085', 'lat': 13.0669588, 'lng': 77.5928756}
Dealer
414
False
4
nan
2015
Sedans
Diesel
80,000
0
https://stimg.cardekho.com/pwa/img/noimage.svg
12.5 Lakh
Skoda Octavia Elegance 2.0 TDI AT
1
3002709
156E5F2317C0A3A3BF8C03FFC35D404C
1
Skoda
Skoda Octavia
Skoda Octavia 2013-2017 Elegance 2.0 TDI AT
Bangalore
/used-car-details/used-Skoda-Octavia-2013-2017-Elegance-2.0-TDI-AT-cars-Bangalore_156E5F2317C0A3A3BF8C03FFC35D404C.htm
1250000
5
1
3092
True
947
0
0
{'lat': 0, 'lng': 0}
Individual
332
False
Or if you wish to explode the dict in seller_location to columns, you can load it with df = pd.json_normalize(cars).
You can save all data to a csv file: df.to_csv('output.csv')

Beautifulsoup doesn't return the whole html seen in inspect

I'm trying to parse the html of a live sport results website, but my code doesn't return every span tag there is to the site. I saw under inspect that all the matches are , but my code can't seem to find anything from the website apart from the footer or header. Also tried with the divs, those didn't work either. I'm new to this and kinda lost, this is my code, could someone help me?
I left the firs part of the for loop for more clarity.
#Creating the urls for the different dates
my_url='https://www.livescore.com/en/football/{}'.format(d1)
print(my_url)
today=date.today()-timedelta(days=i)
d1 = today.strftime("%Y-%m-%d/")
#Opening up the connection and grabbing the html
uClient=uReq(my_url)
page_html=uClient.read()
uClient.close()
#HTML parser
page_soup=soup(page_html,"html.parser")
spans=page_soup.findAll("span")
matches=page_soup.findAll("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"})
print(spans)
The page is dynamic and rendered by JS. When you do a request, you are getting the static html response before it's rendered. There are few things you could do to work with this situation:
Use something like Selenium which simulates the browser operations. It'l open a browser, go to the site, allow the site to render the page. Once the page is rendered, you THEN can get the html of that page which will have the data. It'll work, but takes longer to process since it literally is simulating the process as you would do it manually.
Use requests-HTML package which also allows the page to be rendered (I have not tried this package before as it conflicts with my IDE Spyder). This would be similar to Selenium, without the borwser actually opening. It's essentially the requests package, but with javascript support.
See if the data (in the static html response) is embedded in the <script> tags in json format. Sometimes you'll find it there, but takes a little work to pull that out and conform/manipulate to a valid json format to be read in using json.loads()
Find if there is an api of some sort (checking XHR) and fetch the data directly from there.
The best option is always #4 if it's available. Why? Because the data will be consistently structured. Even if the website changes it's structure or css changes (which would change the html you parse), the underlying data feeding into it will rarely change it's structure. This site does have an api to access the data:
import requests
import datetime
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}
dates_list = ['20210214', '20210215', '20210216']
for dateStr in dates_list:
url = f'https://prod-public-api.livescore.com/v1/api/react/date/soccer/{dateStr}/0.00'
dateStr_alpha = datetime.datetime.strptime(dateStr, '%Y%m%d').strftime('%B %d')
response = requests.get(url, headers=headers).json()
stages = response['Stages']
for stage in stages:
location = stage['Cnm']
stageName = stage['Snm']
events = stage['Events']
print('\n\n%s - %s\t%s' %(location, stageName, dateStr_alpha))
print('*'*50)
for event in events:
outcome = event['Eps']
team1Name = event['T1'][0]['Nm']
if 'Tr1' in event.keys():
team1Goals = event['Tr1']
else:
team1Goals = '?'
team2Name = event['T2'][0]['Nm']
if 'Tr2' in event.keys():
team2Goals = event['Tr2']
else:
team2Goals = '?'
print('%s\t%s %s - %s %s' %(outcome, team1Name, team1Goals, team2Name, team2Goals))
Output:
England - Premier League February 15
********************************************************************************
FT West Ham United 3 - Sheffield United 0
FT Chelsea 2 - Newcastle United 0
Spain - LaLiga Santander February 15
********************************************************************************
FT Cadiz 0 - Athletic Bilbao 4
Germany - Bundesliga February 15
********************************************************************************
FT Bayern Munich 3 - Arminia Bielefeld 3
Italy - Serie A February 15
********************************************************************************
FT Hellas Verona 2 - Parma Calcio 1913 1
Portugal - Primeira Liga February 15
********************************************************************************
FT Sporting CP 2 - Pacos de Ferreira 0
Belgium - Jupiler League February 15
********************************************************************************
FT Gent 4 - Royal Excel Mouscron 0
Belgium - First Division B February 15
********************************************************************************
FT Westerlo 1 - Lommel 1
Turkey - Super Lig February 15
********************************************************************************
FT Genclerbirligi 0 - Besiktas 3
FT Antalyaspor 1 - Yeni Malatyaspor 1
Brazil - Serie A February 15
********************************************************************************
FT Gremio 1 - Sao Paulo 2
FT Ceara 1 - Fluminense 3
FT Sport Recife 0 - Bragantino 0
Italy - Serie B February 15
********************************************************************************
FT Cosenza 2 - Reggina 2
France - Ligue 2 February 15
********************************************************************************
FT Sochaux 2 - Valenciennes 0
FT Toulouse 3 - AC Ajaccio 0
Spain - LaLiga Smartbank February 15
********************************************************************************
FT Castellon 1 - Fuenlabrada 2
FT Real Oviedo 3 - Lugo 1
...
Uganda - Super League February 16
********************************************************************************
FT Busoga United FC 1 - Bright Stars FC 1
FT Kitara FC 0 - Mbarara City 1
FT Kyetume 2 - Vipers SC 2
FT UPDF FC 0 - Onduparaka FC 1
FT Uganda Police 2 - BUL FC 0
Uruguay - Primera División: Clausura February 16
********************************************************************************
FT Boston River 0 - Montevideo City Torque 3
International - Friendlies Women February 16
********************************************************************************
FT Guatemala 3 - Panama 1
Africa - Africa Cup Of Nations U20: Group C February 16
********************************************************************************
FT Ghana U20 4 - Tanzania U20 0
FT Gambia U20 0 - Morocco U20 1
Brazil - Amazonense: Group A February 16
********************************************************************************
Postp. Manaus FC ? - Penarol AC AM ?
Now assuming you have the correct class to scrape, a simple loop would work:
for i in soup.find_all("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"}):
print(i)
Or add it into a list:
teams = []
for i in soup.find_all("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"}):
teams.append(i.text)
print(teams)
If this does not work, run some tests to see if you are actually scraping the correct things e.g. print a singular thing.
Also in your code I see that you are printing "spans" and not "matches", this could also be a problem with your code.
You can also look at this post what further explains how to do this.

Cannot Scrape Fantasy Table Using Python

I'm trying to scrape fantasy player data from the following site: http://www.fplstatistics.co.uk/. The table appears upon opening the site, but it's not visible when I scrape the site.
I tried the following:
import requests as rq
from bs4 import BeautifulSoup
fplStatsPage = rq.get('http://www.fplstatistics.co.uk')
fplStatsPageSoup = BeautifulSoup(fplStatsPage.text, 'html.parser')
fplStatsPageSoup
And the table was nowhere to be seen. What came back in place of where the table should be is:
<div>
The 'Player Data' is out of date.
<br/> <br/>
You need to refresh the web page.
<br/> <br/>
Press F5 or hit <i class="fa fa-refresh"></i>
</div>
This message appears on the site whenever the table is updated.
I then looked at the developer tools to see if I can find the URL from where the table data is retrieved, but I had no luck. Probably because I don't know how to read the developer tools well.
I then tried to refresh the page as the above message says using Selenium:
from selenium import webdriver
import time
chromeDriverPath = '/Users/SplitShiftKing/Downloads/Software/chromedriver'
driver = webdriver.Chrome(chromeDriverPath)
driver.get('http://www.fplstatistics.co.uk')
driver.refresh()
#To give site enough time to refresh
time.sleep(15)
html = driver.page_source
fplStatsPageSoup = BeautifulSoup(html, 'html.parser')
fplStatsPageSoup
The output was the same as before. The table appears on the site but not in the output.
Assistance would be appreciated. I've looked at similar questions on overflow, but I haven't been able to find a solution.
Why not go straight to the source that pulls i that data. Only thing you need to work out is the column names, but this gets you all the data in one request and without using selenium:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
url = 'http://www.fplstatistics.co.uk/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Mobile Safari/537.36'}
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '"iselRow"' in script.text:
iselRowVal = re.search('"value":(.+?)}\);}', script.text).group(1).strip()
url = 'http://www.fplstatistics.co.uk/Home/AjaxPricesFHandler'
payload = {
'iselRow': iselRowVal,
'_': ''}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['aaData'])
Output:
print (df.head(5).to_string())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 Mustafi Arsenal D A 0.3 5.2 £5.2m 0 --- 110 -95.6 -95.6 -1 -1 Mustafi Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
1 Bellerín Arsenal D I 0.3 5.4 £5.4m 0 --- 54024 2.6 2.6 -2 -2 Bellerin Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
2 Kolasinac Arsenal D I 0.6 5.2 £5.2m 0 --- 5464 -13.9 -13.9 -2 -2 Kolasinac Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
3 Maitland-Niles Arsenal D A 2.6 4.6 £4.6m 0 --- 11924 -39.0 -39.0 -2 -2 Maitland-Niles Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
4 Sokratis Arsenal D S 1.5 4.9 £4.9m 0 --- 19709 -29.4 -29.4 -2 -2 Sokratis Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
By requesting driver.page_source you're cancelling out any benefit you get from Selenium: the page source does not contain the table you want. That table is updated dynamically via Javascript after the page has loaded. You need to retrieve the table use methods on your driver, rather than using BeautifulSoup. For example:
>>> from selenium import webdriver
>>> d = webdriver.Chrome()
>>> d.get('http://www.fplstatistics.co.uk')
>>> table = d.find_element_by_id('myDataTable')
>>> print('\n'.join(x.text for x in table.find_elements_by_tag_name('tr')))
Name
Club
Pos
Status
%Owned
Price
Chgs
Unlocks
Delta
Target
Kelly Crystal Palace D A 30.7 £4.3m 0 --- 0
101.0
Rico Bournemouth D A 14.6 £4.3m 0 --- 0
100.9
Baldock Sheffield Utd D A 7.1 £4.8m 0 --- 88 99.8
Rashford Man Utd F A 26.4 £9.0m 0 --- 794 98.6
Son Spurs M A 21.6 £10.0m 0 --- 833 98.5
Henderson Sheffield Utd G A 7.8 £4.7m 0 --- 860 98.4
Grealish Aston Villa M A 8.9 £6.1m 0 --- 1088 98.0
Kane Spurs F A 19.3 £10.9m 0 --- 3961 92.9
Reid West Ham D A 4.6 £3.9m 0 --- 4029 92.7
Richarlison Everton M A 7.7 £7.8m 0 --- 5405 90.3

Categories

Resources