Hi I'm doing some web scraping with NBA Data in python on this page. Some elements of basketball-reference are easy to scrape, but this one is giving me some trouble with my lack of python knowledge.
I'm able to grab the data and column headers I want, but I end up with 2 lists of data that I need to combine by their index (i think?) so that index 0 of player_injury_info lines up with index 0 of player_names etc, which I dont know how to do.
Below I've pasted some code that you can follow along.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)
# this correctly gives me the 4 column headers i want (Player, Team, Update, Description)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
# 2 lists - player_injury_info and player_names. they need to be combined.
rows = soup.findAll('tr')
player_injury_info = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
player_injury_info = player_injury_info[1:] # removing first element bc dont need it
player_names = [[th.getText() for th in rows[i].findAll('th')]
for i in range(len(rows))]
player_names = player_names[1:] # removing first element bc dont need it
### joining the lists in the correct order- the part i dont know how to do
player_list = player_names.append(player_injury_info)
### this should give me the data frame i want if i can get player_injury_info into the right format.
injury_data = pd.DataFrame(player_injury_info, columns = headers)
There might be an easier way to web scrape the data into all 1 list / data frame? Or maybe it's fine to just join the 2 lists together like I'm trying to do. But if anybody was able to follow along and can offer a solution I'd appreciate the help!
Let pandas do the parse of the table for you.
import pandas as pd
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
injury_data = pd.read_html(url)[0]
Output:
print(injury_data)
Player ... Description
0 Onyeka Okongwu ... Out (Shoulder) - The Hawks announced that Okon...
1 Jaylen Brown ... Out (Wrist) - The Celtics announced that Brown...
2 Coby White ... Out (Shoulder) - The Bulls announced that Whit...
3 Taurean Prince ... Out (Ankle) - The Cavaliers announced F Taurea...
4 Jamal Murray ... Out (Knee) - Murray is recovering from a torn ...
5 Klay Thompson ... Out (Right Achilles) - Thompson is on track to...
6 James Wiseman ... Out (Knee) - Wiseman is on track to be ready b...
7 T.J. Warren ... Out (Foot) - Warren underwent foot surgery and...
8 Serge Ibaka ... Out (Back) - The Clippers announced Serge Ibak...
9 Kawhi Leonard ... Out (Knee) - The Clippers announced Kawhi Leon...
10 Victor Oladipo ... Out (Knee) - Oladipo could be cleared for full...
11 Donte DiVincenzo ... Out (Foot) - DiVincenzo suffered a tendon inju...
12 Jarrett Culver ... Out (Ankle) - The Timberwolves announced Culve...
13 Markelle Fultz ... Out (Knee) - Fultz will miss the rest of the s...
14 Jonathan Isaac ... Out (Knee) - Isaac is making progress with his...
15 Dario Šarić ... Out (Knee) - The Suns announced that Sario has...
16 Zach Collins ... Out (Ankle) - The Blazers announced that Colli...
17 Pascal Siakam ... Out (Shoulder) - The Raptors announced Pascal ...
18 Deni Avdija ... Out (Leg) - The Wizards announced that Avdija ...
19 Thomas Bryant ... Out (Left knee) - The Wizards announced that B...
[20 rows x 4 columns]
But if you were to iterate it yourself, I'd simply get at the rows (<tr> tags), then get the player name in the <a> tag, and combine it with that row's <td> tags. Then create your dataframe from the list of those:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
trs = soup.findAll('tr')[1:]
rows = []
for tr in trs:
player_name = tr.find('a').text
data = [player_name] + [x.text for x in tr.find_all('td')]
rows.append(data)
injury_data = pd.DataFrame(rows, columns = headers)
I think you want this (a list of tuples), using zip:
players = ["joe", "bill"]
injuries = ["tooth-ache", "mental break"]
list(zip(players, injuries))
Result:
[('joe', 'tooth-ache'), ('bill', 'mental break')]
Related
I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following:
Scraping Kansas City Chiefs active team player name with the college attended. This is the url used https://www.chiefs.com/team/players-roster/.
After compiling, I get an error saying "IndexError: list index out of range".
I don't know if my set classes are wrong. Help would be appreciated.
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
print(player_name,player_university)
TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries
Indexing
The Python Index Operator is represented by opening and closing square brackets: []. The syntax, however, requires you to put a number inside the brackets.
Example:
So [7] applies indexing to the preceding iterable (all found tds), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.
In your statement, you take all found cells as <td> HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7].
row.find_all('td', class_='sorter-lastname selected')[7]
How to avoid index-errors ?
Are you sure there are any td elements found in the row?
If some are found, can we guarantee that it are always at least 8.
In this case, the were apparently less than 8 elements.
That's why Python would raise an IndexError, e.g. in given script line 15:
Traceback (most recent call last):
File "<stdin>", line 15, in <module>
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range
Better test on length before indexing:
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
print(f"person row: {row}") # debug-print helps to fix element-query
player_name = row.find('td', class_='sorter-lastname selected"')
cells = row.find_all('td', class_='sorter-lastname selected')
player_university = None # define a default to avoid NameError
if len(cells) > 7: # test on minimum length of 8 for index 7
player_university = cells[7].text
print(player_name, player_university)
Element-queries
When the index was fixed, the queried names returned empty results as None, None.
We need to debug (thus I added the print inside the loop) and adjust the queries:
(1) for the university-name:
If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1 means from backwards, like here: the last. The number of cells should be at least 1 or greater than 0.
(2) for the player-name:
It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a .. title="Player Name"> or in following sibling as inner text of span > a.
CSS selectors
You may use CSS selectors for that an bs4's select or select_one functions. Then you can select the path like td > ? > ? > a and get the title.
Note: the ? placeholders are left as challenging exercise for you.)
💡️ Tip: most browsers have an inspector (right click on the element, e.g. the player-name), then choose "inspect element" and an HTML source view opens selecting the element. Right-click again to "Copy" the element as "CSS selector".
Further Reading
About indexing, and the magic of negative numbers like [-1]:
AskPython: Indexing in Python - A Complete Beginners Guide
.. a bit further, about slicing:
Real Python: Indexing and Slicing
Research on Beautiful Soup here:
Using BeautifulSoup to extract the title of a link
Get text with BeautifulSoup CSS Selector
I couldn't find a td with class sorter-lastname selected in the source code. You basically need the last td in each row, so this would do:
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td')[-1].text
PS. scraping tables is extremely easy in pandas:
import pandas as pd
df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')
It may take a bit longer, but the output is impressive, for example the print(df[0]):
Player # Pos HT WT Age Exp College
0 Josh Pederson NaN TE 6-5 235 24 R Louisiana-Monroe
1 Brandin Dandridge NaN WR 5-10 180 25 R Missouri Western
2 Justin Watson NaN WR 6-3 215 25 4 Pennsylvania
3 Jonathan Woodard NaN DE 6-5 271 28 3 Central Arkansas
4 Andrew Billings NaN DT 6-1 311 26 5 Baylor
.. ... ... .. ... ... ... .. ...
84 James Winchester 41.0 LS 6-3 242 32 7 Oklahoma
85 Travis Kelce 87.0 TE 6-5 256 32 9 Cincinnati
86 Marcus Kemp 85.0 WR 6-4 208 26 4 Hawaii
87 Chris Jones 95.0 DT 6-6 298 27 6 Mississippi State
88 Harrison Butker 7.0 K 6-4 196 26 5 Georgia Tech
[89 rows x 8 columns]
I've been learning the basics of Python for a short while, and thought I'd go ahead and try to put something together, but appear to have hit a stumbling block (despite looking just about everywhere to see where I may be going wrong).
I'm trying to grab a table i.e. from here: https://www.oddschecker.com/horse-racing/2020-09-10-chelmsford-city/20:30/winner
Now I realize that the table isn't set out how typically a normal HTML would be, and therefore trying to grab this with Pandas wouldn't yield results. Therefore delved into BeautifulSoup to try and get a result.
It seems all the data I would need is within the class 'diff-row evTabRow bc' and therefore wrote the following:
url = requests.get('https://www.oddschecker.com/horse-racing/2020-09-10-haydock/14:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
table = soup.find_all("tr", class_="diff-row evTabRow bc")
This seems to put each horse and all corresponding data I'd need for it, into a list. Within this list, I'd only need certain bits, i.e. "data-name" for the horse name, and "data-odig" for the current odds.
I thought there may be some way I could then extract the data from the list to build a list of lists, and then construct a data frame in Pandas, but I may be going about this all wrong.
You can access any of the <tr> attributes with the BeautifulSoup object .attrs property.
Once you have table, loop over each entry, pull out the attributes you want as a list of dicts. Then initialize a Pandas data frame with the resulting list.
horse_attrs = list()
for entry in table:
attrs = dict(name=entry.attrs['data-bname'], dig=entry.attrs['data-best-dig'])
horse_attrs.append(attrs)
df = pd.DataFrame(horse_attrs)
df
name dig
0 Las Farras 9999
1 Heat Miami 9999
2 Martin Beck 9999
3 Litran 9999
4 Ritmo Capanga 9999
5 Perfect Score 9999
6 Simplemente Tuyo 9999
7 Anpacai 9999
8 Colt Fast 9999
9 Cacharpari 9999
10 Don Leparc 9999
11 Curioso Seattle 9999
12 Golpe Final 9999
13 El Acosador 9999
Notes:
The url you provided didn't work for me, but this similar one did: https://www.oddschecker.com/horse-racing/palermo-arg/21:00/winner
I didn't see the exact attributes (data-name and data-odig) you mentioned, so I used ones with similar names. I don't know enough about horse racing to know if these are useful, but the method in this answer should allow you to choose any of the attributes that are available.
The data you are looking for is both in the row tag <tr> and in the cell tags <td>.
The issue is that not all of the <td>'s are useful, so you have to skip those.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = requests.get('https://www.oddschecker.com/horse-racing/thirsk/13:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
rows = soup.find_all("tr", class_="diff-row evTabRow bc")
my_data = []
for row in rows:
horse = row.attrs['data-bname']
for td in row:
if td.attrs['class'][0] != 'np':
continue #Skip
bookie = td['data-bk']
odds = td['data-odig']
my_data.append(dict(
horse = horse,
bookie = bookie,
odds = odds
))
df = pd.DataFrame(my_data)
print(df)
This will give you what you are looking for:
horse bookie odds
0 Just Frank B3 3.75
1 Just Frank SK 4.33
2 Just Frank WH 4.33
3 Just Frank EE 4.33
4 Just Frank FB 4.2
.. ... ... ...
268 Tommy R RZ 29
269 Tommy R SX 26
270 Tommy R BF 10.8
271 Tommy R MK 41
272 Tommy R MA 98
[273 rows x 3 columns]
If web-scraping, you can take the approach where you get your data stored as various variables:
l = []
for thing in elements:
var1 = ... # however you extract it
var2 = ...
l.append({'column1_name': var1, 'column2_name': var2})
df = pd.DataFrame(l)
How you select the data out of the HTML element is up to you (consider selecting td?).
I'm attempting to extract a series of tables from an HTML document and append a new column with a constant value from a tag being used as a header. The idea would then be to make this new three column table a dataframe. below is the code i've come up with so far. I.e. each table would have a third column where all the row values would equal either AGO, DPK, ATK, or PMS depending which header precedes the series of tables. Would be grateful for any help as i'm new to python and HTML. Thanks a mill!
import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser
br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")
table = br.find_all('td', class_='vc_table_cell')
for element in table:
data = element.find('span', class_='vc_table_content')
prod_name = br.find_all('strong')
ago = prod_name[0].text
dpk = prod_name[1].text
atk = prod_name[2].text
pms = prod_name[3].text
if br.find('strong').text == ago:
data.append(ago.text)
elif br.find('strong').text == dpk:
data.append(dpk.text)
elif br.find('strong').text == atk:
data.append(atk.text)
elif br.find('strong').text == pms:
data.append(pms.text)
print(data.text)
df = pd.DataFrame(data)
The result i'm hoping for is to go from this
AGO
Enterprise Price
Coy A $0.5/L
Coy B $0.6/L
Coy C $0.7/L
to the new table below as a dataframe in Pandas
Enterprise Price Product
Coy A $0.5/L AGO
Coy B $0.6/L AGO
Coy C $0.7/L AGO
and to repeat the same thing for other tables with DPK, ATK and PMS information
I hope I understood your question right. This script will scrape all tables found in the page into the dataframe and save it to csv file:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
Prints:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
The data.csv (screenshot from LibreOffice):
I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks
You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront
Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps
Trying to scrape all of the player names and fantasy info on the players listed on this site. I can find the table absolutely fine, but the trouble starts when I try and iterate over the entire table. Here's the code I've written so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
nfl = 'http://www.fantasypros.com/nfl/adp/overall.php'
html = urlopen(nfl)
soup = BeautifulSoup(html.read(), "lxml")
table = soup.find('tbody').find_next('tbody')
playername = table.find('td').find_next('td')
for row in table:
print(playername)
Expected output:
Adrian Peterson MIN, 5
Le'Veon Bell PIT, 11
and so on and so forth for the rest of the players on the chart.
Actual output:
Adrian Peterson MIN, 5
Adrian Peterson MIN, 5
Adrian Peterson MIN, 5
and so on for over 400 iterations.
Where is my for loop going wrong?
You need to make the search in the context of a particular table:
for row in table:
print(row.find('td').find_next('td'))
Though, I would approach the problem differently. The desired table has an id:
table = soup.find('table', id="data")
for row in table.find_all("tr")[1:]: # skipping header row
cells = row.find_all("td")
print(cells[0].text, cells[1].find('a').text)
Prints:
(u'1', u'Adrian Peterson')
(u'2', u"Le'Veon Bell")
(u'3', u'Eddie Lacy')
(u'4', u'Jamaal Charles')
(u'5', u'Marshawn Lynch')
...