Web scraping table data using beautiful soup

Web scraping table data using beautiful soup - python

I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following:
Scraping Kansas City Chiefs active team player name with the college attended. This is the url used https://www.chiefs.com/team/players-roster/.
After compiling, I get an error saying "IndexError: list index out of range".
I don't know if my set classes are wrong. Help would be appreciated.
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
print(player_name,player_university)

TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries
Indexing
The Python Index Operator is represented by opening and closing square brackets: []. The syntax, however, requires you to put a number inside the brackets.
Example:
So [7] applies indexing to the preceding iterable (all found tds), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.
In your statement, you take all found cells as <td> HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7].
row.find_all('td', class_='sorter-lastname selected')[7]
How to avoid index-errors ?
Are you sure there are any td elements found in the row?
If some are found, can we guarantee that it are always at least 8.
In this case, the were apparently less than 8 elements.
That's why Python would raise an IndexError, e.g. in given script line 15:
Traceback (most recent call last):
File "<stdin>", line 15, in <module>
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range
Better test on length before indexing:
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
print(f"person row: {row}") # debug-print helps to fix element-query
player_name = row.find('td', class_='sorter-lastname selected"')
cells = row.find_all('td', class_='sorter-lastname selected')
player_university = None # define a default to avoid NameError
if len(cells) > 7: # test on minimum length of 8 for index 7
player_university = cells[7].text
print(player_name, player_university)
Element-queries
When the index was fixed, the queried names returned empty results as None, None.
We need to debug (thus I added the print inside the loop) and adjust the queries:
(1) for the university-name:
If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1 means from backwards, like here: the last. The number of cells should be at least 1 or greater than 0.
(2) for the player-name:
It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a .. title="Player Name"> or in following sibling as inner text of span > a.
CSS selectors
You may use CSS selectors for that an bs4's select or select_one functions. Then you can select the path like td > ? > ? > a and get the title.
Note: the ? placeholders are left as challenging exercise for you.)
💡️ Tip: most browsers have an inspector (right click on the element, e.g. the player-name), then choose "inspect element" and an HTML source view opens selecting the element. Right-click again to "Copy" the element as "CSS selector".
Further Reading
About indexing, and the magic of negative numbers like [-1]:
AskPython: Indexing in Python - A Complete Beginners Guide
.. a bit further, about slicing:
Real Python: Indexing and Slicing
Research on Beautiful Soup here:
Using BeautifulSoup to extract the title of a link
Get text with BeautifulSoup CSS Selector

I couldn't find a td with class sorter-lastname selected in the source code. You basically need the last td in each row, so this would do:
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td')[-1].text
PS. scraping tables is extremely easy in pandas:
import pandas as pd
df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')
It may take a bit longer, but the output is impressive, for example the print(df[0]):
Player # Pos HT WT Age Exp College
0 Josh Pederson NaN TE 6-5 235 24 R Louisiana-Monroe
1 Brandin Dandridge NaN WR 5-10 180 25 R Missouri Western
2 Justin Watson NaN WR 6-3 215 25 4 Pennsylvania
3 Jonathan Woodard NaN DE 6-5 271 28 3 Central Arkansas
4 Andrew Billings NaN DT 6-1 311 26 5 Baylor
.. ... ... .. ... ... ... .. ...
84 James Winchester 41.0 LS 6-3 242 32 7 Oklahoma
85 Travis Kelce 87.0 TE 6-5 256 32 9 Cincinnati
86 Marcus Kemp 85.0 WR 6-4 208 26 4 Hawaii
87 Chris Jones 95.0 DT 6-6 298 27 6 Mississippi State
88 Harrison Butker 7.0 K 6-4 196 26 5 Georgia Tech
[89 rows x 8 columns]

Related

Appending elements of a list into a multi-dimensional list

Hi I'm doing some web scraping with NBA Data in python on this page. Some elements of basketball-reference are easy to scrape, but this one is giving me some trouble with my lack of python knowledge.
I'm able to grab the data and column headers I want, but I end up with 2 lists of data that I need to combine by their index (i think?) so that index 0 of player_injury_info lines up with index 0 of player_names etc, which I dont know how to do.
Below I've pasted some code that you can follow along.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)
# this correctly gives me the 4 column headers i want (Player, Team, Update, Description)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
# 2 lists - player_injury_info and player_names. they need to be combined.
rows = soup.findAll('tr')
player_injury_info = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
player_injury_info = player_injury_info[1:] # removing first element bc dont need it
player_names = [[th.getText() for th in rows[i].findAll('th')]
for i in range(len(rows))]
player_names = player_names[1:] # removing first element bc dont need it
### joining the lists in the correct order- the part i dont know how to do
player_list = player_names.append(player_injury_info)
### this should give me the data frame i want if i can get player_injury_info into the right format.
injury_data = pd.DataFrame(player_injury_info, columns = headers)
There might be an easier way to web scrape the data into all 1 list / data frame? Or maybe it's fine to just join the 2 lists together like I'm trying to do. But if anybody was able to follow along and can offer a solution I'd appreciate the help!

Let pandas do the parse of the table for you.
import pandas as pd
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
injury_data = pd.read_html(url)[0]
Output:
print(injury_data)
Player ... Description
0 Onyeka Okongwu ... Out (Shoulder) - The Hawks announced that Okon...
1 Jaylen Brown ... Out (Wrist) - The Celtics announced that Brown...
2 Coby White ... Out (Shoulder) - The Bulls announced that Whit...
3 Taurean Prince ... Out (Ankle) - The Cavaliers announced F Taurea...
4 Jamal Murray ... Out (Knee) - Murray is recovering from a torn ...
5 Klay Thompson ... Out (Right Achilles) - Thompson is on track to...
6 James Wiseman ... Out (Knee) - Wiseman is on track to be ready b...
7 T.J. Warren ... Out (Foot) - Warren underwent foot surgery and...
8 Serge Ibaka ... Out (Back) - The Clippers announced Serge Ibak...
9 Kawhi Leonard ... Out (Knee) - The Clippers announced Kawhi Leon...
10 Victor Oladipo ... Out (Knee) - Oladipo could be cleared for full...
11 Donte DiVincenzo ... Out (Foot) - DiVincenzo suffered a tendon inju...
12 Jarrett Culver ... Out (Ankle) - The Timberwolves announced Culve...
13 Markelle Fultz ... Out (Knee) - Fultz will miss the rest of the s...
14 Jonathan Isaac ... Out (Knee) - Isaac is making progress with his...
15 Dario Šarić ... Out (Knee) - The Suns announced that Sario has...
16 Zach Collins ... Out (Ankle) - The Blazers announced that Colli...
17 Pascal Siakam ... Out (Shoulder) - The Raptors announced Pascal ...
18 Deni Avdija ... Out (Leg) - The Wizards announced that Avdija ...
19 Thomas Bryant ... Out (Left knee) - The Wizards announced that B...
[20 rows x 4 columns]
But if you were to iterate it yourself, I'd simply get at the rows (<tr> tags), then get the player name in the <a> tag, and combine it with that row's <td> tags. Then create your dataframe from the list of those:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
trs = soup.findAll('tr')[1:]
rows = []
for tr in trs:
player_name = tr.find('a').text
data = [player_name] + [x.text for x in tr.find_all('td')]
rows.append(data)
injury_data = pd.DataFrame(rows, columns = headers)

I think you want this (a list of tuples), using zip:
players = ["joe", "bill"]
injuries = ["tooth-ache", "mental break"]
list(zip(players, injuries))
Result:
[('joe', 'tooth-ache'), ('bill', 'mental break')]

Python BeautifulSoup filter data while parsing a URL

I'm trying to parse these on daily basis, before market open and I successfully get the list, but now i wanted to add additional filter for "strong buys" and "Volume" > 5000000 from the underlying url data https://www.tradingview.com/markets/stocks-usa/market-movers-gainers/
full code below
import requests
from bs4 import BeautifulSoup
url = "https://www.tradingview.com/markets/stocks-usa/market-movers-gainers/"
siteinfo = requests.get(url)
i = 0
content = siteinfo.content
html = content
parsed_html = BeautifulSoup(html, features="lxml")
doneList = []
for link in parsed_html.find_all('a'):
a = link.get('href')
if "symbol" in str(a) and "-" in str(a):
if i < 25:
i += 1
else:
x = a.split("-")
x = x[1].split("/")
doneList.append(x[0])
i += 1
print(doneList)

In this particular case, you're probably better off using pandas w/ multiple conditions and a filter:
import pandas as pd
url = 'https://www.tradingview.com/markets/stocks-usa/market-movers-gainers/'
df = pd.read_html(url)[0]
#create a helper function as a filter - it returns a series of boolean values
def filter_out(row):
#Unnamed: 4 is the buy recommendation and the next one is volume
if 'Strong' in row['Unnamed: 4'] and 'M' in row['Unnamed: 5']:
#since you're using a 5M volume as condition, you have to check for its existence:
if (int(row['Unnamed: 5'].split('.')[0])>5):
return True
else:
return False
else:
return False
#use the boolean values to filter the dataframe:
bulls = df.apply(filter_out, axis=1)
df[bulls]
Output (pardon the formatting):
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 M MRIN Marin Software Incorporated 7.50 96.85% 3.69 Strong Buy 263.387M 41.786M — -1.59 162.00 Technology Services
2 NTLA Intellia Therapeutics, Inc. 133.43 50.21% 44.60 Strong Buy 21.740M 6.054B — -2.46 312.00 Health Technology
3 A AUUD Auddia Inc. 5.89 43.66% 1.79 Strong Buy 36.281M 46.296M — — 11.00 Technology Services
etc. You can then change columns names or do other processing.
EDIT:
To get only the tickers of these companies use:
ticks = df[bulls]['Unnamed: 0'].to_list()
for tick in ticks:
print(tick.split(' ')[-2])
Output:
MRIN
NTLA
AUUD
WTT
etc.

First of all, exit code 0 means there was no error in your code. Thus, the index out of range error must be from the print statement print(e) (exception handled).
Looking at your code, this is the most vulnerable part for list index out of range error.
quote = get_quote(stock)
I don't know the inner mechanism of get_quote, but I guess this is where the error occured.

Extracting data from list in Python, after BeautifulSoup scrape, and creating Pandas table

I've been learning the basics of Python for a short while, and thought I'd go ahead and try to put something together, but appear to have hit a stumbling block (despite looking just about everywhere to see where I may be going wrong).
I'm trying to grab a table i.e. from here: https://www.oddschecker.com/horse-racing/2020-09-10-chelmsford-city/20:30/winner
Now I realize that the table isn't set out how typically a normal HTML would be, and therefore trying to grab this with Pandas wouldn't yield results. Therefore delved into BeautifulSoup to try and get a result.
It seems all the data I would need is within the class 'diff-row evTabRow bc' and therefore wrote the following:
url = requests.get('https://www.oddschecker.com/horse-racing/2020-09-10-haydock/14:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
table = soup.find_all("tr", class_="diff-row evTabRow bc")
This seems to put each horse and all corresponding data I'd need for it, into a list. Within this list, I'd only need certain bits, i.e. "data-name" for the horse name, and "data-odig" for the current odds.
I thought there may be some way I could then extract the data from the list to build a list of lists, and then construct a data frame in Pandas, but I may be going about this all wrong.

You can access any of the <tr> attributes with the BeautifulSoup object .attrs property.
Once you have table, loop over each entry, pull out the attributes you want as a list of dicts. Then initialize a Pandas data frame with the resulting list.
horse_attrs = list()
for entry in table:
attrs = dict(name=entry.attrs['data-bname'], dig=entry.attrs['data-best-dig'])
horse_attrs.append(attrs)
df = pd.DataFrame(horse_attrs)
df
name dig
0 Las Farras 9999
1 Heat Miami 9999
2 Martin Beck 9999
3 Litran 9999
4 Ritmo Capanga 9999
5 Perfect Score 9999
6 Simplemente Tuyo 9999
7 Anpacai 9999
8 Colt Fast 9999
9 Cacharpari 9999
10 Don Leparc 9999
11 Curioso Seattle 9999
12 Golpe Final 9999
13 El Acosador 9999
Notes:
The url you provided didn't work for me, but this similar one did: https://www.oddschecker.com/horse-racing/palermo-arg/21:00/winner
I didn't see the exact attributes (data-name and data-odig) you mentioned, so I used ones with similar names. I don't know enough about horse racing to know if these are useful, but the method in this answer should allow you to choose any of the attributes that are available.

The data you are looking for is both in the row tag <tr> and in the cell tags <td>.
The issue is that not all of the <td>'s are useful, so you have to skip those.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = requests.get('https://www.oddschecker.com/horse-racing/thirsk/13:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
rows = soup.find_all("tr", class_="diff-row evTabRow bc")
my_data = []
for row in rows:
horse = row.attrs['data-bname']
for td in row:
if td.attrs['class'][0] != 'np':
continue #Skip
bookie = td['data-bk']
odds = td['data-odig']
my_data.append(dict(
horse = horse,
bookie = bookie,
odds = odds
))
df = pd.DataFrame(my_data)
print(df)
This will give you what you are looking for:
horse bookie odds
0 Just Frank B3 3.75
1 Just Frank SK 4.33
2 Just Frank WH 4.33
3 Just Frank EE 4.33
4 Just Frank FB 4.2
.. ... ... ...
268 Tommy R RZ 29
269 Tommy R SX 26
270 Tommy R BF 10.8
271 Tommy R MK 41
272 Tommy R MA 98
[273 rows x 3 columns]

If web-scraping, you can take the approach where you get your data stored as various variables:
l = []
for thing in elements:
var1 = ... # however you extract it
var2 = ...
l.append({'column1_name': var1, 'column2_name': var2})
df = pd.DataFrame(l)
How you select the data out of the HTML element is up to you (consider selecting td?).

If-condition is not executed in a for-loop when scraping data from kworb.net

I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.

You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR

Full HTML is not being parsed with BeautifulSoup - is this because of dynamic HTML?

I'm trying to scrape the table on this page.
I can see from the browser debugger that the table I want is there in the HTML. e.g. you can see Peptide Name:
I wrote this code to extract this table:
for i in range(1001,1003):
# try:
res = requests.get("https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=" + str(i))
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find_all('table')
print table
But the output that is printed is:
[<table bgcolor="#DAD5BF" border="1" cellpadding="5" width="970"><tr><td align="center">\n\t This page displays user query in tabular form.\n</td></tr>\n</table>, <table width="970px"><tr><td align="center"><br/><font color="black" size="5px">1001 details</font><br/></td></tr></table>]
Can someone explain why the find_all is not finding all of the tables (and specifically the table I want) and how I can fix this?

Not sure why it's not showing.
Since it's a table too, I just went ahead and used Pandas to do .read_html
import pandas as pd
url = 'https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=antitb_1001'
tables = pd.read_html(url)
table = tables[-1]
Output:
print (table)
0 1
0 Primary information NaN
1 ID antitb_1001
2 Peptide Name Polydim-I
3 Sequence AVAGEKLWLLPHLLKMLLTPTP
4 N-terminal Modification Free
5 C-terminal Modification Free
6 Chemical Modification None
7 Linear/ Cyclic Linear
8 Length 22
9 Chirality L
10 Nature Amphipathic
11 Source Natural
12 Origin Isolated from the venom of the Neotropical was...
13 Species Mycobacterium abscessus subsp. massiliense
14 Strain Mycobacterium abscessus subsp. massiliense iso...
15 Inhibition Concentartion MIC = 60.8 Î¼g/mL
16 In vitro/In vivo Both
17 Cell Line Peritoneal macrophages, J774 macrophages cells...
18 Inhibition Concentartion Treatment of infected macrophages with 7.6 Î¼g...
19 Cytotoxicity Non-cytotoxic, 10% cytotoxicity on J774 cells ...
20 In vivo Model 6 to 8 weeks old BALB/c and IFN-Î³KO (Knockout...
21 Lethal Dose 2 mg/kg/mLW shows 90% reduction in bacterial load
22 Immune Response NaN
23 Mechanism of Action Cell wall disruption
24 Target Cell wall
25 Combination Therapy None
26 Other Activities NaN
27 Pubmed ID 26930596
28 Year of Publication 2016
29 3-D Structure View in Jmol or Download Structure

FYI (If you want to know the root-cause of your issue) target table has invalid markup:
<table class ="tab" cellpadding= "5" ... STYLE="border-spacing: 0px;border-style: line ;
<tr bgcolor="#DAD5BF"></tr>
Note that starting tag is not closed: <table ... (should be <table ...>) and also ancestor is <div> while the closing tag is </p>
That's why BeautifulSoup doesn't recognize this as a table and thus it's not returned by soup.find_all('table')
However, modern browsers has built-in tools to "fix" broken tags and so in browser table doesn't look "broken": closing </div> is added to ancestor div while p tag transformed into empty node <p></p>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping table data using beautiful soup - python

Related

Appending elements of a list into a multi-dimensional list

Python BeautifulSoup filter data while parsing a URL

Extracting data from list in Python, after BeautifulSoup scrape, and creating Pandas table

If-condition is not executed in a for-loop when scraping data from kworb.net

Full HTML is not being parsed with BeautifulSoup - is this because of dynamic HTML?

Categories

Resources