Iterate through an entire table with BeautifulSoup? - python

Trying to scrape all of the player names and fantasy info on the players listed on this site. I can find the table absolutely fine, but the trouble starts when I try and iterate over the entire table. Here's the code I've written so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
nfl = 'http://www.fantasypros.com/nfl/adp/overall.php'
html = urlopen(nfl)
soup = BeautifulSoup(html.read(), "lxml")
table = soup.find('tbody').find_next('tbody')
playername = table.find('td').find_next('td')
for row in table:
print(playername)
Expected output:
Adrian Peterson MIN, 5
Le'Veon Bell PIT, 11
and so on and so forth for the rest of the players on the chart.
Actual output:
Adrian Peterson MIN, 5
Adrian Peterson MIN, 5
Adrian Peterson MIN, 5
and so on for over 400 iterations.
Where is my for loop going wrong?

You need to make the search in the context of a particular table:
for row in table:
print(row.find('td').find_next('td'))
Though, I would approach the problem differently. The desired table has an id:
table = soup.find('table', id="data")
for row in table.find_all("tr")[1:]: # skipping header row
cells = row.find_all("td")
print(cells[0].text, cells[1].find('a').text)
Prints:
(u'1', u'Adrian Peterson')
(u'2', u"Le'Veon Bell")
(u'3', u'Eddie Lacy')
(u'4', u'Jamaal Charles')
(u'5', u'Marshawn Lynch')
...

Related

How to scrape wikipedia if you already have all the urls?

I have the a table of information below that contains 1000 of these entries and the sum of the first column is approximately 90 000.
counts
genus species
14149
Marchantia polymorpha
9345
Physcomitrium patens
7744
Selaginella moellendorffii
5389
Picea sitchensis
For each of the 1000 entries I would like to search the Wikipedia page and extract the grouping of the organism.
For example:
I look up Marchantia polymorpha in Wikipedia and find it's page. On the right most side of the page is the scientific classification of the species. I would like to extract the value for Kingdom i.e. Plantae (amongst others) in this case.
At the end of searching and extracting I would like a table that looks like this:
counts
genus species
kingdom
14149
Marchantia polymorpha
plantae
9345
Physcomitrium patens
plantae
7744
Selaginella moellendorffii
plantae
5389
Picea sitchensis
fungi
So that I can count the total number of entries belonging to each kingdom.
Collecting the URLs won't be difficult. Since the base URL is the same and the page I want to end up on simply adds the genus_species name. For example all the URLs for the list above would be:
https://en.wikipedia.org/wiki/Marchantia_polymorpha
https://en.wikipedia.org/wiki/Physcomitrium_patens
https://en.wikipedia.org/wiki/Selaginella_moellendorffii
https://en.wikipedia.org/wiki/Picea_sitchensis
I am not 100% certain that a Wikipedia page exists for each genus but the majority do based on manual searches I've done previously.
You can use this, but I am not sure if it will work on all Wikipedia sites:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Marchantia_polymorpha'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
# It is in an table row:
tr = soup.find_all('tr')
for t in tr:
if 'kingdom' in t.text.lower():
text = t.text
break
text.replace('\n', '').split(':')[1]
>>> 'Plantae'
# Or since both kingdom and plantea are in a <td>
td = soup.find_all('td')
for i, t in enumerate(td):
if 'kingdom' in t.text.lower():
kingdom = td[i+1].text.rstrip() # in the row after kingdom is plantae (i+1)
print(kingdom)
>>> 'Plantae'

If-condition is not executed in a for-loop when scraping data from kworb.net

I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.
You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR

scraping data from wikipedia table

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks
You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront
Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

Scraping wikipedia table to pandas data frame

I need to scrape a wikipedia table to a pandas data frame and create three columns: PostalCode, Borough, and Neighborhoods.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Here is the code that I have used:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = [ ]
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighborhood'] = Neighbourhood
df
And it returns that:
(PostalCode, Borough, Neighborhood)
0 North York
1 Parkwoods
2 North York
3 Victoria Village
4 Downtown Toronto
5 Harbourfront (Toronto)
6 Downtown Toronto
7 Regent Park
8 North York
I can't figure out how to pick up the postcode and the neighbourhood from the wikipedia table.
Thank you
pandas allow you to do it in one line of code:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
Provide the error message.
By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.
You have two small errors:
df = pd.dataframe() should be df = pd.DataFrame([])
You also misspelled Neighborhoods as Neighbourhoods the second time.
You might also need to change soup = BeautifulSoup(website_url,'lxml') to soup = BeautifulSoup(website_url,'xml'), but we can't help you more without knowing your exact error message.
Instead of using
df = pd.dataframe()
df['Neighbourhoods'] = Neighbourhoods
You can use
df['Neighbourhoods'] = pd.Series(Neighbourhoods)
This would solve your error and add new columns similarly using pd.Series(listname) or you can give list of lists containing PostalCode, Borough, and Neighborhoods using this code
df = pd.Dataframe(list_of_lists)
It looks like you're only picking up one of the columns here:
links = My_table.findAll('a')
You should be looking for 'tr' rather than 'a' as that signifies a new row in the table.
You should then use a for loop to populate a list of lists, this code should work:
v = []
for tr in values:
td = tr.find_all('td')
row = [i.text for i in td]
v.append(row)
df = pd.DataFrame.from_records(v)

Python scraping from website. selecting TR elements based on multiple class attrs

I am scraping from the following page: https://kenpom.com/index.php?y=2018
The page shows a list of every Divison 1 college basketball team. Each row is for one team. I want to assign every team-row to a variable called "teams". The problem is that after each 40 teams there are two header rows that I do not want to include. These rows are unique as they have a class of "thead1" and "thead2". The rows that I want to grab have a class of None or "bold-bottom". So essentially i need to iterate through every tr element in that table and grab any that has a class of None or "bold-bottom". My attempt below does not work. It returns a count of 35 when it should be 353
import requests
from bs4 import BeautifulSoup
url ='https://kenpom.com/index.php?y=2018'
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
table = soup.find('table',{'id':'ratings-table'}).tbody
teams = table.findAll('tr',attrs = {'class':(None or 'bold-bottom')})
print(len(teams))

Categories

Resources