scraping data from wikipedia table

scraping data from wikipedia table - python

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks

You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront

You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront

Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

Related

How do you extract a HTML table and add a new column with constant values from an earlier <strong> tag?

I'm attempting to extract a series of tables from an HTML document and append a new column with a constant value from a tag being used as a header. The idea would then be to make this new three column table a dataframe. below is the code i've come up with so far. I.e. each table would have a third column where all the row values would equal either AGO, DPK, ATK, or PMS depending which header precedes the series of tables. Would be grateful for any help as i'm new to python and HTML. Thanks a mill!
import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser
br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")
table = br.find_all('td', class_='vc_table_cell')
for element in table:
data = element.find('span', class_='vc_table_content')
prod_name = br.find_all('strong')
ago = prod_name[0].text
dpk = prod_name[1].text
atk = prod_name[2].text
pms = prod_name[3].text
if br.find('strong').text == ago:
data.append(ago.text)
elif br.find('strong').text == dpk:
data.append(dpk.text)
elif br.find('strong').text == atk:
data.append(atk.text)
elif br.find('strong').text == pms:
data.append(pms.text)
print(data.text)
df = pd.DataFrame(data)
The result i'm hoping for is to go from this
AGO
Enterprise Price
Coy A $0.5/L
Coy B $0.6/L
Coy C $0.7/L
to the new table below as a dataframe in Pandas
Enterprise Price Product
Coy A $0.5/L AGO
Coy B $0.6/L AGO
Coy C $0.7/L AGO
and to repeat the same thing for other tables with DPK, ATK and PMS information

I hope I understood your question right. This script will scrape all tables found in the page into the dataframe and save it to csv file:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
Prints:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
The data.csv (screenshot from LibreOffice):

If-condition is not executed in a for-loop when scraping data from kworb.net

I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.

You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR

Scraping wikipedia table to pandas data frame

I need to scrape a wikipedia table to a pandas data frame and create three columns: PostalCode, Borough, and Neighborhoods.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Here is the code that I have used:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = [ ]
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighborhood'] = Neighbourhood
df
And it returns that:
(PostalCode, Borough, Neighborhood)
0 North York
1 Parkwoods
2 North York
3 Victoria Village
4 Downtown Toronto
5 Harbourfront (Toronto)
6 Downtown Toronto
7 Regent Park
8 North York
I can't figure out how to pick up the postcode and the neighbourhood from the wikipedia table.
Thank you

pandas allow you to do it in one line of code:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

Provide the error message.
By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.

You have two small errors:
df = pd.dataframe() should be df = pd.DataFrame([])
You also misspelled Neighborhoods as Neighbourhoods the second time.
You might also need to change soup = BeautifulSoup(website_url,'lxml') to soup = BeautifulSoup(website_url,'xml'), but we can't help you more without knowing your exact error message.

Instead of using
df = pd.dataframe()
df['Neighbourhoods'] = Neighbourhoods
You can use
df['Neighbourhoods'] = pd.Series(Neighbourhoods)
This would solve your error and add new columns similarly using pd.Series(listname) or you can give list of lists containing PostalCode, Borough, and Neighborhoods using this code
df = pd.Dataframe(list_of_lists)

It looks like you're only picking up one of the columns here:
links = My_table.findAll('a')
You should be looking for 'tr' rather than 'a' as that signifies a new row in the table.
You should then use a for loop to populate a list of lists, this code should work:
v = []
for tr in values:
td = tr.find_all('td')
row = [i.text for i in td]
v.append(row)
df = pd.DataFrame.from_records(v)

Trouble creating pandas dataframe from lists

I am having some trouble creating a pandas df from lists I generate while scraping data from the web. Here I am using beautifulsoup to pull a few pieces of information about local farms from localharvest.org (farm name, city, and description). I am able to scrape the data effectively, creating a list of objects on each pass. The trouble I'm having is outputting these lists into a tabular df.
My complete code is as follows:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text
fname.append(name)
city = item.contents[3].text
fcity.append(city)
desc = item.find_all("div", {'class': 'short-desc'})[0].text
fdesc.append(desc)
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')
Interestingly, the print(df) function shows that all three lists have been passed to the dataframe. But the resultant .CSV output contains only a single column of values (fcity) with the fname and fdesc column labels present. Interstingly, If I do something crazy like try to force tab delineated output with df.to_csv('farmdata.csv', sep='\t'), I get a single column with jumbled output, but it appears to at least be passing the other elements of the dataframe.
Thanks in advance for any input.

Try stripping out the newline and space characters:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text.split()
fname.append(' '.join(name))
city = item.contents[3].text.split()
fcity.append(' '.join(city))
desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
fdesc.append(' '.join(desc))
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')

Consider, instead of using lists of the information for each farm entity that you scrape, to use a list of dictionaries, or a dict of dicts. eg:
[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]
Now you can call Pandas.DataFrame.from_dict() on the above defined list of dicts.
Pandas method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
An answer that might describe this solution in more detail: Convert Python dict into a dataframe

It works for me:
# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df
fcity fdesc fname
0 South Portland, ME Gromaine Farm is pro Gromaine Farm
1 Newport, ME We are a diversified Parker Family Farm
2 Unity, ME The Buckle Farm is a The Buckle Farm
3 Kenduskeag, ME Visit wiseacresfarm. Wise Acres Farm
4 Winterport, ME Winter Cove Farm is Winter Cove Farm
5 Albion, ME MISTY BROOK FARM off Misty Brook Farm
6 Dover-Foxcroft, ME We want you to becom Ripley Farm
7 Madison, ME Hide and Go Peep Far Hide and Go Peep Far
8 Etna, ME Fail Better Farm is Fail Better Farm
9 Pittsfield, ME We are a family farm Snakeroot Organic Fa
Maybe you had a lot of empty spaces which was misinterpreted by the default delimiter(,) and hence picked up fcity column as it contained(,) in it which led to the ordering getting affected.

Iterate through an entire table with BeautifulSoup?

Trying to scrape all of the player names and fantasy info on the players listed on this site. I can find the table absolutely fine, but the trouble starts when I try and iterate over the entire table. Here's the code I've written so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
nfl = 'http://www.fantasypros.com/nfl/adp/overall.php'
html = urlopen(nfl)
soup = BeautifulSoup(html.read(), "lxml")
table = soup.find('tbody').find_next('tbody')
playername = table.find('td').find_next('td')
for row in table:
print(playername)
Expected output:
Adrian Peterson MIN, 5
Le'Veon Bell PIT, 11
and so on and so forth for the rest of the players on the chart.
Actual output:
Adrian Peterson MIN, 5
Adrian Peterson MIN, 5
Adrian Peterson MIN, 5
and so on for over 400 iterations.
Where is my for loop going wrong?

You need to make the search in the context of a particular table:
for row in table:
print(row.find('td').find_next('td'))
Though, I would approach the problem differently. The desired table has an id:
table = soup.find('table', id="data")
for row in table.find_all("tr")[1:]: # skipping header row
cells = row.find_all("td")
print(cells[0].text, cells[1].find('a').text)
Prints:
(u'1', u'Adrian Peterson')
(u'2', u"Le'Veon Bell")
(u'3', u'Eddie Lacy')
(u'4', u'Jamaal Charles')
(u'5', u'Marshawn Lynch')
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scraping data from wikipedia table - python

Related

How do you extract a HTML table and add a new column with constant values from an earlier <strong> tag?

If-condition is not executed in a for-loop when scraping data from kworb.net

Scraping wikipedia table to pandas data frame

Trouble creating pandas dataframe from lists

Iterate through an entire table with BeautifulSoup?

Categories

Resources