Scraping wikipedia table to pandas data frame

Scraping wikipedia table to pandas data frame - python

I need to scrape a wikipedia table to a pandas data frame and create three columns: PostalCode, Borough, and Neighborhoods.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Here is the code that I have used:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = [ ]
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighborhood'] = Neighbourhood
df
And it returns that:
(PostalCode, Borough, Neighborhood)
0 North York
1 Parkwoods
2 North York
3 Victoria Village
4 Downtown Toronto
5 Harbourfront (Toronto)
6 Downtown Toronto
7 Regent Park
8 North York
I can't figure out how to pick up the postcode and the neighbourhood from the wikipedia table.
Thank you

pandas allow you to do it in one line of code:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

Provide the error message.
By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.

You have two small errors:
df = pd.dataframe() should be df = pd.DataFrame([])
You also misspelled Neighborhoods as Neighbourhoods the second time.
You might also need to change soup = BeautifulSoup(website_url,'lxml') to soup = BeautifulSoup(website_url,'xml'), but we can't help you more without knowing your exact error message.

Instead of using
df = pd.dataframe()
df['Neighbourhoods'] = Neighbourhoods
You can use
df['Neighbourhoods'] = pd.Series(Neighbourhoods)
This would solve your error and add new columns similarly using pd.Series(listname) or you can give list of lists containing PostalCode, Borough, and Neighborhoods using this code
df = pd.Dataframe(list_of_lists)

It looks like you're only picking up one of the columns here:
links = My_table.findAll('a')
You should be looking for 'tr' rather than 'a' as that signifies a new row in the table.
You should then use a for loop to populate a list of lists, this code should work:
v = []
for tr in values:
td = tr.find_all('td')
row = [i.text for i in td]
v.append(row)
df = pd.DataFrame.from_records(v)

Related

getting data in the same row knowing the element value in the other column(both column name known)

import pandas as pd
import random
data = pd.read_csv("file.csv")
print(data)
Country City
0 German Berlin
1 France Paris
random_country = random.choice(data['Country'])
How do I get the corresponding city name in a quick way please?

Instead of using the country name retrieved to search the dataframe again it would be more efficient to extract the city at the same time. This can be achieved with the pandas.DataFrame.sample method
random_entry = df.sample(1)
random_country = random_entry['Country']
random_city = random_entry['City']

Try
idx = random.choice(data.index)
random_country = df.loc[idx,'Country']
random_city = df.loc[idx,'City']

If-condition is not executed in a for-loop when scraping data from kworb.net

I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.

You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR

scraping data from wikipedia table

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks

You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront

You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront

Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

How to scrape NHL skater stats using Xpath?

I am trying to scrape the stats for 2017/2018 NHL skaters. I have started on the code but I am running into issues parsing the data and printing to excel.
Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml.html import fromstring
import pandas as pd
#connect to url
url = "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
#setting up excel columns
columns = ("names", "gp", "g", "s", "team")
df = pd.DataFrame(columns=columns)
#attempt at parsing data while using loop
for nhl, skater_row in enumerate(tree.xpath('//table[contains(#class,"stats_table")]/tr')):
names = pitcher_row.xpath('.//td[#data-stat="player"]/a')[0].text
gp = skater_row.xpath('.//td[#data-stat="games_played"]/text()')[0]
g = skater_row.xpath('.//td[#data-stat="goals"]/text()')[0]
s = skater_row.xpath('.//td[#data-stat="shots"]/text()')[0]
try:
team = skater_row.xpath('.//td[#data-stat="team_id"]/a')[0].text
# create pandas dataframe to export data to excel
df.loc[nhl] = (names, team, gp, g, s)
#write data to excel
writer = pd.ExcelWriter('NHL skater.xlsx')
df.to_excel(writer, 'Sheet1')
writer.save()
Can someone please explain how to parse this data? Are there any tips you have to help write the Xpath so I can loop through the data?
I am having trouble writing the line:
for nhl, skater_row in enumerate(tree.xpath...
How did you find the Xpath? Did you use Xpath Finder or Xpath Helper?
Also, I ran into an error with the line:
df.loc[nhl] = (names, team, gp, g, s)
It shows an invalid syntax for df.
I am new to web scraping and have no prior experience coding. Any help would be greatly appreciated. Thanks in advance for your time!

If you still want to stick to XPath and get required data only instead of filtering complete data, you can try below:
for row in tree.xpath('//table[#id="stats"]/tbody/tr[not(#class="thead")]'):
name = row.xpath('.//td[#data-stat="player"]')[0].text_content()
gp = row.xpath('.//td[#data-stat="games_played"]')[0].text_content()
g = row.xpath('.//td[#data-stat="goals"]')[0].text_content()
s = row.xpath('.//td[#data-stat="shots"]')[0].text_content()
team = row.xpath('.//td[#data-stat="team_id"]')[0].text_content()
Output of print(name, gp, g, s, team):
Justin Abdelkader 75 13 110 DET
Pontus Aberg 53 4 70 TOT
Pontus Aberg 37 2 39 NSH
Pontus Aberg 16 2 31 EDM
Noel Acciari 60 10 66 BOS
Kenny Agostino 5 0 11 BOS
Sebastian Aho 78 29 200 CAR
...

IIUC: It can be done like this with BeautifulSoup and pandas read_html
import requests
import pandas
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2018_skaters.html'
pg = requests.get(url)
bsf = BeautifulSoup(pg.content, 'html5lib')
tables = bsf.findAll('table', attrs={'id':'stats'})
dfs = pd.read_html(tables[0].prettify())
df = dfs[0]
The resultant dataframe will have all the columns in the table and use pandas to filter the columns that are required.
#Filters only columns 1, 3 and 5 similarly all required columns can be filtered.
dff = df[df.columns[[1, 3, 5]]]

Trouble creating pandas dataframe from lists

I am having some trouble creating a pandas df from lists I generate while scraping data from the web. Here I am using beautifulsoup to pull a few pieces of information about local farms from localharvest.org (farm name, city, and description). I am able to scrape the data effectively, creating a list of objects on each pass. The trouble I'm having is outputting these lists into a tabular df.
My complete code is as follows:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text
fname.append(name)
city = item.contents[3].text
fcity.append(city)
desc = item.find_all("div", {'class': 'short-desc'})[0].text
fdesc.append(desc)
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')
Interestingly, the print(df) function shows that all three lists have been passed to the dataframe. But the resultant .CSV output contains only a single column of values (fcity) with the fname and fdesc column labels present. Interstingly, If I do something crazy like try to force tab delineated output with df.to_csv('farmdata.csv', sep='\t'), I get a single column with jumbled output, but it appears to at least be passing the other elements of the dataframe.
Thanks in advance for any input.

Try stripping out the newline and space characters:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text.split()
fname.append(' '.join(name))
city = item.contents[3].text.split()
fcity.append(' '.join(city))
desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
fdesc.append(' '.join(desc))
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')

Consider, instead of using lists of the information for each farm entity that you scrape, to use a list of dictionaries, or a dict of dicts. eg:
[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]
Now you can call Pandas.DataFrame.from_dict() on the above defined list of dicts.
Pandas method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
An answer that might describe this solution in more detail: Convert Python dict into a dataframe

It works for me:
# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df
fcity fdesc fname
0 South Portland, ME Gromaine Farm is pro Gromaine Farm
1 Newport, ME We are a diversified Parker Family Farm
2 Unity, ME The Buckle Farm is a The Buckle Farm
3 Kenduskeag, ME Visit wiseacresfarm. Wise Acres Farm
4 Winterport, ME Winter Cove Farm is Winter Cove Farm
5 Albion, ME MISTY BROOK FARM off Misty Brook Farm
6 Dover-Foxcroft, ME We want you to becom Ripley Farm
7 Madison, ME Hide and Go Peep Far Hide and Go Peep Far
8 Etna, ME Fail Better Farm is Fail Better Farm
9 Pittsfield, ME We are a family farm Snakeroot Organic Fa
Maybe you had a lot of empty spaces which was misinterpreted by the default delimiter(,) and hence picked up fcity column as it contained(,) in it which led to the ordering getting affected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping wikipedia table to pandas data frame - python

pandas allow you to do it in one line of code: df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

Provide the error message. By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.

Related

getting data in the same row knowing the element value in the other column(both column name known)

If-condition is not executed in a for-loop when scraping data from kworb.net

scraping data from wikipedia table

How to scrape NHL skater stats using Xpath?

Trouble creating pandas dataframe from lists

Categories

Resources