How to scrape NHL skater stats using Xpath? - python

I am trying to scrape the stats for 2017/2018 NHL skaters. I have started on the code but I am running into issues parsing the data and printing to excel.
Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml.html import fromstring
import pandas as pd
#connect to url
url = "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
#setting up excel columns
columns = ("names", "gp", "g", "s", "team")
df = pd.DataFrame(columns=columns)
#attempt at parsing data while using loop
for nhl, skater_row in enumerate(tree.xpath('//table[contains(#class,"stats_table")]/tr')):
names = pitcher_row.xpath('.//td[#data-stat="player"]/a')[0].text
gp = skater_row.xpath('.//td[#data-stat="games_played"]/text()')[0]
g = skater_row.xpath('.//td[#data-stat="goals"]/text()')[0]
s = skater_row.xpath('.//td[#data-stat="shots"]/text()')[0]
try:
team = skater_row.xpath('.//td[#data-stat="team_id"]/a')[0].text
# create pandas dataframe to export data to excel
df.loc[nhl] = (names, team, gp, g, s)
#write data to excel
writer = pd.ExcelWriter('NHL skater.xlsx')
df.to_excel(writer, 'Sheet1')
writer.save()
Can someone please explain how to parse this data? Are there any tips you have to help write the Xpath so I can loop through the data?
I am having trouble writing the line:
for nhl, skater_row in enumerate(tree.xpath...
How did you find the Xpath? Did you use Xpath Finder or Xpath Helper?
Also, I ran into an error with the line:
df.loc[nhl] = (names, team, gp, g, s)
It shows an invalid syntax for df.
I am new to web scraping and have no prior experience coding. Any help would be greatly appreciated. Thanks in advance for your time!

If you still want to stick to XPath and get required data only instead of filtering complete data, you can try below:
for row in tree.xpath('//table[#id="stats"]/tbody/tr[not(#class="thead")]'):
name = row.xpath('.//td[#data-stat="player"]')[0].text_content()
gp = row.xpath('.//td[#data-stat="games_played"]')[0].text_content()
g = row.xpath('.//td[#data-stat="goals"]')[0].text_content()
s = row.xpath('.//td[#data-stat="shots"]')[0].text_content()
team = row.xpath('.//td[#data-stat="team_id"]')[0].text_content()
Output of print(name, gp, g, s, team):
Justin Abdelkader 75 13 110 DET
Pontus Aberg 53 4 70 TOT
Pontus Aberg 37 2 39 NSH
Pontus Aberg 16 2 31 EDM
Noel Acciari 60 10 66 BOS
Kenny Agostino 5 0 11 BOS
Sebastian Aho 78 29 200 CAR
...

IIUC: It can be done like this with BeautifulSoup and pandas read_html
import requests
import pandas
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2018_skaters.html'
pg = requests.get(url)
bsf = BeautifulSoup(pg.content, 'html5lib')
tables = bsf.findAll('table', attrs={'id':'stats'})
dfs = pd.read_html(tables[0].prettify())
df = dfs[0]
The resultant dataframe will have all the columns in the table and use pandas to filter the columns that are required.
#Filters only columns 1, 3 and 5 similarly all required columns can be filtered.
dff = df[df.columns[[1, 3, 5]]]

Related

How to extract multiples tables from one PDF file using Pandas and tabula-py

Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this
import pandas as pd
import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

I can't correctly visualize a json dataframe from api

I am currently trying to read some data from a public API. It has different ways of reading (json, csv, txt, among others), just change the label in the url (/ json, / csv, / txt ...). The url is as follows:
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/
...
My problem is that when trying to import into the Pandas dataframe it doesn't read the data correctly. I am trying the following alternatives:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
r = requests.get(url)
rjson = r.json()
df= json_normalize(rjson)
df['periods']
Also I try to read the data in csv format:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/'
collisions = pd.read_csv(url, sep='<br>')
collisions.head()
But I don't get good results; the dataframe cannot be visualized correctly since the 'periods' column is grouped with all the values ...
the output is displayed as follows:
all data appears as columns: /
Here is an example of how the data is displayed correctly:
What alternative do you recommend trying?
Thank you in advance for your time and help !!
I will be attentive to your answers, regards!
For csv you can use StringIO from io package
In [20]: import requests
In [21]: res = requests.get("https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/")
In [22]: import pandas as pd
In [23]: import io
In [24]: df = pd.read_csv(io.StringIO(res.text.strip().replace("<br>","\n")), engine='python')
In [25]: df
Out[25]:
Mes/Año Tipo de cambio - promedio del periodo (S/ por US$) - Bancario - Promedio
0 Jul.2018 3.276595
1 Ago.2018 3.288071
2 Sep.2018 3.311325
3 Oct.2018 3.333909
4 Nov.2018 3.374675
5 Dic.2018 3.364026
6 Ene.2019 3.343864
7 Feb.2019 3.321475
8 Mar.2019 3.304690
9 Abr.2019 3.303825
10 May.2019 3.332364
11 Jun.2019 3.325650
12 Jul.2019 3.290214
13 Ago.2019 3.377560
14 Sep.2019 3.357357
15 Oct.2019 3.359762
16 Nov.2019 3.371700
17 Dic.2019 3.355190
18 Ene.2020 3.327364
19 Feb.2020 3.390350
20 Mar.2020 3.491364
21 Abr.2020 3.397500
22 May.2020 3.421150
23 Jun.2020 3.470167
erh, sorry couldnt find the link for the read json with multiple objects inside it. the thing is we cant use load/s for this kind of format. so have to use raw_decode() instead
this code should work
import pandas as pd
import json
import urllib.request as ur
from pprint import pprint
d = json.JSONDecoder()
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
#reading and transforming json into list of dictionaries
data = []
with ur.urlopen(url) as json_file:
x = json_file.read().decode() # decode to convert bytes string into normal string
while True:
try:
j, n = d.raw_decode(x)
except ValueError:
break
#print(j)
data.append(j)
x = x[n:]
#pprint(data)
#creating list of dictionaries to convert into dataframe
clean_list = []
for i, d in enumerate(data[0]['periods']):
dict_data = {
"month_year": d['name'],
"value": d['values'][0],
}
clean_list.append(dict_data)
#print(clean_list)
#pd.options.display.width = 0
df = pd.DataFrame(clean_list)
print(df)
result
month_year value
0 Jul.2018 3.27659523809524
1 Ago.2018 3.28807142857143
2 Sep.2018 3.311325
3 Oct.2018 3.33390909090909
4 Nov.2018 3.374675
5 Dic.2018 3.36402631578947
6 Ene.2019 3.34386363636364
7 Feb.2019 3.321475
8 Mar.2019 3.30469047619048
9 Abr.2019 3.303825
10 May.2019 3.33236363636364
11 Jun.2019 3.32565
12 Jul.2019 3.29021428571428
13 Ago.2019 3.37756
14 Sep.2019 3.35735714285714
15 Oct.2019 3.3597619047619
16 Nov.2019 3.3717
17 Dic.2019 3.35519047619048
18 Ene.2020 3.32736363636364
19 Feb.2020 3.39035
20 Mar.2020 3.49136363636364
21 Abr.2020 3.3975
22 May.2020 3.42115
23 Jun.2020 3.47016666666667
if I somehow found the link again, I'll edit/comment my answer

How do you extract a HTML table and add a new column with constant values from an earlier <strong> tag?

I'm attempting to extract a series of tables from an HTML document and append a new column with a constant value from a tag being used as a header. The idea would then be to make this new three column table a dataframe. below is the code i've come up with so far. I.e. each table would have a third column where all the row values would equal either AGO, DPK, ATK, or PMS depending which header precedes the series of tables. Would be grateful for any help as i'm new to python and HTML. Thanks a mill!
import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser
br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")
table = br.find_all('td', class_='vc_table_cell')
for element in table:
data = element.find('span', class_='vc_table_content')
prod_name = br.find_all('strong')
ago = prod_name[0].text
dpk = prod_name[1].text
atk = prod_name[2].text
pms = prod_name[3].text
if br.find('strong').text == ago:
data.append(ago.text)
elif br.find('strong').text == dpk:
data.append(dpk.text)
elif br.find('strong').text == atk:
data.append(atk.text)
elif br.find('strong').text == pms:
data.append(pms.text)
print(data.text)
df = pd.DataFrame(data)
The result i'm hoping for is to go from this
AGO
Enterprise Price
Coy A $0.5/L
Coy B $0.6/L
Coy C $0.7/L
to the new table below as a dataframe in Pandas
Enterprise Price Product
Coy A $0.5/L AGO
Coy B $0.6/L AGO
Coy C $0.7/L AGO
and to repeat the same thing for other tables with DPK, ATK and PMS information
I hope I understood your question right. This script will scrape all tables found in the page into the dataframe and save it to csv file:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
Prints:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
The data.csv (screenshot from LibreOffice):

Adding headers to a table I have scraped

I have been following an online tutorial but rather than using the tutorial data which comes with headers I want to use the following code:
The problem I have is that my table has no headers so it is using the first row as the header. How can I set defined headers of "Ride" and "Queue Time"?
Thanks
import requests
import lxml.html as lh
import pandas as pd
url='http://www.ridetimes.co.uk/'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
r_elements = doc.xpath('//tr')
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print '%d:"%s"'%(i,name)
col.append((name,[]))
print(col)
How about trying this:
>>> pd.DataFrame(col,columns=["Ride","Queue Time"])
Ride Queue Time
0 Spinball Whizzer []
1 0 mins []
If I am correct then this is the answer.
Use pandas to get the table, then just assign the column names:
import pandas as pd
url='http://www.ridetimes.co.uk/'
df = pd.read_html(url)[0]
df.columns = ['Ride', 'Queue Time']
Output:
print (df)
Ride Queue Time
0 Spinball Whizzer 0 mins
1 Nemesis 5 mins
2 Oblivion 5 mins
3 Wicker Man 5 mins
4 The Smiler 10 mins
5 Rita 20 mins
6 TH13TEEN 25 mins
7 Galactica Currently Unavailable
8 Enterprise Currently Unavailable
Consider using the same source the page does to update values which returns json. You add a random number to the url to prevent cached results being served. This does all group types not just thrill.
import requests
import random
import pandas as pd
i = random.randint(1,1000000000000000000)
r = requests.get('http://ridetimes.co.uk/queue-times-new.php?r=' + str(i)).json() #to prevent cached results being served
df = pd.DataFrame([(item['ride'], item['time']) for item in r], columns = ['Ride', ' Queue Time'])
print(df)
If you want only thrill group then amend to this line:
df = pd.DataFrame([(item['ride'], item['time']) for item in r if item['group'] == 'Thrill'], columns = ['Ride', ' Queue Time'])

Scraping wikipedia table to pandas data frame

I need to scrape a wikipedia table to a pandas data frame and create three columns: PostalCode, Borough, and Neighborhoods.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Here is the code that I have used:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = [ ]
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighborhood'] = Neighbourhood
df
And it returns that:
(PostalCode, Borough, Neighborhood)
0 North York
1 Parkwoods
2 North York
3 Victoria Village
4 Downtown Toronto
5 Harbourfront (Toronto)
6 Downtown Toronto
7 Regent Park
8 North York
I can't figure out how to pick up the postcode and the neighbourhood from the wikipedia table.
Thank you
pandas allow you to do it in one line of code:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
Provide the error message.
By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.
You have two small errors:
df = pd.dataframe() should be df = pd.DataFrame([])
You also misspelled Neighborhoods as Neighbourhoods the second time.
You might also need to change soup = BeautifulSoup(website_url,'lxml') to soup = BeautifulSoup(website_url,'xml'), but we can't help you more without knowing your exact error message.
Instead of using
df = pd.dataframe()
df['Neighbourhoods'] = Neighbourhoods
You can use
df['Neighbourhoods'] = pd.Series(Neighbourhoods)
This would solve your error and add new columns similarly using pd.Series(listname) or you can give list of lists containing PostalCode, Borough, and Neighborhoods using this code
df = pd.Dataframe(list_of_lists)
It looks like you're only picking up one of the columns here:
links = My_table.findAll('a')
You should be looking for 'tr' rather than 'a' as that signifies a new row in the table.
You should then use a for loop to populate a list of lists, this code should work:
v = []
for tr in values:
td = tr.find_all('td')
row = [i.text for i in td]
v.append(row)
df = pd.DataFrame.from_records(v)

Categories

Resources