Adding headers to a table I have scraped

Adding headers to a table I have scraped - python

I have been following an online tutorial but rather than using the tutorial data which comes with headers I want to use the following code:
The problem I have is that my table has no headers so it is using the first row as the header. How can I set defined headers of "Ride" and "Queue Time"?
Thanks
import requests
import lxml.html as lh
import pandas as pd
url='http://www.ridetimes.co.uk/'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
r_elements = doc.xpath('//tr')
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print '%d:"%s"'%(i,name)
col.append((name,[]))
print(col)

How about trying this:
>>> pd.DataFrame(col,columns=["Ride","Queue Time"])
Ride Queue Time
0 Spinball Whizzer []
1 0 mins []
If I am correct then this is the answer.

Use pandas to get the table, then just assign the column names:
import pandas as pd
url='http://www.ridetimes.co.uk/'
df = pd.read_html(url)[0]
df.columns = ['Ride', 'Queue Time']
Output:
print (df)
Ride Queue Time
0 Spinball Whizzer 0 mins
1 Nemesis 5 mins
2 Oblivion 5 mins
3 Wicker Man 5 mins
4 The Smiler 10 mins
5 Rita 20 mins
6 TH13TEEN 25 mins
7 Galactica Currently Unavailable
8 Enterprise Currently Unavailable

Consider using the same source the page does to update values which returns json. You add a random number to the url to prevent cached results being served. This does all group types not just thrill.
import requests
import random
import pandas as pd
i = random.randint(1,1000000000000000000)
r = requests.get('http://ridetimes.co.uk/queue-times-new.php?r=' + str(i)).json() #to prevent cached results being served
df = pd.DataFrame([(item['ride'], item['time']) for item in r], columns = ['Ride', ' Queue Time'])
print(df)
If you want only thrill group then amend to this line:
df = pd.DataFrame([(item['ride'], item['time']) for item in r if item['group'] == 'Thrill'], columns = ['Ride', ' Queue Time'])

Related

I can't correctly visualize a json dataframe from api

I am currently trying to read some data from a public API. It has different ways of reading (json, csv, txt, among others), just change the label in the url (/ json, / csv, / txt ...). The url is as follows:
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/
...
My problem is that when trying to import into the Pandas dataframe it doesn't read the data correctly. I am trying the following alternatives:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
r = requests.get(url)
rjson = r.json()
df= json_normalize(rjson)
df['periods']
Also I try to read the data in csv format:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/'
collisions = pd.read_csv(url, sep='<br>')
collisions.head()
But I don't get good results; the dataframe cannot be visualized correctly since the 'periods' column is grouped with all the values ...
the output is displayed as follows:
all data appears as columns: /
Here is an example of how the data is displayed correctly:
What alternative do you recommend trying?
Thank you in advance for your time and help !!
I will be attentive to your answers, regards!

For csv you can use StringIO from io package
In [20]: import requests
In [21]: res = requests.get("https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/")
In [22]: import pandas as pd
In [23]: import io
In [24]: df = pd.read_csv(io.StringIO(res.text.strip().replace("<br>","\n")), engine='python')
In [25]: df
Out[25]:
Mes/Año Tipo de cambio - promedio del periodo (S/ por US$) - Bancario - Promedio
0 Jul.2018 3.276595
1 Ago.2018 3.288071
2 Sep.2018 3.311325
3 Oct.2018 3.333909
4 Nov.2018 3.374675
5 Dic.2018 3.364026
6 Ene.2019 3.343864
7 Feb.2019 3.321475
8 Mar.2019 3.304690
9 Abr.2019 3.303825
10 May.2019 3.332364
11 Jun.2019 3.325650
12 Jul.2019 3.290214
13 Ago.2019 3.377560
14 Sep.2019 3.357357
15 Oct.2019 3.359762
16 Nov.2019 3.371700
17 Dic.2019 3.355190
18 Ene.2020 3.327364
19 Feb.2020 3.390350
20 Mar.2020 3.491364
21 Abr.2020 3.397500
22 May.2020 3.421150
23 Jun.2020 3.470167

erh, sorry couldnt find the link for the read json with multiple objects inside it. the thing is we cant use load/s for this kind of format. so have to use raw_decode() instead
this code should work
import pandas as pd
import json
import urllib.request as ur
from pprint import pprint
d = json.JSONDecoder()
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
#reading and transforming json into list of dictionaries
data = []
with ur.urlopen(url) as json_file:
x = json_file.read().decode() # decode to convert bytes string into normal string
while True:
try:
j, n = d.raw_decode(x)
except ValueError:
break
#print(j)
data.append(j)
x = x[n:]
#pprint(data)
#creating list of dictionaries to convert into dataframe
clean_list = []
for i, d in enumerate(data[0]['periods']):
dict_data = {
"month_year": d['name'],
"value": d['values'][0],
}
clean_list.append(dict_data)
#print(clean_list)
#pd.options.display.width = 0
df = pd.DataFrame(clean_list)
print(df)
result
month_year value
0 Jul.2018 3.27659523809524
1 Ago.2018 3.28807142857143
2 Sep.2018 3.311325
3 Oct.2018 3.33390909090909
4 Nov.2018 3.374675
5 Dic.2018 3.36402631578947
6 Ene.2019 3.34386363636364
7 Feb.2019 3.321475
8 Mar.2019 3.30469047619048
9 Abr.2019 3.303825
10 May.2019 3.33236363636364
11 Jun.2019 3.32565
12 Jul.2019 3.29021428571428
13 Ago.2019 3.37756
14 Sep.2019 3.35735714285714
15 Oct.2019 3.3597619047619
16 Nov.2019 3.3717
17 Dic.2019 3.35519047619048
18 Ene.2020 3.32736363636364
19 Feb.2020 3.39035
20 Mar.2020 3.49136363636364
21 Abr.2020 3.3975
22 May.2020 3.42115
23 Jun.2020 3.47016666666667
if I somehow found the link again, I'll edit/comment my answer

Unable to store output in a customized manner in dataframe

I've created a script in python to parse some urls and store them in a dataframe. My script can do it. However, it doesn't do the way I expect.
I've tried with:
import requests
from bs4 import BeautifulSoup
import pandas as pd
base = 'http://opml.radiotime.com/Search.ashx?query=kroq'
linklist = []
r = requests.get(base)
soup = BeautifulSoup(r.text,"xml")
for item in soup.select("outline[type='audio'][URL]"):
find_match = base.split("=")[-1].lower()
if find_match in item['text'].lower():
linklist.append(item['URL'])
df = pd.DataFrame(linklist, columns=[find_match])
print(df)
Current output:
0 http://opml.radiotime.com/Tune.ashx?id=s35105
1 http://opml.radiotime.com/Tune.ashx?id=s26581
2 http://opml.radiotime.com/Tune.ashx?id=t122458...
3 http://opml.radiotime.com/Tune.ashx?id=t132149...
4 http://opml.radiotime.com/Tune.ashx?id=t131867...
5 http://opml.radiotime.com/Tune.ashx?id=t120569...
6 http://opml.radiotime.com/Tune.ashx?id=t125126...
7 http://opml.radiotime.com/Tune.ashx?id=t131068...
8 http://cdn-cms.tunein.com/service/Audio/nostre...
9 http://cdn-cms.tunein.com/service/Audio/notcom...
Expected output (I wish to kick out the indices as well if possible):
0 http://opml.radiotime.com/Tune.ashx?id=s35105
1 http://opml.radiotime.com/Tune.ashx?id=s26581
2 http://opml.radiotime.com/Tune.ashx?id=t122458
3 http://opml.radiotime.com/Tune.ashx?id=t132149
4 http://opml.radiotime.com/Tune.ashx?id=t131867
5 http://opml.radiotime.com/Tune.ashx?id=t120569
6 http://opml.radiotime.com/Tune.ashx?id=t125126
7 http://opml.radiotime.com/Tune.ashx?id=t131068
8 http://cdn-cms.tunein.com/service/Audio/nostre
9 http://cdn-cms.tunein.com/service/Audio/notcom

You can align. To get rid of index drop it when writing to csv
df.style.set_properties(**{'text-align': 'left'})
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )

How to scrape NHL skater stats using Xpath?

I am trying to scrape the stats for 2017/2018 NHL skaters. I have started on the code but I am running into issues parsing the data and printing to excel.
Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml.html import fromstring
import pandas as pd
#connect to url
url = "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
#setting up excel columns
columns = ("names", "gp", "g", "s", "team")
df = pd.DataFrame(columns=columns)
#attempt at parsing data while using loop
for nhl, skater_row in enumerate(tree.xpath('//table[contains(#class,"stats_table")]/tr')):
names = pitcher_row.xpath('.//td[#data-stat="player"]/a')[0].text
gp = skater_row.xpath('.//td[#data-stat="games_played"]/text()')[0]
g = skater_row.xpath('.//td[#data-stat="goals"]/text()')[0]
s = skater_row.xpath('.//td[#data-stat="shots"]/text()')[0]
try:
team = skater_row.xpath('.//td[#data-stat="team_id"]/a')[0].text
# create pandas dataframe to export data to excel
df.loc[nhl] = (names, team, gp, g, s)
#write data to excel
writer = pd.ExcelWriter('NHL skater.xlsx')
df.to_excel(writer, 'Sheet1')
writer.save()
Can someone please explain how to parse this data? Are there any tips you have to help write the Xpath so I can loop through the data?
I am having trouble writing the line:
for nhl, skater_row in enumerate(tree.xpath...
How did you find the Xpath? Did you use Xpath Finder or Xpath Helper?
Also, I ran into an error with the line:
df.loc[nhl] = (names, team, gp, g, s)
It shows an invalid syntax for df.
I am new to web scraping and have no prior experience coding. Any help would be greatly appreciated. Thanks in advance for your time!

If you still want to stick to XPath and get required data only instead of filtering complete data, you can try below:
for row in tree.xpath('//table[#id="stats"]/tbody/tr[not(#class="thead")]'):
name = row.xpath('.//td[#data-stat="player"]')[0].text_content()
gp = row.xpath('.//td[#data-stat="games_played"]')[0].text_content()
g = row.xpath('.//td[#data-stat="goals"]')[0].text_content()
s = row.xpath('.//td[#data-stat="shots"]')[0].text_content()
team = row.xpath('.//td[#data-stat="team_id"]')[0].text_content()
Output of print(name, gp, g, s, team):
Justin Abdelkader 75 13 110 DET
Pontus Aberg 53 4 70 TOT
Pontus Aberg 37 2 39 NSH
Pontus Aberg 16 2 31 EDM
Noel Acciari 60 10 66 BOS
Kenny Agostino 5 0 11 BOS
Sebastian Aho 78 29 200 CAR
...

IIUC: It can be done like this with BeautifulSoup and pandas read_html
import requests
import pandas
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2018_skaters.html'
pg = requests.get(url)
bsf = BeautifulSoup(pg.content, 'html5lib')
tables = bsf.findAll('table', attrs={'id':'stats'})
dfs = pd.read_html(tables[0].prettify())
df = dfs[0]
The resultant dataframe will have all the columns in the table and use pandas to filter the columns that are required.
#Filters only columns 1, 3 and 5 similarly all required columns can be filtered.
dff = df[df.columns[[1, 3, 5]]]

Extracting many URLs in a python dataframe

I have a dataframe which contains text including one or more URL(s) :
user_id text
1 blabla... http://amazon.com ...blabla
1 blabla... http://nasa.com ...blabla
2 blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
2 blabla... https://fnac.com ...blabla ...
3 blabla....
I want to transform this dataframe with the count of URL(s) per user-id :
user_id count_URL
1 2
2 3
3 0
Is there a simple way to perform this task in Python ?
My code start :
URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
for i in range(data.shape[0]) :
for j in range(0,8):
URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))
Thanks you
Lionel

In general, the definition of a URL is much more complex than what you have in your example. Unless you are sure you have very simple URLs, you should look up a good pattern.
import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...
First, extract the URLs from each string and count them:
df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
Next, group the counts by user id:
df.groupby('user_id').sum()['urlcount']
#user_id
#1 2
#2 3
#3 0

Below there is another way to do that:
#read data
import pandas as pd
data = pd.read_csv("data.csv")
#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)
#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
counter = val[1][0].count(sub)
count_URL.append(counter)
#list to DataFrame
count_URL = pd.DataFrame(count_URL)
#Concatenate the two data frames and apply the code of #DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())

webscraping data and use pandas read_html to convert it to dataframe and merge the dataset together

I'm kinda new to python and having some problem with my code. Would appreciate any suggestions on what I try to do.
import pandas as pd
from bs4 import BeautifulSoup
import requests
def trade():
tickers = ["AAPL","AMZN", "INTC", "MSFT", "SNAP"]
for ticker in tickers:
url = "http://finance.yahoo.com/quote/%s?p=%s"%(ticker,ticker)
res = requests.get(url)
soup = (BeautifulSoup(res.content, 'lxml'))
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print("DF")
print(df)
df_string = str(df)
print(df_string_parse)
print(type(df_string_parse))
When I look at df, it displays string like this
[ 0 1
0 Previous Close 167.37
1 Open 169.79
2 Bid 157.23 x 300
3 Ask 157.29 x 500
4 Day's Range 169.00 - 173.09
5 52 Week Range 134.84 - 180.10
6 Volume 51124085
7 Avg. Volume 33251246]
What I want to do, is to store each ticker's table into dataframe, and merge them together like below. Or I guess if there is a way, I Could turn that into dictionary so that I could use its variable more easily.
APPL AMAZN INTC MSFT SNAP
Previous Close
Open
Bid
Ask
Day's Range
Volume
Avg. Volume
For now, there are 2 problems that I face:
how to turn the table into data frame and/or dictionary after using pd.read_html(str(table))
how to store each ticker's results separately eventually to merge
them together? I know how to use for loop to read them one by one,
but I don't seem to know how to store them in such way.

I'd do it like this:
import pandas as pd
from bs4 import BeautifulSoup
import requests
def fetch(t):
url = f'http://finance.yahoo.com/quote/{t}?p={t}'
res = requests.get(url)
soup = (BeautifulSoup(res.content, 'lxml'))
table = soup.find_all('table')[0]
labels, data = pd.read_html(str(table))[0].values.T
# ^
# What you were missing
# pd.read_html returned a list of 1 dataframe
return pd.Series(data, labels, name=t)
tickers = ["AAPL","AMZN", "INTC", "MSFT", "SNAP"]
df = pd.concat(map(fetch, tickers), axis=1)
AAPL AMZN INTC MSFT SNAP
Previous Close 167.37 1451.05 45.38 90.81 19.56
Open 169.79 1466.89 45.88 91.21 19.66
Bid 157.23 x 300 1,373.50 x 100 43.29 x 100 86.19 x 100 19.01 x 300
Ask 157.29 x 500 1,376.00 x 200 43.43 x 800 86.30 x 500 19.04 x 500
Day's Range 169.00 - 173.09 1,436.84 - 1,468.94 44.95 - 45.99 90.62 - 92.72 19.58 - 20.57
52 Week Range 134.84 - 180.10 833.50 - 1,498.00 33.23 - 50.85 63.62 - 96.07 11.28 - 29.44
Volume 51129225 5650685 23536349 27823161 40733751
Avg. Volume 33251246 4689059 34001746 28064627 25027667

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding headers to a table I have scraped - python

How about trying this: >>> pd.DataFrame(col,columns=["Ride","Queue Time"]) Ride Queue Time 0 Spinball Whizzer [] 1 0 mins [] If I am correct then this is the answer.

Related

I can't correctly visualize a json dataframe from api

Unable to store output in a customized manner in dataframe

How to scrape NHL skater stats using Xpath?

Extracting many URLs in a python dataframe

webscraping data and use pandas read_html to convert it to dataframe and merge the dataset together

Categories

Resources