Python: Get html table data by xpath

Python: Get html table data by xpath - python

I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here..
Is there an simple pythonic way to extract strings and numbers out of a website by just using the url and xpath of the table of interest?
Example:
url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
xpath_str = //*[#id="sortabletable"]
I once had a script that could fetch data from this site. But lost it. As I recall it I was using the tag '' and some string logic.. not very pretty
I know that sites like thingspeak can do these things..

There is a fairly general pattern which you could use to parse many, though not
all, tables.
import lxml.html as LH
import requests
import pandas as pd
def text(elt):
return elt.text_content().replace(u'\xa0', u' ')
url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)
for table in root.xpath('//table[#id="sortabletable"]'):
header = [text(th) for th in table.xpath('//th')] # 1
data = [[text(td) for td in tr.xpath('td')]
for tr in table.xpath('//tr')] # 2
data = [row for row in data if len(row)==len(header)] # 3
data = pd.DataFrame(data, columns=header) # 4
print(data)
You can use table.xpath('//th') to find the column names.
table.xpath('//tr') returns the rows, and for each row, tr.xpath('td')
returns the element representing one "cell" of the table.
Sometimes you may need to filter out certain rows, such as in this case, rows
with fewer values than the header.
What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:
Pris Adresse Tidspunkt
0 8.04 Brovejen 18 5500 Middelfart 3 min 38 sek
1 7.88 Hovedvejen 11 5500 Middelfart 4 min 52 sek
2 7.88 Assensvej 105 5500 Middelfart 5 min 56 sek
3 8.23 Ejby Industrivej 111 2600 Glostrup 6 min 28 sek
4 8.15 Park Alle 125 2605 Brøndby 25 min 21 sek
5 8.09 Sletvej 36 8310 Tranbjerg J 25 min 34 sek
6 8.24 Vindinggård Center 29 7100 Vejle 27 min 6 sek
7 7.99 * Søndergade 116 8620 Kjellerup 31 min 27 sek
8 7.99 * Gertrud Rasks Vej 1 9210 Aalborg SØ 31 min 27 sek
9 7.99 * Sorøvej 13 4200 Slagelse 31 min 27 sek

If you mean all the text:
from bs4 import BeautifulSoup
url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
import requests
r = requests.get(url_str).content
print([x.text for x in BeautifulSoup(r).find_all("table",attrs={"id":"sortabletable"})]
['Pris\nAdresse\nTidspunkt\n\n\n\n\n* Denne pris er indberettet af selskabet Indberet pris\n\n\n\n\n\n\xa08.24\n\xa0Gladsaxe Møllevej 33 2860 Søborg\n7 min 4 sek \n\n\n\n\xa08.89\n\xa0Frederikssundsvej 356 2700 Brønshøj\n9 min 10 sek \n\n\n\n\xa07.98\n\xa0Gartnerivej 1 7500 Holstebro\n14 min 25 sek \n\n\n\n\xa07.99 *\n\xa0Søndergade 116 8620 Kjellerup\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Gertrud Rasks Vej 1 9210 Aalborg SØ\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Sorøvej 13 4200 Slagelse\n15 min 7 sek \n\n\n\n\xa08.08 *\n\xa0Tørholmsvej 95 9800 Hjørring\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Nordvej 6 9900 Frederikshavn\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Skelmosevej 89 6980 Tim\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Højgårdsvej 2 4000 Roskilde\n15 min 7 sek']

Related

Cannot scrape some table using Pandas

i'm more than a noob in python, i'm tryng to get some tables from this page:
https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html
Using Pandas and command pd.read_html i'm able to get most of them but not the "Line Score" and the "Four Factors"...if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web.
What am i missing here?
Any help appreciated, thanks!

If you look at the page source (not by inspecting), you'd see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.
I'll show you both options:
Option 1: Remove the comment tags from the page source
import requests
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")
dfs = pd.read_html(response, header=1)
Output:
You can see you now have 21 tables, with the 4th and 5th tables the ones in question.
print(len(dfs))
for each in dfs[3:5]:
print('\n\n', each, '\n')
21
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1
Option 2: Pull out comments with bs4
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
dfs = pd.read_html(url, header=1)
comments = data.find_all(string=lambda text: isinstance(text, Comment))
other_tables = []
for each in comments:
if '<table' in str(each):
try:
other_tables.append(pd.read_html(str(each), header=1)[0])
except:
continue
Output:
for each in other_tables:
print(each, '\n')
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000

ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

pd.concat keys to separate column

I have several tables imported from an Excel file:
df = pd.read_excel(ffile, 'Constraints', header = None, names = range(13))
table_names = ['A', ...., 'W']
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
This is the first time I've tried to read multiple tables from a single sheet, so I'm not sure if this is the best manner. If printed like this:
for k,v in tables.items():
print("table:", k)
print(v)
print()
The output is:
table: A
0 1 2 ... 10 11 12
2 Sxxxxxx Dxxx 21 20 ... 22 19 22
3 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 7 7 ... 7 7 7
4 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 X 5.95 5.95 ... 5.95 5.95 5.95
...
...
...
table: W
0 1 2 ... 10 11 12
6 Sxxxxxx Dxxx 21 20 ... 22 19 22
7 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 30 30 ... 30 30 30
8 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 x 28.5 28.5 ... 28.5 28.5 28.5
I tried to combine them all into one DataFrame using dfa = pd.DataFrame(tables['A'])
for each table, and then using fdf = pd.concat([dfa,...,dwf], keys =['A', ... 'W']).
The keys are hierarchically placed, but the autonumbered index column inserts itself after the keys and before the first column:
0 1 2 ... 10 11 12
A 2 Sxxxxxx Dxxx 21 20 ... 22 19 22
3 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 7 7 ... 7 7 7
4 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 X 5.95 5.95 ... 5.95 5.95 5.95
I would like to convert the keys to an actual column and switch places with the pandas numbered index, but I'm not sure how to do that. I've tried pd.reset_index() in various configurations, but am wondering if I maybe constructed the tables wrong in the first place?
If any of this information is not necessary, please let me know and I will remove it. I'm trying to follow the MCV guidelines and am not sure how much people need to know.

After you get the your tables, Just do
pd.concat(tables)

Pandas groupby two columns and only keep records satisfying condition based on count

Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36

It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

Using pandas in python I am trying to group data from price ranges

Here is the code I am running, It creates a bar plot but i would like to group together values within $5 of each other for each bar in the graph. The bar graph currently shows all 50 values as individual bars and makes the data nearly unreadable. Is a histogram a better option? Also, bdf is the bids and adf is the asks.
import gdax
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from gdax import *
from pandas import *
from numpy import *
s= 'sequence'
b= 'bids'
a= 'asks'
public_client = gdax.PublicClient()
o = public_client.get_product_order_book('BTC-USD', level=2)
df = pd.DataFrame(o)
bdf = pd.DataFrame(o[b],columns = ['price','size','null'], dtype='float')
adf = pd.DataFrame(o[b],columns = ['price','size','null'], dtype='float')
del bdf['null'] bdf.plot.bar(x='price', y='size')
plt.show()
pause = input('pause')
Here is an example of the data I receive as a DataFrame object.
price size
0 11390.99 13.686618
1 11389.40 0.002000
2 11389.00 0.090700
3 11386.53 0.060000
4 11385.26 0.010000
5 11385.20 0.453700
6 11381.33 0.006257
7 11380.06 0.011100
8 11380.00 0.001000
9 11378.61 0.729421
10 11378.60 0.159554
11 11375.00 0.012971
12 11374.00 0.297197
13 11373.82 0.005000
14 11373.72 0.661006
15 11373.39 0.001758
16 11373.00 1.000000
17 11370.00 0.082399
18 11367.22 1.002000
19 11366.90 0.010000
20 11364.67 1.000000
21 11364.65 6.900000
22 11364.37 0.002000
23 11361.23 0.250000
24 11361.22 0.058760
25 11360.89 0.001760
26 11360.00 0.026000
27 11358.82 0.900000
28 11358.30 0.020000
29 11355.83 0.002000
30 11355.15 1.000000
31 11354.72 8.900000
32 11354.41 0.250000
33 11353.00 0.002000
34 11352.88 1.313130
35 11352.19 0.510000
36 11350.00 1.650228
37 11349.90 0.477500
38 11348.41 0.001762
39 11347.43 0.900000
40 11347.18 0.874096
41 11345.42 7.800000
42 11343.21 1.700000
43 11343.02 0.001754
44 11341.73 0.900000
45 11341.62 0.002000
46 11341.00 0.024900
47 11340.00 0.400830
48 11339.77 0.002946
49 11337.00 0.050000
Is pandas the best way to manipulate this data?

Not sure if I understand correctly, but if you want to count number of bids with a $5 step, here is how you can do it:
> df["size"].groupby((df["price"]//5)*5).sum()
price
11335.0 0.052946
11340.0 3.029484
11345.0 10.053358
11350.0 12.625358
11355.0 1.922000
11360.0 8.238520
11365.0 1.012000
11370.0 2.047360
11375.0 0.901946
11380.0 0.018357
11385.0 0.616400
11390.0 13.686618
Name: size, dtype: float64

You can using cut here
df['bin']=pd.cut(df.price,bins=3)
df.groupby('bin')['size'].sum().plot(kind='bar')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Get html table data by xpath - python

Related

Cannot scrape some table using Pandas

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

pd.concat keys to separate column

Pandas groupby two columns and only keep records satisfying condition based on count

Using pandas in python I am trying to group data from price ranges

Categories

Resources