i'm more than a noob in python, i'm tryng to get some tables from this page:
https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html
Using Pandas and command pd.read_html i'm able to get most of them but not the "Line Score" and the "Four Factors"...if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web.
What am i missing here?
Any help appreciated, thanks!
If you look at the page source (not by inspecting), you'd see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.
I'll show you both options:
Option 1: Remove the comment tags from the page source
import requests
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")
dfs = pd.read_html(response, header=1)
Output:
You can see you now have 21 tables, with the 4th and 5th tables the ones in question.
print(len(dfs))
for each in dfs[3:5]:
print('\n\n', each, '\n')
21
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1
Option 2: Pull out comments with bs4
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
dfs = pd.read_html(url, header=1)
comments = data.find_all(string=lambda text: isinstance(text, Comment))
other_tables = []
for each in comments:
if '<table' in str(each):
try:
other_tables.append(pd.read_html(str(each), header=1)[0])
except:
continue
Output:
for each in other_tables:
print(each, '\n')
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1
As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2
I have several tables imported from an Excel file:
df = pd.read_excel(ffile, 'Constraints', header = None, names = range(13))
table_names = ['A', ...., 'W']
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
This is the first time I've tried to read multiple tables from a single sheet, so I'm not sure if this is the best manner. If printed like this:
for k,v in tables.items():
print("table:", k)
print(v)
print()
The output is:
table: A
0 1 2 ... 10 11 12
2 Sxxxxxx Dxxx 21 20 ... 22 19 22
3 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 7 7 ... 7 7 7
4 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 X 5.95 5.95 ... 5.95 5.95 5.95
...
...
...
table: W
0 1 2 ... 10 11 12
6 Sxxxxxx Dxxx 21 20 ... 22 19 22
7 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 30 30 ... 30 30 30
8 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 x 28.5 28.5 ... 28.5 28.5 28.5
I tried to combine them all into one DataFrame using dfa = pd.DataFrame(tables['A'])
for each table, and then using fdf = pd.concat([dfa,...,dwf], keys =['A', ... 'W']).
The keys are hierarchically placed, but the autonumbered index column inserts itself after the keys and before the first column:
0 1 2 ... 10 11 12
A 2 Sxxxxxx Dxxx 21 20 ... 22 19 22
3 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 7 7 ... 7 7 7
4 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 X 5.95 5.95 ... 5.95 5.95 5.95
I would like to convert the keys to an actual column and switch places with the pandas numbered index, but I'm not sure how to do that. I've tried pd.reset_index() in various configurations, but am wondering if I maybe constructed the tables wrong in the first place?
If any of this information is not necessary, please let me know and I will remove it. I'm trying to follow the MCV guidelines and am not sure how much people need to know.
After you get the your tables, Just do
pd.concat(tables)
Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36
It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
Here is the code I am running, It creates a bar plot but i would like to group together values within $5 of each other for each bar in the graph. The bar graph currently shows all 50 values as individual bars and makes the data nearly unreadable. Is a histogram a better option? Also, bdf is the bids and adf is the asks.
import gdax
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from gdax import *
from pandas import *
from numpy import *
s= 'sequence'
b= 'bids'
a= 'asks'
public_client = gdax.PublicClient()
o = public_client.get_product_order_book('BTC-USD', level=2)
df = pd.DataFrame(o)
bdf = pd.DataFrame(o[b],columns = ['price','size','null'], dtype='float')
adf = pd.DataFrame(o[b],columns = ['price','size','null'], dtype='float')
del bdf['null'] bdf.plot.bar(x='price', y='size')
plt.show()
pause = input('pause')
Here is an example of the data I receive as a DataFrame object.
price size
0 11390.99 13.686618
1 11389.40 0.002000
2 11389.00 0.090700
3 11386.53 0.060000
4 11385.26 0.010000
5 11385.20 0.453700
6 11381.33 0.006257
7 11380.06 0.011100
8 11380.00 0.001000
9 11378.61 0.729421
10 11378.60 0.159554
11 11375.00 0.012971
12 11374.00 0.297197
13 11373.82 0.005000
14 11373.72 0.661006
15 11373.39 0.001758
16 11373.00 1.000000
17 11370.00 0.082399
18 11367.22 1.002000
19 11366.90 0.010000
20 11364.67 1.000000
21 11364.65 6.900000
22 11364.37 0.002000
23 11361.23 0.250000
24 11361.22 0.058760
25 11360.89 0.001760
26 11360.00 0.026000
27 11358.82 0.900000
28 11358.30 0.020000
29 11355.83 0.002000
30 11355.15 1.000000
31 11354.72 8.900000
32 11354.41 0.250000
33 11353.00 0.002000
34 11352.88 1.313130
35 11352.19 0.510000
36 11350.00 1.650228
37 11349.90 0.477500
38 11348.41 0.001762
39 11347.43 0.900000
40 11347.18 0.874096
41 11345.42 7.800000
42 11343.21 1.700000
43 11343.02 0.001754
44 11341.73 0.900000
45 11341.62 0.002000
46 11341.00 0.024900
47 11340.00 0.400830
48 11339.77 0.002946
49 11337.00 0.050000
Is pandas the best way to manipulate this data?
Not sure if I understand correctly, but if you want to count number of bids with a $5 step, here is how you can do it:
> df["size"].groupby((df["price"]//5)*5).sum()
price
11335.0 0.052946
11340.0 3.029484
11345.0 10.053358
11350.0 12.625358
11355.0 1.922000
11360.0 8.238520
11365.0 1.012000
11370.0 2.047360
11375.0 0.901946
11380.0 0.018357
11385.0 0.616400
11390.0 13.686618
Name: size, dtype: float64
You can using cut here
df['bin']=pd.cut(df.price,bins=3)
df.groupby('bin')['size'].sum().plot(kind='bar')