get value from one column as variable for subtraction - python

i have a data frame with XY's and distances. what i am trying to do is store the distance as a variable and subtract it from the next distance if X or Y has a value greater than 0
here is a sample df
dist x y
0 12.93 99.23
200 0 0
400 0 0
600 0 0
800 0 0
1000 12.46 99.14
1200 0 0
1400 0 0
1600 0 0
1800 0 0
2000 12.01 99.07
and this is new df
dist x y
0 12.93 99.23
200 0 0
400 0 0
600 0 0
800 0 0
0 12.46 99.14
200 0 0
400 0 0
600 0 0
800 0 0
2000 12.01 99.07
the last value doesn't matter, but technically, it would be 0.
the idea is that at every know XY, assign the distance as 0 and subtract that distance until the next known XY
in the above example, the distances are rounded numbers, but in reality, they could be like
132.05
19.999
1539.65
and so on

Check with transform
df.dist-=df.groupby(df.x.ne(0).cumsum())['dist'].transform('first')
df
Out[769]:
dist x y
0 0 12.93 99.23
1 200 0.00 0.00
2 400 0.00 0.00
3 600 0.00 0.00
4 800 0.00 0.00
5 0 12.46 99.14
6 200 0.00 0.00
7 400 0.00 0.00
8 600 0.00 0.00
9 800 0.00 0.00
10 0 12.01 99.07

You can use groupby and apply, using a custom grouper calculated as follows:
grouper = (df['x'].ne(0) | df['y'].ne(0)).cumsum()
df['dist'].groupby(grouper).apply(lambda x: x - x.values[0])
0 0
1 200
2 400
3 600
4 800
5 0
6 200
7 400
8 600
9 800
10 0
Name: dist, dtype: int64
Where,
grouper
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
dtype: int64
The idea is to mark all rows that must be subtracted from the first non-zero value of that corresponding group.

With where + ffill
df['dist'] = df.dist - df.where(df.x.gt(0) | df.y.gt(0)).dist.ffill()
dist x y
0 0.0 12.93 99.23
1 200.0 0.00 0.00
2 400.0 0.00 0.00
3 600.0 0.00 0.00
4 800.0 0.00 0.00
5 0.0 12.46 99.14
6 200.0 0.00 0.00
7 400.0 0.00 0.00
8 600.0 0.00 0.00
9 800.0 0.00 0.00
10 0.0 12.01 99.07

Related

Unable to scrape 2nd table from Fbref.com

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard
on google collab.
but pd.read_html("https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard") only gives me the first table.
Please help me understand where I am going wrong.
Snippet of page
This is one way to read that data:
import pandas as pd
import requests
url= 'https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[2]
print(df)
Result in terminal:
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls.1 Ast.1 G+A G-PK.1 G+A-PK Matches
0 1 Sahal Abdul Samad in IND MF Kerala Blasters 24 1997 20 19 1443 16.0 5 1 5 0 0 0 0 0.31 0.06 0.37 0.31 0.37 Matches
1 2 Ayush Adhikari in IND MF Kerala Blasters 21 2000 14 6 540 6.0 0 0 0 0 0 3 1 0.00 0.00 0.00 0.00 0.00 Matches
2 3 Gani Ahammed Nigam in IND FW NorthEast Utd 23 1998 6 0 66 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Airam es ESP FW Goa 33 1987 13 8 751 8.3 6 1 5 1 2 0 0 0.72 0.12 0.84 0.60 0.72 Matches
4 5 Alex br BRA MF Jamshedpur 32 1988 20 12 1118 12.4 1 4 1 0 0 2 0 0.08 0.32 0.40 0.08 0.40 Matches
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
302 292 João Victor br BRA MF Hyderabad FC 32 1988 18 18 1590 17.7 5 1 3 2 2 3 0 0.28 0.06 0.34 0.17 0.23 Matches
303 293 David Williams au AUS FW Mohun Bagan 33 1988 15 6 602 6.7 4 1 4 0 1 2 0 0.60 0.15 0.75 0.60 0.75 Matches
304 294 Banana Yaya cm CMR DF Bengaluru 30 1991 5 2 229 2.5 0 1 0 0 0 1 0 0.00 0.39 0.39 0.00 0.39 Matches
305 295 Joe Zoherliana in IND DF NorthEast Utd 22 1999 9 6 677 7.5 0 1 0 0 0 0 0 0.00 0.13 0.13 0.00 0.13 Matches
306 296 Mark Zothanpuia in IND MF Hyderabad FC 19 2002 3 0 63 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
307 rows × 24 columns
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

Shifting column in multiindex dataframe with missting dates

I'd like to shift a column in a multiindex dataframe in order to calculate a regression model with a lagged independent variable. As my time-series has missing values I only want to have the values shifted for known previous days. The df looks like that:
cost
ID day
1 31.01.2020 0
1 03.02.2020 0
1 04.02.2020 0.12
1 05.02.2020 0
1 06.02.2020 0
1 07.02.2020 0.08
1 10.02.2020 0
1 11.02.2020 0
1 12.02.2020 0.03
1 13.02.2020 0.1
1 14.02.2020 0
The desired output would like that:
cost cost_lag
ID day
1 31.01.2020 0 NaN
1 03.02.2020 0 NaN
1 04.02.2020 0.12 0
1 05.02.2020 0 0.12
1 06.02.2020 0 0
1 07.02.2020 0.08 0
1 10.02.2020 0 NaN
1 11.02.2020 0 0
1 12.02.2020 0.03 0
1 13.02.2020 0.1 0.03
1 14.02.2020 0 0.1
Based on this answer to a similar question I've tried the following:
df['cost_lag'] = df.groupby(['id'])['cost'].shift(1)[df.reset_index().day == df.reset_index().day.shift(1) + datetime.timedelta(days=1)]
But that results in an error message I don't understand:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I've also tried to fill the missing dates following an approach suggested here:
ams_spend_ranking_df = ams_spend_ranking_df.index.get_level_values(1).apply(lambda x: datetime.datetime(x, 1, 1))
again resulting in an error message which does not enlighten me:
AttributeError: 'DatetimeIndex' object has no attribute 'apply'
Long story short: how can I shift the cost column by 1 day and add NaNs if I don't have data on the previous day?
You can add all missing datetimes by DataFrameGroupBy.resample with Resampler.asfreq:
df1 = df.reset_index(level=0).groupby(['ID'])['cost'].resample('d').asfreq()
print (df1)
ID day
1 2020-01-31 0.00
2020-02-01 NaN
2020-02-02 NaN
2020-02-03 0.00
2020-02-04 0.12
2020-02-05 0.00
2020-02-06 0.00
2020-02-07 0.08
2020-02-08 NaN
2020-02-09 NaN
2020-02-10 0.00
2020-02-11 0.00
2020-02-12 0.03
2020-02-13 0.10
2020-02-14 0.00
Name: cost, dtype: float64
So then if use your solution with DataFrameGroupBy.shift it working like need:
df['cost_lag'] = df1.groupby('ID').shift(1)
print (df)
cost cost_lag
ID day
1 2020-01-31 0.00 NaN
2020-02-03 0.00 NaN
2020-02-04 0.12 0.00
2020-02-05 0.00 0.12
2020-02-06 0.00 0.00
2020-02-07 0.08 0.00
2020-02-10 0.00 NaN
2020-02-11 0.00 0.00
2020-02-12 0.03 0.00
2020-02-13 0.10 0.03
2020-02-14 0.00 0.10

Pandas: find the n lowest values each m rows

I have a dataframe with a Counter, increasing by 1 each 24 rows, and a value column, like below.
value counter
0 0.00 1
1 0.00 1
2 0.00 1
3 0.00 1
4 0.00 1
5 0.00 1
6 0.00 1
7 0.00 1
8 55.00 1
9 90.00 1
10 49.27 1
11 51.80 1
12 49.06 1
13 43.46 1
14 45.96 1
15 43.95 1
16 45.00 1
17 43.97 1
18 42.00 1
19 41.14 1
20 43.92 1
21 51.74 1
22 40.85 1
23 0.00 2
24 0.00 2
25 0.00 2
26 0.00 2
27 0.00 2
28 0.00 2
29 0.00 2
... ... ...
187 82.38 9
188 66.89 9
189 59.83 9
190 52.46 9
191 40.48 9
192 28.87 9
193 41.90 9
194 42.56 9
195 40.93 9
196 40.02 9
197 36.54 9
198 33.70 9
199 38.99 9
200 46.10 9
201 44.82 9
202 0.00 9
203 0.00 9
204 0.00 9
205 0.00 9
206 0.00 9
207 0.00 10
208 0.00 10
209 0.00 10
210 74.69 10
211 89.20 10
212 74.59 10
213 55.11 10
214 58.39 10
215 40.81 10
216 45.06 10
I would like to know if there is a way to create a third column with the 4 lowest values in each Group where the Counter has the same value. See below an example for the first Group with Count=1:
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
I know about some functions like nsmallest(n,'column') but I don't know how to limit it with the Count grouping
Any idea? thank you in advance! .
I think you need first filter out rows with 0 values in value, sorting by sort_values and get DataFrame.head for top 4 values, last add reindex for filling 0 for not matched values:
df['value 2'] = (df[df['value'] != 0]
.sort_values('value')
.groupby('counter')['value'].head(4)
.reindex(df.index, fill_value=0))
print (df)
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
23 0.00 2 0.00
24 0.00 2 0.00
25 0.00 2 0.00
26 0.00 2 0.00
27 0.00 2 0.00

How to create multiple spacing CSV from pandas?

I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00
After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.

Fill zero values for combinations of unique multi-index values after groupby

To better explain by problem better lets pretend i have a shop with 3 unique customers and my dataframe contains every purchase of my customers with weekday, name and paid price.
name price weekday
0 Paul 18.44 0
1 Micky 0.70 0
2 Sarah 0.59 0
3 Sarah 0.27 1
4 Paul 3.45 2
5 Sarah 14.03 2
6 Paul 17.21 3
7 Micky 5.35 3
8 Sarah 0.49 4
9 Micky 17.00 4
10 Paul 2.62 4
11 Micky 17.61 5
12 Micky 10.63 6
The information i would like to get is the average price per unique customer per weekday. What i often do in similar situations is to group by several columns with sum and then take the average of a subset of the columns.
df = df.groupby(['name','weekday']).sum()
price
name weekday
Micky 0 0.70
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
2 3.45
3 17.21
4 2.62
Sarah 0 0.59
1 0.27
2 14.03
4 0.49
df = df.groupby(['weekday']).mean()
price
weekday
0 6.576667
1 0.270000
2 8.740000
3 11.280000
4 6.703333
5 17.610000
6 10.630000
Of course this only works if all my unique customers would have at least one purchase per day.
Is there an elegant way to get a zero value for all combinations between unique index values that have no sum after the first groupby?
My solutions has been so far to either to reindex on a multi index i created from the unique values of the grouped columns or the combination of unstack-fillna-stack but both solutions do not really satisfy me.
Appreciate your help!
IIUC, let's use unstack and fillna then stack:
df_out = df.groupby(['name','weekday']).sum().unstack().fillna(0).stack()
Output:
price
name weekday
Micky 0 0.70
1 0.00
2 0.00
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
1 0.00
2 3.45
3 17.21
4 2.62
5 0.00
6 0.00
Sarah 0 0.59
1 0.27
2 14.03
3 0.00
4 0.49
5 0.00
6 0.00
And,
df_out.groupby('weekday').mean()
Output:
price
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I think you can use pivot_table to do all the steps at once. I'm not exactly sure what you want but the default aggregation from pivot_table is the mean. You can change it to 'sum'.
df1 = df.pivot_table(index='name', columns='weekday', values='price',
fill_value=0, aggfunc='sum')
weekday 0 1 2 3 4 5 6
name
Micky 0.70 0.00 0.00 5.35 17.00 17.61 10.63
Paul 18.44 0.00 3.45 17.21 2.62 0.00 0.00
Sarah 0.59 0.27 14.03 0.00 0.49 0.00 0.00
And then take the mean of each column.
df1.mean()
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333

Categories

Resources