Create column that sums the last x occurrences of another column - python

I'm trying to create a new column, lets call it "HomeForm", that is the sum of the last 5 values of "FTHG" for each of the entries in the "HomeTeam" column.
Say for Team 0, the idea would be to populate the cell on the new column with the sum of the last 5 values of "FTHG" that correspond to Team 0. The table is ordered by date.
How can it be done in Python?
HomeTeam FTHG HomeForm
Date
136 0 4
135 2 0
135 4 2
135 5 0
135 6 1
135 13 0
135 17 3
135 18 1
134 11 4
134 12 0
128 1 0
128 3 0
128 8 2
128 9 1
128 13 3
128 14 1
128 15 0
127 7 1
127 16 1
126 10 1
Thanks.

You'll groupby on HomeTeam and perform a rolling sum here, summing for a minimum of 1 period, and maximum of 5.
First, define a function -
def f(x):
return x.shift().rolling(window=5, min_periods=1).sum()
This function performs the rolling sum of the previous 5 games (hence the shift). Pass this function to dfGroupBy.transform -
df['HomeForm'] = df.groupby('HomeTeam', sort=False).FTHG.transform(f)
df
HomeTeam FTHG HomeForm
Date
136 0 4 NaN
135 2 0 NaN
135 4 2 NaN
135 5 0 NaN
135 6 1 NaN
135 13 0 NaN
135 17 3 NaN
135 18 1 NaN
134 11 4 NaN
134 12 0 NaN
128 1 0 NaN
128 3 0 NaN
128 8 2 NaN
128 9 1 NaN
128 13 3 0.0
128 14 1 NaN
128 15 0 NaN
127 7 1 NaN
127 16 1 NaN
126 10 1 NaN
If needed, fill the NaNs with zeros and convert to integer -
df['HomeForm'] = df['HomeForm'].fillna(0).astype(int)

Related

Average of every x rows with a step size of y per each subset using pandas

I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332
Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)
One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333

find the maximum value for each streak of numbers in another column in pandas

I have a dataframe like this :
df = pd.DataFrame({'dir': [1,1,1,1,0,0,1,1,1,0], 'price':np.random.randint(100,200,10)})
dir price
0 1 100
1 1 150
2 1 190
3 1 194
4 0 152
5 0 151
6 1 131
7 1 168
8 1 112
9 0 193
and I want a new column that shows the maximum price as long as the dir is 1 and reset if dir is 0.
My desired outcome looks like this:
dir price max
0 1 100 194
1 1 150 194
2 1 190 194
3 1 194 194
4 0 152 NaN
5 0 151 NaN
6 1 131 168
7 1 168 168
8 1 112 168
9 0 193 NaN
Use transform with max for filtered rows:
#get unique groups for consecutive values
g = df['dir'].ne(df['dir'].shift()).cumsum()
#filter only 1
m = df['dir'] == 1
df['max'] = df[m].groupby(g)['price'].transform('max')
print (df)
dir price max
0 1 100 194.0
1 1 150 194.0
2 1 190 194.0
3 1 194 194.0
4 0 152 NaN
5 0 151 NaN
6 1 131 168.0
7 1 168 168.0
8 1 112 168.0
9 0 193 NaN

Pandas - Sum up previous values of a column

New to pandas, I'm trying to sum up all previous values of a column. In SQL I did this by joining the table to itself, so I've been taking the same approach in pandas, but having some issues.
Original Data Frame
TeamName PlayerCount Goals CalMonth
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 30 300 2
5 B 28 189 3
Code
prev_month = np.where(df3['CalMonth'] == 12, df3['CalMonth'] - 11, df3['CalMonth'] + 1)
df4 = pd.merge(df3, df3, how='left', left_on=['TeamName','CalMonth'], right_on=['TeamName', prev_month])
print(df4.head(20))
Output
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y
NaN NaN NaN
25 126 1
25 100 2
22 NaN NaN
22 205 1
22 100 2
The output is what I had in mind, but what I want now is to create a column that is YTD and sum up all Goals from previous months. Here are my desired results (can either include the current month or not, that can be done in an additional step):
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y Goals_YTD
NaN NaN NaN NaN
25 126 1 126
25 100 2 226
22 NaN NaN NaN
22 205 1 205
22 100 2 305

How to find out if there was weekend between days?

I have two data frames. One representing when an order was placed and arrived, while the other one represents the working days of the shop.
Days are taken as days of the year. i.e. 32 = 1th February.
orders = DataFrame({'placed':[100,103,104,105,108,109], 'arrived':[103,104,105,106,111,111]})
Out[25]:
arrived placed
0 103 100
1 104 103
2 105 104
3 106 105
4 111 108
5 111 109
calendar = DataFrame({'day':['100','101','102','103','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120'], 'closed':[0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0]})
Out[21]:
closed day
0 0 100
1 1 101
2 1 102
3 0 103
4 0 104
5 0 105
6 0 106
7 0 107
8 1 108
9 1 109
10 0 110
11 0 111
12 0 112
13 0 113
14 0 114
15 1 115
16 1 116
17 0 117
18 0 118
19 0 119
20 0 120
What i want to do is to compute the difference between placed and arrived
x = orders['arrived'] - orders['placed']
Out[24]:
0 3
1 1
2 1
3 1
4 3
5 2
dtype: int64
and subtract one if any day between arrived and placed (included) was a day in which the shop was closed.
i.e. in the first row the order is placed on day 100 and arrived on day 103. the day used are 100, 101, 102, 103. the difference between 103 and 100 is 3. However, since 101 and 102 are days in which the shop is closed I want to subtract 1 for each. That is 3 -1 -1 = 1. And finally append this result on the orders df.

Counting grouped data with missing values in pandas dataframe

I am trying to do something like this, but on a much larger dataframe (called Clean):
d={'rx': [1,1,1,1,2.1,2.1,2.1,2.1],
'vals': [NaN,10,10,20,NaN,10,20,20]}
df=DataFrame(d)
arrays = [df.rx,df.vals]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])
df.index = index
Hist=df.groupby(level=('rx','vals'))
Hist.count('vals')
This seems to work just fine, but when I run the same concept on even a subset of the Clean dataframe (substituting a column 'LagBin' for 'vals') I get an error:
df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)
arrays = [df1.rx,df1.LagBin]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','LagBin'])
df1.index = index
Hist=df1.groupby(level=('rx','LagBin'))
Hist.count('LagBin')
Specifically, the Hist.count('LagBin') produces a value error:
ValueError: Cannot convert NA to integer
I have looked at the data structure and that all seems exactly the same.
Here is the data that produces the error:
rx LagBin rx LagBin
139.1 nan 139.1
139.1 0 139.1 0
139.1 0 139.1 0
139.1 0 139.1 0
141.1 nan 141.1
141.1 10 141.1 10
141.1 20 141.1 20
193 nan 193
193 50 193 50
193 20 193 20
193 3600 193 3600
193 50 193 50
193 0 193 0
193 20 193 20
193 10 193 10
193 110 193 110
193 80 193 80
193 460 193 460
193 30 193 30
193 0 193 0
while the original routine that works produces this:
rx vals rx vals
1 nan 1
1 10 1 10
1 10 1 10
1 20 1 20
2.1 nan 2.1
2.1 10 2.1 10
2.1 20 2.1 20
2.1 20 2.1 20
What is different about these datasets that produces this error?
If I'm understanding your question correctly I believe what you want is:
Hist.agg(len).dropna()
The full code implementation looks like this:
d={'rx': [139.1,139.1,139.1,139.1,141.1,141.1,141.1,193,193,193,193,193,193,193,193,193,193,193,193,193],
'vals': [nan,0,0,0,nan,10,20,nan,50,20,3600,50,0,20,10,110,80,460,30,0]}
df=pd.DataFrame(d)
arrays = [df.rx,df.vals]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])
df.index = index
Hist=df.groupby(level=('rx','vals'))
print(Hist.agg(len).dropna())
Where df looks like:
rx vals
rx vals
139.1 NaN 139.1 NaN
0 139.1 0
0 139.1 0
0 139.1 0
141.1 NaN 141.1 NaN
10 141.1 10
20 141.1 20
193.0 NaN 193.0 NaN
50 193.0 50
20 193.0 20
3600 193.0 3600
50 193.0 50
0 193.0 0
20 193.0 20
10 193.0 10
110 193.0 110
80 193.0 80
460 193.0 460
30 193.0 30
0 193.0 0
And the line Hist.agg(len).dropna() looks like:
rx vals
rx vals
139.1 0 3 3
141.1 10 1 1
20 1 1
193.0 0 2 2
10 1 1
20 2 2
30 1 1
50 2 2
80 1 1
110 1 1
460 1 1
3600 1 1
That looks right---I have been tinkering with groupby and came up with this solution, which seems more elegant, and does not require explicitly dealing with the na's:
df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)
df1["rx"].groupby((df1["rx"],df1["LagBin"])).count().reset_index(name="Count")
print(LagCount)
which gives me:
rx LagBin Count
0 139.1 0 3
1 141.1 10 1
2 141.1 20 1
3 193.0 0 2
4 193.0 10 1
5 193.0 20 2
6 193.0 30 1
7 193.0 50 2
8 193.0 80 1
9 193.0 110 1
10 193.0 460 1
11 193.0 3600 1
I like this better, because I retain values as values and not indices, which I assume will make life easier later for plotting.

Categories

Resources