Pandas groupby multiple columns with rolling date offset - How? - python

I am trying to do a rolling sum across partitioned data based on a moving 2 business day window. It feels like it should be both easy and widely used, but the solution is beyond me.
#generate sample data
import pandas as pd
import numpy as np
import datetime
vals = [-4,17,-4,-16,2,20,3,10,-17,-8,-21,2,0,-11,16,-24,-10,-21,5,12,14,9,-15,-15]
grp = ['X']*6 + ['Y'] * 6 + ['X']*6 + ['Y'] * 6
typ = ['foo']*12+['bar']*12
dat = ['19/01/18','19/01/18','22/01/18','22/01/18','23/01/18','24/01/18'] * 4
#create dataframe with sample data
df = pd.DataFrame({'group': grp,'type':typ,'value':vals,'date':dat})
df.date = pd.to_datetime(df.date)
df.head(12)
gives the following (note this is just the head 12 rows):
date group type value
0 19/01/2018 X foo -4
1 19/01/2018 X foo 17
2 22/01/2018 X foo -4
3 22/01/2018 X foo -16
4 23/01/2018 X foo 2
5 24/01/2018 X foo 20
6 19/01/2018 Y foo 3
7 19/01/2018 Y foo 10
8 22/01/2018 Y foo -17
9 22/01/2018 Y foo -8
10 23/01/2018 Y foo -21
11 24/01/2018 Y foo 2
The desired results are (all rows shown here):
date group type 2BD Sum
1 19/01/2018 X foo 13
2 22/01/2018 X foo -7
3 23/01/2018 X foo -18
4 24/01/2018 X foo 22
5 19/01/2018 Y foo 13
6 22/01/2018 Y foo -12
7 23/01/2018 Y foo -46
8 24/01/2018 Y foo -19
9 19/01/2018 X bar -11
10 22/01/2018 X bar -19
11 23/01/2018 X bar -18
12 24/01/2018 X bar -31
13 19/01/2018 Y bar 17
14 22/01/2018 Y bar 40
15 23/01/2018 Y bar 8
16 24/01/2018 Y bar -30
I have viewed this question and tried
df.groupby(['group','type']).rolling('2d',on='date').agg({'value':'sum'}
).reset_index().groupby(['group','type','date']).agg({'value':'sum'}).reset_index()
Which would work fine if 'value' is always positive, but this is not the case here. I have tried many other ways that have caused errors that I can list if it is of value. Can anyone help?

I expected the following to work:
g = lambda ts: ts.rolling('2B', on='date')['value'].sum()
df.groupby(['group', 'type']).apply(g)
However, I get an error as a business day is not a fixed frequency.
This brings me to suggesting the following solution, a lot uglier:
value_per_bday = lambda df: df.resample('B', on='date')['value'].sum()
df = df.groupby(['group', 'type']).apply(value_per_bday).stack()
value_2_bdays = lambda x: x.rolling(2, min_periods=1).sum()
df = df.groupby(axis=0, level=['group', 'type']).apply(value_2_bdays)
Maybe it sounds better with a function, your pick.
def resample_and_sum(x):
x = x.resample('B', on='date')['value'].sum()
x = x.rolling(2, min_periods=1).sum()
return x
df = df.groupby(['group', 'type']).apply(resample_and_sum).stack()

Related

How to iteratively add a column to a dataframe X based on the values of a separated dataframe y with Pandas?

I am struggling with this problem.
These are my initial matrices:
columnsx = {'X1':[6,11,17,3,12],'X2':[1,2,10,24,18],'X3':[8,14,9,15,7], 'X4':[22,4,20,16,5],'X5':[19,21,13,23,25]}
columnsy = {'y1':[0,1,1,2,0],'y2':[1,0,0,2,1]}
X = pd.DataFrame(columnsx)
y = pd.DataFrame(columnsy)
This is the final solution I am figuring out. It adds a column to X (called X_i), corresponding to the name of y with y value > 0. Therefore, it takes only the positive values of y (y>0) and rensitutes a binary vector with cardinality 2.
columnsx = {'X1':[11,17,3,6,3,12],'X2':[2,10,24,1,24,18],'X3':[14,9,15,8,15,7],
'X4':[4,20,16,22,16,5],'X5':[21,13,23,19,23,25], 'X_i':['y1','y1','y1','y2','y2','y2']}
columnsy = {'y':[1,1,2,1,2,1]}
X = pd.DataFrame(columnsx)
y = pd.DataFrame(columnsy)
Use DataFrame.melt
new_df = (df.melt(df.columns[df.columns.str.contains('X')],
var_name='X_y', value_name='y')
.loc[lambda df: df['y'].gt(0)])
print(new_df)
Output
X1 X2 X3 X4 X5 X_y y
1 11 2 14 4 21 y1 1
2 17 10 9 20 13 y1 1
3 3 24 15 16 23 y1 2
5 6 1 8 22 19 y2 1
8 3 24 15 16 23 y2 2
9 12 18 7 5 25 y2 1

Function for bar plot in python

I am trying to write an function for bar plot and it has to be like the plot shown below for Category and Group based on the index. The problem here is function has to divide X - Index and Y -Index separately and plot the graphs for Category and Group.
Index Group Category Population
X A 5 12
X A 5 34
Y B 5 23
Y B 5 34
Y B 6 33
X A 6 44
Y C 7 12
X C 7 23
Y A 8 12
Y A 8 4
X B 8 56
Y B 9 67
X B 10 23
Y A 8 45
X C 9 34
X C 9 56
Here the Men and Women are Index- X, Y in my case
I have tried many different ways but not able to solve this issue. It would be really helpful if anyone would help me in this.
Not sure if this is what you are looking for, but it's the easiest way to plot multi-indices IMP:
df["Index"] = df["Index"].map({"X":"Male", "Y": "Female"})
df_ = df.groupby(["Group","Category","Index"]).mean().unstack()
df_.plot.bar()
This will give you:

Groupby on condition and calculate sum of subgroups

Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15
Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15
You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15
Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15

Round float columns in pandas dataframe

I have got the following pandas data frame
Y X id WP_NER
0 35.973496 -2.734554 1 WP_01
1 35.592138 -2.903913 2 WP_02
2 35.329853 -3.391070 3 WP_03
3 35.392608 -3.928513 4 WP_04
4 35.579265 -3.942995 5 WP_05
5 35.519728 -3.408771 6 WP_06
6 35.759485 -3.078903 7 WP_07
I´d like to round Y and X columns using pandas.
How can I do that ?
You can now, use round on dataframe
Option 1
In [661]: df.round({'Y': 2, 'X': 2})
Out[661]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07
Option 2
In [662]: cols = ['Y', 'X']
In [663]: df[cols] = df[cols].round(2)
In [664]: df
Out[664]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07
You can apply round:
In [142]:
df[['Y','X']].apply(pd.Series.round)
Out[142]:
Y X
0 36 -3
1 36 -3
2 35 -3
3 35 -4
4 36 -4
5 36 -3
6 36 -3
If you want to apply to a specific number of places:
In [143]:
df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
Out[143]:
Y X
0 35.973 -2.735
1 35.592 -2.904
2 35.330 -3.391
3 35.393 -3.929
4 35.579 -3.943
5 35.520 -3.409
6 35.759 -3.079
EDIT
You assign the above to the columns you want to modify like the following:
In [144]:
df[['Y','X']] = df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
df
Out[144]:
Y X id WP_NER
0 35.973 -2.735 1 WP_01
1 35.592 -2.904 2 WP_02
2 35.330 -3.391 3 WP_03
3 35.393 -3.929 4 WP_04
4 35.579 -3.943 5 WP_05
5 35.520 -3.409 6 WP_06
6 35.759 -3.079 7 WP_07
Round is so smart that it works just on float columns, so the simplest solution is just:
df = df.round(2)
you can do the below:
df['column_name'] = df['column_name'].apply(lambda x: round(x,2) if isinstance(x, float) else x)
that check as well if the value of the cell is a float number. if is not float return the same value. that comes from the fact that a cell value can be a string or a NAN.
You can also - first check to see which columns are of type float - then round those columns:
for col in df.select_dtypes(include=['float']).columns:
df[col] = df[col].apply(lambda x: x if(math.isnan(x)) else round(x,1))
This also manages potential errors if trying to round nanvalues by implementing if(math.isnan(x))

Pandas : expanding_apply with groupby

I'm trying to get the expanding mean for the 3 max value:
import pandas as pd
import numpy as np
np.random.seed(seed=10)
df = pd.DataFrame ({'ID' : ['foo', 'bar'] * 10,
'ORDER' : np.arange(20),
'VAL' : np.random.randn(20)})
df = df.sort(columns=['ID','ORDER'])
I have try the expanding_apply function :
pd.expanding_apply(df['VAL'],lambda x : np.mean(((np.sort(np.array(x)))[-3:])))
It works, but for all my ID. and i need it for each of my id so i try something with groupby, with no success ...
i have tried :
df['AVG_MAX3']= df.groupby('ID')['VAL'].apply(pd.expanding_apply(lambda x : np.mean(((np.sort(np.array(x)))[-3:]))))
my expanding_mean have to restard from 0 for each ID
how can i do that ? any suggestion
desired output :
ID ORDER VAL exp_mean
bar 1 0.715278974 0.715278974
bar 3 -0.00838385 0.353447562
bar 5 -0.720085561 -0.004396812
bar 7 0.108548526 0.27181455
bar 9 -0.174600211 0.27181455
bar 11 1.203037374 0.675621625
bar 13 1.028274078 0.982196809
bar 15 0.445137613 0.982196809
bar 17 0.135136878 0.982196809
bar 19 -1.079804886 0.982196809
foo 0 1.331586504 1.331586504
foo 2 -1.545400292 -0.106906894
foo 4 0.621335974 0.135840729
foo 6 0.265511586 0.739478021
foo 8 0.004291431 0.739478021
foo 10 0.43302619 0.795316223
foo 12 -0.965065671 0.795316223
foo 14 0.22863013 0.795316223
foo 16 -1.136602212 0.795316223
foo 18 1.484537002 1.145819827
You're close, but you're missing the first argument in pd.expanding_apply when you're calling it in the groupby operation. I pulled your expanding mean into a separate function to make it a little clearer.
In [158]: def expanding_max_mean(x, size=3):
...: return np.mean(np.sort(np.array(x))[-size:])
In [158]: df['exp_mean'] = df.groupby('ID')['VAL'].apply(lambda x: pd.expanding_apply(x, expanding_max_mean))
In [159]: df
Out[159]:
ID ORDER VAL exp_mean
1 bar 1 0.715279 0.715279
3 bar 3 -0.008384 0.353448
5 bar 5 -0.720086 -0.004397
7 bar 7 0.108549 0.271815
9 bar 9 -0.174600 0.271815
11 bar 11 1.203037 0.675622
13 bar 13 1.028274 0.982197
15 bar 15 0.445138 0.982197
17 bar 17 0.135137 0.982197
19 bar 19 -1.079805 0.982197
0 foo 0 1.331587 1.331587
2 foo 2 -1.545400 -0.106907
4 foo 4 0.621336 0.135841
6 foo 6 0.265512 0.739478
8 foo 8 0.004291 0.739478
10 foo 10 0.433026 0.795316
12 foo 12 -0.965066 0.795316
14 foo 14 0.228630 0.795316
16 foo 16 -1.136602 0.795316
18 foo 18 1.484537 1.145820

Categories

Resources