How do I update pandas dataframes with calculations done group-wise? - python

Take the following table:
df = pd.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4], 'c':[10,20,30,40]})
print(df.to_string())
a b c
0 1 1 10
1 1 2 20
2 2 3 30
3 2 4 40
I would like the following result:
result = pd.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4], 'c':[10,20,30,40], 'group_avg':[13.5,13.5,31.5,31.5]})
print(result.to_string())
a b c group_avg
0 1 1 10 13.5
1 1 2 20 13.5
2 2 3 30 31.5
3 2 4 40 31.5
That is, group_avg is computed by doing c-b and then taking the average group-wise by grouping on a.
Is there a nice way of doing this, or do I have to go the roundabout way of creating a new difference column, grouping by a, getting the average, then joining the result on the original table?
What if I want to apply an arbitrary function which takes 2 series, but I want to apply it group-wise?

Try, using assign to create a temporary column of c-b then, groupby with transform:
df['group_avg'] = df.assign(avg = df.c - df.b)\
.groupby('a')['avg'].transform('mean')
Output:
a b c group_avg
0 1 1 10 13.5
1 1 2 20 13.5
2 2 3 30 31.5
3 2 4 40 31.5

Due to the linear nature of the mean, the mean of the difference is the same as the difference of the mean. So we can use the mean after a groupby then subtract.
df.join(df.groupby('a').mean().eval('c - b').rename('avg'), on='a')
a b c avg
0 1 1 10 13.5
1 1 2 20 13.5
2 2 3 30 31.5
3 2 4 40 31.5

Related

Pandas : Apply weights to another column, for certain ids only

Let's take this sample dataframe and this list of ids :
df=pd.DataFrame({'Id':['A','A','A','B','C','C','D','D'], 'Weight':[50,20,30,1,2,8,3,2], 'Value':[100,100,100,10,20,20,30,30]})
Id Weight Value
0 A 50 100
1 A 20 100
2 A 30 100
3 B 1 10
4 C 2 20
5 C 8 20
6 D 3 30
7 D 2 30
L = ['A','C']
Value column has same values for each id in Id column. For the specific ids of L, I would like to apply the weights of Weight column to Value column. I am currently doing the following way but it is extremely slow with my real big dataframe :
for i in L :
df.loc[df["Id"]==i,"Value"] = (df.loc[df["Id"]==i,"Value"] * df.loc[df["Id"]==i,"Weight"] /
df[df["Id"]==i]["Weight"].sum())
How please could I do that efficiently ?
Expected output :
Id Weight Value
0 A 50 50
1 A 20 20
2 A 30 30
3 B 1 10
4 C 2 4
5 C 8 16
6 D 3 30
7 D 2 30
Idea is working only for filtered rows by Series.isin with GroupBy.transform and sum for sum per groups with same size like original DataFrame:
L = ['A','C']
m = df['Id'].isin(L)
df1 = df[m].copy()
s = df1.groupby('Id')['Weight'].transform('sum')
df.loc[m, 'Value'] = df1['Value'].mul(df1['Weight']).div(s)
print (df)
Id Weight Value
0 A 50 50.0
1 A 20 20.0
2 A 30 30.0
3 B 1 10.0
4 C 2 4.0
5 C 8 16.0
6 D 3 30.0
7 D 2 30.0

how to create a new boolean column that processes information from previous n rows

Given a dataframe df, I would like to generate a new variable/column for each row based on the values in the previous n rows (for example previous 3).
For example, given the following:
INPUT
A B C
10 2 59.4
53 3 71.5
32 2 70.4
24 3 82.1
Calculation for D: if in the actual row in C or previous 3 rows in C there are 2 or more cells > 70 then 1, else 0
OUTPUT
A B C D
10 2 59.4 0
53 3 71.5 0
32 2 70.4 1
24 3 82.1 1
How should I do it in pandas?
IIUC, should use rolling and build your logic in the apply
window = 3
df.C.rolling(window).apply(lambda s: 1 if (s>=70).size >= 2 else 0)
0 NaN
1 NaN
2 1.0
3 1.0
You can also fillna to turn NaNs into 0
.fillna(0)
0 0.0
1 0.0
2 1.0
3 1.0
I think #RafaelC's answer is the right approach. I'm adding an answer to (a) provide better example data that covers edge cases and (b) to adjust #RafaelC's syntax slightly. In particular:
min_periods = 1 allows for early rows whose index values are smaller than the window to be non-NaN
window = 4 allows for the current entry plus the previous 3 to be considered
Use sum() instead of size to get only True values
Updated code:
window = 4
df.C.rolling(window, min_periods=1).apply(lambda x: (x>70).sum()>=2)
Data:
A B C
10 2 59.4
53 3 71.5
32 2 70.4
24 3 82.1
11 4 10.1
10 5 1.0
12 3 2.3
13 2 1.1
99 9 70.2
12 9 80.0
Expected output according to OP rules:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 0.0
8 0.0
9 1.0
Name: C, dtype: float64

How do I aggregate rows with an upper bound on column value?

I have a pd.DataFrame I'd like to transform:
id values days time value_per_day
0 1 15 15 1 1
1 1 20 5 2 4
2 1 12 12 3 1
I'd like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.
Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:
id values days value_per_day
0 1 10 10 1.0
1 1 25 10 2.5
2 1 10 10 1.0
3 1 2 2 1.0
I've tried pd.Grouper:
df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})
Out[146]:
values
days id
5 days 1 16
15 days 1 10
But I'm clearly using it incorrectly.
csv for convenience:
id,values,days,time
1,10,15,1
1,20,5,2
1,12,12,3
Notice: this is a time cost solution
newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]:
days value_per_day
0 10 1.0
1 10 2.5
2 10 1.0
3 2 1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]:
days value_per_day value
0 10 1.0 10.0
1 10 2.5 25.0
2 10 1.0 10.0
3 2 1.0 2.0
I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop

Pandas: multiply starting value for one column through each value in another within group

I have a starting value and some future expected growth rates for a number of customers.
Here is a simple sample dataframe:
df = pd.DataFrame([['A',1,10,np.nan],['A',2,10,1.2],['A',3,10,1.15],
['B',1,20,np.nan],['B',2,20,1.05],['B',3,20,1.2]],columns = ['Cust','Period','startingValue','Growth'])
print df
Cust Period startingValue Growth
0 A 1 10 NaN
1 A 2 10 1.20
2 A 3 10 1.15
3 B 1 20 NaN
4 B 2 20 1.05
5 B 3 20 1.20
For each Cust, I want to multiply the starting value by the growth rate, then carry that value forward to the next period. I could do this with groupby-apply or an ugly for loop, but I'm hoping there's some faster vectorized method for doing this. I had hoped there would be some .fill() magic, where you could multiply by another column as it fills downwards. Here's what the output should look like:
Cust Period startingValue Growth Pred_val
0 A 1 10 NaN 10.0
1 A 2 10 1.20 12.0
2 A 3 10 1.15 13.8
3 B 1 20 NaN 20.0
4 B 2 20 1.05 21.0
5 B 3 20 1.20 25.2
Thoughts?
You can do a cumulative product using cumprod function:
df['Pred_val'] = df.Growth.fillna(1).groupby(df.Cust).cumprod()*df.startingValue

Iterate through the rows of a dataframe and reassign minimum values by group

I am working with a dataframe that looks like this.
id time diff
0 0 34 nan
1 0 36 2
2 1 43 7
3 1 55 12
4 1 59 4
5 2 2 -57
6 2 10 8
What is an efficient way find the minimum values for 'time' by id, then set 'diff' to nan at those minimum values. I am looking for a solution that results in:
id time diff
0 0 34 nan
1 0 36 2
2 1 43 nan
3 1 55 12
4 1 59 4
5 2 2 nan
6 2 10 8
groupby('id') and use idxmin to find the location of minimum values of 'time'. Finally, use loc to assign np.nan
df.loc[df.groupby('id').time.idxmin(), 'diff'] = np.nan
df
You can group the time by id and calculate a logical vector where if the time is minimum within the group, the value is True, else False, and use the logical vector to assign NaN to the corresponding rows:
import numpy as np
import pandas as pd
df.loc[df.groupby('id')['time'].apply(lambda g: g == min(g)), "diff"] = np.nan
df
# id time diff
#0 0 34 NaN
#1 0 36 2.0
#2 1 43 NaN
#3 1 55 12.0
#4 1 59 4.0
#5 2 2 NaN
#6 2 10 8.0

Categories

Resources