pandas GroupBy and cumulative mean of previous rows in group

pandas GroupBy and cumulative mean of previous rows in group - python

I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
The new column should look like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Note: If it is the first time, the mean should be NaN.
EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.

"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply (chaining two groupby calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.

Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.

Related

Pandas: Resample a dataframe given a list of indexes that are not evenly spaces

Given a dataframe df: pd.Dataframe and a subset selected_indexes of indexes from df.index how can I resample df with the max operator applied to each interval selected_indexes[i], selected_indexes[i+1] ?
For example, given a dataframe:
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
And a selection of index "selected_indexes = [0, 5, 6, 9]" and applying the maximum on the col column between each interval (assuming we keep the end point and exclude the starting point), we should get:
col
0 5
5 9
6 3
9 5
For example the line 9 was made with max(5, 2, 4) from lines 7, 8, 9 \in (6, 9].

new interpretation
selected_indexes = [0, 5, 6, 9]
group = (df.index.to_series().shift() # make groups
.isin(selected_indexes) # based on
.cumsum() # previous indices
)
# get max per group
out = df.groupby(group).max().set_axis(selected_indexes)
# or for many aggregations (see comments):
out = (df.groupby(group).agg({'col1': 'max', 'col2': 'min'})
.set_axis(selected_indexes)
)
Output:
col
0 5
5 9
6 3
9 5
previous interpretation of the question
You likely need a rolling.max, not resample:
out = df.loc[selected_indexes].rolling(3, center=True).max()
Or, if you want the ±1 to apply to the data before selection:
out = df.rolling(3, center=True).max().loc[selected_indexes]
Example:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 10)})
selected_indexes = [1, 2, 3, 5, 6, 8, 9]
print(df)
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
out = df.loc[selected_indexes].rolling(3, center=True).max()
print(out)
col
1 NaN
2 3.0
3 9.0
5 9.0
6 9.0
8 4.0
9 NaN
out2 = df.rolling(3, center=True).max().loc[selected_indexes]
print(out2)
col
1 5.0
2 3.0
3 7.0
5 9.0
6 9.0
8 5.0
9 NaN

How to aggregate the previous values of a pandas dataframe based only on historical values? Similarly, how to mean encode only historical groups? [duplicate]

I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
The new column should look like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Note: If it is the first time, the mean should be NaN.
EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.

"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply (chaining two groupby calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.

Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.

creating pandas function equivalent for EXCEL OFFSET function

Let's say input was
d = {'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,3,4,5,6,7,8,9,10],
'col3': [1,2,3,4,5,6,7,8,9,10],
'offset': [1,2,3,1,2,3,1,2,3,1]}
df = pd.DataFrame(data=d)
I want to create an additional column that looks like this:
df['output'] = [1, 4, 9, 4, 10, 18, 7, 16, 27, 10]
Basically each number in offset is telling you the number of columns to sum over (from col1 as ref point).
Is there a vectorized way to do this without iterating through each value in offset?

You use np.select. To use it, create each of the column sum (1, 2, 3 ... as needed) as the possible choices, and create a boolean masks for each value in offset column as the possible conditons.
# get all possible values from offset
lOffset = df['offset'].unique()
# get te result with np.select
df['output'] = np.select(
# create mask for each values in offset
condlist=[df['offset'].eq(i) for i in lOffset],
# crerate the sum over the number of columns per offset value
choicelist=[df.iloc[:,:i].sum(axis=1) for i in lOffset]
)
print(df)
# col1 col2 col3 offset output
# 0 1 1 1 1 1
# 1 2 2 2 2 4
# 2 3 3 3 3 9
# 3 4 4 4 1 4
# 4 5 5 5 2 10
# 5 6 6 6 3 18
# 6 7 7 7 1 7
# 7 8 8 8 2 16
# 8 9 9 9 3 27
# 9 10 10 10 1 10
Note: this assumes that your offset column is the last one

It can be done with pd.crosstab then we mask all 0 to NaN and back fill, this will return 1 as all value ned to sum
df['new'] = df.filter(like = 'col').where(pd.crosstab(df.index,df.offset).mask(lambda x : x==0).bfill(1).values==1).sum(1)
Out[710]:
0 1.0
1 4.0
2 9.0
3 4.0
4 10.0
5 18.0
6 7.0
7 16.0
8 27.0
9 10.0
dtype: float64

pivot table in specific intervals pandas

I have a one column dataframe which looks like this:
Neive Bayes
0 8.322087e-07
1 3.213342e-24
2 4.474122e-28
3 2.230054e-16
4 3.957606e-29
5 9.999992e-01
6 3.254807e-13
7 8.836033e-18
8 1.222642e-09
9 6.825381e-03
10 5.275194e-07
11 2.224289e-06
12 2.259303e-09
13 2.014053e-09
14 1.755933e-05
15 1.889681e-04
16 9.929193e-01
17 4.599619e-05
18 6.944654e-01
19 5.377576e-05
I want to pivot it to wide format but with specific intervals. The first 9 rows should make up 9 columns of the first row, and continue this pattern until the final table has 9 columns and has 9 times fewer rows than now. How would I achieve this?

Using pivot_table:
df.pivot_table(columns=df.index % 9, index=df.index // 9, values='Neive Bayes')
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN

Construct multiindex, set_index and unstack
iix = pd.MultiIndex.from_arrays([np.arange(df.shape[0]) // 9,
np.arange(df.shape[0]) % 9])
df_wide = df.set_index(iix)['Neive Bayes'].unstack()
Out[204]:
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN

Expanding mean grouped by multiple columns in pandas

I have a dataframe that I'd like to calculate expanding mean over one column (quiz_score), but need to group by two different columns (userid and week). The data looks like this:
data = {"userid": ['1','1','1','1','1','1','1','1', '2','2','2','2','2','2','2','2'],\
"week": [1,1,2,2,3,3,4,4, 1,2,2,3,3,4,4,5],\
"quiz_score": [12, 14, 14, 15, 9, 15, 11, 14, 15, 14, 15, 13, 15, 10, 14, 14]}
>>> df = pd.DataFrame(data, columns = ['userid', 'week', 'quiz_score'])
>>> df
userid week quiz_score
0 1 1 12
1 1 1 14
2 1 2 14
3 1 2 15
4 1 3 9
5 1 3 15
6 1 4 11
7 1 4 14
8 2 1 15
9 2 2 14
10 2 2 15
11 2 3 13
12 2 3 15
13 2 4 10
14 2 4 14
15 2 5 14
I need to calculate expanding means by userid over each week--that is, for each user each week, I need their average quiz score over the preceding weeks. I know that a solution will involve using shift() and pd.expanding_mean() or .expanding().mean() in some form, but I've been unable to get the grouping and shift-ing correct -- even when I try without shifting, the results aren't grouped properly and seem to be just expanding mean across the rows as if there were no grouping at all:
df.groupby(['userid', 'week']).apply(pd.expanding_mean).reset_index()
To be clear, the correct result would look like this:
userid week expanding_mean_quiz_score
0 1 1 NA
1 1 2 13
2 1 3 13.75
3 1 4 13.166666
4 1 5 13
5 1 6 13
6 2 1 NA
7 2 2 15
8 2 3 14.666666
9 2 4 14.4
10 2 5 13.714
11 2 6 13.75
Note that the expanding_mean_quiz_score for each user/week is the mean of the scores for that user across all previous weeks.
Thanks for your help, I've never used expanding_mean() and am stumped here.

You can groupby userid and 'week' and keep track of the total scores and count for those groupings. Then use the expanding method on the groupby object to accumulate the scores and counts. Finally, get the desired column by dividing both accumulations.
a=df.groupby(['userid', 'week'])['quiz_score'].agg(('sum', 'count'))
a = a.reindex(pd.MultiIndex.from_product([['1', '2'], range(1,7)], names=['userid', 'week']))
b = a.groupby(level=0).cumsum().groupby(level=0).shift(1)
b['em_quiz_score'] = b['sum'] / b['count']
c = b.reset_index().drop(['count', 'sum'], axis=1)
d = c.groupby('userid').fillna(method='ffill')
d['userid'] = c['userid']
d = d[['userid', 'week', 'em_quiz_score']]
userid week em_quiz_score
0 1 1 NaN
1 1 2 13.000000
2 1 3 13.750000
3 1 4 13.166667
4 1 5 13.000000
5 1 6 13.000000
6 2 1 NaN
7 2 2 15.000000
8 2 3 14.666667
9 2 4 14.400000
10 2 5 13.714286
11 2 6 13.750000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas GroupBy and cumulative mean of previous rows in group - python

Related

Pandas: Resample a dataframe given a list of indexes that are not evenly spaces

How to aggregate the previous values of a pandas dataframe based only on historical values? Similarly, how to mean encode only historical groups? [duplicate]

creating pandas function equivalent for EXCEL OFFSET function

pivot table in specific intervals pandas

Expanding mean grouped by multiple columns in pandas

Categories

Resources