Pandas groupby cumulative sum - python
I would like to add a cumulative sum column to my Pandas dataframe so that:
name
day
no
Jack
Monday
10
Jack
Tuesday
20
Jack
Tuesday
10
Jack
Wednesday
50
Jill
Monday
40
Jill
Wednesday
110
becomes:
Jack | Monday | 10 | 10
Jack | Tuesday | 30 | 40
Jack | Wednesday | 50 | 90
Jill | Monday | 40 | 40
Jill | Wednesday | 110 | 150
I tried various combos of df.groupby and df.agg(lambda x: cumsum(x)) to no avail.
This should do it, need groupby() twice:
df.groupby(['name', 'day']).sum() \
.groupby(level=0).cumsum().reset_index()
Explanation:
print(df)
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
# sum per name/day
print( df.groupby(['name', 'day']).sum() )
no
name day
Jack Monday 10
Tuesday 30
Wednesday 50
Jill Monday 40
Wednesday 110
# cumulative sum per name/day
print( df.groupby(['name', 'day']).sum() \
.groupby(level=0).cumsum() )
no
name day
Jack Monday 10
Tuesday 40
Wednesday 90
Jill Monday 40
Wednesday 150
The dataframe resulting from the first sum is indexed by 'name' and by 'day'. You can see it by printing
df.groupby(['name', 'day']).sum().index
When computing the cumulative sum, you want to do so by 'name', corresponding to the first index (level 0).
Finally, use reset_index to have the names repeated.
df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()
name day no
0 Jack Monday 10
1 Jack Tuesday 40
2 Jack Wednesday 90
3 Jill Monday 40
4 Jill Wednesday 150
Modification to #Dmitry's answer. This is simpler and works in pandas 0.19.0:
print(df)
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
df['no_csum'] = df.groupby(['name'])['no'].cumsum()
print(df)
name day no no_csum
0 Jack Monday 10 10
1 Jack Tuesday 20 30
2 Jack Tuesday 10 40
3 Jack Wednesday 50 90
4 Jill Monday 40 40
5 Jill Wednesday 110 150
This works in pandas 0.16.2
In[23]: print df
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
In[25]: print df
name day no no_cumulative
0 Jack Monday 10 10
1 Jack Tuesday 20 30
2 Jack Tuesday 10 40
3 Jack Wednesday 50 90
4 Jill Monday 40 40
5 Jill Wednesday 110 150
you should use
df['cum_no'] = df.no.cumsum()
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html
Another way of doing it
import pandas as pd
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())
df
Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
(see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()
df.groupby(by=['name','day']).sum() is actually just moving both columns to a MultiIndex
as_index=False means you do not need to call reset_index afterwards
data.csv:
name,day,no
Jack,Monday,10
Jack,Tuesday,20
Jack,Tuesday,10
Jack,Wednesday,50
Jill,Monday,40
Jill,Wednesday,110
Code:
import numpy as np
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
df = df.groupby(['name', 'day'])['no'].sum().reset_index()
print(df)
df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
print(df)
Output:
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
name day no
0 Jack Monday 10
1 Jack Tuesday 30
2 Jack Wednesday 50
3 Jill Monday 40
4 Jill Wednesday 110
name day no cumsum
0 Jack Monday 10 10
1 Jack Tuesday 30 40
2 Jack Wednesday 50 90
3 Jill Monday 40 40
4 Jill Wednesday 110 150
as of version 1.0 pandas got a new api for window functions.
specifically, what was achieved earlier with
df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
or
df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()
now becomes
df.groupby(['name'])['no'].expanding().sum()
I find it more intuitive for all window-related functions than groupby+level operations
although learning to use groupby is useful for general purpose.
see docs:
https://pandas.pydata.org/docs/user_guide/window.html
If you want to write a one-liner (perhaps you want to pass the methods into a pipeline), you can do so by first setting as_index parameter of groupby method to False to return a dataframe from the aggregation step and use assign() to assign a new column to it (the cumulative sum for each person).
These chained methods return a new dataframe, so you'll need to assign it to a variable (e.g. agg_df) to be able to use it later on.
agg_df = (
# aggregate df by name and day
df.groupby(['name','day'], as_index=False)['no'].sum()
.assign(
# assign the cumulative sum of each name as a new column
cumulative_sum=lambda x: x.groupby('name')['no'].cumsum()
)
)
Related
Add new column with time only in Dataframe Pandas
So i have this Dataframe that has been scraped from a website.What i want to achieve is to add a new column with only time in it as HH:mm each time it scrapes new data. But the code i use gives data and time import pandas as pd data = {'Name':['Tom', 'nick', 'krish', 'jack','Rick', 'John'], 'Age':[20, 21, 19, 18, 26, 23]} df = pd.DataFrame(data) df['time'] = pd.date_range('11/5/2020', periods = 6, freq ='2H') df Data i have: Name Age time 0 Tom 20 2020-11-05 01:00:00 1 nick 21 2020-11-05 01:00:00 2 krish 19 2020-11-05 01:00:00 3 jack 18 2020-11-05 01:00:00 4 Rick 26 2020-11-05 01:00:00 5 John 23 2020-11-05 01:00:00 Result I want Name Age time 0 Tom 20 01:00 1 nick 21 01:00 2 krish 19 01:00 3 jack 18 01:00 4 Rick 26 01:00 5 John 23 01:00 As the data is scraped every 15 minutes I want to append the new data to it with the time without the need to add header again like below: Name Age time 0 Tom 20 01:00 1 nick 21 01:00 2 krish 19 01:00 3 jack 18 01:00 4 Rick 26 01:00 5 John 23 01:00 0 Tom 20 01:15 1 nick 21 01:15 2 krish 19 01:15 3 jack 18 01:15 4 Rick 26 01:15 5 John 23 01:15
If I understand correctly here is what you want. lst_date_range = pd.date_range('11/5/2020', periods = 6, freq ='2H') df['time'] = [date_time.strftime("%H:%M") for date_time in lst_date_range] By using .strftime("%H:%M") could allow you to select only hour and minute. From Name Age time 0 Tom 20 2020-11-05 01:00:00 To Name Age time 0 Tom 20 01:00 But I do not understand what you are asking for appending without the need to add header, please explain more if you need help here.
How to add values from two dataframes according to the persons in first column? Python Pandas
I have two dataframes df1 Name 2010 2011 0 Jack 25 35 1 Jill 15 20 df2 Name 2010 2011 0 Berry 45 25 1 Jack 5 10 I want to create a third dataframe by adding the values in these dataframes Desired Output df3 Name 2010 2011 0 Jack 30 45 #add the values from df1 and df2 1 Jill 15 20 2 Berry 45 25 I have used this code df1.add(df2)
concat both dfs and do a groupby and sum: print (pd.concat([df, df2]).groupby("Name", as_index=False).sum()) Name 2010 2011 0 Berry 45 25 1 Jack 30 45 2 Jill 15 20
Compare averages of a values corresponding to 2 different dates?
I have a table like this: Date Student Average(for that date) 17 Jan 2020 Alex 40 18 Jan 2020 Alex 50 19 Jan 2020 Alex 80 20 Jan 2020 Alex 70 17 Jan 2020 Jeff 10 18 Jan 2020 Jeff 50 19 Jan 2020 Jeff 80 20 Jan 2020 Jeff 60 I want to add a column for high and low. The logic for that column should be that it is high as long as the average score for a student for today`s date is greater than the value < 90% of previous days score. Like my comparison would look something like this: avg(score)(for current date) < ( avg(score)(for previous day) - (90% * avg(score)(for previous day) /100) I can`t figure how to incorporate the date part in my formula.That it compares averages from current day to the average of the previous date. I am working with Pandas so i was wondering if there is a way in it to incorporate this.
IIUC, df['Previous Day'] = df.sort_values('Date').groupby('Student')['Average'].shift()*.90 df['Indicator'] = np.where(df['Average']>df['Previous Day'],'High','Low') df Output: Date Student Average Previous Day Indicator 0 2020-01-17 Alex 40 NaN Low 1 2020-01-18 Alex 50 36.0 High 2 2020-01-19 Alex 80 45.0 High 3 2020-01-20 Alex 70 72.0 Low 4 2020-01-17 Jeff 10 NaN Low 5 2020-01-18 Jeff 50 9.0 High 6 2020-01-19 Jeff 80 45.0 High 7 2020-01-20 Jeff 60 72.0 Low
Pandas Python Groupby Cummulative Sum Reverse
I have found Pandas groupby cumulative sum and found it very useful. However, I would like to determine how to calculate a reverse cumulative sum. The link suggests the following. df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum() In order to reverse sum, I tried slicing the data, but it fails. df.groupby(by=['name','day']).ix[::-1, 'no'].sum().groupby(level=[0]).cumsum() Jack | Monday | 10 | 90 Jack | Tuesday | 30 | 80 Jack | Wednesday | 50 | 50 Jill | Monday | 40 | 80 Jill | Wednesday | 40 | 40 EDIT: Based on feedback, I tried to implement the code and make the dataframe larger: import pandas as pd df = pd.DataFrame( {'name': ['Jack', 'Jack', 'Jack', 'Jill', 'Jill'], 'surname' : ['Jones','Jones','Jones','Smith','Smith'], 'car' : ['VW','Mazda','VW','Merc','Merc'], 'country' : ['UK','US','UK','EU','EU'], 'year' : [1980,1980,1980,1980,1980], 'day': ['Monday', 'Tuesday','Wednesday','Monday','Wednesday'], 'date': ['2016-02-31','2016-01-31','2016-01-31','2016-01-31','2016-01-31'], 'no': [10,30,50,40,40], 'qty' : [100,500,200,433,222]}) I then try and group on a number of columns but it fails to apply the grouping. df = df.groupby(by=['name','surname','car','country','year','day','date']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1].reset_index() Why is the case? I expect Jack Jones with car Mazda to be a separate cumulative quantity from Jack Jones with a VW.
You can use double iloc: df = df.groupby(by=['name','day']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1] print (df) no name day Jack Monday 90 Tuesday 80 Wednesday 50 Jill Monday 80 Wednesday 40 For another column solution is simplify: df = df.groupby(by=['name','day']).sum() df['new'] = df.iloc[::-1].groupby(level=[0]).cumsum() print (df) no new name day Jack Monday 10 90 Tuesday 30 80 Wednesday 50 50 Jill Monday 40 80 Wednesday 40 40 EDIT: There is problem in second groupby need to append more levels - level=[0,1,2] means group by first name, second surname and third car levels. df1 = (df.groupby(by=['name','surname','car','country','year','day','date']) .sum()) print (df1) no qty name surname car country year day date Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500 VW UK 1980 Monday 2016-02-31 10 100 Wednesday 2016-01-31 50 200 Jill Smith Merc EU 1980 Monday 2016-01-31 40 433 Wednesday 2016-01-31 40 222 df2 = (df.groupby(by=['name','surname','car','country','year','day','date']) .sum() .iloc[::-1] .groupby(level=[0,1,2]) .cumsum() .iloc[::-1] .reset_index()) print (df2) name surname car country year day date no qty 0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500 1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300 2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200 3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655 4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222 Or is possible select by names - see groupby enhancements in 0.20.1+: df2 = (df.groupby(by=['name','surname','car','country','year','day','date']) .sum() .iloc[::-1] .groupby(['name','surname','car']) .cumsum() .iloc[::-1] .reset_index()) print (df2) name surname car country year day date no qty 0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500 1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300 2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200 3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655 4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222
Python - Pandas - Unroll / Remove Cumulative Sum
I have a data frame like the following (specific data below, this is generic). The no gives me a cumulative sum: no name day Jack Monday 10 Tuesday 40 Wednesday 90 Jill Monday 40 Wednesday 150 I want to "unroll" the cumulative sum to give me something like this: print df name day no 0 Jack Monday 10 1 Jack Tuesday 30 2 Jack Wednesday 50 3 Jill Monday 40 4 Jill Wednesday 110 In essence, I'd like to do something like the following, but in reverse: Pandas groupby cumulative sum
If I understand correctly you can do the following: In [103]: df.groupby(level=0).diff().fillna(df).reset_index() Out[103]: name day no 0 Jack Monday 10.0 1 Jack Tuesday 30.0 2 Jack Wednesday 50.0 3 Jill Monday 40.0 4 Jill Wednesday 110.0 So groupby the first index level and call diff to calculate the inter row differences per group and fill the NaN values with the original df values and call reset_index
Here's a method based on zip. It creates two series, the 2nd being offset by 1, and subtracts the difference between the two. [n-nn for n,nn in zip(df['No'],df['No'][1:]+[0])]