I have a data frame like the following (specific data below, this is generic). The no gives me a cumulative sum:
no
name day
Jack Monday 10
Tuesday 40
Wednesday 90
Jill Monday 40
Wednesday 150
I want to "unroll" the cumulative sum to give me something like this:
print df
name day no
0 Jack Monday 10
1 Jack Tuesday 30
2 Jack Wednesday 50
3 Jill Monday 40
4 Jill Wednesday 110
In essence, I'd like to do something like the following, but in reverse:
Pandas groupby cumulative sum
If I understand correctly you can do the following:
In [103]:
df.groupby(level=0).diff().fillna(df).reset_index()
Out[103]:
name day no
0 Jack Monday 10.0
1 Jack Tuesday 30.0
2 Jack Wednesday 50.0
3 Jill Monday 40.0
4 Jill Wednesday 110.0
So groupby the first index level and call diff to calculate the inter row differences per group and fill the NaN values with the original df values and call reset_index
Here's a method based on zip. It creates two series, the 2nd being offset by 1, and subtracts the difference between the two.
[n-nn for n,nn in zip(df['No'],df['No'][1:]+[0])]
Related
I'm very new to python and I have two dataframes... I'm trying to match the "Names" aka the columns of dataframe 1 with the rows of dataframe 2 and collect the value for the year 2022 with the hopeful output looking like Dataframe 3... I've tried looking through other queries but not found anything to help, any help would be greatly appreciated!
Dataframe 1 - Money Dataframe 2 Dataframe 3
Date Alex Rob Kev Ben Name Name Amount
2022 29 45 65 12 James James
2021 11 32 11 19 Alex Alex 29
2019 45 12 22 76 Carl Carl
Rob Rob 45
Kev Kev 65
There are many different ways to achieve this.
One option is using map:
s = df1.set_index('Date').loc[2022]
df2['Amount'] = df2['Name'].map(s)
output:
Name Amount
0 James NaN
1 Alex 29.0
2 Carl NaN
3 Rob 45.0
4 Kev 65.0
Another option is using merge:
s = df1.set_index('Date').loc[2022]
df3 = df2.merge(s.rename('Amount'), left_on='Name', right_index=True, how='left')
I have a data frame (df) with columns named date, Year, Month, Day, Hour and Energy. It is multiyear Time Series which I want to convert into a averaged single year time series with (8760 points i.e 365 * 24 points) where the column Energy_Mean is averaged out value.
df is
date Year Month Day Hour Energy
1/1/1999 0:00 1999 Jan 1 1 45.0
1/1/1999 1:00 1999 Jan 1 2 73.5
1/1/1999 2:00 1999 Jan 1 3 82.4
1/1/1999 3:00 1999 Jan 1 4 90.0
1/1/1999 4:00 1999 Jan 1 5 72.2
.
.
.
12/31/1999 23:00 1999 Dec 12 24 77.0
.
.
.
12/31/2019 23:00 2019 Dec 12 24 84.3
Task is to convert it into an averaged form as shown below:
Month Day Hour Energy_Mean
Jan 1 1 22.45
Jan 1 2 73.5
Jan 1 3 57.4
Jan 1 4 88.0
Jan 1 5 33.2
.
.
.
Dec 31 24 77.0
Trying to figure out whether pivot_table or groupby is a better method of pandas to use to convert time series into 8760 count data frame. Additionally, I want the output to be sorted by months NOT alphabetically. Like Jan, Feb, March, April NOT April, August..
My Code is:
p50_8760 = df.groupby(['Month', 'Day', 'Hour'])['Energy'].mean()
df_p50_8760 = p50_8760.to_frame()
The output file does not have column names or the data points count of 8760 data points.
According to the response in this SO question Pandas: group by and Pivot table difference pivot_table and groupby might be both equally well suited as they only differ in the shape of the result.
So pick that one that you find easier to work with.
For my example I will use a pivot_table.
In order to sort by a Months' index instead of alphabetically by name, I add an additional column 'Month_ind'. You could do the mapping by hand, of course. Because we already have a datetime column present, I chose to let Pandas do this step.
The numerical column 'Month_ind' can then be used to sort in the end:
df = pd.read_csv('data/multi_year_ts.csv')
df['date'] = pd.to_datetime(df['date']) # convert column to datetime
df['Month_ind'] = df['date'].map(lambda e: e.month)
pivot = pd.pivot_table(df, index=['Month_ind', 'Day', 'Hour'], columns=['Year'], values=['Energy'])
print(pivot.sort_values('Month_ind'))
Result:
Energy
Year 1999 2005 2007 2019
Month_ind Day Hour
1 1 1 45.0 60.4 55.2 NaN
2 73.5 NaN NaN NaN
3 82.4 NaN NaN NaN
4 90.0 NaN NaN NaN
5 72.2 NaN NaN NaN
12 12 24 77.0 NaN NaN 84.3
Note that the values are not correct (and mostly NaN) as I only had a very small test sample.
To get the mean value for a specific Hour on a given date over all years, transpose the pivot first:
print(pivot.T.mean())
Final Result:
Month_ind Day Hour
1 1 1 53.533333
2 73.500000
3 82.400000
4 90.000000
5 72.200000
12 12 24 80.650000
dtype: float64
I have a table like this:
Date Student Average(for that date)
17 Jan 2020 Alex 40
18 Jan 2020 Alex 50
19 Jan 2020 Alex 80
20 Jan 2020 Alex 70
17 Jan 2020 Jeff 10
18 Jan 2020 Jeff 50
19 Jan 2020 Jeff 80
20 Jan 2020 Jeff 60
I want to add a column for high and low. The logic for that column should be that it is high as long as the average score for a student for today`s date is greater than the value < 90% of previous days score.
Like my comparison would look something like this:
avg(score)(for current date) < ( avg(score)(for previous day) - (90% * avg(score)(for previous day) /100)
I can`t figure how to incorporate the date part in my formula.That it compares averages from current day to the average of the previous date.
I am working with Pandas so i was wondering if there is a way in it to incorporate this.
IIUC,
df['Previous Day'] = df.sort_values('Date').groupby('Student')['Average'].shift()*.90
df['Indicator'] = np.where(df['Average']>df['Previous Day'],'High','Low')
df
Output:
Date Student Average Previous Day Indicator
0 2020-01-17 Alex 40 NaN Low
1 2020-01-18 Alex 50 36.0 High
2 2020-01-19 Alex 80 45.0 High
3 2020-01-20 Alex 70 72.0 Low
4 2020-01-17 Jeff 10 NaN Low
5 2020-01-18 Jeff 50 9.0 High
6 2020-01-19 Jeff 80 45.0 High
7 2020-01-20 Jeff 60 72.0 Low
I have found Pandas groupby cumulative sum and found it very useful. However, I would like to determine how to calculate a reverse cumulative sum.
The link suggests the following.
df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
In order to reverse sum, I tried slicing the data, but it fails.
df.groupby(by=['name','day']).ix[::-1, 'no'].sum().groupby(level=[0]).cumsum()
Jack | Monday | 10 | 90
Jack | Tuesday | 30 | 80
Jack | Wednesday | 50 | 50
Jill | Monday | 40 | 80
Jill | Wednesday | 40 | 40
EDIT:
Based on feedback, I tried to implement the code and make the dataframe larger:
import pandas as pd
df = pd.DataFrame(
{'name': ['Jack', 'Jack', 'Jack', 'Jill', 'Jill'],
'surname' : ['Jones','Jones','Jones','Smith','Smith'],
'car' : ['VW','Mazda','VW','Merc','Merc'],
'country' : ['UK','US','UK','EU','EU'],
'year' : [1980,1980,1980,1980,1980],
'day': ['Monday', 'Tuesday','Wednesday','Monday','Wednesday'],
'date': ['2016-02-31','2016-01-31','2016-01-31','2016-01-31','2016-01-31'],
'no': [10,30,50,40,40],
'qty' : [100,500,200,433,222]})
I then try and group on a number of columns but it fails to apply the grouping.
df = df.groupby(by=['name','surname','car','country','year','day','date']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1].reset_index()
Why is the case? I expect Jack Jones with car Mazda to be a separate cumulative quantity from Jack Jones with a VW.
You can use double iloc:
df = df.groupby(by=['name','day']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1]
print (df)
no
name day
Jack Monday 90
Tuesday 80
Wednesday 50
Jill Monday 80
Wednesday 40
For another column solution is simplify:
df = df.groupby(by=['name','day']).sum()
df['new'] = df.iloc[::-1].groupby(level=[0]).cumsum()
print (df)
no new
name day
Jack Monday 10 90
Tuesday 30 80
Wednesday 50 50
Jill Monday 40 80
Wednesday 40 40
EDIT:
There is problem in second groupby need to append more levels - level=[0,1,2] means group by first name, second surname and third car levels.
df1 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum())
print (df1)
no qty
name surname car country year day date
Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
VW UK 1980 Monday 2016-02-31 10 100
Wednesday 2016-01-31 50 200
Jill Smith Merc EU 1980 Monday 2016-01-31 40 433
Wednesday 2016-01-31 40 222
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(level=[0,1,2])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222
Or is possible select by names - see groupby enhancements in 0.20.1+:
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(['name','surname','car'])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222
I would like to add a cumulative sum column to my Pandas dataframe so that:
name
day
no
Jack
Monday
10
Jack
Tuesday
20
Jack
Tuesday
10
Jack
Wednesday
50
Jill
Monday
40
Jill
Wednesday
110
becomes:
Jack | Monday | 10 | 10
Jack | Tuesday | 30 | 40
Jack | Wednesday | 50 | 90
Jill | Monday | 40 | 40
Jill | Wednesday | 110 | 150
I tried various combos of df.groupby and df.agg(lambda x: cumsum(x)) to no avail.
This should do it, need groupby() twice:
df.groupby(['name', 'day']).sum() \
.groupby(level=0).cumsum().reset_index()
Explanation:
print(df)
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
# sum per name/day
print( df.groupby(['name', 'day']).sum() )
no
name day
Jack Monday 10
Tuesday 30
Wednesday 50
Jill Monday 40
Wednesday 110
# cumulative sum per name/day
print( df.groupby(['name', 'day']).sum() \
.groupby(level=0).cumsum() )
no
name day
Jack Monday 10
Tuesday 40
Wednesday 90
Jill Monday 40
Wednesday 150
The dataframe resulting from the first sum is indexed by 'name' and by 'day'. You can see it by printing
df.groupby(['name', 'day']).sum().index
When computing the cumulative sum, you want to do so by 'name', corresponding to the first index (level 0).
Finally, use reset_index to have the names repeated.
df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()
name day no
0 Jack Monday 10
1 Jack Tuesday 40
2 Jack Wednesday 90
3 Jill Monday 40
4 Jill Wednesday 150
Modification to #Dmitry's answer. This is simpler and works in pandas 0.19.0:
print(df)
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
df['no_csum'] = df.groupby(['name'])['no'].cumsum()
print(df)
name day no no_csum
0 Jack Monday 10 10
1 Jack Tuesday 20 30
2 Jack Tuesday 10 40
3 Jack Wednesday 50 90
4 Jill Monday 40 40
5 Jill Wednesday 110 150
This works in pandas 0.16.2
In[23]: print df
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
In[25]: print df
name day no no_cumulative
0 Jack Monday 10 10
1 Jack Tuesday 20 30
2 Jack Tuesday 10 40
3 Jack Wednesday 50 90
4 Jill Monday 40 40
5 Jill Wednesday 110 150
you should use
df['cum_no'] = df.no.cumsum()
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html
Another way of doing it
import pandas as pd
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())
df
Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
(see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()
df.groupby(by=['name','day']).sum() is actually just moving both columns to a MultiIndex
as_index=False means you do not need to call reset_index afterwards
data.csv:
name,day,no
Jack,Monday,10
Jack,Tuesday,20
Jack,Tuesday,10
Jack,Wednesday,50
Jill,Monday,40
Jill,Wednesday,110
Code:
import numpy as np
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
df = df.groupby(['name', 'day'])['no'].sum().reset_index()
print(df)
df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
print(df)
Output:
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
name day no
0 Jack Monday 10
1 Jack Tuesday 30
2 Jack Wednesday 50
3 Jill Monday 40
4 Jill Wednesday 110
name day no cumsum
0 Jack Monday 10 10
1 Jack Tuesday 30 40
2 Jack Wednesday 50 90
3 Jill Monday 40 40
4 Jill Wednesday 110 150
as of version 1.0 pandas got a new api for window functions.
specifically, what was achieved earlier with
df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
or
df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()
now becomes
df.groupby(['name'])['no'].expanding().sum()
I find it more intuitive for all window-related functions than groupby+level operations
although learning to use groupby is useful for general purpose.
see docs:
https://pandas.pydata.org/docs/user_guide/window.html
If you want to write a one-liner (perhaps you want to pass the methods into a pipeline), you can do so by first setting as_index parameter of groupby method to False to return a dataframe from the aggregation step and use assign() to assign a new column to it (the cumulative sum for each person).
These chained methods return a new dataframe, so you'll need to assign it to a variable (e.g. agg_df) to be able to use it later on.
agg_df = (
# aggregate df by name and day
df.groupby(['name','day'], as_index=False)['no'].sum()
.assign(
# assign the cumulative sum of each name as a new column
cumulative_sum=lambda x: x.groupby('name')['no'].cumsum()
)
)