Pandas: cumulative sum every n rows - python

I have a dataframe with a column "date" of type dtype M8[ns] and another "expected_response". Then, there is a column "cumulative_expected" which does the cumulative sum of the expected_response among the rows with the same date. The dataframe has a row for each second of the month. Like below:
date Expected_response cumulative_expected
0 2018-03-01 0.270 0.270
1 2018-03-01 0.260 0.530
2 2018-03-01 0.240 0.770
3 2018-03-01 0.224 0.994
4 2018-03-01 0.204 1.198
5 2018-03-01 0.194 1.392
6 2018-03-01 0.190 1.582
... ... ... ...
2678395 2018-03-31 0.164 -7533.464
2678396 2018-03-31 0.164 -7533.300
2678397 2018-03-31 0.160 -7533.140
2678398 2018-03-31 0.154 -7532.986
2678399 2018-03-31 0.150 -7532.836
as you can see there is an error: the cumulative sum does not recognise the Change of the date and the cumulative sum does not restart each time the date changes.
The code is:
df['cumulative_expected']=df.groupby(df['date']!=df['date'])['Expected_response'].cumsum()
Maybe an Option could be to create a Counter that increases by 1 each 86400 rows (seconds in a day) and then groupby the Counter. But I don't know how to do it.
Is there any other solution?
thank you in advance

There is default index, so you can use floor division:
df['cumulative_expected'] = df['Expected_response'].groupby(df.index // 86400).cumsum()
Generally solution is create np.arange with floor division:
arr = np.arange(len(df)) // 86400
df['cumulative_expected'] = df['Expected_response'].groupby(arr).cumsum()
Your solution should be changed with comparing shifted values with cumsum:
s = (df['date']!=df['date'].shift()).cumsum()
df['cumulative_expected'] = df['Expected_response'].groupby(s).cumsum()
Test with changed sample data:
print (df)
date Expected_response
0 2018-03-01 0.270
1 2018-03-01 0.260
2 2018-03-02 0.240
3 2018-03-02 0.224
4 2018-03-02 0.204
5 2018-03-01 0.194
6 2018-03-01 0.190
s = (df['date']!=df['date'].shift()).cumsum()
print (s)
0 1
1 1
2 2
3 2
4 2
5 3
6 3
Name: date, dtype: int32
df['cumulative_expected'] = df['Expected_response'].groupby(s).cumsum()
print (df)
date Expected_response cumulative_expected
0 2018-03-01 0.270 0.270
1 2018-03-01 0.260 0.530
2 2018-03-02 0.240 0.240
3 2018-03-02 0.224 0.464
4 2018-03-02 0.204 0.668
5 2018-03-01 0.194 0.194
6 2018-03-01 0.190 0.384

You can take the first difference of the date using diff to see were the changes occur, and use this as a reference to take the cumulative sum.
Here I use a slightly modified df to see how works:
print(df)
date Expected_response
0 2018-03-01 0.270
1 2018-03-01 0.260
2 2018-03-01 0.240
3 2018-03-01 0.224
4 2018-03-02 0.204
5 2018-03-02 0.194
6 2018-03-02 0.190
df['change'] = df.date.diff().abs().fillna(0).cumsum()
print(df)
date Expected_response change
0 2018-03-01 0.270 0 days
1 2018-03-01 0.260 0 days
2 2018-03-01 0.240 0 days
3 2018-03-01 0.224 0 days
4 2018-03-02 0.204 1 days
5 2018-03-02 0.194 1 days
6 2018-03-02 0.190 1 days
df['cumulative_expected'] = df.groupby('change').cumsum()
print(df.drop(['change'], axis = 1))
date Expected_response cumulative_expected
0 2018-03-01 0.270 0.270
1 2018-03-01 0.260 0.530
2 2018-03-01 0.240 0.770
3 2018-03-01 0.224 0.994
4 2018-03-02 0.204 0.204
5 2018-03-02 0.194 0.398
6 2018-03-02 0.190 0.588

Related

How to get the average of a time frame in pandas

I have a large CSV file as below:
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-03 00:00:00 3 0 0.390 0.490 1.280 1.760 2.560
2018-11-03 00:01:00 3 0 0.390 0.490 1.280 1.760 2.560
2018-11-03 00:02:00 3 0 0.380 0.460 1.300 1.610 2.500
2018-11-03 00:03:00 3 0 0.380 0.450 1.310 1.600 2.490
...
2018-11-28 23:56:00 28 23 0.670 0.560 1.100 1.870 2.940
2018-11-28 23:57:00 28 23 0.660 0.570 1.100 1.990 2.950
2018-11-28 23:58:00 28 23 0.660 0.570 1.100 1.990 2.950
2018-11-28 23:59:00 28 23 0.650 0.530 1.130 1.880 2.870
[37440 rows x 7 columns]
I'd like to take the average of 60 minutes to obtain hourly data. The final data would look something like this:
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-03 00:00:00 3 0 0.390 0.490 1.280 1.760 2.560
2018-11-03 01:00:00 3 1 0.390 0.490 1.280 1.760 2.560
2018-11-03 02:00:00 3 2 0.380 0.460 1.300 1.610 2.500
2018-11-03 03:00:00 3 3 0.380 0.450 1.310 1.600 2.490
I tried
print (df['v.amm'].resample('60Min').mean())
t
2018-11-03 00:00:00 0.357
2018-11-03 01:00:00 0.354
2018-11-03 02:00:00 0.369
2018-11-03 03:00:00 0.384
but I don't think it's efficient as it only prints one specific column at a time, without heading as well.

How to restrict time difference to same day?

I have a dataframe like as shown below
df1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03
12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04
11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00']
})
I would like to create another column called tdiff to calculate the time difference
This is what I tried
df1['time_1'] = pd.to_datetime(df1['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
But this produces an output like as shown below. As you can see, it subtracts from the next date. Instead I would like to restrict the time difference only to the same day. Ex: if Jan 15th 20:00:00 PM is the last record for that day, then I expect the tdiff to be 4:00:00 (24:00:00: - 20:00:00)
I understand it is happening because I am shifting the values of time to subtract and it's obvious that the highlighted rows are picking records from next date. But is there a way to avoid this but calculate the time difference between records in a same day?
I expect my output to be like this. Here NaN should be replaced by the current date (23:59:00). if you check the difference, you will get an idea
Is there any existing method or pandas function that can help us do this datewise timedelta? How can I shift the values datewise?
IIUC, you can use:
s=pd.to_timedelta(24,unit='h')-(df1.time_1-df1.time_1.dt.normalize())
df1['tdiff']=df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
#df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s).dt.total_seconds()/3600
subject_id time_1 tdiff
0 1 2173-04-03 12:35:00 00:15:00
1 1 2173-04-03 12:50:00 00:09:00
2 1 2173-04-03 12:59:00 00:15:00
3 1 2173-04-03 13:14:00 00:23:00
4 1 2173-04-03 13:37:00 10:23:00
5 1 2173-04-04 11:30:00 12:30:00
6 1 2173-04-05 16:00:00 06:00:00
7 1 2173-04-05 22:00:00 02:00:00
8 1 2173-04-06 04:00:00 00:30:00
9 1 2173-04-06 04:30:00 03:30:00
10 1 2173-04-06 08:00:00 16:00:00
you could use df.where and df.dt.ceil to decide if to subtract from time_2 or from midnight of time_1:
sameDayOrMidnight = df.time_2.where(df.time_1.dt.date==df.time_2.dt.date, df.time_1.dt.ceil(freq='1d'))
df['tdiff'] = (sameDayOrMidnight - df.time_1).dt.total_seconds() / 3600
result:
subject_id time_1 time_2 tdiff
0 1 2173-04-03 12:35:00 2173-04-03 12:50:00 0.250000
1 1 2173-04-03 12:50:00 2173-04-03 12:59:00 0.150000
2 1 2173-04-03 12:59:00 2173-04-03 13:14:00 0.250000
3 1 2173-04-03 13:14:00 2173-04-03 13:37:00 0.383333
4 1 2173-04-03 13:37:00 2173-04-04 11:30:00 10.383333
5 1 2173-04-04 11:30:00 2173-04-05 16:00:00 12.500000
6 1 2173-04-05 16:00:00 2173-04-05 22:00:00 6.000000
7 1 2173-04-05 22:00:00 2173-04-06 04:00:00 2.000000
8 1 2173-04-06 04:00:00 2173-04-06 04:30:00 0.500000
9 1 2173-04-06 04:30:00 2173-04-06 08:00:00 3.500000
10 1 2173-04-06 08:00:00 NaT 16.000000

Python: Resampling and forward-filling to most recent month

I have a large data set, which contains, for each property_id, a cost for every month that a cost was incurred, as per the below dataset.
property_id period amount
1 2016-07-01 105908.20
1 2016-08-01 0.00
2 2016-08-01 114759.40
3 2014-05-01 -934.00
3 2014-06-01 -845.95
3 2017-12-01 92175.77
4 2015-09-01 -1859.75
4 2015-12-01 1859.75
4 2017-12-01 130105.00
5 2014-07-01 -6929.58
I would like to create a cumulative sum, grouped by property_id, and carry it forward each month, from the first month of that property_id through to the most recent full month.
I've tried the below, wherein I'm using resampling by property_id and trying to forward fill, but it gives an error
cost = cost.groupby['property_id'].apply(lambda x: x.set_index('period').resample('M').fillna(method='pad'))
TypeError: 'method' object is not subscriptable
Example output below:
> property_id period amount
> 1 2016-07-01 105908.20
> 1 2016-08-01 105908.20
> 1 2016-09-01 105908.20
> 1 2016-10-01 105908.20
> ...
> 1 2019-07-01 105908.20
> 2 2016-08-01 114759.40
> 2 2016-09-01 114759.40
> 2 2016-10-01 114759.40
> ...
> 2 2019-07-01 114759.40
> 3 2014-05-01 -934.00
> 3 2014-06-01 -1779.95
> 3 2014-07-01 -1779.95
> 3 2014-08-01 -1779.95
> ...
> 3 2017-12-01 90395.82
> 3 2018-01-01 90395.82
> 3 2018-02-01 90395.82
> 3 2018-03-01 90395.82
> ...
> 3 2019-07-01 90395.82
> 4 2015-09-01 -1859.75
> 4 2015-10-01 -1859.75
> 4 2015-11-01 -1859.75
> 4 2015-12-01 0
> 4 2016-01-01 0
> ...
> 4 2017-11-01 0
> 4 2017-12-01 130105.00
> 4 2018-01-01 130105.00
> ...
> 4 2019-07-01 130105.00
> 5 2014-07-01 -6929.58
> 5 2014-08-01 -6929.58
> 5 2014-09-01 -6929.58
> ...
> 5 2019-07-01 -6929.58
Any help would be great.
Thanks!
Create DatetimeIndex first and then use groupby with resample:
df['period'] = pd.to_datetime(df['period'])
df1 = df.set_index('period').groupby('property_id').resample('M').pad()
#alternative
#df1 = df.set_index('period').groupby('property_id').resample('M').ffill()
print (df1)
property_id amount
property_id period
1 2016-07-31 1 105908.20
2016-08-31 1 0.00
2 2016-08-31 2 114759.40
3 2014-05-31 3 -934.00
2014-06-30 3 -845.95
... ...
4 2017-09-30 4 1859.75
2017-10-31 4 1859.75
2017-11-30 4 1859.75
2017-12-31 4 130105.00
5 2014-07-31 5 -6929.58
[76 rows x 2 columns]
EDIT: Idea is create new DataFrame by filtering by last value of property_id and assign month by condition, then append to original and use solution above:
df['period'] = pd.to_datetime(df['period'])
df = df.sort_values(['property_id','period'])
last = pd.to_datetime('now').floor('d')
nextday = (last + pd.Timedelta(1, 'd')).day
orig_month = last.to_period('m').to_timestamp()
before_month = (last.to_period('m') - 1).to_timestamp()
last = orig_month if nextday == 1 else before_month
print (last)
2019-07-01 00:00:00
df1 = df.drop_duplicates('property_id', keep='last').assign(period=last)
print (df1)
property_id period amount
1 1 2019-07-01 0.00
2 2 2019-07-01 114759.40
5 3 2019-07-01 92175.77
8 4 2019-07-01 130105.00
9 5 2019-07-01 -6929.58
df = pd.concat([df, df1])
df1 = (df.set_index('period')
.groupby('property_id')['amount']
.resample('MS')
.asfreq(fill_value=0)
.groupby(level=0)
.cumsum())
print (df1)
property_id period
1 2016-07-01 105908.20
2016-08-01 105908.20
2016-09-01 105908.20
2016-10-01 105908.20
2016-11-01 105908.20
5 2019-03-01 -394986.06
2019-04-01 -401915.64
2019-05-01 -408845.22
2019-06-01 -415774.80
2019-07-01 -422704.38
Name: amount, Length: 244, dtype: float64

Pandas reshaping functions

To add to the many excellent examples of this, I'm trying to reshape my data into the format I want.
I currently have data indexed by customer, purchase category and date, with observations for each intra-day time period across the columns:
I want to aggregate by purchase category, and reshape so that my data is indexed by date and time, while customers appear across the columns.
What's the simplest way to achieve this?
In text form, the original data looks like this:
<table><tbody><tr><th>Customer</th><th>Purchase Category</th><th>date</th><th>00:30</th><th>01:00</th><th>01:30</th></tr><tr><td>1</td><td>A</td><td>01/07/2012</td><td>1.25</td><td>1.25</td><td>1.25</td></tr><tr><td>1</td><td>B</td><td>01/07/2012</td><td>0.855</td><td>0.786</td><td>0.604</td></tr><tr><td>1</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>1</td><td>A</td><td>02/07/2012</td><td>1.25</td><td>1.25</td><td>1.125</td></tr><tr><td>1</td><td>B</td><td>02/07/2012</td><td>0.309</td><td>0.082</td><td>0.059</td></tr><tr><td>1</td><td>C</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>01/07/2012</td><td>0.167</td><td>0.108</td><td>0.119</td></tr><tr><td>2</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>02/07/2012</td><td>0.11</td><td>0.109</td><td>0.123</td></tr></tbody></table>
I think you need groupby with aggregating sum with reshape by stack and unstack. Last pop column level_1, add to date and convert to_datetime:
print (df)
Customer Purchase Category date 00:30 01:00 01:30
0 1 A 01/07/2012 1.250 1.250 1.250
1 1 B 01/07/2012 0.855 0.786 0.604
2 1 C 01/07/2012 0.000 0.000 0.000
3 1 A 02/07/2012 1.250 1.250 1.125
4 1 B 02/07/2012 0.309 0.082 0.059
5 1 C 02/07/2012 0.000 0.000 0.000
6 2 A 01/07/2012 0.000 0.000 0.000
7 2 B 01/07/2012 0.167 0.108 0.119
8 2 C 01/07/2012 0.000 0.000 0.000
9 2 A 02/07/2012 0.000 0.000 0.000
10 2 B 02/07/2012 0.110 0.109 0.123
df1 = df.groupby(['Customer','date']).sum().stack().unstack(0).reset_index()
df1.date = pd.to_datetime(df1.date + df1.pop('level_1'), format='%d/%m/%Y%H:%M')
print (df1)
Customer date 1 2
0 2012-07-01 00:30:00 2.105 0.167
1 2012-07-01 01:00:00 2.036 0.108
2 2012-07-01 01:30:00 1.854 0.119
3 2012-07-02 00:30:00 1.559 0.110
4 2012-07-02 01:00:00 1.332 0.109
5 2012-07-02 01:30:00 1.184 0.123

add timedelta data within a group in pandas dataframe

I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...

Categories

Resources