Last 10 minutes per day - python

I'm trying to obtain how much of the transaction volume of a business is done of the last each 10 minutes of each day
The data I have is the following:
DF_Q
Out[97]:
LongTime
2016-01-04 09:30:00 35077034
2016-01-04 09:30:11 1119
2016-01-04 09:30:21 12295250
2016-01-04 09:30:23 1387856
2016-01-04 09:30:40 877954
...
2016-05-27 15:59:53 16986
2016-05-27 15:59:58 50080165
2016-05-27 15:59:59 17097260
Name: Volume, dtype: int64
I first resample that series to 10 minutes interval and I obtained:
DF_Qmin = DF_Q.resample('10min').sum()
DF_Qmin
Out[102]:
LongTime
2016-01-04 09:30:00 3.202500e+05
2016-01-04 09:40:00 1.192028e+08
2016-01-04 09:50:00 6.156090e+07
2016-01-04 10:00:00 1.289250e+09
...
2016-05-27 15:20:00 1.035539e+09
2016-05-27 15:30:00 1.489631e+09
2016-05-27 15:40:00 2.228257e+09
2016-05-27 15:50:00 5.352179e+09
Freq: 10T, Name: Volume, dtype: float64
And then I do a pivot table
, which I save as an excel and I manually obtain each day's last 10 minute volume
2016-01-04 16:50:00 3.693279e+09
2016-01-05 16:50:00 2.158429e+09
...
2016-05-26 15:50:00 1.256878e+08
2016-05-27 15:50:00 6.521489e+09
It is possible to do this without excel? or iterating each day?

I think you need groupby by date and aggregating last. Last rename_axis (new in pandas 0.18.0) and reset_index:
#if need column LongTime
DF_Qmin = DF_Qmin.reset_index()
print (DF_Qmin.groupby(DF_Qmin.LongTime.dt.date).last())
Sample:
import pandas as pd
DF_Qmin = pd.Series({pd.Timestamp('2016-01-04 09:30:00'): 320250.0, pd.Timestamp('2016-01-04 09:50:00'): 61560900.0, pd.Timestamp('2016-05-27 15:40:00'): 2228257000.0, pd.Timestamp('2016-01-04 09:40:00'): 119202800.0, pd.Timestamp('2016-05-27 15:30:00'): 1489631000.0, pd.Timestamp('2016-01-04 10:00:00'): 1289250000.0, pd.Timestamp('2016-05-27 15:50:00'): 5352179000.0, pd.Timestamp('2016-05-27 15:20:00'): 1035539000.0}, name='Volume')
DF_Qmin.index.name = 'LongTime'
print (DF_Qmin)
LongTime
2016-01-04 09:30:00 3.202500e+05
2016-01-04 09:40:00 1.192028e+08
2016-01-04 09:50:00 6.156090e+07
2016-01-04 10:00:00 1.289250e+09
2016-05-27 15:20:00 1.035539e+09
2016-05-27 15:30:00 1.489631e+09
2016-05-27 15:40:00 2.228257e+09
2016-05-27 15:50:00 5.352179e+09
Name: Volume, dtype: float64
DF_Qmin = DF_Qmin.reset_index()
print (DF_Qmin)
LongTime Volume
0 2016-01-04 09:30:00 3.202500e+05
1 2016-01-04 09:40:00 1.192028e+08
2 2016-01-04 09:50:00 6.156090e+07
3 2016-01-04 10:00:00 1.289250e+09
4 2016-05-27 15:20:00 1.035539e+09
5 2016-05-27 15:30:00 1.489631e+09
6 2016-05-27 15:40:00 2.228257e+09
7 2016-05-27 15:50:00 5.352179e+09
print (DF_Qmin.groupby(DF_Qmin.LongTime.dt.date)
.last()
.rename_axis('Date')
.reset_index())
Date LongTime Volume
0 2016-01-04 2016-01-04 10:00:00 1.289250e+09
1 2016-05-27 2016-05-27 15:50:00 5.352179e+09
If last time is not necessary:
print (DF_Qmin.groupby(DF_Qmin.index.date)
.last()
.rename_axis('Date')
.reset_index())
Date Volume
0 2016-01-04 1.289250e+09
1 2016-05-27 5.352179e+09

after resampling your Series/DF you can do it this way:
DF_Qmin.ix[DF_Qmin.index.minute == 50]

Related

Add a column with the hourly difference of the Datetime Index [duplicate]

This question already has answers here:
compute time difference of DateTimeIndex
(3 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a Dataframe with a datetimeindex and I need to create a column that contains the difference in time between the rows of the datetimeindex expressed in hours. This is what I have:
Datetime Numbers
2020-11-27 08:30:00 1
2020-11-27 13:00:00 2
2020-11-27 15:15:00 3
2020-11-27 20:45:00 4
2020-11-28 08:45:00 5
2020-11-28 10:45:00 6
2020-12-01 04:00:00 7
2020-12-01 08:15:00 8
2020-12-01 12:45:00 9
2020-12-01 14:45:00 10
2020-12-01 17:15:00 11
...
This is what I need:
Datetime Numbers Delta
2020-11-27 08:30:00 1 Nan
2020-11-27 13:00:00 2 4.5
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.5
2020-11-28 08:45:00 5 12
2020-11-28 10:45:00 6 2
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.5
2020-12-01 14:45:00 10 2
2020-12-01 17:15:00 11 2.5
...
The Dataframe has thousands of rows so I can't use a "for" loop. Thanks in advance!
EDIT: I found a solution:
df = df.reset_index()
df['Time'] = df['Datetime'].astype(np.int64) // 10**9
df['Delta'] = df['Time'].diff()/3600
df.drop(columns=['Time'],inplace =True)
df.set_index('Datetime', inplace=True)
I assume that Datetime is set as index:
df.reset_index(inplace=True)
df['Delta'] = df['Datetime'].diff().dt.total_seconds()/3600
df.set_index('Datetime', inplace=True)
OUTPUT:
Numbers Delta
Datetime
2020-11-27 08:30:00 1 NaN
2020-11-27 13:00:00 2 4.50
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.50
2020-11-28 08:45:00 5 12.00
2020-11-28 10:45:00 6 2.00
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.50
2020-12-01 14:45:00 10 2.00
2020-12-01 17:15:00 11 2.50

how to sort by english date format not american pandas .sort()

symb dates
4 BLK 01/03/2014 09:00:00
0 BBR 02/06/2014 09:00:00
21 HZ 02/06/2014 09:00:00
24 OMNI 02/07/2014 09:00:00
31 NOTE 03/04/2014 09:00:00
65 AMP 03/04/2016 09:00:00
40 RBY 04/07/2014 09:00:00
Here's a sample of the output from (df.sort('date')).
As you can see it uses the days for the months and vice versa. Any idea how to fix this ?
You can use pandas.to_datetime and use the format argument then sort it.
>> df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S')
>> df.sort('date')
date symb
0 2014-01-03 09:00:00 BLK
1 2014-02-06 09:00:00 BBR
2 2014-02-06 09:00:00 HZ
3 2014-02-07 09:00:00 OMNI
4 2014-03-04 09:00:00 NOTE
6 2014-04-07 09:00:00 RBY
5 2016-03-04 09:00:00 AMP
You can use to_datetime, for sorting sort_values:
#format mm/dd/YYYY
df['dates'] = pd.to_datetime(df['dates'])
print (df.sort_values('dates'))
symb dates
4 BLK 2014-01-03 09:00:00
0 BBR 2014-02-06 09:00:00
21 HZ 2014-02-06 09:00:00
24 OMNI 2014-02-07 09:00:00
31 NOTE 2014-03-04 09:00:00
40 RBY 2014-04-07 09:00:00
65 AMP 2016-03-04 09:00:00
#format dd/mm/YYYY
df['dates'] = pd.to_datetime(df['dates'], dayfirst=True)
print (df.sort_values('dates'))
symb dates
4 BLK 2014-03-01 09:00:00
31 NOTE 2014-04-03 09:00:00
0 BBR 2014-06-02 09:00:00
21 HZ 2014-06-02 09:00:00
24 OMNI 2014-07-02 09:00:00
40 RBY 2014-07-04 09:00:00
65 AMP 2016-04-03 09:00:00
Another solution is use parameter parse_dates in read_csv, if format dd/mm/YYYY add dayfirst=True:
import pandas as pd
import numpy as np
from pandas.compat import StringIO
temp=u"""symb,dates
BLK,01/03/2014 09:00:00
BBR,02/06/2014 09:00:00
HZ,02/06/2014 09:00:00
OMNI,02/07/2014 09:00:00
NOTE,03/04/2014 09:00:00
AMP,03/04/2016 09:00:00
RBY,04/07/2014 09:00:00"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), parse_dates=['dates'])
print (df)
symb dates
0 BLK 2014-01-03 09:00:00
1 BBR 2014-02-06 09:00:00
2 HZ 2014-02-06 09:00:00
3 OMNI 2014-02-07 09:00:00
4 NOTE 2014-03-04 09:00:00
5 AMP 2016-03-04 09:00:00
6 RBY 2014-04-07 09:00:00
print (df.dtypes)
symb object
dates datetime64[ns]
dtype: object
print (df.sort_values('dates'))
symb dates
0 BLK 2014-01-03 09:00:00
1 BBR 2014-02-06 09:00:00
2 HZ 2014-02-06 09:00:00
3 OMNI 2014-02-07 09:00:00
4 NOTE 2014-03-04 09:00:00
6 RBY 2014-04-07 09:00:00
5 AMP 2016-03-04 09:00:00
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), parse_dates=['dates'], dayfirst=True)
print (df)
symb dates
0 BLK 2014-03-01 09:00:00
1 BBR 2014-06-02 09:00:00
2 HZ 2014-06-02 09:00:00
3 OMNI 2014-07-02 09:00:00
4 NOTE 2014-04-03 09:00:00
5 AMP 2016-04-03 09:00:00
6 RBY 2014-07-04 09:00:00
print (df.dtypes)
symb object
dates datetime64[ns]
dtype: object
print (df.sort_values('dates'))
symb dates
0 BLK 2014-03-01 09:00:00
4 NOTE 2014-04-03 09:00:00
1 BBR 2014-06-02 09:00:00
2 HZ 2014-06-02 09:00:00
3 OMNI 2014-07-02 09:00:00
6 RBY 2014-07-04 09:00:00
5 AMP 2016-04-03 09:00:00
I am not sure how you are getting the data, but if you are importing it from some source such as a CSV you could use pandas.read_csv and set parse_dates=True. The question is what is the type of the dates column? You an easily change them to datelike objects using `dateutil.parse.parse. For example,
import pandas
import dateutil
data = {'symb': ['BLK', 'BBR', 'HZ', 'OMNI', 'NOTE', 'AMP', 'RBY'],
'dates': ['01/03/2014 09:00:00', '02/06/2014 09:00:00', '02/06/2014 09:00:00',
'02/07/2014 09:00:00', '03/04/2014 09:00:00', '03/04/2016 09:00:00',
'04/07/2014 09:00:00']}
df = pandas.DataFrame.from_dict(data)
df.dates = df.dates.apply(dateutil.parser.parse)
print df.to_string()
# OUTPUT
# 0 2014-01-03 09:00:00 BLK
# 1 2014-02-06 09:00:00 BBR
# 2 2014-02-06 09:00:00 HZ
# 3 2014-02-07 09:00:00 OMNI
# 4 2014-03-04 09:00:00 NOTE
# 5 2016-03-04 09:00:00 AMP
# 6 2014-04-07 09:00:00 RBY
This gets you the [ISO8601 format] which may be preferable to the dd/mm/yyyy format, but if you must have that format you can use the code recommended by #umutto

add timedelta data within a group in pandas dataframe

I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...

rolling_sum on business day and return new dataframe with date as index

I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.

How to resample a df with datetime index to exactly n equally sized periods?

I've got a large dataframe with a datetime index and need to resample data to exactly 10 equally sized periods.
So far, I've tried finding the first and last dates to determine the total number of days in the data, divide that by 10 to determine the size of each period, then resample using that number of days. eg:
first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'
df.resample(periodsize,how='sum')
This doesn't guarantee exactly 10 periods in the df after resampling since the periodsize is a rounded down int. Using a float doesn't work in the resampling. Seems that either there's something simple that I'm missing here, or I'm attacking the problem all wrong.
import numpy as np
import pandas as pd
n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
# 0
# 2000-01-01 1
# 2000-01-02 1
# ...
# 2000-02-01 1
# 2000-02-02 1
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)
result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n
yields
0
2000-01-01 00:00:00 4
2000-01-04 07:12:00 3
2000-01-07 14:24:00 3
2000-01-10 21:36:00 4
2000-01-14 04:48:00 3
2000-01-17 12:00:00 3
2000-01-20 19:12:00 4
2000-01-24 02:24:00 3
2000-01-27 09:36:00 3
2000-01-30 16:48:00 3
The values in the 0-column indicate the number of rows that were aggregated, since the original DataFrame was filled with values of 1. The pattern of 4's and 3's is about as even as you can get since 33 rows can not be evenly grouped into 10 groups.
Explanation: Consider this simpler DataFrame:
n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
# 0
# 2000-01-01 1
# 2000-01-02 1
# 2000-01-03 1
# 2000-01-04 1
# 2000-01-05 1
Using df.resample('2D', how='sum') gives the wrong number of groups
In [366]: df.resample('2D', how='sum')
Out[366]:
0
2000-01-01 2
2000-01-03 2
2000-01-05 1
Using df.resample('3D', how='sum') gives the right number of groups, but the
second group starts at 2000-01-04 which does not evenly divide the DataFrame
into two equally-spaced groups:
In [367]: df.resample('3D', how='sum')
Out[367]:
0
2000-01-01 3
2000-01-04 2
To do better, we need to work at a finer time resolution than in days. Since Timedeltas have a total_seconds method, let's work in seconds. So for the example above, the desired frequency string would be
In [374]: df.resample('216000S', how='sum')
Out[374]:
0
2000-01-01 00:00:00 3
2000-01-03 12:00:00 2
since there are 216000*2 seconds in 5 days:
In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0
Okay, so now all we need is a way to generalize this. We'll want the minimum and maximum dates in the index:
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
We add an extra day because it makes the difference in days come out right. In
the example above, There are only 4 days between the Timestamps for 2000-01-05
and 2000-01-01,
In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4
But as we can see in the worked example, the DataFrame has 5 rows representing 5
days. So it makes sense that we need to add an extra day.
Now we can compute the correct number of seconds in each equally-spaced group with:
secs = int((last-first).total_seconds()//n)
Here is one way to ensure equal-size sub-periods by using np.linspace() on pd.Timedelta and then classifying each obs into different bins using pd.cut.
import pandas as pd
import numpy as np
# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))
Out[87]:
A B
2015-01-01 00:00:00 1.7641 0.4002
2015-01-01 08:00:00 0.9787 2.2409
2015-01-01 16:00:00 1.8676 -0.9773
2015-01-02 00:00:00 0.9501 -0.1514
2015-01-02 08:00:00 -0.1032 0.4106
2015-01-02 16:00:00 0.1440 1.4543
2015-01-03 00:00:00 0.7610 0.1217
2015-01-03 08:00:00 0.4439 0.3337
2015-01-03 16:00:00 1.4941 -0.2052
2015-01-04 00:00:00 0.3131 -0.8541
2015-01-04 08:00:00 -2.5530 0.6536
2015-01-04 16:00:00 0.8644 -0.7422
2015-01-05 00:00:00 2.2698 -1.4544
2015-01-05 08:00:00 0.0458 -0.1872
2015-01-05 16:00:00 1.5328 1.4694
... ... ...
2015-01-29 08:00:00 0.9209 0.3187
2015-01-29 16:00:00 0.8568 -0.6510
2015-01-30 00:00:00 -1.0342 0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555 0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00 0.6252 -1.6021
2015-02-01 00:00:00 -1.1044 0.0522
2015-02-01 08:00:00 -0.7396 1.5430
2015-02-01 16:00:00 -1.2929 0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00 0.5233 -0.1715
2015-02-02 16:00:00 0.7718 0.8235
2015-02-03 00:00:00 2.1632 1.3365
[100 rows x 2 columns]
# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])
# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])
Out[89]:
A B start_time_index end_time_index
2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00 1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00 0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032 0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00 0.1440 1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00 0.7610 0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00 0.4439 0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00 1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00 0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00 0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00 2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00 0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00 1.5328 1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
... ... ... ... ...
2015-01-29 08:00:00 0.9209 0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00 0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342 0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555 0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00 0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044 0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396 1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929 0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00 0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00 0.7718 0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00 2.1632 1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00
[100 rows x 4 columns]
df.groupby('start_time_index').agg('sum')
Out[90]:
A B
start_time_index
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -0.0902 -2.5178
Another potential shorter way to do this is to specify your sampling freq as the time delta. But the problem, as shown in below, is that it delivers 11 sub-samples instead of 10. I believe the reason is that the resample implements a left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sub-sampling scheme so that the very last obs at '2015-02-03 00:00:00' is considered as a separate group. If we use pd.cut to do it ourself, we can specify include_lowest=True so that it gives us exactly 10 sub-samples rather than 11.
n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')
Out[114]:
A B
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00 2.1632 1.3365

Categories

Resources