averaging every five minutes data as one datapoint in pandas dataframe

averaging every five minutes data as one datapoint in pandas dataframe - python

I have a Dataframe in Pandas like this
1. 2013-10-09 09:00:05
2. 2013-10-09 09:01:00
3. 2013-10-09 09:02:00
4. ............
5. ............
6. ............
7. 2013-10-10 09:15:05
8. 2013-10-10 09:16:00
9. 2013-10-10 09:17:00
I would like reduce the size of the Dataframe by averaging every 5 mins data and forming 1 datapoint for it ..like this
1. 2013-10-09 09:05:00
2. 2013-10-09 09:10:00
3. 2013-10-09 09:15:00
Can someone help me with this ??

you may want to look at pandas.resample:
df['Data'].resample('5Min', how='mean')
or, as how = 'mean' is default parameter:
df['Data'].resample('5Min')
For example:
>>> rng = pd.date_range('1/1/2012', periods=10, freq='Min')
>>> df = pd.DataFrame({'Data':np.random.randint(0, 500, len(rng))}, index=rng)
>>> df
Data
2012-01-01 00:00:00 488
2012-01-01 00:01:00 172
2012-01-01 00:02:00 276
2012-01-01 00:03:00 5
2012-01-01 00:04:00 233
2012-01-01 00:05:00 266
2012-01-01 00:06:00 103
2012-01-01 00:07:00 40
2012-01-01 00:08:00 274
2012-01-01 00:09:00 494
>>>
>>> df['Data'].resample('5Min')
2012-01-01 00:00:00 234.8
2012-01-01 00:05:00 235.4
You can find more examples here.

Related

Problems with replacing NaT in pandas correctly

I have a dataframe that contains some NaT values.
Date Value
6312957 2012-01-01 23:58:00 -49
6312958 2012-01-01 23:59:00 -49
6312959 NaT -48
6312960 2012-01-02 00:01:00 -47
6312961 2012-01-02 00:02:00 -46
I try to replace these NAT by adding a minute to the previous entry.
indices_of_NAT = np.flatnonzero(pd.isna(df.loc[:, "Date"]))
df.loc[indices_of_NAT, "Date"] = df.loc[indices_of_NAT - 1, "Date"] + pd.Timedelta(minutes=1)
This produces the correct timestamps and indices, which I checked manually. The only problem is that they don't replace the NaT values for whatever reason. I wonder if something goes wrong with the indexing in my last line of code. Is there something obvious I am missing?

You can fillna with the shifted values + 1 min:
df['Date'] = df['Date'].fillna(df['Date'].shift().add(pd.Timedelta('1min')))
Another method is to interpolate. For this you need to temporarily convert to a number. This way you can fill more than one gap and the increment will be calculated automatically, and there are many nice interpolation methods (see doc):
df['Date'] = (pd.to_datetime(pd.to_numeric(df['Date'])
.mask(df['Date'].isna())
.interpolate('linear'))
)
Example:
Date Value shift interpolate
0 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
1 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
2 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
3 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
4 NaT -48 2012-01-02 00:02:00 2012-01-02 00:01:20
5 NaT -48 NaT 2012-01-02 00:01:40
6 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00

Use Series.fillna with shifted values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
Or with forward filling missing values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
You can see difference with another data:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
df['Date2'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
print (df)
Date Value Date1 Date2
6312957 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
6312958 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
6312959 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
6312960 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
6312961 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00
6312962 NaT -47 2012-01-02 00:03:00 2012-01-02 00:03:00
6312963 NaT -47 NaT 2012-01-02 00:03:00
6312967 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00

Optimize code to find the median of values of past 4 to 6 days for each row in a DataFrame

Given a dataframe of timestamp data, I would like to compute the median of certain variable of past 4-6 days.
Median of past 1-3 days can be computed by pd.pandas.DataFrame.rolling, but I couldn't find how to use rolling to compute the median of past 4-6 days.
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
timestamp var
0 2011-01-01 00:00:00 1.624345
1 2011-01-01 06:00:00 -0.611756
2 2011-01-01 12:00:00 -0.528172
3 2011-01-01 18:00:00 -1.072969
4 2011-01-02 00:00:00 0.865408
5 2011-01-02 06:00:00 -2.301539
6 2011-01-02 12:00:00 1.744812
7 2011-01-02 18:00:00 -0.761207
8 2011-01-03 00:00:00 0.319039
9 2011-01-03 06:00:00 -0.249370
10 2011-01-03 12:00:00 1.462108
Desired output:
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN # no data in past 4-6 days
1 2011-01-01 06:00:00 -0.611756 NaN # no data in past 4-6 days
2 2011-01-01 12:00:00 -0.528172 NaN # no data in past 4-6 days
3 2011-01-01 18:00:00 -1.072969 NaN # no data in past 4-6 days
4 2011-01-02 00:00:00 0.865408 NaN # no data in past 4-6 days
5 2011-01-02 06:00:00 -2.301539 NaN # no data in past 4-6 days
6 2011-01-02 12:00:00 1.744812 NaN # no data in past 4-6 days
7 2011-01-02 18:00:00 -0.761207 NaN # no data in past 4-6 days
8 2011-01-03 00:00:00 0.319039 NaN # no data in past 4-6 days
9 2011-01-03 06:00:00 -0.249370 NaN # no data in past 4-6 days
10 2011-01-03 12:00:00 1.462108 NaN # no data in past 4-6 days
11 2011-01-03 18:00:00 -2.060141 NaN # no data in past 4-6 days
12 2011-01-04 00:00:00 -0.322417 NaN # no data in past 4-6 days
13 2011-01-04 06:00:00 -0.384054 NaN # no data in past 4-6 days
14 2011-01-04 12:00:00 1.133769 NaN # no data in past 4-6 days
15 2011-01-04 18:00:00 -1.099891 NaN # no data in past 4-6 days
16 2011-01-05 00:00:00 -0.172428 NaN # only 4 data in past 4-6 days
17 2011-01-05 06:00:00 -0.877858 -0.528172
18 2011-01-05 12:00:00 0.042214 -0.569964
19 2011-01-05 18:00:00 0.582815 -0.528172
20 2011-01-06 00:00:00 -1.100619 -0.569964
21 2011-01-06 06:00:00 1.144724 -0.528172
22 2011-01-06 12:00:00 0.901591 -0.388771
23 2011-01-06 18:00:00 0.502494 -0.249370
My current code:
def findPastVar2(df, var='var' ,window=3, method='median'):
# window= # of past days
for i in xrange(len(df)):
pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
if pastVar2.shape[0]>=5: # At least 5 data points
if method == 'median':
df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
return(df)
Current speed:
In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop
I edited the post so that I can clearly show my expected output of at least 5 data points. I've set the random seed so that everyone should be able to get the same input and show the same output. As far as I know simple rolling and shift does not work for the case of multiple data in the same day.

here we go:
df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)
df['var']
Out[55]:
timestamp
2011-01-01 00:00:00 NaN
2011-01-01 06:00:00 NaN
2011-01-01 12:00:00 NaN
2011-01-01 18:00:00 NaN
2011-01-02 00:00:00 NaN
2011-01-02 06:00:00 NaN
2011-01-02 12:00:00 NaN
2011-01-02 18:00:00 NaN
2011-01-03 00:00:00 NaN
2011-01-03 06:00:00 NaN
2011-01-03 12:00:00 NaN
2011-01-03 18:00:00 NaN
2011-01-04 00:00:00 NaN
2011-01-04 06:00:00 NaN
2011-01-04 12:00:00 NaN
2011-01-04 18:00:00 NaN
2011-01-05 00:00:00 NaN
2011-01-05 06:00:00 -0.528172
2011-01-05 12:00:00 -0.569964
2011-01-05 18:00:00 -0.528172
2011-01-06 00:00:00 -0.569964
2011-01-06 06:00:00 -0.528172
2011-01-06 12:00:00 -0.569964
2011-01-06 18:00:00 -0.528172
2011-01-07 00:00:00 -0.388771
2011-01-07 06:00:00 -0.249370
2011-01-07 12:00:00 -0.388771

The way this is setup is for each row, and as an irregular timeseries, it will have different widths thus requiring an iterative approach like you have started. But, if we make the index the timeseries
# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))
in this case, I chose an interval every 12hrs, but could be whatever is available or irregular. Using a modified function with a window for the median, along with an offset (here, positive Delta is looking backwards), gives you the flexibility you wanted:
def GetMedian(df,var='var',window='2D',Delta='3D'):
for Ti in df.index:
Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
(df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
df.loc[Ti,'Medians']=Vals[var].median()
return df
This runs substantially faster:
%timeit GetMedian(df)
84.8 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The min_period should be 2 instead of 5 because you should not count window size in. (5 - 3 = 2)
import pandas as pd
import numpy as np
import datetime
np.random.seed(1) # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
def first():
df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
return df
　
%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop
My first try didn't use shift(), but then I saw Noobie's answer.
I made the following one with shift(), which is much faster than previous one.
def test():
df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
return df
　
%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop
The second one is around 4 times as fast as the first one.
These two function creates the same result, which looks like this:
df2 = test()
df2
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN
1 2011-01-02 00:00:00 -0.611756 NaN
2 2011-01-03 00:00:00 -0.528172 NaN
3 2011-01-04 00:00:00 -1.072969 NaN
4 2011-01-05 00:00:00 0.865408 0.506294
5 2011-01-06 00:00:00 -2.301539 -0.528172
6 2011-01-07 00:00:00 1.744812 -0.611756
... ... ... ...
93 2011-04-04 00:00:00 -0.638730 1.129484
94 2011-04-05 00:00:00 0.423494 1.129484
95 2011-04-06 00:00:00 0.077340 0.185156
96 2011-04-07 00:00:00 -0.343854 -0.375285
97 2011-04-08 00:00:00 0.043597 -0.375285
98 2011-04-09 00:00:00 -0.620001 0.077340
99 2011-04-10 00:00:00 0.698032 0.077340

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.

IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

How to resample a df with datetime index to exactly n equally sized periods?

I've got a large dataframe with a datetime index and need to resample data to exactly 10 equally sized periods.
So far, I've tried finding the first and last dates to determine the total number of days in the data, divide that by 10 to determine the size of each period, then resample using that number of days. eg:
first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'
df.resample(periodsize,how='sum')
This doesn't guarantee exactly 10 periods in the df after resampling since the periodsize is a rounded down int. Using a float doesn't work in the resampling. Seems that either there's something simple that I'm missing here, or I'm attacking the problem all wrong.

import numpy as np
import pandas as pd
n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
# 0
# 2000-01-01 1
# 2000-01-02 1
# ...
# 2000-02-01 1
# 2000-02-02 1
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)
result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n
yields
0
2000-01-01 00:00:00 4
2000-01-04 07:12:00 3
2000-01-07 14:24:00 3
2000-01-10 21:36:00 4
2000-01-14 04:48:00 3
2000-01-17 12:00:00 3
2000-01-20 19:12:00 4
2000-01-24 02:24:00 3
2000-01-27 09:36:00 3
2000-01-30 16:48:00 3
The values in the 0-column indicate the number of rows that were aggregated, since the original DataFrame was filled with values of 1. The pattern of 4's and 3's is about as even as you can get since 33 rows can not be evenly grouped into 10 groups.
Explanation: Consider this simpler DataFrame:
n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
# 0
# 2000-01-01 1
# 2000-01-02 1
# 2000-01-03 1
# 2000-01-04 1
# 2000-01-05 1
Using df.resample('2D', how='sum') gives the wrong number of groups
In [366]: df.resample('2D', how='sum')
Out[366]:
0
2000-01-01 2
2000-01-03 2
2000-01-05 1
Using df.resample('3D', how='sum') gives the right number of groups, but the
second group starts at 2000-01-04 which does not evenly divide the DataFrame
into two equally-spaced groups:
In [367]: df.resample('3D', how='sum')
Out[367]:
0
2000-01-01 3
2000-01-04 2
To do better, we need to work at a finer time resolution than in days. Since Timedeltas have a total_seconds method, let's work in seconds. So for the example above, the desired frequency string would be
In [374]: df.resample('216000S', how='sum')
Out[374]:
0
2000-01-01 00:00:00 3
2000-01-03 12:00:00 2
since there are 216000*2 seconds in 5 days:
In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0
Okay, so now all we need is a way to generalize this. We'll want the minimum and maximum dates in the index:
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
We add an extra day because it makes the difference in days come out right. In
the example above, There are only 4 days between the Timestamps for 2000-01-05
and 2000-01-01,
In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4
But as we can see in the worked example, the DataFrame has 5 rows representing 5
days. So it makes sense that we need to add an extra day.
Now we can compute the correct number of seconds in each equally-spaced group with:
secs = int((last-first).total_seconds()//n)

Here is one way to ensure equal-size sub-periods by using np.linspace() on pd.Timedelta and then classifying each obs into different bins using pd.cut.
import pandas as pd
import numpy as np
# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))
Out[87]:
A B
2015-01-01 00:00:00 1.7641 0.4002
2015-01-01 08:00:00 0.9787 2.2409
2015-01-01 16:00:00 1.8676 -0.9773
2015-01-02 00:00:00 0.9501 -0.1514
2015-01-02 08:00:00 -0.1032 0.4106
2015-01-02 16:00:00 0.1440 1.4543
2015-01-03 00:00:00 0.7610 0.1217
2015-01-03 08:00:00 0.4439 0.3337
2015-01-03 16:00:00 1.4941 -0.2052
2015-01-04 00:00:00 0.3131 -0.8541
2015-01-04 08:00:00 -2.5530 0.6536
2015-01-04 16:00:00 0.8644 -0.7422
2015-01-05 00:00:00 2.2698 -1.4544
2015-01-05 08:00:00 0.0458 -0.1872
2015-01-05 16:00:00 1.5328 1.4694
... ... ...
2015-01-29 08:00:00 0.9209 0.3187
2015-01-29 16:00:00 0.8568 -0.6510
2015-01-30 00:00:00 -1.0342 0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555 0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00 0.6252 -1.6021
2015-02-01 00:00:00 -1.1044 0.0522
2015-02-01 08:00:00 -0.7396 1.5430
2015-02-01 16:00:00 -1.2929 0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00 0.5233 -0.1715
2015-02-02 16:00:00 0.7718 0.8235
2015-02-03 00:00:00 2.1632 1.3365
[100 rows x 2 columns]
# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])
# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])
Out[89]:
A B start_time_index end_time_index
2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00 1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00 0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032 0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00 0.1440 1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00 0.7610 0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00 0.4439 0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00 1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00 0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00 0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00 2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00 0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00 1.5328 1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
... ... ... ... ...
2015-01-29 08:00:00 0.9209 0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00 0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342 0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555 0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00 0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044 0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396 1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929 0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00 0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00 0.7718 0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00 2.1632 1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00
[100 rows x 4 columns]
df.groupby('start_time_index').agg('sum')
Out[90]:
A B
start_time_index
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -0.0902 -2.5178
Another potential shorter way to do this is to specify your sampling freq as the time delta. But the problem, as shown in below, is that it delivers 11 sub-samples instead of 10. I believe the reason is that the resample implements a left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sub-sampling scheme so that the very last obs at '2015-02-03 00:00:00' is considered as a separate group. If we use pd.cut to do it ourself, we can specify include_lowest=True so that it gives us exactly 10 sub-samples rather than 11.
n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')
Out[114]:
A B
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00 2.1632 1.3365

Selecting Data between Specific hours in a pandas dataframe

My Pandas Dataframe frame looks something like this
1. 2013-10-09 09:00:05
2. 2013-10-09 09:05:00
3. 2013-10-09 10:00:00
4. ............
5. ............
6. ............
7. 2013-10-10 09:00:05
8. 2013-10-10 09:05:00
9. 2013-10-10 10:00:00
I want the data lying in between hours 9 and 10 ...if anyone has worked on something like this ,it would be really helpful.

In [7]: index = date_range('20131009 08:30','20131010 10:05',freq='5T')
In [8]: df = DataFrame(randn(len(index),2),columns=list('AB'),index=index)
In [9]: df
Out[9]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 308 entries, 2013-10-09 08:30:00 to 2013-10-10 10:05:00
Freq: 5T
Data columns (total 2 columns):
A 308 non-null values
B 308 non-null values
dtypes: float64(2)
In [10]: df.between_time('9:00','10:00')
Out[10]:
A B
2013-10-09 09:00:00 -0.664639 1.597453
2013-10-09 09:05:00 1.197290 -0.500621
2013-10-09 09:10:00 1.470186 -0.963553
2013-10-09 09:15:00 0.181314 -0.242415
2013-10-09 09:20:00 0.969427 -1.156609
2013-10-09 09:25:00 0.261473 0.413926
2013-10-09 09:30:00 -0.003698 0.054953
2013-10-09 09:35:00 0.418147 -0.417291
2013-10-09 09:40:00 0.413565 -1.096234
2013-10-09 09:45:00 0.460293 1.200277
2013-10-09 09:50:00 -0.702444 -0.041597
2013-10-09 09:55:00 0.548385 -0.832382
2013-10-09 10:00:00 -0.526582 0.758378
2013-10-10 09:00:00 0.926738 0.178204
2013-10-10 09:05:00 -1.178534 0.184205
2013-10-10 09:10:00 1.408258 0.948526
2013-10-10 09:15:00 0.523318 0.327390
2013-10-10 09:20:00 -0.193174 0.863294
2013-10-10 09:25:00 1.355610 -2.160864
2013-10-10 09:30:00 1.930622 0.174683
2013-10-10 09:35:00 0.273551 0.870682
2013-10-10 09:40:00 0.974756 -0.327763
2013-10-10 09:45:00 1.808285 0.080267
2013-10-10 09:50:00 0.842119 0.368689
2013-10-10 09:55:00 1.065585 0.802003
2013-10-10 10:00:00 -0.324894 0.781885

Make a new column for the time after splitting your original column . Use the below code to split your time for hours, minutes, and seconds:-
df[['h','m','s']] = df['Time'].astype(str).str.split(':', expand=True).astype(int)
Once you are done with that, you have to select the data by filtering it out:-
df9to10 =df[df['h'].between(9, 10, inclusive=True)]
And, it's dynamic, if you want to take another period between apart from 9 and 10.

Another method that uses query. Tested with Python 3.9.
from Pandas import Timestamp
from datetime import time
df = pd.DataFrame({"timestamp":
[Timestamp("2017-01-03 09:30:00.049"), Timestamp("2017-01-03 09:30:00.049"),
Timestamp("2017-12-29 16:12:34.214"), Timestamp("2017-12-29 16:17:19.006")]})
df["time"] = df.timestamp.dt.time
start_time = time(9,20,0)
end_time = time(10,0,0)
df_times = df.query("time >= #start_time and time <= #end_time")
In:
timestamp
2017-01-03 09:30:00.049
2017-01-03 09:30:00.049
2017-12-29 16:12:34.214
2017-12-29 16:17:19.006
Out:
timestamp time
2017-01-03 09:30:00.049 09:30:00.049000
2017-01-03 09:30:00.049 09:30:00.049000
As a bonus, arbitrarily complex expressions can be used within a query, e.g. selecting everything within two separate time ranges (this is impossible with between_time).

Assuming your original dataframe is called "df" and your time column is called "time" this would work: (where start_time and end_time correspond to the time interval that you'd like)
>>> df_new = df[(df['time'] > start_time) & (df['time'] < end_time)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

averaging every five minutes data as one datapoint in pandas dataframe - python

Related

Problems with replacing NaT in pandas correctly

Optimize code to find the median of values of past 4 to 6 days for each row in a DataFrame

Grouping dates by 5 minute periods irrespective of day

How to resample a df with datetime index to exactly n equally sized periods?

Selecting Data between Specific hours in a pandas dataframe

Categories

Resources