How do I resample a pandas Series using values around the hour - python

I have timeseries data recorded at 10min frequency. I want to average the values at one hour interval. But for that I want to take 3 values before the hour and 2 values after the hour, take the average and assign that value to the exact hour timestamp.
for example, I have the series
index = pd.date_range('2000-01-01T00:30:00', periods=63, freq='10min')
series = pd.Series(range(63), index=index)
series
2000-01-01 00:30:00 0
2000-01-01 00:40:00 1
2000-01-01 00:50:00 2
2000-01-01 01:00:00 3
2000-01-01 01:10:00 4
2000-01-01 01:20:00 5
2000-01-01 01:30:00 6
2000-01-01 01:40:00 7
2000-01-01 01:50:00 8
2000-01-01 02:00:00 9
2000-01-01 02:10:00 10
..
2000-01-01 08:50:00 50
2000-01-01 09:00:00 51
2000-01-01 09:10:00 52
2000-01-01 09:20:00 53
2000-01-01 09:30:00 54
2000-01-01 09:40:00 55
2000-01-01 09:50:00 56
2000-01-01 10:00:00 57
2000-01-01 10:10:00 58
2000-01-01 10:20:00 59
2000-01-01 10:30:00 60
2000-01-01 10:40:00 61
2000-01-01 10:50:00 62
Freq: 10T, Length: 63, dtype: int64
So, if I do
series.resample('1H').mean()
2000-01-01 00:00:00 1.0
2000-01-01 01:00:00 5.5
2000-01-01 02:00:00 11.5
2000-01-01 03:00:00 17.5
2000-01-01 04:00:00 23.5
2000-01-01 05:00:00 29.5
2000-01-01 06:00:00 35.5
2000-01-01 07:00:00 41.5
2000-01-01 08:00:00 47.5
2000-01-01 09:00:00 53.5
2000-01-01 10:00:00 59.5
Freq: H, dtype: float64
the first value is the average of 0, 1, 2, and assigned to hour 0, the second the average of the values for 1:00:00 to 1:50:00 assigned to 1:00:00 and so on.
What I would like to have is the first average centered at 1:00:00 calculated using values from 00:30:00 through 01:20:00, the second centered at 02:00:00 calculated from 01:30:00 to 02:20:00 and so on...
What will be the best way to do that?
Thanks!

You should be able to do that with:
series.index = series.index - pd.Timedelta(30, unit='m')
series_grouped_mean = series.groupby(pd.Grouper(freq='60min')).mean()
series_grouped_mean.index = series_grouped_mean.index + pd.Timedelta(60, unit='m')
series_grouped_mean
I got:
2000-01-01 01:00:00 2.5
2000-01-01 02:00:00 8.5
2000-01-01 03:00:00 14.5
2000-01-01 04:00:00 20.5
2000-01-01 05:00:00 26.5
2000-01-01 06:00:00 32.5
2000-01-01 07:00:00 38.5
2000-01-01 08:00:00 44.5
2000-01-01 09:00:00 50.5
2000-01-01 10:00:00 56.5
2000-01-01 11:00:00 61.0
Freq: H, dtype: float64

Related

resampling and appending to same dataframe

I have a dataframe which I want to resample and append the results to original dataframe as new column,
What I have:
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
time value
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
What I want:
time value mean_resampled
2000-01-01 00:00:00 0. 2
2000-01-01 00:01:00 1. NaN
2000-01-01 00:02:00 2. NaN
2000-01-01 00:03:00 3. NaN
2000-01-01 00:04:00 4. NaN
2000-01-01 00:05:00 5. 6.5
2000-01-01 00:06:00 6. NaN
2000-01-01 00:07:00 7. NaN
2000-01-01 00:08:00 8. NaN
Note: resampling frequency is '5T'
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index, name='values')
sample = series.resample('5T').mean() # create a sample at some frequency
df = series.to_frame() # convert series to frame
df.loc[sample.index.values, 'mean_resampled'] = sample # use loc to assign new values
values mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
Use resample to compute the mean and concat to merge your Series with new values.
>>> pd.concat([series, series.resample('5T').mean()], axis=1) \
.rename(columns={0: 'value', 1: 'mean_resampled'})
value mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
If you have a DataFrame instead of Series in your real case, you have just to add a new column:
>>> df['mean_resampled'] = df.resample('5T').mean()

While resampling, put NaN in the resulting value if there are are missing values in the source interval

import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").agg(pd.Series.sum, skipna=False)
2000-01-01 00:00:00 NaN
2000-01-01 00:05:00 35.0
2000-01-01 00:10:00 21.0
Freq: 5T, dtype: float64
So far so good. The problem is that in the last interval (00:10-00:15) it does output a value, because there's no NaN there. But I don't want it to, because some values are missing.
I could use min_count=5, but this wouldn't always work (e.g. if I'm aggregating daily to monthly, there's a variable number of source steps in each target step—some months have 28 days, some 29, some 30, some 31).
You can reindex your time series using some simple arithmetic logic.
For example,
freq = 5
add = freq - tt.minute % freq
new_ts = ts.reindex(pd.date_range(ts.index[0],
ts.index[-1].to_datetime64() + pd.Timedelta(minutes=add-1),
freq='T'))
which outputs
2000-01-01 00:00:00 NaN
2000-01-01 00:05:00 35.0
2000-01-01 00:10:00 NaN
Freq: 5T, dtype: float64

(Pandas) Different way of combining two dataframes

I was wandering if theres a better approach of combining two dataframes than what I did below.
import pandas as pd
#create ramdom data sets
N = 50
df = pd.DataFrame({'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
peak_time = df.iloc[index.indexer_between_time('7:00','9:00')]
lunch_time = df.iloc[index.indexer_between_time('12:00','14:00')]
comb_data = pd.concat([peak_time, lunch_time], ignore_index=True)
Is there a way to combine two ranges when using between_time using logical operator?
I have to use that to make a new column in df called 'isPeak' where 1 is written when it's in range between 7:00 ~ 9:00 and also 12:00 ~ 14:00 and 0 if not.
For me working np.union1d:
import numpy as np
idx = np.union1d(index.indexer_between_time('7:00','9:00'),
index.indexer_between_time('12:00','14:00'))
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181
Alternative with numpy.r_:
idx = np.r_[index.indexer_between_time('7:00','9:00'),
index.indexer_between_time('12:00','14:00')]
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181
Pure pandas solution with Index.union and convert array to index:
idx = (pd.Index(index.indexer_between_time('7:00','9:00'))
.union(pd.Index(index.indexer_between_time('12:00','14:00'))))
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181

Resample panda time series to have the end time stamp for the bin name?

I generate a sample 5 minute time series:
index = pd.date_range('1/1/2000', periods=10, freq='5T')
data=range(10)
ser = pd.Series(data, index=index)
What it looks like:
2000-01-01 00:00:00 0.0
2000-01-01 00:05:00 1.0
2000-01-01 00:10:00 2.0
2000-01-01 00:15:00 3.0
2000-01-01 00:20:00 4.0
2000-01-01 00:25:00 5.0
2000-01-01 00:30:00 6.0
2000-01-01 00:35:00 7.0
2000-01-01 00:40:00 8.0
2000-01-01 00:45:00 9.0
Freq: 5T, dtype: float64
What I need
I would like to turn this time series into a 15 minute one and have each 15 minute value be the mean of the 5 minute values observed in that 15 min period, i.e.
2000-01-01 00:15:00 2.0 # i.e. mean(1, 2, 3)
2000-01-01 00:30:00 5.0 # i.e. mean(4, 5, 6)
2000-01-01 00:45:00 8.0 # i.e. mean(7, 8, 9)
Things I tried
If I resample this data into 15 minute buckets and call mean i get:
ser.resample('15T').mean()
2000-01-01 00:00:00 1.0
2000-01-01 00:15:00 4.0
2000-01-01 00:30:00 7.0
2000-01-01 00:45:00 9.0
which is not computing the means I want. If I add closed='right' to the resample call I get closer to the values I want but the timestamps are not right.
ser.resample('15T', closed='right').mean()
1999-12-31 23:45:00 0.0
2000-01-01 00:00:00 2.0
2000-01-01 00:15:00 5.0
2000-01-01 00:30:00 8.0
Freq: 15T, dtype: float64
Any suggestions?
You can use the label argument in resample,
ser.resample('15T', label='right', closed='right').mean()
this shifts the label from the left (default) to the right of the resampled window. This is a more succinct than my somewhat clunky comment.

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

Categories

Resources