Problems resampling pandas time series with time gap between indices - python

I want to resample the pandas series
import pandas as pd
index_1 = pd.date_range('1/1/2000', periods=4, freq='T')
index_2 = pd.date_range('1/2/2000', periods=3, freq='T')
series = pd.Series(range(4), index=index_1)
series=series.append(pd.Series(range(3), index=index_2))
print series
>>>2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 00:02:00 2
such that the resulting DataSeries only contains every second entry, i.e
>>>2000-01-01 00:00:00 0
2000-01-01 00:02:00 2
2000-01-02 00:00:00 0
2000-01-02 00:02:00 2
using the (poorly documented) resample method of pandas in the following way:
resampled_series = series.resample('2T', closed='right')
print resampled_series
I get
>>>1999-12-31 23:58:00 0.0
2000-01-01 00:00:00 1.5
2000-01-01 00:02:00 3.0
2000-01-01 00:04:00 NaN
2000-01-01 00:56:00 NaN
...
2000-01-01 23:54:00 NaN
2000-01-01 23:56:00 NaN
2000-01-01 23:58:00 0.0
2000-01-02 00:00:00 1.5
2000-01-02 00:02:00 3.0
Why does it start 2 minutes earlier than the original series? why does it contain all the time steps inbetween, which are not contained in the original series? How can I get my desired result?

resample() is not the right function for your purpose.
try this:
series[series.index.minute % 2 == 0]

Related

resampling and appending to same dataframe

I have a dataframe which I want to resample and append the results to original dataframe as new column,
What I have:
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
time value
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
What I want:
time value mean_resampled
2000-01-01 00:00:00 0. 2
2000-01-01 00:01:00 1. NaN
2000-01-01 00:02:00 2. NaN
2000-01-01 00:03:00 3. NaN
2000-01-01 00:04:00 4. NaN
2000-01-01 00:05:00 5. 6.5
2000-01-01 00:06:00 6. NaN
2000-01-01 00:07:00 7. NaN
2000-01-01 00:08:00 8. NaN
Note: resampling frequency is '5T'
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index, name='values')
sample = series.resample('5T').mean() # create a sample at some frequency
df = series.to_frame() # convert series to frame
df.loc[sample.index.values, 'mean_resampled'] = sample # use loc to assign new values
values mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
Use resample to compute the mean and concat to merge your Series with new values.
>>> pd.concat([series, series.resample('5T').mean()], axis=1) \
.rename(columns={0: 'value', 1: 'mean_resampled'})
value mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
If you have a DataFrame instead of Series in your real case, you have just to add a new column:
>>> df['mean_resampled'] = df.resample('5T').mean()

While resampling, put NaN in the resulting value if there are are missing values in the source interval

import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").agg(pd.Series.sum, skipna=False)
2000-01-01 00:00:00 NaN
2000-01-01 00:05:00 35.0
2000-01-01 00:10:00 21.0
Freq: 5T, dtype: float64
So far so good. The problem is that in the last interval (00:10-00:15) it does output a value, because there's no NaN there. But I don't want it to, because some values are missing.
I could use min_count=5, but this wouldn't always work (e.g. if I'm aggregating daily to monthly, there's a variable number of source steps in each target step—some months have 28 days, some 29, some 30, some 31).
You can reindex your time series using some simple arithmetic logic.
For example,
freq = 5
add = freq - tt.minute % freq
new_ts = ts.reindex(pd.date_range(ts.index[0],
ts.index[-1].to_datetime64() + pd.Timedelta(minutes=add-1),
freq='T'))
which outputs
2000-01-01 00:00:00 NaN
2000-01-01 00:05:00 35.0
2000-01-01 00:10:00 NaN
Freq: 5T, dtype: float64

While resampling, put NaN in the resulting value if there are some NaN values in the source interval

Example:
import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").sum()
2000-01-01 00:00:00 5.0
2000-01-01 00:05:00 30.0
2000-01-01 00:10:00 30.0
Freq: 5T, dtype: float64
In the above example, it extracts the sum of the interval 00:00-00:05 as if the missing value was zero. What I want is for it to produce result NaN in 00:00.
Or, maybe I'd like for it to be OK if there's one missing value in the interval, but NaN if there are two missing values in the interval.
How can I do these?
For one or more NaN values:
ts.resample('5min').agg(pd.Series.sum, skipna=False)
For a minimum of 2 non-NaN values:
ts.resample('5min').agg(pd.Series.sum, min_count=2)
For a maximum of 2 NaN values seems tricker:
ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)
You might expect ts.resample('5min').sum(skipna=False) to work in the same way as ts.sum(skipna=False), but the implementations are not consistent.

Resample panda time series to have the end time stamp for the bin name?

I generate a sample 5 minute time series:
index = pd.date_range('1/1/2000', periods=10, freq='5T')
data=range(10)
ser = pd.Series(data, index=index)
What it looks like:
2000-01-01 00:00:00 0.0
2000-01-01 00:05:00 1.0
2000-01-01 00:10:00 2.0
2000-01-01 00:15:00 3.0
2000-01-01 00:20:00 4.0
2000-01-01 00:25:00 5.0
2000-01-01 00:30:00 6.0
2000-01-01 00:35:00 7.0
2000-01-01 00:40:00 8.0
2000-01-01 00:45:00 9.0
Freq: 5T, dtype: float64
What I need
I would like to turn this time series into a 15 minute one and have each 15 minute value be the mean of the 5 minute values observed in that 15 min period, i.e.
2000-01-01 00:15:00 2.0 # i.e. mean(1, 2, 3)
2000-01-01 00:30:00 5.0 # i.e. mean(4, 5, 6)
2000-01-01 00:45:00 8.0 # i.e. mean(7, 8, 9)
Things I tried
If I resample this data into 15 minute buckets and call mean i get:
ser.resample('15T').mean()
2000-01-01 00:00:00 1.0
2000-01-01 00:15:00 4.0
2000-01-01 00:30:00 7.0
2000-01-01 00:45:00 9.0
which is not computing the means I want. If I add closed='right' to the resample call I get closer to the values I want but the timestamps are not right.
ser.resample('15T', closed='right').mean()
1999-12-31 23:45:00 0.0
2000-01-01 00:00:00 2.0
2000-01-01 00:15:00 5.0
2000-01-01 00:30:00 8.0
Freq: 15T, dtype: float64
Any suggestions?
You can use the label argument in resample,
ser.resample('15T', label='right', closed='right').mean()
this shifts the label from the left (default) to the right of the resampled window. This is a more succinct than my somewhat clunky comment.

Python pandas resampling instantaneous hourly data to daily timestep including 00:00 of next day

Let's say I have a series of instantaneous temperature measurements (i.e. they capture the temperature at an exact moment in time).
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
Out[130]:
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
I want to get a average of daily temperature. The problem is that I want to include 00:00:00 from the current day and the next day in the average for the current day. For example I want to average 2000-01-01 00:00:00 to 2000-01-02 00:00:00 inclusive. The pandas resample function will not include 2000-01-02 in the bin because it's a different day.
I would imagine this situation comes up often when dealing with instantaneous measurements that need to be resampled. What's the solution?
setup
index = pd.date_range('1/1/2000', periods=9, freq='6H')
series = pd.Series(range(9), index=index)
series
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
solution
series.rolling(5).mean().resample('D').first()
2000-01-01 NaN
2000-01-02 2.0
2000-01-03 6.0
Freq: D, dtype: float64

Categories

Resources