I have a dataframe which I want to resample and append the results to original dataframe as new column,
What I have:
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
time value
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
What I want:
time value mean_resampled
2000-01-01 00:00:00 0. 2
2000-01-01 00:01:00 1. NaN
2000-01-01 00:02:00 2. NaN
2000-01-01 00:03:00 3. NaN
2000-01-01 00:04:00 4. NaN
2000-01-01 00:05:00 5. 6.5
2000-01-01 00:06:00 6. NaN
2000-01-01 00:07:00 7. NaN
2000-01-01 00:08:00 8. NaN
Note: resampling frequency is '5T'
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index, name='values')
sample = series.resample('5T').mean() # create a sample at some frequency
df = series.to_frame() # convert series to frame
df.loc[sample.index.values, 'mean_resampled'] = sample # use loc to assign new values
values mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
Use resample to compute the mean and concat to merge your Series with new values.
>>> pd.concat([series, series.resample('5T').mean()], axis=1) \
.rename(columns={0: 'value', 1: 'mean_resampled'})
value mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
If you have a DataFrame instead of Series in your real case, you have just to add a new column:
>>> df['mean_resampled'] = df.resample('5T').mean()
Related
import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").agg(pd.Series.sum, skipna=False)
2000-01-01 00:00:00 NaN
2000-01-01 00:05:00 35.0
2000-01-01 00:10:00 21.0
Freq: 5T, dtype: float64
So far so good. The problem is that in the last interval (00:10-00:15) it does output a value, because there's no NaN there. But I don't want it to, because some values are missing.
I could use min_count=5, but this wouldn't always work (e.g. if I'm aggregating daily to monthly, there's a variable number of source steps in each target step—some months have 28 days, some 29, some 30, some 31).
You can reindex your time series using some simple arithmetic logic.
For example,
freq = 5
add = freq - tt.minute % freq
new_ts = ts.reindex(pd.date_range(ts.index[0],
ts.index[-1].to_datetime64() + pd.Timedelta(minutes=add-1),
freq='T'))
which outputs
2000-01-01 00:00:00 NaN
2000-01-01 00:05:00 35.0
2000-01-01 00:10:00 NaN
Freq: 5T, dtype: float64
Example:
import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").sum()
2000-01-01 00:00:00 5.0
2000-01-01 00:05:00 30.0
2000-01-01 00:10:00 30.0
Freq: 5T, dtype: float64
In the above example, it extracts the sum of the interval 00:00-00:05 as if the missing value was zero. What I want is for it to produce result NaN in 00:00.
Or, maybe I'd like for it to be OK if there's one missing value in the interval, but NaN if there are two missing values in the interval.
How can I do these?
For one or more NaN values:
ts.resample('5min').agg(pd.Series.sum, skipna=False)
For a minimum of 2 non-NaN values:
ts.resample('5min').agg(pd.Series.sum, min_count=2)
For a maximum of 2 NaN values seems tricker:
ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)
You might expect ts.resample('5min').sum(skipna=False) to work in the same way as ts.sum(skipna=False), but the implementations are not consistent.
I generate a sample 5 minute time series:
index = pd.date_range('1/1/2000', periods=10, freq='5T')
data=range(10)
ser = pd.Series(data, index=index)
What it looks like:
2000-01-01 00:00:00 0.0
2000-01-01 00:05:00 1.0
2000-01-01 00:10:00 2.0
2000-01-01 00:15:00 3.0
2000-01-01 00:20:00 4.0
2000-01-01 00:25:00 5.0
2000-01-01 00:30:00 6.0
2000-01-01 00:35:00 7.0
2000-01-01 00:40:00 8.0
2000-01-01 00:45:00 9.0
Freq: 5T, dtype: float64
What I need
I would like to turn this time series into a 15 minute one and have each 15 minute value be the mean of the 5 minute values observed in that 15 min period, i.e.
2000-01-01 00:15:00 2.0 # i.e. mean(1, 2, 3)
2000-01-01 00:30:00 5.0 # i.e. mean(4, 5, 6)
2000-01-01 00:45:00 8.0 # i.e. mean(7, 8, 9)
Things I tried
If I resample this data into 15 minute buckets and call mean i get:
ser.resample('15T').mean()
2000-01-01 00:00:00 1.0
2000-01-01 00:15:00 4.0
2000-01-01 00:30:00 7.0
2000-01-01 00:45:00 9.0
which is not computing the means I want. If I add closed='right' to the resample call I get closer to the values I want but the timestamps are not right.
ser.resample('15T', closed='right').mean()
1999-12-31 23:45:00 0.0
2000-01-01 00:00:00 2.0
2000-01-01 00:15:00 5.0
2000-01-01 00:30:00 8.0
Freq: 15T, dtype: float64
Any suggestions?
You can use the label argument in resample,
ser.resample('15T', label='right', closed='right').mean()
this shifts the label from the left (default) to the right of the resampled window. This is a more succinct than my somewhat clunky comment.
I have a
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='0.9S')
series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00.000 0
2000-01-01 00:00:00.900 1
2000-01-01 00:00:01.800 2
2000-01-01 00:00:02.700 3
2000-01-01 00:00:03.600 4
2000-01-01 00:00:04.500 5
2000-01-01 00:00:05.400 6
2000-01-01 00:00:06.300 7
2000-01-01 00:00:07.200 8
Freq: 900L, dtype: int64
Now I get
>>> series.resample(rule='0.5S').head(100)
2000-01-01 00:00:00.000 0.0
2000-01-01 00:00:00.500 1.0
2000-01-01 00:00:01.000 NaN
2000-01-01 00:00:01.500 2.0
2000-01-01 00:00:02.000 NaN
2000-01-01 00:00:02.500 3.0
2000-01-01 00:00:03.000 NaN
2000-01-01 00:00:03.500 4.0
2000-01-01 00:00:04.000 NaN
2000-01-01 00:00:04.500 5.0
2000-01-01 00:00:05.000 6.0
2000-01-01 00:00:05.500 NaN
2000-01-01 00:00:06.000 7.0
2000-01-01 00:00:06.500 NaN
2000-01-01 00:00:07.000 8.0
Freq: 500L, dtype: float64
as I would expect, but I get
>>> series.resample(rule='0.5S').interpolate(method='linear')
2000-01-01 00:00:00.000 0.000000
2000-01-01 00:00:00.500 0.555556
2000-01-01 00:00:01.000 1.111111
2000-01-01 00:00:01.500 1.666667
2000-01-01 00:00:02.000 2.222222
2000-01-01 00:00:02.500 2.777778
2000-01-01 00:00:03.000 3.333333
2000-01-01 00:00:03.500 3.888889
2000-01-01 00:00:04.000 4.444444
2000-01-01 00:00:04.500 5.000000
2000-01-01 00:00:05.000 5.000000
2000-01-01 00:00:05.500 5.000000
2000-01-01 00:00:06.000 5.000000
2000-01-01 00:00:06.500 5.000000
2000-01-01 00:00:07.000 5.000000
Freq: 500L, dtype: float64
where I would have expected that the last value is still 8.0 and still 7.0 for the timestamp with 6.5 secondss. What's up with that?
A way of getting this at least partially right (for real data, the results are not great, I had better success with scipy's interp1d) is to use mean() in between the methods:
>>> series.resample(rule='0.5S').mean().interpolate(method='linear')
2000-01-01 00:00:00.000 0.0
2000-01-01 00:00:00.500 1.0
2000-01-01 00:00:01.000 1.5
2000-01-01 00:00:01.500 2.0
2000-01-01 00:00:02.000 2.5
2000-01-01 00:00:02.500 3.0
2000-01-01 00:00:03.000 3.5
2000-01-01 00:00:03.500 4.0
2000-01-01 00:00:04.000 4.5
2000-01-01 00:00:04.500 5.0
2000-01-01 00:00:05.000 6.0
2000-01-01 00:00:05.500 6.5
2000-01-01 00:00:06.000 7.0
2000-01-01 00:00:06.500 7.5
2000-01-01 00:00:07.000 8.0
Freq: 500L, dtype: float64
I want to resample the pandas series
import pandas as pd
index_1 = pd.date_range('1/1/2000', periods=4, freq='T')
index_2 = pd.date_range('1/2/2000', periods=3, freq='T')
series = pd.Series(range(4), index=index_1)
series=series.append(pd.Series(range(3), index=index_2))
print series
>>>2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 00:02:00 2
such that the resulting DataSeries only contains every second entry, i.e
>>>2000-01-01 00:00:00 0
2000-01-01 00:02:00 2
2000-01-02 00:00:00 0
2000-01-02 00:02:00 2
using the (poorly documented) resample method of pandas in the following way:
resampled_series = series.resample('2T', closed='right')
print resampled_series
I get
>>>1999-12-31 23:58:00 0.0
2000-01-01 00:00:00 1.5
2000-01-01 00:02:00 3.0
2000-01-01 00:04:00 NaN
2000-01-01 00:56:00 NaN
...
2000-01-01 23:54:00 NaN
2000-01-01 23:56:00 NaN
2000-01-01 23:58:00 0.0
2000-01-02 00:00:00 1.5
2000-01-02 00:02:00 3.0
Why does it start 2 minutes earlier than the original series? why does it contain all the time steps inbetween, which are not contained in the original series? How can I get my desired result?
resample() is not the right function for your purpose.
try this:
series[series.index.minute % 2 == 0]