Time-based .rolling() fails with group by - python

Here's a code snippet from Pandas Issue #13966
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
Fails:
df.groupby('A').rolling('4s', on='B').C.mean()
ValueError: B must be monotonic
Per the issue linked above, this seems to be a bug. Does anyone have a good workaround?

Set B as the index first so as to use Groupby.resample method on it.
df.set_index('B', inplace=True)
Groupby A and resample based on seconds frequency. As resample cannot be directly used with rolling, use ffill(forward fillna with NaN limit as 0).
Now use rolling function by specifying the window size as 4 (because of freq=4s) interval and take it's mean along C column as shown:
for _, grp in df.groupby('A'):
print (grp.resample('s').ffill(limit=0).rolling(4)['C'].mean().head(10)) #Remove head()
Resulting output obtained:
B
2016-01-01 09:30:00 NaN
2016-01-01 09:30:01 NaN
2016-01-01 09:30:02 NaN
2016-01-01 09:30:03 1.5
2016-01-01 09:30:04 2.5
2016-01-01 09:30:05 3.5
2016-01-01 09:30:06 4.5
2016-01-01 09:30:07 5.5
2016-01-01 09:30:08 6.5
2016-01-01 09:30:09 7.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:00 NaN
2016-01-01 09:30:01 NaN
2016-01-01 09:30:02 NaN
2016-01-01 09:30:03 21.5
2016-01-01 09:30:04 22.5
2016-01-01 09:30:05 23.5
2016-01-01 09:30:06 24.5
2016-01-01 09:30:07 25.5
2016-01-01 09:30:08 26.5
2016-01-01 09:30:09 27.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:12 NaN
2016-01-01 09:30:13 NaN
2016-01-01 09:30:14 NaN
2016-01-01 09:30:15 33.5
2016-01-01 09:30:16 34.5
2016-01-01 09:30:17 35.5
2016-01-01 09:30:18 36.5
2016-01-01 09:30:19 37.5
Freq: S, Name: C, dtype: float64
TL;DR
Use groupby.apply as a workaround instead after setting the index appropriately:
# tested in version - 0.19.1
df.groupby('A').apply(lambda grp: grp.resample('s').ffill(limit=0).rolling(4)['C'].mean())
(Or)
# Tested in OP's version - 0.19.0
df.groupby('A').apply(lambda grp: grp.resample('s').ffill().rolling(4)['C'].mean())
Both work.

>>> df.sort_values('B').set_index('B').groupby('A').rolling('4s').C.mean()

Related

how to calculate percent change rate on a multi-date dataframe elegantly?

I have a dataframe, which index is datetime. it contains a columns - price
In [9]: df = pd.DataFrame({'price':[3,5,6,10,11]}, index=pd.to_datetime(['2016-01-01 14:58:00',
'2016-01-01 14:58:00', '2016-01-01 14:58:00', '2016-01-02 09:30:00', '2016-01-02 09:31:00']))
...:
In [10]: df
Out[10]:
price
2016-01-01 14:58:00 3
2016-01-01 14:58:00 5
2016-01-01 14:58:00 6
2016-01-02 09:30:00 10
2016-01-02 09:31:00 11
I want to calculate the next return(price percent change rate for some time intervals).
dataframe have a pct_change() function can calculate the change rate.
In [12]: df['price'].pct_change().shift(-1)
Out[12]:
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 0.666667
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64
but, i want the cross date element to be nan
which means, i want df['pct_change'].loc['2016-01-01 14:58:00'] to be nan, because it calculate the pct_change using tomw's data(2016-01-02 09:30:00)
the expected output:
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 NaN
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64
i can make a mask to filter those out. but i think this solution is not elegant enough, is there any suggestions?
You can use GroupBy.apply by DatetimeIndex.date:
s1 = df.groupby(df.index.date)['price'].apply(lambda x: x.pct_change().shift(-1))
print (s1)
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 NaN
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64

Pandas fill in missing date within each group with information in the previous row

Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0

adding new column to pandas multi-index dataframe using rolling().max()

I have the following dataframe with multi-index:
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
df = df.set_index(["B","A"])
Now I want to create a new columns being the maximum value of the last two values for index A. I tried the following:
df.loc[:,"D"] = df.groupby(level="A").rolling(2).max()
But it only produces N/A for the new column ("D), since the order of the grouped dataframe index is the opposite of the original dataframe.
How can I solve this? I prefer to stay away from stacking/unstacking, swaplevel/sortlevel, join or concat since I have a big dataframe and these operations tend to be quite time consuming.
You need reset_index with drop parameter for remove first level of MultiIndex:
df['D'] = df.groupby(level="A")['C'].rolling(2).max().reset_index(level=0, drop=True)
print (df)
C D
B A
2016-01-01 09:30:00 1 0 NaN
2016-01-01 09:30:01 1 1 1.0
2016-01-01 09:30:02 1 2 2.0
2016-01-01 09:30:03 1 3 3.0
2016-01-01 09:30:04 1 4 4.0
2016-01-01 09:30:05 1 5 5.0
2016-01-01 09:30:06 1 6 6.0
2016-01-01 09:30:07 1 7 7.0
2016-01-01 09:30:08 1 8 8.0
2016-01-01 09:30:09 1 9 9.0
2016-01-01 09:30:10 1 10 10.0
2016-01-01 09:30:11 1 11 11.0
2016-01-01 09:30:12 1 12 12.0
2016-01-01 09:30:13 1 13 13.0
2016-01-01 09:30:14 1 14 14.0
2016-01-01 09:30:15 1 15 15.0
2016-01-01 09:30:16 1 16 16.0
2016-01-01 09:30:17 1 17 17.0
2016-01-01 09:30:18 1 18 18.0
2016-01-01 09:30:19 1 19 19.0
2016-01-01 09:30:00 2 20 NaN
2016-01-01 09:30:01 2 21 21.0
...
...
because:
print (df.groupby(level="A")['C'].rolling(2).max())
A B A
1 2016-01-01 09:30:00 1 NaN
2016-01-01 09:30:01 1 1.0
2016-01-01 09:30:02 1 2.0
2016-01-01 09:30:03 1 3.0
2016-01-01 09:30:04 1 4.0
2016-01-01 09:30:05 1 5.0
2016-01-01 09:30:06 1 6.0
2016-01-01 09:30:07 1 7.0
2016-01-01 09:30:08 1 8.0
2016-01-01 09:30:09 1 9.0
2016-01-01 09:30:10 1 10.0
2016-01-01 09:30:11 1 11.0
2016-01-01 09:30:12 1 12.0
2016-01-01 09:30:13 1 13.0
2016-01-01 09:30:14 1 14.0
2016-01-01 09:30:15 1 15.0
2016-01-01 09:30:16 1 16.0
2016-01-01 09:30:17 1 17.0
2016-01-01 09:30:18 1 18.0
2016-01-01 09:30:19 1 19.0
2 2016-01-01 09:30:00 2 NaN
2016-01-01 09:30:01 2 21.0
...
...

Pandas - Rolling slope calculation

How to calculate slope of each columns' rolling(window=60) value, stepped by 5?
I'd like to calculate every 5 minutes' value, and I don't need every record's results.
Here's sample dataframe and results:
df
Time A ... N
2016-01-01 00:00 1.2 ... 4.2
2016-01-01 00:01 1.2 ... 4.0
2016-01-01 00:02 1.2 ... 4.5
2016-01-01 00:03 1.5 ... 4.2
2016-01-01 00:04 1.1 ... 4.6
2016-01-01 00:05 1.6 ... 4.1
2016-01-01 00:06 1.7 ... 4.3
2016-01-01 00:07 1.8 ... 4.5
2016-01-01 00:08 1.1 ... 4.1
2016-01-01 00:09 1.5 ... 4.1
2016-01-01 00:10 1.6 ... 4.1
....
result
Time A ... N
2016-01-01 00:04 xxx ... xxx
2016-01-01 00:09 xxx ... xxx
2016-01-01 00:14 xxx ... xxx
...
Can df.rolling function be applied to this problem?
It's fine if NaN is in the window, meaning subset could be less than 60.
It seems that what you want is rolling with a specific step size.
However, according to the documentation of pandas, step size is currently not supported in rolling.
If the data size is not too large, just perform rolling on all data and select the results using indexing.
Here's a sample dataset. For simplicity, the time column is represented using integers.
data = pd.DataFrame(np.random.rand(500, 1) * 10, columns=['a'])
a
0 8.714074
1 0.985467
2 9.101299
3 4.598044
4 4.193559
.. ...
495 9.736984
496 2.447377
497 5.209420
498 2.698441
499 3.438271
Then, roll and calculate slopes,
def calc_slope(x):
slope = np.polyfit(range(len(x)), x, 1)[0]
return slope
# set min_periods=2 to allow subsets less than 60.
# use [4::5] to select the results you need.
result = data.rolling(60, min_periods=2).apply(calc_slope)[4::5]
The result will be,
a
4 -0.542845
9 0.084953
14 0.155297
19 -0.048813
24 -0.011947
.. ...
479 -0.004792
484 -0.003714
489 0.022448
494 0.037301
499 0.027189
Or, you can refer to this post. The first answer provides a numpy way to achieve this:
step size in pandas.DataFrame.rolling
try this
windows = df.groupby("Time")["A"].rolling(60)
df[out] = windows.apply(lambda x: np.polyfit(range(60), x, 1)[0], raw=True).values
You could use pandas Resample. Note that to use this , you need an index with time value
df.index = pd.to_datetime(df.Time)
print df
result = df.resample('5Min').bfill()
print result
Time A N
Time
2016-01-01 00:00:00 2016-01-01 00:00 1.2 4.2
2016-01-01 00:01:00 2016-01-01 00:01 1.2 4.0
2016-01-01 00:02:00 2016-01-01 00:02 1.2 4.5
2016-01-01 00:03:00 2016-01-01 00:03 1.5 4.2
2016-01-01 00:04:00 2016-01-01 00:04 1.1 4.6
2016-01-01 00:05:00 2016-01-01 00:05 1.6 4.1
2016-01-01 00:06:00 2016-01-01 00:06 1.7 4.3
2016-01-01 00:07:00 2016-01-01 00:07 1.8 4.5
2016-01-01 00:08:00 2016-01-01 00:08 1.1 4.1
2016-01-01 00:09:00 2016-01-01 00:09 1.5 4.1
2016-01-01 00:10:00 2016-01-01 00:10 1.6 4.1
2016-01-01 00:15:00 2016-01-01 00:15 1.6 4.1
Time A N
Output
Time
2016-01-01 00:00:00 2016-01-01 00:00 1.2 4.2
2016-01-01 00:05:00 2016-01-01 00:05 1.6 4.1
2016-01-01 00:10:00 2016-01-01 00:10 1.6 4.1
2016-01-01 00:15:00 2016-01-01 00:15 1.6 4.1
I use:
df['slope_I'] = df['I'].rolling('600s').apply(lambda x: (x[-1]-x[0])/600)
where the slope is something with 1/seconds units.
Probably the first 600s of the result will be empty, you should fill it with zeros, or with the mean.
The first number in the slope column will be the slope of the line that goes from the first row inside the window to the last, and so on during the rolling.
Best regards.
For other answer seekers, here I got another solution where the time interval does not need to be the same length.
df.A.diff(60)/df.Time.diff(60).dt.total_seconds()
This line of code takes the difference of the current row with sixty rows back and divide this by the difference in time of the same rows.
When you only want every fifth record then the next line should work.
df.A.diff(60)/df.Time.diff(60).dt.total_seconds()[4::5]
Note: every line is calculated and only the 5 stepped serie is returned
doc pandas diff: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html

Truncating hourly Pandas Series to be full days only

I'm using Pandas 0.17.1 and I oftentimes encounter hourly Series data that contains partial days. It does not seem that there is any functionality built into pandas that permits you to discard values that correspond to incomplete segments of a coarser date offset on the boundaries of the Series data (I would only like to discard partial data that exist at the beginning and/or the end of the Series).
My intuition, given the above, is that I would have to code something up to abstract the criterion (e.g. groupby with a count aggregation, discard hours in days with < 24 hours):
>> hist_data.groupby(lambda x: x.date()).agg('count')
2007-01-01 23
2007-01-02 24
...
An example of desired behavior:
>> hourly_data
2016-01-01 04:00:00 0.603820
2016-01-01 05:00:00 0.806696
2016-01-01 06:00:00 0.938521
2016-01-01 07:00:00 0.781834
2016-01-01 08:00:00 0.154952
...
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
2016-01-04 00:00:00 0.458402
2016-01-04 01:00:00 0.649496
2016-01-04 02:00:00 0.525321
2016-01-04 03:00:00 0.242605
Freq: H, dtype: float64
>> remove_partial_boundary_data(hourly_data)
2016-01-02 00:00:00 0.833063
2016-01-02 01:00:00 0.131586
2016-01-02 02:00:00 0.876609
2016-01-02 03:00:00 0.319436
2016-01-02 04:00:00 0.056246
...
2016-01-03 20:00:00 0.405725
2016-01-03 21:00:00 0.541096
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
Freq: H, dtype: float64
However, if my timezone is anything other than UTC (timezone-aware), the approach suggested above seems fraught with peril because counts of hours on DST transition days would be either 23 or 25.
Does anyone know of a clever or built-in way to handle this?
You can do this with a groupby and filter on groups that are not complete. To check for completeness, I first reindexed the data and then checked if there are NaN values:
In [10]: hourly_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01 04:00', periods=72, freq='H'))
In [11]: new_idx = pd.date_range(hourly_data.index[0].date(), hourly_data.index[-1].date() + pd.Timedelta('1 day'), freq='H')
In [12]: hourly_data.reindex(new_idx)
Out[12]:
2016-01-01 00:00:00 NaN
2016-01-01 01:00:00 NaN
2016-01-01 02:00:00 NaN
2016-01-01 03:00:00 NaN
2016-01-01 04:00:00 -0.941332
2016-01-01 05:00:00 1.802739
2016-01-01 06:00:00 0.798968
2016-01-01 07:00:00 -0.444979
...
2016-01-04 17:00:00 NaN
2016-01-04 18:00:00 NaN
2016-01-04 19:00:00 NaN
2016-01-04 20:00:00 NaN
2016-01-04 21:00:00 NaN
2016-01-04 22:00:00 NaN
2016-01-04 23:00:00 NaN
2016-01-05 00:00:00 NaN
Freq: H, dtype: float64
This resulted in a timeseries that includes all hours of the dates that are present in the timeseries. This way, we can check if a date was complete by checking if there are NaN values for that date (this method should work for DST transitions), and filter with this criterion:
In [13]: hourly_data.reindex(new_idx).groupby(lambda x: x.date()).filter(lambda x: x.isnull().sum() == 0)
Out[13]:
2016-01-02 00:00:00 -1.231445
2016-01-02 01:00:00 2.371690
2016-01-02 02:00:00 -0.695448
2016-01-02 03:00:00 0.745308
2016-01-02 04:00:00 0.814579
2016-01-02 05:00:00 1.345674
2016-01-02 06:00:00 -1.491470
2016-01-02 07:00:00 0.407182
...
2016-01-03 16:00:00 -0.742151
2016-01-03 17:00:00 0.677229
2016-01-03 18:00:00 0.832271
2016-01-03 19:00:00 -0.183729
2016-01-03 20:00:00 1.938594
2016-01-03 21:00:00 -0.816370
2016-01-03 22:00:00 1.745757
2016-01-03 23:00:00 0.223487
Freq: H, dtype: float64
ORIGINAL ANSWER
You can do this with resample by providing a custom function, and in that function you can then specify that NaN values should not be skipped.
Short answer:
hist_data.resample('D', how=lambda x: x.mean(skipna=False))
if the missing hours are already present as NaNs. Otherwise, you can first resample it to a regular hourly series:
hist_data.resample('H').resample('D', how=lambda x: x.mean(skipna=False))
Long answer with an example. With some dummy data (and I insert a NaN in one of the days):
In [77]: hist_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01', periods=72, freq='H'))
In [78]: hist_data
Out[78]:
2016-01-01 00:00:00 -0.717624
2016-01-01 01:00:00 0.029151
2016-01-01 02:00:00 0.535843
...
2016-01-03 21:00:00 0.659923
2016-01-03 22:00:00 -1.085640
2016-01-03 23:00:00 0.571347
Freq: H, dtype: float64
In [80]: hist_data.iloc[30] = np.nan
With count you can see that there is indeed one missing value for the second day:
In [81]: hist_data.resample('D', how='count')
Out[81]:
2016-01-01 24
2016-01-02 23
2016-01-03 24
Freq: D, dtype: int64
By default, 'mean' will ignore this NaN value:
In [83]: hist_data.resample('D', how='mean')
Out[83]:
2016-01-01 0.106537
2016-01-02 -0.112774
2016-01-03 -0.292248
Freq: D, dtype: float64
But you can change this behaviour with the skipna keyword argument:
In [82]: hist_data.resample('D', how=lambda x: x.mean(skipna=False))
Out[82]:
2016-01-01 0.106537
2016-01-02 NaN
2016-01-03 -0.292248
Freq: D, dtype: float64

Categories

Resources