how to calculate percent change rate on a multi-date dataframe elegantly? - python

I have a dataframe, which index is datetime. it contains a columns - price
In [9]: df = pd.DataFrame({'price':[3,5,6,10,11]}, index=pd.to_datetime(['2016-01-01 14:58:00',
'2016-01-01 14:58:00', '2016-01-01 14:58:00', '2016-01-02 09:30:00', '2016-01-02 09:31:00']))
...:
In [10]: df
Out[10]:
price
2016-01-01 14:58:00 3
2016-01-01 14:58:00 5
2016-01-01 14:58:00 6
2016-01-02 09:30:00 10
2016-01-02 09:31:00 11
I want to calculate the next return(price percent change rate for some time intervals).
dataframe have a pct_change() function can calculate the change rate.
In [12]: df['price'].pct_change().shift(-1)
Out[12]:
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 0.666667
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64
but, i want the cross date element to be nan
which means, i want df['pct_change'].loc['2016-01-01 14:58:00'] to be nan, because it calculate the pct_change using tomw's data(2016-01-02 09:30:00)
the expected output:
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 NaN
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64
i can make a mask to filter those out. but i think this solution is not elegant enough, is there any suggestions?

You can use GroupBy.apply by DatetimeIndex.date:
s1 = df.groupby(df.index.date)['price'].apply(lambda x: x.pct_change().shift(-1))
print (s1)
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 NaN
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64

Related

Time-based .rolling() fails with group by

Here's a code snippet from Pandas Issue #13966
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
Fails:
df.groupby('A').rolling('4s', on='B').C.mean()
ValueError: B must be monotonic
Per the issue linked above, this seems to be a bug. Does anyone have a good workaround?
Set B as the index first so as to use Groupby.resample method on it.
df.set_index('B', inplace=True)
Groupby A and resample based on seconds frequency. As resample cannot be directly used with rolling, use ffill(forward fillna with NaN limit as 0).
Now use rolling function by specifying the window size as 4 (because of freq=4s) interval and take it's mean along C column as shown:
for _, grp in df.groupby('A'):
print (grp.resample('s').ffill(limit=0).rolling(4)['C'].mean().head(10)) #Remove head()
Resulting output obtained:
B
2016-01-01 09:30:00 NaN
2016-01-01 09:30:01 NaN
2016-01-01 09:30:02 NaN
2016-01-01 09:30:03 1.5
2016-01-01 09:30:04 2.5
2016-01-01 09:30:05 3.5
2016-01-01 09:30:06 4.5
2016-01-01 09:30:07 5.5
2016-01-01 09:30:08 6.5
2016-01-01 09:30:09 7.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:00 NaN
2016-01-01 09:30:01 NaN
2016-01-01 09:30:02 NaN
2016-01-01 09:30:03 21.5
2016-01-01 09:30:04 22.5
2016-01-01 09:30:05 23.5
2016-01-01 09:30:06 24.5
2016-01-01 09:30:07 25.5
2016-01-01 09:30:08 26.5
2016-01-01 09:30:09 27.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:12 NaN
2016-01-01 09:30:13 NaN
2016-01-01 09:30:14 NaN
2016-01-01 09:30:15 33.5
2016-01-01 09:30:16 34.5
2016-01-01 09:30:17 35.5
2016-01-01 09:30:18 36.5
2016-01-01 09:30:19 37.5
Freq: S, Name: C, dtype: float64
TL;DR
Use groupby.apply as a workaround instead after setting the index appropriately:
# tested in version - 0.19.1
df.groupby('A').apply(lambda grp: grp.resample('s').ffill(limit=0).rolling(4)['C'].mean())
(Or)
# Tested in OP's version - 0.19.0
df.groupby('A').apply(lambda grp: grp.resample('s').ffill().rolling(4)['C'].mean())
Both work.
>>> df.sort_values('B').set_index('B').groupby('A').rolling('4s').C.mean()

Last 10 minutes per day

I'm trying to obtain how much of the transaction volume of a business is done of the last each 10 minutes of each day
The data I have is the following:
DF_Q
Out[97]:
LongTime
2016-01-04 09:30:00 35077034
2016-01-04 09:30:11 1119
2016-01-04 09:30:21 12295250
2016-01-04 09:30:23 1387856
2016-01-04 09:30:40 877954
...
2016-05-27 15:59:53 16986
2016-05-27 15:59:58 50080165
2016-05-27 15:59:59 17097260
Name: Volume, dtype: int64
I first resample that series to 10 minutes interval and I obtained:
DF_Qmin = DF_Q.resample('10min').sum()
DF_Qmin
Out[102]:
LongTime
2016-01-04 09:30:00 3.202500e+05
2016-01-04 09:40:00 1.192028e+08
2016-01-04 09:50:00 6.156090e+07
2016-01-04 10:00:00 1.289250e+09
...
2016-05-27 15:20:00 1.035539e+09
2016-05-27 15:30:00 1.489631e+09
2016-05-27 15:40:00 2.228257e+09
2016-05-27 15:50:00 5.352179e+09
Freq: 10T, Name: Volume, dtype: float64
And then I do a pivot table
, which I save as an excel and I manually obtain each day's last 10 minute volume
2016-01-04 16:50:00 3.693279e+09
2016-01-05 16:50:00 2.158429e+09
...
2016-05-26 15:50:00 1.256878e+08
2016-05-27 15:50:00 6.521489e+09
It is possible to do this without excel? or iterating each day?
I think you need groupby by date and aggregating last. Last rename_axis (new in pandas 0.18.0) and reset_index:
#if need column LongTime
DF_Qmin = DF_Qmin.reset_index()
print (DF_Qmin.groupby(DF_Qmin.LongTime.dt.date).last())
Sample:
import pandas as pd
DF_Qmin = pd.Series({pd.Timestamp('2016-01-04 09:30:00'): 320250.0, pd.Timestamp('2016-01-04 09:50:00'): 61560900.0, pd.Timestamp('2016-05-27 15:40:00'): 2228257000.0, pd.Timestamp('2016-01-04 09:40:00'): 119202800.0, pd.Timestamp('2016-05-27 15:30:00'): 1489631000.0, pd.Timestamp('2016-01-04 10:00:00'): 1289250000.0, pd.Timestamp('2016-05-27 15:50:00'): 5352179000.0, pd.Timestamp('2016-05-27 15:20:00'): 1035539000.0}, name='Volume')
DF_Qmin.index.name = 'LongTime'
print (DF_Qmin)
LongTime
2016-01-04 09:30:00 3.202500e+05
2016-01-04 09:40:00 1.192028e+08
2016-01-04 09:50:00 6.156090e+07
2016-01-04 10:00:00 1.289250e+09
2016-05-27 15:20:00 1.035539e+09
2016-05-27 15:30:00 1.489631e+09
2016-05-27 15:40:00 2.228257e+09
2016-05-27 15:50:00 5.352179e+09
Name: Volume, dtype: float64
DF_Qmin = DF_Qmin.reset_index()
print (DF_Qmin)
LongTime Volume
0 2016-01-04 09:30:00 3.202500e+05
1 2016-01-04 09:40:00 1.192028e+08
2 2016-01-04 09:50:00 6.156090e+07
3 2016-01-04 10:00:00 1.289250e+09
4 2016-05-27 15:20:00 1.035539e+09
5 2016-05-27 15:30:00 1.489631e+09
6 2016-05-27 15:40:00 2.228257e+09
7 2016-05-27 15:50:00 5.352179e+09
print (DF_Qmin.groupby(DF_Qmin.LongTime.dt.date)
.last()
.rename_axis('Date')
.reset_index())
Date LongTime Volume
0 2016-01-04 2016-01-04 10:00:00 1.289250e+09
1 2016-05-27 2016-05-27 15:50:00 5.352179e+09
If last time is not necessary:
print (DF_Qmin.groupby(DF_Qmin.index.date)
.last()
.rename_axis('Date')
.reset_index())
Date Volume
0 2016-01-04 1.289250e+09
1 2016-05-27 5.352179e+09
after resampling your Series/DF you can do it this way:
DF_Qmin.ix[DF_Qmin.index.minute == 50]

rolling_sum on business day and return new dataframe with date as index

I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.

Truncating hourly Pandas Series to be full days only

I'm using Pandas 0.17.1 and I oftentimes encounter hourly Series data that contains partial days. It does not seem that there is any functionality built into pandas that permits you to discard values that correspond to incomplete segments of a coarser date offset on the boundaries of the Series data (I would only like to discard partial data that exist at the beginning and/or the end of the Series).
My intuition, given the above, is that I would have to code something up to abstract the criterion (e.g. groupby with a count aggregation, discard hours in days with < 24 hours):
>> hist_data.groupby(lambda x: x.date()).agg('count')
2007-01-01 23
2007-01-02 24
...
An example of desired behavior:
>> hourly_data
2016-01-01 04:00:00 0.603820
2016-01-01 05:00:00 0.806696
2016-01-01 06:00:00 0.938521
2016-01-01 07:00:00 0.781834
2016-01-01 08:00:00 0.154952
...
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
2016-01-04 00:00:00 0.458402
2016-01-04 01:00:00 0.649496
2016-01-04 02:00:00 0.525321
2016-01-04 03:00:00 0.242605
Freq: H, dtype: float64
>> remove_partial_boundary_data(hourly_data)
2016-01-02 00:00:00 0.833063
2016-01-02 01:00:00 0.131586
2016-01-02 02:00:00 0.876609
2016-01-02 03:00:00 0.319436
2016-01-02 04:00:00 0.056246
...
2016-01-03 20:00:00 0.405725
2016-01-03 21:00:00 0.541096
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
Freq: H, dtype: float64
However, if my timezone is anything other than UTC (timezone-aware), the approach suggested above seems fraught with peril because counts of hours on DST transition days would be either 23 or 25.
Does anyone know of a clever or built-in way to handle this?
You can do this with a groupby and filter on groups that are not complete. To check for completeness, I first reindexed the data and then checked if there are NaN values:
In [10]: hourly_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01 04:00', periods=72, freq='H'))
In [11]: new_idx = pd.date_range(hourly_data.index[0].date(), hourly_data.index[-1].date() + pd.Timedelta('1 day'), freq='H')
In [12]: hourly_data.reindex(new_idx)
Out[12]:
2016-01-01 00:00:00 NaN
2016-01-01 01:00:00 NaN
2016-01-01 02:00:00 NaN
2016-01-01 03:00:00 NaN
2016-01-01 04:00:00 -0.941332
2016-01-01 05:00:00 1.802739
2016-01-01 06:00:00 0.798968
2016-01-01 07:00:00 -0.444979
...
2016-01-04 17:00:00 NaN
2016-01-04 18:00:00 NaN
2016-01-04 19:00:00 NaN
2016-01-04 20:00:00 NaN
2016-01-04 21:00:00 NaN
2016-01-04 22:00:00 NaN
2016-01-04 23:00:00 NaN
2016-01-05 00:00:00 NaN
Freq: H, dtype: float64
This resulted in a timeseries that includes all hours of the dates that are present in the timeseries. This way, we can check if a date was complete by checking if there are NaN values for that date (this method should work for DST transitions), and filter with this criterion:
In [13]: hourly_data.reindex(new_idx).groupby(lambda x: x.date()).filter(lambda x: x.isnull().sum() == 0)
Out[13]:
2016-01-02 00:00:00 -1.231445
2016-01-02 01:00:00 2.371690
2016-01-02 02:00:00 -0.695448
2016-01-02 03:00:00 0.745308
2016-01-02 04:00:00 0.814579
2016-01-02 05:00:00 1.345674
2016-01-02 06:00:00 -1.491470
2016-01-02 07:00:00 0.407182
...
2016-01-03 16:00:00 -0.742151
2016-01-03 17:00:00 0.677229
2016-01-03 18:00:00 0.832271
2016-01-03 19:00:00 -0.183729
2016-01-03 20:00:00 1.938594
2016-01-03 21:00:00 -0.816370
2016-01-03 22:00:00 1.745757
2016-01-03 23:00:00 0.223487
Freq: H, dtype: float64
ORIGINAL ANSWER
You can do this with resample by providing a custom function, and in that function you can then specify that NaN values should not be skipped.
Short answer:
hist_data.resample('D', how=lambda x: x.mean(skipna=False))
if the missing hours are already present as NaNs. Otherwise, you can first resample it to a regular hourly series:
hist_data.resample('H').resample('D', how=lambda x: x.mean(skipna=False))
Long answer with an example. With some dummy data (and I insert a NaN in one of the days):
In [77]: hist_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01', periods=72, freq='H'))
In [78]: hist_data
Out[78]:
2016-01-01 00:00:00 -0.717624
2016-01-01 01:00:00 0.029151
2016-01-01 02:00:00 0.535843
...
2016-01-03 21:00:00 0.659923
2016-01-03 22:00:00 -1.085640
2016-01-03 23:00:00 0.571347
Freq: H, dtype: float64
In [80]: hist_data.iloc[30] = np.nan
With count you can see that there is indeed one missing value for the second day:
In [81]: hist_data.resample('D', how='count')
Out[81]:
2016-01-01 24
2016-01-02 23
2016-01-03 24
Freq: D, dtype: int64
By default, 'mean' will ignore this NaN value:
In [83]: hist_data.resample('D', how='mean')
Out[83]:
2016-01-01 0.106537
2016-01-02 -0.112774
2016-01-03 -0.292248
Freq: D, dtype: float64
But you can change this behaviour with the skipna keyword argument:
In [82]: hist_data.resample('D', how=lambda x: x.mean(skipna=False))
Out[82]:
2016-01-01 0.106537
2016-01-02 NaN
2016-01-03 -0.292248
Freq: D, dtype: float64

python pandas date_range when importing with read_csv

I have a .csv file with some data in the following format:
1.69511909, 0.57561167, 0.31437427, 0.35458831, 0.15841189, 0.28239582, -0.18180907, 1.34761404, -1.5059083, 1.29246638
-1.66764664, 0.1488095, 1.03832221, -0.35229205, 1.35705861, -1.56747104, -0.36783851, -0.57636948, 0.9854391, 1.63031066
0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838, -0.08557089, 1.71855596, -0.61235066, -0.32520105, 1.54162629
Every line corresponds to a specific day, and every record in a line corresponds to a specific hour in that day.
Is there is a convenient way of importing the data with read_csv such that everything would be correctly indexed, i.e. the importing function would discriminate different days (lines), and hours within days (separate records in lines).
Something like this. Note that I couldn't copy your string for some reason, so my dataset is cutoff....
Read in the string (it reads as a dataframe because mine had newlines in it)....but need to coerce to a Series.
In [23]: s = pd.read_csv(StringIO(data)).values
In [24]: s
Out[24]:
array([[-1.66764664, 0.1488095 , 1.03832221, -0.35229205, 1.35705861,
-1.56747104, -0.36783851, -0.57636948, 0.9854391 , 1.63031066],
[ 0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838,
-0.08557089, 1.71855596, nan, nan, nan]])
In [25]: s = Series(pd.read_csv(StringIO(data)).values.ravel())
In [26]: s
Out[26]:
0 -1.667647
1 0.148810
2 1.038322
3 -0.352292
4 1.357059
5 -1.567471
6 -0.367839
7 -0.576369
8 0.985439
9 1.630311
10 0.877638
11 0.607572
12 0.649083
13 -0.683577
14 0.334998
15 -0.085571
16 1.718556
17 NaN
18 NaN
19 NaN
dtype: float64
Just set the index directly....Note that you are solely responsible for alignment, this is VERY
easy to be say off by 1
In [27]: s.index = pd.date_range('20130101',freq='H',periods=len(s))
In [28]: s
Out[28]:
2013-01-01 00:00:00 -1.667647
2013-01-01 01:00:00 0.148810
2013-01-01 02:00:00 1.038322
2013-01-01 03:00:00 -0.352292
2013-01-01 04:00:00 1.357059
2013-01-01 05:00:00 -1.567471
2013-01-01 06:00:00 -0.367839
2013-01-01 07:00:00 -0.576369
2013-01-01 08:00:00 0.985439
2013-01-01 09:00:00 1.630311
2013-01-01 10:00:00 0.877638
2013-01-01 11:00:00 0.607572
2013-01-01 12:00:00 0.649083
2013-01-01 13:00:00 -0.683577
2013-01-01 14:00:00 0.334998
2013-01-01 15:00:00 -0.085571
2013-01-01 16:00:00 1.718556
2013-01-01 17:00:00 NaN
2013-01-01 18:00:00 NaN
2013-01-01 19:00:00 NaN
Freq: H, dtype: float64
First just read in the DataFrame:
df = pd.read_csv(file_name, sep=',\s+', header=None)
Then set the index to be the dates and the columns to be the hours*
df.index = pd.date_range('2012-01-01', freq='D', periods=len(df))
from pandas.tseries.offsets import Hour
df.columns = [Hour(7+t) for t in df.columns]
In [5]: df
Out[5]:
<7 Hours> <8 Hours> <9 Hours> <10 Hours> <11 Hours> <12 Hours> <13 Hours> <14 Hours> <15 Hours> <16 Hours>
2012-01-01 1.695119 0.575612 0.314374 0.354588 0.158412 0.282396 -0.181809 1.347614 -1.505908 1.292466
2012-01-02 -1.667647 0.148810 1.038322 -0.352292 1.357059 -1.567471 -0.367839 -0.576369 0.985439 1.630311
2012-01-03 0.877638 0.607572 0.649083 -0.683577 0.334998 -0.085571 1.718556 -0.612351 -0.325201 1.541626
Then stack it and add the Date and the Hour levels of the MultiIndex:
s = df.stack()
s.index = [x[0]+x[1] for x in s.index]
In [8]: s
Out[8]:
2012-01-01 07:00:00 1.695119
2012-01-01 08:00:00 0.575612
2012-01-01 09:00:00 0.314374
2012-01-01 10:00:00 0.354588
2012-01-01 11:00:00 0.158412
2012-01-01 12:00:00 0.282396
2012-01-01 13:00:00 -0.181809
2012-01-01 14:00:00 1.347614
2012-01-01 15:00:00 -1.505908
2012-01-01 16:00:00 1.292466
2012-01-02 07:00:00 -1.667647
2012-01-02 08:00:00 0.148810
...
* You can use different offsets, see here, e.g. Minute, Second.

Categories

Resources