Pandas - Resample when not multiple of frequency? - python

I have a time-serie on daily frequency across 1204 days.
I want to resample it on a 365D basis (by summing) but the time-serie runs across 3,29 * 365D, not a multiple of 365D.
By default, resample is returning 4 lines.
Here is the raw data:
DATE
2012-08-12 15350.0
2012-08-19 11204.0
2012-08-26 11795.0
2012-09-02 15160.0
2012-09-09 9991.0
2012-09-16 12337.0
2012-09-23 10721.0
2012-09-30 9952.0
2012-10-07 11903.0
2012-10-14 8537.0
...
2015-09-27 14234.0
2015-10-04 17917.0
2015-10-11 13610.0
2015-10-18 8716.0
2015-10-25 15191.0
2015-11-01 8925.0
2015-11-08 13306.0
2015-11-15 8884.0
2015-11-22 11527.0
2015-11-29 6859.0
df.index.max() - df.index.min()
Timedelta('1204 days 00:00:00')
If I apply:
df.resample('365D').sum()
I got:
DATE
2012-08-12 536310.0
2013-08-12 555016.0
2014-08-12 569548.0
2015-08-12 245942.0
Freq: 365D, dtype: float64
It seems like the last bin is the one covering less than 365 days.
How do I force resample to exclude it from the result?

df.resample('365D') starts sampling at lowest day in index. So last bin will be almost allways not covering all days. Just skip it
df.resample('365D').sum()[:-1]
You can also consider sampling by start/end of the year
df.resample('A').sum()

Related

How to calculate a rolling 52 week high with missing dates and grouping by company ticker (complex data frame)?

I have a large dataset with daily stock prices and company codes. I need to compute the 52 week high for each stock at every point in time based on the previous 52 weeks. The problem is that some of the companies do not necessarily have data in between some periods and thus, if I use a fixed window size for a rolling max then the results are not correct.
First I tried this:
df['52wh'] = df["PRC"].groupby(df['id']).shift(1).rolling(253).max()
However, this doesn't work since it does not take into account the dates but only the previous 253 entries.
I also tried this:
df['date'] = pd.to_datetime(df['date'])
df['52wh'] = df.set_index('date').groupby('id').rolling(window=365, freq='D', min_periods=1).max()['PRC']
But this gives me this error:
ValueError: cannot handle a non-unique multi-index!
I am thinking maybe a rolling function with get bounds could work but I don't know how to write a good one.
Here is an example of how the data frame looks like
date id PRC
0 2010-01-09 10158 11.87
1 2010-01-10 10158 12.30
2 2010-01-11 10158 12.37
3 2010-01-12 10158 12.89
4 2010-02-08 10158 10.13
... ... ... ...
495711 2018-12-12 93188 14.48
495712 2018-12-13 93188 14.48
495713 2018-12-14 93188 14.48
495714 2018-12-17 93188 14.48
495715 2018-12-18 93188 NaN
Can someone help? Thanks in advance guys! :)

How to group pandas dataframe by unequal time interval or string value

I have a dataframe with 5 minute time granularity. By now I group the df by cutting it down to the entire day and read the min / max values from two columns:
df.groupby(pd.Grouper(key='Date', freq='1D')).agg({'Low':[np.min],'High':[np.max] })
Now, instead of getting the whole day, I need to boil the dataframe down to a split day, with unequal intervals. Let's say 7:00 to 15:00 and 15:00 to 22:00.
How could I do it? freq='' allows only equal intervals.
I also have a column with value 'A' for the first part of the day, and 'B' for the second part of the day, in case it's easier to group.
Date High Low Session
0 2019-06-20 07:00:00 2927.50 2926.75 A
1 2019-06-20 07:05:00 2927.50 2927.00 A
2 2019-06-20 07:10:00 2927.25 2926.50 A
3 2019-06-20 07:15:00 2926.75 2926.25 A
4 2019-06-20 07:20:00 2926.75 2926.00 A
You can use your Session column
df = df.groupby([df.Date.dt.date, 'Session']).agg({'Low':'min', 'High':'max'})
Or you can make your own with pd.cut
df = (
df.groupby([df.Date.dt.date,
pd.cut(df.Date.dt.hour, bins=[7, 15, 22], labels=['7-15', '15-22'])])
.agg({'Low':'min', 'High':'max'})
)

Pandas time series resample, binning seems off

I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.
Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.
index = pd.date_range(start="1/1/2018", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
columns=["sales"], index=index)
>>> df.head()
sales
2018-01-01 66
2018-01-02 18
2018-01-03 45
2018-01-04 92
2018-01-05 76
Now I resample by one month, everything looks fine:
>>>df.resample("1M").sum()
sales
2018-01-31 1507
2018-02-28 1186
2018-03-31 1382
[...]
2018-11-30 1342
2018-12-31 1337
If I try to resample by more months though binning starts to look off. This is particularly evident with 6M
df.resample("6M").sum()
sales
2018-01-31 1507
2018-07-31 8393
2019-01-31 7283
First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9054
2019-06-30 39
Now I have an extra bin in 2019 with data from 2018-12-31...
Is this working properly? am I missing any option I should set?
EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9093 # 9054 + 39
Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.
The M time offset alias implies month end frequency.
What you need is 6MS which is an alias for month start frequency:
df.resample('6MS').sum()
resulting in
sales
2018-01-01 8130
2018-07-01 9563
2019-01-01 0
Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.
For extra clarity you can compare ranges directly:
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
Adding np.random.seed(365) to check both our outputs.
print(df.resample("6M", kind='period').sum())
sales
2018-01 8794
2018-07 9033
would this work for you?

Adding values with same frequency in time series with pandas:

I have a pandas DataFrame indexed by pandas.core.indexes.datetimes.DatetimeIndex and I'd like to add new values starting from the last date in the series. I need that each new value inserted be in the next date using a daily frequency.
Example:
TotalOrders
Date
2013-12-29 3756
2013-12-30 6222
2013-12-31 4918
I'd like to insert, let's say, 5000 and that it will be automatically assigned to date 2014-01-01 and so on for the following values. What would be the best way to do that?
Example:
TotalOrders
Date
2013-12-29 3756
2013-12-30 6222
2013-12-31 4918
2014-01-01 5000
Use loc with DateOffset:
df.loc[df.index.max()+pd.DateOffset(1)] = 5000
TotalOrders
Date
2013-12-29 3756
2013-12-30 6222
2013-12-31 4918
2014-01-01 5000

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

Categories

Resources