Pandas weekly resampling - python

I have a dataframe with daily market data (OHLCV) and am resampling it to weekly.
My specific requirement is that the weekly dataframe's index labels must be the index labels of the first day of that week, whose data is present in the daily dataframe.
For example, in July 2022, the trading week beginning 4th July (for US stocks) should be labelled 5th July, since 4th July was a holiday and not found in the daily dataframe, and the first date in that week found in the daily dataframe is 5th July.
The usual weekly resampling offset aliases and anchored offsets do not seem to have such an option.
I can achieve my requirement specifically for US stocks by importing USFederalHolidayCalendar from pandas.tseries.holiday and then using
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dfw.index = dfw.index.map(lambda idx: bday_us.rollforward(idx))
where dfw is the already resampled weekly dataframe with W-MON as option.
However, this would mean that I'd have to use different trading calendars for each different exchange/market, which I'd very much like to avoid.
Any pointers on how to do this simply so that the index label in the weekly dataframe is the index label of the first day of that week available in the daily dataframe would be much appreciated.

You want to group all days by calendar week (Mon-Sun), then aggregate the data, and use the first observed date as the index, correct?
If so, W-MON is not applicable because you will group dates from Tuesday through Monday. Using W-SUN instead, you group by the calendar week where the index is the Sunday. However, you can use method first on the date column to obtain the first observed date in this week and replace the index with this result.
This is possible with either groupby or resample:
import numpy as np
import pandas as pd
# simulate daily data, drop a monday
date_range = pd.bdate_range(start='2022-06-06',end='2022-07-31')
date_range = date_range[~(date_range=='2022-07-04')]
# simulate data
df = pd.DataFrame(data = {
'date': date_range,
'return': np.random.random(size=len(date_range))
})
# resample with groupby
g = df.groupby([pd.Grouper(key='date', freq='W-SUN')])
result_groupby = g[['return']].mean() # example aggregation method
result_groupby['date_first_observed'] = g['date'].first()
result_groupby['date_last_observed'] = g['date'].last()
result_groupby.set_index('date_first_observed', inplace=True)
# resample with resample
df.index = df['date']
g = df.resample('W-SUN')
result_resample = g[['return']].mean() # example aggregation method
result_resample['date_first_observed'] = g['date'].first()
result_resample['date_last_observed'] = g['date'].last()
result_resample.set_index('date_first_observed', inplace=True)
This gives
>>> result_groupby
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
>>> result_resample
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
One row shows 2022-07-05 (Tuesday) instead of 2022-07-04 (Monday).

Related

Pandas resampling data with bigger interval than a whole index range

Situation
I have the folowwing pandas timeseries data:
date
predicted1
2001-03-13
0.994756
2005-08-22
0.551661
2000-05-07
0.001396
I need to take into account a case of resampling into bigger interval than a 5 years, for e.g. 10 years:
sample = data.set_index(pd.DatetimeIndex(data['date'])).drop('date', axis=1)['predicted1']
sample.resample('10Y').sum()
I get the following:
date
2000-12-31
0.001396
2010-12-31
1.546418
So resampling function groups data for the first year and separetely for other years.
Question
How to group all data to the 10 year interval? I want to get smth like this:
date
2000-12-31
1.5478132011506138
You can change the reference, closing and label in resample:
sample.resample('10Y', origin=sample.index.min(), closed='left', label='left').sum()
Output:
date
1999-12-31 1.547813
Freq: 10A-DEC, Name: predicted1, dtype: float64

How to fill missing observations in time series data

I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()

Pandas Manipulating Freq for Business Day DateRange

I am trying to add a set of common date related columns to my data frame and my approach to building these date columns is off the .date_range() pandas method that will have the date range for my dataframe.
While I can use methods like .index.day or .index.weekday_name for general date columns, I would like to set a business day column based on date_range I constructed, but not sure if I can use the freq attribute nickname 'B' or if I need to create a new date range.
Further, I am hoping to not count those business days based on a list of holiday dates that I have.
Here is my setup:
Holiday table
holiday_table = holiday_table.set_index('date')
holiday_table_dates = holiday_table.index.to_list() # ['2019-12-31', etc..]
Base Date Table
data_date_range = pd.date_range(start=date_range_start, end=date_range_end)
df = pd.DataFrame({'date': data_date_range}).set_index('date')
df['day_index'] = df.index.day
# Weekday Name
df['weekday_name'] = df.index.weekday_name
# Business day
df['business_day'] = data_date_range.freq("B")
Error at df['business_day'] = data_date_range.freq("B"):
---> 13 df['business_day'] = data_date_range.freq("B")
ApplyTypeError: Unhandled type: str
OK, I think I understand your question now. You are looking to create a a new column of working business days (excluding your custom holidays). In my example i just used the regular US holidays from pandas but you already have your holidays as a list in holiday_table_dates but you should still be able to follow the general layout of my example for your specific use. I also used the assumption that you are OK with boolean values for your business_day column:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as h_cal
# sample data
data_date_range = pd.date_range(start='1/1/2019', end='12/31/2019')
df = pd.DataFrame({'date': data_date_range}).set_index('date')
df['day_index'] = df.index.day
# Weekday Name
df['weekday_name'] = df.index.weekday_name
# this is just a sample using US holidays
hday = h_cal().holidays(df.index.min(), df.index.max())
# b is the same date range as bove just with the freq set to business days
b = pd.date_range(start='1/1/2019', end='12/31/2019', freq='B')
# find all the working business day where b is not a holiday
bday = b[~b.isin(hday)]
# create a boolean col where the date index is in your custom business day we just created
df['bday'] = df.index.isin(bday)
day_index weekday_name bday
date
2019-01-01 1 Tuesday False
2019-01-02 2 Wednesday True
2019-01-03 3 Thursday True
2019-01-04 4 Friday True
2019-01-05 5 Saturday False

pandas Grouper changes date value

Based on this thread: Pandas Subset of a Time Series Without Resampling
The goal is to return the latest date in a month (with a value), and return that value.
Sample code:
Date CumReturn
3/31/2017 1
4/3/2017 .99
5/31/2017 1.022
4/4/2017 100
4/28/2017 1.012
5/1/2017 1.011
6/30/2017 1.033
import pandas as pd
df = pd.read_clipboard(parse_dates = ['Date'])
df.set_index('Date')
df
I thought this would work:
df.groupby(pd.Grouper(freq = 'M')).max()
But it returns the dates corresponding to the highest values (CumReturn), rather than the max dates in the index.
df.groupby(pd.Grouper(freq = 'M')).last()
However, the output shows that the last day in April is chosen, rather than the latest day in the df. pandas assigns the value from April 28 to April 30, and returns this df:
CumReturn
Date
2017-03-31 1.000
2017-04-30 1.012
2017-05-31 1.022
2017-06-30 1.033
What causes this behavior? I assume pandas is just picking the latest date in each month, but that seems odd since those dates aren't present in the original data.

Filling higher-frequency windows when upsampling with pandas

I am converting low-frequency data to a higher frequency with pandas (for instance monthly to daily). When making this conversion, I would like the resulting higher-frequency index to span the entire low-frequency window. For example, suppose I have a monthly series, like so:
import numpy as np
from pandas import *
data = np.random.randn(2)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s
2012-01-31 0
2012-02-29 1
Now, I convert it to daily frequency:
s.resample('D')
2012-01-31 0
2012-02-01 NaN
2012-02-02 NaN
2012-02-03 NaN
...
2012-02-27 NaN
2012-02-28 NaN
2012-02-29 1
Notice how the resulting output goes from 2012-01-31 to 2012-02-29. But what I really want is days from 2011-01-01 to 2012-02-29, so that the daily index "fills" the entire January month, even if 2012-01-31 is still the only non-NaN observation in that month.
I'm also curious if there are built-in methods that give more control over how the higher-frequency period is filled with the lower frequency values. In the monthly to daily example, the default is to fill in just the last day of each month; if I use a PeriodIndex to index my series I can also s.resample('D', convention='start') to have only the first observation filled in. However, I also would like options to fill every day in the month with the monthly value, and to fill every day with the daily average (the monthly value divided by the number of days in the month).
Note that basic backfill and forward fill would not be sufficient to fill every daily observation in the month with the monthly value. For example, if the monthly series runs from January to March but the February value is NaN, then a forward fill would carry the January values into February, which is not desired.
How about this?
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='D'))

Categories

Resources