I am converting low-frequency data to a higher frequency with pandas (for instance monthly to daily). When making this conversion, I would like the resulting higher-frequency index to span the entire low-frequency window. For example, suppose I have a monthly series, like so:
import numpy as np
from pandas import *
data = np.random.randn(2)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s
2012-01-31 0
2012-02-29 1
Now, I convert it to daily frequency:
s.resample('D')
2012-01-31 0
2012-02-01 NaN
2012-02-02 NaN
2012-02-03 NaN
...
2012-02-27 NaN
2012-02-28 NaN
2012-02-29 1
Notice how the resulting output goes from 2012-01-31 to 2012-02-29. But what I really want is days from 2011-01-01 to 2012-02-29, so that the daily index "fills" the entire January month, even if 2012-01-31 is still the only non-NaN observation in that month.
I'm also curious if there are built-in methods that give more control over how the higher-frequency period is filled with the lower frequency values. In the monthly to daily example, the default is to fill in just the last day of each month; if I use a PeriodIndex to index my series I can also s.resample('D', convention='start') to have only the first observation filled in. However, I also would like options to fill every day in the month with the monthly value, and to fill every day with the daily average (the monthly value divided by the number of days in the month).
Note that basic backfill and forward fill would not be sufficient to fill every daily observation in the month with the monthly value. For example, if the monthly series runs from January to March but the February value is NaN, then a forward fill would carry the January values into February, which is not desired.
How about this?
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='D'))
Related
I have a dataframe with daily market data (OHLCV) and am resampling it to weekly.
My specific requirement is that the weekly dataframe's index labels must be the index labels of the first day of that week, whose data is present in the daily dataframe.
For example, in July 2022, the trading week beginning 4th July (for US stocks) should be labelled 5th July, since 4th July was a holiday and not found in the daily dataframe, and the first date in that week found in the daily dataframe is 5th July.
The usual weekly resampling offset aliases and anchored offsets do not seem to have such an option.
I can achieve my requirement specifically for US stocks by importing USFederalHolidayCalendar from pandas.tseries.holiday and then using
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dfw.index = dfw.index.map(lambda idx: bday_us.rollforward(idx))
where dfw is the already resampled weekly dataframe with W-MON as option.
However, this would mean that I'd have to use different trading calendars for each different exchange/market, which I'd very much like to avoid.
Any pointers on how to do this simply so that the index label in the weekly dataframe is the index label of the first day of that week available in the daily dataframe would be much appreciated.
You want to group all days by calendar week (Mon-Sun), then aggregate the data, and use the first observed date as the index, correct?
If so, W-MON is not applicable because you will group dates from Tuesday through Monday. Using W-SUN instead, you group by the calendar week where the index is the Sunday. However, you can use method first on the date column to obtain the first observed date in this week and replace the index with this result.
This is possible with either groupby or resample:
import numpy as np
import pandas as pd
# simulate daily data, drop a monday
date_range = pd.bdate_range(start='2022-06-06',end='2022-07-31')
date_range = date_range[~(date_range=='2022-07-04')]
# simulate data
df = pd.DataFrame(data = {
'date': date_range,
'return': np.random.random(size=len(date_range))
})
# resample with groupby
g = df.groupby([pd.Grouper(key='date', freq='W-SUN')])
result_groupby = g[['return']].mean() # example aggregation method
result_groupby['date_first_observed'] = g['date'].first()
result_groupby['date_last_observed'] = g['date'].last()
result_groupby.set_index('date_first_observed', inplace=True)
# resample with resample
df.index = df['date']
g = df.resample('W-SUN')
result_resample = g[['return']].mean() # example aggregation method
result_resample['date_first_observed'] = g['date'].first()
result_resample['date_last_observed'] = g['date'].last()
result_resample.set_index('date_first_observed', inplace=True)
This gives
>>> result_groupby
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
>>> result_resample
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
One row shows 2022-07-05 (Tuesday) instead of 2022-07-04 (Monday).
Situation
I have the folowwing pandas timeseries data:
date
predicted1
2001-03-13
0.994756
2005-08-22
0.551661
2000-05-07
0.001396
I need to take into account a case of resampling into bigger interval than a 5 years, for e.g. 10 years:
sample = data.set_index(pd.DatetimeIndex(data['date'])).drop('date', axis=1)['predicted1']
sample.resample('10Y').sum()
I get the following:
date
2000-12-31
0.001396
2010-12-31
1.546418
So resampling function groups data for the first year and separetely for other years.
Question
How to group all data to the 10 year interval? I want to get smth like this:
date
2000-12-31
1.5478132011506138
You can change the reference, closing and label in resample:
sample.resample('10Y', origin=sample.index.min(), closed='left', label='left').sum()
Output:
date
1999-12-31 1.547813
Freq: 10A-DEC, Name: predicted1, dtype: float64
I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()
trying to get better at using Python and pandas...
I have some stock market data, I have added a day_of_week column(Monday, Tuesday, Wednesday, Thursday, Friday) based on the "Date" columnn, obviously theres Open, High, Low and Close columns as well for each day, and I've also added and pct-chance column, now what I want to do is print out, then plot the pct-change for all the Monday's, or whichever day I like, then I want to plot the average pct-change in a histogram style for each week day in my data, so it'll look like a histogram chart with Monday,Tuesday etc. along the X axis on the bottom, with their average pct-change data on the Y.
So somehow I need to figure out if day_of_week column == "Monday" return pct-change? But I'm assuming there's a better way with pandas somehow, with .loc or something but I can't figure it out :(
The data is indexed by just an integer, then a Date, O,H,L,C, Volume, Range, Pct_change, day_of_week columns etc, below is my output from data.head()
Date Open High Low Settle Volume \
0 2017-07-20 12493.0 12567.0 12381.0 12422.0 94966.0
1 2017-07-19 12446.5 12481.5 12408.0 12432.5 68435.0
2 2017-07-18 12554.0 12569.0 12373.5 12425.5 96933.0
3 2017-07-17 12646.5 12668.0 12531.0 12587.0 65648.0
4 2017-07-14 12642.5 12658.5 12567.0 12611.0 59074.0
Prev. Day Open Interest day_of_week Range < MR > MR Pct_change
0 151217.0 Thursday 186.0 NaN 186.0 -0.568318
1 148249.0 Wednesday 73.5 73.5 NaN -0.112481
2 154485.0 Tuesday 195.5 NaN 195.5 -1.023578
3 145445.0 Monday 137.0 137.0 NaN -0.470486
4 144704.0 Friday 91.5 91.5 NaN -0.249160
Hope someone can help/point me in the right direction, thanks!!
I think you need groupby with aggregate mean and then DataFrame.plot.bar:
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()
For ordering is possible use ordered categorical:
cats = ['Monday','Tuesday','Wednesday','Thursday','Friday']
df['day_of_week'] = df['day_of_week'].astype('category', categories=cats, ordered=True)
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()
I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.