I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.
Related
I am looking for a way to check the frequency of dates in a column. I have a date with a frequency of every week, but sometimes there is a hurdle of 2 or 3 weeks, and the pd.infer_freq method returns NaN.
My data:
2022-01-01
2022-01-08
2022-01-23
2022-01-30
Your sample data is too small for pd.infer_freq to be able to infer the frequencies. You could find the most common time difference between consecutive days and use that to infer the frequency -
s = pd.Series(dates)
print((s - s.shift(1)).mode())
Output
0 7 days
dtype: timedelta64[ns]
(using Python 3.5.6)
Hi,
am trying to prepare a dataframe for being used with zipline with a customized trading_calendar. Let's say that I imported relevant timeseries data from 2:15am till 21:58pm only business days, that info will be treated by zipline later with frequency="minute"... so data needs to be accurate.
Something like this:
de = OrderedDict()
de['DE30'] = pd.read_csv("DE.30.1m.csv",header=None)
de['DE30']['date'] = pd.to_datetime(de['DE30']['time'],utc=True)
de['DE30'].set_index('date', inplace=True)
de['DE30'] = de['DE30'].resample(rule="min").mean()
de['DE30'].rename(columns={2:'open',3:'high',4:'low',5:'close',6:'volume'},inplace=True)
de['DE30'].loc[datetime(2020,3,30,21,57):].head()
date open high low close volume
2020-03-30 21:57:00+00:00 9846.4 9848.9 9839.4 9843.4 62.0
2020-03-30 21:58:00+00:00 9842.9 9842.9 9840.9 9840.9 2.0
2020-03-30 21:59:00+00:00 NaN NaN NaN NaN NaN
2020-03-30 22:00:00+00:00 NaN NaN NaN NaN NaN
2020-03-30 22:01:00+00:00 NaN NaN NaN NaN NaN
The tricky thing here is that: if I resample the full timeseries -like before- then I won't miss any "minute" in my original information (i.e. in case my .CSV does not have some important rows by mistake), however I will have a tedious task to remove all rows between 21:59pm and 2:15am (no trading hours) as well as weekends and holidays info. How to do this in a 'simple way'..?
If I do not resample it, I will pass to zipline a panel that contains only the correct trading hours and business days (zipline will work with a pre-defined customized trading_calendar that I registered for that) however I can obtain errors like this:
KeyError: 'the label [2020-03-31 05:37:00+00:00] is not in the [index]'
because the row with the minute 05:37 for that day is not in the original .csv
Option 1) Is it possible to check only if there are 'missing' minutes in an specific timeframe without resampling? or..
Option 2) in case I decide to resample, is there an easy way to clean rows with undesired times, weekends, etc?
trying to get better at using Python and pandas...
I have some stock market data, I have added a day_of_week column(Monday, Tuesday, Wednesday, Thursday, Friday) based on the "Date" columnn, obviously theres Open, High, Low and Close columns as well for each day, and I've also added and pct-chance column, now what I want to do is print out, then plot the pct-change for all the Monday's, or whichever day I like, then I want to plot the average pct-change in a histogram style for each week day in my data, so it'll look like a histogram chart with Monday,Tuesday etc. along the X axis on the bottom, with their average pct-change data on the Y.
So somehow I need to figure out if day_of_week column == "Monday" return pct-change? But I'm assuming there's a better way with pandas somehow, with .loc or something but I can't figure it out :(
The data is indexed by just an integer, then a Date, O,H,L,C, Volume, Range, Pct_change, day_of_week columns etc, below is my output from data.head()
Date Open High Low Settle Volume \
0 2017-07-20 12493.0 12567.0 12381.0 12422.0 94966.0
1 2017-07-19 12446.5 12481.5 12408.0 12432.5 68435.0
2 2017-07-18 12554.0 12569.0 12373.5 12425.5 96933.0
3 2017-07-17 12646.5 12668.0 12531.0 12587.0 65648.0
4 2017-07-14 12642.5 12658.5 12567.0 12611.0 59074.0
Prev. Day Open Interest day_of_week Range < MR > MR Pct_change
0 151217.0 Thursday 186.0 NaN 186.0 -0.568318
1 148249.0 Wednesday 73.5 73.5 NaN -0.112481
2 154485.0 Tuesday 195.5 NaN 195.5 -1.023578
3 145445.0 Monday 137.0 137.0 NaN -0.470486
4 144704.0 Friday 91.5 91.5 NaN -0.249160
Hope someone can help/point me in the right direction, thanks!!
I think you need groupby with aggregate mean and then DataFrame.plot.bar:
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()
For ordering is possible use ordered categorical:
cats = ['Monday','Tuesday','Wednesday','Thursday','Friday']
df['day_of_week'] = df['day_of_week'].astype('category', categories=cats, ordered=True)
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()
I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.
I am converting low-frequency data to a higher frequency with pandas (for instance monthly to daily). When making this conversion, I would like the resulting higher-frequency index to span the entire low-frequency window. For example, suppose I have a monthly series, like so:
import numpy as np
from pandas import *
data = np.random.randn(2)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s
2012-01-31 0
2012-02-29 1
Now, I convert it to daily frequency:
s.resample('D')
2012-01-31 0
2012-02-01 NaN
2012-02-02 NaN
2012-02-03 NaN
...
2012-02-27 NaN
2012-02-28 NaN
2012-02-29 1
Notice how the resulting output goes from 2012-01-31 to 2012-02-29. But what I really want is days from 2011-01-01 to 2012-02-29, so that the daily index "fills" the entire January month, even if 2012-01-31 is still the only non-NaN observation in that month.
I'm also curious if there are built-in methods that give more control over how the higher-frequency period is filled with the lower frequency values. In the monthly to daily example, the default is to fill in just the last day of each month; if I use a PeriodIndex to index my series I can also s.resample('D', convention='start') to have only the first observation filled in. However, I also would like options to fill every day in the month with the monthly value, and to fill every day with the daily average (the monthly value divided by the number of days in the month).
Note that basic backfill and forward fill would not be sufficient to fill every daily observation in the month with the monthly value. For example, if the monthly series runs from January to March but the February value is NaN, then a forward fill would carry the January values into February, which is not desired.
How about this?
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='D'))