Converting Monthly Data to Daily Data For Regression Problem - python

I have two series, one of which is monthly CPI (consumer price inflation) data and the other being the daily close price of EUR/USD exchange rate. The issue I am having is converting the monthly CPI data INTO DAILY so that I can combine the two series together into a dataframe. As I am using this for a regression machine learning (ML) task, namely XGBoost, I am unsure about the correct way to do this. I've read about a couple of methods: maybe interpolation, e.g. Chow-Lin or perhaps Cubic Spline, I've heard of using a Kalman Filter, I have seen people manipulating the datetime index and filling it with NaN values and then using ffill() to fill in the NaN's, and there's training separate models (which I don't want to do) etc.....
I really don't know the correct procedure to do this, especially when time disaggregation is of a major concern in relation to model accuracy. Here is the code:
from datetime import datetime
import pandas as pd
import pandas_datareader.data as pdr
import yfinance as yf
eurusd = yf.download("EURUSD=X", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21))["Close"] # daily
cpi = pdr.FredReader("CPALTT01USM657N", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21)).read() # monthly
The output of the data is as follows:
(N.B: I will eventually slice the dataframe and have data from around 2008 onwards for the sake of feature engineering and for more recent data for the ML model)
Date
2003-12-01 1.196501
2003-12-02 1.208897
2003-12-03 1.212298
2003-12-04 1.208094
2003-12-05 1.218695
...
2021-07-15 1.183334
2021-07-16 1.181181
2021-07-19 1.181401
2021-07-20 1.179384
2021-07-21 1.178411
Name: Close, Length: 4554, dtype: float64
CPALTT01USM657N
DATE
2000-01-01 0.297089
2000-02-01 0.592417
2000-03-01 0.824499
2000-04-01 0.058411
2000-05-01 0.116754
... ...
2021-01-01 0.425378
2021-02-01 0.547438
2021-03-01 0.708327
2021-04-01 0.821891
2021-05-01 0.801711
[257 rows x 1 columns]
Really appreciate all the help that I can get! Many thanks.

Related

How to work around the date range limit in Pandas for plotting?

sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks
the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]

Problem with statsmodels ARIMA out-of-sample forecast

I am developing an ARIMA model to forecast oil exports. Please see the code below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
import stats_can
# Get and preprocess data
df = stats_can.sc.vectors_to_df("v1001826820", periods=200)
df.rename(columns={"v1001826820": "oil_exports"}, inplace=True)
df.index.rename('date', inplace=True)
df["log_oil_exports"] = df.apply(np.log)
# Plot logarithm of oil exports
df.log_oil_exports.plot.line()
# Run ADF test on levels
adfuller(df.log_oil_exports, regression='ct', autolag='AIC')
# Plot first difference of log oil exports
df.log_oil_exports.diff().dropna().plot.line()
# Run ADF test on first difference
adfuller(df.log_oil_exports.diff().dropna(), regression='nc', autolag='AIC')
# Fit ARIMA(1,1,2) model
oil_export_model = ARIMA(endog=df.log_oil_exports, order=(1, 1, 2), trend='t')
oil_export_model_fit = oil_export_model.fit()
oil_export_model_fit.summary()
# Print last 5 observations from the training data
df.log_oil_exports.tail()
# Print last 5 fitted values
oil_export_model_fit.predict()[-5:]
# Make an out-of-sample forecast
oil_export_model_fit.forecast()
The last five observations from the training dataset are:
2021-04-01 8.902456
2021-05-01 8.896834
2021-06-01 9.132130
2021-07-01 9.101674
2021-08-01 9.182167
Name: log_oil_exports, dtype: float64
The last 5 fitted values are:
2021-04-01 8.957241
2021-05-01 8.890469
2021-06-01 8.925859
2021-07-01 9.113322
2021-08-01 9.098928
Freq: MS, dtype: float64
The first out-of-sample prediction is
2021-09-01 7.708736
Freq: MS, dtype: float64
To say the truth, I cannot understand how the first out-of-sample prediction is 7.7087 when the preceding five observed and fitted values are all close to 9. Am I missing something obvious here? I would appreciate any help/suggestions.

Reading daily time-series using pandas and re-sampling to monthly

I am very new to Python. I usually use scikits.timeseries to process time-series data. Now I would like to use Panda such as read_csv to do the same as the code shown below. I used the read_csv manual to read the file, but I don't know how to convert the daily time-series to monthly time-series.
The input is one column daily data starting from 2002-01-01 to 2011-12-31, so the length is 3652. The output will be one column monthly data starting from 2002-01 to 2011-12, so the length is 120.
import numpy as np
import pandas as pd
import scikits.timeseries as ts
stgSim = ts.time_series(np.loadtxt('examp.txt', delimiter = ',' , skiprows = 1 ,
usecols = [37] ),
start_date ='2002-01-01',
freq='d' )
v4 = ts.time_series(np.random.rand(3652),start_date='2002-01-01',freq='d')
startD = stgSim.date_to_index(v4.start_date)
stgSim = stgSim[startD:]
stgSimAnMonth = stgSim.convert(freq='m',func=np.ma.mean)
Are you asking for resample which converts daily data to monthly data?
Say
rng = np.random.RandomState(42) # set a random seed so that result is repeatable
ts = pd.Series(data=rng.rand(100),
index=pd.date_range('2018/01/01', periods=100, freq='D'))
mts = ts.resample('M').mean() # resample (convert) to monthly data
ts is like
2018-01-01 0.374540
2018-01-02 0.950714
2018-01-03 0.731994
...
2018-04-08 0.427541
2018-04-09 0.025419
2018-04-10 0.107891
Now you should have mts like
2018-01-31 0.444047
2018-02-28 0.498545
2018-03-31 0.477100
2018-04-30 0.450325

Interpolate (upsample) non-equispaced timeseries into equispaced with Pandas version 18.0rc1

I want to interpolate (upscale) nonequispaced time-series to obtain equispaced time-series.
Currently I am doing it in following way:
take original timeseries.
create new timeseries with NaN values at each 30 seconds intervals ( using resample('30S').asfreq() )
concat original timeseries and new timeseries
sort the timeseries to restore order of times (This I do not like - sorting has complexity of O = n log(n) )
interpolate
remove original points from the timeseries
is there a more simple way with Pandas version 18.0rc1? like in matlab you have original timeseries and you pass new times as a parameter to the interpolate() function to receive values at desired times.
I remark that times of original timeseries might not be be a subset of the times of desired timeseries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, 50, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:4',
'2015-01-04 08:37:05',
'2015-01-04 08:41:07',
'2015-01-04 08:43:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts
ts[ts==-1] = np.nan
newFreq=ts.resample('60S').asfreq()
new=pd.concat([ts,newFreq]).sort_index()
new=new.interpolate(method='time')
ts.plot(marker='o')
new.plot(marker='+',markersize=15)
new[newFreq.index].plot(marker='.')
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['original values (nonequispaced)', 'original + interpolated at new frequency (nonequispaced)', 'interpolated values without original values (equispaced!)']
plt.legend(lines, labels, loc='best')
plt.show()
There have been several requests for a simpler way to interpolate at desired values (I'll edit in links later, but search the issue tracker for interpolate issues). So in the future there will be an easier way.
For now you can write the option a bit more cleanly as
In [9]: (ts.reindex(ts.index | newFreq.index)
.interpolate(method='time')
.loc[newFreq.index])
Out[9]:
2015-01-04 08:29:00 NaN
2015-01-04 08:30:00 277996.070686
2015-01-04 08:31:00 285236.860707
2015-01-04 08:32:00 292477.650728
2015-01-04 08:33:00 299718.440748
...
2015-01-04 08:45:00 261362.402778
2015-01-04 08:46:00 261937.569444
2015-01-04 08:47:00 262512.736111
2015-01-04 08:48:00 263087.902778
2015-01-04 08:49:00 263663.069444
Freq: 60S, dtype: float64
This still involves all the steps you listed above, but the unioning of the indexes is cleaner than concating and dropping.

Upsampling to weekly data in pandas

I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.

Categories

Resources