I am developing an ARIMA model to forecast oil exports. Please see the code below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
import stats_can
# Get and preprocess data
df = stats_can.sc.vectors_to_df("v1001826820", periods=200)
df.rename(columns={"v1001826820": "oil_exports"}, inplace=True)
df.index.rename('date', inplace=True)
df["log_oil_exports"] = df.apply(np.log)
# Plot logarithm of oil exports
df.log_oil_exports.plot.line()
# Run ADF test on levels
adfuller(df.log_oil_exports, regression='ct', autolag='AIC')
# Plot first difference of log oil exports
df.log_oil_exports.diff().dropna().plot.line()
# Run ADF test on first difference
adfuller(df.log_oil_exports.diff().dropna(), regression='nc', autolag='AIC')
# Fit ARIMA(1,1,2) model
oil_export_model = ARIMA(endog=df.log_oil_exports, order=(1, 1, 2), trend='t')
oil_export_model_fit = oil_export_model.fit()
oil_export_model_fit.summary()
# Print last 5 observations from the training data
df.log_oil_exports.tail()
# Print last 5 fitted values
oil_export_model_fit.predict()[-5:]
# Make an out-of-sample forecast
oil_export_model_fit.forecast()
The last five observations from the training dataset are:
2021-04-01 8.902456
2021-05-01 8.896834
2021-06-01 9.132130
2021-07-01 9.101674
2021-08-01 9.182167
Name: log_oil_exports, dtype: float64
The last 5 fitted values are:
2021-04-01 8.957241
2021-05-01 8.890469
2021-06-01 8.925859
2021-07-01 9.113322
2021-08-01 9.098928
Freq: MS, dtype: float64
The first out-of-sample prediction is
2021-09-01 7.708736
Freq: MS, dtype: float64
To say the truth, I cannot understand how the first out-of-sample prediction is 7.7087 when the preceding five observed and fitted values are all close to 9. Am I missing something obvious here? I would appreciate any help/suggestions.
Related
I am training a LSTM to predict requests. The input data is split by day, like so:
import datetime as dt
import numpy as np
import pandas as pd
start = dt.date(day=1, month=1, year=2022)
end = dt.date(day=8, month=1, year=2022)
training_data = pd.DataFrame([0, 0, 1, 0, 0, 2, 3], index=np.arange(start, end))
training_data
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
2022-01-06 2
2022-01-07 3
However, for my use case, I only need to predict weekly number of requests, not daily. So in this case, I would only need to predict 6 requests for the week starting on 2022-01-01.
I understand that I could resample the training data to also be on a weekly basis. But would it be possible to train on data split up by day and predict on weeks? My hypothesis is that the daily timesteps might help the model training, but my worry is that the alternate timescale here might mess up the training of the LSTM. I also am not sure exactly how normalization would work.
I have two series, one of which is monthly CPI (consumer price inflation) data and the other being the daily close price of EUR/USD exchange rate. The issue I am having is converting the monthly CPI data INTO DAILY so that I can combine the two series together into a dataframe. As I am using this for a regression machine learning (ML) task, namely XGBoost, I am unsure about the correct way to do this. I've read about a couple of methods: maybe interpolation, e.g. Chow-Lin or perhaps Cubic Spline, I've heard of using a Kalman Filter, I have seen people manipulating the datetime index and filling it with NaN values and then using ffill() to fill in the NaN's, and there's training separate models (which I don't want to do) etc.....
I really don't know the correct procedure to do this, especially when time disaggregation is of a major concern in relation to model accuracy. Here is the code:
from datetime import datetime
import pandas as pd
import pandas_datareader.data as pdr
import yfinance as yf
eurusd = yf.download("EURUSD=X", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21))["Close"] # daily
cpi = pdr.FredReader("CPALTT01USM657N", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21)).read() # monthly
The output of the data is as follows:
(N.B: I will eventually slice the dataframe and have data from around 2008 onwards for the sake of feature engineering and for more recent data for the ML model)
Date
2003-12-01 1.196501
2003-12-02 1.208897
2003-12-03 1.212298
2003-12-04 1.208094
2003-12-05 1.218695
...
2021-07-15 1.183334
2021-07-16 1.181181
2021-07-19 1.181401
2021-07-20 1.179384
2021-07-21 1.178411
Name: Close, Length: 4554, dtype: float64
CPALTT01USM657N
DATE
2000-01-01 0.297089
2000-02-01 0.592417
2000-03-01 0.824499
2000-04-01 0.058411
2000-05-01 0.116754
... ...
2021-01-01 0.425378
2021-02-01 0.547438
2021-03-01 0.708327
2021-04-01 0.821891
2021-05-01 0.801711
[257 rows x 1 columns]
Really appreciate all the help that I can get! Many thanks.
real = MAVP(close, periods, minperiod=2, maxperiod=30, matype=0)
i am trying to use this method but it raises an error because the periods parameters,
How to use this method for Dataframe like this
This period parameter must be passed as an array of the periods you want to get. I got the Apple stock price from Yahoo Finance and got a moving average for the entire period I got it.
import yfinance as yf
data = yf.download("AAPL", start="2020-01-01", end="2021-01-01")
data.head()
Open High Low Close Adj Close Volume
Date
2020-01-02 74.059998 75.150002 73.797501 75.087502 74.333511 135480400
2020-01-03 74.287498 75.144997 74.125000 74.357498 73.610840 146322800
2020-01-06 73.447502 74.989998 73.187500 74.949997 74.197395 118387200
2020-01-07 74.959999 75.224998 74.370003 74.597504 73.848442 108872000
2020-01-08 74.290001 76.110001 74.290001 75.797501 75.036385 132079200
import talib as ta
import numpy as np
data.reset_index(drop=False,inplace=True)
periods = data.Date
real = ta.MAVP(data.Close, periods, minperiod=2, maxperiod=30, matype=0)
real
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
248 131.465004
249 134.330002
250 135.779999
251 134.294998
252 133.205002
Length: 253, dtype: float64
Typically when we have a data frame we split it into train and test. For example, imagine my data frame is something like this:
> df.head()
Date y wind temperature
1 2019-10-03 00:00:00 33 12 15
2 2019-10-03 01:00:00 10 5 6
3 2019-10-03 02:00:00 39 6 5
4 2019-10-03 03:00:00 60 13 4
5 2019-10-03 04:00:00 21 3 7
I want to predict y based on the wind and temperature. We then do a split something like this:
df_train = df.loc[df.index <= split_date].copy()
df_test = df.loc[df.index > split_date].copy()
X1=df_train[['wind','temperature']]
y1=df_train['y']
X2=df_test[['wind','temperature']]
y2=df_test['y']
from sklearn.model_selection import train_test_split
X_train, y_train =X1, y1
X_test, y_test = X2,y2
model.fit(X_train,y_train)
And we then predict our test data. However, this uses the features of wind and temperature in the test data frame. If I want to predict (unknown) tomorrow y without knowing tomorrow's hourly temperature and wind, does the method no longer work? (For LSTM or XGBoost for example)
The way you train your model, each row is considered an independent sample, regardless of the order, i.e. what values are observed earlier or later. If you have reason to believe that the chronological order is relevant to predicting y from wind speed and temperature you will need to change your model.
You could try, e.g. to add another column with the values for wind speed and temperature one hour before (shift it by one row), or, if you believe that y might be depend on the weekday, compute the weekday from the date and add that as input feature.
I am very new to Python. I usually use scikits.timeseries to process time-series data. Now I would like to use Panda such as read_csv to do the same as the code shown below. I used the read_csv manual to read the file, but I don't know how to convert the daily time-series to monthly time-series.
The input is one column daily data starting from 2002-01-01 to 2011-12-31, so the length is 3652. The output will be one column monthly data starting from 2002-01 to 2011-12, so the length is 120.
import numpy as np
import pandas as pd
import scikits.timeseries as ts
stgSim = ts.time_series(np.loadtxt('examp.txt', delimiter = ',' , skiprows = 1 ,
usecols = [37] ),
start_date ='2002-01-01',
freq='d' )
v4 = ts.time_series(np.random.rand(3652),start_date='2002-01-01',freq='d')
startD = stgSim.date_to_index(v4.start_date)
stgSim = stgSim[startD:]
stgSimAnMonth = stgSim.convert(freq='m',func=np.ma.mean)
Are you asking for resample which converts daily data to monthly data?
Say
rng = np.random.RandomState(42) # set a random seed so that result is repeatable
ts = pd.Series(data=rng.rand(100),
index=pd.date_range('2018/01/01', periods=100, freq='D'))
mts = ts.resample('M').mean() # resample (convert) to monthly data
ts is like
2018-01-01 0.374540
2018-01-02 0.950714
2018-01-03 0.731994
...
2018-04-08 0.427541
2018-04-09 0.025419
2018-04-10 0.107891
Now you should have mts like
2018-01-31 0.444047
2018-02-28 0.498545
2018-03-31 0.477100
2018-04-30 0.450325