I am training a LSTM to predict requests. The input data is split by day, like so:
import datetime as dt
import numpy as np
import pandas as pd
start = dt.date(day=1, month=1, year=2022)
end = dt.date(day=8, month=1, year=2022)
training_data = pd.DataFrame([0, 0, 1, 0, 0, 2, 3], index=np.arange(start, end))
training_data
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
2022-01-06 2
2022-01-07 3
However, for my use case, I only need to predict weekly number of requests, not daily. So in this case, I would only need to predict 6 requests for the week starting on 2022-01-01.
I understand that I could resample the training data to also be on a weekly basis. But would it be possible to train on data split up by day and predict on weeks? My hypothesis is that the daily timesteps might help the model training, but my worry is that the alternate timescale here might mess up the training of the LSTM. I also am not sure exactly how normalization would work.
Related
I am developing an ARIMA model to forecast oil exports. Please see the code below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
import stats_can
# Get and preprocess data
df = stats_can.sc.vectors_to_df("v1001826820", periods=200)
df.rename(columns={"v1001826820": "oil_exports"}, inplace=True)
df.index.rename('date', inplace=True)
df["log_oil_exports"] = df.apply(np.log)
# Plot logarithm of oil exports
df.log_oil_exports.plot.line()
# Run ADF test on levels
adfuller(df.log_oil_exports, regression='ct', autolag='AIC')
# Plot first difference of log oil exports
df.log_oil_exports.diff().dropna().plot.line()
# Run ADF test on first difference
adfuller(df.log_oil_exports.diff().dropna(), regression='nc', autolag='AIC')
# Fit ARIMA(1,1,2) model
oil_export_model = ARIMA(endog=df.log_oil_exports, order=(1, 1, 2), trend='t')
oil_export_model_fit = oil_export_model.fit()
oil_export_model_fit.summary()
# Print last 5 observations from the training data
df.log_oil_exports.tail()
# Print last 5 fitted values
oil_export_model_fit.predict()[-5:]
# Make an out-of-sample forecast
oil_export_model_fit.forecast()
The last five observations from the training dataset are:
2021-04-01 8.902456
2021-05-01 8.896834
2021-06-01 9.132130
2021-07-01 9.101674
2021-08-01 9.182167
Name: log_oil_exports, dtype: float64
The last 5 fitted values are:
2021-04-01 8.957241
2021-05-01 8.890469
2021-06-01 8.925859
2021-07-01 9.113322
2021-08-01 9.098928
Freq: MS, dtype: float64
The first out-of-sample prediction is
2021-09-01 7.708736
Freq: MS, dtype: float64
To say the truth, I cannot understand how the first out-of-sample prediction is 7.7087 when the preceding five observed and fitted values are all close to 9. Am I missing something obvious here? I would appreciate any help/suggestions.
I have two series, one of which is monthly CPI (consumer price inflation) data and the other being the daily close price of EUR/USD exchange rate. The issue I am having is converting the monthly CPI data INTO DAILY so that I can combine the two series together into a dataframe. As I am using this for a regression machine learning (ML) task, namely XGBoost, I am unsure about the correct way to do this. I've read about a couple of methods: maybe interpolation, e.g. Chow-Lin or perhaps Cubic Spline, I've heard of using a Kalman Filter, I have seen people manipulating the datetime index and filling it with NaN values and then using ffill() to fill in the NaN's, and there's training separate models (which I don't want to do) etc.....
I really don't know the correct procedure to do this, especially when time disaggregation is of a major concern in relation to model accuracy. Here is the code:
from datetime import datetime
import pandas as pd
import pandas_datareader.data as pdr
import yfinance as yf
eurusd = yf.download("EURUSD=X", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21))["Close"] # daily
cpi = pdr.FredReader("CPALTT01USM657N", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21)).read() # monthly
The output of the data is as follows:
(N.B: I will eventually slice the dataframe and have data from around 2008 onwards for the sake of feature engineering and for more recent data for the ML model)
Date
2003-12-01 1.196501
2003-12-02 1.208897
2003-12-03 1.212298
2003-12-04 1.208094
2003-12-05 1.218695
...
2021-07-15 1.183334
2021-07-16 1.181181
2021-07-19 1.181401
2021-07-20 1.179384
2021-07-21 1.178411
Name: Close, Length: 4554, dtype: float64
CPALTT01USM657N
DATE
2000-01-01 0.297089
2000-02-01 0.592417
2000-03-01 0.824499
2000-04-01 0.058411
2000-05-01 0.116754
... ...
2021-01-01 0.425378
2021-02-01 0.547438
2021-03-01 0.708327
2021-04-01 0.821891
2021-05-01 0.801711
[257 rows x 1 columns]
Really appreciate all the help that I can get! Many thanks.
Typically when we have a data frame we split it into train and test. For example, imagine my data frame is something like this:
> df.head()
Date y wind temperature
1 2019-10-03 00:00:00 33 12 15
2 2019-10-03 01:00:00 10 5 6
3 2019-10-03 02:00:00 39 6 5
4 2019-10-03 03:00:00 60 13 4
5 2019-10-03 04:00:00 21 3 7
I want to predict y based on the wind and temperature. We then do a split something like this:
df_train = df.loc[df.index <= split_date].copy()
df_test = df.loc[df.index > split_date].copy()
X1=df_train[['wind','temperature']]
y1=df_train['y']
X2=df_test[['wind','temperature']]
y2=df_test['y']
from sklearn.model_selection import train_test_split
X_train, y_train =X1, y1
X_test, y_test = X2,y2
model.fit(X_train,y_train)
And we then predict our test data. However, this uses the features of wind and temperature in the test data frame. If I want to predict (unknown) tomorrow y without knowing tomorrow's hourly temperature and wind, does the method no longer work? (For LSTM or XGBoost for example)
The way you train your model, each row is considered an independent sample, regardless of the order, i.e. what values are observed earlier or later. If you have reason to believe that the chronological order is relevant to predicting y from wind speed and temperature you will need to change your model.
You could try, e.g. to add another column with the values for wind speed and temperature one hour before (shift it by one row), or, if you believe that y might be depend on the weekday, compute the weekday from the date and add that as input feature.
I have a dataset with 84 Monthly Sales (from 01/2013 to 12/2019) - just months, not days.
Month 01 | Sale 1
Month 02 | Sale 2
Month 03 | Sale 3
.... | ...
Month 84 | Sale 84
By visualization it looks like that the model fits very well... but I need to check it....
So what I understood is that cross val does not support Months, and so what I did was convert to use it w/ days(although there is no day info into my original df)...
I wanted to try my model w/ the first five years(60 months) and leave the 2 remaining years(24 months) to see how well the model is predicting....
So i did something like:
cv_results = cross_validation( model = prophet, initial='1825 days', period='30 days', horizon = '60 days')
Does this make sense?
I did not get the concept of cut off dates and forecast periods
I struggled with this for a while as well. But here is how it works. The initial model will be trained on the first 1,825 days of data. It will forecast the next 60 days of data (because horizon is set to 60). The model will then train on the initial period + the period (1,825 + 30 days in this case) and forecast the next 60 days. It will continued like this, adding another 30 days to the training data and then forecasting for the next 60 until there is no longer enough data to do this.
In summary, period is how much data to add to the training data set in every iteration of cross-validation, and horizon is how far out it will forecast.
Here is the setup:
The total number of data points is 700 days
Initial is 365 days
The period is 10 days
The horizon is 20 days
On the 1st iteration, it will train on days 1-365 and will forecast on days 366 to 385.
On the 2nd iteration, it will train on days 11-375 and will forecast on days 376 to 395, etc.
I am very new to Python. I usually use scikits.timeseries to process time-series data. Now I would like to use Panda such as read_csv to do the same as the code shown below. I used the read_csv manual to read the file, but I don't know how to convert the daily time-series to monthly time-series.
The input is one column daily data starting from 2002-01-01 to 2011-12-31, so the length is 3652. The output will be one column monthly data starting from 2002-01 to 2011-12, so the length is 120.
import numpy as np
import pandas as pd
import scikits.timeseries as ts
stgSim = ts.time_series(np.loadtxt('examp.txt', delimiter = ',' , skiprows = 1 ,
usecols = [37] ),
start_date ='2002-01-01',
freq='d' )
v4 = ts.time_series(np.random.rand(3652),start_date='2002-01-01',freq='d')
startD = stgSim.date_to_index(v4.start_date)
stgSim = stgSim[startD:]
stgSimAnMonth = stgSim.convert(freq='m',func=np.ma.mean)
Are you asking for resample which converts daily data to monthly data?
Say
rng = np.random.RandomState(42) # set a random seed so that result is repeatable
ts = pd.Series(data=rng.rand(100),
index=pd.date_range('2018/01/01', periods=100, freq='D'))
mts = ts.resample('M').mean() # resample (convert) to monthly data
ts is like
2018-01-01 0.374540
2018-01-02 0.950714
2018-01-03 0.731994
...
2018-04-08 0.427541
2018-04-09 0.025419
2018-04-10 0.107891
Now you should have mts like
2018-01-31 0.444047
2018-02-28 0.498545
2018-03-31 0.477100
2018-04-30 0.450325