SARIMAX out of sample forecast with exogenous data

SARIMAX out of sample forecast with exogenous data - python

I am working on a timeseries analysis with SARIMAX and have been really struggling with it.
I think I have successfully fit a model and used it to make predictions; however, I don't know how to make out of sample forecast with exogenous data.
I may be doing the whole thing wrong so I have included my steps below with some sample data;
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas import datetime
import statsmodels.api as sm
# Defining Sample data
df = pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
'2019-01-04','2019-01-05','2019-01-06',
'2019-01-07','2019-01-08','2019-01-09',
'2019-01-10','2019-01-11','2019-01-12'],
'price':[78,60,62,64,66,68,70,72,74,76,78,80],
'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
})
# Changing index to datetime
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.dropna(inplace=True)
# Splitting Data into test and training sets manually
train = df.loc['2019-01-01':'2019-01-09']
test = df.loc['2019-01-10':'2019-01-12']
# setting index to datetime for test and train datasets
train.index = pd.DatetimeIndex(train.index).to_period('D')
test.index = pd.DatetimeIndex(test.index).to_period('D')
# Defining and fitting the model with training data for endogenous and exogenous data
model=sm.tsa.statespace.SARIMAX(train['price'],
order=(0, 0, 0),
seasonal_order=(0, 0, 0,12),
exog=train.iloc[:,1:],
time_varying_regression=True,
mle_regression=False)
model_1= model.fit(disp=False)
# Defining exogenous data for testing
exog_test=test.iloc[:,1:]
# Forecasting out of sample data with exogenous data
forecast = model_1.forecast(3, exog=exog_test)
so my problem is really with the last line, what do I do if I want more than 3 steps?

I would attempt to answer this question as it mainly relates to the type of data and documentation about statsmodels package.
As per the documentation the 'steps' are an integer, the number of steps to forecast from the end of the sample. That also means if you are interested in getting more than three steps you need to provide additional array data for training and TESTING data (note - both).
(https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html)
(https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAXResults.forecast.html)
Here are two errors I get when I increase step size by one:
ValueError: cannot reshape array of size 3 into shape (4,1)
Provided exogenous values are not of the appropriate shape. Required (4, 1), got (3, 1).
ValueError: the number of rows in the exogenous variable does not match the number of time periods you're asking it to predict
With that said simply expanding the testing set works well and gets you additional forecasts here is the code that works and the working notebook link:
https://colab.research.google.com/drive/1o9KXAe61EKH6bDI-FJO3qXzlWjz9IHHw?usp=sharing
import pandas as pd
import numpy as np
# from sklearn.model_selection import train_test_split
# why import this if you want to do tran/test manually?
from pandas import datetime
# Defining Sample data
df=pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
'2019-01-04','2019-01-05','2019-01-06',
'2019-01-07','2019-01-08','2019-01-09',
'2019-01-10','2019-01-11','2019-01-12'],
'price':[78,60,62,64,66,68,70,72,74,76,78,80],
'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
})
# Changing index to datetime
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.dropna(inplace=True)
# Splitting Data into test and training sets manually
train=df.loc['2019-01-01':'2019-01-09']
# I made a change here #CHANGED 10 to 09 so one more month got added
# that means my input array is now 4,1 (if you add a column array is - )
# (4,2)
# I can give any step from -4,0,4 (integral)
test=df.loc['2019-01-09':'2019-01-12']
# setting index to datetime for test and train datasets
train.index = pd.DatetimeIndex(train.index).to_period('D')
test.index = pd.DatetimeIndex(test.index).to_period('D')
# Defining and fitting the model with training data for endogenous and exogenous data
import statsmodels.api as sm
model=sm.tsa.statespace.SARIMAX(train['price'],
order=(0, 0, 0),
seasonal_order=(0, 0, 0,12),
exog=train.iloc[:,1:],
time_varying_regression=True,
mle_regression=False)
model_1= model.fit(disp=False)
# Defining exogenous data for testing
exog_test=test.iloc[:,1:]
# Forcasting out of sample data with exogenous data
forecast = model_1.forecast(4, exog=exog_test)

Related

Why is the mean in my model almost the same for all values?

I am currently in the process of creating a deep learning model to forecast stock prices. I am using AutoGluon since I am relatively new to deep learning/ML.
My pandas dataframe consists of the historical daily close price of about 1800 different stocks. This is about 8 million rows total. Here is how I have preprocessed my data:
import pandas as pd
import matplotlib.pyplot as plt
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
from datetime import datetime, timedelta, time
import random
import numpy as np
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data_formatted.csv",
parse_dates=["Date"],
)
values_to_remove = ['Date', 'Close', 'ID']
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df = df[~df['Date'].isin(values_to_remove) & ~df['ID'].isin(values_to_remove) & ~df['Close'].isin(values_to_remove)]
df = df.drop(columns=['Unnamed: 0'])
#add time and microseconds to make a unique index since stocks have overlapping dates.
df['Date'] = df['Date'].apply(lambda x: x.replace(hour=12, minute=0, second=0))
df['Date'] = pd.to_datetime(df['Date'])
microseconds = df.groupby(['Date']).cumcount()
df['Date'] += np.array(microseconds, dtype='m8[us]')
ts_dataframe = TimeSeriesDataFrame.from_data_frame(
df,
id_column="ID", # column that contains unique ID of each time series
timestamp_column="Date",
)
#add frequency so model understands
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index.min()
end = original_index.max()
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
I first trained the model on best_quality and with no hyperparameter adjustment. This trained all the available models on Autogluon and I used a weighted ensemble of the models. However, the mean of my predictions was simply a straight line (I assume this has to do with overfitting). Here is the code:
prediction_length = 30
test_data = ts_dataframe
# last prediction_length timesteps of each time series are excluded, akin to `x[:-48]`
train_data = ts_dataframe.slice_by_timestep(None, -prediction_length)
predictor = TimeSeriesPredictor(
path="/content/drive/MyDrive/TrainedModel",
target="Close",
prediction_length=prediction_length,
eval_metric="MAPE",
)
predictor.fit(
train_data,
presets="best_quality",
num_gpus=1
)
I then decided to only use only DeepAR from AutoGluon with configured hyperparameters. I have tried a different bunch of configurations, however, I don't get a model which gives even a halfway decent forecast. This is the configuration I last did:
predictor.fit(
train_data,
hyperparameters={
"DeepAR": {
"num_layers": 5,
"hidden_size": 100,
"learning_rate": 0.1,
"batch_size": 128,
"num_batches_per_epoch": 80,
"epochs": 300
},
},
num_gpus=1
)
This configuration gives slightly better results than the weighted ensemble, however it's still an almost linear mean which doesn't quite capture the ups and downs in the trend.
I don't know whether this is because of the way I'm preprocessing the data or because I'm doing something extremely wrong in configuring the hyper params.
I should note that to test the model, I am passing it the entire historical data of just one stock.
I'm relatively new to ML, so I apologize if I've made an extremely rookie error, but I'd really appreciate it if someone could guide me as to where I'm going wrong.
If there's something I'd left out, or any more info I could give to make things clearer, please let me know.
Thanks!

Forecasting time series with multiple seasonaliy by using auto_arima(SARIMAX) and Fourier terms

I am trying to forecast a time series in Python by using auto_arima and adding Fourier terms as exogenous features. The data come from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. The specificity of this time series is that it has daily data with weekly and annual seasonalities.
In order to capture these two levels of seasonality I first used TBATS as recommended by Rob J Hyndman in Forecasting with daily data which worked pretty well actually.
I also followed this medium article posted by the creator of TBATS python library who compared it with SARIMAX + Fourier terms (also recommended by Hyndman).
But now, when I tried to use the second approach with pmdarima's auto_arima and Fourier terms as exogenous features, I get unexpected results.
In the following code, I only used the train.csv file that I split into train and test data (last year used for forecasting) and set the maximum order of Fourier terms K = 2.
My problem is that I obtain a smoothed forecast (see Image below) that do not seem to capture the weekly seasonality which is different from the result at the end of this article.
Is there something wrong with my code ?
Complete code :
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consist in a long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) = 500 time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 2)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()
Thanks in advance for your answers !

Here's the answer in case someone's interested.
Thanks again Flavia Giammarino.
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consists long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 1)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, D=1, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()

any workaround to do forward forecasting for estimating time series in python?

I want to make forward forecasting for monthly times series of air pollution data such as what would be 3~6 months ahead of estimation on air pollution index. I tried scikit-learn models for forecasting and fitting data to the model works fine. But what I wanted to do is making a forward period estimate such as what would be 6 months ahead of the air pollution output index is going to be. In my current attempt, I could able to train the model by using scikit-learn. But I don't know how that forward forecasting can be done in python. To make a forward period estimate, what should I do? Can anyone suggest a possible workaround to do this? Any idea?
my attempt
import pandas as pd
from sklearn.preprocessing StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import BayesianRidge
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url, parse_dates=['dates'])
df.drop(columns=['Unnamed: 0'], inplace=True)
resultsDict={}
predictionsDict={}
split_date ='2017-12-01'
df_training = df.loc[df.index <= split_date]
df_test = df.loc[df.index > split_date]
df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']
X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)
reg = linear_model.BayesianRidge()
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['BayesianRidge'] = accuracy_score(df_test['pollution_index'], yhat)
new update 2
this is my attempt using ARMA model
from statsmodels.tsa.arima_model import ARIMA
index = len(df_training)
yhat = list()
for t in tqdm(range(len(df_test['pollution_index']))):
temp_train = df[:len(df_training)+t]
model = ARMA(temp_train['pollution_index'], order=(1, 1))
model_fit = model.fit(disp=False)
predictions = model_fit.predict(start=len(temp_train), end=len(temp_train), dynamic=False)
yhat = yhat + [predictions]
yhat = pd.concat(yhat)
resultsDict['ARMA'] = evaluate(df_test['pollution_index'], yhat.values)
but this can't help me to make forward forecasting of estimating my time series data. what I want to do is, what would be 3~6 months ahead of estimated values of pollution_index. Can anyone suggest me a possible workaround to do this? How to overcome the limitation of my current attempt? What should I do? Can anyone suggest me a better way of doing this? Any thoughts?
update: goal
for the clarification, I am not expecting which model or approach works best, but what I am trying to figure it out is, how to make reliable forward forecasting for given time series (pollution index), how should I correct my current attempt if it is not efficient and not ready to do forward period estimation. Can anyone suggest any possible way to do this?
update-desired output
here is my sketch desired forecasting plot that I want to make:

In order to obtain your desired output, I think you need to use a model that can return the standard deviation in the predicted value. Therefore, I adopt Gaussian process regression. From the code you provided in your post, I don't see how this is a time series forecasting task, so in my solution below, I also treat this task as a usual regression task.
First, prepare the data
import pandas
from sklearn.preprocessing import StandardScaler
from sklearn.gaussian_process import GaussianProcessRegressor
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# drop the date column
df_training.drop(columns=['date'],axis=1,inplace=True)
df_test.drop(columns=['date'],axis=1,inplace=True)
y_train = df_training['pollution_index']
y_test = df_test['pollution_index']
df_training.drop(['pollution_index'],axis=1)
df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_training)
X_train = scaler.transform(df_training)
X_test = scaler.transform(df_test)
X_train_df = pd.DataFrame(X_train,columns=df_training.columns)
X_test_df = pd.DataFrame(X_test,columns=df_test.columns)
with the dataframes prepared above, you can train a GaussianProcessRegressor and make predictions by
gpr = GaussianProcessRegressor(normalize_y=True).fit(X_train_df,y_train)
pred,std = gpr.predict(X_test_df,return_std=True)
in which std is an array of standard deviations in the predicted values. Then, you can plot the data by
import numpy as np
from matplotlib import pyplot as plt
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
ax.plot(y_train.index[plot_start:],y_train.values[plot_start:],'navy',marker='o',label='observed')
# plot the test data
ax.plot(y_test.index,y_test.values,'navy',marker='o')
ax.plot(y_test.index,pred,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(std)
ax.fill(np.concatenate([y_test.index,y_test.index[::-1]]),
np.concatenate([pred-1.960*sigma,(pred+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
ax.legend(loc='upper left',prop={'size':16})
the output plot looks like
UPDATE
I thought pollution_index is something that can be predicted by 'dew', 'temp', 'press', 'wnd_spd', 'rain'. If you want a one-step ahead forecasting, here is what you can do
import numpy as np
import pandas as pd
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values
# train an ARIMA model
model = ARIMA(train_polltidx,order=(1,1,1))
model_fit = model.fit(disp=0)
# you can predict as many as you want, here I only predict len(test_dat.index) days
forecast,stderr,conf = model_fit.forecast(len(test_date))
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date,test_polltidx,'navy',marker='o')
plt.plot(test_date,forecast,'darkgreen',marker='o',label='pred')
# ax.errorbar(np.arange(len(pred)),pred,std,fmt='r')
plt.fill(np.concatenate([test_date,test_date[::-1]]),
np.concatenate((conf[:,0],conf[:,1][::-1])),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The output figure is
Clearly, the prediction is very bad, because I haven't done any preprocessing to the training data.
UPDATE 2
Since I'm not familiar with ARIMA, I implement one-step forecasting using GaussianProcessRegressor with the help of this wonderful post.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.preprocessing import StandardScaler
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values[:,None]
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values[:,None]
# preprocessing
scalar = StandardScaler()
scalar.fit(train_polltidx)
train_polltidx = scalar.transform(train_polltidx)
test_polltidx = scalar.transform(test_polltidx)
def series_to_supervised(data,n_in,n_out):
df = pd.DataFrame(data)
cols = list()
for i in range(n_in,0,-1): cols.append(df.shift(i))
for i in range(0, n_out): cols.append(df.shift(-i))
agg = pd.concat(cols,axis=1)
agg.dropna(inplace=True)
return agg.values
months_look_back = 1
# train
pollt_series = series_to_supervised(train_polltidx,months_look_back,1)
x_train,y_train = pollt_series[:,:months_look_back],pollt_series[:,-1]
# test
pollt_series = series_to_supervised(test_polltidx,months_look_back,1)
x_test,y_test = pollt_series[:,:months_look_back],pollt_series[:,-1]
print("The first %i months in the test set won't be predicted." % months_look_back)
def walk_forward_validation(x_train,y_train,x_test,y_test):
predictions = []
history_x = x_train.tolist()
history_y = y_train.tolist()
for rep,target in zip(x_test,y_test):
# train model
gpr = GaussianProcessRegressor(alpha=1e-4,normalize_y=False).fit(history_x,history_y)
pred,std = gpr.predict([rep],return_std=True)
predictions.append([pred,std])
history_x.append(rep)
history_y.append(target)
return predictions
predictions = walk_forward_validation(x_train,y_train,x_test,y_test)
pred_test,pred_std = zip(*predictions)
# put back
pred_test = scalar.inverse_transform(pred_test)
pred_std = scalar.inverse_transform(pred_std)
train_polltidx = scalar.inverse_transform(train_polltidx)
test_polltidx = scalar.inverse_transform(test_polltidx)
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date[months_look_back:],test_polltidx[months_look_back:],'navy',marker='o')
plt.plot(test_date[months_look_back:],pred_test,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([test_date[months_look_back:],test_date[months_look_back:][::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The idea of this script is to cast the time series forecasting task into a supervised regression task. The plot_start is a parameter that controls from which year we want to plot, clearly plot_start cannot be greater than the length of the training data. The output figure of the script is
as you can see, the first month in the test dataset is not predicted, because we need to look back one month to make a prediction.
In order to further make predictions about unseen data, based on this post on CV site, you can train a new model using the predicted value from the last step, therefore, here is how you can do it
unseen_dates = pd.date_range(test_date[-1],periods=180,freq='D').values
all_data = series_to_supervised(df['pollution_index'].values,months_look_back,months_to_predict)
def predict_unseen(unseen_dates,all_data,days_look_back):
predictions = []
history_x = all_data[:,:days_look_back].tolist()
history_y = all_data[:,-1].tolist()
inds = np.arange(unseen_dates.shape[0])
for ind in inds:
# train model
gpr = GaussianProcessRegressor(alpha=1e-2,normalize_y=False).fit(history_x,history_y)
rep = np.array(history_y[-days_look_back:]).reshape(days_look_back,1)
pred,std = gpr.predict(rep,return_std=True)
predictions.append([pred,std])
history_x.append(history_y[-days_look_back:])
history_y.append(pred)
return predictions
predictions = predict_unseen(unseen_dates,all_data,days_look_back=1)
pred_test,pred_std = zip(*predictions)
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the test data
plt.plot(unseen_dates,pred_test,'navy',marker='o')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([unseen_dates,unseen_dates[::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
One very important thing to note: The timestep of the real data is a month, using such data to make predictions about days may not be correct.

The model you have built links what you are trying to model, 'pollution_index', to some input variables, in your case ['dew', 'temp', 'press', 'wnd_spd', 'rain']. So to predict pollution_index into the future using your model, at the high level, you need to estimate what these variables would be over the next 3-6 months, and then run your model on that. Practically, you need to come up with something that looks like X_test but has your projections for these variables for the future, and then call:
yhat = reg.predict(X_test)
... to produce the model estimate of where the pollution_index will be. Hope this makes sense. This gives you a "mechanical" ability to use your model for prediction.
For example, following up on your main example where reg is BayesianRidge() that you fit, we would do the following:
import sys
from io import StringIO
import matplotlib.pyplot as plt
# Here we load your predictions for input variables
# I stubbed it with some random data
df_predict_data = StringIO(
"""
date,dew,temp,press,wnd_spd,rain
2021-01-01,59,28,16,0.78,98.7
2021-02-01,68,32,18,0.79,46.1
2021-03-01,75,34,20,0.81,91.5
2021-04-01,63,31,16,0.83,19.1
2021-05-01,74,38,19,0.83,21.8
2021-06-01,65,32,17,0.85,35.4
""")
df_predict = pd.read_csv(df_predict_data, index_col = 'date')
# scale it using the same scaler you used in training
X_predict = scaler.transform(df_predict)
# predict pollution_index
y_predict = reg.predict(X_predict)
# plot it
plt.plot(df_predict.index, y_predict, '.-')
So we get this:
Whether the linear regression you built is a good model for such prediction is a completely different question. As #Sergey Bushmanov mentioned there is vast literature on forecasting and what models are best for this or that, and this thread is probably not the right place to debate that aspect of your question.

Insufficient degrees of freedom to estimate

My degree of freedom is smaller than the number of rows in the dataset. Why do I have the error "Insufficient degrees of freedom to estimate". What can I do to resolve this error?
I have tried to reduce the value in differenced = difference(X,11), but it still shows the error.
dataset, validation = series[0:split_point], series[split_point:]
print('Dataset %d, Validation %d' % (len(dataset), len(validation)))
dataset.to_csv('dataset.csv')
validation.to_csv('validation.csv')
from pandas import Series
from statsmodels.tsa.arima_model import ARIMA
import numpy
# load dataset
series = Series.from_csv('dataset.csv', header=None)
series = series.iloc[1:]
series.head()
series.shape
from pandas import Series
from statsmodels.tsa.arima_model import ARIMA
import numpy
# create a differenced series
def difference(dataset, interval=1):
diff = list()
for i in range(interval+1, len(dataset)):
value = int(dataset[i]) - int(dataset[i - interval])
diff.append(value)
return numpy.array(diff)
# load dataset
series = Series.from_csv('dataset.csv', header=None)
# seasonal difference
X = series.values
differenced = difference(X,11)
# fit model
model = ARIMA(differenced, order=(7,0,1))
model_fit = model.fit(disp=0)
# print summary of fit model
print(model_fit.summary())
The shape is (17,)

After differencing, you are left with 6 observations (17 - 11 = 6). That's not enough for an ARIMA(7, 0, 1).
With that little data, you are unlikely to get good forecasting performance with any model, but if you must, then I would recommend something much simpler, like ARIMA(1, 0, 0) or an exponential smoothing model.

Python SciKitLearn and Pandas categoric data

I'm working on multivariable regression from a csv, predicting crop performance based on multiple factors. Some of my columns are numerical and meaningful. Others are numerical and categorical, or strings and categorical (for instance, crop variety, or plot code or whatever.) How do I teach Python to use them? I've found One Hot Encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but don't really understand how to apply it here.
My code so far:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('filepath.csv')
df.drop(df[df['LabeledDataColumn'].isnull()].index.tolist(),inplace=True)
scale = StandardScaler()
pd.options.mode.chained_assignment = None # default='warn'
X = df[['inputColumn1', 'inputColumn2', ...,'inputColumn20']]
y = df['LabeledDataColumn']
X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']] = scale.fit_transform(X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']].as_matrix())
#print (X)
est = sm.OLS(y, X).fit()
est.summary()

You could use the get_dummies function pandas provides and convert the categorical values.
Something like this..
predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']),
pd.get_dummies(data['categorical_column1'], prefix='Categorical_col1'),
pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'),
axis=1)
then you could get the outcome/label column by doing
outcome = predictor['label']
del predictor['label']
Then call the model on the data doing
est = sm.OLS(outcome, predictor).fit()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SARIMAX out of sample forecast with exogenous data - python

Related

Why is the mean in my model almost the same for all values?

Forecasting time series with multiple seasonaliy by using auto_arima(SARIMAX) and Fourier terms

any workaround to do forward forecasting for estimating time series in python?

Insufficient degrees of freedom to estimate

Python SciKitLearn and Pandas categoric data

Categories

Resources