I am working on a timeseries analysis with SARIMAX and have been really struggling with it.
I think I have successfully fit a model and used it to make predictions; however, I don't know how to make out of sample forecast with exogenous data.
I may be doing the whole thing wrong so I have included my steps below with some sample data;
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas import datetime
import statsmodels.api as sm
# Defining Sample data
df = pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
'2019-01-04','2019-01-05','2019-01-06',
'2019-01-07','2019-01-08','2019-01-09',
'2019-01-10','2019-01-11','2019-01-12'],
'price':[78,60,62,64,66,68,70,72,74,76,78,80],
'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
})
# Changing index to datetime
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.dropna(inplace=True)
# Splitting Data into test and training sets manually
train = df.loc['2019-01-01':'2019-01-09']
test = df.loc['2019-01-10':'2019-01-12']
# setting index to datetime for test and train datasets
train.index = pd.DatetimeIndex(train.index).to_period('D')
test.index = pd.DatetimeIndex(test.index).to_period('D')
# Defining and fitting the model with training data for endogenous and exogenous data
model=sm.tsa.statespace.SARIMAX(train['price'],
order=(0, 0, 0),
seasonal_order=(0, 0, 0,12),
exog=train.iloc[:,1:],
time_varying_regression=True,
mle_regression=False)
model_1= model.fit(disp=False)
# Defining exogenous data for testing
exog_test=test.iloc[:,1:]
# Forecasting out of sample data with exogenous data
forecast = model_1.forecast(3, exog=exog_test)
so my problem is really with the last line, what do I do if I want more than 3 steps?
I would attempt to answer this question as it mainly relates to the type of data and documentation about statsmodels package.
As per the documentation the 'steps' are an integer, the number of steps to forecast from the end of the sample. That also means if you are interested in getting more than three steps you need to provide additional array data for training and TESTING data (note - both).
(https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html)
(https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAXResults.forecast.html)
Here are two errors I get when I increase step size by one:
ValueError: cannot reshape array of size 3 into shape (4,1)
Provided exogenous values are not of the appropriate shape. Required (4, 1), got (3, 1).
ValueError: the number of rows in the exogenous variable does not match the number of time periods you're asking it to predict
With that said simply expanding the testing set works well and gets you additional forecasts here is the code that works and the working notebook link:
https://colab.research.google.com/drive/1o9KXAe61EKH6bDI-FJO3qXzlWjz9IHHw?usp=sharing
import pandas as pd
import numpy as np
# from sklearn.model_selection import train_test_split
# why import this if you want to do tran/test manually?
from pandas import datetime
# Defining Sample data
df=pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
'2019-01-04','2019-01-05','2019-01-06',
'2019-01-07','2019-01-08','2019-01-09',
'2019-01-10','2019-01-11','2019-01-12'],
'price':[78,60,62,64,66,68,70,72,74,76,78,80],
'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
})
# Changing index to datetime
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.dropna(inplace=True)
# Splitting Data into test and training sets manually
train=df.loc['2019-01-01':'2019-01-09']
# I made a change here #CHANGED 10 to 09 so one more month got added
# that means my input array is now 4,1 (if you add a column array is - )
# (4,2)
# I can give any step from -4,0,4 (integral)
test=df.loc['2019-01-09':'2019-01-12']
# setting index to datetime for test and train datasets
train.index = pd.DatetimeIndex(train.index).to_period('D')
test.index = pd.DatetimeIndex(test.index).to_period('D')
# Defining and fitting the model with training data for endogenous and exogenous data
import statsmodels.api as sm
model=sm.tsa.statespace.SARIMAX(train['price'],
order=(0, 0, 0),
seasonal_order=(0, 0, 0,12),
exog=train.iloc[:,1:],
time_varying_regression=True,
mle_regression=False)
model_1= model.fit(disp=False)
# Defining exogenous data for testing
exog_test=test.iloc[:,1:]
# Forcasting out of sample data with exogenous data
forecast = model_1.forecast(4, exog=exog_test)
a= [-0.10266667,0.02666667,0.016 ,0.06666667,0.08266667]
b= [5.12,26.81,58.82,100.04,148.08]
the result in excel SLOPE(a,b) is 0.001062
How I can get the same result in Python what I get by using SLOPE in Excel?
Here you go.
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([5.12,26.81,58.82,100.04,148.08]).reshape((-1, 1))
y = np.array([-0.10266667,0.02666667,0.016 ,0.06666667,0.08266667])
model = LinearRegression().fit(x, y)
print(model.coef_)
# methods and attributes available
print(dir(model))
In excel, SLOPE arguments are in the order y, x. I used those names here so it would be more obvious.
The reshape just makes x a lists of lists which is what is required. y is just needs to be a list. model has many other methods and attributes available. See dir(model).
I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.
My degree of freedom is smaller than the number of rows in the dataset. Why do I have the error "Insufficient degrees of freedom to estimate". What can I do to resolve this error?
I have tried to reduce the value in differenced = difference(X,11), but it still shows the error.
dataset, validation = series[0:split_point], series[split_point:]
print('Dataset %d, Validation %d' % (len(dataset), len(validation)))
dataset.to_csv('dataset.csv')
validation.to_csv('validation.csv')
from pandas import Series
from statsmodels.tsa.arima_model import ARIMA
import numpy
# load dataset
series = Series.from_csv('dataset.csv', header=None)
series = series.iloc[1:]
series.head()
series.shape
from pandas import Series
from statsmodels.tsa.arima_model import ARIMA
import numpy
# create a differenced series
def difference(dataset, interval=1):
diff = list()
for i in range(interval+1, len(dataset)):
value = int(dataset[i]) - int(dataset[i - interval])
diff.append(value)
return numpy.array(diff)
# load dataset
series = Series.from_csv('dataset.csv', header=None)
# seasonal difference
X = series.values
differenced = difference(X,11)
# fit model
model = ARIMA(differenced, order=(7,0,1))
model_fit = model.fit(disp=0)
# print summary of fit model
print(model_fit.summary())
The shape is (17,)
After differencing, you are left with 6 observations (17 - 11 = 6). That's not enough for an ARIMA(7, 0, 1).
With that little data, you are unlikely to get good forecasting performance with any model, but if you must, then I would recommend something much simpler, like ARIMA(1, 0, 0) or an exponential smoothing model.
Basically, I am trying to run a regression based on a dataframe without an intercept, so I set fit intercept to false, yet the following code yields a parameters that include an intercept. Anyone have an idea why this may be the case?
model2 = smf.ols('Y ~ X', data=df_final)
result2 = model2.fit(cov_type = 'HAC', cov_kwds = {'maxlags':5}, fit_intercept= False)
result2.params
Intercept 0.032649
X 0.014521
dtype: float64
When running an OLS model using a formula an intercept is added by default. One way to omit the intercept term is to add a -1 to the formula:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'X': np.random.randint(0, 100, size=20),
'Y': np.random.randint(0, 100, size=20)})
model = smf.ols('Y ~ X - 1', data=df)
result = model.fit()
The fitted model now only contains a single parameter (for X):
X 0.691876
dtype: float64
If you're not using the formula api, then the OLS model doesn't include an intercept so you don't need to worry about it (in that case you need to explicitly add it to your data)
I'm not sure where you got the fit_intercept parameter from as I cant find any reference to it in the statsmodels documentation or source code. Maybe you're thinking of linear regression using scikit-learn, which does use a parameter to control the intercept