How to forecast out of sample with AutoRegression from statsmodel? - python

I have time-series sales data. First I group-by the sales by a year. Than I want to forecast the sales for the years 2021,2022 and 2023. I have data from the year 2000.
My question is similar to this one, however I want an answer on how to make forecast outside of the training index.
model = AutoReg(grp, lags=5)
model_fit = model.fit()
predictions = model_fit.predict(start=len(grp), end=len(grp)+3, dynamic=False)
If I do this the results are:
2021-12-31 NaN
2022-12-31 NaN
2023-12-31 NaN
2024-12-31 NaN
I can make it work if I set the end variable to len(grp)-1, but that means I am making predictions for data inside my sample I want to make predictions for the future.
The attribute dynamic seems salient as it says in the documentation it represents the index at which the predictions use dynamically calculated lag values.

The index of time-series did not have the freq set even though the index was annual.
grp.index.freq="Y"
Did the job for me. Also there does not appear to be any problem when you have numeric index of the time-series.

Related

Applying statsforecast implementation of expanding window cross-validation to multiple time series with varying lengths

I am looking to assess the accuracy of different classical time series forecasting models by implementing expanding window cross-validation with statsforecast on a time-series dataset with many unique IDs that have varying temporal lengths that can range between 1 to 48 months. I would like to forecast the next seven months after the ending month of each window and assess the accuracy with some error metric (e.g., sMAPE). There is potentially seasonality and trend in the different time series, so I would like to capture these in the cross-validation process as well. However, I am having difficulty and not fully understanding the different parameters (step_size, n_windows, test_size) in the package's cross-validation function.
Could someone advise me in setting up the right parameters? Is what I'm looking for even feasible with the function provided in the package? How do I decide the best value for step_size, test_size and n_windows?
For reference, my data looks like this:
df =
unique_id
ds
y
0
111111
2000-01-01
9
1
111111
2000-02-01
9
2
111111
2000-03-01
10
3
111111
2000-04-01
4
...
...
...
...
999999
111269
2003-10-01
32532
1000000
111269
2003-11-01
0
1000001
111269
2003-12-01
984214
And to be explicit, the history for individual unique_ids can vary (i.e., the length of the time series is unequal between unique_ids.)
I have already instantiated my StatsForecast object with the requisite models:
sf = StatsForecast(
df=df,
models=[AutoARIMA(season_length=12), AutoETS(error_type='zzz'), Naive()],
freq='MS',
n_jobs=-1,
fallback_model=Naive()
)
Then, I call the cross_validation method:
results_cv = sf.cross_validation(
h=7 # Predict each of the future seven months
step_size=?,
n_windows=?
)
I have tried an assortment of parameter values for step_size and n_windows together, and also just for test_size alone (e.g., 7 because I want to compare the last 7 months of actuals and forecasts in each window), but I'm always left with the following error:
ValueError: could not broadcast input array from shape (y,) into shape (z,)
I expect the end result to look similar to the data-frame presented in the statsforecast tutorial:
screenshot from the GitHub example
or scroll down to 'crossvaldation_df.head()'
Any pointers would be greatly appreciated. Thank you!

Get back original prediction from logged and differenced time-series data in python

I am doing time-series forecasting to predict future orders. Since the data was non-stationary I did log and first-differencing. then I trained the Arima Model using the order values I got from auto_arima by passing the log-differenced data. I used last 30 days for testing and rest for training. I am getting the predicted values in the logged and differenced format. also I predicted for next 30 days which is not present in the dataset which give prediction in same format. How to get back the original data in both the cases since I don't have the data of future 30 days.
This can be done conveniently using numpy. The reverse operation involves taking the cumsum() and then the exp().
You can do:
Algorithm reference courtesy #Divakar
series = df["Number of Bookings"]
new_series = np.log(series).diff()
# getting only the value of zeroth index since the diff() operation looses first value.
new_series.iloc[0] = np.log(series.iloc[0])
result = np.exp(new_series.cumsum())

Python rolling forecast update lags

I would like to implement an OLS with a sklearn.linear_model.LinearRegression.
I have a time series with 100 data points and the respective data.
My overall goal is to forecast the next 6 weeks. I have in x_train multiple explaining variables for my output. These include for example the date, the month and, amongst others, my lag variables. I have checked the significance of lags with the autocorrelation plot and it told me, lags 1-12 are significant, so 12 explaining variables are these 12 lags.
I have created 12 lag variables:
lag_variables=pd.concat([data.shift(1), data.shift(2),...., data.shift(12)]
These lag variables are 12 columns in my x_train dataframe, from which I choose my features.
However, if I want to forecast 6 data points, I need to update my lags.
Because for example for week 5, I don't have any lags 1-4, because I don't know them at the time of my forecast. For week 1 I know the lags 1-12, for week 2 I know the lags 2-12, and so on.
Python gives me now an error, since my x_train contains nan values, since I dont know all lags.
So my idea was, to forecast each week individually, so firstly I forecast week 1. Then, after I have forecasted week 1, I have the real value for week 1, and I can update my lags for week 2 with the real value. I am basically doing a rollig forecast with lags.
I have tried the following:
hist_x_train=[x for x in x_train]
hist_x_test=[x for x in x_test]
hist_y_train=[x for x in y_train]
hist_y_test=[x for x in y_test]
predictions=list()
for t in range(len(x_test)):
model=best_praams #gridsearch before the loop, chosses best parameter
model.fit(hist_x_train, hist_y_train)
y_pred=model.predict(hist_x_test[t])
predictions.append(y_pred)
hist_x_train.append(hist_x_test[t])
hist_y_train.append(y_pred)
print(y_test[t]) # this is my actual real value for that period; I don't
# know how to update this in x_train, so it shows up as a lag
Technically, the rolling forecast works like that if i don't have any lags in my x_train, but I don't know how to deal with the lags.
Can anyone please help me?
Thanks a lot!
Updating the train data with the predictions result in cascading errors which makes the next predictions almost constant or zero after some time, this should be avoided. Instead forecast for the next 6 weeks at one shot, or use multiple models (6 models here) to predict for the next 6 weeks, arrange your y_labels (y_1, y_2..... y_6) such that y_1 is lagged by 1 week, and y_2 by 2 weeks, and so on. and train then accordingly.

fbprophet yearly seasonality values too high

I have recently started using fbprophet package in python. I have monthly data for last 2 years and forecasting for next 9 months.
Since, I have monthly data I have only included yearly seasonality (Prophet(yearly_seasonality = True)).
When I plot components, trend values seem to be fine however, Yearly seasonality values are too high, I don't understand why?
Seasonality is showing 300 increase or -200 decrease. However, in actual graph it is not happening in any months in the past - what I can do to correct?
Code Used is as follows:
m = Prophet(yearly_seasonality = True)
m.fit(df_bu_country1)
future = m.make_future_dataframe(periods=9, freq='M')
forecast = m.predict(future)
m.plot(forecast)
m.plot_components(forecast)
There is no seasonality at all in your data. For there to be yearly seasonality, you should have a pattern that repeats year after year, but the shape of your time series from 10/2015 to 10/2016 is completely different from the shape between 10/2016 to 10/2017. So forcing a yearly seasonality is going to give you strange results, you should switch it off (i.e. just use Prophet's default settings).
There is an inconsistency in the seasonality factor of your data, there seems a little yearly seasonality between 2017-04 to 2018-10 . The first answer is absolutely true but incase if you feel there some seasonality you can reduce its impact by altering fourier order it has.
https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#fourier-order-for-seasonalities
This page has how to do so, the default fourier order is 10, reducing the values cahnges it effects.
Try this hope it helps you

Python ARIMA model, predicted values are shifted

I am new to Python ARIMA implementation. I have a data at 15 min frequency for few months. In my attempt to follow the Box-Jenkins method to fit a timeseries model. I ran into an issue towards the end. The ACF-PACF graph for the time series (ts) and the difference series (ts_diff) are given. I used ARIMA (5,1,2) and finally I plotted the fitted values(green) and original values(blue). As you can from figure, there is a clear shift(by one) in values. What am I doing wrong?
Is the prediction bad? Any insight will be helpful.
This is a standard property of one-step ahead prediction or forecasting.
The information used for the forecast is the history up to and including the previous period. A peak, for example, at a period will affect the forecast for the next period, but cannot influence the forecast for the peak period. This makes the forecasts appear shifted in the plot.
A two-step ahead forecast would give the impression of a shift by two periods.
Just to confirm, I am doing this right then? Here is the code I used.
from statsmodels.tsa.arima_model import ARIMA
model = sm.tsa.ARIMA(ts, order=(5, 1, 2))
model = model.fit()
results_ARIMA=model.predict(typ='levels')
concatenated = pd.concat([ts, results_ARIMA], axis=1, keys=['original', 'predicted'])
concatenated.head(10)
original predicted
login_time
1970-01-01 20:00:00 2 NaN
1970-01-01 20:15:00 6 2.000186
1970-01-01 20:30:00 9 4.552971
1970-01-01 20:45:00 7 7.118973
1970-01-01 21:00:00 1 7.099769
1970-01-01 21:15:00 4 3.624975
1970-01-01 21:30:00 0 3.867454
1970-01-01 21:45:00 4 1.618120
1970-01-01 22:00:00 9 2.997275
1970-01-01 22:15:00 8 6.300015
In the model you specify (5, 1, 2), you set d = 1. This means that you are differencing the data by 1, or in other words, performing a shift of your entire range of time-related observations so as to minimize the residuals of the fitted model.
Sometimes, setting d to 1 will result in a ACF / PACF plot with fewer and / or less dramatic spikes (i.e. less extreme residuals). In such cases, if you use the model you have fitted to predict future values, your predictions will deviate less dramatically from the observations you have if you apply differencing.
Differencing is accomplished through Y(differenced) = Y(t) - Y(t-d), where Y(t) refers to observed value Y at timeindex t, and d refers to the order of differencing you apply. When you use differencing, your entire range of observations basically shifts to the right. This means you lose some data at the left edge of your time series. How many time points you lose depends on the order of differencing d you use. This is where your observed shift comes from.
This page may offer a more elaborate explanation (make sure to click around a bit and explore the other pages on there if you want a treatment of the whole process of fitting an ARIMA model).
Hope this helps (or at least puts your mind at ease about the shift)!
Bests,
Evert

Categories

Resources