Time series forecasting using statsmodels - python

So here I am attempting to forecast a year worth of values in a timeseries (ts), using arima model but I can't actually get the forecasted values', the predicted values are somewhat in different scale (you can see the last one from the dataset is 339 and the predicted are very small) but I am not sure where to tweak the code. I was trying to change fill_value to different value but I don't know if this is proper method.
I suppose this might also have something to do with this line:
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
Is there a way to extend the index to cover forecasted values?
The code is below:
ts_log = np.log(ts)
ts_log_diff = ts_log - ts_log.shift()
model = ARIMA(ts_log, order=(2, 1, 2))
results_ARIMA = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2))
predictions_ARIMA_diff = pd.Series(results_ARIMA.predict('1949-02-01','1961-12-01'), copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))
So here you can see how the results look, first one is the last value I have and beginning with 1961-01-01 I have predicted values.
1960-12-01 339.216967
1961-01-01 3.111950
1961-02-01 3.295407
1961-03-01 3.540066
1961-04-01 3.789093
1961-05-01 3.980322
1961-06-01 4.068641
1961-07-01 4.045327
1961-08-01 3.939715
1961-09-01 3.802622
1961-10-01 3.684713
1961-11-01 3.622262
1961-12-01 3.632668

Related

Training a DeepAR time series model with monthly data. Can only train on Daily

I am training a DeepAR model (arXiv) in Jupyter Notebook. I am following this tutorial.
I create a collection of time series (concat_df), as needed by the DeepAR method:
Each row is a time series. This collection is used to train the DeepAR model. The input format expected by DeepAr is a list of series. So I create this from the above data frame:
time_series = []
for index, row in concat_df.iterrows():
time_series.append(row)
With this list of time series I then set the freq, prediction_length and context_length (note I am setting frequency to Daily in this first example):
freq = "D"
prediction_length = 3
context_length = 3
...as well as the T-Naught, Data Length, Number of Time Series and Period:
t0 = concat_df.columns[0]
data_length = concat_df.shape[1]
num_ts = concat_df.shape[0]
period = 12
I create the training set:
time_series_training = []
for ts in time_series:
time_series_training.append(ts[:-prediction_length])
..and visualize this with the test set:
time_series[0].plot(label="test")
time_series_training[0].plot(label="train", ls=":")
plt.legend()
plt.show()
So far so good, everything seems to agree with the tutorial.
I then use the remaining code to invoke the deployed model as explained in the referenced article:
list_of_df = predictor.predict(time_series_training[:5], content_type="application/json")
BUT, if I change the frequency to monthly (freq = "M") I get the following error:
ValueError: Units 'M', 'Y', and 'y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Why would monthly data not be accepted? How can I train monthly data on DeepAR? Is there a way to specify some daily equivalent to monthly data?
This appears to be some kind of Pandas error, as shown here.

Forecasting time series with multiple seasonaliy by using auto_arima(SARIMAX) and Fourier terms

I am trying to forecast a time series in Python by using auto_arima and adding Fourier terms as exogenous features. The data come from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. The specificity of this time series is that it has daily data with weekly and annual seasonalities.
In order to capture these two levels of seasonality I first used TBATS as recommended by Rob J Hyndman in Forecasting with daily data which worked pretty well actually.
I also followed this medium article posted by the creator of TBATS python library who compared it with SARIMAX + Fourier terms (also recommended by Hyndman).
But now, when I tried to use the second approach with pmdarima's auto_arima and Fourier terms as exogenous features, I get unexpected results.
In the following code, I only used the train.csv file that I split into train and test data (last year used for forecasting) and set the maximum order of Fourier terms K = 2.
My problem is that I obtain a smoothed forecast (see Image below) that do not seem to capture the weekly seasonality which is different from the result at the end of this article.
Is there something wrong with my code ?
Complete code :
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consist in a long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) = 500 time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 2)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()
Thanks in advance for your answers !
Here's the answer in case someone's interested.
Thanks again Flavia Giammarino.
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consists long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 1)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, D=1, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()

Predicting values using an array in python [duplicate]

This question already exists:
Predicting Values from an array in python
Closed 1 year ago.
I am looking for a solution to the following problem. I am hoping to use a regression model to predict some values. I can currently use it to predict a single value i.e. the model will predict the y-value for the x-value equal to 10 for example. What I would like to do is use an array as the input value. I am hoping I can get the model to create a new array of predicted values when the second array is used as the input value. Is this possible? I have attached some code hoping that this helps.
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=3)
wspoly = poly_reg.fit_transform(ws)
lin_reg2 = LinearRegression()
lin_reg2.fit(wspoly,elec)
plt.figure(figsize = (25,15))
x_grid = np.arange(min(ws), max(ws), 0.1)
x_grid = x_grid.reshape(len(x_grid), 1)
plt.scatter(ws, elec, color = 'red')
plt.plot(x_grid, lin_reg2.predict(poly_reg.fit_transform(x_grid)), color = 'blue')
plt.title("Polynomial Regression of Wind Speed & Electricty Generated")
plt.xlabel('Wind Speed (m/s)')
plt.ylabel('Electricity Generated (kWh)')
plt.show()
Output from the above code showing the polynomial regression model
prediction = lin_reg2.predict(poly_reg.fit_transform([[10]]))
Start out by creating a function to return the prediction. The function will look like this.
def my_func(x_value):
# your code goes here
return y_value
Then you can create an array of predictions like this.
y_values = [my_func(x) for x in list_of_x_values]

Linear Regression prediction by Date in python

I have converted date into numerical values but I am stuck on next step how to preparing data for prediction, How to use date for prediction in python code? How to count eventhappen attribute Please guide me and improve my code where it does not make any sense. Below is my code
#Here is Dataset
date Eventhappen
2016-01-14 A
2016-01-15 C
2016-01-16 B
2016-01-17 A
2016-01-18 C
2016-02-18 B
#Converting Date into Numerical Value
df['Dispatch_Date_Time'] = pd.to_datetime(df['Dispatch_Date_Time'])
df.set_index('Dispatch_Date_Time', inplace=True)
df.sort_index(inplace=True)
df['month'] = df.index.month
df['year'] = df.index.year
df['day'] = df.index.day
df['eventhappen'] = 1
#Preparing the data
X = df[['year']]
y = df['eventhappen']
#Trainng the Algorithm
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Making the Predictions
y_pred = regressor.predict(X_test)
#Plotting the Least Square Line
sns.pairplot(df, x_vars=['year'], y_vars='eventhappen', size=7, aspect=0.7, kind='reg')
There is a lot of confusion in your code, for me at least. The column names are not the same used in the processing.You have two scenarios to consider :
SN-A : If you want to predict which event is happening on some future date, the target column which is 'Eventhappen' will be categorical, you have a multi-classification task not a regression one, so you should encode your target column, then split your dataset using train/test split and finally implement a classifier to predict the event on some future date.
SN-B : If you want to predict the number of the events happening on some future date, then you are on the right way, you should have a numerical column to predict which is the count. It means that this line of code should not be a constant :
df['eventhappen'] = 1
When you have it, you should consider some timeseries techniques (power transformation, lags...), then split into train/test datasets and finally implement/evaluate your regressor model.
Use this function to extract the features needed from the date column and then use them directly in your machine learning model. You can also encode the cyclic features, that gives the model the ability to extract cyclic insights from the data.
def transform_col_date(data, date_col):
'''
data : Dataframe (Your dataset).
date_col : String (name of the date column)
'''
data_ = data.copy()
data_.reset_index(inplace=True)
data_[date_col] = pd.to_datetime(data_[date_col], infer_datetime_format = True)
data_['day'] = data_[date_col].dt.day
data_['month'] = data_[date_col].dt.month
data_['dayofweek'] = data_[date_col].dt.dayofweek
data_['dayofyear'] = data_[date_col].dt.dayofyear
data_['quarter'] = data_[date_col].dt.quarter
data_['weekofyear'] = data_[date_col].dt.weekofyear
data_['year'] = data_[date_col].dt.year
return data_
#in your case
data = transform_col_date(df, 'date')

Pandas/Statsmodel OLS predicting future values

I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085
I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)

Categories

Resources