Forecasting with statsmodels

Forecasting with statsmodels - python

I have a .csv file containing a 5-year time series, with hourly resolution (commoditiy price). Based on the historical data, I want to create a forecast of the prices for the 6th year.
I have read a couple of articles on the www about these type of procedures, and I basically based my code on the code posted there, since my knowledge in both Python (especially statsmodels) and statistic is at most limited.
Those are the links, for those who are interested:
http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/
http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/
First of all, here is a sample of the .csv file. Data is displayed with monthly resolution in this case, it is not real data, just randomly choosen numbers to give an example here (in which case I hope one year is enough to be able to develop a forecast for the 2nd year; if not, full csv file is available):
Price
2011-01-31 32.21
2011-02-28 28.32
2011-03-31 27.12
2011-04-30 29.56
2011-05-31 31.98
2011-06-30 26.25
2011-07-31 24.75
2011-08-31 25.56
2011-09-30 26.68
2011-10-31 29.12
2011-11-30 33.87
2011-12-31 35.45
My current progress is as follows:
After reading the input file and setting the date column as datetime index, the follwing script was used to develop a forecast for the available data
model = sm.tsa.ARIMA(df['Price'].iloc[1:], order=(1, 0, 0))
results = model.fit(disp=-1)
df['Forecast'] = results.fittedvalues
df[['Price', 'Forecast']].plot(figsize=(16, 12))
,which gives the following output:
Now, as I said, I ain't got no statistic skills and I have little to no idea how I got to this output (basically, changing the order attribute inside the first line changes the output), but the 'actual' forecast looks quite good and I would like to extend it for another year (2016).
In order to do that, additional rows are created in the dataframe, as follows:
start = datetime.datetime.strptime("2016-01-01", "%Y-%m-%d")
date_list = pd.date_range('2016-01-01', freq='1D', periods=366)
future = pd.DataFrame(index=date_list, columns= df.columns)
data = pd.concat([df, future])
Finally, when I use the .predict function of statsmodels:
data['Forecast'] = results.predict(start = 1825, end = 2192, dynamic= True)
data[['Price', 'Forecast']].plot(figsize=(12, 8))
what I get as forecast is a straight line (see below), which doesn't seem at all like a forecast. Moreover, if I extend the range, which now is from the 1825th to 2192nd day (year of 2016), to the whole 6 year timespan, the forecast line is a straight line for the entire period (2011-2016).
I have also tried to use the 'statsmodels.tsa.statespace.sarimax.SARIMAX.predict' method, which accounts for a seasonal variation (which makes sense in this case), but I get some error about 'module' has no attribute 'SARIMAX'. But this is secondary problem, will get into more detail if needed.
Somewhere I am losing grip and I have no idea where. Thanks for reading. Cheers!

ARIMA(1,0,0) is a one period autoregressive model. So it's a model that follows this formula:
What that means is that the value in time period t is equal to some constant (phi_0) plus a value determined by fitting the ARMA model (phi_1) multiplied by the value in the prior period r_(t-1), plus a white noise error term (a_t).
Your model only has a memory of 1 period, so the current prediction is entirely determined by the 1 value of the prior period. It's not a very complex model; it's not doing anything fancy with all the prior values. It's just taking yesterday's price, multiplying it by some value and adding a constant. You should expect it to quickly go to equilibrium and then stay there forever.
The reason why the forecast in the top picture looks so good is that it is just showing you hundreds of 1 period forecasts that are starting fresh with each new period. It's not showing a long period prediction like you probably think it is.
Looking at the link you sent:
http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/
read the section where he discusses why this model doesn't give you what you want.
"So at first glance it seems like this model is doing pretty well. But although it appears like the forecasts are really close (the lines are almost indistinguishable after all), remember that we used the un-differenced series! The index only fluctuates a small percentage day-to-day relative to the total absolute value. What we really want is to predict the first difference, or the day-to-day moves. We can either re-run the model using the differenced series, or add an "I" term to the ARIMA model (resulting in a (1, 1, 0) model) which should accomplish the same thing. Let's try using the differenced series."
To do what you're trying to do, you'll need to do more research into these models and figure out how to format your data, and what model will be appropriate. The most important thing is knowing what information you believe is contained in the data you're feeding into the model. What your model currently is trying to do is say, "Today the price is $45. What will the price be tomorrow?" That's it. It doesn't have any information about momentum, volatility, etc. That's not much to go off.

It sounds like you are using an older version of statsmodels that does not support SARIMAX. You'll want to install the latest released version 0.8.0 see http://statsmodels.sourceforge.net/devel/install.html.
I'm using Anaconda and installed via pip.
pip install -U statsmodels
The results class from the SARIMAX model have a number of useful methods including forecast.
data['Forecast'] = results.forecast(100)
Will use your model to forecast 100 steps into the future.

try set dynamic = False when predicting

Related

Create a ML model with tensorflow that predicts a values at any given time range at hourly intervals

I am pretty new to ML and completely new to creating my own models. I have went through tensorflows time-series forecasting tutorial and other LSTM time series examples on how to predict with multi-variate inputs. After trying multiple examples I think I realized that this is not what I want to achieve.
My problem involves a dataset that is in hourly intervals and includes 4 different variables with the possibility of more in the future. One column being the datetime. I want to train a model with data that can range from one month to many years. I will then create an input set that involves a few of the variables included during training with at least one of them missing which is what I want it to predict.
For example if I had 2 years worth of solar panel data and weather conditions from Jan-12-2016 to Jan-24-2018 I would like to train the model on that then have it predict the solar panel data from May-05-2021 to May-09-2021 given the weather conditions during that date range. So in my case I am not necessarily forecasting but using existing data to point at a certain day of any year given the conditions of that day at each hour. This would mean I should be able to go back in time as well.
Is this possible to achieve using tensorflow and If so are there any resources or tutorials I should be looking at?

See Best practice for encoding datetime in machine learning.
Relevant variables for solar power seem to be hour-of-day and day-of-year. (I don't think separating into month and day-of-month is useful, as they are part of the same natural cycle; if you had data over spending habits of people who get paid monthly, then it would make sense to model the month cycle, but Sun doesn't do anything particular monthly.) Divide them by hours-in-day and days-in-year respectively to get them to [0,1), multiply by 2*PI to get to a circle, and create a sine column and a cosine column for each of them. This gives you four features that capture the cyclic nature of day and year.
CyclicalFeatures discussed in the link is a nice convenience, but you should still pre-process your day-of-year into a [0,1) interval manually, as otherwise you can't get it to handle leap years correctly - some years days-in-year are 365 and some years 366, and CyclicalFeatures only accepts a single max value per column.

How to build an prediction model which takes input in date_time format?

Dataset Image
I have been working on predicting water usage on a weekly basis. I have starting day of every week in one column and water consumed in another column, I want my model prediction in such a way that I give the input in date time format like 21-01-2021 (say)in the predict() function. Which model and how can I achieve this?
I've previously tried with ARIMA model in time series analysis.

Most of the ML/DL algorithms use floating point input values, that said and based on most of the datasets that I've seen, you should do some data transformation and compute a time delta (you'd see something like TimeDT). That's done setting a setting a base date (the first date that appears in you train data) compute next row delta based on your criteria (seconds, hours or days elapsed, etc).
TL;DR
As I understood you're computing based on the day of the week (correct me please if I'm wrong), so your time delta would be daily, restarting each week? the most appropriated in that case is based on the calendar, decompose the date and add two new features: week_of_year and day_of_week.
Is week_of_year important? well, in summer might be a tendency on consume more water, that's something your dataset can tell you.

What is my dataframe's x value when using sklearn RandomForestRegressor?

I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv
I'm trying to predict the next values of "LandAverageTemperature".
I've asked another question about this topic earlier.(its here:How to predict correctly in sklearn RandomForestRegressor?) I couldn't get any answer for that question.After not getting anything in my first question and then failing for another day, I've decided to start from scratch.
Right now, I want to know which value is in my dataset is "x" to make the prediction correctly. I read that y is a dependent variable which that I want to predict and x is the independent variable that I should use as "predictor" to help the prediction proccess. In that case my y variable is "LandAverageTemperature". I don't know what the x value is. I was using date values for x at first but I'm not sure that is true at the moment.
And if I shouldn't use RandomForestRegressor or sklearn (I've started with spark to this project) for this dataset please let me know. Thanks in advance.

You only have one variable (LandAverageTemperature), so obviously that's what you're going to use. What you're looking for is the df.shift() function, which shifts your values. With this function, you'll be able to add columns of past values to your dataframe. You will then be able to use t 1 month/day ago, t 2 months/days ago, etc, as predictors of another day/month's temperature.
You can use it like this:
for i in range(1, 15):
df.loc[:, 'T–%s'%i] = df.loc[:, 'LandAverageTemperature'].shift(i)
Your columns will then be temperature, and temperature at T-1, T-2, for up to 14 time periods.
For your question about what is a proper model for time series forecasting, it would be off-topic for this site, but many resources exist on https://stats.stackexchange.com.

In general case you can use for X feature matrix all data columns excluding your target column. But in your case there is several complications:
You have missed (empty) data in most of the columns for many years. You can exclude such rows/years from train data. Or exclude columns with missed data (which will be almost all of your columns and it's not good).
Regression model can't use date fields directly, you should traislate date field to some numerical field(s), "months past first observation", for example. Something like (year-1750)*12 + month. Or/and you can have year and month in separate columns (it's better if you have some "seasonality" in your data).
You have sequental time data here, so may be you should not use simple regression. Use some ARIMA/SARIMA/SARIMAX and so on so-called Time-Series models which predicts target data sequentially one value by another, month after month in your case. It's a hard topic for learning, but you should definitely take a look at TS because you will need it some time in the future if not today.

ARIMA order for trend

I am trying to fit ARIMA model. I have 3 months data and it shows count(float) for every minute. Which order I should pass for arima.fit()?
I need to predict for every minute.

A basic ARIMA(p,d,q) model isn't going to work with your data.
Your data violates the assumptions of ARIMA, one of which is that the parameters have to be consistent over time.
I see clusters of 5 spikes, so I'm presuming that you have 5 busy workdays and quiet weekends. Basic ARIMA isn't going to know the difference between a weekday and weekend, so it probably won't give you useful results.
There is such a thing as a SARIMA (Seasonal Autoregressive Integrated Moving Average). That would be useful if you were dealing with daily data points, but not really suitable for minute data either.
I would suggest that you try to filter your data so that it excludes evenings and weekends. Then you might be dealing with a dataset that has consistent parameters. If you're using python, you could try using pyramid's auto_arima() function on your filtered time series data (s) to have it try to auto-find the best parameters p, d, q.
It also does a lot of the statistical tests you might want to look into for this type of analysis. I actually don't always agree with the auto_arima's parameter choices, but it's a start.
model = pyramid.arima.auto_arima(s)
print(model.summary())
http://pyramid-arima.readthedocs.io/en/latest/_submodules/arima.html

1) is your data even good for box -Jenkins model( ARIMA)?
2) I see higher mean in the end. There is clear seasonality in the data. ARIMA will fail. Please try seasonal based models- SARIMA as righly suggested by try, Prophet is another beautiful algo by Facebook for seasonality. - ( R implementation)
https://machinelearningstories.blogspot.com/2017/05/facebooks-phophet-model-for-forecasting.html
3) Not just rely on ARIMA. Try other time series algos that are STL, BTS, TBATS. ( Structural Time series) and Hybrid etc. Let me know if you want some packages information in R?

numeric value for time series data with frequency every 2 minutes

I have a dataset of two months in which I am getting reading every 2 minutes. statsmodel.tsa.seasonal_decompose method asks for the numeric value for frequency. What would be the numeric value for such data and what is the proper method to calculate freq in such time series data.

You need to identify frequency of the seasonality yourself. Often this is done using knowledge of the dataset or by visually inspecting a partial autocorrelation plot, which the statsmodels library provides:
statsmodels - partial autocorrelation
If the data has an hourly seasonality to it, you might see a significant partial autocorrelation lag 30 (as there are 30 data points between this hour's first 2 minutes, and last hour's first 2 minutes). I'm assuming statsmodels would expect this value; I'm assuming if you had monthly data it would expect a 12, or if you had daily data, it would expect a 7 for the weekly shape etc.
It sounds like you have multiple seasonalities to consider judging by your other post. You might see a significant lag that corresponds to the same 2 minutes in the previous hours, previous days, and/or the previous weeks. This approach to seasonal decomposition is considered to be naive and only addresses 1 seasonality as described in its documentation:
Seasonal Decomposition
If you would like to continue down the seasonal decomposition path, you could try the relatively recently released double seasonal model released by facebook. It is specifically designed to work well with daily data where it models intra-year and intra-week seasonalities. Perhaps it can be adapted to your problem.
fbprophet
The shortcoming with seasonal decomposition models is that it does not capture how a season might change through time. For example, the characteristics of a week of electricity demand in the summer is much different than the winter. This method will determine an average seasonal pattern and leave the remaining information in the residuals. So given your characteristics vary by day of week (mentioned in your other post), this won't capture that.
If you'd like to send me your data, I'd be interested in taking a look. Given my experience, you've jumped into the deep end of time-series forecasting that doesn't necessarily have an easy to use off-the-shelf solution. If you do provide it, please also clarify what your objective is:
Are you attempting to forecast ahead, and if so, how many 2 minute intervals?
Do you require confidence intervals, monte carlo results, or neither?
How do you measure accuracy of the model's performance? How 'good' does it need to be?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.