I followed this guide (https://alkaline-ml.com/pmdarima/usecases/stocks.html) with pm auto arima and applied it to my own data.
So I created a model, fitted it with the training data and made a forecast with the test data.
Now I want to make a prediction for x days into the future.
Can I use the same model or should I create a new model on the whole data set?
On other websites (Medium) I have always seen that the whole data set is used. So why do you have to create a test set and a training set?
I have tried both methods and forcasted x days but got very different results.
When you create your model, you need some way to evaluate how well it performs. That's why you split the data into training and test sets. When you found a model (in case of ARIMA the order) with sufficient performance, you can train it on the entire data and make predictions for the future.
Related
I've been trying to create a stateful LSTM model with keras, and I pretty much figured out the training part, but I don't get the predicting part.
So, let's imagine that we had 10000 time-series datapoints. we use 9000 in front for training, and the other 1000 for testing. So, as we start training, we set the window length to 2, and slide the window forward as we set the input(X) as the first datapoint and set the output(y) as the second datapoint.
And as we train, the model converges because of it's stateful nature. Finally we finish training.
Now, we are left with a model, and some test data. The problem begins here. We test the first datapoint.
It returns a guessed value. Nice.
We test the second datapoint of the test set.
We get an output. But, the problem is that because we were using a stateful model, and we only one value as an input, the only way the model is going to figure out the next value is from memory of the previous time-series.
But since we didn't train the data on the first datapoint of the test set, the time-series is broken, and the model will think that the second datapoint on the test set is the first datapoint on the test set!
So, my question is,
does keras take care of this and automaticaly train the network as it's predicting?
or do I have to train the net as I am predicting
or is there some other reason that enables me to just keep predicting without training the model farther?
For a stateful LSTM, if will retain information in its cells as you predict. If you were to take any random point in the train or test dataset and repeatedly predict on it, your answer will change each time, because it keeps seeing this data and uses it every time it predicts. The only way to get a repeatable answer would be to call reset_states().
You should be calling reset_states() after each training epoch, and when you save the model, those cells should be empty. Then if you want to start predicting on the test set, you can predict on the last n training points (without saving the values anywhere), then start saving values once you get to your first test point.
It is often good practice to seed the model before prediction. If I want to evaluate on test_set[10:20,:], I can let the model predict on test_set[:10,:] first to seed the model then start saving my predicted values once I get to the range I am interested in.
To address the further training question, you do not need to train the model further to predict. Training will only be for tuning the model's weights. Look into this blog for more information on Stateful vs Stateless LSTM.
Say I have created a randomforest regression model using test/train data available with me.
This contains feature scaling and categorical data encoding.
Now if I get a new dataset on a new day and I need to use this model to predict the outcome of this new dataset and compare it with the new dataset outcome that I have, do I need to apply feature scaling and categorical data encoding on this dataset as well?
For example .. day 1 I have 10K rows with 6 features and 1 label -- a regression problem
I built a model using this.
On day 2, I get 2K rows with same features and label but of course the data within it would be different.
Now I want to firstly predict using this model and day 2 data, what should be the label as per my model.
Secondly, using this result I want to compare the outcome of the model against the day 2 original label that I have.
So in order to do this, when I pass the day 2 features as the test set to the model, do I need to first do feature scaling and categorical data encoding on them?
This is somewhat to do with making predictions and validating with the received data in order to assess the data quality of the received data.
You always need to pass the data to the model in the format it is expecting them. If the model has been trained on scaled, encoded, ... data. You need to do perform all these transformations every time you are pushing new data into the trained model (for whatever reason).
The easiest solution is to use sklearn's Pipeline to create a pipeline with all those transformations included and then use it, instead of the model itself to make predictions for new entries so that all those transformations are automatically applied.
example - automatically applying StandardScaler's scaling feature before passing data into the model:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
// then
pipe.fit...
pipe.score...
pipe.predict...
The same holds for dependent variable. If you scaled it before you trained your model, you will need to scale the new ones as well, or you will need to apply inverse operation on the output of the model before you compare it with the original dependent variable values.
Using the following guide, I've made an sklearn regression model for doing time series forecasting. I'm able to use the model to get predictions on a set of test data where I have the timestamps, as well as the independent variable data, since the model just takes those variables and and gives the output labels as predictions.
However, I'm not sure how, or even if I can use this model to do out of sample predictions, where I only have a future timestamp and none of the independent variable data that goes with it. Is there some sort of recursive method where the model can use data from a test set, make a prediction, then use the prediction and the data to make the next prediction, etc.? Thanks!
Yes, but it depends on whether you want to do single-step or multi-step forecasts.
For single-step forecasts, as you describe, use the last available window of your data as input to the prediction function, this returns the first step ahead forecasted value.
For multi-step forecasts, you have three options:
Direct: Fit one regressor for each step ahead and let each fitted regressor make a prediction with the last available window,
Recursive: Use the last available window to make the first step prediction, then use the first step prediction to roll the window and predict again.
DirRec: A combination of the above strategies, where you instead of rolling the window, you expand it with the previously predicted value, note however this requires to fit the regressors accordingly.
You can find more details in:
Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. "Machine learning strategies for time series forecasting." European business intelligence summer school. Springer, Berlin, Heidelberg, 2012.
Also note that you have to be careful to appropriately evaluate your model. The train and test sets are not independent in this setting, as they represent measurements at subsequent time points of the same variable. So you have to account for the potential auto-correlation.
from sklearn.model_selection import train_test_split
I have a question about the train_test_split function from sklearn. First, why do we split the data??? and were do we get the testing data from. Do we just chop the data in half and use some of it to train and some of it to test?? Than doesn't make sense since the data is already filled. If it is filled, then what are we predicting now?? I need help!
First, why do we split the data???
We split the data to isolate a portion of it for validation purposes. Use the non-isolated portion to fit the algorithm and use the algorithm to test again the isolated portion.
were do we get the testing data from.
The testing data is actually part of your original dataset.
Do we just chop the data in half
We usually chop at the 20-40% mark.
If it is filled, then what are we predicting now??*
You are actually not trying to predict the result directly. You are training the algorithm to fit the training set and use the testing set to see how accurate the algorithm is.
Your original dataset should be split up into training and testing data. For example, 80% of the data could be used for training and 20% could be used for testing. The data is split so that there is data for the model to be evaluated on to see how well the model performs on unseen data.
Training: This data is used to build your model. E.g. finding the optimal coefficients in a Linear Regression model; or using the CART
algorithm to create a Decision Tree.
Testing: This data is used to see how the model performs on unseen data, as it would in a real world situation. This data should
be left completely unseen until you would like to test your model to
evaluate performance.
Extra notes on validation data:
To tune your model (for example, finding the best max_depth value for a decision tree) the training data should also be proportioned into training and validation data. K-Folds Cross Validation can be used here. The model should be trained on the training data and evaluated on the validation data. This is performed multiple time with cross validation.
The results (e.g. MSE, F1 etc.) can then be evaluated on each fold and used to tune the hyperparameters. By using cross-validation to tune the hyperparameters, this ensures that the model is not overfitting to the test data.
Once the model is tuned, it can then be applied to the test data.
I have a data set of 1400 length.
I made every operation and trained an LSTM model using Keras on Python in an attempt to predict future points. My trained model learns well.
After compiling the model, I added a new random test set. My intent is to predict unseen future, for example 15 days, that are not included in trained data. When I add some test data with a constant value, my predictions become constant.
But when I add real values that I try to predict, my model fits well on this test data.
So How can I handle it?
Why does the model prediction change due to test data ?
How can I predict the next 15 days that are not included in my training set ?
How can I predict the unseen future?
If LSTM models work only on known train and test sets why should I use it?