Using the following guide, I've made an sklearn regression model for doing time series forecasting. I'm able to use the model to get predictions on a set of test data where I have the timestamps, as well as the independent variable data, since the model just takes those variables and and gives the output labels as predictions.
However, I'm not sure how, or even if I can use this model to do out of sample predictions, where I only have a future timestamp and none of the independent variable data that goes with it. Is there some sort of recursive method where the model can use data from a test set, make a prediction, then use the prediction and the data to make the next prediction, etc.? Thanks!
Yes, but it depends on whether you want to do single-step or multi-step forecasts.
For single-step forecasts, as you describe, use the last available window of your data as input to the prediction function, this returns the first step ahead forecasted value.
For multi-step forecasts, you have three options:
Direct: Fit one regressor for each step ahead and let each fitted regressor make a prediction with the last available window,
Recursive: Use the last available window to make the first step prediction, then use the first step prediction to roll the window and predict again.
DirRec: A combination of the above strategies, where you instead of rolling the window, you expand it with the previously predicted value, note however this requires to fit the regressors accordingly.
You can find more details in:
Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. "Machine learning strategies for time series forecasting." European business intelligence summer school. Springer, Berlin, Heidelberg, 2012.
Also note that you have to be careful to appropriately evaluate your model. The train and test sets are not independent in this setting, as they represent measurements at subsequent time points of the same variable. So you have to account for the potential auto-correlation.
Related
I followed this guide (https://alkaline-ml.com/pmdarima/usecases/stocks.html) with pm auto arima and applied it to my own data.
So I created a model, fitted it with the training data and made a forecast with the test data.
Now I want to make a prediction for x days into the future.
Can I use the same model or should I create a new model on the whole data set?
On other websites (Medium) I have always seen that the whole data set is used. So why do you have to create a test set and a training set?
I have tried both methods and forcasted x days but got very different results.
When you create your model, you need some way to evaluate how well it performs. That's why you split the data into training and test sets. When you found a model (in case of ARIMA the order) with sufficient performance, you can train it on the entire data and make predictions for the future.
I'm implementing LightGBM (Python) into a continuous learning pipeline. My goal is to train an initial model and update the model (e.g. every day) with newly available data.
Most examples load an already trained model and apply train() once again:
updated_model = lightgbm.train(params=last_model_params, train_set=new_data, init_model = last_model)
However, I'm wondering if this is actually the correct way to approach continuous learning within the LightGBM library since the amount of fitted trees (num_trees()) grows for every application of train() by n_estimators. For my understanding a model update should take an initial model definition (under a given set of model parameters) and refine it without ever growing the amount of trees/size of the model definition.
I find the documentation regarding train(), update() and refit() not particularly helpful. What would be considered the right approach to implement continuous learning with LightGBM?
In lightgbm (the Python package for LightGBM), these entrypoints you've mentioned do have different purposes.
The main lightgbm model object is a Booster. A fitted Booster is produced by training on input data. Given an initial trained Booster...
Booster.refit() does not change the structure of an already-trained model. It just updates the leaf counts and leaf values based on the new data. It will not add any trees to the model.
Booster.update() will perform exactly 1 additional round of gradient boosting on an existing Booster. It will add at most 1 tree to the model.
train() with an init_model will perform gradient boosting for num_iterations additional rounds. It also allows for lots of other functionality, like custom callbacks (e.g. to change the learning rate from iteration-to-iteration) and early stopping (to stop adding trees if performance on a validation set fails to improve). It will add up to num_iterations trees to the model.
What would be considered the right approach to implement continuous learning with LightGBM?
There are trade-offs involved in this choice and no one of these is the globally "right" way to achieve the goal "modify an existing model based on newly-arrived data".
Booster.refit() is the only one of these approaches that meets your definition of "refine [the model] without ever growing the amount of trees/size of the model definition". But it could lead to drastic changes in the predictions produced by the model, especially if the batch of newly-arrived data is much smaller than the original training data, or if the distribution of the target is very different.
Booster.update() is the simplest interface for this, but a single iteration might not be enough to get most of the information from the newly-arrived data into the model. For example, if you're using fairly shallow trees (say, num_leaves=7) and a very small learning rate, even newly-arrived data that is very different from the original training data might not change the model's predictions by much.
train(init_model=previous_model) is the most flexible and powerful option, but it also introduces more parameters and choices. If you choose to use train(init_model=previous_model), pay attention to parameters num_iterations and learning_rate. Lower values of these parameters will decrease the impact of newly-arrived data on the trained model, higher values will allow a larger change to the model. Finding the right balance between those is a concern for your evaluation framework.
I've been trying to create a stateful LSTM model with keras, and I pretty much figured out the training part, but I don't get the predicting part.
So, let's imagine that we had 10000 time-series datapoints. we use 9000 in front for training, and the other 1000 for testing. So, as we start training, we set the window length to 2, and slide the window forward as we set the input(X) as the first datapoint and set the output(y) as the second datapoint.
And as we train, the model converges because of it's stateful nature. Finally we finish training.
Now, we are left with a model, and some test data. The problem begins here. We test the first datapoint.
It returns a guessed value. Nice.
We test the second datapoint of the test set.
We get an output. But, the problem is that because we were using a stateful model, and we only one value as an input, the only way the model is going to figure out the next value is from memory of the previous time-series.
But since we didn't train the data on the first datapoint of the test set, the time-series is broken, and the model will think that the second datapoint on the test set is the first datapoint on the test set!
So, my question is,
does keras take care of this and automaticaly train the network as it's predicting?
or do I have to train the net as I am predicting
or is there some other reason that enables me to just keep predicting without training the model farther?
For a stateful LSTM, if will retain information in its cells as you predict. If you were to take any random point in the train or test dataset and repeatedly predict on it, your answer will change each time, because it keeps seeing this data and uses it every time it predicts. The only way to get a repeatable answer would be to call reset_states().
You should be calling reset_states() after each training epoch, and when you save the model, those cells should be empty. Then if you want to start predicting on the test set, you can predict on the last n training points (without saving the values anywhere), then start saving values once you get to your first test point.
It is often good practice to seed the model before prediction. If I want to evaluate on test_set[10:20,:], I can let the model predict on test_set[:10,:] first to seed the model then start saving my predicted values once I get to the range I am interested in.
To address the further training question, you do not need to train the model further to predict. Training will only be for tuning the model's weights. Look into this blog for more information on Stateful vs Stateless LSTM.
I am trying to get the confidence intervals from an XGBoost saved model in a .tar.gz file that is created using python XGBoost library.
The problem is that the model has already been fitted, and I dont have training data any more, I just have inference or serving data to predict. All the examples that I found entail using a training and test data to create either quantile regression models, or bagged models, but I dont think I have the chance to do that.
Why your desired approach will not work
I assume we are talking about regression here. Given a regression model that you cannot modify, I think you will not be able to achieve your desired result using only the given model. The model was trained to calculate a continuous value that appoximates some objective value (i.e., its true value) based on some given input. Nothing more.
Possible solution
The only workaround I can think of would be to train two more models. These model's training goal would be to predict the quality of the output of your given model. One would calculate the upper bound of a given (i.e., predefined by you at training time) confidence interval and the other one the lower bound. This would probably include a lot of feature engineering. One would probably like to find features that correlate with the prediction quality of the original model.
I am predicting stock prices using support vector regression. I have trained with some values but when i predict the values every time I have to train based on that(online learning). So i have passed the values to train inside the loop after predicted.
inside loop
//prediction
clf.fit(testx[i],testy[i])
So when i call the fit function everytime how svr training work internally based on one input?
clf.fit is not incremental. You have to pass all the previous training points in addition to the new instance to re-train a new model that benefits from the new data points unfortunately.
This is a limitation of the SMO algorithm implemented by the libsvm library used internally in the sklearn.svm.SVR class.