This is a univariate time series prediction problem. As the following code shows, I divide the initial data into a train dataset (trainX) and a test dataset(testX), then I create a LSTM network by keras. Next, I train the model by the train dataset. However, when I want to get the prediction, I need to know the test value, so my problem is: why do I have to predict since I have known the true value which is test dataset in this problem. What I want to get is the prediction value of future time? If I have some misunderstandings about LSTM network, please tell me.
Thank you!
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
Since we don't have the future value with us while training the model, we just divide the data into train and test sets. Then we just imagine that test sets are the future values. We train our model using train set (and also usually a validation set). And after our model is trained, we test it using the test set to check our models performance.
why do I have to predict since I have known the true value which is test dataset in this problem. What I want to get is the prediction value of future time?
In ML, we give test data X and it returns us Y. In the case of time-series, it may mislead a beginner a bit as we use the X and output is apparently X as well: The difference here is that we are inputting old values of time-series as X and the output Y is value of same time-series but we are predicting in future (can be applied for present or even past as well) as you have identified it correctly.
(P.S: I would recommend you to begin with simple regression and then come to LSTMs etc. if all you want is to learn the Machine Learning.)
I think the correct term in this context is 'Forecasting'.
A good explanation is: after you train and test your model, with the data that you already had (as the other ones said here before me), you want to predict future data, which is, I think, the trully interresting thing about recurrent networks.
So in order to make this, you need to start predicting the values from one day after your final date in your original dataset, using the model (which is trained with this past data). Once you predict this value, you do the same thing, but considering the last values predict, and so on.
The fact that you are using a prediction to make others predictions, implies that is much more difficult to get good results, so is common to try to predict short ranges of time.
The exact code that you need to perform to do this could vary, but I think that is the prime concept
In the link below, in the last part, in which is perform a forecast, the author show us a code and a explanation on how he did it.
https://towardsdatascience.com/time-series-forecasting-with-recurrent-neural-networks-74674e289816
I guess that's it.
Related
I done training my LSTM model and it brought a really good result. But, I wondering that is it only can predict the value in the x_test dataset? Like, I usually write model.predict(x_test). But what if I want to predict the actual future like it never happened for the testing set? What should I do?
I am trying to build a machine learning model which predicts a single number from a series of numbers. I am using an LSTM model with Tensorflow.
You can imagine my dataset to look something like this:
Index
x data
y data
0
np.array(shape (10000,1) )
numpy.float32
1
np.array(shape (10000,1) )
numpy.float32
2
np.array(shape (10000,1) )
numpy.float32
...
...
...
56
np.array(shape (10000,1) )
numpy.float32
Easily said I just want my model to predict a number (y data) from a sequence of numbers (x data).
For example like this:
array([3.59280851, 3.60459062, 3.60459062, ...]) => 2.8989773
array([3.54752101, 3.56740332, 3.56740332, ...]) => 3.0893357
...
x and y data
From my x data I created a numpy array x_train which I want to use to train the network.
Because I am using an LSTM network, x_train should be of shape (samples, time_steps, features).
I reshaped my x_train array to be shaped like this: (57, 10000, 1), because I have 57 samples, which each are of length 10000 and contain a single number.
The y data was created similarly and is of shape (57,1) because, once again, I have 57 samples which each contain a single number as the desired y output.
Current model attempt
My model summary looks like this:
The model was compiled with model.compile(loss="mse", optimizer="adam") so my loss function is simply the mean squared error and as an optimizer I'm using Adam.
Current results
Training of the model works fine and I can see that the loss and validation loss decreases after some epochs.
The actual problem occurs when I want to predict some data y_verify from some data x_verify.
I do this after the training is finished to determine how well the model is trained.
In the following example I simply used the data I used for training to determine how well the model is trained (I know about overfitting and that verifying with the training set is not the right way of doing it, but that is not the problem I want to demonstrate right not).
In the following graph you can see the y data I provided to the model in blue.
The orange line is the result of calling model.predict(x_verify) where x_verify is of the same shape as x_train.
I also calculated the mean absolute percentage error (MAPE) of my prediction and the actual data and it came out to be around 4% which is not bad, because I only trained for 40 epochs. But this result still is not helpful at all because as you can see in the graph above the curves do not match at all.
Question:
What is going on here?
Am I using an incorrect loss function?
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be?
Ideally the prediction should be the y data which I provided so the curves should look the same (more or less).
Do you have any ideas?
Thanks! :)
After some back and forth in the comments, I'll give my best estimation to your questions:
What is going on here?
Very complex (too many layers deep) model with very little data, trained for too few epochs on non-normalized data (credit to Muhammad in his answer). The biggest issue, as far as I can tell, is the number of training epochs.
Am I using an incorrect loss function?
MSE is an appropriate loss function for a regression task.
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be? Ideally the prediction should be the y data which I provided so the curves should look the same (more or less). Do you have any ideas?
Too few training epochs is the biggest contributor, as far as I can tell.
Based on the collab notebook that Luca shared:
30 Epochs, no normalization
Way off target, flat predictions (though I can't reproduce how flat the predictions are that Luca posted)
30 Epochs, with normalization
Worse off.
2000(!) epochs, no normalization
Okay, now the predictions are at least in the ballpark
2000 epochs, with normalization
And now the model seems to be starting to figure things out, like we'd hope it should. Given, this is training on the 11 samples that were cobbled together in the notebook, so it's naturally going to overfit. We're just happy to see it learn something.
2000 epochs, normalization, different loss
Never be afraid to try out different losses, as some may be better suited than others. Not knowing the domain of this task, I'm just trying out mean_absolute_error instead of mean_squared_error.
Caution! Don't compare loss values between different losses. They're not on the same scale.
2000 epochs, normalization, larger learning rate
Okay, so it's taking a long time to learn. Can I nudge it along a little faster? Sure, up the learning rate of the optimizer, and it'll get you to where you're going faster. Here, we up it by a factor of 5.
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=0.005))
You could even employ a learning rate scheduler that starts big and slowly diminishes it over the course of epochs.
def scheduler(epoch, lr):
if epoch < 400:
return lr
else:
return lr * tf.math.exp(-0.01)
lrs = tf.keras.callbacks.LearningRateScheduler(scheduler)
history = model.fit(x=x_train, y=y_train, epochs=1000, callbacks=[lrs])
Hope this all helps!
From the notebook it seems you are not scaling your data. You should normalize or standardize your data before training your model.
https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/
can add normalization layer in keras https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization
I just wanted to post a quick update.
First of all, this is my current result:
I am absolutely happy, that I was finally able to achieve what I wanted to. At least to some extent.
There were some steps I had to take to achieve this result:
Normalization
Training for 500-1000 epochs
Most importantly: Reducing the amount of time steps to 1000
In the end my thought of "the more data, the better" was a huge misconception. I was not able to achieve such results with 10000 time steps per sample AT ALL. So I'm glad that I just gave 1000 a shot.
Thank you all very much for your answers!
I will try to further imroved my model with your suggestions :)
i think it would be helpful if you change loss into huber loss and even change optimizer into sgd and then first try out to define the best learning rate based on a callback (learning rate schedule) cause of small dataset and even normalize or standardize data before training model.
I've been trying to create a stateful LSTM model with keras, and I pretty much figured out the training part, but I don't get the predicting part.
So, let's imagine that we had 10000 time-series datapoints. we use 9000 in front for training, and the other 1000 for testing. So, as we start training, we set the window length to 2, and slide the window forward as we set the input(X) as the first datapoint and set the output(y) as the second datapoint.
And as we train, the model converges because of it's stateful nature. Finally we finish training.
Now, we are left with a model, and some test data. The problem begins here. We test the first datapoint.
It returns a guessed value. Nice.
We test the second datapoint of the test set.
We get an output. But, the problem is that because we were using a stateful model, and we only one value as an input, the only way the model is going to figure out the next value is from memory of the previous time-series.
But since we didn't train the data on the first datapoint of the test set, the time-series is broken, and the model will think that the second datapoint on the test set is the first datapoint on the test set!
So, my question is,
does keras take care of this and automaticaly train the network as it's predicting?
or do I have to train the net as I am predicting
or is there some other reason that enables me to just keep predicting without training the model farther?
For a stateful LSTM, if will retain information in its cells as you predict. If you were to take any random point in the train or test dataset and repeatedly predict on it, your answer will change each time, because it keeps seeing this data and uses it every time it predicts. The only way to get a repeatable answer would be to call reset_states().
You should be calling reset_states() after each training epoch, and when you save the model, those cells should be empty. Then if you want to start predicting on the test set, you can predict on the last n training points (without saving the values anywhere), then start saving values once you get to your first test point.
It is often good practice to seed the model before prediction. If I want to evaluate on test_set[10:20,:], I can let the model predict on test_set[:10,:] first to seed the model then start saving my predicted values once I get to the range I am interested in.
To address the further training question, you do not need to train the model further to predict. Training will only be for tuning the model's weights. Look into this blog for more information on Stateful vs Stateless LSTM.
A practice problem based on whether or not and with what accuracy/probability an uber ride gets completed after being ordered has the following features:
Available Drivers int64
Placed Time float64
Response Distance float64
Car Type int32
Day Of Week int64
Response Delay float64
Order Completion int32 [target]
My approach has been to use tf.Keras Sequential to predict the target. Here's what it looks like:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(16, activation='relu', input_shape=input_shape),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation='sigmoid')
])
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
binary_crossentropy_loss = tf.keras.losses.BinaryCrossentropy()
model.compile(optimizer=adam_optimizer,
loss=binary_crossentropy_loss,
metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=ES_PATIENCE)
history = model.fit(train_dataset, validation_data=validation_dataset, epochs=EPOCHS, verbose=2,
callbacks=[early_stop])
I normalize the data like this (note that train_data is a dataframe):
train_data = tf.keras.utils.normalize(train_data)
And then for predicting,
predictions = model.predict_proba(prediction_dataset, batch_size=None)
Training results:
loss: 0.3506 - accuracy: 0.8817 - val_loss: 0.3493 - val_accuracy: 0.8773
But this still gives me a poor quality probability for the corresponding occurrence. Is this the wrong approach ?
What approach would you suggest for a problem like this and am I doing it completely wrong ? Are Neural Networks a bad idea for this solution? Thanks a lot!
As you framed the problem, this is a classic machine learning classification problem.
Given N features(independent variables) you have to predict 1(one) dependent variable.
The way in which you constructed the neural network is good.
Since you have a binary classification problem, the sigmoid activation is the correct one.
With respect to the complexity of your model (number of layers, number of neurons per layer) it depends very much on your dataset.
If you have a comprehensive dataset with a lot of features and a lot of examples(an example is a row in dataframe with X1,X2,X3... Y), where X are the features and Y the dependent variable, your model can vary in complexity.
If you have a small dataset with a few features, a small model is recommended. Always begin with a small model.
If you run into the issue of underfitting (poor accuracy on the training set and also on the validation and test set), you can gradually increase the complexity of the model (add more layers, add more neurons per layer).
If you run into the issue of overfitting, implementing regularisation techniques may help (Dropout, L1/L2 Regularisation, Noise Addition, Data Augmentation).
What you have to take into consideration is that, if you have a small dataset, then a classical machine learning algorithm could outperform the deep learning model. This happens because neural networks are very 'hungry' ---> as compared to machine learning models, they require much more data in order to properly work. You could choose SVM/Kernel SVM/Random Forest/ XGBoost and other similar algorithms.
EDIT!
Whether or not and with what accuracy/probability automatically splits the problem into two parts, not only a simple classification one.
What I would personally do is the following: Since the probabilities occur between 0% and 100%, if you had probability as a feature in your X columns (which you don't), then, according to the number of data points(rows) you have you could do the following: I would assign a label to each probability section: 1 to (0%,25%), 2 to (25%, 50%), 3 to (50%,75%), 4 to (75%, 100%). But that depends exclusively on the prior probability information(if you had the probability as a feature). Then if you inferred and you get label 3, you would know the probability of the ride being completed.
Otherwise, you cannot frame your current problem as both a classification and a probablity one.
I hope that I have given you an introductory insight. Happy coding.
If you are doing classification, you may want to look into ensemble methods (forests, boosts, etc.)
If you are calculating probability, you may want to look into probabilistic graphical models (Bayesian networks, etc.)
I have a data set of 1400 length.
I made every operation and trained an LSTM model using Keras on Python in an attempt to predict future points. My trained model learns well.
After compiling the model, I added a new random test set. My intent is to predict unseen future, for example 15 days, that are not included in trained data. When I add some test data with a constant value, my predictions become constant.
But when I add real values that I try to predict, my model fits well on this test data.
So How can I handle it?
Why does the model prediction change due to test data ?
How can I predict the next 15 days that are not included in my training set ?
How can I predict the unseen future?
If LSTM models work only on known train and test sets why should I use it?