I have figured out how to train a LSTM using just values, but what would the data look like if I wanted to include the time? Perhaps input dimension of 2, with time as epoch seconds and normalized values? There may be time gaps in the data and I want the training to reflect that.
Assuming I only want to periodically train the LSTM, since this is an expensive operation, how would you predict values in the future with a gap between the last training time and the first predicted time? For example, lets says I trained the LSTM 3 days ago, but now I want to predict the values for the next day.
All my work so far is based on this article: http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/. But it doesn't cover these kinds of questions.
I think you can handle this situation when constructing your training set, at least if the time delay between the last value (in the input sequence) and the value to predict is fixed.
Let X_train have dimension: (nb_samples, timesteps, input_dim) and y_train have dimension (n_samples, output_dim). Let x be one training input sample. It corresponds to a multivariate time series with dimension (timesteps, input_dim). Its corresponding output is y with dimension (output_dim).
In y you put the value to predict which can be 3 days after the last value in x, the LSTM "should" grasp the temporal dependency. So if the time delay between the last value in the input and the value to predict is fixed, this should work.
That was the case for such a problem: https://challengedata.ens.fr/en/challenge/9/prediction_of_transaction_volumes_in_financial_markets.html
Related
I've taken a quick course in neural networks to better understand them and now I'm trying them out for myself in R. I'm following this documentation of Keras.
The way I understand what is happening:
We are inputting a series of images and transforming these images to numerical matrices based on the arrangement of the pixels and colors in those pixels. We then build a neural network model to learn the pattern of these arrangements, depending on the classification (0 to 9). We then use the model to predict which class an image belongs to. I'll be honest and admit I'm not entirely sure what y_train and x_train is. I simply see it as one training and one validation set so I'm not sure what the difference between x and y is.
My question:
I've followed the steps to the T and the model runs fine and the predictions look like they do in the documentation. Ultimately, the prediction looks like this:
I take this to mean that observation 1 in x_test is predicted to be a category 7.
However, looking at x_test it looks like this:
There is a 0 in every column and row, also if I scroll further down. This is where I get confused. I'm also not sure how I view the original images to view for myself how well they are predicting them. I would eventually like to draw a number myself in paint or so and then see if the model can predict it, but for that I need to first understand what is going on. I feel I am close but I just need a little nudge!
I think if you read more about the input and output layer's dimensions, that would help.
In your example:
Input layer:
A single training example of image has two dimensions 28*28, which is then converted to a single vector of dimension 784. This acts as the input layer for the neural network.
So for m training examples, your input layer will have dimensions (m, 784). Analogically speaking (to traditional ML systems), you can imagine that each pixel of an image is converted into a feature (or x1, x2, ... x784), and your training set is a dataframe with m rows and 784 columns, which is then fed into neural network to compute y_hat = f(x1,x2,x3,...x784).
Output layer:
As an output for our neural network, we want it to predict which number it is from 0 to 9. So for a single training example, the output layer has dimension 10, representing each number from 0 to 9 and for n testing examples the output layer would be a matrix with dimension n*10.
Our y is a vector of length n which would be something like [1,7,8,2,.....] containing true value for each testing example. But to match the dimension of output layer, the y vector's dimension are converted using one-hot encoding. Imagine a length 10 vector, representing number 7 by putting 1 at 7th place and rest of the positions zeros something like [0,0,0,0,0,0,1,0,0,0].
So in your question, if you wish to see the original image, you should be able to see it before reshaping the training examples with something like image(mnist$test$x[1, , ]
Hope this helps!!
y_train are the labels and x_train is the training data, so images in this example. You need to use some kind of plotting library to plot x'es. In this example you probably are not expected to input your own drawings and if you want you would need to preprocess them in the same way as in MNIST and pass them to the model.
I am using Keras on some data. Here are the details:
8,000 customers, each customer has varying time steps ranging from 2 - 41. So I am using zero padding to ensure all customers have 41 time steps. All 8,000 customers have 2 features and the data comes with multiclass labels, 0-4. Each tilmestep has a label.
After training the model, in the test part of the process I'd like to feed in the features and labels for timesteps 1-40, then have it predict the label in the 41st timestep. Does anyone know if this is possible? I've found Keras to be somewhat of a black box in interpreting what is it actually predicting (eg when it gives an accuracy score, what is this the accuracy of? What is it trying to predict: the last tilmestep label or all tilmestep labels?).
Is there a particular sub-division of model that should be used within sequential Keras LSTM models? I've read 'A many-to-one model (f(...)) produces one output (y(t)) value after receiving multiple input values (X(t), X(t+1), ...). ' (Brownlee 2017). However, it doesn't seem to make accommodation for the fact that my input is Xt & Yt for all time steps except the last one that I want to predict. I'm not sure how I would set up my code to instruct the model to predict the last timestep (that I have the data for but then I want to compare the predicted category with the actual category).
To predict the next timestep for each feature you would want your final Dense layer to be the same width as the number of features:
model.add(Dense(n_features))
There's a good example of a similar problem here under Multiple Parallel Series
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
The accuracy is just a metric for measuring the effectiveness of your model. For accuracy, it's correct_predictions / total_predictions
https://keras.io/metrics/
I am trying to train an LSTM, followed by a Dense layer in keras with numerical input sequences of different lengths. the numbers range is [1,13]. each one of those sequences ends with the same number, 13 in my case.
I train the controller on a few sequences, use the trained model to generate a few more sequences with the same properties, add them to the training set and train the LSTM again. As this loop goes on, the LSTM predictions start converging towards the final value of each sequence.
The sequences are padded to a certain maximum length. as a result the x_train data is of the size (None, max_len-1) and y_train data is categorical data of the last element of each input sequence. in this case, every element in the y_train data is the same (one hot encoded vector for the number 13).
Is the way input and output data is structured the reason for this skewing of predictions?
Is there a way to work around it?
I am following this tutorial LSTM and I wonder how to map this to a multi-time series input. I have a dataset of several time-series and I want to predict for each time series the future. I don't know how to scale LSTM to several time-series.
The aim is to avoid to make a model for each time series as I have 40k of time series.
Thank you
Process one by one
Just do exactly the same in a loop like this:
for epoch in range(numberOfEpochs):
for sequence in yourSequences:
model.reset_states()
#1 - do the entire training for this sequence (1 epoch only)
#you may use "model.train_on_batch" to avoid some overhead in "fit"
#or 2 - do the entire predictoins for this sequence
Process all together
Just pack the series in the first dimension of the input. No change is necessary in the model
When defining the input shape, use batch_input=(number_of_time_series,length,features) or batch_input_shape=(number_of_time_series,length,features). (You may need a smaller batch size, because 40K is too much)
Make sure to use shuffle=False in every training command.
If your batch is not 40k, make sure to process the entire length (the entire training or prediction) of each batch, then you use model.reset_states() and start a new group of sequences.
.
batch_size = ....
for epoch in range(numberOfEpochs):
firstSeq = 0
lastSeq = firstSeq + batch_size
while lastSeq <= len(sequences):
model.reset_states()
batch = sequences[firstSeq:lastSeq]
#train the entire batch (one epoch only)
#or predict for the entire batch
firstSeq += batch_size
lastSeq += batch_size
Since you are using separate Time series, I don't think keeping stateful = True is a good Idea.
Actually, your problem is closer to the 'generic' use of LSTMs.
Try to concatenate your series in a 2d array where each line is corresponding to a serie. Then reshape your data like that : (number_of_series , timesteps (length of a single serie) , 1), then feed it to your network.
Depending of the length of your series, you may need to read this : https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/
The real potential of LSTM models for time series forecasting can be exploiting by building a global model using all the time series, instead as a univariate model, which actually ignores any cross series information available across your time series.
We implement the use case you are referring to by introducing a 'Moving Window Approach' strategy that involves modeling a multiple input and output mapping, where you can pool time series that have different lengths. More detailed discussion of this strategy is described in section 3.4 on our paper[1]. Here, you basically produce multiple input and output tuples for the given set of time series you have and then pool them to together for the LSTM training purposes. This accommodates even if you have time series with different lengths.
[1] https://arxiv.org/pdf/1710.03222.pdf
Sources
There are several sources out there explaining stateful / stateless LSTMs and the role of batch_size which I've read already. I'll refer to them later in my post:
[1] https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/
[2] https://machinelearningmastery.com/stateful-stateless-lstm-time-series-forecasting-python/
[3] http://philipperemy.github.io/keras-stateful-lstm/
[4] https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/
And also other SO threads like Understanding Keras LSTMs and Keras - stateful vs stateless LSTMs which didn't fully explain what I'm looking for however.
My Problem
I am still not sure what is the correct approach for my task regarding statefulness and determining batch_size.
I have about 1000 independent time series (samples) that have a length of about 600 days (timesteps) each (actually variable length, but I thought about trimming the data to a constant timeframe) with 8 features (or input_dim) for each timestep (some of the features are identical to every sample, some individual per sample).
Input shape = (1000, 600, 8)
One of the features is the one I want to predict, while the others are (supposed to be) supportive for the prediction of this one “master feature”. I will do that for each of the 1000 time series. What would be the best strategy to model this problem?
Output shape = (1000, 600, 1)
What is a Batch?
From [4]:
Keras uses fast symbolic mathematical libraries as a backend, such as TensorFlow and Theano.
A downside of using these libraries is that the shape and size of your data must be defined once up front and held constant regardless of whether you are training your network or making predictions.
[…]
This does become a problem when you wish to make fewer predictions than the batch size. For example, you may get the best results with a large batch size, but are required to make predictions for one observation at a time on something like a time series or sequence problem.
This sounds to me like a “batch” would be splitting the data along the timesteps-dimension.
However, [3] states that:
Said differently, whenever you train or test your LSTM, you first have to build your input matrix X of shape nb_samples, timesteps, input_dim where your batch size divides nb_samples. For instance, if nb_samples=1024 and batch_size=64, it means that your model will receive blocks of 64 samples, compute each output (whatever the number of timesteps is for every sample), average the gradients and propagate it to update the parameters vector.
When looking deeper into the examples of [1] and [4], Jason is always splitting his time series to several samples that only contain 1 timestep (the predecessor that in his example fully determines the next element in the sequence). So I think the batches are really split along the samples-axis. (However his approach of time series splitting doesn’t make sense to me for a long-term dependency problem.)
Conclusion
So let’s say I pick batch_size=10, that means during one epoch the weights are updated 1000 / 10 = 100 times with 10 randomly picked, complete time series containing 600 x 8 values, and when I later want to make predictions with the model, I’ll always have to feed it batches of 10 complete time series (or use solution 3 from [4], copying the weights to a new model with different batch_size).
Principles of batch_size understood – however still not knowing what would be a good value for batch_size. and how to determine it
Statefulness
The KERAS documentation tells us
You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch.
If I’m splitting my time series into several samples (like in the examples of [1] and [4]) so that the dependencies I’d like to model span across several batches, or the batch-spanning samples are otherwise correlated with each other, I may need a stateful net, otherwise not. Is that a correct and complete conclusion?
So for my problem I suppose I won’t need a stateful net. I’d build my training data as a 3D array of the shape (samples, timesteps, features) and then call model.fit with a batch_size yet to determine. Sample code could look like:
model = Sequential()
model.add(LSTM(32, input_shape=(600, 8))) # (timesteps, features)
model.add(LSTM(32))
model.add(LSTM(32))
model.add(LSTM(32))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)
Let me explain it via an example:
So let's say you have the following series: 1,2,3,4,5,6,...,100. You have to decide how many timesteps your lstm will learn, and reshape your data as so. Like below:
if you decide time_steps = 5, you have to reshape your time series as a matrix of samples in this way:
1,2,3,4,5 -> sample1
2,3,4,5,6 -> sample2
3,4,5,6,7 -> sample3
etc...
By doing so, you will end with a matrix of shape (96 samples x 5 timesteps)
This matrix should be reshape as (96 x 5 x 1) indicating Keras that you have just 1 time series. If you have more time series in parallel (as in your case), you do the same operation on each time series, so you will end with n matrices (one for each time series) each of shape (96 sample x 5 timesteps).
For the sake of argument, let's say you 3 time series. You should concat all of three matrices into one single tensor of shape (96 samples x 5 timeSteps x 3 timeSeries). The first layer of your lstm for this example would be:
model = Sequential()
model.add(LSTM(32, input_shape=(5, 3)))
The 32 as first parameter is totally up to you. It means that at each point in time, your 3 time series will become 32 different variables as output space. It is easier to think each time step as a fully conected layer with 3 inputs and 32 outputs but with a different computation than FC layers.
If you are about stacking multiple lstm layers, use return_sequences=True parameter, so the layer will output the whole predicted sequence rather than just the last value.
your target shoud be the next value in the series you want to predict.
Putting all together, let say you have the following time series:
Time series 1 (master): 1,2,3,4,5,6,..., 100
Time series 2 (support): 2,4,6,8,10,12,..., 200
Time series 3 (support): 3,6,9,12,15,18,..., 300
Create the input and target tensor
x -> y
1,2,3,4,5 -> 6
2,3,4,5,6 -> 7
3,4,5,6,7 -> 8
reformat the rest of time series, but forget about the target since you don't want to predict those series
Create your model
model = Sequential()
model.add(LSTM(32, input_shape=(5, 3), return_sequences=True)) # Input is shape (5 timesteps x 3 timeseries), output is shape (5 timesteps x 32 variables) because return_sequences = True
model.add(LSTM(8)) # output is shape (1 timesteps x 8 variables) because return_sequences = False
model.add(Dense(1, activation='linear')) # output is (1 timestep x 1 output unit on dense layer). It is compare to target variable.
Compile it and train. A good batch size is 32. Batch size is the size your sample matrices are splited for faster computation. Just don't use statefull