I am fairly new to tensorflow and keras and have a question.
I want to use do time series prediction using LSTM layer, with some modifications. I started with the example given in the tensorflow tutorial
def build_LSTM(neurons, batch_size, history_size, features):
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(neurons,
batch_input_shape=(batch_size, history_size, features),
stateful=True))
model.add(tf.keras.layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
return(model)
In the current state from the example, the input of for the model is of the form (observations, time steps, features), and it returns a single number (the prediction for the next time step).
What I want to do is use the mode return_sequence=True in the LSTM layer.
Is it correct that this returns a tensor T of shape (time steps, features)?
Is there a way to transfer this tensor from one step (lets say observation = 1) to the next step (observation = 2)? I guess the corresponding graph would look like this:
To answer your queation, Is it correct that this returns a tensor T of shape (time steps, features)?
Answer is yes the output is a tensor of an output for each time steps.
Another question, Is there a way to transfer this tensor from one step (lets say observation = 1) to the next step (observation = 2)?
This question is quite hard to answer, technically when you specify return_sequence=True, it automatically compute each timestep and feed "current state" to it self as an initial state when it compute next time step, until it compute all your data and give that Tensor output that you ask in question 1. So, if you want this tensor for further computing, such as you want to sum up all answer from odd time steps, it is possible. Moreover, If you want to pass your last state to next batch of input, you can achieve that by giving stateful=True argument.
However, If you want to feed an output of last time step to current time step (something like close-loop control), regardless of given model, you need to create your own recurrent cell and use it with RNN layer custom_model = RNN(custom_recurrent _cell, return_sequence=True).
Sources
There are several sources out there explaining stateful / stateless LSTMs and the role of batch_size which I've read already. I'll refer to them later in my post:
[1] https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/
[2] https://machinelearningmastery.com/stateful-stateless-lstm-time-series-forecasting-python/
[3] http://philipperemy.github.io/keras-stateful-lstm/
[4] https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/
And also other SO threads like Understanding Keras LSTMs and Keras - stateful vs stateless LSTMs which didn't fully explain what I'm looking for however.
My Problem
I am still not sure what is the correct approach for my task regarding statefulness and determining batch_size.
I have about 1000 independent time series (samples) that have a length of about 600 days (timesteps) each (actually variable length, but I thought about trimming the data to a constant timeframe) with 8 features (or input_dim) for each timestep (some of the features are identical to every sample, some individual per sample).
Input shape = (1000, 600, 8)
One of the features is the one I want to predict, while the others are (supposed to be) supportive for the prediction of this one “master feature”. I will do that for each of the 1000 time series. What would be the best strategy to model this problem?
Output shape = (1000, 600, 1)
What is a Batch?
From [4]:
Keras uses fast symbolic mathematical libraries as a backend, such as TensorFlow and Theano.
A downside of using these libraries is that the shape and size of your data must be defined once up front and held constant regardless of whether you are training your network or making predictions.
[…]
This does become a problem when you wish to make fewer predictions than the batch size. For example, you may get the best results with a large batch size, but are required to make predictions for one observation at a time on something like a time series or sequence problem.
This sounds to me like a “batch” would be splitting the data along the timesteps-dimension.
However, [3] states that:
Said differently, whenever you train or test your LSTM, you first have to build your input matrix X of shape nb_samples, timesteps, input_dim where your batch size divides nb_samples. For instance, if nb_samples=1024 and batch_size=64, it means that your model will receive blocks of 64 samples, compute each output (whatever the number of timesteps is for every sample), average the gradients and propagate it to update the parameters vector.
When looking deeper into the examples of [1] and [4], Jason is always splitting his time series to several samples that only contain 1 timestep (the predecessor that in his example fully determines the next element in the sequence). So I think the batches are really split along the samples-axis. (However his approach of time series splitting doesn’t make sense to me for a long-term dependency problem.)
Conclusion
So let’s say I pick batch_size=10, that means during one epoch the weights are updated 1000 / 10 = 100 times with 10 randomly picked, complete time series containing 600 x 8 values, and when I later want to make predictions with the model, I’ll always have to feed it batches of 10 complete time series (or use solution 3 from [4], copying the weights to a new model with different batch_size).
Principles of batch_size understood – however still not knowing what would be a good value for batch_size. and how to determine it
Statefulness
The KERAS documentation tells us
You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch.
If I’m splitting my time series into several samples (like in the examples of [1] and [4]) so that the dependencies I’d like to model span across several batches, or the batch-spanning samples are otherwise correlated with each other, I may need a stateful net, otherwise not. Is that a correct and complete conclusion?
So for my problem I suppose I won’t need a stateful net. I’d build my training data as a 3D array of the shape (samples, timesteps, features) and then call model.fit with a batch_size yet to determine. Sample code could look like:
model = Sequential()
model.add(LSTM(32, input_shape=(600, 8))) # (timesteps, features)
model.add(LSTM(32))
model.add(LSTM(32))
model.add(LSTM(32))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)
Let me explain it via an example:
So let's say you have the following series: 1,2,3,4,5,6,...,100. You have to decide how many timesteps your lstm will learn, and reshape your data as so. Like below:
if you decide time_steps = 5, you have to reshape your time series as a matrix of samples in this way:
1,2,3,4,5 -> sample1
2,3,4,5,6 -> sample2
3,4,5,6,7 -> sample3
etc...
By doing so, you will end with a matrix of shape (96 samples x 5 timesteps)
This matrix should be reshape as (96 x 5 x 1) indicating Keras that you have just 1 time series. If you have more time series in parallel (as in your case), you do the same operation on each time series, so you will end with n matrices (one for each time series) each of shape (96 sample x 5 timesteps).
For the sake of argument, let's say you 3 time series. You should concat all of three matrices into one single tensor of shape (96 samples x 5 timeSteps x 3 timeSeries). The first layer of your lstm for this example would be:
model = Sequential()
model.add(LSTM(32, input_shape=(5, 3)))
The 32 as first parameter is totally up to you. It means that at each point in time, your 3 time series will become 32 different variables as output space. It is easier to think each time step as a fully conected layer with 3 inputs and 32 outputs but with a different computation than FC layers.
If you are about stacking multiple lstm layers, use return_sequences=True parameter, so the layer will output the whole predicted sequence rather than just the last value.
your target shoud be the next value in the series you want to predict.
Putting all together, let say you have the following time series:
Time series 1 (master): 1,2,3,4,5,6,..., 100
Time series 2 (support): 2,4,6,8,10,12,..., 200
Time series 3 (support): 3,6,9,12,15,18,..., 300
Create the input and target tensor
x -> y
1,2,3,4,5 -> 6
2,3,4,5,6 -> 7
3,4,5,6,7 -> 8
reformat the rest of time series, but forget about the target since you don't want to predict those series
Create your model
model = Sequential()
model.add(LSTM(32, input_shape=(5, 3), return_sequences=True)) # Input is shape (5 timesteps x 3 timeseries), output is shape (5 timesteps x 32 variables) because return_sequences = True
model.add(LSTM(8)) # output is shape (1 timesteps x 8 variables) because return_sequences = False
model.add(Dense(1, activation='linear')) # output is (1 timestep x 1 output unit on dense layer). It is compare to target variable.
Compile it and train. A good batch size is 32. Batch size is the size your sample matrices are splited for faster computation. Just don't use statefull
I am trying to reconcile my understand of LSTMs and pointed out here in this post by Christopher Olah implemented in Keras. I am following the blog written by Jason Brownlee for the Keras tutorial. What I am mainly confused about is,
The reshaping of the data series into [samples, time steps, features] and,
The stateful LSTMs
Lets concentrate on the above two questions with reference to the code pasted below:
# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], look_back, 1))
testX = numpy.reshape(testX, (testX.shape[0], look_back, 1))
########################
# The IMPORTANT BIT
##########################
# create and fit the LSTM network
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(100):
model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
model.reset_states()
Note: create_dataset takes a sequence of length N and returns a N-look_back array of which each element is a look_back length sequence.
What is Time Steps and Features?
As can be seen TrainX is a 3-D array with Time_steps and Feature being the last two dimensions respectively (3 and 1 in this particular code). With respect to the image below, does this mean that we are considering the many to one case, where the number of pink boxes are 3? Or does it literally mean the chain length is 3 (i.e. only 3 green boxes considered).
Does the features argument become relevant when we consider multivariate series? e.g. modelling two financial stocks simultaneously?
Stateful LSTMs
Does stateful LSTMs mean that we save the cell memory values between runs of batches? If this is the case, batch_size is one, and the memory is reset between the training runs so what was the point of saying that it was stateful. I'm guessing this is related to the fact that training data is not shuffled, but I'm not sure how.
Any thoughts?
Image reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Edit 1:
A bit confused about #van's comment about the red and green boxes being equal. So just to confirm, does the following API calls correspond to the unrolled diagrams? Especially noting the second diagram (batch_size was arbitrarily chosen.):
Edit 2:
For people who have done Udacity's deep learning course and still confused about the time_step argument, look at the following discussion: https://discussions.udacity.com/t/rnn-lstm-use-implementation/163169
Update:
It turns out model.add(TimeDistributed(Dense(vocab_len))) was what I was looking for. Here is an example: https://github.com/sachinruk/ShakespeareBot
Update2:
I have summarised most of my understanding of LSTMs here: https://www.youtube.com/watch?v=ywinX5wgdEU
As a complement to the accepted answer, this answer shows keras behaviors and how to achieve each picture.
General Keras behavior
The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example):
In this image, I increased the number of steps to 5, to avoid confusion with the other dimensions.
For this example:
We have N oil tanks
We spent 5 hours taking measures hourly (time steps)
We measured two features:
Pressure P
Temperature T
Our input array should then be something shaped as (N,5,2):
[ Step1 Step2 Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
....
Tank N: [[Pn1,Tn1], [Pn2,Tn2], [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
]
Inputs for sliding windows
Often, LSTM layers are supposed to process the entire sequences. Dividing windows may not be the best idea. The layer has internal states about how a sequence is evolving as it steps forward. Windows eliminate the possibility of learning long sequences, limiting all sequences to the window size.
In windows, each window is part of a long original sequence, but by Keras they will be seen each as an independent sequence:
[ Step1 Step2 Step3 Step4 Step5
Window A: [[P1,T1], [P2,T2], [P3,T3], [P4,T4], [P5,T5]],
Window B: [[P2,T2], [P3,T3], [P4,T4], [P5,T5], [P6,T6]],
Window C: [[P3,T3], [P4,T4], [P5,T5], [P6,T6], [P7,T7]],
....
]
Notice that in this case, you have initially only one sequence, but you're dividing it in many sequences to create windows.
The concept of "what is a sequence" is abstract. The important parts are:
you can have batches with many individual sequences
what makes the sequences be sequences is that they evolve in steps (usually time steps)
Achieving each case with "single layers"
Achieving standard many to many:
You can achieve many to many with a simple LSTM layer, using return_sequences=True:
outputs = LSTM(units, return_sequences=True)(inputs)
#output_shape -> (batch_size, steps, units)
Achieving many to one:
Using the exact same layer, keras will do the exact same internal preprocessing, but when you use return_sequences=False (or simply ignore this argument), keras will automatically discard the steps previous to the last:
outputs = LSTM(units)(inputs)
#output_shape -> (batch_size, units) --> steps were discarded, only the last was returned
Achieving one to many
Now, this is not supported by keras LSTM layers alone. You will have to create your own strategy to multiplicate the steps. There are two good approaches:
Create a constant multi-step input by repeating a tensor
Use a stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features)
One to many with repeat vector
In order to fit to keras standard behavior, we need inputs in steps, so, we simply repeat the inputs for the length we want:
outputs = RepeatVector(steps)(inputs) #where inputs is (batch,features)
outputs = LSTM(units,return_sequences=True)(outputs)
#output_shape -> (batch_size, steps, units)
Understanding stateful = True
Now comes one of the possible usages of stateful=True (besides avoiding loading data that can't fit your computer's memory at once)
Stateful allows us to input "parts" of the sequences in stages. The difference is:
In stateful=False, the second batch contains whole new sequences, independent from the first batch
In stateful=True, the second batch continues the first batch, extending the same sequences.
It's like dividing the sequences in windows too, with these two main differences:
these windows do not superpose!!
stateful=True will see these windows connected as a single long sequence
In stateful=True, every new batch will be interpreted as continuing the previous batch (until you call model.reset_states()).
Sequence 1 in batch 2 will continue sequence 1 in batch 1.
Sequence 2 in batch 2 will continue sequence 2 in batch 1.
Sequence n in batch 2 will continue sequence n in batch 1.
Example of inputs, batch 1 contains steps 1 and 2, batch 2 contains steps 3 to 5:
BATCH 1 BATCH 2
[ Step1 Step2 | [ Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], | [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], | [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
.... |
Tank N: [[Pn1,Tn1], [Pn2,Tn2], | [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
] ]
Notice the alignment of tanks in batch 1 and batch 2! That's why we need shuffle=False (unless we are using only one sequence, of course).
You can have any number of batches, indefinitely. (For having variable lengths in each batch, use input_shape=(None,features).
One to many with stateful=True
For our case here, we are going to use only 1 step per batch, because we want to get one output step and make it be an input.
Please notice that the behavior in the picture is not "caused by" stateful=True. We will force that behavior in a manual loop below. In this example, stateful=True is what "allows" us to stop the sequence, manipulate what we want, and continue from where we stopped.
Honestly, the repeat approach is probably a better choice for this case. But since we're looking into stateful=True, this is a good example. The best way to use this is the next "many to many" case.
Layer:
outputs = LSTM(units=features,
stateful=True,
return_sequences=True, #just to keep a nice output shape even with length 1
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Now, we're going to need a manual loop for predictions:
input_data = someDataWithShape((batch, 1, features))
#important, we're starting new sequences, not continuing old ones:
model.reset_states()
output_sequence = []
last_step = input_data
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
Many to many with stateful=True
Now, here, we get a very nice application: given an input sequence, try to predict its future unknown steps.
We're using the same method as in the "one to many" above, with the difference that:
we will use the sequence itself to be the target data, one step ahead
we know part of the sequence (so we discard this part of the results).
Layer (same as above):
outputs = LSTM(units=features,
stateful=True,
return_sequences=True,
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Training:
We are going to train our model to predict the next step of the sequences:
totalSequences = someSequencesShaped((batch, steps, features))
#batch size is usually 1 in these cases (often you have only one Tank in the example)
X = totalSequences[:,:-1] #the entire known sequence, except the last step
Y = totalSequences[:,1:] #one step ahead of X
#loop for resetting states at the start/end of the sequences:
for epoch in range(epochs):
model.reset_states()
model.train_on_batch(X,Y)
Predicting:
The first stage of our predicting involves "ajusting the states". That's why we're going to predict the entire sequence again, even if we already know this part of it:
model.reset_states() #starting a new sequence
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:] #the last step of the predictions is the first future step
Now we go to the loop as in the one to many case. But don't reset states here!. We want the model to know in which step of the sequence it is (and it knows it's at the first new step because of the prediction we just made above)
output_sequence = [firstNewStep]
last_step = firstNewStep
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
This approach was used in these answers and file:
Predicting a multiple forward time step of a time series using LSTM
how to use the Keras model to forecast for future dates or events?
https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb
Achieving complex configurations
In all examples above, I showed the behavior of "one layer".
You can, of course, stack many layers on top of each other, not necessarly all following the same pattern, and create your own models.
One interesting example that has been appearing is the "autoencoder" that has a "many to one encoder" followed by a "one to many" decoder:
Encoder:
inputs = Input((steps,features))
#a few many to many layers:
outputs = LSTM(hidden1,return_sequences=True)(inputs)
outputs = LSTM(hidden2,return_sequences=True)(outputs)
#many to one layer:
outputs = LSTM(hidden3)(outputs)
encoder = Model(inputs,outputs)
Decoder:
Using the "repeat" method;
inputs = Input((hidden3,))
#repeat to make one to many:
outputs = RepeatVector(steps)(inputs)
#a few many to many layers:
outputs = LSTM(hidden4,return_sequences=True)(outputs)
#last layer
outputs = LSTM(features,return_sequences=True)(outputs)
decoder = Model(inputs,outputs)
Autoencoder:
inputs = Input((steps,features))
outputs = encoder(inputs)
outputs = decoder(outputs)
autoencoder = Model(inputs,outputs)
Train with fit(X,X)
Additional explanations
If you want details about how steps are calculated in LSTMs, or details about the stateful=True cases above, you can read more in this answer: Doubts regarding `Understanding Keras LSTMs`
First of all, you choose great tutorials(1,2) to start.
What Time-step means: Time-steps==3 in X.shape (Describing data shape) means there are three pink boxes. Since in Keras each step requires an input, therefore the number of the green boxes should usually equal to the number of red boxes. Unless you hack the structure.
many to many vs. many to one: In keras, there is a return_sequences parameter when your initializing LSTM or GRU or SimpleRNN. When return_sequences is False (by default), then it is many to one as shown in the picture. Its return shape is (batch_size, hidden_unit_length), which represent the last state. When return_sequences is True, then it is many to many. Its return shape is (batch_size, time_step, hidden_unit_length)
Does the features argument become relevant: Feature argument means "How big is your red box" or what is the input dimension each step. If you want to predict from, say, 8 kinds of market information, then you can generate your data with feature==8.
Stateful: You can look up the source code. When initializing the state, if stateful==True, then the state from last training will be used as the initial state, otherwise it will generate a new state. I haven't turn on stateful yet. However, I disagree with that the batch_size can only be 1 when stateful==True.
Currently, you generate your data with collected data. Image your stock information is coming as stream, rather than waiting for a day to collect all sequential, you would like to generate input data online while training/predicting with network. If you have 400 stocks sharing a same network, then you can set batch_size==400.
When you have return_sequences in your last layer of RNN you cannot use a simple Dense layer instead use TimeDistributed.
Here is an example piece of code this might help others.
words = keras.layers.Input(batch_shape=(None, self.maxSequenceLength), name = "input")
# Build a matrix of size vocabularySize x EmbeddingDimension
# where each row corresponds to a "word embedding" vector.
# This layer will convert replace each word-id with a word-vector of size Embedding Dimension.
embeddings = keras.layers.embeddings.Embedding(self.vocabularySize, self.EmbeddingDimension,
name = "embeddings")(words)
# Pass the word-vectors to the LSTM layer.
# We are setting the hidden-state size to 512.
# The output will be batchSize x maxSequenceLength x hiddenStateSize
hiddenStates = keras.layers.GRU(512, return_sequences = True,
input_shape=(self.maxSequenceLength,
self.EmbeddingDimension),
name = "rnn")(embeddings)
hiddenStates2 = keras.layers.GRU(128, return_sequences = True,
input_shape=(self.maxSequenceLength, self.EmbeddingDimension),
name = "rnn2")(hiddenStates)
denseOutput = TimeDistributed(keras.layers.Dense(self.vocabularySize),
name = "linear")(hiddenStates2)
predictions = TimeDistributed(keras.layers.Activation("softmax"),
name = "softmax")(denseOutput)
# Build the computational graph by specifying the input, and output of the network.
model = keras.models.Model(input = words, output = predictions)
# model.compile(loss='kullback_leibler_divergence', \
model.compile(loss='sparse_categorical_crossentropy', \
optimizer = keras.optimizers.Adam(lr=0.009, \
beta_1=0.9,\
beta_2=0.999, \
epsilon=None, \
decay=0.01, \
amsgrad=False))
Refer this blog for more details Animated RNN, LSTM and GRU.
The figure below gives you a better view of LSTM. It's a LSTM cell.
As you can see, X has 3 features (green circles) so input of this cell is a vector of dimension 3 and hidden state has 2 units (red circles) so the output of this cell (and also cell state) is a vector of dimension 2.
An example of one LSTM layer with 3 timesteps (3 LSTM cells) is shown in the figure below:
** A model can have multiple LSTM layers.
Now I use Daniel Möller's example again for better understanding:
We have 10 oil tanks. For each of them we measure 2 features: temperature, pressure every one hour for 5 times.
now parameters are:
batch_size = number of samples used in one forward/backward pass (default=32) --> for example if you have 1000 samples and you set up the batch_size to 100 then the model will take 10 iterations to pass all of the samples once through network (1 epoch). The higher the batch size, the more memory space you'll need. Because the number of samples in this example are low, we consider batch_size equal to all of samples = 10
timesteps = 5
features = 2
units = It's a positive integer and determines the dimension of hidden state and cell state or in other words the number of parameters passed to next LSTM cell. It can be chosen arbitrarily or empirically based on the features and timesteps. Using more units will result in more accuracy and also more computational time. But it may cause over fitting.
input_shape = (batch_size, timesteps, features) = (10,5,2)
output_shape:
(batch_size, timesteps, units) if return_sequences=True
(batch_size, units) if return_sequences=False