Understanding Keras LSTMs: Role of Batch-size and Statefulness

Understanding Keras LSTMs: Role of Batch-size and Statefulness - python

Sources
There are several sources out there explaining stateful / stateless LSTMs and the role of batch_size which I've read already. I'll refer to them later in my post:
[1] https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/
[2] https://machinelearningmastery.com/stateful-stateless-lstm-time-series-forecasting-python/
[3] http://philipperemy.github.io/keras-stateful-lstm/
[4] https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/
And also other SO threads like Understanding Keras LSTMs and Keras - stateful vs stateless LSTMs which didn't fully explain what I'm looking for however.
My Problem
I am still not sure what is the correct approach for my task regarding statefulness and determining batch_size.
I have about 1000 independent time series (samples) that have a length of about 600 days (timesteps) each (actually variable length, but I thought about trimming the data to a constant timeframe) with 8 features (or input_dim) for each timestep (some of the features are identical to every sample, some individual per sample).
Input shape = (1000, 600, 8)
One of the features is the one I want to predict, while the others are (supposed to be) supportive for the prediction of this one “master feature”. I will do that for each of the 1000 time series. What would be the best strategy to model this problem?
Output shape = (1000, 600, 1)
What is a Batch?
From [4]:
Keras uses fast symbolic mathematical libraries as a backend, such as TensorFlow and Theano.
A downside of using these libraries is that the shape and size of your data must be defined once up front and held constant regardless of whether you are training your network or making predictions.
[…]
This does become a problem when you wish to make fewer predictions than the batch size. For example, you may get the best results with a large batch size, but are required to make predictions for one observation at a time on something like a time series or sequence problem.
This sounds to me like a “batch” would be splitting the data along the timesteps-dimension.
However, [3] states that:
Said differently, whenever you train or test your LSTM, you first have to build your input matrix X of shape nb_samples, timesteps, input_dim where your batch size divides nb_samples. For instance, if nb_samples=1024 and batch_size=64, it means that your model will receive blocks of 64 samples, compute each output (whatever the number of timesteps is for every sample), average the gradients and propagate it to update the parameters vector.
When looking deeper into the examples of [1] and [4], Jason is always splitting his time series to several samples that only contain 1 timestep (the predecessor that in his example fully determines the next element in the sequence). So I think the batches are really split along the samples-axis. (However his approach of time series splitting doesn’t make sense to me for a long-term dependency problem.)
Conclusion
So let’s say I pick batch_size=10, that means during one epoch the weights are updated 1000 / 10 = 100 times with 10 randomly picked, complete time series containing 600 x 8 values, and when I later want to make predictions with the model, I’ll always have to feed it batches of 10 complete time series (or use solution 3 from [4], copying the weights to a new model with different batch_size).
Principles of batch_size understood – however still not knowing what would be a good value for batch_size. and how to determine it
Statefulness
The KERAS documentation tells us
You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch.
If I’m splitting my time series into several samples (like in the examples of [1] and [4]) so that the dependencies I’d like to model span across several batches, or the batch-spanning samples are otherwise correlated with each other, I may need a stateful net, otherwise not. Is that a correct and complete conclusion?
So for my problem I suppose I won’t need a stateful net. I’d build my training data as a 3D array of the shape (samples, timesteps, features) and then call model.fit with a batch_size yet to determine. Sample code could look like:
model = Sequential()
model.add(LSTM(32, input_shape=(600, 8))) # (timesteps, features)
model.add(LSTM(32))
model.add(LSTM(32))
model.add(LSTM(32))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)

Let me explain it via an example:
So let's say you have the following series: 1,2,3,4,5,6,...,100. You have to decide how many timesteps your lstm will learn, and reshape your data as so. Like below:
if you decide time_steps = 5, you have to reshape your time series as a matrix of samples in this way:
1,2,3,4,5 -> sample1
2,3,4,5,6 -> sample2
3,4,5,6,7 -> sample3
etc...
By doing so, you will end with a matrix of shape (96 samples x 5 timesteps)
This matrix should be reshape as (96 x 5 x 1) indicating Keras that you have just 1 time series. If you have more time series in parallel (as in your case), you do the same operation on each time series, so you will end with n matrices (one for each time series) each of shape (96 sample x 5 timesteps).
For the sake of argument, let's say you 3 time series. You should concat all of three matrices into one single tensor of shape (96 samples x 5 timeSteps x 3 timeSeries). The first layer of your lstm for this example would be:
model = Sequential()
model.add(LSTM(32, input_shape=(5, 3)))
The 32 as first parameter is totally up to you. It means that at each point in time, your 3 time series will become 32 different variables as output space. It is easier to think each time step as a fully conected layer with 3 inputs and 32 outputs but with a different computation than FC layers.
If you are about stacking multiple lstm layers, use return_sequences=True parameter, so the layer will output the whole predicted sequence rather than just the last value.
your target shoud be the next value in the series you want to predict.
Putting all together, let say you have the following time series:
Time series 1 (master): 1,2,3,4,5,6,..., 100
Time series 2 (support): 2,4,6,8,10,12,..., 200
Time series 3 (support): 3,6,9,12,15,18,..., 300
Create the input and target tensor
x -> y
1,2,3,4,5 -> 6
2,3,4,5,6 -> 7
3,4,5,6,7 -> 8
reformat the rest of time series, but forget about the target since you don't want to predict those series
Create your model
model = Sequential()
model.add(LSTM(32, input_shape=(5, 3), return_sequences=True)) # Input is shape (5 timesteps x 3 timeseries), output is shape (5 timesteps x 32 variables) because return_sequences = True
model.add(LSTM(8)) # output is shape (1 timesteps x 8 variables) because return_sequences = False
model.add(Dense(1, activation='linear')) # output is (1 timestep x 1 output unit on dense layer). It is compare to target variable.
Compile it and train. A good batch size is 32. Batch size is the size your sample matrices are splited for faster computation. Just don't use statefull

Related

Predicting Fibonacci Using LSTM RNN

New to neural nets so please correct my syntax.
I'm trying to create a LSTM RNN that will predict the Fibonacci sequence. When I ran the code below, the loss remains incredibly high (around 35339663592701874176).
Why does the shape of the input have to be (batch_size, timesteps, input_dim)? In my example I have 100 data entries so that'd be my batch_size, and the Fibonacci sequence takes in 2 inputs so that'd be input_dim but what would timesteps be in this case? 1?
Shouldn't the the units of the LSTM be 1? If I'm understanding correctly, the "units" are just the amount of hidden state nodes that are in the LSTM. So in theory, each of the 2 inputs would have a "1" coefficient weight towards that hidden state after training.
Would an RNN be a suitable model for this problem? When I've looked online, most people like to use the Fibonacci sequence as an example to explain how RNN's work.
Thanks for the help!
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Create Training Data
xs = [[[1, 1]]]
ys = []
i = 0
while i < 100:
ys.append([xs[i][0][0]+xs[i][0][1]])
xs.append([[xs[i][0][1], ys[len(ys)-1][0]]])
i = i + 1
del xs[len(xs)-1]
xs = np.array(xs, dtype=float)
ys = np.array(ys, dtype=float)
# Create Model
model = keras.Sequential()
model.add(layers.LSTM(1, input_shape=(1, 2)))
model.add(layers.Dense(1))
model.compile(optimizer="adam", loss="mean_absolute_error", metrics=[ 'accuracy' ])
# Train
model.fit(xs, ys, epochs=100000)

You can't feed a NN data where some of the values are 10^21 times as large as some of the others and expect it to work, it just doesn't happen.
You're not doing anything here that actually calls for LSTM (or any RNN), you're not actually using the time dimension, and you're basically just trying to learn addition. Maybe you meant to do something different (like input digits as a sequence, or have the output run for multiple timesteps and give you several values of the sequence), but that's not what you're doing, and it's unclear what you want.
The number of units is your memory/procesing capacity. Each unit of an RNN is able to receive values from all of the units in the previous timestep. One unit alone can't do anything interesting, especially with no layer before it to preprocess the data.

Trying to understand keras SimpleRNN

I have a medical longitudinal data on which I am doing a research.
To start with I am working with 4000 rows of sample with 3 time-steps(3 columns) of a bone size corresponding to size of bone measured in 3 different months.
I am done with the basic model. Now I want to be sure if my understanding of the model is correct.
model = Sequential()
model.add(layers.SimpleRNN(units=10, input_shape=(3,1),use_bias=True,bias_initializer='zeros',activation="relu",kernel_initializer="random_uniform"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='sgd')
model.summary()
model.fit(trainX,train_op, epochs=100, batch_size=50, verbose=2)
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
Following are my few doubts around this model :
Here return_sequences is False , then shouldn't I get only the last output from RNN layer. Why the output is of shape(None, 10) from RNN layer? I assumed it should be (sample size,1) .
Also my below mentioned logic is flawed but I need to resolve it which is :
Units corresponds to output units. Initially my guess was that since there are 3 time-steps there has to be 3 output units but I was surprised that even if give units= 128 or 10,1 the model worked. How and why it is happening ? This question along with the above one confuses me more.
input_shape corresponds to -[sample size, number of time steps, features]. Here, I am measuring 1 bone size over 3 time periods. Is my understanding correct when I say the input shape is (sample size, 3,1) ? Moreover, I have confusion regarding how numpy represents 3d array. It seems, to get required dimension I need to input as - #features, observations/sample_size, timesteps . Do I have to reshape my inputs according to how numpy represents 3d or should i let it be. ?
Moreover, how can I build a model if i have different set of features measured over different time frame or have various time steps ? How can i incorporate with the above model.

Yes, you get the last output, which is a 10-dimensional vector, not a one-dimensional vector, so getting shape (samples, 10) is correct.
Number of units has nothing to do with timesteps, the number of timesteps is how many times the neurons are applied recurrently, so its orthogonal to the number of features or units.
Yes, shape of your inputs should be (samples, 3, 1) and the input_shape should be (3, 1), all of this is correct in your code. I am not sure what you are talking about on "how numpy represents 3d array", the shape is clear, numpy does not do any modifications to input shapes.

How to use Multivariate time-series prediction with Keras, when multiple samples are used

As the title states, I am doing multivariate time-series prediction. I have some experience with this situation and was able to successfully setup and train a working model in TF Keras.
However, I did not know the 'proper' way to handle having multiple unrelated time-series samples. I have about 8000 unique sample 'blocks' with anywhere from 800 time steps to 30,000 time steps per sample. Of course I couldn't concatenate them all into one single time series because the first points of sample 2 are not related in time with the last points of sample 1.
Thus my solution was to fit each sample individually in a loop (at great inefficiency).
My new idea is can/should I pad the start of each sample with empty time-steps = to the amount of look back for the RNN and then concatenate the padded samples into one time-series? This will mean that the first time-step will have a look-back data of mostly 0's which sounds like another 'hack' for my problem and not the right way to do it.

The main challenge is in 800 vs. 30,000 timesteps, but nothing you can't do.
Model design: group sequences into chunks - for example, 30 sequences of 800-to-900 timesteps, padded, then 60 sequences of 900-to-1000, etc. - don't have to be contiguous (i.e. next can be 1200-to-1500)
Input shape: (samples, timesteps, channels) - or equivalently, (sequences, timesteps, features)
Layers: Conv1D and/or RNNs - e.g. GRU, LSTM. Each can handle variable timesteps
Concatenation: don't do it. If each of your sequences is independent, then each must be fed along dimension 0 in Keras - the batch or samples dimension. If they are dependent, e.g. multivariate timeseries, like many channels in a signal - then feed them along the channels dimension (dim 2). But never concatenate along timeseries dimension, as it implies causal continuity whrere none exists.
Stateful RNNs: can help in processing long sequences - info on how they work here
RNN capability: is limited w.r.t. long sequences, and 800 is already in danger zone even for LSTMs; I'd suggest dimensionality reduction via either autoencoders or CNNs w/ strides > 1 at input, then feeding their outputs to RNNs.
RNN training: is difficult. Long train times, hyperparameter sensitivity, vanishing gradients - but, with proper regularization, they can be powerful. More info here
Zero-padding: before/after/both - debatable, can read about it, but probably stay clear from "both" as learning to ignore paddings is easier with one locality; I personally use "before"
RNN variant: use CuDNNLSTM or CuDNNGRU whenever possible, as they are 10x faster
Note: "samples" above, and in machine learning, refers to independent examples / observations, rather than measured signal datapoints (which would be referred to as timesteps).
Below is a minimal code for what a timeseries-suited model would look like:
from tensorflow.keras.layers import Input, Conv1D, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import numpy as np
def make_data(batch_shape): # dummy data
return (np.random.randn(*batch_shape),
np.random.randint(0, 2, (batch_shape[0], 1)))
def make_model(batch_shape): # example model
ipt = Input(batch_shape=batch_shape)
x = Conv1D(filters=16, kernel_size=10, strides=2, padding='valid')(ipt)
x = LSTM(units=16)(x)
out = Dense(1, activation='sigmoid')(x) # assuming binary classification
model = Model(ipt, out)
model.compile(Adam(lr=1e-3), 'binary_crossentropy')
return model
batch_shape = (32, 100, 16) # 32 samples, 100 timesteps, 16 channels
x, y = make_data(batch_shape)
model = make_model(batch_shape)
model.train_on_batch(x, y)

The Diagram explanation of the LSTM Network

I am working with LSTM for my time series forecasting problem. I have the following network:
model = Sequential()
model.add(LSTM(units_size=300, activation=activation, input_shape=(20, 1)))
model.add(Dense(20))
My forecasting problem is to forecast the next 20 time steps looking back the last 20 time steps. So, for each iteration, I have an input shape like (x_t-20...x_t) and forecast the next (x_t+1...x_t+20). For the hidden layer, I use 300 hidden units.
As LSTM is different than the simple feed-forward neural network, I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells? As I describe above, I have 20 time steps to predict and are these all steps generated from the last LSTM cels? I have no idea. Can some generally give a diagram example of this kind of network structure?

Regarding these questions,
I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells?
It is simpler to consider the LSTM layer you have defined as a single block. This diagram is heavily borrowed from Francois Chollet's Deep Learning with Python book:
In your model, input shape is defined as (20,1), so you have 20 time-steps of size 1. For a moment, consider that the output Dense layer is not present.
model = Sequential()
model.add(LSTM(300, input_shape=(20,1)))
model.summary()
lstm_7 (LSTM) (None, 300) 362400
The output shape of the LSTM layer is 300 which means the output is of size 300.
output = model.predict(np.zeros((1, 20, 1)))
print(output.shape)
(1, 300)
input (1,20,1) => batch size = 1, time-steps = 20, input-feature-size = 1.
output (1, 300) => batch size = 1, output-feature-size = 300
Keras recurrently ran the LSTM for 20 time-steps and generated an output of size (300). In the diagram above, this is Output t+19.
Now, if you add the Dense layer after LSTM, the output will be of size 20 which is straightforward.

To understand LSTMs, I'd recommend first spending a few minutes to understand 'plain vanilla' RNNs, as LSTMs are just a more complex version of that. I'll try to describe what's happening in your network if it was a basic RNN.
You are training a single set of weights that are repeatedly used for each time step (t-20,...,t). The first weight (let's say W1) is for inputs. One by one, each of x_t-20,...,x_t is multiplied by W1, then a non-linear activation function is applied - same as any NN forward pass.
The difference with RNNs is the presence of a separate 'state' (note: not a trained weight), that can start off random or zero, and carries information about your sequence across time steps. There's another weight for the state (W2). So starting at the first time step t-20, the initial state is multiplied by W2 and an activation function applied.
So at timestep t-20 we have the output from W1 (on inputs) and W2 (on state). We can combine these outputs at each timestep, and use it to generate the state to pass to the next timestep, i.e. t-19. Because the state has to be calculated at each timestep and passed to the next, these calculations have to happen sequentially starting from t-20. To generate our desired output, we can take each output state across all timesteps - or only take the output at the final timestep. As return_sequences=False by default in Keras, you are only using the output at the final timestep, which then goes into your dense layer.
The weights W1 and W2 need to have one dimension equal to the dimensions of each timestep input x_t-20... for matrix multiplication to work. This dimension is 1 in your case, as each of the 20 inputs are a 1d vector (or number), which is multiplied by W1. However, we're free to set the second dimension of the weights as we please - 300 in your case. So W1 is of size 1x300, and is multiplied 20 times, once for each timestep.
This lecture will take you through the basic flow diagram of RNNs that I described above, all the way to more advanced stuff which you can skip. This is a famous explanation of LSTMs if you want to make the leap from basic RNNs to LSTMs, which you may not need to do - there are just more complicated weights and states.