Proper way to feed time-series data to stateful LSTM?

Proper way to feed time-series data to stateful LSTM? - python

Let's suppose I have a sequence of integers:
0,1,2, ..
and want to predict the next integer given the last 3 integers, e.g.:
[0,1,2]->5, [3,4,5]->6, etc
Suppose I setup my model like so:
batch_size=1
time_steps=3
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, 1), stateful=True))
model.add(Dense(1))
It is my understanding that model has the following structure (please excuse the crude drawing):
First Question: is my understanding correct?
Note I have drawn the previous states C_{t-1}, h_{t-1} entering the picture as this is exposed when specifying stateful=True. In this simple "next integer prediction" problem, the performance should improve by providing this extra information (as long as the previous state results from the previous 3 integers).
This brings me to my main question: It seems the standard practice (for example see this blog post and the TimeseriesGenerator keras preprocessing utility), is to feed a staggered set of inputs to the model during training.
For example:
batch0: [[0, 1, 2]]
batch1: [[1, 2, 3]]
batch2: [[2, 3, 4]]
etc
This has me confused because it seems this is requires the output of the 1st Lstm Cell (corresponding to the 1st time step). See this figure:
From the tensorflow docs:
stateful: Boolean (default False). If True, the last state for each
sample at index i in a batch will be used as initial state for the
sample of index i in the following batch.
it seems this "internal" state isn't available and all that is available is the final state. See this figure:
So, if my understanding is correct (which it's clearly not), shouldn't we be feeding non-overlapped windows of samples to the model when using stateful=True? E.g.:
batch0: [[0, 1, 2]]
batch1: [[3, 4, 5]]
batch2: [[6, 7, 8]]
etc

The answer is: depends on problem at hand. For your case of one-step prediction - yes, you can, but you don't have to. But whether you do or not will significantly impact learning.
Batch vs. sample mechanism ("see AI" = see "additional info" section)
All models treat samples as independent examples; a batch of 32 samples is like feeding 1 sample at a time, 32 times (with differences - see AI). From model's perspective, data is split into the batch dimension, batch_shape[0], and the features dimensions, batch_shape[1:] - the two "don't talk." The only relation between the two is via the gradient (see AI).
Overlap vs no-overlap batch
Perhaps the best approach to understand it is information-based. I'll begin with timeseries binary classification, then tie it to prediction: suppose you have 10-minute EEG recordings, 240000 timesteps each. Task: seizure or non-seizure?
As 240k is too much for an RNN to handle, we use CNN for dimensionality reduction
We have the option to use "sliding windows" - i.e. feed a subsegment at a time; let's use 54k
Take 10 samples, shape (240000, 1). How to feed?
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[54000:108000] ...
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[1:54001] ...
Which of the two above do you take? If (2), your neural net will never confuse a seizure for a non-seizure for those 10 samples. But it'll also be clueless about any other sample. I.e., it will massively overfit, because the information it sees per iteration barely differs (1/54000 = 0.0019%) - so you're basically feeding it the same batch several times in a row. Now suppose (3):
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[24000:81000] ...
A lot more reasonable; now our windows have a 50% overlap, rather than 99.998%.
Prediction: overlap bad?
If you are doing a one-step prediction, the information landscape is now changed:
Chances are, your sequence length is faaar from 240000, so overlaps of any kind don't suffer the "same batch several times" effect
Prediction fundamentally differs from classification in that, the labels (next timestep) differ for every subsample you feed; classification uses one for the entire sequence
This dramatically changes your loss function, and what is 'good practice' for minimizing it:
A predictor must be robust to its initial sample, especially for LSTM - so we train for every such "start" by sliding the sequence as you have shown
Since labels differ timestep-to-timestep, the loss function changes substantially timestep-to-timestep, so risks of overfitting are far less
What should I do?
First, make sure you understand this entire post, as nothing here's really "optional." Then, here's the key about overlap vs no-overlap, per batch:
One sample shifted: model learns to better predict one step ahead for each starting step - meaning: (1) LSTM's robust against initial cell state; (2) LSTM predicts well for any step ahead given X steps behind
Many samples, shifted in later batch: model less likely to 'memorize' train set and overfit
Your goal: balance the two; 1's main edge over 2 is:
2 can handicap the model by making it forget seen samples
1 allows model to extract better quality features by examining the sample over several starts and ends (labels), and averaging the gradient accordingly
Should I ever use (2) in prediction?
If your sequence lengths are very long and you can afford to "slide window" w/ ~50% its length, maybe, but depends on the nature of data: signals (EEG)? Yes. Stocks, weather? Doubt it.
Many-to-many prediction; more common to see (2), in large per longer sequences.
LSTM stateful: may actually be entirely useless for your problem.
Stateful is used when LSTM can't process the entire sequence at once, so it's "split up" - or when different gradients are desired from backpropagation. With former, the idea is - LSTM considers former sequence in its assessment of latter:
t0=seq[0:50]; t1=seq[50:100] makes sense; t0 logically leads to t1
seq[0:50] --> seq[1:51] makes no sense; t1 doesn't causally derive from t0
In other words: do not overlap in stateful in separate batches. Same batch is OK, as again, independence - no "state" between the samples.
When to use stateful: when LSTM benefits from considering previous batch in its assessment of the next. This can include one-step predictions, but only if you can't feed the entire seq at once:
Desired: 100 timesteps. Can do: 50. So we set up t0, t1 as in above's first bullet.
Problem: not straightforward to implement programmatically. You'll need to find a way to feed to LSTM while not applying gradients - e.g. freezing weights or setting lr = 0.
When and how does LSTM "pass states" in stateful?
When: only batch-to-batch; samples are entirely independent
How: in Keras, only batch-sample to batch-sample: stateful=True requires you to specify batch_shape instead of input_shape - because, Keras builds batch_size separate states of the LSTM at compiling
Per above, you cannot do this:
# sampleNM = sample N at timestep(s) M
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample21, sample41, sample11, sample31]
This implies 21 causally follows 10 - and will wreck training. Instead do:
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample11, sample21, sample31, sample41]
Batch vs. sample: additional info
A "batch" is a set of samples - 1 or greater (assume always latter for this answer)
. Three approaches to iterate over data: Batch Gradient Descent (entire dataset at once), Stochastic GD (one sample at a time), and Minibatch GD (in-between). (In practice, however, we call the last SGD also and only distinguish vs BGD - assume it so for this answer.) Differences:
SGD never actually optimizes the train set's loss function - only its 'approximations'; every batch is a subset of the entire dataset, and the gradients computed only pertain to minimizing loss of that batch. The greater the batch size, the better its loss function resembles that of the train set.
Above can extend to fitting batch vs. sample: a sample is an approximation of the batch - or, a poorer approximation of the dataset
First fitting 16 samples and then 16 more is not the same as fitting 32 at once - since weights are updated in-between, so model outputs for the latter half will change
The main reason for picking SGD over BGD is not, in fact, computational limitations - but that it's superior, most of the time. Explained simply: a lot easier to overfit with BGD, and SGD converges to better solutions on test data by exploring a more diverse loss space.
BONUS DIAGRAMS:

Related

What will happen if I use batch normalzation but set batch_size=1? [duplicate]

What will happen when I use batch normalization but set batch_size = 1?
Because I am using 3D medical images as training dataset, the batch size can only be set to 1 because of GPU limitation. Normally, I know, when batch_size = 1, variance will be 0. And (x-mean)/variance will lead to error because of division by 0.
But why did errors not occur when I set batch_size = 1? Why my network was trained as good as I expected? Could anyone explain it?
Some people argued that:
The ZeroDivisionError may not be encountered because of two cases. First, the exception is caught in a try catch block. Second, a small rational number is added ( 1e-19 ) to the variance term so that it is never zero.
But some people disagree. They said that:
You should calculate mean and std across all pixels in the images of the batch. (So even batch_size = 1, there are still a lot of pixels in the batch. So the reason why batch_size=1 can still work is not because of 1e-19)
I have checked the Pytorch source code, and from the code I think the latter one is right.
Does anyone have different opinion???

variance will be 0
No, it won't; BatchNormalization computes statistics only with respect to a single axis (usually the channels axis, =-1 (last) by default); every other axis is collapsed, i.e. summed over for averaging; details below.
More importantly, however, unless you can explicitly justify it, I advise against using BatchNormalization with batch_size=1; there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics "averaged" over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don't work as intended.
Small mini-batch alternatives: Batch Renormalization -- Layer Normalization -- Weight Normalization
Implementation details: from source code:
reduction_axes = list(range(len(input_shape)))
del reduction_axes[self.axis]
Eventually, tf.nn.monents is called with axes=reduction_axes, which performs a reduce_sum to compute variance. Then, in the TensorFlow backend, mean and variance are passed to tf.nn.batch_normalization to return train- or inference-normalized inputs.
In other words, if your input is (batch_size, height, width, depth, channels), or (1, height, width, depth, channels), then BN will run calculations over the 1, height, width, and depth dimensions.
Can variance ever be zero? - yes, if every single datapoint for any given channel slice (along every dimension) is the same. But this should be near-impossible for real data.
Other answers: first one is misleading:
a small rational number is added (1e-19) to the variance
This doesn't happen in computing variance, but it is added to variance when normalizing; nonetheless, it is rarely necessary, as variance is far from zero. Also, the epsilon term is actually defaulted to 1e-3 by Keras; it serves roles in regularizing, beyond mere avoiding zero-division.
Update: I failed to address an important piece of intuition with suspecting variance to be 0; indeed, the batch statistics variance is zero, since there is only one statistic - but the "statistic" itself concerns the mean & variance of the channel + spatial dimensions. In other words, the variance of the mean & variance (of the single train sample) is zero, but the mean & variance themselves aren't.

when batch_size = 1, variance will be 0
No, because when you compute mean and variance for BN (for example using tf.nn.monents) you will be computing it over axis [0, 1, 2] (assuming you have NHWC tensor channels order).
From "Group Normalization" paper:
https://arxiv.org/pdf/1803.08494.pdf
With batch_size=1 batch normalization is equal to instance normalization and it can be helpful in some tasks.
But if you are using sort of encoder-decoder and in some layer you have tensor with spatial size of 1x1 it will be a problem, because each channel only have only one value and mean of value will be equal to this value, so BN will zero out information.

I have machine learning data with binary features. How can I force an autoencoder to return binary data?

I have a dataset of the following form: A series of M observations of N-dimensional data. In order to obtain latent factors from this data, I wish to make a single hidden-layer autoencoder trained on this data. Every dimension of a single observation is either a 0 or a 1. But the keras Model returns floats. Is there a way to add a layer to enforce a 0 or 1 as output?
I tried using a simple keras Model to solve this problem. It claims good accuracy on the data, but when looking at the raw data it predicts the 0's correctly and often completely ignores the 1's.
n_nodes = 50
input_1 = tf.keras.layers.Input(shape=(x_train.shape[1],))
x = tf.keras.layers.Dense(n_nodes, activation='relu')(input_1)
output_1 = tf.keras.layers.Dense(x_train.shape[1], activation='sigmoid')(x)
model = tf.keras.models.Model(input_1, output_1)
my_optimizer = tf.keras.optimizers.RMSprop()
my_optimizer.lr = 0.002
model.compile(optimizer=my_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10000)
predictions = model.predict(x_test)
These observations I then validate by looking at all experiments and seeing if a large (>0.1) value is returned for the elements which are 1. The performance is very poor on the 1's.
I have seen that the loss converges around 10000 epochs. However, the autoencoder fails to properly predict almost all 1's in the data set. Even when setting the width of the hidden layer to be identical to the dimensionality of the data (n_nodes = x_train.shape[1]) the autoencoder still gives bad performance, even worsening if I increase the width of the hidden layer.

[0, 1] outputs should generally be rounded such that >=0.5 rounds to 1 when outputting a final prediction and <0.5 rounds to 0. However your labels should be float values {0.0, 1.0} for the loss function (which I expect they are already). You can compute accuracy by rounding the outputs and comparing to your binary labels to count errors for {0, 1}, but they must be in continuous form [0.0, 1.0] for the loss and gradient calculations to work.
If you are doing all of that (and it does appear that things are set up correctly in your code), there might be a number of reasons for poor performance:
Your dense, "constriction" layer should be significantly smaller than your input. In making it smaller you are forcing the auto-encoder to learn a representative form of the input that can be used to produce the output. This representative form is likely to generalize well. If you increase the size of your hidden layer the network will have much more capacity to memorize the inputs.
You might have many more 0 values than 1 values, if this is the case then in the absence of actual learning the network could get stuck just predicting 0 as a "best guess" because that's "usually right". This is a harder problem to tackle. You might consider multiplying the loss by a vector of labels * eta + 1, this would effectively increase the learning rate of the ones labels. Example: Your labels are [0, 1, 0], eta is a hyper-parameter value >1, let's say eta=2.0. labels * eta = [1.0, 3.0, 1.0] which scales up the gradient signal for 1 values by increasing the loss for only 1's. This isn't a bullet proof method of increasing the importance of the 1's class, but it's something simple to try. If it makes any improvement then follow up on this line of reasoning in more detail.
You have 1 hidden layer, which means your limited to linear relationships, you might try 3 hidden layers to add a little non linearity. Your center layer should be fairly small, try something like 5 or 10 neurons, it should need to squeeze the data into a fairly tight constriction point to extract a general purpose representation.

LSTM - time series predictions

I am following this tutorial LSTM and I wonder how to map this to a multi-time series input. I have a dataset of several time-series and I want to predict for each time series the future. I don't know how to scale LSTM to several time-series.
The aim is to avoid to make a model for each time series as I have 40k of time series.
Thank you

Process one by one
Just do exactly the same in a loop like this:
for epoch in range(numberOfEpochs):
for sequence in yourSequences:
model.reset_states()
#1 - do the entire training for this sequence (1 epoch only)
#you may use "model.train_on_batch" to avoid some overhead in "fit"
#or 2 - do the entire predictoins for this sequence
Process all together
Just pack the series in the first dimension of the input. No change is necessary in the model
When defining the input shape, use batch_input=(number_of_time_series,length,features) or batch_input_shape=(number_of_time_series,length,features). (You may need a smaller batch size, because 40K is too much)
Make sure to use shuffle=False in every training command.
If your batch is not 40k, make sure to process the entire length (the entire training or prediction) of each batch, then you use model.reset_states() and start a new group of sequences.
.
batch_size = ....
for epoch in range(numberOfEpochs):
firstSeq = 0
lastSeq = firstSeq + batch_size
while lastSeq <= len(sequences):
model.reset_states()
batch = sequences[firstSeq:lastSeq]
#train the entire batch (one epoch only)
#or predict for the entire batch
firstSeq += batch_size
lastSeq += batch_size

Since you are using separate Time series, I don't think keeping stateful = True is a good Idea.
Actually, your problem is closer to the 'generic' use of LSTMs.
Try to concatenate your series in a 2d array where each line is corresponding to a serie. Then reshape your data like that : (number_of_series , timesteps (length of a single serie) , 1), then feed it to your network.
Depending of the length of your series, you may need to read this : https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/

The real potential of LSTM models for time series forecasting can be exploiting by building a global model using all the time series, instead as a univariate model, which actually ignores any cross series information available across your time series.
We implement the use case you are referring to by introducing a 'Moving Window Approach' strategy that involves modeling a multiple input and output mapping, where you can pool time series that have different lengths. More detailed discussion of this strategy is described in section 3.4 on our paper[1]. Here, you basically produce multiple input and output tuples for the given set of time series you have and then pool them to together for the LSTM training purposes. This accommodates even if you have time series with different lengths.
[1] https://arxiv.org/pdf/1710.03222.pdf

Understanding Keras LSTMs

I am trying to reconcile my understand of LSTMs and pointed out here in this post by Christopher Olah implemented in Keras. I am following the blog written by Jason Brownlee for the Keras tutorial. What I am mainly confused about is,
The reshaping of the data series into [samples, time steps, features] and,
The stateful LSTMs
Lets concentrate on the above two questions with reference to the code pasted below:
# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], look_back, 1))
testX = numpy.reshape(testX, (testX.shape[0], look_back, 1))
########################
# The IMPORTANT BIT
##########################
# create and fit the LSTM network
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(100):
model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
model.reset_states()
Note: create_dataset takes a sequence of length N and returns a N-look_back array of which each element is a look_back length sequence.
What is Time Steps and Features?
As can be seen TrainX is a 3-D array with Time_steps and Feature being the last two dimensions respectively (3 and 1 in this particular code). With respect to the image below, does this mean that we are considering the many to one case, where the number of pink boxes are 3? Or does it literally mean the chain length is 3 (i.e. only 3 green boxes considered).
Does the features argument become relevant when we consider multivariate series? e.g. modelling two financial stocks simultaneously?
Stateful LSTMs
Does stateful LSTMs mean that we save the cell memory values between runs of batches? If this is the case, batch_size is one, and the memory is reset between the training runs so what was the point of saying that it was stateful. I'm guessing this is related to the fact that training data is not shuffled, but I'm not sure how.
Any thoughts?
Image reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Edit 1:
A bit confused about #van's comment about the red and green boxes being equal. So just to confirm, does the following API calls correspond to the unrolled diagrams? Especially noting the second diagram (batch_size was arbitrarily chosen.):
Edit 2:
For people who have done Udacity's deep learning course and still confused about the time_step argument, look at the following discussion: https://discussions.udacity.com/t/rnn-lstm-use-implementation/163169
Update:
It turns out model.add(TimeDistributed(Dense(vocab_len))) was what I was looking for. Here is an example: https://github.com/sachinruk/ShakespeareBot
Update2:
I have summarised most of my understanding of LSTMs here: https://www.youtube.com/watch?v=ywinX5wgdEU

As a complement to the accepted answer, this answer shows keras behaviors and how to achieve each picture.
General Keras behavior
The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example):
In this image, I increased the number of steps to 5, to avoid confusion with the other dimensions.
For this example:
We have N oil tanks
We spent 5 hours taking measures hourly (time steps)
We measured two features:
Pressure P
Temperature T
Our input array should then be something shaped as (N,5,2):
[ Step1 Step2 Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
....
Tank N: [[Pn1,Tn1], [Pn2,Tn2], [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
]
Inputs for sliding windows
Often, LSTM layers are supposed to process the entire sequences. Dividing windows may not be the best idea. The layer has internal states about how a sequence is evolving as it steps forward. Windows eliminate the possibility of learning long sequences, limiting all sequences to the window size.
In windows, each window is part of a long original sequence, but by Keras they will be seen each as an independent sequence:
[ Step1 Step2 Step3 Step4 Step5
Window A: [[P1,T1], [P2,T2], [P3,T3], [P4,T4], [P5,T5]],
Window B: [[P2,T2], [P3,T3], [P4,T4], [P5,T5], [P6,T6]],
Window C: [[P3,T3], [P4,T4], [P5,T5], [P6,T6], [P7,T7]],
....
]
Notice that in this case, you have initially only one sequence, but you're dividing it in many sequences to create windows.
The concept of "what is a sequence" is abstract. The important parts are:
you can have batches with many individual sequences
what makes the sequences be sequences is that they evolve in steps (usually time steps)
Achieving each case with "single layers"
Achieving standard many to many:
You can achieve many to many with a simple LSTM layer, using return_sequences=True:
outputs = LSTM(units, return_sequences=True)(inputs)
#output_shape -> (batch_size, steps, units)
Achieving many to one:
Using the exact same layer, keras will do the exact same internal preprocessing, but when you use return_sequences=False (or simply ignore this argument), keras will automatically discard the steps previous to the last:
outputs = LSTM(units)(inputs)
#output_shape -> (batch_size, units) --> steps were discarded, only the last was returned
Achieving one to many
Now, this is not supported by keras LSTM layers alone. You will have to create your own strategy to multiplicate the steps. There are two good approaches:
Create a constant multi-step input by repeating a tensor
Use a stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features)
One to many with repeat vector
In order to fit to keras standard behavior, we need inputs in steps, so, we simply repeat the inputs for the length we want:
outputs = RepeatVector(steps)(inputs) #where inputs is (batch,features)
outputs = LSTM(units,return_sequences=True)(outputs)
#output_shape -> (batch_size, steps, units)
Understanding stateful = True
Now comes one of the possible usages of stateful=True (besides avoiding loading data that can't fit your computer's memory at once)
Stateful allows us to input "parts" of the sequences in stages. The difference is:
In stateful=False, the second batch contains whole new sequences, independent from the first batch
In stateful=True, the second batch continues the first batch, extending the same sequences.
It's like dividing the sequences in windows too, with these two main differences:
these windows do not superpose!!
stateful=True will see these windows connected as a single long sequence
In stateful=True, every new batch will be interpreted as continuing the previous batch (until you call model.reset_states()).
Sequence 1 in batch 2 will continue sequence 1 in batch 1.
Sequence 2 in batch 2 will continue sequence 2 in batch 1.
Sequence n in batch 2 will continue sequence n in batch 1.
Example of inputs, batch 1 contains steps 1 and 2, batch 2 contains steps 3 to 5:
BATCH 1 BATCH 2
[ Step1 Step2 | [ Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], | [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], | [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
.... |
Tank N: [[Pn1,Tn1], [Pn2,Tn2], | [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
] ]
Notice the alignment of tanks in batch 1 and batch 2! That's why we need shuffle=False (unless we are using only one sequence, of course).
You can have any number of batches, indefinitely. (For having variable lengths in each batch, use input_shape=(None,features).
One to many with stateful=True
For our case here, we are going to use only 1 step per batch, because we want to get one output step and make it be an input.
Please notice that the behavior in the picture is not "caused by" stateful=True. We will force that behavior in a manual loop below. In this example, stateful=True is what "allows" us to stop the sequence, manipulate what we want, and continue from where we stopped.
Honestly, the repeat approach is probably a better choice for this case. But since we're looking into stateful=True, this is a good example. The best way to use this is the next "many to many" case.
Layer:
outputs = LSTM(units=features,
stateful=True,
return_sequences=True, #just to keep a nice output shape even with length 1
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Now, we're going to need a manual loop for predictions:
input_data = someDataWithShape((batch, 1, features))
#important, we're starting new sequences, not continuing old ones:
model.reset_states()
output_sequence = []
last_step = input_data
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
Many to many with stateful=True
Now, here, we get a very nice application: given an input sequence, try to predict its future unknown steps.
We're using the same method as in the "one to many" above, with the difference that:
we will use the sequence itself to be the target data, one step ahead
we know part of the sequence (so we discard this part of the results).
Layer (same as above):
outputs = LSTM(units=features,
stateful=True,
return_sequences=True,
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Training:
We are going to train our model to predict the next step of the sequences:
totalSequences = someSequencesShaped((batch, steps, features))
#batch size is usually 1 in these cases (often you have only one Tank in the example)
X = totalSequences[:,:-1] #the entire known sequence, except the last step
Y = totalSequences[:,1:] #one step ahead of X
#loop for resetting states at the start/end of the sequences:
for epoch in range(epochs):
model.reset_states()
model.train_on_batch(X,Y)
Predicting:
The first stage of our predicting involves "ajusting the states". That's why we're going to predict the entire sequence again, even if we already know this part of it:
model.reset_states() #starting a new sequence
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:] #the last step of the predictions is the first future step
Now we go to the loop as in the one to many case. But don't reset states here!. We want the model to know in which step of the sequence it is (and it knows it's at the first new step because of the prediction we just made above)
output_sequence = [firstNewStep]
last_step = firstNewStep
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
This approach was used in these answers and file:
Predicting a multiple forward time step of a time series using LSTM
how to use the Keras model to forecast for future dates or events?
https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb
Achieving complex configurations
In all examples above, I showed the behavior of "one layer".
You can, of course, stack many layers on top of each other, not necessarly all following the same pattern, and create your own models.
One interesting example that has been appearing is the "autoencoder" that has a "many to one encoder" followed by a "one to many" decoder:
Encoder:
inputs = Input((steps,features))
#a few many to many layers:
outputs = LSTM(hidden1,return_sequences=True)(inputs)
outputs = LSTM(hidden2,return_sequences=True)(outputs)
#many to one layer:
outputs = LSTM(hidden3)(outputs)
encoder = Model(inputs,outputs)
Decoder:
Using the "repeat" method;
inputs = Input((hidden3,))
#repeat to make one to many:
outputs = RepeatVector(steps)(inputs)
#a few many to many layers:
outputs = LSTM(hidden4,return_sequences=True)(outputs)
#last layer
outputs = LSTM(features,return_sequences=True)(outputs)
decoder = Model(inputs,outputs)
Autoencoder:
inputs = Input((steps,features))
outputs = encoder(inputs)
outputs = decoder(outputs)
autoencoder = Model(inputs,outputs)
Train with fit(X,X)
Additional explanations
If you want details about how steps are calculated in LSTMs, or details about the stateful=True cases above, you can read more in this answer: Doubts regarding `Understanding Keras LSTMs`

First of all, you choose great tutorials(1,2) to start.
What Time-step means: Time-steps==3 in X.shape (Describing data shape) means there are three pink boxes. Since in Keras each step requires an input, therefore the number of the green boxes should usually equal to the number of red boxes. Unless you hack the structure.
many to many vs. many to one: In keras, there is a return_sequences parameter when your initializing LSTM or GRU or SimpleRNN. When return_sequences is False (by default), then it is many to one as shown in the picture. Its return shape is (batch_size, hidden_unit_length), which represent the last state. When return_sequences is True, then it is many to many. Its return shape is (batch_size, time_step, hidden_unit_length)
Does the features argument become relevant: Feature argument means "How big is your red box" or what is the input dimension each step. If you want to predict from, say, 8 kinds of market information, then you can generate your data with feature==8.
Stateful: You can look up the source code. When initializing the state, if stateful==True, then the state from last training will be used as the initial state, otherwise it will generate a new state. I haven't turn on stateful yet. However, I disagree with that the batch_size can only be 1 when stateful==True.
Currently, you generate your data with collected data. Image your stock information is coming as stream, rather than waiting for a day to collect all sequential, you would like to generate input data online while training/predicting with network. If you have 400 stocks sharing a same network, then you can set batch_size==400.

When you have return_sequences in your last layer of RNN you cannot use a simple Dense layer instead use TimeDistributed.
Here is an example piece of code this might help others.
words = keras.layers.Input(batch_shape=(None, self.maxSequenceLength), name = "input")
# Build a matrix of size vocabularySize x EmbeddingDimension
# where each row corresponds to a "word embedding" vector.
# This layer will convert replace each word-id with a word-vector of size Embedding Dimension.
embeddings = keras.layers.embeddings.Embedding(self.vocabularySize, self.EmbeddingDimension,
name = "embeddings")(words)
# Pass the word-vectors to the LSTM layer.
# We are setting the hidden-state size to 512.
# The output will be batchSize x maxSequenceLength x hiddenStateSize
hiddenStates = keras.layers.GRU(512, return_sequences = True,
input_shape=(self.maxSequenceLength,
self.EmbeddingDimension),
name = "rnn")(embeddings)
hiddenStates2 = keras.layers.GRU(128, return_sequences = True,
input_shape=(self.maxSequenceLength, self.EmbeddingDimension),
name = "rnn2")(hiddenStates)
denseOutput = TimeDistributed(keras.layers.Dense(self.vocabularySize),
name = "linear")(hiddenStates2)
predictions = TimeDistributed(keras.layers.Activation("softmax"),
name = "softmax")(denseOutput)
# Build the computational graph by specifying the input, and output of the network.
model = keras.models.Model(input = words, output = predictions)
# model.compile(loss='kullback_leibler_divergence', \
model.compile(loss='sparse_categorical_crossentropy', \
optimizer = keras.optimizers.Adam(lr=0.009, \
beta_1=0.9,\
beta_2=0.999, \
epsilon=None, \
decay=0.01, \
amsgrad=False))

Refer this blog for more details Animated RNN, LSTM and GRU.
The figure below gives you a better view of LSTM. It's a LSTM cell.
As you can see, X has 3 features (green circles) so input of this cell is a vector of dimension 3 and hidden state has 2 units (red circles) so the output of this cell (and also cell state) is a vector of dimension 2.
An example of one LSTM layer with 3 timesteps (3 LSTM cells) is shown in the figure below:
** A model can have multiple LSTM layers.
Now I use Daniel Möller's example again for better understanding:
We have 10 oil tanks. For each of them we measure 2 features: temperature, pressure every one hour for 5 times.
now parameters are:
batch_size = number of samples used in one forward/backward pass (default=32) --> for example if you have 1000 samples and you set up the batch_size to 100 then the model will take 10 iterations to pass all of the samples once through network (1 epoch). The higher the batch size, the more memory space you'll need. Because the number of samples in this example are low, we consider batch_size equal to all of samples = 10
timesteps = 5
features = 2
units = It's a positive integer and determines the dimension of hidden state and cell state or in other words the number of parameters passed to next LSTM cell. It can be chosen arbitrarily or empirically based on the features and timesteps. Using more units will result in more accuracy and also more computational time. But it may cause over fitting.
input_shape = (batch_size, timesteps, features) = (10,5,2)
output_shape:
(batch_size, timesteps, units) if return_sequences=True
(batch_size, units) if return_sequences=False

Why does Theano throw NaNs when I use dropouts?

I am training a simple feed-forward model with 3 or 4 hidden layers and dropouts between each (hidden layer + non linearity) combination.
Sometimes after a few epochs (about 10-11) the model starts outputting Infs and NaNs as the error of the NLL and the accuracy falls to 0.0%. This problem does not happen when I do not use dropouts. Is this a known issue with dropouts in Theano? The way I implement dropouts is:
def drop(self, input):
mask = self.theano_rng.binomial(n=1, p=self.p, size=input.shape, dtype=theano.config.floatX)
return input * mask
where input is the feature-vector on which we want to apply dropouts.
I have also observed that the occurance of NaNs happens earlier if the dropout probability (self.p) is higher. p = 0.5 would cause NaNs to occur around epoch 1 or 2 but p = 0.7 would cause NaNs to occur around epoch 10 or 11.
Also the occurrence of NaNs happens only when hidden layer sizes are large. For example (800,700,700) gives NaNs whereas (500,500,500) does not.

in my experience, NaNs, when training a network usually happen because of two problems:
first, mathematical error, e.g. log of negative value. It could happen when you are using log() in your loss function.
Second, there is a value that becomes too big so python can't handle.
In your case, from your good observation, I think it's a second case. Your loss value may become too big to handled by python. Try to initialize smaller weight when you try to expand your network. Or just use different approach to initialize weight like explained by Glorot (2010) or He (2015). Hope it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.