Processing Text for Classification with Keras

Processing Text for Classification with Keras - python

I'm trying to train a basic text classification NN using Keras. I download 12,500 pos and 12,500 negative movie reviews from a website. I'm having trouble processing the data into something Keras can use however.
First, I open the 25000 text files and store each file into an array. I then run each array (one positive and one negative) through this function:
def process_for_model(textArray):
'''
Given a 2D array of the form:
[[fileLines1],[fileLines2]...[fileLinesN]]
converts the text into integers
'''
result = []
for file_ in textArray:
inner = []
for line in file_:
length = len(set(text_to_word_sequence(line)))
inner.append(hashing_trick(line,round(length*1.3),hash_function='md5'))
result.append(inner)
return result
With the purpose of converting the words into numbers to get them close to something a Keras model can use.
I then append the converted numbers into a single array, along with appending a 0 or 1 to another array as labels:
training_labels = []
train_batches = []
for i in range(len(positive_encoded)):
train_batches.append(positive_encoded[i])
training_labels.append([0])
for i in range(len(negative_encoded)):
train_batches.append(negative_encoded[i])
training_labels.append([1])
And finally I convert each array to a np array:
train_batches = array(train_batches)
training_labels = array(training_labels)
However, I'm not really sure where to go from here. Each review is, I believe, 168 words. I don't know how to create an appropriate model for this data or how to properly scale all the numbers to be between 0 and 1 using sklearn.
The things I am most confused on are: how many layers should I have, how many neurons each layer should have, and how many input dimensions should I have for the first layer.
Should I be taking another approach entirely?

Here is quite a good tutotial with Keras and this dataset: https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/

You can also use Keras official tutorial for text classification.
It basically downloads 50k reviews from the IMDB set, equally balanced (half positive, half negative). They split (randomly) half for training, half for testing, and take 10k (40%) of the training examples as a validation set.
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
The reviews are already in their word-dictionary representation (i.e. each review is an array of numbers). The total dictionary has about 80k+ words, but they only use the top 10k most frequent words (all the other words in a particular review are mapped to a special token - unknown ('<UNK>')).
(In the tutorial they create a reversed word dictionary - for the sake of showing you the original reviews. But it's not important.)
Each review is max 256 words, so they pre-process each review and pad it with 0 (<PAD> token) in case it's shorter. (Padding is done post, i.e. at the end)
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
Their NN architecture consists of 4 layers:
Input Embedding layer: takes a batch of reviews, each a 256 vector who's numbers are [0, 10,000) and tries to find a 16 dimensional vector (for each word) to represent them.
Global Average Pooling layer: average over all the words (16-D representation) in a review, and gives you a single 16 dimensional vector to represent the whole review.
Fully connected dense layer of 16 nodes - the 'vanilla' NN layer. They chose a ReLu activation function.
An output layer of 1 node: with a sigmoid activation function - gives a number from 0 to 1 which represents the confidence it's a positive/negative review.
Here is the code for it:
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
Then they fit the model and run it:
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
In summary - they chose to simplify what could have been a 10k dimensional vector to only 16 dimensions, and then run one dense layer NN - with which they got a pretty good results (87%).

Related

How do I predict on more than one batch from a Tensorflow Dataset, using .predict_on_batch?

As the question says, I can only predict from my model with model.predict_on_batch(). Keras tries to concatenate everything together if I use model.predict() and that doesn't work.
For my application (a sequence to sequence model) it is faster to do grouping on the fly. But even if I had done it in Pandas and then only used Dataset for the padded batch, .predict() still shouldn't work?
If I can get predict_on_batch to work then that's what works. But I can only predict on the first batch of the Dataset. How do I get predictions for the rest? I can't loop over the Dataset, I can't consume it...
Here's a smaller code example. The group is the same as the labels but in the real world they are obviously two different things. There are 3 classes, maximum of 2 values in a sequence, 2 row of data per batch. There's a lot of comments and I nicked parts of the windowing from somewhere on StackOverflow. I hope it is fairly legible to most.
If you have any other suggestions on how to improve the code, please comment. But no, that's not what my model looks like at all. So suggestions for that part probably aren't helpful.
EDIT: Tensorflow version 2.1.0
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Bidirectional, Masking, Input, Dense, GRU
import random
import numpy as np
random.seed(100)
# input data
feature = list(range(3, 14))
# shuffle data
random.shuffle(feature)
# make label from feature data, +1 because we are padding with zero
label = [feat // 5 +1 for feat in feature]
group = label[:]
# random.shuffle(group)
max_group = 2
batch_size = 2
print('Data:')
print(*zip(group, feature, label), sep='\n')
# make dataset from data arrays
ds = tf.data.Dataset.zip((tf.data.Dataset.from_tensor_slices({'group': group, 'feature': feature}),
tf.data.Dataset.from_tensor_slices({'label': label})))
# group by window
ds = ds.apply(tf.data.experimental.group_by_window(
# use feature as key (you may have to use tf.reshape(x['group'], []) instead of tf.cast)
key_func=lambda x, y: tf.cast(x['group'], tf.int64),
# convert each window to a batch
reduce_func=lambda _, window: window.batch(max_group),
# use batch size as window size
window_size=max_group))
# shuffle at most 100k rows, but commented out because we don't want to predict on shuffled data
# ds = ds.shuffle(int(1e5))
ds = ds.padded_batch(batch_size,
padded_shapes=({s: (None,) for s in ['group', 'feature']},
{s: (None,) for s in ['label']}))
# show dataset contents
print('Result:')
for element in ds:
print(element)
# Keras matches the name in the input to the tensor names in the first part of ds
inp = Input(shape=(None,), name='feature')
# RNNs require an additional rank, even if it is a degenerate dimension
duck = tf.expand_dims(inp, axis=-1)
rnn = GRU(32, return_sequences=True)(duck)
# again Keras matches names
out = Dense(max(label)+1, activation='softmax', name='label')(rnn)
model = Model(inputs=inp, outputs=out)
model.summary()
model.compile(loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(ds, epochs=3)
model.predict_on_batch(ds)

You can iterate over the dataset, like so, remembering what is "x" and what is "y" in typical notation:
for item in ds:
xi, yi = item
pi = model.predict_on_batch(xi)
print(xi["group"].shape, pi.shape)
Of course, this predicts on each element individually. Otherwise you'd have to define the batches yourself, by batching matching shapes together, as the batch size itself is allowed to be variable.

Correctly structuring text data for text generation with Tensorflow model

I am trying to train my model to generate sentences no longer that 210 characters. From what I have read I have only seen training on 'continuous' text. Like a book. However I am trying to train my model on single sentences.
I'm pretty new to tensorflow and ML so right now I am able to train my model but it generates garbage, seemingly random text. I have 10,000 sentences so I think I have sufficient data.
Overview of my data
Structure [['SENTENCE'], ['SENTENCE2']...]
Data Prep
tokenizer = keras.preprocessing.text.Tokenizer(num_words=209, lower=False, char_level=True, filters='#$%&()*+-<=>#[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(df['title'].values)
df['encoded_with_keras'] = tokenizer.texts_to_sequences(df['title'].values)
dataset = df['encoded_with_keras'].values
dataset = tf.keras.preprocessing.sequence.pad_sequences(dataset, padding='post')
dataset = dataset.flatten()
dataset = tf.data.Dataset.from_tensor_slices(dataset)
sequences = dataset.batch(seq_len+1, drop_remainder=True)
def create_seq_targets(seq):
input_txt = seq[:-1]
target_txt = seq[1:]
return input_txt, target_txt
dataset = sequences.map(create_seq_targets)
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)
Model
def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):
model = Sequential()
model.add(Embedding(vocab_size, embed_dim, batch_input_shape=[batch_size, None],input_length=209, mask_zero=True))
model.add(LSTM(rnn_neurons, return_sequences=True, stateful=True,))
model.add(Dropout(0.2))
model.add(Dense(258, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(optimizer='adam', loss="sparse_categorical_crossentropy")
return model
When I give the model a sequence to start from I get back absolute nonsense and eventually the model predicts a 0 which is not in the char_index mapping.
Edit
Text Generation
epochs = 2
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
model = create_model(vocab_size = vocab_size,
embed_dim=embed_dim,
rnn_neurons=rnn_neurons,
batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
def generate_text(model, start_string):
num_generate = 200
input_eval = [char_2_index[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
text_generated = []
temperature = 1
# model.reset_states()
for i in range(num_generate):
print(text_generated)
predictions = model(input_eval)
predictions = tf.squeeze(predictions, 0)
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
print(predicted_id)
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(index_2_char[predicted_id])
return (start_string + ''.join(text_generated))

There are a few things that must be changed on the first sight.
Tokenizer must have num_words = vocab_size
At first (didn't analyse it deeply), I can't imagine why you're flattening your dataset and getting slices if it's probably correctly structured
You cannot use stateful=True if you don't want that "batch 2 is a sequel of batch 1", you have individual sentences, so stateful=False. (Unless you are training correctly with manual training loops and resetting states for each batch, which is unnecessary trouble in the training phase)
What you need to check visually:
Input data must have format like:
[
[1,2,3,6,10,4,10, ...up to sentence length - 1...],
[5,6,3,6,7,3,11,... up to sentence length - 1...],
.... up to number of sentences ...
]
Output data must then be:
[
[2,3,6,10,4,10,15 ...], #equal to input data, shifted by 1
[6,3,6,7,3,11,13, ...],
...
]
Print a few rows of them to check if they're correctly preprocessed as intended.
Training will then be easy:
model.fit(input_data, output_data, epochs=....)
Yes, your model will predict zeros, as you have zeros in your data, that's not weird: you did a pad_sequences.
You can interpret a zero as a "sentence end" in this case, since you did a 'post' pading. When your model gives you a zero, it decided that the sentence it's generating should end at that point - if it was well trained, it will probably continue outputting zeros for that sentence from this point on.
Generating new senteces
This part is more complex and you need to rewrite the model, now being stative=True, and transfer the weights from the trained model to this new model.
Before anything, call model.reset_states().
You will need to manually feed a batch with shape (number_of_sentences=batch_size, 1). This will be the "first character" of each of the sentences it will generate. The output will be the "second character" of each sentence.
Get this output and feed the model with it. It will generate the "third character" of each sentence. And so on.
When all outputs are zero, all sentences are fully generated and you can stop the loop.
Call model.reset_states() again before trying to generate a new batch of sentences.
You can find examples of this kind of predicting here: https://stackoverflow.com/a/50235563/2097240

How to use Multivariate time-series prediction with Keras, when multiple samples are used

As the title states, I am doing multivariate time-series prediction. I have some experience with this situation and was able to successfully setup and train a working model in TF Keras.
However, I did not know the 'proper' way to handle having multiple unrelated time-series samples. I have about 8000 unique sample 'blocks' with anywhere from 800 time steps to 30,000 time steps per sample. Of course I couldn't concatenate them all into one single time series because the first points of sample 2 are not related in time with the last points of sample 1.
Thus my solution was to fit each sample individually in a loop (at great inefficiency).
My new idea is can/should I pad the start of each sample with empty time-steps = to the amount of look back for the RNN and then concatenate the padded samples into one time-series? This will mean that the first time-step will have a look-back data of mostly 0's which sounds like another 'hack' for my problem and not the right way to do it.

The main challenge is in 800 vs. 30,000 timesteps, but nothing you can't do.
Model design: group sequences into chunks - for example, 30 sequences of 800-to-900 timesteps, padded, then 60 sequences of 900-to-1000, etc. - don't have to be contiguous (i.e. next can be 1200-to-1500)
Input shape: (samples, timesteps, channels) - or equivalently, (sequences, timesteps, features)
Layers: Conv1D and/or RNNs - e.g. GRU, LSTM. Each can handle variable timesteps
Concatenation: don't do it. If each of your sequences is independent, then each must be fed along dimension 0 in Keras - the batch or samples dimension. If they are dependent, e.g. multivariate timeseries, like many channels in a signal - then feed them along the channels dimension (dim 2). But never concatenate along timeseries dimension, as it implies causal continuity whrere none exists.
Stateful RNNs: can help in processing long sequences - info on how they work here
RNN capability: is limited w.r.t. long sequences, and 800 is already in danger zone even for LSTMs; I'd suggest dimensionality reduction via either autoencoders or CNNs w/ strides > 1 at input, then feeding their outputs to RNNs.
RNN training: is difficult. Long train times, hyperparameter sensitivity, vanishing gradients - but, with proper regularization, they can be powerful. More info here
Zero-padding: before/after/both - debatable, can read about it, but probably stay clear from "both" as learning to ignore paddings is easier with one locality; I personally use "before"
RNN variant: use CuDNNLSTM or CuDNNGRU whenever possible, as they are 10x faster
Note: "samples" above, and in machine learning, refers to independent examples / observations, rather than measured signal datapoints (which would be referred to as timesteps).
Below is a minimal code for what a timeseries-suited model would look like:
from tensorflow.keras.layers import Input, Conv1D, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import numpy as np
def make_data(batch_shape): # dummy data
return (np.random.randn(*batch_shape),
np.random.randint(0, 2, (batch_shape[0], 1)))
def make_model(batch_shape): # example model
ipt = Input(batch_shape=batch_shape)
x = Conv1D(filters=16, kernel_size=10, strides=2, padding='valid')(ipt)
x = LSTM(units=16)(x)
out = Dense(1, activation='sigmoid')(x) # assuming binary classification
model = Model(ipt, out)
model.compile(Adam(lr=1e-3), 'binary_crossentropy')
return model
batch_shape = (32, 100, 16) # 32 samples, 100 timesteps, 16 channels
x, y = make_data(batch_shape)
model = make_model(batch_shape)
model.train_on_batch(x, y)

Using tensorflow or keras to build a NN model by feeding 'pairwise' samples

I'm trying to implement a NN model with pairwise samples. Details are shown in follows:
Original data:
X_org with shape of (100, 50) for example, namely 100 samples with 50 features.
Y_org with shape of (100, 1).
Processing these original data for real training:
Select 2 samples from X_org randomly (so we have 100*99/2 such combinations) to form a new 'pairwise' sample, and the prediction target, namely the new y label is the subtraction of the two corresponding y_org labels (Y_org_sample1 - Y_org_sample2). Now we have new X_train and Y_train.
I need a more a NN model (DNN, CNN, LSTM, whatever ...), with which I can pass the first sub_sample of one pairwise sample from X_train into the model and will get one result, same step for the second sub_sample. By calculating the subtraction of the two results, I can get the prediction of this pairwise sample. This prediction will be the one compared with the corresponding Y label from Y_train.
Overall, I need to train a model (update the weights) after feeding it a 'pairwise' sample (two successive sub samples). The reason why I don't choose a 'two-arm' model (e.g. merge two arms by xxx.sub()) is that I will only feed one sub sample during test process. I will just use the model to predict one sub-sample finally.
So I will use the data from X_train during train step, while use X_org-like data format during test step. It looks a bit complex.
Looks like Tensorflow would be more feasible for this task, if keras also works, please kindly share your idea.

You can first create a model that will take only one X_org-like element:
#create a model the way you like it, it can be Functional API or Sequential, no problem
xOrgModel = createAModelForXOrgData(...)
Now, lets create a second model, this time necessarily functional API that works with both inputs:
from keras.models import Model
from keras.layers import Input, Subtract
input1 = Input(shapeOfInput)
input2 = Input(shapeOfInput)
output1 = xOrgModel(input1)
output2 = xOrgModel(input2)
output = Subtract()([output1,output2])
pairWiseModel = Model([input1,input2],output)
Now you have two models: xOrgModel and pairWiseModel. You can use any of them depending on the task you are doing at the moment.
Both models are sharing their weights. This means that you can train any of them and the other will be updated as well.
Using the pairwise model
First, organize your data in two separate arrays. (Because our model uses two inputs)
L = len(X_org)
x1 = []
x2 = []
y = []
for i in range(L):
for j in range(i+1,L):
x1.append(X_org[i])
x2.append(X_org[j])
y.append(Y_org[i] - Y_org[j])
x1 = np.array(x1)
x2 = np.array(x2)
y = np.array(y)
Train and predict with a list of inputs:
pairWiseModel.fit([x1,x2],y,...)

Understanding Keras LSTMs

I am trying to reconcile my understand of LSTMs and pointed out here in this post by Christopher Olah implemented in Keras. I am following the blog written by Jason Brownlee for the Keras tutorial. What I am mainly confused about is,
The reshaping of the data series into [samples, time steps, features] and,
The stateful LSTMs
Lets concentrate on the above two questions with reference to the code pasted below:
# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], look_back, 1))
testX = numpy.reshape(testX, (testX.shape[0], look_back, 1))
########################
# The IMPORTANT BIT
##########################
# create and fit the LSTM network
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(100):
model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
model.reset_states()
Note: create_dataset takes a sequence of length N and returns a N-look_back array of which each element is a look_back length sequence.
What is Time Steps and Features?
As can be seen TrainX is a 3-D array with Time_steps and Feature being the last two dimensions respectively (3 and 1 in this particular code). With respect to the image below, does this mean that we are considering the many to one case, where the number of pink boxes are 3? Or does it literally mean the chain length is 3 (i.e. only 3 green boxes considered).
Does the features argument become relevant when we consider multivariate series? e.g. modelling two financial stocks simultaneously?
Stateful LSTMs
Does stateful LSTMs mean that we save the cell memory values between runs of batches? If this is the case, batch_size is one, and the memory is reset between the training runs so what was the point of saying that it was stateful. I'm guessing this is related to the fact that training data is not shuffled, but I'm not sure how.
Any thoughts?
Image reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Edit 1:
A bit confused about #van's comment about the red and green boxes being equal. So just to confirm, does the following API calls correspond to the unrolled diagrams? Especially noting the second diagram (batch_size was arbitrarily chosen.):
Edit 2:
For people who have done Udacity's deep learning course and still confused about the time_step argument, look at the following discussion: https://discussions.udacity.com/t/rnn-lstm-use-implementation/163169
Update:
It turns out model.add(TimeDistributed(Dense(vocab_len))) was what I was looking for. Here is an example: https://github.com/sachinruk/ShakespeareBot
Update2:
I have summarised most of my understanding of LSTMs here: https://www.youtube.com/watch?v=ywinX5wgdEU

As a complement to the accepted answer, this answer shows keras behaviors and how to achieve each picture.
General Keras behavior
The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example):
In this image, I increased the number of steps to 5, to avoid confusion with the other dimensions.
For this example:
We have N oil tanks
We spent 5 hours taking measures hourly (time steps)
We measured two features:
Pressure P
Temperature T
Our input array should then be something shaped as (N,5,2):
[ Step1 Step2 Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
....
Tank N: [[Pn1,Tn1], [Pn2,Tn2], [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
]
Inputs for sliding windows
Often, LSTM layers are supposed to process the entire sequences. Dividing windows may not be the best idea. The layer has internal states about how a sequence is evolving as it steps forward. Windows eliminate the possibility of learning long sequences, limiting all sequences to the window size.
In windows, each window is part of a long original sequence, but by Keras they will be seen each as an independent sequence:
[ Step1 Step2 Step3 Step4 Step5
Window A: [[P1,T1], [P2,T2], [P3,T3], [P4,T4], [P5,T5]],
Window B: [[P2,T2], [P3,T3], [P4,T4], [P5,T5], [P6,T6]],
Window C: [[P3,T3], [P4,T4], [P5,T5], [P6,T6], [P7,T7]],
....
]
Notice that in this case, you have initially only one sequence, but you're dividing it in many sequences to create windows.
The concept of "what is a sequence" is abstract. The important parts are:
you can have batches with many individual sequences
what makes the sequences be sequences is that they evolve in steps (usually time steps)
Achieving each case with "single layers"
Achieving standard many to many:
You can achieve many to many with a simple LSTM layer, using return_sequences=True:
outputs = LSTM(units, return_sequences=True)(inputs)
#output_shape -> (batch_size, steps, units)
Achieving many to one:
Using the exact same layer, keras will do the exact same internal preprocessing, but when you use return_sequences=False (or simply ignore this argument), keras will automatically discard the steps previous to the last:
outputs = LSTM(units)(inputs)
#output_shape -> (batch_size, units) --> steps were discarded, only the last was returned
Achieving one to many
Now, this is not supported by keras LSTM layers alone. You will have to create your own strategy to multiplicate the steps. There are two good approaches:
Create a constant multi-step input by repeating a tensor
Use a stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features)
One to many with repeat vector
In order to fit to keras standard behavior, we need inputs in steps, so, we simply repeat the inputs for the length we want:
outputs = RepeatVector(steps)(inputs) #where inputs is (batch,features)
outputs = LSTM(units,return_sequences=True)(outputs)
#output_shape -> (batch_size, steps, units)
Understanding stateful = True
Now comes one of the possible usages of stateful=True (besides avoiding loading data that can't fit your computer's memory at once)
Stateful allows us to input "parts" of the sequences in stages. The difference is:
In stateful=False, the second batch contains whole new sequences, independent from the first batch
In stateful=True, the second batch continues the first batch, extending the same sequences.
It's like dividing the sequences in windows too, with these two main differences:
these windows do not superpose!!
stateful=True will see these windows connected as a single long sequence
In stateful=True, every new batch will be interpreted as continuing the previous batch (until you call model.reset_states()).
Sequence 1 in batch 2 will continue sequence 1 in batch 1.
Sequence 2 in batch 2 will continue sequence 2 in batch 1.
Sequence n in batch 2 will continue sequence n in batch 1.
Example of inputs, batch 1 contains steps 1 and 2, batch 2 contains steps 3 to 5:
BATCH 1 BATCH 2
[ Step1 Step2 | [ Step3 Step4 Step5
Tank A: [[Pa1,Ta1], [Pa2,Ta2], | [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B: [[Pb1,Tb1], [Pb2,Tb2], | [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
.... |
Tank N: [[Pn1,Tn1], [Pn2,Tn2], | [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
] ]
Notice the alignment of tanks in batch 1 and batch 2! That's why we need shuffle=False (unless we are using only one sequence, of course).
You can have any number of batches, indefinitely. (For having variable lengths in each batch, use input_shape=(None,features).
One to many with stateful=True
For our case here, we are going to use only 1 step per batch, because we want to get one output step and make it be an input.
Please notice that the behavior in the picture is not "caused by" stateful=True. We will force that behavior in a manual loop below. In this example, stateful=True is what "allows" us to stop the sequence, manipulate what we want, and continue from where we stopped.
Honestly, the repeat approach is probably a better choice for this case. But since we're looking into stateful=True, this is a good example. The best way to use this is the next "many to many" case.
Layer:
outputs = LSTM(units=features,
stateful=True,
return_sequences=True, #just to keep a nice output shape even with length 1
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Now, we're going to need a manual loop for predictions:
input_data = someDataWithShape((batch, 1, features))
#important, we're starting new sequences, not continuing old ones:
model.reset_states()
output_sequence = []
last_step = input_data
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
Many to many with stateful=True
Now, here, we get a very nice application: given an input sequence, try to predict its future unknown steps.
We're using the same method as in the "one to many" above, with the difference that:
we will use the sequence itself to be the target data, one step ahead
we know part of the sequence (so we discard this part of the results).
Layer (same as above):
outputs = LSTM(units=features,
stateful=True,
return_sequences=True,
input_shape=(None,features))(inputs)
#units = features because we want to use the outputs as inputs
#None because we want variable length
#output_shape -> (batch_size, steps, units)
Training:
We are going to train our model to predict the next step of the sequences:
totalSequences = someSequencesShaped((batch, steps, features))
#batch size is usually 1 in these cases (often you have only one Tank in the example)
X = totalSequences[:,:-1] #the entire known sequence, except the last step
Y = totalSequences[:,1:] #one step ahead of X
#loop for resetting states at the start/end of the sequences:
for epoch in range(epochs):
model.reset_states()
model.train_on_batch(X,Y)
Predicting:
The first stage of our predicting involves "ajusting the states". That's why we're going to predict the entire sequence again, even if we already know this part of it:
model.reset_states() #starting a new sequence
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:] #the last step of the predictions is the first future step
Now we go to the loop as in the one to many case. But don't reset states here!. We want the model to know in which step of the sequence it is (and it knows it's at the first new step because of the prediction we just made above)
output_sequence = [firstNewStep]
last_step = firstNewStep
for i in steps_to_predict:
new_step = model.predict(last_step)
output_sequence.append(new_step)
last_step = new_step
#end of the sequences
model.reset_states()
This approach was used in these answers and file:
Predicting a multiple forward time step of a time series using LSTM
how to use the Keras model to forecast for future dates or events?
https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb
Achieving complex configurations
In all examples above, I showed the behavior of "one layer".
You can, of course, stack many layers on top of each other, not necessarly all following the same pattern, and create your own models.
One interesting example that has been appearing is the "autoencoder" that has a "many to one encoder" followed by a "one to many" decoder:
Encoder:
inputs = Input((steps,features))
#a few many to many layers:
outputs = LSTM(hidden1,return_sequences=True)(inputs)
outputs = LSTM(hidden2,return_sequences=True)(outputs)
#many to one layer:
outputs = LSTM(hidden3)(outputs)
encoder = Model(inputs,outputs)
Decoder:
Using the "repeat" method;
inputs = Input((hidden3,))
#repeat to make one to many:
outputs = RepeatVector(steps)(inputs)
#a few many to many layers:
outputs = LSTM(hidden4,return_sequences=True)(outputs)
#last layer
outputs = LSTM(features,return_sequences=True)(outputs)
decoder = Model(inputs,outputs)
Autoencoder:
inputs = Input((steps,features))
outputs = encoder(inputs)
outputs = decoder(outputs)
autoencoder = Model(inputs,outputs)
Train with fit(X,X)
Additional explanations
If you want details about how steps are calculated in LSTMs, or details about the stateful=True cases above, you can read more in this answer: Doubts regarding `Understanding Keras LSTMs`

First of all, you choose great tutorials(1,2) to start.
What Time-step means: Time-steps==3 in X.shape (Describing data shape) means there are three pink boxes. Since in Keras each step requires an input, therefore the number of the green boxes should usually equal to the number of red boxes. Unless you hack the structure.
many to many vs. many to one: In keras, there is a return_sequences parameter when your initializing LSTM or GRU or SimpleRNN. When return_sequences is False (by default), then it is many to one as shown in the picture. Its return shape is (batch_size, hidden_unit_length), which represent the last state. When return_sequences is True, then it is many to many. Its return shape is (batch_size, time_step, hidden_unit_length)
Does the features argument become relevant: Feature argument means "How big is your red box" or what is the input dimension each step. If you want to predict from, say, 8 kinds of market information, then you can generate your data with feature==8.
Stateful: You can look up the source code. When initializing the state, if stateful==True, then the state from last training will be used as the initial state, otherwise it will generate a new state. I haven't turn on stateful yet. However, I disagree with that the batch_size can only be 1 when stateful==True.
Currently, you generate your data with collected data. Image your stock information is coming as stream, rather than waiting for a day to collect all sequential, you would like to generate input data online while training/predicting with network. If you have 400 stocks sharing a same network, then you can set batch_size==400.

When you have return_sequences in your last layer of RNN you cannot use a simple Dense layer instead use TimeDistributed.
Here is an example piece of code this might help others.
words = keras.layers.Input(batch_shape=(None, self.maxSequenceLength), name = "input")
# Build a matrix of size vocabularySize x EmbeddingDimension
# where each row corresponds to a "word embedding" vector.
# This layer will convert replace each word-id with a word-vector of size Embedding Dimension.
embeddings = keras.layers.embeddings.Embedding(self.vocabularySize, self.EmbeddingDimension,
name = "embeddings")(words)
# Pass the word-vectors to the LSTM layer.
# We are setting the hidden-state size to 512.
# The output will be batchSize x maxSequenceLength x hiddenStateSize
hiddenStates = keras.layers.GRU(512, return_sequences = True,
input_shape=(self.maxSequenceLength,
self.EmbeddingDimension),
name = "rnn")(embeddings)
hiddenStates2 = keras.layers.GRU(128, return_sequences = True,
input_shape=(self.maxSequenceLength, self.EmbeddingDimension),
name = "rnn2")(hiddenStates)
denseOutput = TimeDistributed(keras.layers.Dense(self.vocabularySize),
name = "linear")(hiddenStates2)
predictions = TimeDistributed(keras.layers.Activation("softmax"),
name = "softmax")(denseOutput)
# Build the computational graph by specifying the input, and output of the network.
model = keras.models.Model(input = words, output = predictions)
# model.compile(loss='kullback_leibler_divergence', \
model.compile(loss='sparse_categorical_crossentropy', \
optimizer = keras.optimizers.Adam(lr=0.009, \
beta_1=0.9,\
beta_2=0.999, \
epsilon=None, \
decay=0.01, \
amsgrad=False))

Refer this blog for more details Animated RNN, LSTM and GRU.
The figure below gives you a better view of LSTM. It's a LSTM cell.
As you can see, X has 3 features (green circles) so input of this cell is a vector of dimension 3 and hidden state has 2 units (red circles) so the output of this cell (and also cell state) is a vector of dimension 2.
An example of one LSTM layer with 3 timesteps (3 LSTM cells) is shown in the figure below:
** A model can have multiple LSTM layers.
Now I use Daniel Möller's example again for better understanding:
We have 10 oil tanks. For each of them we measure 2 features: temperature, pressure every one hour for 5 times.
now parameters are:
batch_size = number of samples used in one forward/backward pass (default=32) --> for example if you have 1000 samples and you set up the batch_size to 100 then the model will take 10 iterations to pass all of the samples once through network (1 epoch). The higher the batch size, the more memory space you'll need. Because the number of samples in this example are low, we consider batch_size equal to all of samples = 10
timesteps = 5
features = 2
units = It's a positive integer and determines the dimension of hidden state and cell state or in other words the number of parameters passed to next LSTM cell. It can be chosen arbitrarily or empirically based on the features and timesteps. Using more units will result in more accuracy and also more computational time. But it may cause over fitting.
input_shape = (batch_size, timesteps, features) = (10,5,2)
output_shape:
(batch_size, timesteps, units) if return_sequences=True
(batch_size, units) if return_sequences=False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.