Puzzled by stacked bidirectional RNN in TensorFlow 2

Puzzled by stacked bidirectional RNN in TensorFlow 2 - python

I'm learning how to build a seq2seq model based on this TensorFlow 2 NMT tutorial, and I'm trying to expand upon it by stacking multiple RNN layers for the encoder and decoder. However, I'm having trouble retrieving the output which corresponds to the hidden state of the encoder.
Here's my code for building the stacked bidirectional GRUCell layers in the encoder:
# Encoder initializer
def __init__(self, n_layers, dropout, ...):
...
gru_cells = [layers.GRUCell(units,
recurrent_initializer='glorot_uniform',
dropout=dropout)
for _ in range(n_layers)]
self.gru = layers.Bidirectional(layers.RNN(gru_cells,
return_sequences=True,
return_state=True))
Assuming the above is correct, I then call the layer I created:
# Encoder call method
def call(self, inputs, state):
...
list_outputs = self.gru(inputs, initial_state=state)
print(len(list_outputs)) # test
list_outputs has length 3 when n_layers = 1, which is expected behavior according to this SO post. When I increase n_layers by one, I find that the number outputs increases by two, which I presume are the forward and reverse final states of the new layer. So 2 layers -> 5 outputs, 3 layers -> 7 outputs, etc. However, I can't figure out which output corresponds to which layer and in which direction.
Ultimately what I'd like to know is: how can I get the forward and reverse final states of the last layer in this stacked bidirectional RNN? If I understand the seq2seq model correctly, they make up the hidden state that is passed to the decoder.

After digging through TensorFlow source code for the RNN and Bidirectional classes, my best guess for the output format of a stacked bidirectional RNN layer is the following 1+2n tuple, where n is the number of stacked layers:
[0] concatenation of forward and backward state across the RNN
[1 : len//2 + 1] final state of forward layers, from first to last
[len//2 + 1:] final state of reverse layers, from first to last

Related

How are keras tensors connected to layers that create them

In the book "Machine Learning with scikit-learn and Tensorflow" there's a code fragment I can't wrap my head around. Until that chapter, their models were only explicitly using layers - be it in a sequential fashion, or functional. But in the chapter 16, there's this:
import tensorflow_addons as tfa
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(
decoder_embeddings, initial_state=encoder_state,
sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
And then he just runs the model in a standard way:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)
history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)
I have trouble understanding the code starting at line 7. The author is creating an Embedding layer which he immediately calls on encoder_inputs and decoder_inputs, then he does basically the same with the LSTM layer that he calls on the previously created encoder_embeddings and tensors returned by this operation are used in the code slightly below. What I don't get here is how are those tensors trained? It looks like he's not using the layers creating them in the model, but if so, then how come the embeddings are learned and the whole model converges?

To understand this overall flow, you must understand how things are made under the hood. Tensorflow uses graph execution when making the model. When you have passed [encoder_inputs, decoder_inputs, sequence_lengths] as an inputs and [Y_proba] as a output. The model doesn't immediately start the training, first it builds the model. So, what builds means here, the thing is it makes a computational graph first and then stores this computational graph. model.compile() does this for you, it makes a computational graph for you.
Let me explain it further let's suppose I wanna compute a + b = c and b + d = 2 and finally c * d = 6 using a computational graph, then first Tensorflow will make 3 nodes for it, how will it look like? see in the picture below.
As you are seeing in the picture above the same exact thing is done by TensorFlow when you pass your inputs and outputs. The picture above is the depiction of forward pass. Now the same graph would be used by Tensorflow to do the backward pass. See the figure below.
Now, first, the computational graph is made and then the same computational graph is used to compute the forward pass and backward pass.
The graph above computes the computational graph of your complete model. But how? in your case specifically. The model will ask how Y_prob comes here. The graph consists of the operations and tensors created between the inputs and outputs. The Embedding layer is created and applied to the inputs encoder_inputs and decoder_inputs to obtain encoder_embeddings and decoder_embeddings, respectively. The LSTM layer is applied to encoder_embeddings to produce encoder_outputs, state_h, and state_c. These tensors are then passed as inputs to the BasicDecoder layer, which combines the decoder_embeddings, encoder_state (constructed from state_h and state_c), and sequence_lengths to produce final_outputs, final_state, and final_sequence_lengths. Finally, the softmax function is applied to the rnn_output of final_outputs to produce the final output Y_proba.
All the entities which are mentioned in the paragraph above in quotes would be your intermediate nodes in a computational graph.
So, it will start with the inputs and bring it down to the Y-Prob. During graph computation, the weights of the model and other parameters are also initiated. The graph is made once, which is then easy to compute the forward pass and backward pass.
How do these layers are trained and optimized for convergence?
when you specify inputs=[encoder_inputs, decoder_inputs, sequence_lengths] and outputs=[Y_proba], the model knows the intermediate layers that are used to compute Y_proba from encoder_inputs, decoder_inputs and sequence_lengths. These intermediate layers are the Embedding and LSTM layers, as well as the TrainingSampler, LSTMCell, Dense and BasicDecoder layers. These layers are automatically included in the computation graph of the model, allowing the optimizer to update the parameters of these layers during training.

This is an example of the Keras Functional API. In this style of defining a model, you write the blueprints first, and then use it later. Think of it like wiring a circuit: while you're connecting things, there's no electricity flowing through them (the electricity corresponds to data in our metaphor). Later, you turn on the power source, and electricity flows through.
This is how the Functional API works as well. First, let's read the last line:
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
This says "Hey Keras, I need a model whose inputs are encoder_inputs, decoder_inputs, and sequence_lengths, and they will eventually produce Y_proba. The details of how they will produce this output is defined above. Let's look at the specific lines you're having trouble with:
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
The first of these says, "Keras, give me a layer that will produce embeddings". embeddings is a layer object. The next two lines are the wiring that we talked about: you can connect layers preemptively before data flows through them: that's the crux of the Functional API. So the second layer says, "Keras, encoder_inputs, which is an Input, will connect to (and go through) the Embedding layer I just created, and the result of that will be in a variable I call encoder_embeddings.
The rest of the code follows the same logic: you're connecting the wires together, before you compile your model and eventually fit it with the data.

Difference between these implementations of LSTM Autoencoder?

Specifically what spurred this question is the return_sequence argument of TensorFlow's version of an LSTM layer.
The docs say:
Boolean. Whether to return the last output. in the output sequence,
or the full sequence. Default: False.
I've seen some implementations, especially autoencoders that use this argument to strip everything but the last element in the output sequence as the output of the 'encoder' half of the autoencoder.
Below are three different implementations. I'd like to understand the reasons behind the differences, as the seem like very large differences but all call themselves the same thing.
Example 1 (TensorFlow):
This implementation strips away all outputs of the LSTM except the last element of the sequence, and then repeats that element some number of times to reconstruct the sequence:
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_in,1)))
# Decoder below
model.add(RepeatVector(n_out))
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
When looking at implementations of autoencoders in PyTorch, I don't see authors doing this. Instead they use the entire output of the LSTM for the encoder (sometimes followed by a dense layer and sometimes not).
Example 1 (PyTorch):
This implementation trains an embedding BEFORE an LSTM layer is applied... It seems to almost defeat the idea of an LSTM based auto-encoder... The sequence is already encoded by the time it hits the LSTM layer.
class EncoderLSTM(nn.Module):
def __init__(self, input_size, hidden_size, n_layers=1, drop_prob=0):
super(EncoderLSTM, self).__init__()
self.hidden_size = hidden_size
self.n_layers = n_layers
self.embedding = nn.Embedding(input_size, hidden_size)
self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=drop_prob, batch_first=True)
def forward(self, inputs, hidden):
# Embed input words
embedded = self.embedding(inputs)
# Pass the embedded word vectors into LSTM and return all outputs
output, hidden = self.lstm(embedded, hidden)
return output, hidden
Example 2 (PyTorch):
This example encoder first expands the input with one LSTM layer, then does its compression via a second LSTM layer with a smaller number of hidden nodes. Besides the expansion, this seems in line with this paper I found: https://arxiv.org/pdf/1607.00148.pdf
However, in this implementation's decoder, there is no final dense layer. The decoding happens through a second lstm layer that expands the encoding back to the same dimension as the original input. See it here. This is not in line with the paper (although I don't know if the paper is authoritative or not).
class Encoder(nn.Module):
def __init__(self, seq_len, n_features, embedding_dim=64):
super(Encoder, self).__init__()
self.seq_len, self.n_features = seq_len, n_features
self.embedding_dim, self.hidden_dim = embedding_dim, 2 * embedding_dim
self.rnn1 = nn.LSTM(
input_size=n_features,
hidden_size=self.hidden_dim,
num_layers=1,
batch_first=True
)
self.rnn2 = nn.LSTM(
input_size=self.hidden_dim,
hidden_size=embedding_dim,
num_layers=1,
batch_first=True
)
def forward(self, x):
x = x.reshape((1, self.seq_len, self.n_features))
x, (_, _) = self.rnn1(x)
x, (hidden_n, _) = self.rnn2(x)
return hidden_n.reshape((self.n_features, self.embedding_dim))
Question:
I'm wondering about this discrepancy in implementations. The difference seems quite large. Are all of these valid ways to accomplish the same thing? Or are some of these mis-guided attempts at a "real" LSTM autoencoder?

There is no official or correct way of designing the architecture of an LSTM based autoencoder... The only specifics the name provides is that the model should be an Autoencoder and that it should use an LSTM layer somewhere.
The implementations you found are each different and unique on their own even though they could be used for the same task.
Let's describe them:
TF implementation:
It assumes the input has only one channel, meaning that each element in the sequence is just a number and that this is already preprocessed.
The default behaviour of the LSTM layer in Keras/TF is to output only the last output of the LSTM, you could set it to output all the output steps with the return_sequences parameter.
In this case the input data has been shrank to (batch_size, LSTM_units)
Consider that the last output of an LSTM is of course a function of the previous outputs (specifically if it is a stateful LSTM)
It applies a Dense(1) in the last layer in order to get the same shape as the input.
PyTorch 1:
They apply an embedding to the input before it is fed to the LSTM.
This is standard practice and it helps for example to transform each input element to a vector form (see word2vec for example where in a text sequence, each word that isn't a vector is mapped into a vector space). It is only a preprocessing step so that the data has a more meaningful form.
This does not defeat the idea of the LSTM autoencoder, because the embedding is applied independently to each element of the input sequence, so it is not encoded when it enters the LSTM layer.
PyTorch 2:
In this case the input shape is not (seq_len, 1) as in the first TF example, so the decoder doesn't need a dense after. The author used a number of units in the LSTM layer equal to the input shape.
In the end you choose the architecture of your model depending on the data you want to train on, specifically: the nature (text, audio, images), the input shape, the amount of data you have and so on...

Adding softmax layer to LSTM network "freezes" output

I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
super().__init__()
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
state.detach_()
optimizer.zero_grad()
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
loss.backward()
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
optimizer.step()
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.

Extracting encoding/decoding models from Keras autoencoder using Sequential API

I am training an autoencoder constructed using the Sequential API in Keras. I'd like to create separate models that implement the encoding and decoding functions. I know from examples how to do this with the functional API, but I can't find an example of how it's done with the Sequential API. The following sample code is my starting point:
input_dim = 2904
encoding_dim = 4
hidden_dim = 128
# instantiate model
autoencoder = Sequential()
# 1st hidden layer
autoencoder.add(Dense(hidden_dim, input_dim=input_dim, use_bias=False))
autoencoder.add(BatchNormalization())
autoencoder.add(Activation('elu'))
autoencoder.add(Dropout(0.5))
# encoding layer
autoencoder.add(Dense(encoding_dim, use_bias=False))
autoencoder.add(BatchNormalization())
autoencoder.add(Activation('elu'))
# autoencoder.add(Dropout(0.5))
# 2nd hidden layer
autoencoder.add(Dense(hidden_dim, use_bias=False))
autoencoder.add(BatchNormalization())
autoencoder.add(Activation('elu'))
autoencoder.add(Dropout(0.5))
# output layer
autoencoder.add(Dense(input_dim))
I realize I can select individual layers using autoencoder.layer[i], but I don't know how to associate a new model with a range of such layers. I naively tried the following:
encoder = Sequential()
for i in range(0,7):
encoder.add(autoencoder.layers[i])
decoder = Sequential()
for i in range(7,12):
decoder.add(autoencoder.layers[i])
print(encoder.summary())
print(decoder.summary())
which seemingly worked for the encoder part (a valid summary was shown), but the decoder part generated an error:
This model has not yet been built. Build the model first by calling build() or calling fit() with some data. Or specify input_shape or batch_input_shape in the first layer for automatic build.

Since the input shape for a middle layer (i.e. here I am referring to autoencoder.layers[7]) is not explicitly set, when you add it to another model as the first layer, that model would not be built automatically (i.e. building process involves constructing weight tensor for the layers in the model). Therefore, you need to call build method explicitly and set the input shape:
decoder.build(input_shape=(None, encoding_dim)) # note that batch axis must be included
As a side note, there is no need to call print on model.summary(), since it would print the result by itself.

Another way which also works.
input_img = Input(shape=(encoding_dim,))
previous_layer = input_img
for i in range(bottleneck_layer,len(autoencoder.layers)): # bottleneck_layer = index of bottleneck_layer + 1!
next_layer = autoencoder.layers[i](previous_layer)
previous_layer = next_layer
decoder = Model(input_img, next_layer)

Adding Dropout to testing/inference phase

I've trained the following model for some timeseries in Keras:
input_layer = Input(batch_shape=(56, 3864))
first_layer = Dense(24, input_dim=28, activation='relu',
activity_regularizer=None,
kernel_regularizer=None)(input_layer)
first_layer = Dropout(0.3)(first_layer)
second_layer = Dense(12, activation='relu')(first_layer)
second_layer = Dropout(0.3)(second_layer)
out = Dense(56)(second_layer)
model_1 = Model(input_layer, out)
Then I defined a new model with the trained layers of model_1 and added dropout layers with a different rate, drp, to it:
input_2 = Input(batch_shape=(56, 3864))
first_dense_layer = model_1.layers[1](input_2)
first_dropout_layer = model_1.layers[2](first_dense_layer)
new_dropout = Dropout(drp)(first_dropout_layer)
snd_dense_layer = model_1.layers[3](new_dropout)
snd_dropout_layer = model_1.layers[4](snd_dense_layer)
new_dropout_2 = Dropout(drp)(snd_dropout_layer)
output = model_1.layers[5](new_dropout_2)
model_2 = Model(input_2, output)
Then I'm getting the prediction results of these two models as follow:
result_1 = model_1.predict(test_data, batch_size=56)
result_2 = model_2.predict(test_data, batch_size=56)
I was expecting to get completely different results because the second model has new dropout layers and theses two models are different (IMO), but that's not the case. Both are generating the same result. Why is that happening?

As I mentioned in the comments, the Dropout layer is turned off in inference phase (i.e. test mode), so when you use model.predict() the Dropout layers are not active. However, if you would like to have a model that uses Dropout both in training and inference phase, you can pass training argument when calling it, as suggested by François Chollet:
# ...
new_dropout = Dropout(drp)(first_dropout_layer, training=True)
# ...
Alternatively, If you have already trained your model and now want to use it in inference mode and keep the Dropout layers (and possibly other layers which have different behavior in training/inference phase such as BatchNormalization) active, you can define a backend function that takes the model's inputs as well as Keras learning phase:
from keras import backend as K
func = K.function(model.inputs + [K.learning_phase()], model.outputs)
# to use it pass 1 to set the learning phase to training mode
outputs = func([input_arrays] + [1.])

your question has a simple solution in the latest version of Tensorflow. you can set the training argument of the call method to true.
you can run a code like the below code:
model(input,training=True)
by using training=True TensorFlow automatically applies the Dropout layer in inference mode.

As there are already some working code solutions above, I will simply add a few more details regarding dropout during inference to prevent confusion.
Based on the original paper, Dropout layers play the role of turning off (setting gradients to zero) the neuron nodes during training to reduce overfitting. However, once we finish off with training and start testing the model, we do not 'touch' any neurons, thus, all the units are considered to make the decision when inferencing. This causes previously 'dead' neuron weights to be large than expected due to the usage of Dropout. To prevent this, a scaling factor is applied to balance the network node. To be more precise, if a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p during the prediction stage.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.