Encoder Decoder model for RNN in tensorflow

Encoder Decoder model for RNN in tensorflow - python

I am implementing an encoder decoder model using bidirectional RNN for both encoder and decoder. Since I initialize the bidirectional RNN on the encoder side and the weights and vectors associated with the bidirectional RNN is already initialized, I get the following error when I try to initialize another instance on the decoder side:
ValueError: Variable bidirectional_rnn/fw/gru_cell/w_ru already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?
I tried defining each within it's own name_scope like below but to no avail:
def enc(message, weights, biases):
message = tf.unstack(message, timesteps_enc, 1)
fw_cell = rnn.GRUBlockCell(num_hidden_enc)
bw_cell = rnn.GRUBlockCell(num_hidden_enc)
with tf.name_scope("encoder"):
outputs, _, _ = rnn.static_bidirectional_rnn(fw_cell, bw_cell, message, dtype=tf.float32)
return tf.matmul(outputs[-1], weights) + biases
def dec(codeword, weights, biases):
codeword = tf.expand_dims(codeword, axis=2)
codeword = tf.unstack(codeword, timesteps_dec, 1)
fw_cell = rnn.GRUBlockCell(num_hidden_dec)
bw_cell = rnn.GRUBlockCell(num_hidden_dec)
with tf.name_scope("decoder"):
outputs, _, _ = rnn.static_bidirectional_rnn(fw_cell, bw_cell, codeword, dtype=tf.float32)
return tf.matmul(outputs[-1], weights) + biases
Can someone please hint at what I am doing wrong?

Just putting it as an answer:
Just try to exchange name_scope for variable_scope. I'm not sure if it is still valid, but for older versions of TF, usage of name_scope was not encouraged. From your variable name bidirectional_rnn/fw/gru_cell/w_ru you can see that the scope is not applied.

One thing is that you cannot create variables with the same name in the same scope, so changing name_scope for variable_scope will fix the training.
The other thing is that such a model cannot work as an encoder-decoder model because the decoder RNN cannot be bidirectional. You indeed have the entire target sequences at the training time, but at the inference time, you generate the target left-to-right. This means you only have the left context for the forward RNN, but you don't have the right context for the backward RNN.

Related

How are keras tensors connected to layers that create them

In the book "Machine Learning with scikit-learn and Tensorflow" there's a code fragment I can't wrap my head around. Until that chapter, their models were only explicitly using layers - be it in a sequential fashion, or functional. But in the chapter 16, there's this:
import tensorflow_addons as tfa
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(
decoder_embeddings, initial_state=encoder_state,
sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
And then he just runs the model in a standard way:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)
history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)
I have trouble understanding the code starting at line 7. The author is creating an Embedding layer which he immediately calls on encoder_inputs and decoder_inputs, then he does basically the same with the LSTM layer that he calls on the previously created encoder_embeddings and tensors returned by this operation are used in the code slightly below. What I don't get here is how are those tensors trained? It looks like he's not using the layers creating them in the model, but if so, then how come the embeddings are learned and the whole model converges?

To understand this overall flow, you must understand how things are made under the hood. Tensorflow uses graph execution when making the model. When you have passed [encoder_inputs, decoder_inputs, sequence_lengths] as an inputs and [Y_proba] as a output. The model doesn't immediately start the training, first it builds the model. So, what builds means here, the thing is it makes a computational graph first and then stores this computational graph. model.compile() does this for you, it makes a computational graph for you.
Let me explain it further let's suppose I wanna compute a + b = c and b + d = 2 and finally c * d = 6 using a computational graph, then first Tensorflow will make 3 nodes for it, how will it look like? see in the picture below.
As you are seeing in the picture above the same exact thing is done by TensorFlow when you pass your inputs and outputs. The picture above is the depiction of forward pass. Now the same graph would be used by Tensorflow to do the backward pass. See the figure below.
Now, first, the computational graph is made and then the same computational graph is used to compute the forward pass and backward pass.
The graph above computes the computational graph of your complete model. But how? in your case specifically. The model will ask how Y_prob comes here. The graph consists of the operations and tensors created between the inputs and outputs. The Embedding layer is created and applied to the inputs encoder_inputs and decoder_inputs to obtain encoder_embeddings and decoder_embeddings, respectively. The LSTM layer is applied to encoder_embeddings to produce encoder_outputs, state_h, and state_c. These tensors are then passed as inputs to the BasicDecoder layer, which combines the decoder_embeddings, encoder_state (constructed from state_h and state_c), and sequence_lengths to produce final_outputs, final_state, and final_sequence_lengths. Finally, the softmax function is applied to the rnn_output of final_outputs to produce the final output Y_proba.
All the entities which are mentioned in the paragraph above in quotes would be your intermediate nodes in a computational graph.
So, it will start with the inputs and bring it down to the Y-Prob. During graph computation, the weights of the model and other parameters are also initiated. The graph is made once, which is then easy to compute the forward pass and backward pass.
How do these layers are trained and optimized for convergence?
when you specify inputs=[encoder_inputs, decoder_inputs, sequence_lengths] and outputs=[Y_proba], the model knows the intermediate layers that are used to compute Y_proba from encoder_inputs, decoder_inputs and sequence_lengths. These intermediate layers are the Embedding and LSTM layers, as well as the TrainingSampler, LSTMCell, Dense and BasicDecoder layers. These layers are automatically included in the computation graph of the model, allowing the optimizer to update the parameters of these layers during training.

This is an example of the Keras Functional API. In this style of defining a model, you write the blueprints first, and then use it later. Think of it like wiring a circuit: while you're connecting things, there's no electricity flowing through them (the electricity corresponds to data in our metaphor). Later, you turn on the power source, and electricity flows through.
This is how the Functional API works as well. First, let's read the last line:
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
This says "Hey Keras, I need a model whose inputs are encoder_inputs, decoder_inputs, and sequence_lengths, and they will eventually produce Y_proba. The details of how they will produce this output is defined above. Let's look at the specific lines you're having trouble with:
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
The first of these says, "Keras, give me a layer that will produce embeddings". embeddings is a layer object. The next two lines are the wiring that we talked about: you can connect layers preemptively before data flows through them: that's the crux of the Functional API. So the second layer says, "Keras, encoder_inputs, which is an Input, will connect to (and go through) the Embedding layer I just created, and the result of that will be in a variable I call encoder_embeddings.
The rest of the code follows the same logic: you're connecting the wires together, before you compile your model and eventually fit it with the data.

Extract intermmediate variable from a custom Tensorflow/Keras layer during inference (TF 2.0)

A bit of background:
I've implemented an NLP classification model using mostly Keras functional model bits of Tensorflow 2.0. The model architecture is a pretty straightforward LSTM network with the addition of an Attention layer between the LSTM and the Dense output layer. The Attention layer comes from this Kaggle kernel (starting around line 51).
I wrapped the trained model in a simple Flask app and get reasonably accurate predictions. In addition to predicting a class for a specific input I also output the value of the attention weight vector "a" from the aforementioned Attention layer so I can visualize the weights applied to the input sequence.
My current method of extracting the attention weights variable works, but seems incredibly inefficient as I'm predicting the output class and then manually calculating the attention vector using an intermediate Keras model. In the Flask app, inference looks something like this:
# Load the trained model
model = tf.keras.models.load_model('saved_model.h5')
# Extract the trained weights and biases of the trained attention layer
attention_weights = model.get_layer('attention').get_weights()
# Create an intermediate model that outputs the activations of the LSTM layer
intermediate_model = tf.keras.Model(inputs=model.input, outputs=model.get_layer('bi-lstm').output)
# Predict the output class using the trained model
model_score = model.predict(input)
# Obtain LSTM activations by predicting the output again using the intermediate model
lstm_activations = intermediate_model.predict(input)
# Use the intermediate LSTM activations and the trained model attention layer weights and biases to calculate the attention vector.
# Maths from the custom Attention Layer (heavily modified for the sake of brevity)
eij = tf.keras.backend.dot(lstm_activations, attention_weights)
a = tf.keras.backend.exp(eij)
attention_vector = a
I think I should be able to include the attention vector as part of the model output, but I'm struggling with figuring out how to accomplish this. Ideally I'd extract the attention vector from the custom attention layer in a single forward pass rather than extracting the various intermediate model values and calculating a second time.
For example:
model_score = model.predict(input)
model_score[0] # The predicted class label or probability
model_score[1] # The attention vector, a
I think I'm missing some basic knowledge around how Tensorflow/Keras throw variables around and when/how I can access those values to include as model output. Any advice would be appreciated.

After a little more research I've managed to cobble together a working solution. I'll summarize here for any future weary internet travelers that come across this post.
The first clues came from this github thread. The attention layer defined there seems to build on the attention layer in the previously mentioned Kaggle kernel. The github user adds a return_attention flag to the layer init which, when enabled, includes the attention vector in addition to the weighted RNN output vector in the layer output.
I also added a get_config function suggested by this user in the same github thread which enables us to save and reload trained models. I had to add the return_attention flag to get_config, otherwise TF would throw a list iteration error when trying to load a saved model with return_attention=True.
With those changes made, the model definition needed to be updated to capture the additional layer outputs.
inputs = Input(shape=(max_sequence_length,))
lstm = Bidirectional(LSTM(lstm1_units, return_sequences=True))(inputs)
# Added 'attention_vector' to capture the second layer output
attention, attention_vector = Attention(max_sequence_length, return_attention=True)(lstm)
x = Dense(dense_units, activation="softmax")(attention)
The final, and most important piece of the puzzle came from this Stackoverflow answer. The method described there allows us to output multiple results while only optimizing on one of them. The code changes are subtle, but very important. I've added comments below in the spots I made changes to implement this functionality.
model = Model(
inputs=inputs,
outputs=[x, attention_vector] # Original value: outputs=x
)
model.compile(
loss=['categorical_crossentropy', None], # Original value: loss='categorical_crossentropy'
optimizer=optimizer,
metrics=[BinaryAccuracy(name='accuracy')])
With those changes in place, I retrained the model and voila! The output of model.predict() is now a list containing the score and its associated attention vector.
The results of the change were pretty dramatic. Running inference on 10k examples took about 20 minutes using this new method. The old method utilizing intermediate models took ~33 minutes to perform inference on the same dataset.
And for anyone that's interested, here is my modified Attention layer:
from tensorflow.python.keras.layers import Layer
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras import backend as K
class Attention(Layer):
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, return_attention=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim
self.features_dim = 0
self.return_attention = return_attention
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight(shape=(input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight(shape=(input_shape[1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None):
return None
def call(self, x, mask=None):
features_dim = self.features_dim
step_dim = self.step_dim
eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
if self.bias:
eij += self.b
eij = K.tanh(eij)
a = K.exp(eij)
if mask is not None:
a *= K.cast(mask, K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
result = K.sum(weighted_input, axis=1)
if self.return_attention:
return [result, a]
return result
def compute_output_shape(self, input_shape):
if self.return_attention:
return [(input_shape[0], self.features_dim),
(input_shape[0], input_shape[1])]
else:
return input_shape[0], self.features_dim
def get_config(self):
config = {
'step_dim': self.step_dim,
'W_regularizer': regularizers.serialize(self.W_regularizer),
'b_regularizer': regularizers.serialize(self.b_regularizer),
'W_constraint': constraints.serialize(self.W_constraint),
'b_constraint': constraints.serialize(self.b_constraint),
'bias': self.bias,
'return_attention': self.return_attention
}
base_config = super(Attention, self).get_config()
return dict(list(base_config.items()) + list(config.items()))

Adding Dropout to testing/inference phase

I've trained the following model for some timeseries in Keras:
input_layer = Input(batch_shape=(56, 3864))
first_layer = Dense(24, input_dim=28, activation='relu',
activity_regularizer=None,
kernel_regularizer=None)(input_layer)
first_layer = Dropout(0.3)(first_layer)
second_layer = Dense(12, activation='relu')(first_layer)
second_layer = Dropout(0.3)(second_layer)
out = Dense(56)(second_layer)
model_1 = Model(input_layer, out)
Then I defined a new model with the trained layers of model_1 and added dropout layers with a different rate, drp, to it:
input_2 = Input(batch_shape=(56, 3864))
first_dense_layer = model_1.layers[1](input_2)
first_dropout_layer = model_1.layers[2](first_dense_layer)
new_dropout = Dropout(drp)(first_dropout_layer)
snd_dense_layer = model_1.layers[3](new_dropout)
snd_dropout_layer = model_1.layers[4](snd_dense_layer)
new_dropout_2 = Dropout(drp)(snd_dropout_layer)
output = model_1.layers[5](new_dropout_2)
model_2 = Model(input_2, output)
Then I'm getting the prediction results of these two models as follow:
result_1 = model_1.predict(test_data, batch_size=56)
result_2 = model_2.predict(test_data, batch_size=56)
I was expecting to get completely different results because the second model has new dropout layers and theses two models are different (IMO), but that's not the case. Both are generating the same result. Why is that happening?

As I mentioned in the comments, the Dropout layer is turned off in inference phase (i.e. test mode), so when you use model.predict() the Dropout layers are not active. However, if you would like to have a model that uses Dropout both in training and inference phase, you can pass training argument when calling it, as suggested by François Chollet:
# ...
new_dropout = Dropout(drp)(first_dropout_layer, training=True)
# ...
Alternatively, If you have already trained your model and now want to use it in inference mode and keep the Dropout layers (and possibly other layers which have different behavior in training/inference phase such as BatchNormalization) active, you can define a backend function that takes the model's inputs as well as Keras learning phase:
from keras import backend as K
func = K.function(model.inputs + [K.learning_phase()], model.outputs)
# to use it pass 1 to set the learning phase to training mode
outputs = func([input_arrays] + [1.])

your question has a simple solution in the latest version of Tensorflow. you can set the training argument of the call method to true.
you can run a code like the below code:
model(input,training=True)
by using training=True TensorFlow automatically applies the Dropout layer in inference mode.

As there are already some working code solutions above, I will simply add a few more details regarding dropout during inference to prevent confusion.
Based on the original paper, Dropout layers play the role of turning off (setting gradients to zero) the neuron nodes during training to reduce overfitting. However, once we finish off with training and start testing the model, we do not 'touch' any neurons, thus, all the units are considered to make the decision when inferencing. This causes previously 'dead' neuron weights to be large than expected due to the usage of Dropout. To prevent this, a scaling factor is applied to balance the network node. To be more precise, if a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p during the prediction stage.

TF Graph does not correspond to the code

I am trying to create a very simple neural network reading in information with the shape 1x2048 and to create a classification for two categories (object or not object). The graph structure however, deviates from what I believe to have coded. The dense layers should be included in the scope of "inner_layer" and should be receiving their input from the "input" placeholder. Instead, TF seems to be treating them as independent layers which do not receive any information from "input".
Also, when using trying to use tensorboard summaries I get an error telling me that I have not mentioned inserting inputs for the apparent placeholders of the dense layers. When omitting tensorboard, everything works as I expected it based on the code.
I have spent a lot of time trying to find the problem but I think I must be overlooking an something very basic.
The graph I get in tensorboard is on this image.
Which corresponds to the following code:
tf.reset_default_graph()
keep_prob = 0.5
# Graph Strcuture
## Placeholders for input
with tf.name_scope('input'):
x_ = tf.placeholder(tf.float32, shape = [None, transfer_values_train.shape[1]], name = "input1")
y_ = tf.placeholder(tf.float32, shape = [None, num_classes], name = "labels")
## Dense Layer one with 2048 nodes
with tf.name_scope('inner_layers'):
first_layer = tf.layers.dense(x_, units = 2048, activation=tf.nn.relu, name = "first_dense")
dropout_layer = tf.nn.dropout(first_layer, keep_prob, name = "dropout_layer")
#readout layer, without softmax
y_conv = tf.layers.dense(dropout_layer, units = 2, activation=tf.nn.relu, name = "second_dense")
# Evaluation and training
with tf.name_scope('cross_entropy'):
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels = y_ , logits = y_conv),
name = "cross_entropy_layer")
with tf.name_scope('trainer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
with tf.name_scope('accuracy'):
prediction = tf.argmax(y_conv, axis = 1)
correct_prediction = tf.equal(prediction, tf.argmax(y_, axis = 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Does anyone have an idea why the graph is so different from what you would expect based on the code?

The graph rendering in tensorboard may be a bit confusing (initially), but it's correct. Take a look at this picture where I've left only the inner_layers part of your graph:
You may notice that:
The first_dense and second_dense are actually the name scopes themselves (generated by tf.layers.dense function; see also this question).
Their input/output tensors are inside the inner_layers scope and wire correctly to the dropout_layer. Here, in each of dense layers, live the corresponding linear ops: MatMul, BiasAdd, Relu.
Both scopes also include the variables (kernel and bias each), that are shown separately from inner_layers. They encapsulate the ops related specifically to variable, such as read, assign, initialize, etc. The linear ops in first_dense depend on the variable ops of first_dense, and second_dense likewise.
The reason for this separation is that in distributed settings the variables are manages by a different task called parameter server. It's usually run on a different device (CPU as opposed to GPU), sometimes even on a different machine. In other words, for tensorflow the variable management is by design different from matrix computation.
Having said that, I'd love to see a mode in tensorflow that would not split the scope into variables and ops and keep them coupled.
Other than this the graph perfectly matches the code.

variable scope issue in Tensorflow

def biLSTM(data, n_steps):
n_hidden= 24
data = tf.transpose(data, [1, 0, 2])
# Reshape to (n_steps*batch_size, n_input)
data = tf.reshape(data, [-1, 300])
# Split to get a list of 'n_steps' tensors of shape (batch_size, n_input)
data = tf.split(0, n_steps, data)
lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Backward direction cell
lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
outputs, _, _ = tf.nn.bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, data, dtype=tf.float32)
return outputs, n_hidden
In my code I am calling this function twice to create 2 bidirectional LSTMs. Then I got the problem of reusing variables.
ValueError: Variable lstm/BiRNN_FW/BasicLSTMCell/Linear/Matrix
already exists, disallowed. Did you mean to set reuse=True in
VarScope?
To resolve this I added the LSTM definition in the function within with tf.variable_scope('lstm', reuse=True) as scope:
This led to a new issue
ValueError: Variable lstm/BiRNN_FW/BasicLSTMCell/Linear/Matrix does
not exist, disallowed. Did you mean to set reuse=None in VarScope?
Please help with a solution to this.

When you create BasicLSTMCell(), it creates all the required weights and biases to implement an LSTM cell under the hood. All of these variables are assigned names automatically. If you call the function more than once within the same scope you get the error you get. Since your question seems to state that you want to create two separate LSTM cells, you do not want to reuse the variables, but you do want to create them in separate scopes. You can do this in two different ways (I haven't actually tried to run this code, but it should work). You can call your function from within a unique scope
def biLSTM(data, n_steps):
... blah ...
with tf.variable_scope('LSTM1'):
outputs, hidden = biLSTM(data, steps)
with tf.variable_scope('LSTM2'):
outputs, hidden = biLSTM(data, steps)
or you can pass a unique scope name to the function and use the scope inside
def biLSTM(data, n_steps, layer_name):
... blah...
with tf.variable_scope(layer_name) as scope:
lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
outputs, _, _ = tf.nn.bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, data, dtype=tf.float32)
return outputs, n_hidden
l1 = biLSTM(data, steps, 'layer1')
l2 = biLSTM(data, steps, 'layer2')
It is up to your coding sensibilities which approach to choose, they are functionally pretty much the same.

I also has the similar problem. However I was using keras implementation with pretrained Resnet50 model.
It worked for me when I updated the tensorflow version using following command:
conda update -f -c conda-forge tensorflow
and used
from keras import backend as K
K.clear_session

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.