I need to convert the following model to TF2. get_outputs() applies several convolution layers, then flat and dense layers and produces some outputs.
with tf.compat.v1.variable_scope(scope_name, reuse=tf.compat.v1.AUTO_REUSE):
outputs = get_outputs()
The following lines apply an exponential moving average to the trainable variables.
params = tf.compat.v1.trainable_variables(scope_name)
ema = tf.train.ExponentialMovingAverage(alpha)
ema_apply_op = ema.apply(params)
def custom_getter(getter, *args, **kwargs):
return ema.average(getter(*args, **kwargs))
This is an identical model which outputs the same outputs after somehow using the custom getter defined above.
with tf.compat.v1.variable_scope(
scope_name, reuse=True, custom_getter=custom_getter
):
avg_outputs = get_outputs()
And after each training iteration:
with tf.control_dependencies([opt_op]): # opt_op being the result of optimizer.apply_gradients()
grouped = tf.group(ema_apply_op)
And then grouped is run in the default session. I need some way to convert this to TF2. I tried the following and I'm not sure whether this works or not, for some reason the translated to TF2 model (keras) does not converge. I'm trying to find out why by eliminating potential sources causing the issue.
model = create_keras_model() # Keras model, results in the same outputs by TF1 model.
avg_model = create_keras_model() # Identical model
And after each training iteration, I do the following:
avg_weights = []
for w1, w2 in zip(model.trainable_variables, avg_model.trainable_variables):
avg_weights.append(alpha * w2 + (1 - decay) * w1)
avg_model.set_weights(avg_weights)
Due to model convergence issues, I cannot determine whether this works or not. I also tried replicating the same steps using tf.train.ExponentialMovingAverage however, it doesn't do anything in TF2. When ema.apply() and ema.average() are called on the training variables, nothing happens, their values remain unchanged.
Related
A bit of background:
I've implemented an NLP classification model using mostly Keras functional model bits of Tensorflow 2.0. The model architecture is a pretty straightforward LSTM network with the addition of an Attention layer between the LSTM and the Dense output layer. The Attention layer comes from this Kaggle kernel (starting around line 51).
I wrapped the trained model in a simple Flask app and get reasonably accurate predictions. In addition to predicting a class for a specific input I also output the value of the attention weight vector "a" from the aforementioned Attention layer so I can visualize the weights applied to the input sequence.
My current method of extracting the attention weights variable works, but seems incredibly inefficient as I'm predicting the output class and then manually calculating the attention vector using an intermediate Keras model. In the Flask app, inference looks something like this:
# Load the trained model
model = tf.keras.models.load_model('saved_model.h5')
# Extract the trained weights and biases of the trained attention layer
attention_weights = model.get_layer('attention').get_weights()
# Create an intermediate model that outputs the activations of the LSTM layer
intermediate_model = tf.keras.Model(inputs=model.input, outputs=model.get_layer('bi-lstm').output)
# Predict the output class using the trained model
model_score = model.predict(input)
# Obtain LSTM activations by predicting the output again using the intermediate model
lstm_activations = intermediate_model.predict(input)
# Use the intermediate LSTM activations and the trained model attention layer weights and biases to calculate the attention vector.
# Maths from the custom Attention Layer (heavily modified for the sake of brevity)
eij = tf.keras.backend.dot(lstm_activations, attention_weights)
a = tf.keras.backend.exp(eij)
attention_vector = a
I think I should be able to include the attention vector as part of the model output, but I'm struggling with figuring out how to accomplish this. Ideally I'd extract the attention vector from the custom attention layer in a single forward pass rather than extracting the various intermediate model values and calculating a second time.
For example:
model_score = model.predict(input)
model_score[0] # The predicted class label or probability
model_score[1] # The attention vector, a
I think I'm missing some basic knowledge around how Tensorflow/Keras throw variables around and when/how I can access those values to include as model output. Any advice would be appreciated.
After a little more research I've managed to cobble together a working solution. I'll summarize here for any future weary internet travelers that come across this post.
The first clues came from this github thread. The attention layer defined there seems to build on the attention layer in the previously mentioned Kaggle kernel. The github user adds a return_attention flag to the layer init which, when enabled, includes the attention vector in addition to the weighted RNN output vector in the layer output.
I also added a get_config function suggested by this user in the same github thread which enables us to save and reload trained models. I had to add the return_attention flag to get_config, otherwise TF would throw a list iteration error when trying to load a saved model with return_attention=True.
With those changes made, the model definition needed to be updated to capture the additional layer outputs.
inputs = Input(shape=(max_sequence_length,))
lstm = Bidirectional(LSTM(lstm1_units, return_sequences=True))(inputs)
# Added 'attention_vector' to capture the second layer output
attention, attention_vector = Attention(max_sequence_length, return_attention=True)(lstm)
x = Dense(dense_units, activation="softmax")(attention)
The final, and most important piece of the puzzle came from this Stackoverflow answer. The method described there allows us to output multiple results while only optimizing on one of them. The code changes are subtle, but very important. I've added comments below in the spots I made changes to implement this functionality.
model = Model(
inputs=inputs,
outputs=[x, attention_vector] # Original value: outputs=x
)
model.compile(
loss=['categorical_crossentropy', None], # Original value: loss='categorical_crossentropy'
optimizer=optimizer,
metrics=[BinaryAccuracy(name='accuracy')])
With those changes in place, I retrained the model and voila! The output of model.predict() is now a list containing the score and its associated attention vector.
The results of the change were pretty dramatic. Running inference on 10k examples took about 20 minutes using this new method. The old method utilizing intermediate models took ~33 minutes to perform inference on the same dataset.
And for anyone that's interested, here is my modified Attention layer:
from tensorflow.python.keras.layers import Layer
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras import backend as K
class Attention(Layer):
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, return_attention=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim
self.features_dim = 0
self.return_attention = return_attention
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight(shape=(input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight(shape=(input_shape[1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None):
return None
def call(self, x, mask=None):
features_dim = self.features_dim
step_dim = self.step_dim
eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
if self.bias:
eij += self.b
eij = K.tanh(eij)
a = K.exp(eij)
if mask is not None:
a *= K.cast(mask, K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
result = K.sum(weighted_input, axis=1)
if self.return_attention:
return [result, a]
return result
def compute_output_shape(self, input_shape):
if self.return_attention:
return [(input_shape[0], self.features_dim),
(input_shape[0], input_shape[1])]
else:
return input_shape[0], self.features_dim
def get_config(self):
config = {
'step_dim': self.step_dim,
'W_regularizer': regularizers.serialize(self.W_regularizer),
'b_regularizer': regularizers.serialize(self.b_regularizer),
'W_constraint': constraints.serialize(self.W_constraint),
'b_constraint': constraints.serialize(self.b_constraint),
'bias': self.bias,
'return_attention': self.return_attention
}
base_config = super(Attention, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
I have a very large tensorflow graph, and two sets of variables: A and B. I create two optimizers:
learning_rate = 1e-3
optimizer1 = tf.train.AdamOptimizer(learning_rate).minimize(loss_1, var_list=var_list_1)
optimizer2 = tf.train.AdamOptimizer(learning_rate).minimize(loss_2, var_list=var_list_2)
The goal here is to iteratively optimize variables 1 and variables 2. The weights from variables 2 are used in the computation of loss 1, but they're not trainable when optimizing loss 1. Meanwhile, the weights from variables 1 are not used in optimizing loss 2 (I would say this is a key asymmetry).
I am finding, weirdly, that this optimization for optimizer2 is much, much slower (2x) than if I were to just optimize that part of the graph by itself. I'm not running any summaries.
Why would this phenomenon happen? How could I fix it? I can provide more details if necessary.
I am guessing that this is a generative adversarial network given by the relation between the losses and the parameters. It seems that the first group of parameters are the generative model and the second group make up the detector model.
If my guesses are correct, then that would mean that the second model is using the output of the first model as its input. Admittedly, I am much more informed about PyTorch than TF. There is a comment which I believe is saying that the first model could be included in the second graph. I also think this is true. I would implement something similar to the following. The most important part is just creating a copy of the generated_tensor with no graph:
// An arbitrary label
label = torch.Tensor(1.0)
// Treat GenerativeModel as the model with the first list of Variables/parameters
generated_tensor = GenerativeModel(random_input_tensor)
// Treat DetectorModel as the model with the second list of Variables/parameters
detector_prediction = DetectorModel(generated_tensor)
generated_tensor_copy = torch.tensor(generated_tensor, requires_grad=False)
detector_prediction_copy = DetectorModel(generated_tensor_copy)
//This is for optimizing the first model, but it has the second model in its graph
// which is necessary.
loss1 = loss_func1(detector_prediction, label)
// This is for optimizing the second model. It will not have the first model in its graph
loss2 = loss_func2(detector_prediction_copy, label)
I hope this is helpful. If anyone knows how to do this in TF, that would probably be very invaluable.
I'm following the code of a coursera assignment which implements a NER tagger using a bidirectional LSTM.
But I'm not able to understand how the embedding matrix is being updated. In the following code, build_layers has a variable embedding_matrix_variable which acts an input the the LSTM. However it's not getting updated anywhere.
Can you help me understand how embeddings are being trained?
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_embedding_matrix, name='embedding_matrix', dtype=tf.float32)
forward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
backward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
(rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
cell_fw=forward_cell, cell_bw=backward_cell,
dtype=tf.float32,
inputs=embeddings,
sequence_length=self.lengths
)
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))
In TensorFlow, variables are not usually updated directly (i.e. by manually setting them to a certain value), but rather they are trained using an optimization algorithm and automatic differentiation.
When you define a tf.Variable, you are adding a node (that maintains a state) to the computational graph. At training time, if the loss node depends on the state of the variable that you defined, TensorFlow will compute the gradient of the loss function with respect to that variable by automatically following the chain rule through the computational graph. Then, the optimization algorithm will make use of the computed gradients to update the values of the trainable variables that took part in the computation of the loss.
Concretely, the code that you provide builds a TensorFlow graph in which the loss self.loss depends on the weights in embedding_matrix_variable (i.e. there is a path between these nodes in the graph), so TensorFlow will compute the gradient with respect to this variable, and the optimizer will update its values when minimizing the loss. It might be useful to inspect the TensorFlow graph using TensorBoard.
I am trying to create a very simple neural network reading in information with the shape 1x2048 and to create a classification for two categories (object or not object). The graph structure however, deviates from what I believe to have coded. The dense layers should be included in the scope of "inner_layer" and should be receiving their input from the "input" placeholder. Instead, TF seems to be treating them as independent layers which do not receive any information from "input".
Also, when using trying to use tensorboard summaries I get an error telling me that I have not mentioned inserting inputs for the apparent placeholders of the dense layers. When omitting tensorboard, everything works as I expected it based on the code.
I have spent a lot of time trying to find the problem but I think I must be overlooking an something very basic.
The graph I get in tensorboard is on this image.
Which corresponds to the following code:
tf.reset_default_graph()
keep_prob = 0.5
# Graph Strcuture
## Placeholders for input
with tf.name_scope('input'):
x_ = tf.placeholder(tf.float32, shape = [None, transfer_values_train.shape[1]], name = "input1")
y_ = tf.placeholder(tf.float32, shape = [None, num_classes], name = "labels")
## Dense Layer one with 2048 nodes
with tf.name_scope('inner_layers'):
first_layer = tf.layers.dense(x_, units = 2048, activation=tf.nn.relu, name = "first_dense")
dropout_layer = tf.nn.dropout(first_layer, keep_prob, name = "dropout_layer")
#readout layer, without softmax
y_conv = tf.layers.dense(dropout_layer, units = 2, activation=tf.nn.relu, name = "second_dense")
# Evaluation and training
with tf.name_scope('cross_entropy'):
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels = y_ , logits = y_conv),
name = "cross_entropy_layer")
with tf.name_scope('trainer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
with tf.name_scope('accuracy'):
prediction = tf.argmax(y_conv, axis = 1)
correct_prediction = tf.equal(prediction, tf.argmax(y_, axis = 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Does anyone have an idea why the graph is so different from what you would expect based on the code?
The graph rendering in tensorboard may be a bit confusing (initially), but it's correct. Take a look at this picture where I've left only the inner_layers part of your graph:
You may notice that:
The first_dense and second_dense are actually the name scopes themselves (generated by tf.layers.dense function; see also this question).
Their input/output tensors are inside the inner_layers scope and wire correctly to the dropout_layer. Here, in each of dense layers, live the corresponding linear ops: MatMul, BiasAdd, Relu.
Both scopes also include the variables (kernel and bias each), that are shown separately from inner_layers. They encapsulate the ops related specifically to variable, such as read, assign, initialize, etc. The linear ops in first_dense depend on the variable ops of first_dense, and second_dense likewise.
The reason for this separation is that in distributed settings the variables are manages by a different task called parameter server. It's usually run on a different device (CPU as opposed to GPU), sometimes even on a different machine. In other words, for tensorflow the variable management is by design different from matrix computation.
Having said that, I'd love to see a mode in tensorflow that would not split the scope into variables and ops and keep them coupled.
Other than this the graph perfectly matches the code.
My training loop is like following,
::pseudocode
define the graph
define session
get_model()
get_optimizers()
for i in range(epoch):
for j in range(num_of_batches):
x,y = get_value()
sess.run() # runs a MLP based on some parameter
sess.run() # runs similar MLP based on some different parameter
---normalize some weight of MLP---
Now I am stuck in the portion of "normalize some weight of MLP"
How should I define my normalization code snippet in my model class? I have tried the following way,
W_trans = tf.Variable(
identity,
name="trans",
dtype=tf.float32)
self.theta_M.append(W_trans)
W_trans = tf.norm(W_trans)
I have also tried to incorporate this by tf.assign() but it can not be used if I want to optimized my model based on the parameter W_trans.