I'm following the code of a coursera assignment which implements a NER tagger using a bidirectional LSTM.
But I'm not able to understand how the embedding matrix is being updated. In the following code, build_layers has a variable embedding_matrix_variable which acts an input the the LSTM. However it's not getting updated anywhere.
Can you help me understand how embeddings are being trained?
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_embedding_matrix, name='embedding_matrix', dtype=tf.float32)
forward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
backward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
(rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
cell_fw=forward_cell, cell_bw=backward_cell,
dtype=tf.float32,
inputs=embeddings,
sequence_length=self.lengths
)
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))
In TensorFlow, variables are not usually updated directly (i.e. by manually setting them to a certain value), but rather they are trained using an optimization algorithm and automatic differentiation.
When you define a tf.Variable, you are adding a node (that maintains a state) to the computational graph. At training time, if the loss node depends on the state of the variable that you defined, TensorFlow will compute the gradient of the loss function with respect to that variable by automatically following the chain rule through the computational graph. Then, the optimization algorithm will make use of the computed gradients to update the values of the trainable variables that took part in the computation of the loss.
Concretely, the code that you provide builds a TensorFlow graph in which the loss self.loss depends on the weights in embedding_matrix_variable (i.e. there is a path between these nodes in the graph), so TensorFlow will compute the gradient with respect to this variable, and the optimizer will update its values when minimizing the loss. It might be useful to inspect the TensorFlow graph using TensorBoard.
Related
I am trying to implement a neural network in PyTorch to solve an ordinary differential equation (ODE). The network architecture is straight-forward. It is just a feed-forward neural network with n inputs and outputs and k layers.
class PINN(torch.nn.Module):
def __init__(self,n):
super().__init__()
# Layers
self.L1=torch.nn.Linear(1,n)
self.L2=torch.nn.Linear(n,n)
self.L3=torch.nn.Linear(n,1)
# Activation functions
self.t=torch.nn.Tanh()
self.r=torch.nn.ReLU()
def forward(self,x):
a1=self.r(self.L1(x))
a2=self.r(self.L2(a1))
a3=self.r(self.L3(a2))+x
return a3
I want to minimize the loss between the gradient of the output of my neural network and the right-hand side of the ODE. I have chosen to work with a mean-squared error loss. I know that PyTorch includes a built-in MSE loss. However, I defined my own loss function since I have to pass in the gradient of a tensor.
def ODELoss(x,y,x0):
# Number of collocation points to sample
n=len(x)
# Initialize the loss to zero
loss=torch.tensor(0.,requires_grad=True)
# Loop over the "data". Technically, this is an unsupervised problem.
# The "data" are points sampled on the domain which are then evaluated
# according to the ODE.
for (xx,yy) in zip(x,y):
xx=torch.tensor([[xx]],requires_grad=True)
yy=torch.tensor([[yy]])
g(xx,x0).backward()
dg=xx.grad.clone().requires_grad_(True)
loss=loss+(dg-yy)**2
loss=loss/n
return loss
Here, g(x) is called the universal predictor. It is used in the literature to account for the initial condition(s).
def g(x,x0):
return x*model(x)+x0
This doesn't seem to work because it seems like I am not passing in the gradient of the output correctly. Can anyone give me some guidance on how to do this?
I want to visualize the patterns that a given feature map in a CNN has learned (in this example I'm using vgg16). To do so I create a random image, feed through the network up to the desired convolutional layer, choose the feature map and find the gradients with the respect to the input. The idea is to change the input in such a way that will maximize the activation of the desired feature map. Using tensorflow 2.0 I have a GradientTape that follows the function and then computes the gradient, however the gradient returns None, why is it unable to compute the gradient?
import tensorflow as tf
import matplotlib.pyplot as plt
import time
import numpy as np
from tensorflow.keras.applications import vgg16
class maxFeatureMap():
def __init__(self, model):
self.model = model
self.optimizer = tf.keras.optimizers.Adam()
def getNumLayers(self, layer_name):
for layer in self.model.layers:
if layer.name == layer_name:
weights = layer.get_weights()
num = weights[1].shape[0]
return ("There are {} feature maps in {}".format(num, layer_name))
def getGradient(self, layer, feature_map):
pic = vgg16.preprocess_input(np.random.uniform(size=(1,96,96,3))) ## Creates values between 0 and 1
pic = tf.convert_to_tensor(pic)
model = tf.keras.Model(inputs=self.model.inputs,
outputs=self.model.layers[layer].output)
with tf.GradientTape() as tape:
## predicts the output of the model and only chooses the feature_map indicated
predictions = model.predict(pic, steps=1)[0][:,:,feature_map]
loss = tf.reduce_mean(predictions)
print(loss)
gradients = tape.gradient(loss, pic[0])
print(gradients)
self.optimizer.apply_gradients(zip(gradients, pic))
model = vgg16.VGG16(weights='imagenet', include_top=False)
x = maxFeatureMap(model)
x.getGradient(1, 24)
This is a common pitfall with GradientTape; the tape only traces tensors that are set to be "watched" and by default tapes will watch only trainable variables (meaning tf.Variable objects created with trainable=True). To watch the pic tensor, you should add tape.watch(pic) as the very first line inside the tape context.
Also, I'm not sure if the indexing (pic[0]) will work, so you might want to remove that -- since pic has just one entry in the first dimension it shouldn't matter anyway.
Furthermore, you cannot use model.predict because this returns a numpy array, which basically "destroys" the computation graph chain so gradients won't be backpropagated. You should simply use the model as a callable, i.e. predictions = model(pic).
Did you define your own loss function? Did you convert tensor to numpy in your loss function?
As a freshman, I also met the same problem:
When using tape.gradient(loss, variables), it turns out None because I convert tensor to numpy array in my own loss function. It seems to be a stupid but common mistake for freshman.
FYI: When GradientTape is not working, there is a possibility of TensorFlow issue. Checking the TF github if the TF functions being used have known issues would be one of the problem determinations.
Gradients do not exist for variables after tf.concat(). #37726.
Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.
You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.
My training loop is like following,
::pseudocode
define the graph
define session
get_model()
get_optimizers()
for i in range(epoch):
for j in range(num_of_batches):
x,y = get_value()
sess.run() # runs a MLP based on some parameter
sess.run() # runs similar MLP based on some different parameter
---normalize some weight of MLP---
Now I am stuck in the portion of "normalize some weight of MLP"
How should I define my normalization code snippet in my model class? I have tried the following way,
W_trans = tf.Variable(
identity,
name="trans",
dtype=tf.float32)
self.theta_M.append(W_trans)
W_trans = tf.norm(W_trans)
I have also tried to incorporate this by tf.assign() but it can not be used if I want to optimized my model based on the parameter W_trans.
This thread comes close: What is the purpose of weights and biases in tensorflow word2vec example?
But I am still missing something from my interpretation of this: https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
From what I understand, you feed the network the indices of target and context words from your dictionary.
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
The batch inputs are then looked up to return the vectors that are randomly generated at the beginning
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
Then an optimizer adjusts the weights and biases to best predict the label as opposed to num_sampled random alternatives
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
My questions are as follows:
Where do the embeddings variable get updated?. It appears to me that I could get the final result by either running the index of a word through the neural network, or by just taking the final_embeddings vectors and using that. But I do not understand where embeddings is ever changed from its random initialization.
If I were to draw this computation graph, what would it look like (or better yet, what is the best way to actually do so)?
Is this running all of the context/target pairs in the batch at once? Or one by one?
Embeddings: Embeddings is a variable. It gets updated every time you do backprop (while running optimizer with loss)
Grpah: Did you try saving the graph and displaying it in tensorboard ? Is this what you're looking for ?
Batching: Atleast in the example you linked, he is doing batch processing using the function at line 96. https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L96
Please correct me if I misunderstood your question.