Compute gradients for each time step of tf.while_loop

Compute gradients for each time step of tf.while_loop - python

Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.

You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.

Related

Training with parametric partial derivatives in pytorch

Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1

Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?

In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

Tensorflow - Access weights while doing backprop

I want to implement C-MWP as described here: https://arxiv.org/pdf/1608.00507.pdf in keras/tensorflow.
This involves modifying the way backprop is performed. The new gradient is a function of the bottom activation responses the weight parameters and the gradients of the layer above.
As a start, I was looking at the way keras-vis is doing modified backprop:
def _register_guided_gradient(name):
if name not in ops._gradient_registry._registry:
#tf.RegisterGradient(name)
def _guided_backprop(op, grad):
dtype = op.outputs[0].dtype
gate_g = tf.cast(grad > 0., dtype)
gate_y = tf.cast(op.outputs[0] > 0, dtype)
return gate_y * gate_g * grad
However, to implement C-MWP I need access to the weights of the layer on which the backprop is performed. Is it possible to access the weight within the #tf.RegisterGradient(name) function? Or am I on the wrong path?

The gradient computation in TF is fundamentally per-operation. If the operation whose gradient you want to change is performed on the weights, or at least the weights are not far from it in the operation graph, you can try finding the weights tensor by walking the graph inside your custom gradient. For example, say you have something like
x = tf.get_variable(...)
y = 5.0 * x
tf.gradients(y, x)
You can get to the variable tensor (more precisely, the tensor produced by the variable reading operation) with something like
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1]
...
If the weights are not immediate inputs, but you know how to get to them, you can walk the graph a bit using something like:
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1].op.inputs[0].op.inputs[2]
...
You should understand that this solution is very hacky. If you control the forward pass, you might want to just define a custom gradient just for the subgraph you care about. You can see how you can do that in How to register a custom gradient for a operation composed of tf operations
and How Can I Define Only the Gradient for a Tensorflow Subgraph? and https://www.tensorflow.org/api_docs/python/tf/Graph#gradient_override_map

Does tensorflow provide an operation like caffe average_loss operation?

due to the limiting of gpu, I want to update my weight after every two step training. Specifically, the network will firstly calculate the fisrt batch inputs and save the loss. And then the network calculate the next batch inputs and average these two losses and will update the weights once. It likes average_loss op in caffe, for example()fcn-berkeley . and how to calculate the batchnorm update-ops.

Easy, juste use tf.reduce_mean(input_tensor)
Tf documentation reduce_mean
and in your case, it will be :
loss = tf.concat([loss1,loss2], axis=0)
final_loss = tf.reduce_mean(loss, axis=0)

Please check this thread for correct info on Caffe's average_loss.
You should be able to compute an averaged loss by subclassing LoggingTensorHook in a way like
class MyLoggingTensorHook(tf.train.LoggingTensorHook):
# set every_n_iter to if you want to average last 2 losses
def __init__(self, tensors, every_n_iter):
super().__init__(tensors=tensors, every_n_iter=every_n_iter)
# keep track of previous losses
self.losses=[]
def after_run(self, run_context, run_values):
_ = run_context
# assuming you have a tag like 'average_loss'
# as the name of your loss tensor
for tag in self._tag_order:
if 'average_loss' in tag:
self.losses.append(run_values.results[tag])
if self._should_trigger:
self._log_tensors(run_values.results)
self._iter_count += 1
def _log_tensors(self, tensor_values):
original = np.get_printoptions()
np.set_printoptions(suppress=True)
logging.info("%s = %s" % ('average_loss', np.mean(self.losses)))
np.set_printoptions(**original)
self.losses=[]
and attach it to an estimator's train method or use a TrainSpec.
You should be able to compute gradients of your variables normally in every step, but apply them in every N steps by conditioning on your global_state variable that defines your current iteration or step (you should have initialized this variable in your graph by something like global_step = tf.train.get_or_create_global_step()). Please see the usage of compute_gradients and apply_gradients for this.

How is embedding matrix being trained in this code snippet?

I'm following the code of a coursera assignment which implements a NER tagger using a bidirectional LSTM.
But I'm not able to understand how the embedding matrix is being updated. In the following code, build_layers has a variable embedding_matrix_variable which acts an input the the LSTM. However it's not getting updated anywhere.
Can you help me understand how embeddings are being trained?
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_embedding_matrix, name='embedding_matrix', dtype=tf.float32)
forward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
backward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
(rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
cell_fw=forward_cell, cell_bw=backward_cell,
dtype=tf.float32,
inputs=embeddings,
sequence_length=self.lengths
)
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))

In TensorFlow, variables are not usually updated directly (i.e. by manually setting them to a certain value), but rather they are trained using an optimization algorithm and automatic differentiation.
When you define a tf.Variable, you are adding a node (that maintains a state) to the computational graph. At training time, if the loss node depends on the state of the variable that you defined, TensorFlow will compute the gradient of the loss function with respect to that variable by automatically following the chain rule through the computational graph. Then, the optimization algorithm will make use of the computed gradients to update the values of the trainable variables that took part in the computation of the loss.
Concretely, the code that you provide builds a TensorFlow graph in which the loss self.loss depends on the weights in embedding_matrix_variable (i.e. there is a path between these nodes in the graph), so TensorFlow will compute the gradient with respect to this variable, and the optimizer will update its values when minimizing the loss. It might be useful to inspect the TensorFlow graph using TensorBoard.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compute gradients for each time step of tf.while_loop - python

Related

Training with parametric partial derivatives in pytorch

Incorporate side conditions into Keras neural network

Tensorflow - Access weights while doing backprop

Does tensorflow provide an operation like caffe average_loss operation?

How is embedding matrix being trained in this code snippet?

Categories

Resources