Tensorflow - Access weights while doing backprop - python

I want to implement C-MWP as described here: https://arxiv.org/pdf/1608.00507.pdf in keras/tensorflow.
This involves modifying the way backprop is performed. The new gradient is a function of the bottom activation responses the weight parameters and the gradients of the layer above.
As a start, I was looking at the way keras-vis is doing modified backprop:
def _register_guided_gradient(name):
if name not in ops._gradient_registry._registry:
#tf.RegisterGradient(name)
def _guided_backprop(op, grad):
dtype = op.outputs[0].dtype
gate_g = tf.cast(grad > 0., dtype)
gate_y = tf.cast(op.outputs[0] > 0, dtype)
return gate_y * gate_g * grad
However, to implement C-MWP I need access to the weights of the layer on which the backprop is performed. Is it possible to access the weight within the #tf.RegisterGradient(name) function? Or am I on the wrong path?

The gradient computation in TF is fundamentally per-operation. If the operation whose gradient you want to change is performed on the weights, or at least the weights are not far from it in the operation graph, you can try finding the weights tensor by walking the graph inside your custom gradient. For example, say you have something like
x = tf.get_variable(...)
y = 5.0 * x
tf.gradients(y, x)
You can get to the variable tensor (more precisely, the tensor produced by the variable reading operation) with something like
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1]
...
If the weights are not immediate inputs, but you know how to get to them, you can walk the graph a bit using something like:
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1].op.inputs[0].op.inputs[2]
...
You should understand that this solution is very hacky. If you control the forward pass, you might want to just define a custom gradient just for the subgraph you care about. You can see how you can do that in How to register a custom gradient for a operation composed of tf operations
and How Can I Define Only the Gradient for a Tensorflow Subgraph? and https://www.tensorflow.org/api_docs/python/tf/Graph#gradient_override_map

Related

Tensorflow 2.2: custom gradient for a vector valued input/output custom layer

I am using tensorflow 2.2 for my research and would like to implement a custom layer that takes in a vector(tensor) input and outputs a vector(tensor). My input/output relation is complicated and I need to create a function that computes the forward pass and the gradients. I came across the custom_gradient function that does the job. Unfortunately, what is totally unclear is how to use it for vector or tensor inputs and outputs. In particular, I have no idea how I need to return the Jacobian matrix or some form thereof.
As a simple example, let us say that my custom layer computes the vector-matrix product of the input a and the weights W (the weights). This is how I would approach the problem (skipping the steps to initialize weights, build etc. for simplicity).
#tf.custom_gradient
def custom_op(A,W): # A is of size (#samples, length of input)
result = tf.matmul(A,W) # I compute the output tensor.
def custom_grad(dy):
grad = ... # I have no idea what exactly grad stores, mathematically speaking
return grad
return result, custom_grad
class CustomLayer(tf.keras.layers.Layer):
def __init__(self):
super(CustomLayer, self).__init__()
def call(self, A):
return custom_op(A, self.W) # assuming self.W are the weights
Any help would be greatly appreciated. Thanks!
I don't think you have to define your own custom_grad function if you don't need customized gradient computation.
I would use tf.GradientTape outside of the layer call. The gradient tape watches the trainable variables and the operations performed on them. The tape.gradient function then alone computes the gradients. The following code snippet originates from this tutorial: https://www.tensorflow.org/guide/autodiff.
layer = CustomLayer()
x = tf.constant([[1., 2., 3.]])
with tf.GradientTape() as tape:
# Forward pass
y = layer(x)
loss = tf.reduce_mean(y**2)
# Calculate gradients with respect to every trainable variable
grad = tape.gradient(loss, layer.trainable_variables)
You can apply the computed gradients by calling apply_gradients function on the instance of an optimizer to train your model.
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.apply_gradients(grad)

Constrain parameters to be -1, 0 or 1 in neural network in pytorch

I want to constrain the parameters of an intermediate layer in a neural network to prefer discrete values: -1, 0, or 1. The idea is to add a custom objective function that would increase the loss if the parameters take any other value. Note that, I want to constrain parameters of a particular layer, not all layers.
How can I implement this in pytorch? I want to add this custom loss to the total loss in the training loop, something like this:
custom_loss = constrain_parameters_to_be_discrete
loss = other_loss + custom_loss
May be using a Dirichlet prior might help, any pointer to this?
Extending upon #Shai answer and mixing it with this answer one could do it simpler via custom layer into which you could pass your specific layer.
First, the calculated derivative of torch.abs(x**2 - torch.abs(x)) taken from WolframAlpha (check here) would be placed inside regularize function.
Now the Constrainer layer:
class Constrainer(torch.nn.Module):
def __init__(self, module, weight_decay=1.0):
super().__init__()
self.module = module
self.weight_decay = weight_decay
# Backward hook is registered on the specified module
self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)
# Not working with grad accumulation, check original answer and pointers there
# If that's needed
def _weight_decay_hook(self, *_):
for parameter in self.module.parameters():
parameter.grad = self.regularize(parameter)
def regularize(self, parameter):
# Derivative of the regularization term created by #Shia
sgn = torch.sign(parameter)
return self.weight_decay * (
(sgn - 2 * parameter) * torch.sign(1 - parameter * sgn)
)
def forward(self, *args, **kwargs):
# Simply forward and args and kwargs to module
return self.module(*args, **kwargs)
Usage is really simple (with your specified weight_decay hyperparameter if you need more/less force on the params):
constrained_layer = Constrainer(torch.nn.Linear(20, 10), weight_decay=0.1)
Now you don't have to worry about different loss functions and can use your model normally.
You can use the loss function:
def custom_loss_function(x):
loss = torch.abs(x**2 - torch.abs(x))
return loss.mean()
This graph plots the proposed loss for a single element:
As you can see, the proposed loss is zero for x={-1, 0, 1} and positive otherwise.
Note that if you want to apply this loss to the weights of a specific layer, then your x here are the weights, not the activations of the layer.

pyTorch gradient becomes none when dividing by scalar

Consider the following code block:
import torch as torch
n=10
x = torch.ones(n, requires_grad=True)/n
y = torch.rand(n)
z = torch.sum(x*y)
z.backward()
print(x.grad) # results in None
print(y)
As written, x.grad is None. However, if I change the definition of x by removing the scalar multiplication (x = torch.ones(n, requires_grad=True)) then indeed I got a non-None gradient that is equivalent to y.
I've googled a bunch looking for this issue, and I think it reflects something fundamental in what I don't understand about how the computational graph in torch. I'd love some clarification. Thanks!
When you set x to a tensor divided by some scalar, x is no longer what is called a "leaf" Tensor in PyTorch. A leaf Tensor is a tensor at the beginning of the computation graph (which is a DAG graph with nodes representing objects such as tensors, and edges which represent a mathematical operation). More specifically, it is a tensor which was not created by some computational operation which is tracked by the autograd engine.
In your example - torch.ones(n, requires_grad=True) is a leaf tensor, but you can't access it directly in your code.
The reasoning behind not keeping the grad for non-leaf tensors is that typically, when you train a network, the weights and biases are leaf tensors and they are what we need the gradient for.
If you want to access the gradients of a non-leaf tensor, you should call the retain_grad function, which means in your code you should add:
x.retain_grad()
after the assignment to x.
It is true that you need to maintain grad. However, the easiest correction to this issues it using torch.div() funciton.

Compute gradients for each time step of tf.while_loop

Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.
You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.

Replacing gradients of Tensorflow function with output of another function

I am using a function consisting of compound Tensorflow operations. However, instead of letting Tensorflow automatically compute its derivatives with respect to one of the inputs, I would like to replace the gradients with a different computation on the same input. Moreover, some of the calculation is shared between the forward and backward pass. For example:
def func(in1, in2):
# do something with inputs using only tf operations
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2))) # same computation for both forward and gradient pass
# return output of forward computation
return tf.op4(shared_rep)
def func_grad(in1, in2):
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2)))
# explicitly calculate gradients with respect to in1, with the intention of replacing the gradients computed by Tensorflow
mygrad1 = tf.op5(tf.op6(shared_rep))
return mygrad1
in1 = tf.Variable([1,2,3])
in2 = tf.Variable([2.5,0.01])
func_val = func(in1, in2)
my_grad1 = func_grad(in1, in2)
tf_grad1 = tf.gradients(func_val, in1)
with tf.Session() as sess:
# would like tf_grad1 to equal my_grad1
val, my1, tf1 = sess.run([func_val, my_grad1, tf_grad1])
tf.assert_equal(my1, tf1)
NOTE: This is similar to question How to replace or modify gradient? with one key difference: I am not interested in Tensorflow computing gradients of a different function in the backward pass; rather I would like to supply the gradients myself based on alternate tensorflow operations on the input.
I am trying to use the ideas proposed in the solution to the above question and in the following post, that is using tf.RegisterGradient and gradient_override_map to override the gradient of the identity function wrapping the forward function.
This fails because inside the registered alternate grad for identity, I have no access to the input to func_grad:
#tf.RegisterGradient("CustomGrad")
def alternate_identity_grad(op, grad):
# op.inputs[0] is the output of func(in1,in2)
# grad is of no use, because I would like to replace it with func_grad(in1,in2)
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
out_grad = tf.identity(input, name="Identity")
EDIT After additional research, I believe this question is similar to the following question. I managed to obtain the desired solution by combining gradient_override_map with the hack suggested here.

Categories

Resources