Replacing gradients of Tensorflow function with output of another function - python

I am using a function consisting of compound Tensorflow operations. However, instead of letting Tensorflow automatically compute its derivatives with respect to one of the inputs, I would like to replace the gradients with a different computation on the same input. Moreover, some of the calculation is shared between the forward and backward pass. For example:
def func(in1, in2):
# do something with inputs using only tf operations
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2))) # same computation for both forward and gradient pass
# return output of forward computation
return tf.op4(shared_rep)
def func_grad(in1, in2):
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2)))
# explicitly calculate gradients with respect to in1, with the intention of replacing the gradients computed by Tensorflow
mygrad1 = tf.op5(tf.op6(shared_rep))
return mygrad1
in1 = tf.Variable([1,2,3])
in2 = tf.Variable([2.5,0.01])
func_val = func(in1, in2)
my_grad1 = func_grad(in1, in2)
tf_grad1 = tf.gradients(func_val, in1)
with tf.Session() as sess:
# would like tf_grad1 to equal my_grad1
val, my1, tf1 = sess.run([func_val, my_grad1, tf_grad1])
tf.assert_equal(my1, tf1)
NOTE: This is similar to question How to replace or modify gradient? with one key difference: I am not interested in Tensorflow computing gradients of a different function in the backward pass; rather I would like to supply the gradients myself based on alternate tensorflow operations on the input.
I am trying to use the ideas proposed in the solution to the above question and in the following post, that is using tf.RegisterGradient and gradient_override_map to override the gradient of the identity function wrapping the forward function.
This fails because inside the registered alternate grad for identity, I have no access to the input to func_grad:
#tf.RegisterGradient("CustomGrad")
def alternate_identity_grad(op, grad):
# op.inputs[0] is the output of func(in1,in2)
# grad is of no use, because I would like to replace it with func_grad(in1,in2)
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
out_grad = tf.identity(input, name="Identity")
EDIT After additional research, I believe this question is similar to the following question. I managed to obtain the desired solution by combining gradient_override_map with the hack suggested here.

Related

Tensorflow 2.3 / Keras - Gradient Function for a custom upsampling layer?

I wrote a function that makes upsampling in a different way from what does Upsampling2D. I wrapped this function in tf.py_function and I wrapped that in a Lambda layer to avoid ValueError: Unknown graph. Aborting. .
However py_function needs a gradient function to work because it returns None in backpropagation.
Invalid argument: Inputs to operation gradient_tape/SegNet/activation_23/ReluGrad of type ReluGrad must have the same size and shape. Input 0: [] != input 1: [16,16,32,32]
conv_24 = Activation("relu")(conv_24)
unpool_5 = Lambda(unpoolWrapper)(conv_24)
unpool_5 = Lambda(adjustShape, arguments={'oldshape': conv_24.shape})(unpool_5)
conv_25 = Conv2D(16, (5,5),kernel_initializer='he_normal',padding='same',
data_format='channels_first')(unpool_5)
Problem is inside the first Lambda.
def unpoolWrapper(tensor):
return py_func(unpool, [tensor], Tout=tf.float32)
I'm following this code to fix the gradient problem.
However I'm inexpert so I don't know how gradient is computed for this kind of layer. It just makes an upsampling operation, no parameters to learn. Can anybody suggest me a gradient function for this? Thanks
def py_func(func, inp, Tout,grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.compat.v1.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name}):
return tf.py_function(func, inp, Tout)

Tensorflow - Access weights while doing backprop

I want to implement C-MWP as described here: https://arxiv.org/pdf/1608.00507.pdf in keras/tensorflow.
This involves modifying the way backprop is performed. The new gradient is a function of the bottom activation responses the weight parameters and the gradients of the layer above.
As a start, I was looking at the way keras-vis is doing modified backprop:
def _register_guided_gradient(name):
if name not in ops._gradient_registry._registry:
#tf.RegisterGradient(name)
def _guided_backprop(op, grad):
dtype = op.outputs[0].dtype
gate_g = tf.cast(grad > 0., dtype)
gate_y = tf.cast(op.outputs[0] > 0, dtype)
return gate_y * gate_g * grad
However, to implement C-MWP I need access to the weights of the layer on which the backprop is performed. Is it possible to access the weight within the #tf.RegisterGradient(name) function? Or am I on the wrong path?
The gradient computation in TF is fundamentally per-operation. If the operation whose gradient you want to change is performed on the weights, or at least the weights are not far from it in the operation graph, you can try finding the weights tensor by walking the graph inside your custom gradient. For example, say you have something like
x = tf.get_variable(...)
y = 5.0 * x
tf.gradients(y, x)
You can get to the variable tensor (more precisely, the tensor produced by the variable reading operation) with something like
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1]
...
If the weights are not immediate inputs, but you know how to get to them, you can walk the graph a bit using something like:
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1].op.inputs[0].op.inputs[2]
...
You should understand that this solution is very hacky. If you control the forward pass, you might want to just define a custom gradient just for the subgraph you care about. You can see how you can do that in How to register a custom gradient for a operation composed of tf operations
and How Can I Define Only the Gradient for a Tensorflow Subgraph? and https://www.tensorflow.org/api_docs/python/tf/Graph#gradient_override_map

Compute gradients for each time step of tf.while_loop

Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.
You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.

Using TensorFlow ``grad_loss / grad_ys`` parameter to add gradients

I'm trying to use the grad_loss parameter in optimizer.minimize(loss, grad_loss=) to modify the network gradients with existing gradients.
I followed the comments here:
Use of grads_ys parameter in tf.gradients - TensorFlow
and I would like to run a toy example, in which I recreate the default 1 values for grad_ys, as specified in the documentation.
Here's the relevant code segment:
grads_and_vars = optimizer.compute_gradients(loss_op)
vars_with_grad = [v for g, v in grads_and_vars if g is not None]
grad_loss = []
for grad,var in grads_and_vars:
grad_loss.append(tf.ones_like(grad))
train_op = optimizer.minimize(loss_op, grad_loss=grad_loss)
The first part extracts gradients using compute_gradients. The last line computes gradients of the loss function loss_op but attempts to use 1-filled vectors for the grads. As far as I understand, this should behave similarly to funning minimize without the grad_loss parameter.
Unfortunately, this fails since it expects grad_loss to be a Tensor (and have a dtype) and not a list. Looking into gradients_impl.py I see that the function expected grad_loss to be of the same dimension as loss (which in this case is a scalar).
I would appreciate any assistance in this simple example - how do I add elements to the gradients this way?
EDIT: I guess the question boils down to the definition of grad_loss: "A Tensor holding the gradient computed for loss." How do I generate such a tensor from a set of gradients obtained by compute_gradients?
Thanks.
You can make use of the tf.convert_to_tensor method to convert your list of gradients to a tensor, and then use tf.reduce_sum:
train_op = optimizer.minimize(loss_op, grad_loss=tf.reduce_sum(tf.convert_to_tensor(grad_loss)))

Computing the weight-update-ratio in Tensorflow

I'm searching for a way to compute the weight-update-ratio for optimizer steps in Tensorflow. The weight-update-ratio is defined as the update-scale divided by the variable scale in each step and can be used for inspecting network training.
Ideally I want a non-intrusive way to compute it in a single session run, but couldn't accomplish quite what I was looking for. Since the update-scale and parameter scale are independent of the train step, one needs to add explicit dependencies to the graph in order to graph variable-scale before and after the update step. Unfortunately, it seems that in TF dependencies can only be defined for new nodes, which further complicates the issue.
So far, the best I've come up with is a context manager for definining the necessary operations. Its used as follows
opt = tf.train.AdamOptimizer(1e0)
grads = tf.gradients(loss, tf.trainable_variables())
grads = list(zip(grads, tf.trainable_variables()))
with compute_weight_update_ratio('wur') as wur:
train = opt.apply_gradients(grads_and_vars=grads)
# ...
with tf.Session() as sess:
sess.run(wur.ratio)
The full code of compute_weight_update_ratio can be found below. What bugs me is that in the current state the weight-update-ratio (at least norm_before) is computed with every training step, but for performance reason I'd rather prefer to do it selectively (e.g only when summaries are computed).
Any ideas on how to improve?
#contextlib.contextmanager
def compute_weight_update_ratio(name, var_scope=None):
'''Injects training to compute weight-update-ratio.
The weight-update-ratio is computed as the update scale divided
by the variable scale before the update and should be somewhere in the
range 1e-2 or 1e-3.
Params
------
name : str
Operation name
Kwargs
------
var_scope : str, optional
Name selection of variables to compute weight-update-ration for. Defaults to all. Regex supported.
'''
class WeightUpdateRatio:
def __init__(self):
self.num_train = len(tf.get_collection(tf.GraphKeys.TRAIN_OP))
self.variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=var_scope)
self.norm_before = tf.norm(self.variables, name='norm_before')
def compute_ratio(self,):
train_ops = tf.get_collection(tf.GraphKeys.TRAIN_OP)
assert len(train_ops) > self.num_train, 'Missing training op'
with tf.control_dependencies(train_ops[self.num_train:]):
self.norm_after = tf.norm(self.variables, name='norm_after')
absdiff = tf.abs(tf.subtract(self.norm_after, self.norm_before), name='absdiff')
self.ratio = tf.divide(absdiff, self.norm_before, name=name)
with tf.name_scope(name) as scope:
try:
wur = WeightUpdateRatio()
with tf.control_dependencies([wur.norm_before]):
yield wur
finally:
wur.compute_ratio()
You don't need to worry about performance too much. Tensorflow only executes the subgraph necessary to produce the output.
So, in your training loop, if wur.ratio is not called during an iteration, none of the extra nodes created to compute it will be executed.

Categories

Resources