I'm trying to use the grad_loss parameter in optimizer.minimize(loss, grad_loss=) to modify the network gradients with existing gradients.
I followed the comments here:
Use of grads_ys parameter in tf.gradients - TensorFlow
and I would like to run a toy example, in which I recreate the default 1 values for grad_ys, as specified in the documentation.
Here's the relevant code segment:
grads_and_vars = optimizer.compute_gradients(loss_op)
vars_with_grad = [v for g, v in grads_and_vars if g is not None]
grad_loss = []
for grad,var in grads_and_vars:
grad_loss.append(tf.ones_like(grad))
train_op = optimizer.minimize(loss_op, grad_loss=grad_loss)
The first part extracts gradients using compute_gradients. The last line computes gradients of the loss function loss_op but attempts to use 1-filled vectors for the grads. As far as I understand, this should behave similarly to funning minimize without the grad_loss parameter.
Unfortunately, this fails since it expects grad_loss to be a Tensor (and have a dtype) and not a list. Looking into gradients_impl.py I see that the function expected grad_loss to be of the same dimension as loss (which in this case is a scalar).
I would appreciate any assistance in this simple example - how do I add elements to the gradients this way?
EDIT: I guess the question boils down to the definition of grad_loss: "A Tensor holding the gradient computed for loss." How do I generate such a tensor from a set of gradients obtained by compute_gradients?
Thanks.
You can make use of the tf.convert_to_tensor method to convert your list of gradients to a tensor, and then use tf.reduce_sum:
train_op = optimizer.minimize(loss_op, grad_loss=tf.reduce_sum(tf.convert_to_tensor(grad_loss)))
Related
Currently, I am working on a Universal Perturbation type of research, where I would use the gradient of the layer before the activation function to retrace the gradient step taken in the last iteration.
However, when I try to extract the gradient using K.gradients, I can't seem to extract the right stuff.
Either I get a tensor, which I don't want, or I get [zero]. What I want are the exact gradients of that second to last layer, given the input-image. This is what I currently have:
f_image = np.array(model.predict(image)).flatten()
I = (np.array(f_image)).flatten().argsort()[::-1]
I = I[0:num_classes]
pert_image = image
gradients = np.asarray(grads(pert_image,I))
Here grads should be the gradient function to get the exact gradients. When I use the following code, I get a tensor:
gradients = K.gradients(model.layers[-2].output, model.layers[0].input)[0]
Where the output is the I, which gives the largest influences before making the activation to classify, and the input is the perturbed image, starting off with the original image.
Could someone tell me what is wrong with my K.gradients implementation?
K.gradients computes the gradient in a symbolic way, you need to evaluate the gradient with actual inputs in order to get numerical values. You can do this using K.function to build a callable:
import keras.backend as K
gradients = K.gradients(model.layers[-2].output, model.layers[0].input)[0]
grad_fn = K.function([model.input], [gradients])
Then you can now call grad_fn with an appropriate input (including the batch dimension) that will return the numerical values of the gradient:
actual_gradients = grad_fn([image])
I'm trying to implement a version of differentially private stochastic gradient descent (e.g., this), which goes as follows:
Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.
What is the best way to do this in pytorch?
Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically
But failing that, compute each gradient separately and then clip the norm before accumulating, but
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
for i in range(loss.size()[0]):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)
accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient. What's the best way to get around this issue?
I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward and that's a fact. Regarding the order of clipping, autograd stores the gradients in .grad of parameter tensors. A crude solution would be to add a dictionary like
clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}
Run your for loop like
for i in range(loss.size(0)):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm_(net.parameters())
for name, param in net.named_parameters():
clipped_grads[name] += param.grad / loss.size(0)
net.zero_grad()
for name, param in net.named_parameters():
param.grad = clipped_grads[name]
optimizer.step()
where I omitted much of the detach, requires_grad=False and similar business which may be necessary to make it behave as expected.
The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients. In principle you could take the "raw" gradient, clip it, add to clipped_gradient, and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad until the end of a backward pass. It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input, but you would have to verify with someone more intimately acquaintanced with autograd.
This package calculates per-sample gradient in parallel. The memory needed is still batch_size times of standard stochastic gradient descent, but due to parallelization it can run much faster.
I want to implement C-MWP as described here: https://arxiv.org/pdf/1608.00507.pdf in keras/tensorflow.
This involves modifying the way backprop is performed. The new gradient is a function of the bottom activation responses the weight parameters and the gradients of the layer above.
As a start, I was looking at the way keras-vis is doing modified backprop:
def _register_guided_gradient(name):
if name not in ops._gradient_registry._registry:
#tf.RegisterGradient(name)
def _guided_backprop(op, grad):
dtype = op.outputs[0].dtype
gate_g = tf.cast(grad > 0., dtype)
gate_y = tf.cast(op.outputs[0] > 0, dtype)
return gate_y * gate_g * grad
However, to implement C-MWP I need access to the weights of the layer on which the backprop is performed. Is it possible to access the weight within the #tf.RegisterGradient(name) function? Or am I on the wrong path?
The gradient computation in TF is fundamentally per-operation. If the operation whose gradient you want to change is performed on the weights, or at least the weights are not far from it in the operation graph, you can try finding the weights tensor by walking the graph inside your custom gradient. For example, say you have something like
x = tf.get_variable(...)
y = 5.0 * x
tf.gradients(y, x)
You can get to the variable tensor (more precisely, the tensor produced by the variable reading operation) with something like
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1]
...
If the weights are not immediate inputs, but you know how to get to them, you can walk the graph a bit using something like:
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1].op.inputs[0].op.inputs[2]
...
You should understand that this solution is very hacky. If you control the forward pass, you might want to just define a custom gradient just for the subgraph you care about. You can see how you can do that in How to register a custom gradient for a operation composed of tf operations
and How Can I Define Only the Gradient for a Tensorflow Subgraph? and https://www.tensorflow.org/api_docs/python/tf/Graph#gradient_override_map
Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.
You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.
I am using a function consisting of compound Tensorflow operations. However, instead of letting Tensorflow automatically compute its derivatives with respect to one of the inputs, I would like to replace the gradients with a different computation on the same input. Moreover, some of the calculation is shared between the forward and backward pass. For example:
def func(in1, in2):
# do something with inputs using only tf operations
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2))) # same computation for both forward and gradient pass
# return output of forward computation
return tf.op4(shared_rep)
def func_grad(in1, in2):
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2)))
# explicitly calculate gradients with respect to in1, with the intention of replacing the gradients computed by Tensorflow
mygrad1 = tf.op5(tf.op6(shared_rep))
return mygrad1
in1 = tf.Variable([1,2,3])
in2 = tf.Variable([2.5,0.01])
func_val = func(in1, in2)
my_grad1 = func_grad(in1, in2)
tf_grad1 = tf.gradients(func_val, in1)
with tf.Session() as sess:
# would like tf_grad1 to equal my_grad1
val, my1, tf1 = sess.run([func_val, my_grad1, tf_grad1])
tf.assert_equal(my1, tf1)
NOTE: This is similar to question How to replace or modify gradient? with one key difference: I am not interested in Tensorflow computing gradients of a different function in the backward pass; rather I would like to supply the gradients myself based on alternate tensorflow operations on the input.
I am trying to use the ideas proposed in the solution to the above question and in the following post, that is using tf.RegisterGradient and gradient_override_map to override the gradient of the identity function wrapping the forward function.
This fails because inside the registered alternate grad for identity, I have no access to the input to func_grad:
#tf.RegisterGradient("CustomGrad")
def alternate_identity_grad(op, grad):
# op.inputs[0] is the output of func(in1,in2)
# grad is of no use, because I would like to replace it with func_grad(in1,in2)
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
out_grad = tf.identity(input, name="Identity")
EDIT After additional research, I believe this question is similar to the following question. I managed to obtain the desired solution by combining gradient_override_map with the hack suggested here.