pyTorch gradient becomes none when dividing by scalar - python

Consider the following code block:
import torch as torch
n=10
x = torch.ones(n, requires_grad=True)/n
y = torch.rand(n)
z = torch.sum(x*y)
z.backward()
print(x.grad) # results in None
print(y)
As written, x.grad is None. However, if I change the definition of x by removing the scalar multiplication (x = torch.ones(n, requires_grad=True)) then indeed I got a non-None gradient that is equivalent to y.
I've googled a bunch looking for this issue, and I think it reflects something fundamental in what I don't understand about how the computational graph in torch. I'd love some clarification. Thanks!

When you set x to a tensor divided by some scalar, x is no longer what is called a "leaf" Tensor in PyTorch. A leaf Tensor is a tensor at the beginning of the computation graph (which is a DAG graph with nodes representing objects such as tensors, and edges which represent a mathematical operation). More specifically, it is a tensor which was not created by some computational operation which is tracked by the autograd engine.
In your example - torch.ones(n, requires_grad=True) is a leaf tensor, but you can't access it directly in your code.
The reasoning behind not keeping the grad for non-leaf tensors is that typically, when you train a network, the weights and biases are leaf tensors and they are what we need the gradient for.
If you want to access the gradients of a non-leaf tensor, you should call the retain_grad function, which means in your code you should add:
x.retain_grad()
after the assignment to x.

It is true that you need to maintain grad. However, the easiest correction to this issues it using torch.div() funciton.

Related

Workaround tf.reshape breaking the flow of the gradient (jacobian)

I have a program in which I'm trying to calculate the jacobian of a neural network, but in order to properly define the jacobian I used tf.reshapeto make the data vectors (as far as I know, jacobian: dy/dx is only defined when y and x are vectors (not matrices nor tensors))
this is my code
#tf.function
def A_calculator():
with tf.GradientTape(watch_accessed_variables=False) as gtape:
noise=tf.random.normal([1000, 100])
gtape.watch(noisex)
fakenoise=tf.reshape(gen(noise),[1000,-1])
reshaped_noise=tf.reshape(noise,[1000,-1])
#caulculate jacobian
Jz=gtape.batch_jacobian(fakenoise,reshaped_noise)
return Jz
where genis a neural network that returns an image(generator)
My problem is that Jz is always a tensor with zero as elements
I searched for a solution for this but the closest thing was here(this is what made me suspect that the problem is tf.reshape), but the solution there doesn't solve my problem as I want to do reshape after I insert the value to the functiongen, does anybody know how solve this ? or why Jz always gives a tensor with zero values ?
Reshaping every tensor is unnecessary, as reshaping (1000,100) tensor to(1000,-1) will result in same shape. Skip reshaping altogether at all stages.
Please check the generator it could take a lot of time to produce the "fakenoise"

Difference between "detach()" and "with torch.nograd()" in PyTorch?

I know about two ways to exclude elements of a computation from the gradient calculation backward
Method 1: using with torch.no_grad()
with torch.no_grad():
y = reward + gamma * torch.max(net.forward(x))
loss = criterion(net.forward(torch.from_numpy(o)), y)
loss.backward();
Method 2: using .detach()
y = reward + gamma * torch.max(net.forward(x))
loss = criterion(net.forward(torch.from_numpy(o)), y.detach())
loss.backward();
Is there a difference between these two? Are there benefits/downsides to either?
tensor.detach() creates a tensor that shares storage with tensor that does not require grad. It detaches the output from the computational graph. So no gradient will be backpropagated along this variable.
The wrapper with torch.no_grad() temporarily set all the requires_grad flag to false. torch.no_grad says that no operation should build the graph.
The difference is that one refers to only a given variable on which it is called. The other affects all operations taking place within the with statement. Also, torch.no_grad will use less memory because it knows from the beginning that no gradients are needed so it doesn’t need to keep intermediary results.
Learn more about the differences between these along with examples from here.
detach()
One example without detach():
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x
r=(y+z).sum()
make_dot(r)
The end result in green r is a root of the AD computational graph and in blue is the leaf tensor.
Another example with detach():
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x.detach()
r=(y+z).sum()
make_dot(r)
This is the same as:
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x.data
r=(y+z).sum()
make_dot(r)
But, x.data is the old way (notation), and x.detach() is the new way.
What is the difference with x.detach()
print(x)
print(x.detach())
Out:
tensor([1., 1.], requires_grad=True)
tensor([1., 1.])
So x.detach() is a way to remove requires_grad and what you get is a new detached tensor (detached from AD computational graph).
torch.no_grad
torch.no_grad is actually a class.
x=torch.ones(2, requires_grad=True)
with torch.no_grad():
y = x * 2
print(y.requires_grad)
Out:
False
From help(torch.no_grad):
Disabling gradient calculation is useful for inference, when you are sure
| that you will not call :meth:Tensor.backward(). It will reduce memory
| consumption for computations that would otherwise have requires_grad=True.
|
| In this mode, the result of every computation will have
| requires_grad=False, even when the inputs have requires_grad=True.
A simple and profound explanation is that use of with torch.no_grad() behaves just like a loop where everything written in it will have there requires_grad argument set as False although temporarily. So there is no need to specify anything beyond this if you need to stop the backpropagation from gradients of certain variables or functions.
However, torch.detach() simply detaches the variable from the gradient computation graph as the name suggests. But this is used when this specification has to be provided for a limited number of variables or functions for eg. generally while displaying the loss and accuracy outputs after an epoch ends in neural network training because at that moment, it only consumed resourced since its gradient won't matter in during the display of results.

Use tf.gradients or tf.hessians on flattened parameter tensor

Let's say I want to compute the Hessian of a scalar-valued function with respect to some parameters W (e.g the weights and biases of a feed-forward neural network).
If you consider the following code, implementing a two-dimensional linear model trained to minimize a MSE loss:
import numpy as np
import tensorflow as tf
x = tf.placeholder(dtype=tf.float32, shape=[None, 2]) #inputs
t = tf.placeholder(dtype=tf.float32, shape=[None,]) #labels
W = tf.placeholder(np.eye(2), dtype=tf.float32) #weights
preds = tf.matmul(x, W) #linear model
loss = tf.reduce_mean(tf.square(preds-t), axis=0) #mse loss
params = tf.trainable_variables()
hessian = tf.hessians(loss, params)
you'd expect session.run(tf.hessian,feed_dict={}) to return a 2x2 matrix (equal to W). It turns out that because paramsis a 2x2 tensor, the output is rather a tensor with shape [2, 2, 2, 2]. While I can easily reshape the tensor to obtain the matrix I want, it seems that this operation might be extremely cumbersome when paramsbecomes a list of tensors of varying size (i.e when the model is a deep neural network for instance).
It seems that are two ways around this:
Flatten params to be a 1D tensor called flat_params:
flat_params = tf.concat([tf.reshape(p, [-1]) for p in params])
so that tf.hessians(loss, flat_params) naturally returns a 2x2 matrix. However as noted in Why does Tensorflow Reshape tf.reshape() break the flow of gradients? for tf.gradients (but also holds for tf.hessians), tensorflow is not able to see the symbolic link in the graph between paramsand flat_params and tf.hessians(loss, flat_params) will raise an error as the gradients will be seen as None.
In https://afqueiruga.github.io/tensorflow/2017/12/28/hessian-mnist.html, the author of the code goes the other way, and first create the flat parameter and reshapes its parts into self.params. This trick does work and gets you the hessian with its expected shape (2x2 matrix). However, it seems to me that this will be cumbersome to use when you have a complex model, and impossible to apply if you create your model via built-in functions (like tf.layers.dense, ..).
Is there no straight-forward way to get the Hessian matrix (as in the 2x2 matrix in this example) from tf.hessians, when self.params is a list of tensor of arbitrary shapes? If not, how can you automatize the reshaping of the output tensor of tf.hessians?
It turns out (per TensorFlow r1.13) that if len(xs) > 1, then tf.hessians(ys, xs) returns tensors corresponding to only the block diagonal submatrices of the full Hessian matrix. Full story and solutions in this paper https://arxiv.org/pdf/1905.05559, and code at https://github.com/gknilsen/pyhessian

Tensorflow - Access weights while doing backprop

I want to implement C-MWP as described here: https://arxiv.org/pdf/1608.00507.pdf in keras/tensorflow.
This involves modifying the way backprop is performed. The new gradient is a function of the bottom activation responses the weight parameters and the gradients of the layer above.
As a start, I was looking at the way keras-vis is doing modified backprop:
def _register_guided_gradient(name):
if name not in ops._gradient_registry._registry:
#tf.RegisterGradient(name)
def _guided_backprop(op, grad):
dtype = op.outputs[0].dtype
gate_g = tf.cast(grad > 0., dtype)
gate_y = tf.cast(op.outputs[0] > 0, dtype)
return gate_y * gate_g * grad
However, to implement C-MWP I need access to the weights of the layer on which the backprop is performed. Is it possible to access the weight within the #tf.RegisterGradient(name) function? Or am I on the wrong path?
The gradient computation in TF is fundamentally per-operation. If the operation whose gradient you want to change is performed on the weights, or at least the weights are not far from it in the operation graph, you can try finding the weights tensor by walking the graph inside your custom gradient. For example, say you have something like
x = tf.get_variable(...)
y = 5.0 * x
tf.gradients(y, x)
You can get to the variable tensor (more precisely, the tensor produced by the variable reading operation) with something like
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1]
...
If the weights are not immediate inputs, but you know how to get to them, you can walk the graph a bit using something like:
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1].op.inputs[0].op.inputs[2]
...
You should understand that this solution is very hacky. If you control the forward pass, you might want to just define a custom gradient just for the subgraph you care about. You can see how you can do that in How to register a custom gradient for a operation composed of tf operations
and How Can I Define Only the Gradient for a Tensorflow Subgraph? and https://www.tensorflow.org/api_docs/python/tf/Graph#gradient_override_map

tf.assign on tf.concat tensor, drops Variable character of tensors?

I am trying to set specific values for the weights and values of a Tensorflow neural network using the Python API. To this end, I placed all weights and biases in a common collection with proper reshaping and using tf.concat on the tensors from each layer.
At a certain stage in my code, I retrieve said collection. However, when I then try to tf.assign (using tf.placeholder of the same shape) to these concatenated tensor in order to set all weights/biases from a single vector of values, e.g. sitting in the feed_dict, then I get the error
AttributeError: 'Tensor' object has no attribute 'assign'
I have boiled my problem down to a minimum working example (MWE) as follows:
import tensorflow as tf
a=tf.Variable(tf.random_uniform([2], dtype=tf.float32))
b=tf.Variable(tf.random_uniform([2], dtype=tf.float32))
c=tf.concat([a,b], axis=0)
d_all=tf.placeholder(shape=[4], dtype=tf.float32)
d_single=tf.placeholder(shape=[2], dtype=tf.float32)
#e_all=tf.assign(c,d_all)
e_single=tf.assign(a,d_single)
sess=tf.Session()
sess.run(tf.global_variables_initializer())
print(a)
print(d_single)
sess.run(e_single, feed_dict={
d_single: [1,2]
})
print(c)
print(d_all)
#sess.run(e_all, feed_dict={
# d_all: [1,2,3,4]
#})
The commented-out lines do not work and fail with the same error. It seems that the tensor resulting from tf.concat is not variable anymore and therefore does not have the assign property. I found a related issue here, but my problem is not solved by validate_shape as suggested there.
Any ideas? Is this desired behavior?
Yes, it's a designed behavior, because c is an op, not a variable. Here's the simplest version of it:
c = a + b
tf.assign(c, a) # Does not work!
Basically, this graph means that the node c depends on a and b through certain operation (concat, addition, whatever). Assigning other values to c conflicts with the values that are coming from a and b, in other words, it breaks the computational graph.
What you should do instead is split d_all into tensors of shape [2] and assign the underlying a and b. This way is perfectly valid.

Categories

Resources