Difference between "detach()" and "with torch.nograd()" in PyTorch? - python

I know about two ways to exclude elements of a computation from the gradient calculation backward
Method 1: using with torch.no_grad()
with torch.no_grad():
y = reward + gamma * torch.max(net.forward(x))
loss = criterion(net.forward(torch.from_numpy(o)), y)
loss.backward();
Method 2: using .detach()
y = reward + gamma * torch.max(net.forward(x))
loss = criterion(net.forward(torch.from_numpy(o)), y.detach())
loss.backward();
Is there a difference between these two? Are there benefits/downsides to either?

tensor.detach() creates a tensor that shares storage with tensor that does not require grad. It detaches the output from the computational graph. So no gradient will be backpropagated along this variable.
The wrapper with torch.no_grad() temporarily set all the requires_grad flag to false. torch.no_grad says that no operation should build the graph.
The difference is that one refers to only a given variable on which it is called. The other affects all operations taking place within the with statement. Also, torch.no_grad will use less memory because it knows from the beginning that no gradients are needed so it doesn’t need to keep intermediary results.
Learn more about the differences between these along with examples from here.

detach()
One example without detach():
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x
r=(y+z).sum()
make_dot(r)
The end result in green r is a root of the AD computational graph and in blue is the leaf tensor.
Another example with detach():
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x.detach()
r=(y+z).sum()
make_dot(r)
This is the same as:
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x.data
r=(y+z).sum()
make_dot(r)
But, x.data is the old way (notation), and x.detach() is the new way.
What is the difference with x.detach()
print(x)
print(x.detach())
Out:
tensor([1., 1.], requires_grad=True)
tensor([1., 1.])
So x.detach() is a way to remove requires_grad and what you get is a new detached tensor (detached from AD computational graph).
torch.no_grad
torch.no_grad is actually a class.
x=torch.ones(2, requires_grad=True)
with torch.no_grad():
y = x * 2
print(y.requires_grad)
Out:
False
From help(torch.no_grad):
Disabling gradient calculation is useful for inference, when you are sure
| that you will not call :meth:Tensor.backward(). It will reduce memory
| consumption for computations that would otherwise have requires_grad=True.
|
| In this mode, the result of every computation will have
| requires_grad=False, even when the inputs have requires_grad=True.

A simple and profound explanation is that use of with torch.no_grad() behaves just like a loop where everything written in it will have there requires_grad argument set as False although temporarily. So there is no need to specify anything beyond this if you need to stop the backpropagation from gradients of certain variables or functions.
However, torch.detach() simply detaches the variable from the gradient computation graph as the name suggests. But this is used when this specification has to be provided for a limited number of variables or functions for eg. generally while displaying the loss and accuracy outputs after an epoch ends in neural network training because at that moment, it only consumed resourced since its gradient won't matter in during the display of results.

Related

pyTorch gradient becomes none when dividing by scalar

Consider the following code block:
import torch as torch
n=10
x = torch.ones(n, requires_grad=True)/n
y = torch.rand(n)
z = torch.sum(x*y)
z.backward()
print(x.grad) # results in None
print(y)
As written, x.grad is None. However, if I change the definition of x by removing the scalar multiplication (x = torch.ones(n, requires_grad=True)) then indeed I got a non-None gradient that is equivalent to y.
I've googled a bunch looking for this issue, and I think it reflects something fundamental in what I don't understand about how the computational graph in torch. I'd love some clarification. Thanks!
When you set x to a tensor divided by some scalar, x is no longer what is called a "leaf" Tensor in PyTorch. A leaf Tensor is a tensor at the beginning of the computation graph (which is a DAG graph with nodes representing objects such as tensors, and edges which represent a mathematical operation). More specifically, it is a tensor which was not created by some computational operation which is tracked by the autograd engine.
In your example - torch.ones(n, requires_grad=True) is a leaf tensor, but you can't access it directly in your code.
The reasoning behind not keeping the grad for non-leaf tensors is that typically, when you train a network, the weights and biases are leaf tensors and they are what we need the gradient for.
If you want to access the gradients of a non-leaf tensor, you should call the retain_grad function, which means in your code you should add:
x.retain_grad()
after the assignment to x.
It is true that you need to maintain grad. However, the easiest correction to this issues it using torch.div() funciton.

computing gradients for every individual sample in a batch in PyTorch

I'm trying to implement a version of differentially private stochastic gradient descent (e.g., this), which goes as follows:
Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.
What is the best way to do this in pytorch?
Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically
But failing that, compute each gradient separately and then clip the norm before accumulating, but
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
for i in range(loss.size()[0]):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)
accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient. What's the best way to get around this issue?
I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward and that's a fact. Regarding the order of clipping, autograd stores the gradients in .grad of parameter tensors. A crude solution would be to add a dictionary like
clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}
Run your for loop like
for i in range(loss.size(0)):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm_(net.parameters())
for name, param in net.named_parameters():
clipped_grads[name] += param.grad / loss.size(0)
net.zero_grad()
for name, param in net.named_parameters():
param.grad = clipped_grads[name]
optimizer.step()
where I omitted much of the detach, requires_grad=False and similar business which may be necessary to make it behave as expected.
The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients. In principle you could take the "raw" gradient, clip it, add to clipped_gradient, and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad until the end of a backward pass. It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input, but you would have to verify with someone more intimately acquaintanced with autograd.
This package calculates per-sample gradient in parallel. The memory needed is still batch_size times of standard stochastic gradient descent, but due to parallelization it can run much faster.

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

Compute gradients for each time step of tf.while_loop

Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.
You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.

Tensorflow GAN discriminator loss NaN since negativ discriminator output

In my implementation of a GAN network the output of the discriminator is something like 2.05145e+07 which leads to 1 - disc_output -> 1-2.05145e+07=-2.05145e+07 (a negativ number) therefore log(1-2.05145e+07) leads to NaN.
I am not the first one with this kind of problem. One solution is to only allow positive values inside the log like done here.
Does anyone knows any better solution to this?
maybe some different loss function ?
Because discriminator returns a probability value, its output must be between 0 and 1. Try using sigmoid ( https://www.tensorflow.org/api_docs/python/tf/sigmoid) before using discriminator outputs.
Additionally, as others did, I suggest using tf.log(tf.maximum(x, 1e-9)) in case of a numerical instability.
There are standard techniques to avoid log numerical instability. For example, what you often care about is the loss (which is a function of the log), not the log value itself. For instance, with logistic loss:
For brevity, let x = logits, z = labels. The logistic loss is
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
= max(x, 0) - x * z + log(1 + exp(-abs(x)))
These tricks are already implemented in standard tensorflow losses (like tf.losses.sigmoid_cross_entropy). Note that the naive solution of taking a max or a min inside of the log is not a good solution, since there aren't meaningful gradients in the saturated regions: for instance, d/dx[max(x, 0)] = 0 for x < 0, which means there won't be gradients in the saturated region.
TensorFlow has GAN support with tf.contrib.gan. These losses already implement all of the standard numerical stability tricks, and an avoid you having to recreate the wheel.
tfgan = tf.contrib.gan
tfgan.losses.minimax_discriminator_loss(...)
See https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gan for more details.

Categories

Resources