If I need to backpropagate through a neural network twice and I don't use retain_graph=True, I get an error.
Why? I realize it is nice to keep the intermediate variables used for the first backpropagation to be reused for the second backpropagation. However, why aren't they simply recalculated, like they were originally calculated in the first backpropagation?
By default, PyTorch doesn't store intermediate gradients, because the PyTorch's main feature is Dynamic Computational Graphs, so after backpropagation the graph will be freed all the intermediate buffers will be destroyed.
Related
zero_grad() method is used when we want to "conserve" RAM with massive datasets. There was already an answer on that, here : Why do we need to call zero_grad() in PyTorch?.
Gradients are used for the update of the parameters during back prop. But if we delete the gradients by setting them at 0, how can the optimization be done during the backward propagation ?
There are models where we use this method and there is still an optimization that is occurring, how is this possible ?
You don't "delete the gradients", you simply clear the cache of gradients from previous iteration. The reason of existence of this cache is ease of implementation of specific methods such as simulation of big batch without memory to actually use the whole batch.
I'm trying to implement CLIP-based style transfer. The full code is here
For some unknown reason optimizer doesn't change the weights of the latent tensor. I can confirm that the values are equal before and after the iteration steps. I've also made sure that requires_grad is True and tried various loss functions and optimizers.
Any idea why it doesn't work?
I see some problems with your code.
The optimizer takes in parameters. Parameters are supposed to be leaf nodes in your computation graph. In your case, you tell the optimizer to use latent as the parameter, but it must have complained as latent is the result of some computations.
So you detached latent, now latent becomes a leaf node. But when you detach the latent, the computation graph is no longer there, creating a new latent variable.
Also, to optimize a parameter, the loss should be a function of that parameter. I am not able to see if you are using latent in your loss function computation. So that can be another issue.
I think I've found the issue. On line 86, where I compute one-hot vector from latent, in order to decode it and pass it to CLIP, the graph would break. vae_make_onehot returns a leaf tensor
I need to backpropagate through my neural network multiple times, so I set backward(retain_graph=True).
However, this is causing
RuntimeError: CUDA out of memory
I don't understand why this is.
Are the number of variables or weights doubling? Shouldn't the amount of memory used remain the same regardless of how many times backward() is called?
The source of the issue :
You are right that no matter how many times we call the backward function, the memory should not increase theorically.
Yet your issue is not because of the backpropagation, but the retain_graph variable that you have set to true when calling the backward function.
When you run your network by passing a set of input data, you call the forward function, which will create a "computation graph".
A computation graph is containing all the operations that your network has performed.
Then when you call the backward function, the computation graph saved will "basically" be runned backward to know which weight should be adjusted in which directions (what is called the gradients).
So PyTorch is saving in memory the computation graph in order to call the backward function.
After the backward function has been called and the gradients have been calculated, we free the graph from the memory, as explained in the doc https://pytorch.org/docs/stable/autograd.html :
retain_graph (bool, optional) – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.
Then usually during training we apply the gradients to the network in order to minimise the loss, then we re-run the network, and so we create a new computation graph. Yet we have only one graph in memory at the same time.
The issue :
If you set retain_graph to true when you call the backward function, you will keep in memory the computation graphs of ALL the previous runs of your network.
And since on every run of your network, you create a new computation graph, if you store them all in memory, you can and will eventually run out of memory.
On the first iteration and run of your network, you will have only one graph in memory. Yet on the 10th run of the network, you have 10 graphs in memory. And on the 10000th run you have 10000 in memory. It is not sustainable, and it is understandable why it is not recommended in the docs.
So even if it may seems that the issue is the backpropagation, it is actually the storing of the computation graphs, and since we usually call the the forward and backward function once per iteration or network run, making a confusion is understandable.
Solution :
What you need to do, is find a way to make your network and architecture work without using retain_graph. Using it will make it almost impossible to train your network, since each iteration increase the usage of your memory and decrease the speed of training, and in your case, even cause you to run out of memory.
You did not mention why you need to backpropagate multiple times, yet it is rarely needed, and i do not know of a case where it cannot be "worked around". For example, if you need to access variables or weights of previous runs you could save them inside variables and later access them, instead of trying doing a new backpropagation.
You likely need to backpropagate multiple times for another reason, yet believe as i have been in this situation, there is likely a way to accomplish what you are trying to do without storing the previous computation graphs.
If you want to share why you need to backpropagate multiple times, maybe others and i could help you more.
More about the backward process :
If you want to learn more about the backward process it is called the "Jacobian-vector product". It is a bit complex and is handled by PyTorch. I do not yet fully understand it, yet this ressource seems good as a starting point, as it seems less intimidating than the PyTorch documentation (in term of algebra) : https://mc.ai/how-pytorch-backward-function-works/
I am reproducing the original paper of Elman networks (Elman, 1990) – together with Jordan networks, known as Simple Recurrent Networks (SRN). As far as I can understand, my code correctly implements the forward propagation, while the learning phase is incomplete. I am implementing the network using the low-level API of TensorFlow, in Python.
The Elman network is an artificial neural network composed of two layers, where the hidden layer gets copied as a "context layer," which concatenates with the inputs the next time we run forward propagate the network. Initially, the context layer is initialized with activation = 0.5 and has a fixed weight of 1.0.
My question is on the calculation of gradients, in the backpropagation of the network. In my code, I use tf.assign to update context units with the activations from the hidden layer. Before adding the assignment operator to the graph, TensorBoard shows that GradientDescentOptimizer will learn gradients from all the variables in the graph. After I include this statement, gradients don't show up for the variables in nodes coming "before" the assignment. In other words, I would expect b_1, w_x, w_c, and a_1 to show up in the list of gradients learned by the optimizer, even with the assignment in the graph.
I believe my implementation for the forward propagation is correct because I compared final values for activations using tf.assign and values from another implementation, using plain Numpy arrays. The values are equal.
Finally: is this behavior intentional or am I doing something wrong?
Here's a notebook with the implementation of the network as I described:
https://gist.github.com/Irio/d00b9661023923be7c963395483dfd73
References
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211. Retrieved from https://crl.ucsd.edu/~elman/Papers/fsit.pdf
No, assign operations do not backpropagate a gradient. That is on purpose, as assigning a value to a variable is not a differentiable operation. However, you probably do not want the gradient of the assignment, but the gradient of the new value of the variable. You can use that gradient, just do not use it as the output of an assignment operation. For example, you can do something like this:
import tensorflow as tf
my_var = tf.Variable(var_intial_value, name="MyVar")
# Compute new value for the variable
new_my_var = ...
# Make the assignment operation a control dependency
with tf.control_dependencies([tf.assign(my_var, new_my_var)]):
# Passing the value through identity here will ensure assignment is done
# while keeping it differentiable
new_my_var = tf.identity(new_my_var)
# Continue using the value
This would mean that my_var is not used in the backpropagation, and so it will not be updated by an optimizer. However, I suppose if you are assigning values to my_var yourself, then it should not be updated by the optimizer.
I wrote a custom layer that is part of a neural network and it contains some operations that I am using for the first time such as tf.scan and tf.slice.
I can easily test that the forward pass works and it makes sense, but how do I know that it will still work during the learning, when it has to do backpropagation? Can I safely assume that everything is going to be fine because the results I get make sense in the forward pass?
I was thinking that one possibility might be to create a neural network, replace one or two layers with the custom ones I have just created, train it, and see what happens. However, despite this would take quite a long time, the network may learn in the other layers whereas in my custom layer it may not work well anyway.
In conclusion, is there any way I can see that back-propagation will work well and I won't have any problems during the learning in this layer?
As far as I know, almost all TensorFlow ops are differentiable, including ops such as tf.abs or tf.where and gradient flows correctly through them. TensorFlow has an automatic differentiation engine, that takes any TensorFlow graph and computes derivatives w.r.t. desired variables.
So if your graph is composed of TensorFlow ops I wouldn't worry about the gradients being wrong (if you would post the code of your layer, I could expand further). However, there are still issues like numerical stability which can make otherwise mathematically sound operation still fail in practice (e.g. naive softmax computation, or tf.exp in your graph in general). Apart from that, TensorFlow differentiation should be correct and taken care of, from the user's point of view.
If you still want to examine your gradients by hand, you can compute the derivatives in your graph using tf.gradients op, which will get you the gradients that you wish and you can check by hand if TensorFlow did the differentiation correctly. (See https://www.tensorflow.org/api_docs/python/tf/gradients)