Tensorflow, tf.gradients calculations - python

I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2

Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.

Related

Do TensorFlow optimizers learn gradients in a graph with assignments?

I am reproducing the original paper of Elman networks (Elman, 1990) – together with Jordan networks, known as Simple Recurrent Networks (SRN). As far as I can understand, my code correctly implements the forward propagation, while the learning phase is incomplete. I am implementing the network using the low-level API of TensorFlow, in Python.
The Elman network is an artificial neural network composed of two layers, where the hidden layer gets copied as a "context layer," which concatenates with the inputs the next time we run forward propagate the network. Initially, the context layer is initialized with activation = 0.5 and has a fixed weight of 1.0.
My question is on the calculation of gradients, in the backpropagation of the network. In my code, I use tf.assign to update context units with the activations from the hidden layer. Before adding the assignment operator to the graph, TensorBoard shows that GradientDescentOptimizer will learn gradients from all the variables in the graph. After I include this statement, gradients don't show up for the variables in nodes coming "before" the assignment. In other words, I would expect b_1, w_x, w_c, and a_1 to show up in the list of gradients learned by the optimizer, even with the assignment in the graph.
I believe my implementation for the forward propagation is correct because I compared final values for activations using tf.assign and values from another implementation, using plain Numpy arrays. The values are equal.
Finally: is this behavior intentional or am I doing something wrong?
Here's a notebook with the implementation of the network as I described:
https://gist.github.com/Irio/d00b9661023923be7c963395483dfd73
References
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211. Retrieved from https://crl.ucsd.edu/~elman/Papers/fsit.pdf
No, assign operations do not backpropagate a gradient. That is on purpose, as assigning a value to a variable is not a differentiable operation. However, you probably do not want the gradient of the assignment, but the gradient of the new value of the variable. You can use that gradient, just do not use it as the output of an assignment operation. For example, you can do something like this:
import tensorflow as tf
my_var = tf.Variable(var_intial_value, name="MyVar")
# Compute new value for the variable
new_my_var = ...
# Make the assignment operation a control dependency
with tf.control_dependencies([tf.assign(my_var, new_my_var)]):
# Passing the value through identity here will ensure assignment is done
# while keeping it differentiable
new_my_var = tf.identity(new_my_var)
# Continue using the value
This would mean that my_var is not used in the backpropagation, and so it will not be updated by an optimizer. However, I suppose if you are assigning values to my_var yourself, then it should not be updated by the optimizer.

How to make sure Tensorflow's backpropagation works?

I wrote a custom layer that is part of a neural network and it contains some operations that I am using for the first time such as tf.scan and tf.slice.
I can easily test that the forward pass works and it makes sense, but how do I know that it will still work during the learning, when it has to do backpropagation? Can I safely assume that everything is going to be fine because the results I get make sense in the forward pass?
I was thinking that one possibility might be to create a neural network, replace one or two layers with the custom ones I have just created, train it, and see what happens. However, despite this would take quite a long time, the network may learn in the other layers whereas in my custom layer it may not work well anyway.
In conclusion, is there any way I can see that back-propagation will work well and I won't have any problems during the learning in this layer?
As far as I know, almost all TensorFlow ops are differentiable, including ops such as tf.abs or tf.where and gradient flows correctly through them. TensorFlow has an automatic differentiation engine, that takes any TensorFlow graph and computes derivatives w.r.t. desired variables.
So if your graph is composed of TensorFlow ops I wouldn't worry about the gradients being wrong (if you would post the code of your layer, I could expand further). However, there are still issues like numerical stability which can make otherwise mathematically sound operation still fail in practice (e.g. naive softmax computation, or tf.exp in your graph in general). Apart from that, TensorFlow differentiation should be correct and taken care of, from the user's point of view.
If you still want to examine your gradients by hand, you can compute the derivatives in your graph using tf.gradients op, which will get you the gradients that you wish and you can check by hand if TensorFlow did the differentiation correctly. (See https://www.tensorflow.org/api_docs/python/tf/gradients)

PyTorch Linear Algebra Gradients

I'm looking to back-propagate gradients through a singular value decomposition for regularisation purposes. PyTorch currently does not support backpropagation through a singular value decomposition.
I know that I could write my own custom function that operates on a Variable; takes its .data tensor, applies the torch.svd to it, wraps a Variable around its singular values and returns it in the forward pass, and in the backward pass applies the appropriate Jacobian matrix to the incoming gradients.
However, I was wondering whether there was a more elegant (and potentially faster) solution, where I could overwrite the "Type Variable doesn't implement stateless method svd" Error directly, call Lapack, etc. ?
If someone could guide me through the appropriate steps and source files I need to look at, I'd be very grateful. I suppose these steps would similarly apply to other linear algebra operations which have no associated backward method currently.
torch.svd with forward and backward pass is now available in the Pytorch master:
http://pytorch.org/docs/master/torch.html#torch.svd
You need to install Pytorch from source:
https://github.com/pytorch/pytorch/#from-source
PyTorch's torch.linalg.svd operation supports gradient calculations, but note:
Gradients computed using U and Vh may be unstable if input is not full rank or has non-unique singular values.

What is the meaning of 'self.diff' in 'forward' of a custom python loss layer for Caffe training?

I try to use a custom python loss layer. When I checked several examples online, such as:
Euclidean loss layer, Dice loss layer,
I notice a variable 'self.diff' is always assigned in 'forward'. Especially for the Dice loss layer,
self.diff[...] = bottom[1].data
I wonder if there is any reason that this variable has to be introduced in forward or I can just use bottom[1].data to access ground truth label?
In addition, what is the point of top[0].reshape(1) in reshape, since by definition in forward, the loss output is a scalar itself.
You need to set the diff attribute of the layer for overall consistency and data communication protocol; it's available other places in the class, and anywhere the loss layer object appears. bottom is a local parameter, and is not available elsewhere in the same form.
In general, the code is expandable for a variety of applications and more complex computations; the reshaping is part of this, ensuring that the returned value is scalar, even if someone expands the inputs to work with vectors or matrices.

How does TensorFlow calculate the gradients for the tf.train.GradientDescentOptimizer?

I am trying to understand how TensorFlow computes the gradients for the tf.train.GradientDescentOptimizer.
If I understand section 4.1 in the TensorFlow whitepaper correct, it computes the gradients based on backpropagation by adding nodes to the TensorFlow graph which compute the derivation of a node in the original graph.
When TensorFlow needs to compute the gradient of a tensor C with respect to some tensor I on which C depends, it first finds the path in the computation graph from I to C. Then it backtracks from C to I, and for each operation on the backward path it adds a node to the TensorFlow graph, composing the partial gradients along the backwards path using the chain rule. The newly added node computes the “gradient function” for the corresponding operation in the forward path. A gradient function may be registered by any operation. This function takes as input not only the partial gradients computed already along the backward path, but also, optionally, the inputs and outputs of the forward operation.
[Section 4.1 TensorFlow whitepaper]
Question 1: Is there a second node implementation for each TensorFlow node which represents the derivation of the original TensorFlow node?
Question 2: Is there a way to visualize which derivation nodes get added to the graph (or any logs)?
Each node gets corresponding method that computes backprop values (registered using something like #ops.RegisterGradient("Sum") in Python)
You can visualize the graph using method here
However, note that since automatic differentiation code is meant to work for a range of conditions, the graph it creates is quite complicated and not very useful to look at. It's not uncommon to have 10 ops nodes for a simple gradient calculation that could be implemented with 1-2 nodes

Categories

Resources