I am reproducing the original paper of Elman networks (Elman, 1990) – together with Jordan networks, known as Simple Recurrent Networks (SRN). As far as I can understand, my code correctly implements the forward propagation, while the learning phase is incomplete. I am implementing the network using the low-level API of TensorFlow, in Python.
The Elman network is an artificial neural network composed of two layers, where the hidden layer gets copied as a "context layer," which concatenates with the inputs the next time we run forward propagate the network. Initially, the context layer is initialized with activation = 0.5 and has a fixed weight of 1.0.
My question is on the calculation of gradients, in the backpropagation of the network. In my code, I use tf.assign to update context units with the activations from the hidden layer. Before adding the assignment operator to the graph, TensorBoard shows that GradientDescentOptimizer will learn gradients from all the variables in the graph. After I include this statement, gradients don't show up for the variables in nodes coming "before" the assignment. In other words, I would expect b_1, w_x, w_c, and a_1 to show up in the list of gradients learned by the optimizer, even with the assignment in the graph.
I believe my implementation for the forward propagation is correct because I compared final values for activations using tf.assign and values from another implementation, using plain Numpy arrays. The values are equal.
Finally: is this behavior intentional or am I doing something wrong?
Here's a notebook with the implementation of the network as I described:
https://gist.github.com/Irio/d00b9661023923be7c963395483dfd73
References
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211. Retrieved from https://crl.ucsd.edu/~elman/Papers/fsit.pdf
No, assign operations do not backpropagate a gradient. That is on purpose, as assigning a value to a variable is not a differentiable operation. However, you probably do not want the gradient of the assignment, but the gradient of the new value of the variable. You can use that gradient, just do not use it as the output of an assignment operation. For example, you can do something like this:
import tensorflow as tf
my_var = tf.Variable(var_intial_value, name="MyVar")
# Compute new value for the variable
new_my_var = ...
# Make the assignment operation a control dependency
with tf.control_dependencies([tf.assign(my_var, new_my_var)]):
# Passing the value through identity here will ensure assignment is done
# while keeping it differentiable
new_my_var = tf.identity(new_my_var)
# Continue using the value
This would mean that my_var is not used in the backpropagation, and so it will not be updated by an optimizer. However, I suppose if you are assigning values to my_var yourself, then it should not be updated by the optimizer.
Related
Suppose you have a neural network with 2 layers A and B. A gets the network input. A and B are consecutive (A's output is fed into B as input). Both A and B output predictions (prediction1 and prediction2) Picture of the described architecture
You calculate a loss (loss1) directly after the first layer (A) with a target (target1). You also calculate a loss after the second layer (loss2) with its own target (target2).
Does it make sense to use the sum of loss1 and loss2 as the error function and back propagate this loss through the entire network? If so, why is it "allowed" to back propagate loss1 through B even though it has nothing to do with it?
This question is related to this question
https://datascience.stackexchange.com/questions/37022/intuition-importance-of-intermediate-supervision-in-deep-learning
but it does not answer my question sufficiently.
In my case, A and B are unrelated modules. In the aforementioned question, A and B would be identical. The targets would be the same, too.
(Additional information)
The reason why I'm asking is that I'm trying to understand LCNN (https://github.com/zhou13/lcnn) from this paper.
LCNN is made up of an Hourglass backbone, which then gets fed into MultiTask Learner (creates loss1), which in turn gets fed into a LineVectorizer Module (loss2). Both loss1 and loss2 are then summed up here and then back propagated through the entire network here.
Even though I've visited several deep learning lectures, I didn't know this was "allowed" or makes sense to do. I would have expected to use two loss.backward(), one for each loss. Or is the pytorch computational graph doing something magical here? LCNN converges and outperforms other neural networks which try to solve the same task.
Yes, It is "allowed" and also makes sense.
From the question, I believe you have understood most of it so I'm not going to details about why this multi-loss architecture can be useful. I think the main part that has made you confused is why does "loss1" back-propagate through "B"? and the answer is: It doesn't. The fact is that loss1 is calculated using this formula:
loss1 = SOME_FUNCTION(label, y_hat)
and y_hat(prediction1) is only dependent on layers before it. Hence, the gradient of this loss only flows through layers before this section (A) and not the ones after it (B). To better understand this, you could again check the mathematics of artificial neural networks. The loss2, on the other hand, back-propagates through all of the network (including part A). When you use a cumulative loss (Loss = loss1 + loss2), a framework like Pytorch will automatically follow the gradient of every predicted label to the first layer.
If I need to backpropagate through a neural network twice and I don't use retain_graph=True, I get an error.
Why? I realize it is nice to keep the intermediate variables used for the first backpropagation to be reused for the second backpropagation. However, why aren't they simply recalculated, like they were originally calculated in the first backpropagation?
By default, PyTorch doesn't store intermediate gradients, because the PyTorch's main feature is Dynamic Computational Graphs, so after backpropagation the graph will be freed all the intermediate buffers will be destroyed.
I wrote a custom layer that is part of a neural network and it contains some operations that I am using for the first time such as tf.scan and tf.slice.
I can easily test that the forward pass works and it makes sense, but how do I know that it will still work during the learning, when it has to do backpropagation? Can I safely assume that everything is going to be fine because the results I get make sense in the forward pass?
I was thinking that one possibility might be to create a neural network, replace one or two layers with the custom ones I have just created, train it, and see what happens. However, despite this would take quite a long time, the network may learn in the other layers whereas in my custom layer it may not work well anyway.
In conclusion, is there any way I can see that back-propagation will work well and I won't have any problems during the learning in this layer?
As far as I know, almost all TensorFlow ops are differentiable, including ops such as tf.abs or tf.where and gradient flows correctly through them. TensorFlow has an automatic differentiation engine, that takes any TensorFlow graph and computes derivatives w.r.t. desired variables.
So if your graph is composed of TensorFlow ops I wouldn't worry about the gradients being wrong (if you would post the code of your layer, I could expand further). However, there are still issues like numerical stability which can make otherwise mathematically sound operation still fail in practice (e.g. naive softmax computation, or tf.exp in your graph in general). Apart from that, TensorFlow differentiation should be correct and taken care of, from the user's point of view.
If you still want to examine your gradients by hand, you can compute the derivatives in your graph using tf.gradients op, which will get you the gradients that you wish and you can check by hand if TensorFlow did the differentiation correctly. (See https://www.tensorflow.org/api_docs/python/tf/gradients)
I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2
Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.
I try to use a custom python loss layer. When I checked several examples online, such as:
Euclidean loss layer, Dice loss layer,
I notice a variable 'self.diff' is always assigned in 'forward'. Especially for the Dice loss layer,
self.diff[...] = bottom[1].data
I wonder if there is any reason that this variable has to be introduced in forward or I can just use bottom[1].data to access ground truth label?
In addition, what is the point of top[0].reshape(1) in reshape, since by definition in forward, the loss output is a scalar itself.
You need to set the diff attribute of the layer for overall consistency and data communication protocol; it's available other places in the class, and anywhere the loss layer object appears. bottom is a local parameter, and is not available elsewhere in the same form.
In general, the code is expandable for a variety of applications and more complex computations; the reshaping is part of this, ensuring that the returned value is scalar, even if someone expands the inputs to work with vectors or matrices.