Tensorflow - does autodiff relives us from the back-prop implementation?

Tensorflow - does autodiff relives us from the back-prop implementation? - python

Question
When using Tensorflow, for instance implementing a custom neural network layer, what is the standard practice to implement the back-propagation? Do we not have to work on the auto-differentiation formulas?
Background
With numpy, when creating a layer e.g. matmul, the back-propagation gradient is first analytically derived and coded accordingly.
def forward(self, X):
self._X = X
np.matmul(self.X, self.W.T, out=self._Y)
return self.Y
def backward(self, dY):
"""dY = dL/dY is a jacobian where L is loss and Y is matmul output"""
self._dY = dY
return np.matmul(self.dY, self.W, out=self._dX)
In Tensorflow, there is autodiff which seems to look after the Jacobian calculation. Does this mean we do not have to manually derive the gradient formula but let the Tensorflow tape look it after?
Computing gradients
To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.

Correct, you just need to define the forward pass and Tensorflow generates an appropriate backward pass. From tf2 autodiff:
TensorFlow provides the tf.GradientTape API for automatic
differentiation; that is, computing the gradient of a computation with
respect to some inputs, usually tf.Variables. TensorFlow "records"
relevant operations executed inside the context of a tf.GradientTape
onto a "tape". TensorFlow then uses that tape to compute the gradients
of a "recorded" computation using reverse mode differentiation.
To do this, Tensorflow is given the forward pass (or the loss) and a set of tf.Variable variables to compute the derivatives on. This process is only possible for a specific set of operations defined by Tensorflow itself. In order to create a custom NN layer you need to define its forward pas using these operations (all of them part of TF or translated to it by some converter).*
Since you seem to have a numpy background, you could define your custom forward pass using numpy, and then translate it to Tensorflow using the tf_numpy API. You could alternatively use tf.numpy_function. After this, TF wil create the backpropagation for you.
(*) Note that some operations such as control statements themselves are not differentiable, thus they are invisible to gradient-based optimizers. There are some caveats about these.

Basically, Tensorflow is a symbolic math library based on dataflow and differentiable programming. We do not have to work on the auto-differentiation formulas manually. All those math operations will be done behind and automatic. You quoted correctly from the official doc about gradients computation. However, in case you want to know how it can be manually done with numpy, I would recommend you to check this fantastic course of Neural Networks and Deep Learning, especially week 4, or an alternative source here.
FYI, in TF 2 we can do custom training from scratch by overriding the train_step of the tf.keras.Model class, and there we can use tf.GradientTape API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs. That same official page includes more information on this. Also, MUST see this a well-written article on tf.GradientTape. For example, using this API we can easily compute the gradient as follows:
import tensorflow as tf
# some input
x = tf.Variable(3.0, trainable=True)
with tf.GradientTape() as tape:
# some output
y = x**3 + x**2 + x + 5
# compute gradient of y wrt x
print(tape.gradient(y, x).numpy())
# 34
Also, we can compute much higher-order derivatives, such
x = tf.Variable(3.0, trainable=True)
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
y = x**3 + x**2 + x + 5
# first derivative
order_1 = tape2.gradient(y, x)
# second derivative
order_2 = tape1.gradient(order_1, x)
print(order_2.numpy())
# 20.0
Now, in custom model training in tf. keras, we first make a forward pass and compute the loss and next compute gradients of the trainable variables of the model with respect to the loss. Later, we update the weights of the model based on these gradients. Below is a code snippet of it, and here are the end-to-end details. Writing a training loop from scratch.
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))

Related

Implementing Backprop for custom loss functions

I have a neural network Network that has a vector output. Instead of using a typical loss function, I would like to implement my own loss function that is a method in some class. This looks something like:
class whatever:
def __init__(self, network, optimizer):
self.network = network
self.optimizer = optimizer
def cost_function(relevant_data):
...implementation of cost function with respect to output of network and relevant_data...
def train(self, epochs, other_params):
...part I'm having trouble with...
The main thing I'm concerned with is about taking gradients. Since I'm taking my own custom loss function, do I need to implement my own gradient with respect to the cost function?
Once I do the math, I realize that if the cost is J, then the gradient of J is a fairly simple function in terms of the gradient of the final layer of the Network. I.e, it looks something like: Equation link.
If I used some traditional loss function like CrossEntropy, my backprocess would look like:
objective = nn.CrossEntropyLoss()
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = objective(output, data)
loss.backward()
optimizer.step()
But how do we do this in my case? My guess is something like:
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = cost_function(output, data)
#And here is where the problem comes in
loss.backward()
optimizer.step()
loss.backward() as I understand it, takes the gradients of the loss function with respect to the parameters. But can I still invoke it while using my own loss function (presumably the program doesn't know what the gradient equation is). Do I have to implement another method/subroutine to find the gradients as well?
Which brings me to my other question: if I do want to implement gradient calculation for my loss function, I also need the gradient of the neural network parameters. How do I obtain those? Is there a function for that?

As long as all your steps starting from the input till the loss function involve differentiable operations on PyTorch's tensors, you need not do anything extra. PyTorch builds a computational graph that keeps track of each operation, its inputs, and gradients. So, calling loss.backward() on your custom loss would still propagate gradients back correctly through the graph. A Gentle Introduction to torch.autograd from the PyTorch tutorials may be a useful reference.
After the backward pass, if you need to directly access the gradients for further processing, you can do so using the .grad attribute (so t.grad for tensor t in the graph).
Finally, if you have a specific use case for finding the gradient of an arbitrary differentiable function implemented using PyTorch's tensors with respect to one of its inputs (e.g. gradient of the loss with respect to a particular weight in the network), you could use torch.autograd.grad.

when call Keras Model, then there is no difference between having #tf.function or not. But different when build low-level model

Is this issue a bug?
Compared the following two codes. If include #tf.function then both works well. If not include #tf.fucntion then the custom low-level model does not train.
model = tf.keras.Model(inputs=inputs, outputs=outputs)
# #tf.function
def propagate(x_batch, y_batch):
"""
Complete both forward and backward propagation on our
batches.
"""
# Record operations to automatically obtain the gradients
with tf.GradientTape() as tape:
logits = model(x_batch)
# Calculates the total loss of the entire network
loss = loss_fn(y_batch, tf.nn.softmax(logits))
# Compute the accuracy of our model
# (Convert our logits to softmax distribution)
accuracy(y_batch, tf.nn.softmax(logits))
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
Compared to when we define custom low-level model:
class Model(object):
def __init__(self):
self.weights, self.biases = self.initialize_weights_and_biases()
self.trainable_vars = list(self.weights.values()) + list(self.biases.values())
def initialize_weights_and_biases(self):
return out_layer

This is a very good question and there has been a very interesting conversation in Github about it.
A Google Engineer (with Github ID alextp) has clarified this question in Github.
Providing the clarification here for the benefit of the Stackoverflow Community.
The problem is because of using softmax and then cross entropy, instead of using softmax_cross_entropy_with_logits or the equivalent.
softmax-then-cross-entropy is really numerically unstable (you throw away most of the bits of your logits when doing softmax) and should never be used.
Because of this the keras cross-entropy function has logic to "undo" the softmax in graph mode:
So, the solution is to use softmax_cross_entropy_with_logits or the equivalent instead of using softmax and then entropy separately.
For a detailed and insightful conversation about this issue, please refer this link.
Happy Learning!

Keras: custom objective function, where to put the derivative

I am trying to modify a bit the loss function of my convent and I have some questions from the implementation side.
I already know how to create a custom loss function in Keras, and how to call it. But I still do not have clear where to include the derivative of the function.
Let's say that my new loss function is:
Loss = cross-entropy + f(x)
where f(x) = x**2.
Where should I include f'(x)=2x so that it is used in the back-prop step?
Does Keras automatically do that? Or should I define this explicitly in some part?
Thanks for any hint on this, since I do not know how to do it.
Chuan.

Loss must be a function of a) your networks output and b) correct labels.
Having loss = Summ(a,b) makes your network minimize both a) and b).
minimizing x**2 brings x close to zero;
minimizing softmax().. since softmax(x) is not a loss function, is defined only for a vector X, and helps make a vector summ up to 1, you cant really minimize it. I guess you are mixing concepts here.
Softmax is an activation function, and its output can be used to compute loss, eg. logloss

Lasagne / Theano gradient values

I'm currently working on recurrent neural nets using Lasagne / Theano.
While training, updates are calculated using Theano's symbolic gradient.
grads = theano.grad(loss_or_grads, params)
While the gradient expression is perfectly fine in general, I'm also interested in the gradient values in order to monitor training.
My question now is if there is a built-in method to also get gradient values, which I haven't found so far, or if I'll have to do it myself.
Thanks in advance

I'm not aware of any lasagne function to evaluate the gradient, but you can get it yourself with simple theano function.
Say we have the following theano variables:
inputs = Inputs to the network
targets = Target outputs of the network
loss = Value of the loss function, defined as a function of network outputs and targets
l_hid = Recurrent layer of the network, type lasagne.layers.RecurrentLayer
Say we're interested in the gradient of the loss function w.r.t. the recurrent weights:
grad = theano.grad(loss, l_hid.W_hid_to_hid)
Define a theano function to get a numerical value for the gradient
get_grad = theano.function([inputs, targets], grad)
Now, just call get_grad for any value of the inputs and targets (e.g. the current minibatch). get_grad() doesn't need to be passed the value of the weights because they're stored as a theano shared variable.

How to efficiently compute gradients layer-wise in Tensorflow?

I am trying to implement the distributed synchronous SGD approach described in this paper using Tensorflow. For that I need to compute and apply gradients layer-wise. In principle I can do it in the following way (obs! incomplete code:
#WORKER CODE
opt = tf.train.GradientDescentOptimizer(learning_rate)
for layer_vars in all_layer_vars:
grads_vars = opt.compute_gradients(loss, layer_vars)
grads = sess.run([grad_var[0] for grad_var in grads_vars], feed_dict)
send_grads_to_master(zip(grads, layer_vars))
#MASTER CODE
while (True):
grads_vars = receive_grads_from_worker()
sess.run(opt.apply_gradients(grads_vars))
What I wonder is whether in this scenario (with several compute_gradients() calls, within different session.run()'s) the number of internal operations performed by Tensorflow is the same or higher than in the "standard" scenario where all grads are computed with just one invocation of compute_gradients().
That is, thinking on the backpropagation algorithm, I wonder if in this distributed scenario Tensorflow will compute the different "delta's" only once, or not. If the latter, is there a more efficient way of doing what I want?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow - does autodiff relives us from the back-prop implementation? - python

Related

Implementing Backprop for custom loss functions

when call Keras Model, then there is no difference between having #tf.function or not. But different when build low-level model

Keras: custom objective function, where to put the derivative

Lasagne / Theano gradient values

How to efficiently compute gradients layer-wise in Tensorflow?

Categories

Resources