Lasagne / Theano gradient values

Lasagne / Theano gradient values - python

I'm currently working on recurrent neural nets using Lasagne / Theano.
While training, updates are calculated using Theano's symbolic gradient.
grads = theano.grad(loss_or_grads, params)
While the gradient expression is perfectly fine in general, I'm also interested in the gradient values in order to monitor training.
My question now is if there is a built-in method to also get gradient values, which I haven't found so far, or if I'll have to do it myself.
Thanks in advance

I'm not aware of any lasagne function to evaluate the gradient, but you can get it yourself with simple theano function.
Say we have the following theano variables:
inputs = Inputs to the network
targets = Target outputs of the network
loss = Value of the loss function, defined as a function of network outputs and targets
l_hid = Recurrent layer of the network, type lasagne.layers.RecurrentLayer
Say we're interested in the gradient of the loss function w.r.t. the recurrent weights:
grad = theano.grad(loss, l_hid.W_hid_to_hid)
Define a theano function to get a numerical value for the gradient
get_grad = theano.function([inputs, targets], grad)
Now, just call get_grad for any value of the inputs and targets (e.g. the current minibatch). get_grad() doesn't need to be passed the value of the weights because they're stored as a theano shared variable.

Related

Using Neural Network as loss for another Neural Network in pytorch

I struggle to implement the following in Pytorch Lightning. I have two neural networks, say f,g with g is pretrained and f has to be trained from scratch. The loss of f is defined for a training pair (x,y) as l1=mse(g(f(x)),g(y)). Here x,y are 1D arrays (time-domain signals) but it really doesn't matter for this question.
Now the loss can be equivalently defined by extending f by g, called this network h and freezing all layers in g. Then mse(h(x), g(y)) is the same loss as l1. But when I write, before training:
g.eval()
for param in g.parameters():
param.requires_grad = False
and try to train my network, then I get an error
RuntimeError: cudnn RNN backward can only be called in training mode
This makes of course only sense. So my question is, how can I use g as a loss function that is fully backpropatable. I need this property, since I want to compute the gradients dg(f(x))/dx.

Implementing Backprop for custom loss functions

I have a neural network Network that has a vector output. Instead of using a typical loss function, I would like to implement my own loss function that is a method in some class. This looks something like:
class whatever:
def __init__(self, network, optimizer):
self.network = network
self.optimizer = optimizer
def cost_function(relevant_data):
...implementation of cost function with respect to output of network and relevant_data...
def train(self, epochs, other_params):
...part I'm having trouble with...
The main thing I'm concerned with is about taking gradients. Since I'm taking my own custom loss function, do I need to implement my own gradient with respect to the cost function?
Once I do the math, I realize that if the cost is J, then the gradient of J is a fairly simple function in terms of the gradient of the final layer of the Network. I.e, it looks something like: Equation link.
If I used some traditional loss function like CrossEntropy, my backprocess would look like:
objective = nn.CrossEntropyLoss()
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = objective(output, data)
loss.backward()
optimizer.step()
But how do we do this in my case? My guess is something like:
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = cost_function(output, data)
#And here is where the problem comes in
loss.backward()
optimizer.step()
loss.backward() as I understand it, takes the gradients of the loss function with respect to the parameters. But can I still invoke it while using my own loss function (presumably the program doesn't know what the gradient equation is). Do I have to implement another method/subroutine to find the gradients as well?
Which brings me to my other question: if I do want to implement gradient calculation for my loss function, I also need the gradient of the neural network parameters. How do I obtain those? Is there a function for that?

As long as all your steps starting from the input till the loss function involve differentiable operations on PyTorch's tensors, you need not do anything extra. PyTorch builds a computational graph that keeps track of each operation, its inputs, and gradients. So, calling loss.backward() on your custom loss would still propagate gradients back correctly through the graph. A Gentle Introduction to torch.autograd from the PyTorch tutorials may be a useful reference.
After the backward pass, if you need to directly access the gradients for further processing, you can do so using the .grad attribute (so t.grad for tensor t in the graph).
Finally, if you have a specific use case for finding the gradient of an arbitrary differentiable function implemented using PyTorch's tensors with respect to one of its inputs (e.g. gradient of the loss with respect to a particular weight in the network), you could use torch.autograd.grad.

Tensorflow - does autodiff relives us from the back-prop implementation?

Question
When using Tensorflow, for instance implementing a custom neural network layer, what is the standard practice to implement the back-propagation? Do we not have to work on the auto-differentiation formulas?
Background
With numpy, when creating a layer e.g. matmul, the back-propagation gradient is first analytically derived and coded accordingly.
def forward(self, X):
self._X = X
np.matmul(self.X, self.W.T, out=self._Y)
return self.Y
def backward(self, dY):
"""dY = dL/dY is a jacobian where L is loss and Y is matmul output"""
self._dY = dY
return np.matmul(self.dY, self.W, out=self._dX)
In Tensorflow, there is autodiff which seems to look after the Jacobian calculation. Does this mean we do not have to manually derive the gradient formula but let the Tensorflow tape look it after?
Computing gradients
To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.

Correct, you just need to define the forward pass and Tensorflow generates an appropriate backward pass. From tf2 autodiff:
TensorFlow provides the tf.GradientTape API for automatic
differentiation; that is, computing the gradient of a computation with
respect to some inputs, usually tf.Variables. TensorFlow "records"
relevant operations executed inside the context of a tf.GradientTape
onto a "tape". TensorFlow then uses that tape to compute the gradients
of a "recorded" computation using reverse mode differentiation.
To do this, Tensorflow is given the forward pass (or the loss) and a set of tf.Variable variables to compute the derivatives on. This process is only possible for a specific set of operations defined by Tensorflow itself. In order to create a custom NN layer you need to define its forward pas using these operations (all of them part of TF or translated to it by some converter).*
Since you seem to have a numpy background, you could define your custom forward pass using numpy, and then translate it to Tensorflow using the tf_numpy API. You could alternatively use tf.numpy_function. After this, TF wil create the backpropagation for you.
(*) Note that some operations such as control statements themselves are not differentiable, thus they are invisible to gradient-based optimizers. There are some caveats about these.

Basically, Tensorflow is a symbolic math library based on dataflow and differentiable programming. We do not have to work on the auto-differentiation formulas manually. All those math operations will be done behind and automatic. You quoted correctly from the official doc about gradients computation. However, in case you want to know how it can be manually done with numpy, I would recommend you to check this fantastic course of Neural Networks and Deep Learning, especially week 4, or an alternative source here.
FYI, in TF 2 we can do custom training from scratch by overriding the train_step of the tf.keras.Model class, and there we can use tf.GradientTape API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs. That same official page includes more information on this. Also, MUST see this a well-written article on tf.GradientTape. For example, using this API we can easily compute the gradient as follows:
import tensorflow as tf
# some input
x = tf.Variable(3.0, trainable=True)
with tf.GradientTape() as tape:
# some output
y = x**3 + x**2 + x + 5
# compute gradient of y wrt x
print(tape.gradient(y, x).numpy())
# 34
Also, we can compute much higher-order derivatives, such
x = tf.Variable(3.0, trainable=True)
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
y = x**3 + x**2 + x + 5
# first derivative
order_1 = tape2.gradient(y, x)
# second derivative
order_2 = tape1.gradient(order_1, x)
print(order_2.numpy())
# 20.0
Now, in custom model training in tf. keras, we first make a forward pass and compute the loss and next compute gradients of the trainable variables of the model with respect to the loss. Later, we update the weights of the model based on these gradients. Below is a code snippet of it, and here are the end-to-end details. Writing a training loop from scratch.
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))

Keras Create New Loss Function

I am looking to design a custom loss function for Keras model. The model itself is neural network that accepts a set of images and is supposed to run a regression to get an output, which is a value. Due to the physical conditions of the problem, I need to add a regularization term to the regular mse which would be calculated as $cos(y_{pred})*f(X_i)$, where $y_{pred}$ is the output of the neural network, $X_i$ is the training example used to calculate $y_{pred}$, $f$ is some function which would calculate a value based on the image.
My problem is how to get the $X_i$ from the model? Loss function is supposed to accept just two inputs: $y_{pred}$ and $y_{true}$ which are tensors.

Getting low test accuracy using Tensorflow batch_norm function

I am using the official Batch Normalization (BN) function (tf.contrib.layers.batch_norm()) of Tensorflow on the MNIST data. I use the following code for adding BN:
local4_bn = tf.contrib.layers.batch_norm(local4, is_training=True)
During testing, I change "is_training=False" in the above line of code and observe only 20% accuracy. However, it gives ~99% accuracy if I use the above code also for testing (i.e., keeping is_training=True) with a batch size of 100 images. This observation indicates that the exponential moving average and variance computed by batch_norm() are probably incorrect or I am missing something in my code.
Can anyone please answer about the solution of the above problem.

You get ~99% accuracy when you test you model with is_training=True only because of the batch size of 100.
If you change the batch size to 1 your accuracy will decrease.
This is due to the fact that you're computing the exponential moving average and variance for the input batch and than you're (batch-)normalizing the layers output using these values.
The batch_norm function have the parameter variables_collections that helps you to store the computed moving average and variance during the train phase and reuse them during the test phase.
If you define a collection for these variables, then the batch_norm layer will use them during the testing phase, instead of calculating new values.
Therefore, if you change you batch normalization layer definition to
local4_bn = tf.contrib.layers.batch_norm(local4, is_training=True, variables_collections=["batch_norm_non_trainable_variables_collection"])
The layer will store the computed variables into the "batch_norm_non_trainable_variables_collection" collection.
In the test phase, when you pass the is_training=False parameters, the layer will re-use the computed value that it find in the collection.
Note that the moving average and the variance are not trainable parameters and therefore, if you save only your model trainable parameters in the checkpoint files, you have to manually add the non-trainable variables stored into the previously defined collection.
You can do it when you create the Saver object:
saver = tf.train.Saver(tf.get_trainable_variables() + tf.get_collection_ref("batch_norm_non_trainable_variables_co‌llection") + otherlistofvariables)
In addiction, since batch normalization can limit the expressive power of the layer which is applied to (because it restricts the range of the values), you should enable the network to learn the parameters gamma and beta (the affine transformation coefficients described in the paper) that allows the network to learn, thus, an affine transformation that increase the representation power of the layer.
You can enable the learning of these parameters setting to True the parameter of the batch_norm function, in this way:
local4_bn = tf.contrib.layers.batch_norm(
local4,
is_training=True,
center=True, # beta
scale=True, # gamma
variables_collections=["batch_norm_non_trainable_variables_collection"])

I have encountered the same question when processing MNIST. My train acc is normal, while test acc is very low at the begining, and then it grows gradually.
I changed default momentum=0.99 to momentum=0.9, then it works fine
My source code is here:
mnist_bn_fixed.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lasagne / Theano gradient values - python

Related

Using Neural Network as loss for another Neural Network in pytorch

Implementing Backprop for custom loss functions

Tensorflow - does autodiff relives us from the back-prop implementation?

Keras Create New Loss Function

Getting low test accuracy using Tensorflow batch_norm function

Categories

Resources