When is Momentum Applied in Tensorflow Gradient Tape? - python

I've been playing around with automatic gradients in tensorflow and I had a question. If we are updating an optimizer, say ADAM, when is the momentum algorithm applied to the gradient? Is it applied when we call tape.gradient(loss,model.trainable_variables) or when we call model.optimizer.apply_gradients(zip(dtf_network,model.trainable_variables))?

tape.gradient computes the gradients straightforwardly without reference to an optimizer. Since momentum is part of the optimizer, the tape does not include it. AFAIK momentum is usually implemented by adding extra variables in the optimizer that store the running average. All of this is handled in optimizer.apply_gradients.


Implementing Backprop for custom loss functions

I have a neural network Network that has a vector output. Instead of using a typical loss function, I would like to implement my own loss function that is a method in some class. This looks something like:
class whatever:
def __init__(self, network, optimizer):
self.network = network
self.optimizer = optimizer
def cost_function(relevant_data):
...implementation of cost function with respect to output of network and relevant_data...
def train(self, epochs, other_params):
...part I'm having trouble with...
The main thing I'm concerned with is about taking gradients. Since I'm taking my own custom loss function, do I need to implement my own gradient with respect to the cost function?
Once I do the math, I realize that if the cost is J, then the gradient of J is a fairly simple function in terms of the gradient of the final layer of the Network. I.e, it looks something like: Equation link.
If I used some traditional loss function like CrossEntropy, my backprocess would look like:
objective = nn.CrossEntropyLoss()
for epochs:
output = Network(input)
loss = objective(output, data)
But how do we do this in my case? My guess is something like:
for epochs:
output = Network(input)
loss = cost_function(output, data)
#And here is where the problem comes in
loss.backward() as I understand it, takes the gradients of the loss function with respect to the parameters. But can I still invoke it while using my own loss function (presumably the program doesn't know what the gradient equation is). Do I have to implement another method/subroutine to find the gradients as well?
Which brings me to my other question: if I do want to implement gradient calculation for my loss function, I also need the gradient of the neural network parameters. How do I obtain those? Is there a function for that?
As long as all your steps starting from the input till the loss function involve differentiable operations on PyTorch's tensors, you need not do anything extra. PyTorch builds a computational graph that keeps track of each operation, its inputs, and gradients. So, calling loss.backward() on your custom loss would still propagate gradients back correctly through the graph. A Gentle Introduction to torch.autograd from the PyTorch tutorials may be a useful reference.
After the backward pass, if you need to directly access the gradients for further processing, you can do so using the .grad attribute (so t.grad for tensor t in the graph).
Finally, if you have a specific use case for finding the gradient of an arbitrary differentiable function implemented using PyTorch's tensors with respect to one of its inputs (e.g. gradient of the loss with respect to a particular weight in the network), you could use torch.autograd.grad.

Optimizer for an RNN using pytorch

The pytorch RNN tutorial uses
for p in net.parameters():
p.data.add_(p.grad.data, alpha = -learning_rate)
as optimizer. Does anyone know the difference between doing that or doing the classical optimizer.step(), once an optimizer has been defined explicitly? Is there some special consideration one has to take into when training RNNs in regards to the optimizer?
It looks like the example uses a simple gradient descent algorithm to update:
where J is cost.
If the optimizer your using is a simple gradient descent tool, then there is no difference between using optimizer.step() and the code in the example.
I know that's not a super exciting answer to your question, because it depends on how the step() function is written. Check out this page to learn about step() and this page to learn more about torch.optim.

How to monitor gradient vanish and explosion in keras with tensorboard?

I would like to monitor the gradient changes in tensorboard with keras to decide whether gradient vanish or explosion. What should I do?
To visualize the training in Tensorboard, add keras.callbacks.TensorBoard callback to model.fit function. Don't forget to set write_grads=True to see the gradients there. Right after training start, you can run...
tensorboard --logdir=/full_path_to_your_logs
... from the command line and point your browser to htttp://localhost:6006. See the example code in this question.
To check for vanishing / exploding gradients, pay attention the gradients distribution and absolute values in the layer of interest ("Distributions" tab):
If the distribution is highly peaked and concentrated around 0, the gradients are probably vanishing. Here's a concrete example how it looks like in practice.
If the distribution is rapidly growing in absolute value with time, the gradients are exploding. Often the output values at the same layer become NaNs very quickly as well.

Regularization Application in the Proximal Adagrad Optimizer in Tensorflow

I have been trying to implement l1-regularization in Tensorflow using the l1_regularization_strength parameter in the ProximalAdagradOptimizer function from Tensorflow. (I am using this optimizer specifically to get a sparse solution.) I have two questions regarding the regularization.
Does the l1-regularization used in the optimizer apply to forward and backward propagation for a neural network or only the back propagation?
Is there a way to break down the optimizer so the regularization only applies to specific layers in the network?
Regularization applies neither to forward or backpropagation but to the weight updates.
You can use different optimizers for different layers by explicitly passing the variables to minimize to each optimizer.

MLPClassifier parameter setting

I'm developing a project which uses Backpropation algorithm. So I'm learning Backpropagation algorithm in scikit-learn.
mlp = MLPClassifier(hidden_layer_sizes=(hiddenLayerSize,), solver='lbfgs', learning_rate='constant',learning_rate_init=0.001, max_iter=100000, random_state=1)
There are different solver options as lbfgs, adam and sgd and also activation options. Are there any best practices about which option should be used for backpropagation?
solver is the argument to set the optimization algorithm here. In general setting sgd (stochastic gradient descent) works best, also it achieves faster convergence. While using sgd you apart from setting the learning_rate you also need to set the momentum argument (default value =0.9 works).
activation functions option is for, to introduce non-linearity of the model, if your model has many layers you have to use activation function such as relu (rectified linear unit) to introduce no-linearity, else using multiple layers become useless. relu is the most simplest and most useful activation function.
Another thing to consider is that the learning rate should not be too large when the activation function is ReLu.
The main issue with the ReLu function is the so called 'Dying Relu' problem. A neuron is considered dead when it is stuck in the negative side and it is most likely to occur when the learning rate is too large.

