Regularization Application in the Proximal Adagrad Optimizer in Tensorflow

Regularization Application in the Proximal Adagrad Optimizer in Tensorflow - python

I have been trying to implement l1-regularization in Tensorflow using the l1_regularization_strength parameter in the ProximalAdagradOptimizer function from Tensorflow. (I am using this optimizer specifically to get a sparse solution.) I have two questions regarding the regularization.
Does the l1-regularization used in the optimizer apply to forward and backward propagation for a neural network or only the back propagation?
Is there a way to break down the optimizer so the regularization only applies to specific layers in the network?

Regularization applies neither to forward or backpropagation but to the weight updates.
You can use different optimizers for different layers by explicitly passing the variables to minimize to each optimizer.

Related

Pass parameters to optimizer conditional on requires_grad?

I have seen code like
torch.optim.SGD(model.parameters(), lr=lr)
for example in the Transformer tutorial but I have also seen code like
optimizer = SGD((p for p in model.parameters() if p.requires_grad), lr=lr)
From the source code of Torch's SGD optimizer class, SGD filters for and modifies only parameters whose grad is not None*
Is it necessary to filter for only the parameters who require gradients? Is there any advantage to filtering, for example in terms of performance?
How does PyTorch implement gradient updates that are conditional on parameters not being frozen in its optimizers?
*A superficial inspection of optimizers' implementations doesn't show such obvious filtering for the parameter set fed to the workhorse function of the respective optimizer, for example the sgd function for the SGD class

ResNet family classification layer activation function

I am using the ResNet18 pre-trained model which will be used for a simple binary image classification task. However, all the tutorials including PyTorch itself use nn.Linear(num_of_features, classes) for the final fully connected layer. What I fail to understand is where is the activation function for that module? Also what if I want to use sigmoid/softmax how do I go about that?
Thanks for your help in advance, I am kinda new to Pytorch

No you do not use activation in the last layer if your loss function is CrossEntropyLoss because pytorch CrossEntropyLoss loss combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
They do they do that ?
You actually need logits (output of sigmoid) for loss calculation so it is a correct design to not have it as part of forward pass. More over for predictions you don't need logits because argmax(linear(x)) == argmax(softmax(linear(x)) i.e softmax does not change the ordering but only change the magnitudes (squashing function which converts arbitrary value into [0,1] range, but preserves the partial ordering]
If you want to use activation functions to add some sort of non-linearity you normally do that by using a multi-layer NN and having the activation functions in the last but other layers.
Finally, if you are using other loss function like NLLLoss, PoissonNLLLoss, BCELoss then you have to calculates sigmoid yourself. Again on the same note if you are using BCEWithLogitsLoss you don't need to calculate sigmoid again because this loss combines a Sigmoid layer and the BCELoss in one single class.
check the pytorch docs to see how to use the loss.

Usually, no ReLU activation function is used in the last layer. The output of the torch.nn.Linear layer is fed to the softmax function of the cross-entropy loss, e.g., by using torch.nn.CrossEntropyLoss. What you may be looking for is the binary-cross-entropy loss torch.nn.BCELoss.

In the tutorials you would see on the internet, people mostly do multi-class classification, for which they use cross-entropy loss which doesn't require a user defined activation function at the output. It applies the softmax activation itself (actually applying an activation function before the cross-entropy is one of the most common mistakes in PyTorch). However, in your case you have a binary classification problem, for which you need to use binary cross-entropy loss, which doesn't apply any activation function by itself unlike the other one. So you will need to apply sigmoid activation (or any kind of activation that maps the real numbers to the range (0, 1) yourself.

When is Momentum Applied in Tensorflow Gradient Tape?

I've been playing around with automatic gradients in tensorflow and I had a question. If we are updating an optimizer, say ADAM, when is the momentum algorithm applied to the gradient? Is it applied when we call tape.gradient(loss,model.trainable_variables) or when we call model.optimizer.apply_gradients(zip(dtf_network,model.trainable_variables))?
Thanks!

tape.gradient computes the gradients straightforwardly without reference to an optimizer. Since momentum is part of the optimizer, the tape does not include it. AFAIK momentum is usually implemented by adding extra variables in the optimizer that store the running average. All of this is handled in optimizer.apply_gradients.

Why does NN training loss flatten?

I implemented a deep learning neural network from scratch without using any python frameworks like tensorflow or keras.
The problem is no matter what i change in my code like adjusting learning rate or changing layers or changing no. of nodes or changing activation functions from sigmoid to relu to leaky relu, i end up with a training loss that starts with 6.98 but always converges to 3.24...
Why is that?
Please review my forward and back prop algorithms.Maybe there's something wrong in that which i couldn't identify.
My hidden layers use leaky relu and final layer uses sigmoid activation.
Im trying to classify the mnist handwritten digits.
code:
#FORWARDPROPAGATION
for i in range(layers-1):
cache["a"+str(i+1)]=lrelu((np.dot(param["w"+str(i+1)],cache["a"+str(i)]))+param["b"+str(i+1)])
cache["a"+str(layers)]=sigmoid((np.dot(param["w"+str(layers)],cache["a"+str(layers-1)]))+param["b"+str(layers)])
yn=cache["a"+str(layers)]
m=X.shape[1]
cost=-np.sum((y*np.log(yn)+(1-y)*np.log(1-yn)))/m
if j%10==0:
print(cost)
costs.append(cost)
#BACKPROPAGATION
grad={"dz"+str(layers):yn-y}
for i in range(layers):
grad["dw"+str(layers-i)]=np.dot(grad["dz"+str(layers-i)],cache["a"+str(layers-i-1)].T)/m
grad["db"+str(layers-i)]=np.sum(grad["dz"+str(layers-i)],1,keepdims=True)/m
if i<layers-1:
grad["dz"+str(layers-i-1)]=np.dot(param["w"+str(layers-i)].T,grad["dz"+str(layers-i)])*lreluDer(cache["a"+str(layers-i-1)])
for i in range(layers):
param["w"+str(i+1)]=param["w"+str(i+1)] - alpha*grad["dw"+str(i+1)]
param["b"+str(i+1)]=param["b"+str(i+1)] - alpha*grad["db"+str(i+1)]

The implementation seems okay. While you could converge to the same value with different models/learning rate/hyper parameters, what's frightening is having the same starting value everytime, 6.98 in your case.
I suspect it has to do with your initialisation. If you're setting all your weights initially to zero, you're not gonna break symmetry. That is explained here and here in adequate detail.

MLPClassifier parameter setting

I'm developing a project which uses Backpropation algorithm. So I'm learning Backpropagation algorithm in scikit-learn.
mlp = MLPClassifier(hidden_layer_sizes=(hiddenLayerSize,), solver='lbfgs', learning_rate='constant',learning_rate_init=0.001, max_iter=100000, random_state=1)
There are different solver options as lbfgs, adam and sgd and also activation options. Are there any best practices about which option should be used for backpropagation?

solver is the argument to set the optimization algorithm here. In general setting sgd (stochastic gradient descent) works best, also it achieves faster convergence. While using sgd you apart from setting the learning_rate you also need to set the momentum argument (default value =0.9 works).
activation functions option is for, to introduce non-linearity of the model, if your model has many layers you have to use activation function such as relu (rectified linear unit) to introduce no-linearity, else using multiple layers become useless. relu is the most simplest and most useful activation function.

Another thing to consider is that the learning rate should not be too large when the activation function is ReLu.
The main issue with the ReLu function is the so called 'Dying Relu' problem. A neuron is considered dead when it is stuck in the negative side and it is most likely to occur when the learning rate is too large.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regularization Application in the Proximal Adagrad Optimizer in Tensorflow - python

Regularization applies neither to forward or backpropagation but to the weight updates. You can use different optimizers for different layers by explicitly passing the variables to minimize to each optimizer.

Related

Pass parameters to optimizer conditional on requires_grad?

ResNet family classification layer activation function

When is Momentum Applied in Tensorflow Gradient Tape?

Why does NN training loss flatten?

MLPClassifier parameter setting

Categories

Resources