I have seen code like
torch.optim.SGD(model.parameters(), lr=lr)
for example in the Transformer tutorial but I have also seen code like
optimizer = SGD((p for p in model.parameters() if p.requires_grad), lr=lr)
From the source code of Torch's SGD optimizer class, SGD filters for and modifies only parameters whose grad is not None*
Is it necessary to filter for only the parameters who require gradients? Is there any advantage to filtering, for example in terms of performance?
How does PyTorch implement gradient updates that are conditional on parameters not being frozen in its optimizers?
*A superficial inspection of optimizers' implementations doesn't show such obvious filtering for the parameter set fed to the workhorse function of the respective optimizer, for example the sgd function for the SGD class
Related
I am interested in how to train deep neural network with custom loss-function. I have seen posts on stack overflow but they aren't answered. I have downloaded VGG16 and froze weights and added my own head. Now I want to train that network with custom loss, how can I do that?
Here is a custom RMSE loss in PyTorch. I hope this gives you a concrete idea of how to implement a custom loss function. You must create a class that inherits nn.Module, define the initialization and forward pass.
class RMSELoss(nn.Module):
def __init__(self, eps=1e-9):
super().__init__()
self.mse = nn.MSELoss()
self.eps = eps
def forward(self,yhat,y):
loss = torch.sqrt(self.mse(yhat,y) + self.eps)
return loss
You can simply define a function with two input parameters(true value, predicted value). Then you can calculate the loss using those values by your very own method.
Here is the coding sample:
def custom_loss( y_true , y_pred ):
tf.losses.mean_squared_error( y_true , y_pred )
I have used mse from tf backend in this example. But you can use manual calculation here.
Compile your model with this loss function.
model.compile(
optimizer=your_optimizer,
loss=custom_loss
)
You can also define your own customized metric to judge during the training.
def custom_metric( y_true , y_pred ):
return calculate_your_metric( y_true , y_pred )
Finally, compile with it,
model.compile(
optimizer=your_optimizer,
loss=custom_loss,
metrics=[ custom_metric ]
)
There are several examples and repositories showing how to implement perceptual loss which sounds like what you are referring to. Of course, you can generalize and learn from some of these approaches to different models depending on your problem. If you do so, I recommend writing about it and sharing. I don't see many examples other than using some pretrained vgg model, and breaking that mold might be a nice contribution! Anyway, you might find these other answers useful:
Implement perceptual loss with pretrained VGG using keras
VGG, perceptual loss in keras
I'm trying to use diferent optimizers for an scikit-learn's MLPClassifier. As far as their docs show, there is only a few solvers(MLPClassifier's optimizer parameter) available, which are:
‘lbfgs’ is an optimizer in the family of quasi-Newton methods.
‘sgd’ refers to stochastic gradient descent.
‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
What I'm trying to use is another one called Nadam. I've tried to mix some parameters to achieve it with solver='adam' and nesterovs_momentum=True, but sklearn's docs says that the last parameter is only used for sgd:
Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.
I've tried the above because I've thought that meant what keras's nadam stated:
Much like Adam is essentially RMSprop with momentum, Nadam is RMSprop with Nesterov momentum.
Because of all this, I don't think I'm doing the right thing. My code below shows what I've done so far.
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='adam', nesterovs_momentum=True,)
You cannot supply nesterovs_momentum to adam optimizer.
To see this you should have a look at BaseMultilayerPerceptron code on github.
Compare the params accepted by SGDOptimizer with those accepted by AdamOptimizer.
You will see that adam simply doesn't accept nesterovs_momentum.
I finished building the DNN model for the Titanic Dataset. Given that, how do I make predictions on the X_test? My code can be accessed through my github:
https://github.com/isaac-altair/Titanic-Dataset
Thanks
When you trained your model you asked tensorflow to evaluate your train_op. Your train_op is your optimizer, e.g.:
train_op = tf.train.AdamOptimizer(...).minimize(cost)
You ran something like this to train the model:
sess.run([train_op], feed_dict={x:data, y:labels})
The train_op depends on things like the gradients and the operations that update the weights, so all of these things happened when you ran the train_op.
At inference time you simply ask it to perform different calculations. You can have the optimizer defined, but if you don't ask it to run the optimizer it won't perform any of the actions that the optimizer is dependent on. You probably have an output of the network called logits (you could call it anything, but logits is the most common and seen in most tutorials). You might also have defined an op called accuracy which computes the accuracy of the batch. You can get the value of those with a similar request to tensorflow:
sess.run([logits, accuracy], feed_dict={x:data, y:labels})
Almost any tutorial will demonstrate this. My favorite tutorials are here: https://github.com/aymericdamien/TensorFlow-Examples
I have been trying to implement l1-regularization in Tensorflow using the l1_regularization_strength parameter in the ProximalAdagradOptimizer function from Tensorflow. (I am using this optimizer specifically to get a sparse solution.) I have two questions regarding the regularization.
Does the l1-regularization used in the optimizer apply to forward and backward propagation for a neural network or only the back propagation?
Is there a way to break down the optimizer so the regularization only applies to specific layers in the network?
Regularization applies neither to forward or backpropagation but to the weight updates.
You can use different optimizers for different layers by explicitly passing the variables to minimize to each optimizer.
I'm developing a project which uses Backpropation algorithm. So I'm learning Backpropagation algorithm in scikit-learn.
mlp = MLPClassifier(hidden_layer_sizes=(hiddenLayerSize,), solver='lbfgs', learning_rate='constant',learning_rate_init=0.001, max_iter=100000, random_state=1)
There are different solver options as lbfgs, adam and sgd and also activation options. Are there any best practices about which option should be used for backpropagation?
solver is the argument to set the optimization algorithm here. In general setting sgd (stochastic gradient descent) works best, also it achieves faster convergence. While using sgd you apart from setting the learning_rate you also need to set the momentum argument (default value =0.9 works).
activation functions option is for, to introduce non-linearity of the model, if your model has many layers you have to use activation function such as relu (rectified linear unit) to introduce no-linearity, else using multiple layers become useless. relu is the most simplest and most useful activation function.
Another thing to consider is that the learning rate should not be too large when the activation function is ReLu.
The main issue with the ReLu function is the so called 'Dying Relu' problem. A neuron is considered dead when it is stuck in the negative side and it is most likely to occur when the learning rate is too large.