I'm trying to use diferent optimizers for an scikit-learn's MLPClassifier. As far as their docs show, there is only a few solvers(MLPClassifier's optimizer parameter) available, which are:
‘lbfgs’ is an optimizer in the family of quasi-Newton methods.
‘sgd’ refers to stochastic gradient descent.
‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
What I'm trying to use is another one called Nadam. I've tried to mix some parameters to achieve it with solver='adam' and nesterovs_momentum=True, but sklearn's docs says that the last parameter is only used for sgd:
Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.
I've tried the above because I've thought that meant what keras's nadam stated:
Much like Adam is essentially RMSprop with momentum, Nadam is RMSprop with Nesterov momentum.
Because of all this, I don't think I'm doing the right thing. My code below shows what I've done so far.
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='adam', nesterovs_momentum=True,)
You cannot supply nesterovs_momentum to adam optimizer.
To see this you should have a look at BaseMultilayerPerceptron code on github.
Compare the params accepted by SGDOptimizer with those accepted by AdamOptimizer.
You will see that adam simply doesn't accept nesterovs_momentum.
Related
I have seen code like
torch.optim.SGD(model.parameters(), lr=lr)
for example in the Transformer tutorial but I have also seen code like
optimizer = SGD((p for p in model.parameters() if p.requires_grad), lr=lr)
From the source code of Torch's SGD optimizer class, SGD filters for and modifies only parameters whose grad is not None*
Is it necessary to filter for only the parameters who require gradients? Is there any advantage to filtering, for example in terms of performance?
How does PyTorch implement gradient updates that are conditional on parameters not being frozen in its optimizers?
*A superficial inspection of optimizers' implementations doesn't show such obvious filtering for the parameter set fed to the workhorse function of the respective optimizer, for example the sgd function for the SGD class
The pytorch RNN tutorial uses
for p in net.parameters():
p.data.add_(p.grad.data, alpha = -learning_rate)
as optimizer. Does anyone know the difference between doing that or doing the classical optimizer.step(), once an optimizer has been defined explicitly? Is there some special consideration one has to take into when training RNNs in regards to the optimizer?
It looks like the example uses a simple gradient descent algorithm to update:
where J is cost.
If the optimizer your using is a simple gradient descent tool, then there is no difference between using optimizer.step() and the code in the example.
I know that's not a super exciting answer to your question, because it depends on how the step() function is written. Check out this page to learn about step() and this page to learn more about torch.optim.
I have been trying to implement l1-regularization in Tensorflow using the l1_regularization_strength parameter in the ProximalAdagradOptimizer function from Tensorflow. (I am using this optimizer specifically to get a sparse solution.) I have two questions regarding the regularization.
Does the l1-regularization used in the optimizer apply to forward and backward propagation for a neural network or only the back propagation?
Is there a way to break down the optimizer so the regularization only applies to specific layers in the network?
Regularization applies neither to forward or backpropagation but to the weight updates.
You can use different optimizers for different layers by explicitly passing the variables to minimize to each optimizer.
I am attempting to perform cross validation on a SGD model in pyspark, I am working with LinearRegressionWithSGD from pyspark.mllib.regression , ParamGridBuilder and CrossValidator both from the pyspark.ml.tuning libraries.
After following documentation from the Spark website, I was hoping running this would work
lr = LinearRegressionWithSGD()
pipeline=Pipeline(stages=[lr])
paramGrid = ParamGridBuilder()\
.addGrid(lr.stepSize, Array(0.1, 0.01))\
.build()
crossval = CrossValidator(estimator=pipeline,estimatorParamMaps= paramGrid,
evaluator=RegressionEvaluator(),
numFolds=10)
But LinearRegressionWithSGD() does not have the attributes stepSize (tried others with no luck either).
I can set lr to LinearRegression but then I am unable to use SGD for my model and cross validate.
There is the kFold method within scala but I am not sure how to access that from pyspark
You can use the step parameter from LinearRegressionWithSGD to define your step size but that will not allow your code to work because you are mixing incompatible libraries. Unfortunately, I do not know how to do cross validation with the ml library using SGD optimization and I would like to know myself but you are mixing the libraries pyspark.ml and pyspark.mllib. Specifically you cannot use LinearRegressionWithSGD with the pyspark.ml library. You have to use pyspark.ml.regression.LinearRegression.
The good news is you can set the set the solver attribute of pyspark.ml.regression.LinearRegression to use 'gd'. Therefore, you can probably set the parameters of the 'gd' optimizer run as SGD, but I am not sure where the solver documentation is or how to set the solver attributes (e.g. the batch size). The api shows the LinearRegression object calling Param() but I am not sure if it is using the pyspark.mllib optimizer. If anyone knows how to set the solver attributes, that could answer your question by allowing you to use the Pipeline, ParamGridBuilder, and CrossValidation ml packages for model selection with LinearRegression utilizing SGD optimization for parameter tuning.
Respectfully,
Shane
I'm developing a project which uses Backpropation algorithm. So I'm learning Backpropagation algorithm in scikit-learn.
mlp = MLPClassifier(hidden_layer_sizes=(hiddenLayerSize,), solver='lbfgs', learning_rate='constant',learning_rate_init=0.001, max_iter=100000, random_state=1)
There are different solver options as lbfgs, adam and sgd and also activation options. Are there any best practices about which option should be used for backpropagation?
solver is the argument to set the optimization algorithm here. In general setting sgd (stochastic gradient descent) works best, also it achieves faster convergence. While using sgd you apart from setting the learning_rate you also need to set the momentum argument (default value =0.9 works).
activation functions option is for, to introduce non-linearity of the model, if your model has many layers you have to use activation function such as relu (rectified linear unit) to introduce no-linearity, else using multiple layers become useless. relu is the most simplest and most useful activation function.
Another thing to consider is that the learning rate should not be too large when the activation function is ReLu.
The main issue with the ReLu function is the so called 'Dying Relu' problem. A neuron is considered dead when it is stuck in the negative side and it is most likely to occur when the learning rate is too large.