MLPClassifier parameter setting - python

I'm developing a project which uses Backpropation algorithm. So I'm learning Backpropagation algorithm in scikit-learn.
mlp = MLPClassifier(hidden_layer_sizes=(hiddenLayerSize,), solver='lbfgs', learning_rate='constant',learning_rate_init=0.001, max_iter=100000, random_state=1)
There are different solver options as lbfgs, adam and sgd and also activation options. Are there any best practices about which option should be used for backpropagation?

solver is the argument to set the optimization algorithm here. In general setting sgd (stochastic gradient descent) works best, also it achieves faster convergence. While using sgd you apart from setting the learning_rate you also need to set the momentum argument (default value =0.9 works).
activation functions option is for, to introduce non-linearity of the model, if your model has many layers you have to use activation function such as relu (rectified linear unit) to introduce no-linearity, else using multiple layers become useless. relu is the most simplest and most useful activation function.

Another thing to consider is that the learning rate should not be too large when the activation function is ReLu.
The main issue with the ReLu function is the so called 'Dying Relu' problem. A neuron is considered dead when it is stuck in the negative side and it is most likely to occur when the learning rate is too large.

Related

How can I increase the VGG16 Model's Accurancy? (Underfitting or Overfitting Issues)

I have a problem about increasing the accuracy of the VGG16 model.
Even if I defined some Dense layers, I couldn't handle with it. Can you help me how to get the best result if you don't mind? I tried to use Dropout but I couldn't increase its accuracy. Can you look through it if you don't want to open this file?
I think it can be overfitting or Underfitting in terms of model's behaviour.
Here is my model shown below.
base_model=VGG16(
include_top=False,
weights="imagenet",
input_shape=(IMAGE_SIZE,IMAGE_SIZE,3))
#freeze the base model
base_model.trainable = False
model=Sequential()
model.add(base_model)
model.add(Flatten())
model.add(Dense(512,activation='relu'))
#model.add(Dropout(0.2))
model.add(Dense(256,activation='relu'))
#model.add(Dropout(0.2))
model.add(Dense(128,activation='relu'))
#model.add(Dropout(0.2))
model.add(Dense(num_classes,activation='softmax'))
model.summary()
Here is my project link : Project
There are a number of different things you can do, and that depends on your problem scope. What you are showing is the basic transfer learning model with a couple of dense layers.
Regularisation is one thing that you have done already by using Dropout, but you have turned it off. Other regularisation tools are L2, L1 regularisation to keep things simple. The other thing you can do is to use a lower learning rate, reduce the batch size, use batch normalisation or change the optimisation function, or all of the above at the same time.
Creating a Neural Network Model is the easy part. The more important and hard to master skill is optimising it to perform good on general data by tweaking each parameter untill you produce better results.
Try looking at these three guides (or other ones that you can find about hyperparameter optimisation) to understand more:
Optimising Neural Networks Where To Start?
A Comprehensive Guide on Neural Networks Performance Optimization
An Overview of Regularization Techniques in Deep Learning 
Dropout is a regularization technique that introduces noise in the network weights in order to avoid overspecialization of the network to the training inputs (overfitting). The way this kind of layer introduces noise is by mapping some trained weights to 0. The probability according to each weight will remain untouched is given by the dropout value (the input value to the Dropout layer).
In your case, your dropout value set to 0.2 will mean that ~80% of the weights will be changed into 0 for each layer, which largely reduces the training effectiveness of any neural model. Typically you want to keep this value high, but not equal to 1, otherwise no regularization (annihilation of weights) will be performed.
Try reintroducing Dropout with higher values (like 0.95, 0.9, 0.8, 0.7) and compare them without the regularization (dropout=1 or just comment the dropout layer line) and with excess of regularization (dropout=0.2).
This fix on dropout may come in handy for boosting your model performances.

Why does NN training loss flatten?

I implemented a deep learning neural network from scratch without using any python frameworks like tensorflow or keras.
The problem is no matter what i change in my code like adjusting learning rate or changing layers or changing no. of nodes or changing activation functions from sigmoid to relu to leaky relu, i end up with a training loss that starts with 6.98 but always converges to 3.24...
Why is that?
Please review my forward and back prop algorithms.Maybe there's something wrong in that which i couldn't identify.
My hidden layers use leaky relu and final layer uses sigmoid activation.
Im trying to classify the mnist handwritten digits.
code:
#FORWARDPROPAGATION
for i in range(layers-1):
cache["a"+str(i+1)]=lrelu((np.dot(param["w"+str(i+1)],cache["a"+str(i)]))+param["b"+str(i+1)])
cache["a"+str(layers)]=sigmoid((np.dot(param["w"+str(layers)],cache["a"+str(layers-1)]))+param["b"+str(layers)])
yn=cache["a"+str(layers)]
m=X.shape[1]
cost=-np.sum((y*np.log(yn)+(1-y)*np.log(1-yn)))/m
if j%10==0:
print(cost)
costs.append(cost)
#BACKPROPAGATION
grad={"dz"+str(layers):yn-y}
for i in range(layers):
grad["dw"+str(layers-i)]=np.dot(grad["dz"+str(layers-i)],cache["a"+str(layers-i-1)].T)/m
grad["db"+str(layers-i)]=np.sum(grad["dz"+str(layers-i)],1,keepdims=True)/m
if i<layers-1:
grad["dz"+str(layers-i-1)]=np.dot(param["w"+str(layers-i)].T,grad["dz"+str(layers-i)])*lreluDer(cache["a"+str(layers-i-1)])
for i in range(layers):
param["w"+str(i+1)]=param["w"+str(i+1)] - alpha*grad["dw"+str(i+1)]
param["b"+str(i+1)]=param["b"+str(i+1)] - alpha*grad["db"+str(i+1)]
The implementation seems okay. While you could converge to the same value with different models/learning rate/hyper parameters, what's frightening is having the same starting value everytime, 6.98 in your case.
I suspect it has to do with your initialisation. If you're setting all your weights initially to zero, you're not gonna break symmetry. That is explained here and here in adequate detail.

Regularization Application in the Proximal Adagrad Optimizer in Tensorflow

I have been trying to implement l1-regularization in Tensorflow using the l1_regularization_strength parameter in the ProximalAdagradOptimizer function from Tensorflow. (I am using this optimizer specifically to get a sparse solution.) I have two questions regarding the regularization.
Does the l1-regularization used in the optimizer apply to forward and backward propagation for a neural network or only the back propagation?
Is there a way to break down the optimizer so the regularization only applies to specific layers in the network?
Regularization applies neither to forward or backpropagation but to the weight updates.
You can use different optimizers for different layers by explicitly passing the variables to minimize to each optimizer.

Homemade deep learning library: numerical issue with relu activation

For the sake of learning the finer details of a deep learning neural network, I have coded my own library with everything (optimizer, layers, activations, cost function) homemade.
It seems to work fine when benchmarking in on the MNIST dataset, and using only sigmoid activation functions.
Unfortunately I seem to get issues when replacing these with relus.
This is what my learning curve looks like for 50 epochs on a training dataset of ~500 examples:
Everything is fine for the first ~8 epochs and then I get a complete collapse on the score of a dummy classifier (~0.1 accuracy). I checked the code of the relu and it seems fine. Here are my forward and backward passes:
def fprop(self, inputs):
return np.maximum( inputs, 0.)
def bprop(self, inputs, outputs, grads_wrt_outputs):
derivative = (outputs > 0).astype( float)
return derivative * grads_wrt_outputs
The culprit seems to be in the numerical stability of the relu. I tried different learning rates and many parameter initializers for the same result. Tanh and sigmoid work properly. Is this a known issue? Is it a consequence of non-continuous derivative of the relu function?
Yes, it's quite possible that the ReLU's are to blame. Most of the classic perceptron-based models, including ConvNet (the classic MNIST trainer), depend on both positive and negative weights for their training accuracy. ReLU ignores the negative characteristics, thus detracting from the model's capabilities.
ReLU is better-suited for convolution layers; it's a filter that says, "If the kernel isn't excited about this part of the input, I don't care how deep the boredom goes; just ignore it." MNIST training depends on counter-correction, allowing nodes to say "No, this isn't good, run the other way!"

Where to apply batch normalization on standard CNNs

I have the following architecture:
Conv1
Relu1
Pooling1
Conv2
Relu2
Pooling3
FullyConnect1
FullyConnect2
My question is, where do I apply batch normalization? And what would be the best function to do this in TensorFlow?
The original batch-norm paper prescribes using the batch-norm before ReLU activation. But there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
To your second question: in tensorflow, you can use a high-level tf.layers.batch_normalization function, or a low-level tf.nn.batch_normalization.
There's some debate on this question. This Stack Overflow thread and this keras thread are examples of the debate. Andrew Ng says that batch normalization should be applied immediately before the non-linearity of the current layer. The authors of the BN paper said that as well, but now according to François Chollet on the keras thread, the BN paper authors use BN after the activation layer. On the other hand, there are some benchmarks such as the one discussed on this torch-residual-networks github issue that show BN performing better after the activation layers.
My current opinion (open to being corrected) is that you should do BN after the activation layer, and if you have the budget for it and are trying to squeeze out extra accuracy, try before the activation layer.
So adding Batch Normalization to your CNN would look like this:
Conv1
Relu1
BatchNormalization
Pooling1
Conv2
Relu2
BatchNormalization
Pooling3
FullyConnect1
BatchNormalization
FullyConnect2
BatchNormalization
In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to
the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015)
recommend the latter. More specifically, XW+b should be replaced by a
normalized version of XW. The bias term should be omitted because it
becomes redundant with the β parameter applied by the batch
normalization reparameterization. The input to a layer is usually the
output of a nonlinear activation function such as the rectified linear
function in a previous layer. The statistics of the input are thus
more non-Gaussian and less amenable to standardization by linear
operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Some report better results when placing batch normalization after activation, while others get better results with batch normalization before activation. It's an open debate. I suggest that you test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.

Categories

Resources