Why does NN training loss flatten?

Why does NN training loss flatten? - python

I implemented a deep learning neural network from scratch without using any python frameworks like tensorflow or keras.
The problem is no matter what i change in my code like adjusting learning rate or changing layers or changing no. of nodes or changing activation functions from sigmoid to relu to leaky relu, i end up with a training loss that starts with 6.98 but always converges to 3.24...
Why is that?
Please review my forward and back prop algorithms.Maybe there's something wrong in that which i couldn't identify.
My hidden layers use leaky relu and final layer uses sigmoid activation.
Im trying to classify the mnist handwritten digits.
code:
#FORWARDPROPAGATION
for i in range(layers-1):
cache["a"+str(i+1)]=lrelu((np.dot(param["w"+str(i+1)],cache["a"+str(i)]))+param["b"+str(i+1)])
cache["a"+str(layers)]=sigmoid((np.dot(param["w"+str(layers)],cache["a"+str(layers-1)]))+param["b"+str(layers)])
yn=cache["a"+str(layers)]
m=X.shape[1]
cost=-np.sum((y*np.log(yn)+(1-y)*np.log(1-yn)))/m
if j%10==0:
print(cost)
costs.append(cost)
#BACKPROPAGATION
grad={"dz"+str(layers):yn-y}
for i in range(layers):
grad["dw"+str(layers-i)]=np.dot(grad["dz"+str(layers-i)],cache["a"+str(layers-i-1)].T)/m
grad["db"+str(layers-i)]=np.sum(grad["dz"+str(layers-i)],1,keepdims=True)/m
if i<layers-1:
grad["dz"+str(layers-i-1)]=np.dot(param["w"+str(layers-i)].T,grad["dz"+str(layers-i)])*lreluDer(cache["a"+str(layers-i-1)])
for i in range(layers):
param["w"+str(i+1)]=param["w"+str(i+1)] - alpha*grad["dw"+str(i+1)]
param["b"+str(i+1)]=param["b"+str(i+1)] - alpha*grad["db"+str(i+1)]

The implementation seems okay. While you could converge to the same value with different models/learning rate/hyper parameters, what's frightening is having the same starting value everytime, 6.98 in your case.
I suspect it has to do with your initialisation. If you're setting all your weights initially to zero, you're not gonna break symmetry. That is explained here and here in adequate detail.

Related

analyze the train-validation accuracy learning curve

I am building a two-layer neural network from scratch on the Fashion MNIST dataset. In between, using the RELU as activation and on the last layer, I am using softmax cross entropy. I am getting the below learning curve between train and validation accuracy which is wrong obviously. But if you see my loss curve, it's decreasing but my model is not learning. I am not able to my head around where I am going wrong. Could anyone explain these two graphs, like where I could be possibly going wrong?

I don't know exactly what you are doing, and I don't know anything about your architecture, but it's wrong to use ReLU on the last layer.
Usually you leave the last layer as linear (no activation). This will produce the logits that enter the Softmax. The output of the softmax will try to approximate the probability distribution on the classes.
This could be a reason for your results.

Exercise 4 by Andrew Ng in Keras.

I am studying some machine learning on my own and I am practicing (in Python) with the assignments of the course held by Andrew Ng.
After completing the fourth exercise by hand, I tought to do it in Keras to practice with the library.
In the exercise we have 5000 images of hand written digits, going from 0 to 9. Each image is a 20x20 matrix. The dataset is stored in a matrix X of shape 5000x400 (each image has been 'unrolled') and the labels are stored in a matrix y of shape 5000x10. Each row of y is a hot-one vector.
The exercise asks to implement backpropagation to maximaze the log likelihood, for a simple neural network with one input layer, one hidden layer and one output layer. The hidden layer has 25 neurons and the output layer 10. We use sigmoid as activation for both layers.
My code in Keras is this
model=Sequential()
model.add(Dense(25,input_shape=(400,),use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.add(Dense(10,use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.compile(loss='categorical_crossentropy',optimizer='sgd',metrics=['accuracy'])
model.fit(X, y, batch_size=5000,epochs=100, verbose=1)
Since I want this to be as similar as possible to the assignment I have used the same initial weights as the assignment, the same regularization parameter, the same activations and gradient descent as a optimizer (actually the assignment uses the Truncated Newton Method but I don't think my problem lies here).
I thought I was doing everything correctly but when I train the network I get a 10% accuracy on the training dataset. Even playing a little bit with the parameters the accuracy doesn't change much. To try to understand better the problem I tested it with smaller pieces of the dataset. For instance if I select a subdataset of 100 elements containing x images of zero and 100-x images of one, I get a x% training accuracy. My guess is that the network is optimizing the parameters to recognise only the first digit.
Now my questions are: what I am missing? Why isn't this the right implementation of the neural network described above?

If you are practising on the MNIST dataset, to classify 10 digits, you have 10 classes to predict. Rather than sigmoid, you should use ReLU in the hidden layers ( in your case the first layer ) and use softmax activation on the output layer. Use categorical crossentropy loss function with adam or sgd optimizer.

Why does my TensorFlow NN model's predicted values have upper limit?

I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...

A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.

I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.

Homemade deep learning library: numerical issue with relu activation

For the sake of learning the finer details of a deep learning neural network, I have coded my own library with everything (optimizer, layers, activations, cost function) homemade.
It seems to work fine when benchmarking in on the MNIST dataset, and using only sigmoid activation functions.
Unfortunately I seem to get issues when replacing these with relus.
This is what my learning curve looks like for 50 epochs on a training dataset of ~500 examples:
Everything is fine for the first ~8 epochs and then I get a complete collapse on the score of a dummy classifier (~0.1 accuracy). I checked the code of the relu and it seems fine. Here are my forward and backward passes:
def fprop(self, inputs):
return np.maximum( inputs, 0.)
def bprop(self, inputs, outputs, grads_wrt_outputs):
derivative = (outputs > 0).astype( float)
return derivative * grads_wrt_outputs
The culprit seems to be in the numerical stability of the relu. I tried different learning rates and many parameter initializers for the same result. Tanh and sigmoid work properly. Is this a known issue? Is it a consequence of non-continuous derivative of the relu function?

Yes, it's quite possible that the ReLU's are to blame. Most of the classic perceptron-based models, including ConvNet (the classic MNIST trainer), depend on both positive and negative weights for their training accuracy. ReLU ignores the negative characteristics, thus detracting from the model's capabilities.
ReLU is better-suited for convolution layers; it's a filter that says, "If the kernel isn't excited about this part of the input, I don't care how deep the boredom goes; just ignore it." MNIST training depends on counter-correction, allowing nodes to say "No, this isn't good, run the other way!"

Where to apply batch normalization on standard CNNs

I have the following architecture:
Conv1
Relu1
Pooling1
Conv2
Relu2
Pooling3
FullyConnect1
FullyConnect2
My question is, where do I apply batch normalization? And what would be the best function to do this in TensorFlow?

The original batch-norm paper prescribes using the batch-norm before ReLU activation. But there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
To your second question: in tensorflow, you can use a high-level tf.layers.batch_normalization function, or a low-level tf.nn.batch_normalization.

There's some debate on this question. This Stack Overflow thread and this keras thread are examples of the debate. Andrew Ng says that batch normalization should be applied immediately before the non-linearity of the current layer. The authors of the BN paper said that as well, but now according to François Chollet on the keras thread, the BN paper authors use BN after the activation layer. On the other hand, there are some benchmarks such as the one discussed on this torch-residual-networks github issue that show BN performing better after the activation layers.
My current opinion (open to being corrected) is that you should do BN after the activation layer, and if you have the budget for it and are trying to squeeze out extra accuracy, try before the activation layer.
So adding Batch Normalization to your CNN would look like this:
Conv1
Relu1
BatchNormalization
Pooling1
Conv2
Relu2
BatchNormalization
Pooling3
FullyConnect1
BatchNormalization
FullyConnect2
BatchNormalization

In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to
the input X, or to the transformed value XW+b. Ioﬀe and Szegedy (2015)
recommend the latter. More speciﬁcally, XW+b should be replaced by a
normalized version of XW. The bias term should be omitted because it
becomes redundant with the β parameter applied by the batch
normalization reparameterization. The input to a layer is usually the
output of a nonlinear activation function such as the rectiﬁed linear
function in a previous layer. The statistics of the input are thus
more non-Gaussian and less amenable to standardization by linear
operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Some report better results when placing batch normalization after activation, while others get better results with batch normalization before activation. It's an open debate. I suggest that you test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does NN training loss flatten? - python

Related

analyze the train-validation accuracy learning curve

Exercise 4 by Andrew Ng in Keras.

Why does my TensorFlow NN model's predicted values have upper limit?

Homemade deep learning library: numerical issue with relu activation

Where to apply batch normalization on standard CNNs

Categories

Resources