SGD optimiser graph

SGD optimiser graph - python

I just wanted to ask a quick question. I understand that val_loss and train_loss is insufficient to tell if the model is overfitting. However, i wish to use it as a rough gauge by monitoring if the val_loss is increasing. As i use SGD optimiser, i seem to have 2 different trends based on the smoothing value. Which should i use? Blue is val_loss and Orange is train_loss.
From smoothing = 0.999, both seems to be decreasing but from smoothing = 0.927, val_loss seems to be increasing. Thank you for reading!
Also, when is a good time to decrease the learning rate? Is it directly before the model overfits?
Smoothing = 0.999
Smoothing = 0.927

In my experience with DL as applied to CNNs, overfitting is tied more to the difference in train/val accuracies/losses rather than just one or the other. In your graphs, it's clear that the difference in loss is increasing as time goes on, showing that your model does not generalize well to the dataset, and hence shows signs of overfitting. It would also help for you to track classification accuracy on train and val datasets if possible--this will show you the generalization error which acts as a similar metric but might show more visible effects.
Dropping the learning rate once the loss starts to even out and overfitting begins is a good idea; however you may find better gains for your generalization if you adjust the net's complexity to better fit the dataset first. For such overfitting, a modest decrease in complexity may help--use the difference in train/val losses and accuracies to confirm.

Related

Why does train data performance deteriorate dramatically?

I am training a binary classifier model that classifies between disease and non-disease.
When I run the model, training loss decreased and auc, acc, get increased.
But, after certain epoch train loss increased and auc, acc were decreased.
I don't know why training performance got decreased after certain epoch.
I used general 1d cnn model and methods, details here:
I tried already to:
batch shuffle
introduce class weights
loss change (binary_crossentropy > BinaryFocalLoss)
learning_rate change

Two questions for you going forward.
Does the training and validation accuracy keep dropping - when you would just let it run for let's say 100 epochs? Definitely something I would try.
Which optimizer are you using? SGD? ADAM?
How large is your dropout, maybe this value is too large. Try without and check whether the behavior is still the same.
It might also be the optimizer
As you do not seem to augment (this could be a potential issue if you do by accident break some label affiliation) your data, each epoch should see similar gradients. Thus I guess, at this point in your optimization process, the learning rate and thus the update step is not adjusted properly - hence not allowing to further progress into that local optimum, and rather overstepping the minimum while at the same time decreasing training and validation performance.
This is an intuitive explanation and the next things I would try are:
Scheduling the learning rate
Using a more sophisticated optimizer (starting with ADAM if you are not already using it)

Your model is overfitting. This is why your accuracy increases and then begins decreasing. You need to implement Early Stopping to stop at the Epoch with the best results. You should also implement dropout layers.

What does a sudden increase in accuracy during epoch training show about my model?

I am learning Convolution Neural Network now and practicing it on kaggle digit recognizer (MNIST) dataset.
While training the data, I noticed that inspite of initial gradually growing accuracy, in between there was a huge jump i.e from 0.8984 to 0.9814.
As a beginner, I want to investigate what does this jump really show about my model. Here is the image of the epochs:
enter image description here
I have circled the jump in yellow. Thanks in advance!

As the loss gradually starts to decrease, this create an impact on fitting of the model. The cost function makes the loss go down, which directly creates an impact on the fitting of model. Better the fitting of model into training data, better the accuracy (which we can easily see as the accuracy increases with the reduction in loss). There is almost a difference of 0.08 in your consecutive loss function which is enough for the model to fit more from the current state.
Now as the model progresses, we try it on the testing dataset because the real world data is nothing like the data we trained it on.
However, a higher accuracy might not always be good as the model is considered to be over-evaluated which is also known as overfitting which means the model is performing too well that it can't handle any little changes. Therefore, a correct balance between learning rate and epochs are required in order to predict the classes correctly. It also depends on the architecture, Optimizing function which make sure the oscillations are low and numerous other things.

Approximately periodic jumps in TensorFlow model loss

I am using tensorflow.keras to train a CNN in an image recognition problem, using the Adam minimiser to minimise a custom loss (some code is at the bottom of the question). I am experimenting with how much data I need to use in my training set, and thought I should look into whether each of my models have properly converged. However, when plotting loss vs number of epochs of training for different training set fractions, I noticed approximately periodic spikes in the loss function, as in the plot below. Here, the different lines show different training set sizes as a fraction of my total dataset.
As I decrease the size of the training set (blue -> orange -> green), the frequency of these spikes appears to decrease, though the amplitude appears to increase. Intuitively, I would associate this kind of behaviour with a minimiser jumping out of a local minimum, but I am not experienced enough with TensorFlow/CNNs to know if that is the correct way to interpret this behaviour. Equally, I can't quite understand the variation with training set size.
Can anyone help me to understand this behaviour? And should I be concerned by these features?
from quasarnet.models import QuasarNET, custom_loss
from tensorflow.keras.optimizers import Adam
...
model = QuasarNET(
X[0,:,None].shape,
nlines=len(args.lines)+len(args.lines_bal)
)
loss = []
for i in args.lines:
loss.append(custom_loss)
for i in args.lines_bal:
loss.append(custom_loss)
adam = Adam(decay=0.)
model.compile(optimizer=adam, loss=loss, metrics=[])
box, sample_weight = io.objective(z,Y,bal,lines=args.lines,
lines_bal=args.lines_bal)
print( "starting fit")
history = model.fit(X[:,:,None], box,
epochs = args.epochs,
batch_size = 256,
sample_weight = sample_weight)

Following some discussion from a colleague, I believe that we have solved this problem. As a default, the Adam minimiser uses an adaptive learning rate that is inversely proportional to the variance of the gradient in its recent history. When the loss starts to flatten out, the variance of the gradient decreases, and so the minimiser increases the learning rate. This can happen quite drastically, causing the minimiser to "jump" to a higher loss point in parameter space.
You can avoid this by setting amsgrad=True when initialising the minimiser (http://www.satyenkale.com/papers/amsgrad.pdf). This prevents the learning rate from increasing in this way, and thus results in better convergence. The (somewhat basic) plot below shows loss vs number of training epochs for the normal setup, as in the original question (norm loss) compared to the loss when setting amsgrad=True in the minimiser (amsgrad loss).
Clearly, the loss function is much better behaved with amsgrad=True, and, with more epochs of training, should result in a stable convergence.

Validation accuracy increasing but validation loss is also increasing

I am using a CNN network to classify images into 5 classes. The size of my dataset is around 370K. I am using Adam optimizer with learning rate 0.0001 and batch size of 32. Surprisingly, I am getting improvement in validation accuracy over the epochs but validation loss is constantly growing.
I am assuming that the model is becoming less and less unsure about the validation set but the accuracy is more because the value of softmax output is more than the threshold value.
What can be the reason behind this? Any help in this regard would be highly appreciated.

I think this is a case of overfitting, as previous comments pointed out. Overfitting can be the result of high variance in the dataset. When you trained the CNN it showed a good ratio towards the decreasing of training error, producing a more complex model. More complex models produce overfitting and it can be noted when validation error tends to increase.
Adam optimizer is taking care of the learning rate, exponential decay and in general of the optimization of the model, but it won't take any action against overfitting. If you want to reduce it (overfitting), you will need to add a regularization technique which will penalize large values of the weights in the model.
You can read more details about this in the deep learning book: http://www.deeplearningbook.org/contents/regularization.html

training loss decreases while dev loss increases

I'm observing the following patterns in a one-layer CNN, binary classification model:
Training loss decreases while dev loss increased with number of steps
Training accuracy increases while dev accuracy decreases with number of steps
Based on past SO questions and literature review, it seems that these patterns are indicative of over-fitting (the model performs well in training, but cannot generalize to new examples).
The graphs below illustrate the loss and accuracy with respect to the number of steps in training.
In both,
The orange line represents the summary of the dev set performance.
The blue line represents the summary of the training set performance.
Loss:
Accuracy:
Traditional remedies I've considered, and my observations about them:
Adding L2 Regularization : I've tried many coefficients of L2 regularization -- from 0.0 to 4.5; all of these tests yield a similar pattern by the 5,000th step in both loss and accuracy.
Cross validation : It seems that the role of cross-validation is widely mis-understood online. As this answer states, cross-validation is for model checking, not model building. Indeed, cross-validation would be a way to check if the model generalizes well. And actually, the graphs I show are from one fold of a 4-fold cross-validation. If I observe a similar pattern in the loss/accuracy in all the folds, what other insight does cross-validation offer other than the confirmation that the model does not generalize well?
Early stopping : This would seem the most intuitive, but the loss graph seems to indicate that the loss levels out only after a divergence in the dev set loss is observed; the starting point of this early stop, then, doesn't seem easy to decide.
Data : The amount of labeled data I have available is limited, so training on more data is not an option right now.
All this said, what I am asking is:
If the patterns observed in the loss and accuracy are indeed indicative of over-fitting, are there any other methods to counteract over-fitting that I haven't considered?
If these patterns are not indicative of over-fitting, what else could they mean?
Thanks -- any insight would be much appreciated.

I think that you are totally on the right track. Looks like classic over-fitting.
One option is adding dropout if you don't already have it. It falls into the category of regularization, but it is more commonly used now then L1 and L2 regularization.
Changing the model architcture could get better results but it's hard to say what specifically would be best. It could help to make it deeper with more layers and possibly some pooling layers. It will likely still overfit but you might get a higher accuracy on the dev set before that happens.
Getting more data may be one of the best things you could do. If you can't get more data you can try to augment the data. You can also try cleaning the data to remove noise which can help prevent the model from fitting to noise.
You may ultimately want to try setting up a hyperparameter optimization search. This, however, can take a while on neural nets which take a while to train. Make sure you remove a test set before hyper parameter tuning.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.