I am using tensorflow.keras to train a CNN in an image recognition problem, using the Adam minimiser to minimise a custom loss (some code is at the bottom of the question). I am experimenting with how much data I need to use in my training set, and thought I should look into whether each of my models have properly converged. However, when plotting loss vs number of epochs of training for different training set fractions, I noticed approximately periodic spikes in the loss function, as in the plot below. Here, the different lines show different training set sizes as a fraction of my total dataset.
As I decrease the size of the training set (blue -> orange -> green), the frequency of these spikes appears to decrease, though the amplitude appears to increase. Intuitively, I would associate this kind of behaviour with a minimiser jumping out of a local minimum, but I am not experienced enough with TensorFlow/CNNs to know if that is the correct way to interpret this behaviour. Equally, I can't quite understand the variation with training set size.
Can anyone help me to understand this behaviour? And should I be concerned by these features?
from quasarnet.models import QuasarNET, custom_loss
from tensorflow.keras.optimizers import Adam
...
model = QuasarNET(
X[0,:,None].shape,
nlines=len(args.lines)+len(args.lines_bal)
)
loss = []
for i in args.lines:
loss.append(custom_loss)
for i in args.lines_bal:
loss.append(custom_loss)
adam = Adam(decay=0.)
model.compile(optimizer=adam, loss=loss, metrics=[])
box, sample_weight = io.objective(z,Y,bal,lines=args.lines,
lines_bal=args.lines_bal)
print( "starting fit")
history = model.fit(X[:,:,None], box,
epochs = args.epochs,
batch_size = 256,
sample_weight = sample_weight)
Following some discussion from a colleague, I believe that we have solved this problem. As a default, the Adam minimiser uses an adaptive learning rate that is inversely proportional to the variance of the gradient in its recent history. When the loss starts to flatten out, the variance of the gradient decreases, and so the minimiser increases the learning rate. This can happen quite drastically, causing the minimiser to "jump" to a higher loss point in parameter space.
You can avoid this by setting amsgrad=True when initialising the minimiser (http://www.satyenkale.com/papers/amsgrad.pdf). This prevents the learning rate from increasing in this way, and thus results in better convergence. The (somewhat basic) plot below shows loss vs number of training epochs for the normal setup, as in the original question (norm loss) compared to the loss when setting amsgrad=True in the minimiser (amsgrad loss).
Clearly, the loss function is much better behaved with amsgrad=True, and, with more epochs of training, should result in a stable convergence.
Related
My learning curves are giving fluctuations for a 3 class classification problem. I am training using Resnet 50 with
class 1 of 899 images
class 2 of 899 images
class 3 of 690 images.
My model gave a
train accuracy of 99.5%
validation accuracy 93%
test accuracy of 88%
with Epochs = 300, Batch size 32 and learning rate 0.1.
i tried tuning my parameters to epochs 50, 100, 200, 300, batch size 16,32, and learning rate 0.1,0.01,0.001,0.0001, still the spikes are present. Is the problem with my model or dataset? How can i actually know that my model is actually learning?
'Spikes' are to be expected when training any model - especially with smaller batch sizes.
To understand why this could be so, assume the batch_size is 1 and in each epoch we start gradient descent with a random data-point and pick following data-points randomly as well.
This data-point helps us (the optimizer) find the direction to move in to find the closest minima. Now we move on to the next data-point, which again points towards the nearest minima. Throughout this epoch, we use the directions from each of our datapoints to find the minima and will give every datapoint equal weightage.
Thus, the order of the datapoints in which we go through to find the minima will have a drastic effect on what minima we get to.
Often times, with smaller batch_sizes, we might get stuck at local minimas (this gives us a huge loss) and sometimes we hit the jackpot with a more optimum minima (which will give us smaller loss).
This is one of the possible reasons you're getting good loss in some epochs and poor loss in others.
To answer your question about is your model actually learning anything, you should look at the smoothened loss graph. If the loss (or your metric) is improving over time then it means that your model is surely learning. If, however, the loss is fluctuating between two values, your model is not learning - as in the weights are not improving through backpropogation. This could be due to the fact that your dataset is just noise, your gradients are not being backpropogated, or other reasons which you can find here.
From your graph, although there is no mention of batch_size, the loss seems to be reducing wrt epochs. That means your model is learning.
If you want to smoothen the curve or get rid of the spikes, one of the thins you should try is training your model with a bigger batch_size, provided your VRAM can fit it.
I have a problem, when training a U-Net, which has many similarities with a CNN, in Keras with Tensorflow. When starting the Training, the Accuracy increases and the loss steadily goes down. At around epoch 40, in my example, the validation loss jumps to the maximum and the validation accuracy to zero. What can I do, to prevent that from happening. I am using a similar approach to this one, for my code, in Keras.
Example image of the Loss
Edit:
I already tried changing Learning rate, adding dropout and changing optimzers, those will not change the curve for the better. As i have a big training set, it is very unlikely, that I am encountering overfitting.
I just wanted to ask a quick question. I understand that val_loss and train_loss is insufficient to tell if the model is overfitting. However, i wish to use it as a rough gauge by monitoring if the val_loss is increasing. As i use SGD optimiser, i seem to have 2 different trends based on the smoothing value. Which should i use? Blue is val_loss and Orange is train_loss.
From smoothing = 0.999, both seems to be decreasing but from smoothing = 0.927, val_loss seems to be increasing. Thank you for reading!
Also, when is a good time to decrease the learning rate? Is it directly before the model overfits?
Smoothing = 0.999
Smoothing = 0.927
In my experience with DL as applied to CNNs, overfitting is tied more to the difference in train/val accuracies/losses rather than just one or the other. In your graphs, it's clear that the difference in loss is increasing as time goes on, showing that your model does not generalize well to the dataset, and hence shows signs of overfitting. It would also help for you to track classification accuracy on train and val datasets if possible--this will show you the generalization error which acts as a similar metric but might show more visible effects.
Dropping the learning rate once the loss starts to even out and overfitting begins is a good idea; however you may find better gains for your generalization if you adjust the net's complexity to better fit the dataset first. For such overfitting, a modest decrease in complexity may help--use the difference in train/val losses and accuracies to confirm.
I am using Tensorflow DNNRegressor Estimator model for making a neural network. But calling estimator.train() function is giving output as follows:
I.e. my loss function is varying a lot with every step. But as far as I know, my loss function should decrease with no of iterations. Also, find the attached screenshot for Tensorboard Visualisation for loss function:
The doubts I'm not able to figure out are:
Whether it is overall loss function value (combined loss for every step processed till now) or just that step's loss value?
If it is that step's loss value, then how to get value of overall loss function and see its trend, which I feel should decrease with increasing no of iterations? And In my knowledge that is the value we should look at while training a dataset.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
First of all, let me point out that tf.contrib.learn.DNNRegressor uses a linear regression head with mean_squared_loss, i.e. simple L2 loss.
Whether it is overall loss function value (combined loss for every
step processed till now) or just that step's loss value?
Each point on a chart is the value of a loss function on the last step after learning so far.
If it is that step's loss value, then how to get value of overall loss
function and see its trend, which I feel should decrease with
increasing no of iterations?
There's no overall loss function, probably you mean a chart how the loss changed after each step. That's exactly what tensorboard is showing to you. You are right, its trend is not downwards, as it should. This indicates that your neural network is not learning.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
A common reason for the neural network not learning is poor choice of hyperparameters (though there are many more mistakes you can possibly make). For example:
the learning rate is too large
it's also possible that the learning rate is too small, which means that the neural network is learning, but very very slowly, so that you can't see it
weights initialization is probably too large, try to decrease it
batch size may be too large as well
you're passing wrong labels for the inputs
training data contains missing values, or unnormalized
...
What I usually do to check if the neural network is at least somehow working is reduce the training set to few examples and try to overfit the network. This experiment is very fast, so I can try various learning rates, initialization variance and other parameters to find a sweet spot. Once I have a steady decreasing loss chart, I go on with a bigger set.
Though previous comment is very informative and good, it doesn't quite address your issue. When you instantiate DNNRegressor, add:
loss_reduction=tf.losses.Reduction.MEAN
in the constructor, and you'll see your average loss, converges.
estimator = tf.estimator.DNNRegressor(
feature_columns=feat_clmns,
hidden_units=[32, 64, 32],
weight_column=weight_clmn,
**loss_reduction=tf.losses.Reduction.MEAN**
i'm running a code of tensorflow.On terminal it is giving me values of training and test accuracy and also step size. Can someone please explain these terms or provide any material that i can read to understand these terms and also stochastic gradient descent method for convolution neural networks
From what you have displayed in the terminal, you are using actually tflearn. This should also display the LOSS or COST which is, how far is your prediction from the actual output. Low loss and high accuracy = better model.
The Stochastic Gradient Descent (SGD) allows learning rate decay. There is a good explanation here http://tflearn.org/optimizers/#stochastic-gradient-descent
In the menu on the felt side you find everything about Loss, Training, Accuracy, Layers etc.
And you can actually choose how often you want to display these things (I mean at what step).
As of batch size, learning rate, number of iterations, number of layers and number of nodes, you can play around with all these and see which works better for your dataset.