My learning curves are giving fluctuations for a 3 class classification problem. I am training using Resnet 50 with
class 1 of 899 images
class 2 of 899 images
class 3 of 690 images.
My model gave a
train accuracy of 99.5%
validation accuracy 93%
test accuracy of 88%
with Epochs = 300, Batch size 32 and learning rate 0.1.
i tried tuning my parameters to epochs 50, 100, 200, 300, batch size 16,32, and learning rate 0.1,0.01,0.001,0.0001, still the spikes are present. Is the problem with my model or dataset? How can i actually know that my model is actually learning?
'Spikes' are to be expected when training any model - especially with smaller batch sizes.
To understand why this could be so, assume the batch_size is 1 and in each epoch we start gradient descent with a random data-point and pick following data-points randomly as well.
This data-point helps us (the optimizer) find the direction to move in to find the closest minima. Now we move on to the next data-point, which again points towards the nearest minima. Throughout this epoch, we use the directions from each of our datapoints to find the minima and will give every datapoint equal weightage.
Thus, the order of the datapoints in which we go through to find the minima will have a drastic effect on what minima we get to.
Often times, with smaller batch_sizes, we might get stuck at local minimas (this gives us a huge loss) and sometimes we hit the jackpot with a more optimum minima (which will give us smaller loss).
This is one of the possible reasons you're getting good loss in some epochs and poor loss in others.
To answer your question about is your model actually learning anything, you should look at the smoothened loss graph. If the loss (or your metric) is improving over time then it means that your model is surely learning. If, however, the loss is fluctuating between two values, your model is not learning - as in the weights are not improving through backpropogation. This could be due to the fact that your dataset is just noise, your gradients are not being backpropogated, or other reasons which you can find here.
From your graph, although there is no mention of batch_size, the loss seems to be reducing wrt epochs. That means your model is learning.
If you want to smoothen the curve or get rid of the spikes, one of the thins you should try is training your model with a bigger batch_size, provided your VRAM can fit it.
Related
I'm using a relatively simple neural network with fully connected layers in keras. For some reason, the accuracy drastically increases basically to its final value after only one training epoch (likewise, the loss sharply decreases). I've tried architectures with larger and smaller numbers of hidden layers too. This network also performs poorly on the testing data, so I am trying to find a more optimal architecture or improve my training set accordingly.
It is trained on a set of 6500 1D array-like data, and I'm using a batch size of 512.
As said by Murilo, hard to say much without more information but it can come from multiple things:
Your network learns through the batches of each epoch, meaning that
your ~12 batches (6500/512) are already enough to learn a good bit of
classification.
Your weights are not really well initialized, and produce a huge
loss for the first epoch. The massive decrease in the loss is
actually the solver 'squishing' the weights. The best explanation I
found for this comes from A. Karpathy in his 'MakeMore' tutorial:
https://youtu.be/P6sfmUTpUmc?t=260
Now this sudden decrease of the loss is not extreme here (from 0.5 to 0.2) so I would not care much. I agree with Murilo that low accuracy in validation can come from too few samples in your validation set, or a bad shuffling between train and validation sets.
I'm having trouble understanding and replicating the original implementation of ResNet on the CIFAR-10 dataset, as described in the paper "Deep Residual Learning for Image Recognition". Specifically, I have a few questions about the following passage:
We use a weight decay of 0.0001 and momentum of 0.9,
and adopt the weight initialization in [13] and BN [16] but
with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning
rate of 0.1, divide it by 10 at 32k and 48k iterations, and
terminate training at 64k iterations, which is determined on
a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side,
and a 32×32 crop is randomly sampled from the padded
image or its horizontal flip. For testing, we only evaluate
the single view of the original 32×32 image.
What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?
How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?
How can I implement the training and learning rate scheduling in PyTorch? Since 45000 isn't divisible by 128, should I drop the last 72 images every epoch? Also, since the 32k, 48k, and 64k milestones don't fall on a whole number of epochs, should I round them to the nearest epochs? Or is there a way to change the learning rate and terminate training in the middle of an epoch?
If anyone could point me in the right direction, I greatly appreciate it. I'm new to deep learning, so thank you for your help and kind understanding.
What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?
When running two GPUs on the same machine then the batch size is split between the GPUs, as you've said. The gradient produced by both GPUs will be transfered, averaged and applied on one of the GPUs, or possibly on the CPU.
Here's more info: https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?
I encourage everyone to think in terms of iterations rather than epochs. Each iteration equates to a single weight update, which is much more relevant to model convergence than an epoch is. If you think in epochs you have to adjust the number of epochs of training every time you try a different batch size. This isn't the case if you use think in terms of iterations (aka training steps, or weight updates). But your formula is correct in computing epochs.
How can I implement the training and learning rate scheduling in PyTorch?
I think this pytorch post answers the question, it looks like this was added to pytorch (sorry for a non authoritative answer here, I'm more familiar with Tensorflow):
https://forums.pytorchlightning.ai/t/training-for-a-set-number-of-iterations-without-setting-epochs/178
https://github.com/Lightning-AI/lightning/pull/5687
You can also just use epochs of course, and adjusting the learning rate doesn't have to happen exactly at the same point as the paper describes, as near as you can reasonably get with rounding error will work just fine.
I am learning Convolution Neural Network now and practicing it on kaggle digit recognizer (MNIST) dataset.
While training the data, I noticed that inspite of initial gradually growing accuracy, in between there was a huge jump i.e from 0.8984 to 0.9814.
As a beginner, I want to investigate what does this jump really show about my model. Here is the image of the epochs:
enter image description here
I have circled the jump in yellow. Thanks in advance!
As the loss gradually starts to decrease, this create an impact on fitting of the model. The cost function makes the loss go down, which directly creates an impact on the fitting of model. Better the fitting of model into training data, better the accuracy (which we can easily see as the accuracy increases with the reduction in loss). There is almost a difference of 0.08 in your consecutive loss function which is enough for the model to fit more from the current state.
Now as the model progresses, we try it on the testing dataset because the real world data is nothing like the data we trained it on.
However, a higher accuracy might not always be good as the model is considered to be over-evaluated which is also known as overfitting which means the model is performing too well that it can't handle any little changes. Therefore, a correct balance between learning rate and epochs are required in order to predict the classes correctly. It also depends on the architecture, Optimizing function which make sure the oscillations are low and numerous other things.
I am using tensorflow.keras to train a CNN in an image recognition problem, using the Adam minimiser to minimise a custom loss (some code is at the bottom of the question). I am experimenting with how much data I need to use in my training set, and thought I should look into whether each of my models have properly converged. However, when plotting loss vs number of epochs of training for different training set fractions, I noticed approximately periodic spikes in the loss function, as in the plot below. Here, the different lines show different training set sizes as a fraction of my total dataset.
As I decrease the size of the training set (blue -> orange -> green), the frequency of these spikes appears to decrease, though the amplitude appears to increase. Intuitively, I would associate this kind of behaviour with a minimiser jumping out of a local minimum, but I am not experienced enough with TensorFlow/CNNs to know if that is the correct way to interpret this behaviour. Equally, I can't quite understand the variation with training set size.
Can anyone help me to understand this behaviour? And should I be concerned by these features?
from quasarnet.models import QuasarNET, custom_loss
from tensorflow.keras.optimizers import Adam
...
model = QuasarNET(
X[0,:,None].shape,
nlines=len(args.lines)+len(args.lines_bal)
)
loss = []
for i in args.lines:
loss.append(custom_loss)
for i in args.lines_bal:
loss.append(custom_loss)
adam = Adam(decay=0.)
model.compile(optimizer=adam, loss=loss, metrics=[])
box, sample_weight = io.objective(z,Y,bal,lines=args.lines,
lines_bal=args.lines_bal)
print( "starting fit")
history = model.fit(X[:,:,None], box,
epochs = args.epochs,
batch_size = 256,
sample_weight = sample_weight)
Following some discussion from a colleague, I believe that we have solved this problem. As a default, the Adam minimiser uses an adaptive learning rate that is inversely proportional to the variance of the gradient in its recent history. When the loss starts to flatten out, the variance of the gradient decreases, and so the minimiser increases the learning rate. This can happen quite drastically, causing the minimiser to "jump" to a higher loss point in parameter space.
You can avoid this by setting amsgrad=True when initialising the minimiser (http://www.satyenkale.com/papers/amsgrad.pdf). This prevents the learning rate from increasing in this way, and thus results in better convergence. The (somewhat basic) plot below shows loss vs number of training epochs for the normal setup, as in the original question (norm loss) compared to the loss when setting amsgrad=True in the minimiser (amsgrad loss).
Clearly, the loss function is much better behaved with amsgrad=True, and, with more epochs of training, should result in a stable convergence.
i'm running a code of tensorflow.On terminal it is giving me values of training and test accuracy and also step size. Can someone please explain these terms or provide any material that i can read to understand these terms and also stochastic gradient descent method for convolution neural networks
From what you have displayed in the terminal, you are using actually tflearn. This should also display the LOSS or COST which is, how far is your prediction from the actual output. Low loss and high accuracy = better model.
The Stochastic Gradient Descent (SGD) allows learning rate decay. There is a good explanation here http://tflearn.org/optimizers/#stochastic-gradient-descent
In the menu on the felt side you find everything about Loss, Training, Accuracy, Layers etc.
And you can actually choose how often you want to display these things (I mean at what step).
As of batch size, learning rate, number of iterations, number of layers and number of nodes, you can play around with all these and see which works better for your dataset.