I'm having trouble understanding and replicating the original implementation of ResNet on the CIFAR-10 dataset, as described in the paper "Deep Residual Learning for Image Recognition". Specifically, I have a few questions about the following passage:
We use a weight decay of 0.0001 and momentum of 0.9,
and adopt the weight initialization in [13] and BN [16] but
with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning
rate of 0.1, divide it by 10 at 32k and 48k iterations, and
terminate training at 64k iterations, which is determined on
a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side,
and a 32×32 crop is randomly sampled from the padded
image or its horizontal flip. For testing, we only evaluate
the single view of the original 32×32 image.
What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?
How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?
How can I implement the training and learning rate scheduling in PyTorch? Since 45000 isn't divisible by 128, should I drop the last 72 images every epoch? Also, since the 32k, 48k, and 64k milestones don't fall on a whole number of epochs, should I round them to the nearest epochs? Or is there a way to change the learning rate and terminate training in the middle of an epoch?
If anyone could point me in the right direction, I greatly appreciate it. I'm new to deep learning, so thank you for your help and kind understanding.
What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?
When running two GPUs on the same machine then the batch size is split between the GPUs, as you've said. The gradient produced by both GPUs will be transfered, averaged and applied on one of the GPUs, or possibly on the CPU.
Here's more info: https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?
I encourage everyone to think in terms of iterations rather than epochs. Each iteration equates to a single weight update, which is much more relevant to model convergence than an epoch is. If you think in epochs you have to adjust the number of epochs of training every time you try a different batch size. This isn't the case if you use think in terms of iterations (aka training steps, or weight updates). But your formula is correct in computing epochs.
How can I implement the training and learning rate scheduling in PyTorch?
I think this pytorch post answers the question, it looks like this was added to pytorch (sorry for a non authoritative answer here, I'm more familiar with Tensorflow):
https://forums.pytorchlightning.ai/t/training-for-a-set-number-of-iterations-without-setting-epochs/178
https://github.com/Lightning-AI/lightning/pull/5687
You can also just use epochs of course, and adjusting the learning rate doesn't have to happen exactly at the same point as the paper describes, as near as you can reasonably get with rounding error will work just fine.
Related
I'm using a relatively simple neural network with fully connected layers in keras. For some reason, the accuracy drastically increases basically to its final value after only one training epoch (likewise, the loss sharply decreases). I've tried architectures with larger and smaller numbers of hidden layers too. This network also performs poorly on the testing data, so I am trying to find a more optimal architecture or improve my training set accordingly.
It is trained on a set of 6500 1D array-like data, and I'm using a batch size of 512.
As said by Murilo, hard to say much without more information but it can come from multiple things:
Your network learns through the batches of each epoch, meaning that
your ~12 batches (6500/512) are already enough to learn a good bit of
classification.
Your weights are not really well initialized, and produce a huge
loss for the first epoch. The massive decrease in the loss is
actually the solver 'squishing' the weights. The best explanation I
found for this comes from A. Karpathy in his 'MakeMore' tutorial:
https://youtu.be/P6sfmUTpUmc?t=260
Now this sudden decrease of the loss is not extreme here (from 0.5 to 0.2) so I would not care much. I agree with Murilo that low accuracy in validation can come from too few samples in your validation set, or a bad shuffling between train and validation sets.
We are developing a prediction model using deepchem's GCNModel.
Model learning and performance verification proceeded without problems, but it was confirmed that a lot of time was spent on prediction.
We are trying to predict a total of 1 million data, and the parameters used are as follows.
model = GCNModel(n_tasks=1, mode='regression', number_atom_features=32, learning_rate=0.0001, dropout=0.2, batch_size=32, device=device, model_dir=model_path)
I changed the batch size to improve the performance, and it was confirmed that the time was faster when the value was decreased than when the value was increased.
All models had the same GPU memory usage.
From common sense I know, it is estimated that the larger the batch size, the faster it will be. But can you tell me why it works in reverse?
We would be grateful if you could also let us know how we can further improve the prediction time.
let's clarify some definitions first.
Epoch
Times that your model and learning algorithm will walk through your dataset.
(Complete passes)
BatchSize
The number of samples(every single row of your training data) before updating the internal model. in other words, the number of samples processed before the model is updated.
So Your batch size is something between 1 and your len(training_data)
Generally, more batch size gives more accuracy of training data.
Epoch ↑ Batch Size ↑ Accuracy ↑ Speed ↓
So the short answer to question is more batch size takes more memory and needs more process and obviously takes longer time to learn.
https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network
Here is the link for more details.
There are two components regarding the speed:
Your batch size and model size
Your CPU/GPU power in spawning and processing batches
And two of them need to be balanced. For example, if your model finishes prediction of this batch, but the next batch is not yet spawned, you will notice a drop in GPU utilization for a brief moment. Sadly there is no inner metrics that directly tell you this balance - try using time.time() to benchmark your model's prediction as well as the dataloader speed.
However, I don't think that's worth the effort, so you can keep decreasing the batch size up to the point there is no improvement - that's where to stop.
My learning curves are giving fluctuations for a 3 class classification problem. I am training using Resnet 50 with
class 1 of 899 images
class 2 of 899 images
class 3 of 690 images.
My model gave a
train accuracy of 99.5%
validation accuracy 93%
test accuracy of 88%
with Epochs = 300, Batch size 32 and learning rate 0.1.
i tried tuning my parameters to epochs 50, 100, 200, 300, batch size 16,32, and learning rate 0.1,0.01,0.001,0.0001, still the spikes are present. Is the problem with my model or dataset? How can i actually know that my model is actually learning?
'Spikes' are to be expected when training any model - especially with smaller batch sizes.
To understand why this could be so, assume the batch_size is 1 and in each epoch we start gradient descent with a random data-point and pick following data-points randomly as well.
This data-point helps us (the optimizer) find the direction to move in to find the closest minima. Now we move on to the next data-point, which again points towards the nearest minima. Throughout this epoch, we use the directions from each of our datapoints to find the minima and will give every datapoint equal weightage.
Thus, the order of the datapoints in which we go through to find the minima will have a drastic effect on what minima we get to.
Often times, with smaller batch_sizes, we might get stuck at local minimas (this gives us a huge loss) and sometimes we hit the jackpot with a more optimum minima (which will give us smaller loss).
This is one of the possible reasons you're getting good loss in some epochs and poor loss in others.
To answer your question about is your model actually learning anything, you should look at the smoothened loss graph. If the loss (or your metric) is improving over time then it means that your model is surely learning. If, however, the loss is fluctuating between two values, your model is not learning - as in the weights are not improving through backpropogation. This could be due to the fact that your dataset is just noise, your gradients are not being backpropogated, or other reasons which you can find here.
From your graph, although there is no mention of batch_size, the loss seems to be reducing wrt epochs. That means your model is learning.
If you want to smoothen the curve or get rid of the spikes, one of the thins you should try is training your model with a bigger batch_size, provided your VRAM can fit it.
On my way on enhancing the speed of my LSTM-NN I stumbled upon a question.
How does the batch size affect the quality of the predict/predict_on_batch method?
To specify: I know it does affect the calculation of the gradient in training phase and therefore the prediction results,
but should the prediction result vary if different batch sizes are used for batch prediction
(with same size of training batch size)?
Or should one always use the same batch size for training/fitting and prediction, and if so why?
What I tried:
I have a well working (0.96 detection rate and 0 false alarms) LSTM for anomaly detection
but it takes a long time to go through all test files because of the slow prediction (40min),
using predict(x) trained with a batch_size of 512.
When using predict_on_batch with batch_size=1 it again takes 35 min to test with 0.96 DR and 0 FA.
When using predict_on_batch(batch, batch_size), same with predict(batch, batch_size),
with batch size 512 (same as used in training) instead of predict(x) it takes only ca.32sec
but the results are way worse (0.63 DR, 62455 FA).
Similar applies to smaller batch_size values greater than 1.
The answer in this thread: Prediction is depending on the batch size in Keras is pretty outdated but is it still accurate (blaming standardization)?
I dont really understand what is happening here, hope someone can clearify!
Versions used:
Keras 2.3.1
Python 3.7.4
Assuming we have 500k items worth of training data, does it matter if we train the model one item at a time or 'n' items at a time or all at once?
Considering inputTrainingData and outputTrainingData to be [[]] and train_step to be any generic tensorflow training step.
Option 1 Train one item at a time -
for i in range(len(inputTrainingData)):
train_step.run(feed_dict={x: [inputTrainingData[i]], y: [outputTrainingData[i]], keep_prob: .60}, session= sess)
Option 2 Train on all at once -
train_step.run(feed_dict={x: inputTrainingData, y: outputTrainingData, keep_prob: .60}, session= sess)
Is there any difference between options 1 and 2 above as far as the quality of training is concerned?
Yes, there is a difference. Option 1 is much less memory consuming but is also much less accurate. Option 2 could eat up all of your RAM but should prove more accurate. However, if you use all your training set at once, be sure to limit the number of steps to avoid over-fitting.
Ideally, use data in batches (typically between 16 and 256).
Most optimization techniques are 'stochastic', i.e. they rely on a statistical sample of examples to estimate a model update.
To sum up:
- More data => more accuracy (but more memory) => higher risk of over-fitting (so limit the amount of training steps)
There is a different between this options. Normally you have to use a batchsize to train for example 128 iterations of data.
You also could use a batchsize of one, like the first of you examples.
The advantage of this method is you can output the training efficient of the neural network.
If you are learning all data at ones, you will bi a little bit faster, but you will know only at the end if you efficient is good.
Best way is to make a batchsize and learn by stack. So you can output you efficient after every stack and control your efficient.
Mathematically these two methods are different. One is called stochastic gradient descent and the other is called batch gradient descent. You are missing the most commonly used one - mini batch gradient descent. There has been a lot of research on this topic but basically different batch sizes have different convergence properties. Generally people use batch sizes that are greater than one but not the full dataset. This is usually necessary since most datasets cannot fit into memory all at once. Also if your model uses batch normalization then a batch size of one won't converge. This paper discusses the effects of batch size (among other things) on performance. The takeaway is that larger batch sizes do not generalize as well. (They actually argue it isn't the batch size itself the but the fact that you have fewer updates when the batch is larger. I would recommend batch sizes of 32 to start and experiment to see how batch size effects performance.
Here is a graph of the effects of batch size on training and validation performance from the paper I linked.