Clarification about Gradient Accumulation

Clarification about Gradient Accumulation - python

I'm trying to get a better understanding of how Gradient Accumulation works and why it is useful. To this end, I wanted to ask what is the difference (if any) between these two possible PyTorch-like implementations of a custom training loop with gradient accumulation:
gradient_accumulation_steps = 5
for batch_idx, batch in enumerate(dataset):
x_batch, y_true_batch = batch
y_pred_batch = model(x_batch)
loss = loss_fn(y_true_batch, y_pred_batch)
loss.backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0: # (assumption: the number of batches is a multiple of gradient_accumulation_steps)
optimizer.step()
optimizer.zero_grad()
y_true_batches, y_pred_batches = [], []
gradient_accumulation_steps = 5
for batch_idx, batch in enumerate(dataset):
x_batch, y_true_batch = batch
y_pred_batch = model(x_batch)
y_true_batches.append(y_true_batch)
y_pred_batches.append(y_pred_batch)
if (batch_idx + 1) % gradient_accumulation_steps == 0: # (assumption: the number of batches is a multiple of gradient_accumulation_steps)
y_true = stack_vertically(y_true_batches)
y_pred = stack_vertically(y_pred_batches)
loss = loss_fn(y_true, y_pred)
loss.backward()
optimizer.step()
optimizer.zero_grad()
y_true_batches.clear()
y_pred_batches.clear()
Also, kind of as an unrelated question: Since the purpose of gradient accumulation is to mimic a larger batch size in cases where you have memory constraints, does it mean that I should also increase the learning rate proportionally?

1. The difference between the two programs:
Conceptually, your two implementations are the same: you forward gradient_accumulation_steps batches for each weight update.
As you already observed, the second method requires more memory resources than the first one.
There is, however, a slight difference: usually, loss functions implementation use mean to reduce the loss over the batch. When you use gradient accumulation (first implementation) you reduce using mean over each mini-batch, but using sum over the accumulated gradient_accumulation_steps mini-batches. To make sure the accumulated gradient implementation is identical to large batches implementation you need to be very careful in the way the loss function is reduced. In many cases you will need to divide the accumulated loss by gradient_accumulation_steps. See this answer for a detailed imlpementation.
2. Batch size and learning rate:
Learning rate and batch size are indeed related. When increasing the batch size one usually reduces the learning rate.
See, e.g.:
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le, Don't Decay the Learning Rate, Increase the Batch Size (ICLR 2018).

Related

Final step of PyTorch Gradient Accumulation for small datasets

I am training a BERT model on a relatively small dataset and cannot afford to lose any labelled sample as they must all be used for training. Due to GPU memory constraints, I am using gradient accumulation to train on larger batches (e.g. 32). According to PyTorch documentation, gradient accumulation is implemented as follows:
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0:
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
However, if you are using e.g. 110 training samples, with batch size 8 and accumulation step 4 (i.e. effective batch size 32), this method would only train the first 96 samples (i.e. 32 x 3), i.e. wasting 14 samples. In order to avoid this, I'd like to modify the code as follows (notice change to the final if statement):
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0 or (i + 1) == len(data):
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Is this correct and really that simple, or will this have any side effects? It seems very simple to me, but I've never seen it done before. Any help appreciated!

As Lucas Ramos already mentioned, when using DataLoader where the underlying dataset's size is not divisible by the batch size, the default behavior is to have a smaller last batch:
drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
Your plan is basically implementing gradient accumulation combined with drop_last=False - that is having the last batch smaller than all others.
Therefore, in principle there's nothing wrong with training with varying batch sizes.
However, there is something you need to fix in your code:
The loss is averaged over the mini-batch. So, if you process mini batches in the usual way you do not need to worry about it. However, when accumulating gradients you do it explicitly by dividing the loss by iters_to_accumulate:
loss = loss / iters_to_accumulate
In the last mini batch (with smaller size) you need to change the value of iter_to_accumulate to reflect this smaller minibatch size!
I proposed this revised code, breaking the training loop into two: an outer loop on mini-batches, and an inner one that accumulates gradients per mini batch. Note how using an iter over the DataLoader helps breaking the training loop into two:
scaler = GradScaler()
for epoch in epochs:
bi = 0 # index batches
# outer loop over minibatches
data_iter = iter(data)
while bi < len(data):
# determine the range for this batch
nbi = min(len(data), bi + iters_to_accumulate)
# inner loop over the items of the mini batch - accumulating gradients
for i in range(bi, nbi):
input, target = data_iter.next()
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / (nbi - bi) # divide by the true batch size
# Accumulates scaled gradients.
scaler.scale(loss).backward()
# done mini batch loop - gradients were accumulated, we can make an optimizatino step.
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
bi = nbi

I was pretty sure I've seen this done before. Check out this code from Pytorch Lightning (functions _accumulated_batches_reached, _num_training_batches_reached and should_accumulate).

Tensorflow NMT with Attention Tutorial -- need help understanding loss function

I'm following Tensorflow's Neural Machine Translation with Attention tutorial (link) but am unclear about some implementation details. It'd be great if someone could help clarify or refer me to a source/better place to ask:
1) def loss_function(real, pred): This function computes loss at a specific time step (say t), averaged over the entire the batch. Examples whose labels at t is <pad> (i.e. no real data, only padded so that all example sequences are of same length) are masked so as not to count towards loss.
My question: It seems loss should get smaller the bigger t is (since more examples are <pad> the further we get to maximum length). So why is loss averaged over the entire batch, and not just over the number of valid (non-<pad>) examples? (This is analogous to using tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS instead of tf.losses.Reduction.SUM_OVER_BATCH_SIZE)
2) for epoch in range(EPOCHS) ——> Two loss variables are defined in the training loop:
loss = sum of loss_function() outputs over all time steps
batch_loss = loss divided by number of time steps
My question: Why are gradients computed w.r.t. loss and not batch_loss? Shouldn't batch_loss be the average loss over all time steps and the entire batch?
Many thanks!

It seems loss should get smaller the bigger t
The loss does get smaller since the pad token is getting masked while calculating the loss.
Batch_loss is used only to print the loss calculated of each batch. Batch loss is calculated for every batch and across all the time steps.
for t in range(1, targ.shape[1])
This loop runs over the batch for all timesteps and calculates the loss by masking the padded values.
I hope this clears it up :)

How to inject data into a graph when using an input pipeline?

I am using an initializable iterator in my code. The iterator returns batches of size 100 from a csv dataset that has 20.000 entries. During training, however, I came across a problem. Consider this piece of code:
def get_dataset_iterator(batch_size):
# parametrized with batch_size
dataset = ...
return dataset.make_initializable_iterator()
## build a model and train it (x is the input of my model)
iterator = get_dataset_iterator(100)
x = iterator.get_next()
y = model(x)
## L1 norm as loss, this works because the model is an autoencoder
loss = tf.abs(x - y)
## training operator
train_op = tf.train.AdamOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
sess.run(train_op)
print(f'Epoch {epoch} training done!')
# TODO: print loss after epoch here
I am interested in the training loss AFTER finishing the epoch. It makes most sense to me that I calculate the average loss over the whole training set (e.g. feeding all 20.000 samples through the network and averaging their loss). I could reuse the dataset iterator here with a batch size of 20.000, but I have declared x as the input.
So the questions are:
1.) Does the loss calculation over all 20.000 examples make sense? I have seen some people do the calculation with just a mini-batch (the last batch of the epoch).
2.) How can I calculate the loss over the whole training set with an input pipeline? I have to inject all of training data somehow, so that I can run sess.run(loss) without calculating it over only 100 samples (because x is declared as input).
EDIT FOR CLARIFICATION:
If I wrote my training loop the following way, there would be some things that bother me:
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
print(current_loss)
Firstly, loss would still be evaluated before doing the last weight update. That means whatever comes out is not the latest value. Secondly, I would not be able to access current_loss after exiting the for loop so I would not be able to print it.

1) Loss calculation over the whole training set (before updating weights) does make sense and is called batch gradient descent (despite using the whole training set and not a mini batch).
However, calculating a loss for your whole dataset before updating weights is slow (especially with large datasets) and training will take a long time to converge. As a result, using a mini batch of data to calculate loss and update weights is what is normally done instead. Although using a mini batch will produce a noisy estimate of the loss it is actually good enough estimate to train networks with enough training iterations.
EDIT:
I agree that the loss value you print will not be the latest loss with the latest updated weights. Probably for most cases it really doesn't make much different or change results so people just go with how you have wrote the code above. However, if you really want to obtain the true latest loss value after you have done training (to print out) then you will just have to run the loss op again after you have done a train op e.g.:
for _ in range(number_of_samples // batch_size):
sess.run([train_op])
current_loss = sess.run([loss])
This will get your true latest value. Of course this wont be on the whole dataset and will be just for a minibatch of 100. Again the value is likely a good enough estimate but if you wish to calculate exact loss for whole dataset you will have to run through your entire set e.g. another loop and then average the loss:
...
# Train loop
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
# Calculate loss of whole train set after training an epoch.
sess.run(iterator.initializer)
current_loss_list = []
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([loss])
current_loss_list.append(current_loss)
train_loss_whole_dataset = np.mean(current_loss_list)
print(train_loss_whole_dataset)
EDIT 2:
As pointed out doing the serial calls to train_op then loss will call the iterator twice and so things might not work out nicely (e.g. run out of data). Therefore my 2nd bit of code will be better to use.

I think the following code will answer your questions:
(A) how can you print the batch loss AFTER performing the train step? (B) how can you calculate the loss over the entire training set, even though the dataset iterator gives only a batch each time?
import tensorflow as tf
import numpy as np
dataset_size = 200
batch_size= 5
dimension = 4
# create some training dataset
dataset = tf.data.Dataset.\
from_tensor_slices(np.random.normal(2.0,size=(dataset_size,dimension)).
astype(np.float32))
dataset = dataset.batch(batch_size) # take batches
iterator = dataset.make_initializable_iterator()
x = tf.cast(iterator.get_next(),tf.float32)
w = tf.Variable(np.random.normal(size=(1,dimension)).astype(np.float32))
loss_func = lambda x,w: tf.reduce_mean(tf.square(x-w)) # notice that the loss function is a mean!
loss = loss_func(x,w) # this is the loss that will be minimized
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
# we are going to use control_dependencies so that we know that we have a loss calculation AFTER the train step
with tf.control_dependencies([train_op]):
loss_after_train_op = loss_func(x,w) # this is an identical loss, but will only be calculated AFTER train_op has
# been performed
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# train one epoch
sess.run(iterator.initializer)
for i in range(dataset_size//batch_size):
# the training step will update the weights based on ONE batch of examples each step
loss1,_,loss2 = sess.run([loss,train_op,loss_after_train_op])
print('train step {:d}. batch loss before step: {:f}. batch loss after step: {:f}'.format(i,loss1,loss2))
# evaluate loss on entire training set. Notice that this calculation assumes the the loss is of the form
# tf.reduce_mean(...)
sess.run(iterator.initializer)
epoch_loss = 0
for i in range(dataset_size // batch_size):
batch_loss = sess.run(loss)
epoch_loss += batch_loss*batch_size
epoch_loss = epoch_loss/dataset_size
print('loss over entire training dataset: {:f}'.format(epoch_loss))
As for your question whether it makes sense to calculate loss over the entire training set - yes, it makes sense, for evaluation purposes. It usually does not make sense to perform training steps which are based on all of the training set since this set is usually very large and you want to update your weights more often, without needing to go over the entire training set each time.

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.

Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)

zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

tensorflow cifar10 example Learning rate decay confusion when using multiple gpus

community,
I have a small question about the learning rate decay in multi-GPUs training of the Tensorflow cifar10 example.
Here is the code:
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
In this code, the number of gpus is not considered. For instance, if we increase the FLAGS.num_gpus to 4. The decay_steps does not change.
In the comments, global_step is supposed to equal the number of batches processed * FlAGS.num_gpus. However, global_step only increases when opt.apply_gradients() function is called. It only increases 1 step per iteration.
In my opinion, the code should be
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY/FLAGS.num_gpus)
Therefore, when utilizing multiple GPUs, the number of iteration required to go through 1 epoch is reduced.
Please correct me and help me understand if my logic is not correct.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Clarification about Gradient Accumulation - python

Related

Final step of PyTorch Gradient Accumulation for small datasets

Tensorflow NMT with Attention Tutorial -- need help understanding loss function

How to inject data into a graph when using an input pipeline?

Why do we need to call zero_grad() in PyTorch?

tensorflow cifar10 example Learning rate decay confusion when using multiple gpus

Categories

Resources