Accumulating Gradients

Accumulating Gradients - python

I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article
it's:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
whereas I expected it to be:
model.zero_grad() # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss += loss_function(predictions, labels) # Compute loss function
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
loss = 0
where I accumulate the loss and then divide by the accumulation steps to average it.
Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?

So according to the answer here, the first method is memory efficient. The amount of work required is more or less the same in both methods.
The second method keeps accumulating the graph, so would require accumulation_steps times more memory. The first method calculates the gradients straight away (and simply adds gradients) so requires less memory.

The backward pass loss.backward() is the operation that actually computes the gradients.
If you only do the forward pass (predictions = model(inputs)) no gradients will be computed and no accumulation is thus possible.

Related

How can I overcome PyTorch Tensor plotting problem?

I am a new PyTorch user and here is the code I am playing with.
epochs=20 # train for this number of epochs
losses=[] #to keep track on losses
for i in range(epochs):
i+=1 #counter
y_pred=model(cat_train,con_train)
loss=torch.sqrt(criterion(y_pred,y_train))
losses.append(loss) # append loss values
if i%10==1: # print out our progress
print(f'epoch: {i} loss is {loss}')
# back propagation
optimizer.zero_grad() # find the zero gradient
loss.backward() #move backward
optimizer.step()
plt.plot(range(epochs),losses)
and it gives me the following error:
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
I know the problem is related to the type of the losses with the following kind of rows:
tensor(3.6168, grad_fn=<SqrtBackward0>)
Can you suggest how I can grab the first column (numeric values of this tensor) and make it plottable e.i. an array not a Tensor.

You can use torch.Tensor.item.
So, replace the statement
losses.append(loss)
with
losses.append(loss.item())

Final step of PyTorch Gradient Accumulation for small datasets

I am training a BERT model on a relatively small dataset and cannot afford to lose any labelled sample as they must all be used for training. Due to GPU memory constraints, I am using gradient accumulation to train on larger batches (e.g. 32). According to PyTorch documentation, gradient accumulation is implemented as follows:
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0:
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
However, if you are using e.g. 110 training samples, with batch size 8 and accumulation step 4 (i.e. effective batch size 32), this method would only train the first 96 samples (i.e. 32 x 3), i.e. wasting 14 samples. In order to avoid this, I'd like to modify the code as follows (notice change to the final if statement):
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0 or (i + 1) == len(data):
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Is this correct and really that simple, or will this have any side effects? It seems very simple to me, but I've never seen it done before. Any help appreciated!

As Lucas Ramos already mentioned, when using DataLoader where the underlying dataset's size is not divisible by the batch size, the default behavior is to have a smaller last batch:
drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
Your plan is basically implementing gradient accumulation combined with drop_last=False - that is having the last batch smaller than all others.
Therefore, in principle there's nothing wrong with training with varying batch sizes.
However, there is something you need to fix in your code:
The loss is averaged over the mini-batch. So, if you process mini batches in the usual way you do not need to worry about it. However, when accumulating gradients you do it explicitly by dividing the loss by iters_to_accumulate:
loss = loss / iters_to_accumulate
In the last mini batch (with smaller size) you need to change the value of iter_to_accumulate to reflect this smaller minibatch size!
I proposed this revised code, breaking the training loop into two: an outer loop on mini-batches, and an inner one that accumulates gradients per mini batch. Note how using an iter over the DataLoader helps breaking the training loop into two:
scaler = GradScaler()
for epoch in epochs:
bi = 0 # index batches
# outer loop over minibatches
data_iter = iter(data)
while bi < len(data):
# determine the range for this batch
nbi = min(len(data), bi + iters_to_accumulate)
# inner loop over the items of the mini batch - accumulating gradients
for i in range(bi, nbi):
input, target = data_iter.next()
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / (nbi - bi) # divide by the true batch size
# Accumulates scaled gradients.
scaler.scale(loss).backward()
# done mini batch loop - gradients were accumulated, we can make an optimizatino step.
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
bi = nbi

I was pretty sure I've seen this done before. Check out this code from Pytorch Lightning (functions _accumulated_batches_reached, _num_training_batches_reached and should_accumulate).

What is the right calculation of epoch loss in training?

I am reading Pytorch official tutorial for fine tuning and I am faced with one problem and that is calculation of loss in each epoch.
Before this , I calculate loss for batch of data, accumulate these batch losses and find mean of these values as loss of epoch. But in that example, the calculation is as follow:
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
my question is in this line running_loss += loss.item() * inputs.size(0). It is multiply loss value of batch in bach size. What is the true way to calculate loss of epoch?
and what is the unit of loss? What is the range of loss value?

Yes the code snippet adds multiplication of batch size with batch mean error. If you want to calculate true summation. You can use
torch.nn.CrossEntropyLoss(reduction = "sum")
which will give you the sum of errors for the batch. Then you can directly sum for each batch as follows:
running_loss += loss.item()
The range of the loss value depends on your number of classes and feature vector. The code in your question will have same running_loss if you use reduction="sum" because your code basically makes
(loss/batch_size) * batch_size
which is the same thing with loss value. However, backpropagation changes because on the one hand you backprop according to the sum of losses, on the other hand you calculate backprop according to the mean loss.

How to inject data into a graph when using an input pipeline?

I am using an initializable iterator in my code. The iterator returns batches of size 100 from a csv dataset that has 20.000 entries. During training, however, I came across a problem. Consider this piece of code:
def get_dataset_iterator(batch_size):
# parametrized with batch_size
dataset = ...
return dataset.make_initializable_iterator()
## build a model and train it (x is the input of my model)
iterator = get_dataset_iterator(100)
x = iterator.get_next()
y = model(x)
## L1 norm as loss, this works because the model is an autoencoder
loss = tf.abs(x - y)
## training operator
train_op = tf.train.AdamOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
sess.run(train_op)
print(f'Epoch {epoch} training done!')
# TODO: print loss after epoch here
I am interested in the training loss AFTER finishing the epoch. It makes most sense to me that I calculate the average loss over the whole training set (e.g. feeding all 20.000 samples through the network and averaging their loss). I could reuse the dataset iterator here with a batch size of 20.000, but I have declared x as the input.
So the questions are:
1.) Does the loss calculation over all 20.000 examples make sense? I have seen some people do the calculation with just a mini-batch (the last batch of the epoch).
2.) How can I calculate the loss over the whole training set with an input pipeline? I have to inject all of training data somehow, so that I can run sess.run(loss) without calculating it over only 100 samples (because x is declared as input).
EDIT FOR CLARIFICATION:
If I wrote my training loop the following way, there would be some things that bother me:
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
print(current_loss)
Firstly, loss would still be evaluated before doing the last weight update. That means whatever comes out is not the latest value. Secondly, I would not be able to access current_loss after exiting the for loop so I would not be able to print it.

1) Loss calculation over the whole training set (before updating weights) does make sense and is called batch gradient descent (despite using the whole training set and not a mini batch).
However, calculating a loss for your whole dataset before updating weights is slow (especially with large datasets) and training will take a long time to converge. As a result, using a mini batch of data to calculate loss and update weights is what is normally done instead. Although using a mini batch will produce a noisy estimate of the loss it is actually good enough estimate to train networks with enough training iterations.
EDIT:
I agree that the loss value you print will not be the latest loss with the latest updated weights. Probably for most cases it really doesn't make much different or change results so people just go with how you have wrote the code above. However, if you really want to obtain the true latest loss value after you have done training (to print out) then you will just have to run the loss op again after you have done a train op e.g.:
for _ in range(number_of_samples // batch_size):
sess.run([train_op])
current_loss = sess.run([loss])
This will get your true latest value. Of course this wont be on the whole dataset and will be just for a minibatch of 100. Again the value is likely a good enough estimate but if you wish to calculate exact loss for whole dataset you will have to run through your entire set e.g. another loop and then average the loss:
...
# Train loop
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
# Calculate loss of whole train set after training an epoch.
sess.run(iterator.initializer)
current_loss_list = []
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([loss])
current_loss_list.append(current_loss)
train_loss_whole_dataset = np.mean(current_loss_list)
print(train_loss_whole_dataset)
EDIT 2:
As pointed out doing the serial calls to train_op then loss will call the iterator twice and so things might not work out nicely (e.g. run out of data). Therefore my 2nd bit of code will be better to use.

I think the following code will answer your questions:
(A) how can you print the batch loss AFTER performing the train step? (B) how can you calculate the loss over the entire training set, even though the dataset iterator gives only a batch each time?
import tensorflow as tf
import numpy as np
dataset_size = 200
batch_size= 5
dimension = 4
# create some training dataset
dataset = tf.data.Dataset.\
from_tensor_slices(np.random.normal(2.0,size=(dataset_size,dimension)).
astype(np.float32))
dataset = dataset.batch(batch_size) # take batches
iterator = dataset.make_initializable_iterator()
x = tf.cast(iterator.get_next(),tf.float32)
w = tf.Variable(np.random.normal(size=(1,dimension)).astype(np.float32))
loss_func = lambda x,w: tf.reduce_mean(tf.square(x-w)) # notice that the loss function is a mean!
loss = loss_func(x,w) # this is the loss that will be minimized
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
# we are going to use control_dependencies so that we know that we have a loss calculation AFTER the train step
with tf.control_dependencies([train_op]):
loss_after_train_op = loss_func(x,w) # this is an identical loss, but will only be calculated AFTER train_op has
# been performed
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# train one epoch
sess.run(iterator.initializer)
for i in range(dataset_size//batch_size):
# the training step will update the weights based on ONE batch of examples each step
loss1,_,loss2 = sess.run([loss,train_op,loss_after_train_op])
print('train step {:d}. batch loss before step: {:f}. batch loss after step: {:f}'.format(i,loss1,loss2))
# evaluate loss on entire training set. Notice that this calculation assumes the the loss is of the form
# tf.reduce_mean(...)
sess.run(iterator.initializer)
epoch_loss = 0
for i in range(dataset_size // batch_size):
batch_loss = sess.run(loss)
epoch_loss += batch_loss*batch_size
epoch_loss = epoch_loss/dataset_size
print('loss over entire training dataset: {:f}'.format(epoch_loss))
As for your question whether it makes sense to calculate loss over the entire training set - yes, it makes sense, for evaluation purposes. It usually does not make sense to perform training steps which are based on all of the training set since this set is usually very large and you want to update your weights more often, without needing to go over the entire training set each time.

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.

Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)

zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accumulating Gradients - python

The backward pass loss.backward() is the operation that actually computes the gradients. If you only do the forward pass (predictions = model(inputs)) no gradients will be computed and no accumulation is thus possible.

Related

How can I overcome PyTorch Tensor plotting problem?

Final step of PyTorch Gradient Accumulation for small datasets

What is the right calculation of epoch loss in training?

How to inject data into a graph when using an input pipeline?

Why do we need to call zero_grad() in PyTorch?

Categories

Resources