Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.
In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.
Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)
zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5
You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues
During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!
In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.
Related
I am building an optimizer (https://github.com/keras-team/keras/blob/master/keras/optimizers.py) which calculates a search direction and then tries a few different step lengths to find which gives the lowest loss. However, I am running into problems when trying to change the step length depending on the value of the loss itself. It appears that the loss (which is a tensor dependant upon the weight of the network and the data) cannot be updated/recalculated more than once during each training loop, which I find very odd.
This is the relevant code I have in get_updates(self, loss, params):
L1 = loss
for p, direction in zip(params, directions):
self.updates.append(K.update(p, p+length*direction))
L2 = loss
for p, direction in zip(params, directions):
self.updates.append(K.update(p, tf.cond( L2<L1, lambda: p+0.5*length*direction, lambda: p))
The problem is that L1 and L2 are the same and no matter what I try I can't get the loss to update after I've updated the weights. I've also tried just p = p+length*direction and p.assign() but the loss doesn't update. Does anyone know how I can get an updated value of the loss?
Note, I am able to get the loss from the previous batch/epoch if I save a loss value and update using self.updates.append(K.update(self.prev_loss,loss)), however since the data will change between batches I am no longer working on the same loss function and thus my comparison between the losses to determine if the step length should be lower is not valid.
I am using an initializable iterator in my code. The iterator returns batches of size 100 from a csv dataset that has 20.000 entries. During training, however, I came across a problem. Consider this piece of code:
def get_dataset_iterator(batch_size):
# parametrized with batch_size
dataset = ...
return dataset.make_initializable_iterator()
## build a model and train it (x is the input of my model)
iterator = get_dataset_iterator(100)
x = iterator.get_next()
y = model(x)
## L1 norm as loss, this works because the model is an autoencoder
loss = tf.abs(x - y)
## training operator
train_op = tf.train.AdamOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
sess.run(train_op)
print(f'Epoch {epoch} training done!')
# TODO: print loss after epoch here
I am interested in the training loss AFTER finishing the epoch. It makes most sense to me that I calculate the average loss over the whole training set (e.g. feeding all 20.000 samples through the network and averaging their loss). I could reuse the dataset iterator here with a batch size of 20.000, but I have declared x as the input.
So the questions are:
1.) Does the loss calculation over all 20.000 examples make sense? I have seen some people do the calculation with just a mini-batch (the last batch of the epoch).
2.) How can I calculate the loss over the whole training set with an input pipeline? I have to inject all of training data somehow, so that I can run sess.run(loss) without calculating it over only 100 samples (because x is declared as input).
EDIT FOR CLARIFICATION:
If I wrote my training loop the following way, there would be some things that bother me:
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
print(current_loss)
Firstly, loss would still be evaluated before doing the last weight update. That means whatever comes out is not the latest value. Secondly, I would not be able to access current_loss after exiting the for loop so I would not be able to print it.
1) Loss calculation over the whole training set (before updating weights) does make sense and is called batch gradient descent (despite using the whole training set and not a mini batch).
However, calculating a loss for your whole dataset before updating weights is slow (especially with large datasets) and training will take a long time to converge. As a result, using a mini batch of data to calculate loss and update weights is what is normally done instead. Although using a mini batch will produce a noisy estimate of the loss it is actually good enough estimate to train networks with enough training iterations.
EDIT:
I agree that the loss value you print will not be the latest loss with the latest updated weights. Probably for most cases it really doesn't make much different or change results so people just go with how you have wrote the code above. However, if you really want to obtain the true latest loss value after you have done training (to print out) then you will just have to run the loss op again after you have done a train op e.g.:
for _ in range(number_of_samples // batch_size):
sess.run([train_op])
current_loss = sess.run([loss])
This will get your true latest value. Of course this wont be on the whole dataset and will be just for a minibatch of 100. Again the value is likely a good enough estimate but if you wish to calculate exact loss for whole dataset you will have to run through your entire set e.g. another loop and then average the loss:
...
# Train loop
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
# Calculate loss of whole train set after training an epoch.
sess.run(iterator.initializer)
current_loss_list = []
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([loss])
current_loss_list.append(current_loss)
train_loss_whole_dataset = np.mean(current_loss_list)
print(train_loss_whole_dataset)
EDIT 2:
As pointed out doing the serial calls to train_op then loss will call the iterator twice and so things might not work out nicely (e.g. run out of data). Therefore my 2nd bit of code will be better to use.
I think the following code will answer your questions:
(A) how can you print the batch loss AFTER performing the train step? (B) how can you calculate the loss over the entire training set, even though the dataset iterator gives only a batch each time?
import tensorflow as tf
import numpy as np
dataset_size = 200
batch_size= 5
dimension = 4
# create some training dataset
dataset = tf.data.Dataset.\
from_tensor_slices(np.random.normal(2.0,size=(dataset_size,dimension)).
astype(np.float32))
dataset = dataset.batch(batch_size) # take batches
iterator = dataset.make_initializable_iterator()
x = tf.cast(iterator.get_next(),tf.float32)
w = tf.Variable(np.random.normal(size=(1,dimension)).astype(np.float32))
loss_func = lambda x,w: tf.reduce_mean(tf.square(x-w)) # notice that the loss function is a mean!
loss = loss_func(x,w) # this is the loss that will be minimized
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
# we are going to use control_dependencies so that we know that we have a loss calculation AFTER the train step
with tf.control_dependencies([train_op]):
loss_after_train_op = loss_func(x,w) # this is an identical loss, but will only be calculated AFTER train_op has
# been performed
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# train one epoch
sess.run(iterator.initializer)
for i in range(dataset_size//batch_size):
# the training step will update the weights based on ONE batch of examples each step
loss1,_,loss2 = sess.run([loss,train_op,loss_after_train_op])
print('train step {:d}. batch loss before step: {:f}. batch loss after step: {:f}'.format(i,loss1,loss2))
# evaluate loss on entire training set. Notice that this calculation assumes the the loss is of the form
# tf.reduce_mean(...)
sess.run(iterator.initializer)
epoch_loss = 0
for i in range(dataset_size // batch_size):
batch_loss = sess.run(loss)
epoch_loss += batch_loss*batch_size
epoch_loss = epoch_loss/dataset_size
print('loss over entire training dataset: {:f}'.format(epoch_loss))
As for your question whether it makes sense to calculate loss over the entire training set - yes, it makes sense, for evaluation purposes. It usually does not make sense to perform training steps which are based on all of the training set since this set is usually very large and you want to update your weights more often, without needing to go over the entire training set each time.
Below is the code.
def create_train_model(hidden_nodes,num_iters):
tf.reset_default_graph()
X=tf.placeholder(shape=(120,4),dtype=tf.float64,name='X')
y=tf.placeholder(shape=(120,1),dtype=tf.float64,name='y')
W1=tf.Variable(np.random.rand(4,hidden_nodes),dtype=tf.float64)
W2=tf.Variable(np.random.rand(hidden_nodes,2),dtype=tf.float64)
A1=tf.sigmoid(tf.matmul(X,W1))
U_est=tf.sigmoid(tf.matmul(A1,W2))
loss=fuloss3(U_est,y)
optimizer=tf.train.AdagradOptimizer(4.9406564584124654e-324)
TRAIN=optimizer.minimize(loss)
init=tf.initialize_all_variables()
sess=tf.Session()
sess.run(init)
for i in range(num_iters):
pout=sess.run(loss,feed_dict={X: Xtrain,
y: ytrain})
sess.run(TRAIN,feed_dict={X: Xtrain,
y: ytrain})
loss_plot[hidden_nodes][i]=sess.run(loss,feed_dict={X: Xtrain,y:
ytrain})
print(pout)
weights1=sess.run(W1)
weights2=sess.run(W2)
print(weights1)
print(weights2)
print('loss (hidden nodes: %d, iterations: %d): %.2f'%(hidden_nodes,
num_iters,loss_plot[hidden_nodes][num_iters-1]))
sess.close()
return weights1, weights2
print(pout) returns a non nan number. While after training, the weights come out all nan. Even when I have set the learning rate to be the smallest possible. Why would this happen? With learning rate so small you're basically not moving the variable. The fact that the initial run on loss gave a valid result, as evident from pout, means that its not an issue with how I set my loss. Thanks in advance.
I suspect your problem is here:
W1=tf.Variable(np.random.rand(4,hidden_nodes),dtype=tf.float64)
W2=tf.Variable(np.random.rand(hidden_nodes,2),dtype=tf.float64)
Try this out:
W1 = tf.get_variable("W1", shape=..., dtype=...,
initializer=tf.contrib.layers.xavier_initializer())
W2 = tf.get_variable("W2", shape=..., dtype=...,
initializer=tf.contrib.layers.xavier_initializer())
Your weight initialization is in the [0,1] range, which are quite large weights. That's going to start the network off with wild gradient swings that are likely to throw you into a NaN situation.
The xavier initializer will take into account the number of inputs to a node and initialize the value such that you aren't saturating a node. In lay terms it initializes the weights intelligently depending on your architecture.
Note that there is a convoutional version of this initializer too.
Optionally, as a quick test, you could cut down the size of your weight initialization by simply multiplying the random weights by a small value such as 1e-4.
Post a comment back here if that doesn't resolve the issue.
I'd like to adjust the sampling rate during training of my neural network to test some stuff and see what happens. To achieve that my idea was to create an new loss and optimizer for every iteration using the same computation graph.
def optimize(self, negative_sampling_rate):
return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.calc_loss(negative_sampling_rate))
def calc_loss(self, negative_sampling_rate):
return tf.reduce_mean(tf.nn.nce_loss(
weights=self.graph.prediction_weights,
biases=self.graph.prediction_bias,
labels=self.graph.labels,
inputs=self.graph.hidden,
num_sampled=negative_sampling_rate,
num_classes=self.graph.prediction_weights.shape[1])
)
def train(self, batch_inputs, batch_labels, negative_sampling_rate):
feed_dict = {self.graph.X: batch_inputs, self.graph.labels: batch_labels}
_, loss_val = self.session.run(
[self.optimize(negative_sampling_rate), self.calc_loss(negative_sampling_rate)], feed_dict=feed_dict
)
return loss_val
But I'm a little bit worried about the optimizer. I've heard that optimizers have internal variables, which change on every training iteration. Is that true for all optimizers or only or a some, and if so which ones are usable for this approach?
The Neural network should then be trained like:
for step in range(training_steps):
NN.train(inputs, labels, compute_sampling_rate(step))
First of all, it should be okay to change the number of samples in the nce loss without causing problems for the optimizer. The internal variables stored by some optimizers relate to the historical gradients of the trainable variables in your graph.
Secondly, if you do want to reset the state of your optimizer for some reason, the way I do it is by putting the optimizer in a variable scope. Whenever I want to reset it then I run the reset_optimizer op.
reset_optimizer = tf.no_op()
with tf.variable_scope('optimizer'):
train_op = optimizer(learning_rate).minimize(loss)
opt_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "optimizer")
if len(opt_vars): # check if optimzer state needs resetting
reset_optimizer = variables_initializer(opt_vars))
I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?
The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.