mini-batch gradient descent implementation in tensorflow

mini-batch gradient descent implementation in tensorflow - python

When reading an tensorflow implementation for a deep learning model, I am trying to understand the following code segment included in the training process.
self.net.gradients_node = tf.gradients(loss, self.variables)
for epoch in range(epochs):
total_loss = 0
for step in range((epoch*training_iters), ((epoch+1)*training_iters)):
batch_x, batch_y = data_provider(self.batch_size)
# Run optimization op (backprop)
_, loss, lr, gradients = sess.run((self.optimizer, self.net.cost, self.learning_rate_node, self.net.gradients_node),
feed_dict={self.net.x: batch_x,
self.net.y: util.crop_to_shape(batch_y, pred_shape),
self.net.keep_prob: dropout})
if avg_gradients is None:
avg_gradients = [np.zeros_like(gradient) for gradient in gradients]
for i in range(len(gradients)):
avg_gradients[i] = (avg_gradients[i] * (1.0 - (1.0 / (step+1)))) + (gradients[i] / (step+1))
norm_gradients = [np.linalg.norm(gradient) for gradient in avg_gradients]
self.norm_gradients_node.assign(norm_gradients).eval()
total_loss += loss
I think it is related to mini-batch gradient descent, but I cannot understand how does it work, or I have some difficulties to connect it to the algorithm shown as follows

This is not related to mini batch SGD.
It computes average gradient over all timesteps. After the first timestep avg_gradients will contain the gradient that was just computed, after the second step it will be elementwise mean of the two gradients from the two steps, after n steps it will be elementwise mean of all the n gradients computed so far. These mean gradients are then normalized (so that their norm is one). It is hard to tell why those average gradients are needed without the context in which they were presented.

Related

Tensorflow calculate hessian of model weights in a batch

I am replicating a paper. I have a basic Keras CNN model for MNIST classification. Now for sample z in the training, I want to calculate the hessian matrix of the model parameters with respect to the loss of that sample. I want to average out this hessian over the training data (n is number of training data).
My final goal is to calculate this value (the influence score):
I can calculate the left term and the right term and want to compute the Hessian term. I don't know how to calculate hessian for the model weights for a batch of examples (vectorization). I was able to calculate it only for a sample at a time which is too slow.
x=tf.convert_to_tensor(x_train[0:13])
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
y=model(x)
mce = tf.keras.losses.CategoricalCrossentropy()
y_expanded=y_train[train_idx]
loss=mce(y_expanded,y)
g = t1.gradient(loss, model.weights[4])
h = t2.jacobian(g, model.weights[4])
print(h.shape)
For clarification, if a model layer is of dimension 20*30, I want to feed a batch of 13 samples to it and get a Hessian of dimension (13,20,30,20,30). Now I can only get Hessian of dimension (20,30,20,30) which thwarts the vectorization (the code above).
This thread has the same problem, except that I want the second-order derivative rather than the first-order.
I also tried the below script which returns a (13,20,30,20,30) matrix that satisfies the dimension, but when I manually checked the sum of this matrix with the sum of 13 single hessian calculations with a for loop from 0 to 12, they lead to different numbers so it does not work either since I expected equal values.
x=tf.convert_to_tensor(x_train[0:13])
mce = tf.keras.losses.CategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
t1.watch(model.weights[4])
y_expanded=y_train[0:13]
y=model(x)
loss=mce(y_expanded,y)
j1=t1.jacobian(loss, model.weights[4])
j3 = t2.jacobian(j1, model.weights[4])
print(j3.shape)

That's how hessians are defined, you can only calculate a hessian of a scalar function.
But nothing new here, the same happens with gradients, and what is done to handle batches is to accumulate the gradients, something similar can be done with the hessian.
If you know how to compute the hessian of the loss, it means you could define batch cost and still be able to compute the hessian with the same method. e.g. you could define your cost as the sum(losses) where losses is the vector of losses for all examples in the batch.

Let's Suppose you have a model and you wanna train the model weights by taking the Hessian of the training images w.r.t trainable-weights
#Import the libraries we need
import tensorflow as tf
from tensorflow.python.eager import forwardprop
model = tf.keras.models.load_model('model.h5')
#Define the Adam Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
#Define the loss function
def loss_function(y_true , y_pred):
return tf.keras.losses.sparse_categorical_crossentropy(y_true , y_pred , from_logits=True)
#Define the Accuracy metric function
def accuracy_function(y_true , y_pred):
return tf.keras.metrics.sparse_categorical_accuracy(y_true , y_pred)
Now, define the variables for storing the mean of the loss and accuracy
train_loss = tf.keras.metrics.Mean(name='loss')
train_accuracy = tf.keras.metrics.Mean(name='accuracy')
#Now compute the Hessian in some different style for better efficiency of the model
vector = [tf.ones_like(v) for v in model.trainable_variables]
def _forward_over_back_hvp(images, labels):
with forwardprop.ForwardAccumulator(model.trainable_variables, vector) as acc:
with tf.GradientTape() as grad_tape:
logits = model(images, training=True)
loss = loss_function(labels ,logits)
grads = grad_tape.gradient(loss, model.trainable_variables)
hessian = acc.jvp(grads)
optimizer.apply_gradients(zip(hessian, model.trainable_variables))
train_loss(loss) #keep adding the loss
train_accuracy(accuracy_function(labels, logits)) #Keep adding the accuracy
#Now, here we need to call the function and train it
import time
for epoch in range(20):
start = time.time()
train_loss.reset_states()
train_accuracy.reset_states()
for i,(x , y) in enumerate(dataset):
_forward_over_back_hvp(x , y)
if(i%50==0):
print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')
Epoch 1 Loss 2.6396 Accuracy 0.1250
Time is taken for 1 epoch: 0.23 secs

Why does this training loss fluctuates? (Logistic regression from scratch with binary cross entropy loss)

I am trying to implement logistic regression from scratch using binary cross entropy loss function. The loss function implemented below is created based on the following formula.
def binary_crossentropy(y, yhat):
no_of_samples = len(y)
numerator_1 = y*np.log(yhat)
numerator_2 = (1-y) * np.log(1-yhat)
loss = -(np.sum(numerator_1 + numerator_2) / no_of_samples)
return loss
And below is how I implement the training using gradient descent.
L = 0.01
epochs = 40000
no_of_samples = len(x)
# Keeping track of the loss
loss = []
for _ in range(epochs):
yhat = sigmoid(x*weight + bias)
# Finding out the loss of each iteration
loss.append(binary_crossentropy(y, yhat))
d_weight = np.sum(x *(yhat-y)) / no_of_samples
d_bias = np.sum(yhat-y) / no_of_samples
weight = weight - L*d_weight
bias = bias - L*d_bias
The training above goes fine since the weight and bias are properly adjusted. But my question here is that, why the loss graph appears to be very fluctuating?
I have ever tried implementing linear regression and the loss appears to be constantly decreasing.
Is there anything incorrect in my logistic regression implementation? If my implementation is already correct, why does it fluctuate that way?

You need to optimize hyperparameters to see if the problem solves or not. One thing that can be done is to change the type of optimizers that you used. For instance, you can use Fmin_tnc instead of gradient descent.
Besides, you can tune the epochs, L and type of solvers (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’) if you use sklearn for regression.

In pytorch, how to train a model with two or more outputs?

output_1, output_2 = model(x)
loss = cross_entropy_loss(output_1, target_1)
loss.backward()
optimizer.step()
loss = cross_entropy_loss(output_2, target_2)
loss.backward()
optimizer.step()
However, when I run this piece of code, I got this error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 4]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Then, I really wanna know what I am supposed to do to train a model with 2 or more outputs

The entire premise on which pytorch (and other DL frameworks) is founded on is the backporpagation of the gradients of a scalar loss function.
In your case, you have a vector (of dim=2) loss function:
[cross_entropy_loss(output_1, target_1), cross_entropy_loss(output_2, target_2)]
You need to decide how to combine these two losses into a single scalar loss.
For instance:
weight = 0.5 # relative weight
loss = weight * cross_entropy_loss(output_1, target_1) + (1. - weight) * cross_entropy_loss(output_2, target_2)
# now loss is a scalar
loss.backward()
optimizer.step()

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.

Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)

zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

How to apply gradient clipping in TensorFlow?

Considering the example code.
I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.
tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)
This is an example that could be used but where do I introduce this ?
In the def of RNN
lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Split data because rnn cell needs a list of inputs for the RNN inner loop
_X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)
But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?
Do I have to define my own Optimizer for this or is there a simpler option?

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.
In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize() method with something like the following:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))
Clipping each gradient matrix individually changes their relative scale but is also possible:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
None if gradient is None else tf.clip_by_norm(gradient, 5.0)
for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))
In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:
optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

It's easy for tf.keras!
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)
This optimizer will clip all gradients to values between [-1.0, 1.0].
See the docs.

This is actually properly explained in the documentation.:
Calling minimize() takes care of both computing the gradients and
applying them to the variables. If you want to process the gradients
before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
And in the example they provide they use these 3 steps:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.

For those who would like to understand the idea of gradient clipping (by norm):
Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.
Let the gradient be g and the max_norm_threshold be j.
Now, if ||g|| > j , we do:
g = ( j * g ) / ||g||
This is the implementation done in tf.clip_by_norm

IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm:
original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)
This way you only have to define this once, and not run it after every gradients calculation.
Documentation:
https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm

Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .
clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars
where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.
After clipping we simply apply its value using an optimizer.
optimizer.apply_gradients(clipped_value)

Method 1
if you are training your model using your custom training loop then the one update step will look like
'''
for loop over full dataset
X -> training samples
y -> labels
'''
optimizer = tf.keras.optimizers.Adam()
for x, y in train_Data:
with tf.GradientTape() as tape:
prob = model(x, training=True)
# calculate loss
train_loss_value = loss_fn(y, prob)
# get gradients
gradients = tape.gradient(train_loss_value, model.trainable_weights)
# clip gradients if you want to clip by norm
gradients = [(tf.clip_by_norm(grad, clip_norm=1.0)) for grad in gradients]
# clip gradients via values
gradients = [(tf.clip_by_value(grad, clip_value_min=-1.0, clip_value_max=1.0)) for grad in gradients]
# apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
Method 2
Or you could also simply just replace the first line in above code as below
# for clipping by norm
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
# for clipping by value
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)
second method will also work if you are using model.compile -> model.fit pipeline.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

mini-batch gradient descent implementation in tensorflow - python

Related

Tensorflow calculate hessian of model weights in a batch

Why does this training loss fluctuates? (Logistic regression from scratch with binary cross entropy loss)

In pytorch, how to train a model with two or more outputs?

Why do we need to call zero_grad() in PyTorch?

How to apply gradient clipping in TensorFlow?

Categories

Resources