What is the right calculation of epoch loss in training? - python

I am reading Pytorch official tutorial for fine tuning and I am faced with one problem and that is calculation of loss in each epoch.
Before this , I calculate loss for batch of data, accumulate these batch losses and find mean of these values as loss of epoch. But in that example, the calculation is as follow:
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
my question is in this line running_loss += loss.item() * inputs.size(0). It is multiply loss value of batch in bach size. What is the true way to calculate loss of epoch?
and what is the unit of loss? What is the range of loss value?

Yes the code snippet adds multiplication of batch size with batch mean error. If you want to calculate true summation. You can use
torch.nn.CrossEntropyLoss(reduction = "sum")
which will give you the sum of errors for the batch. Then you can directly sum for each batch as follows:
running_loss += loss.item()
The range of the loss value depends on your number of classes and feature vector. The code in your question will have same running_loss if you use reduction="sum" because your code basically makes
(loss/batch_size) * batch_size
which is the same thing with loss value. However, backpropagation changes because on the one hand you backprop according to the sum of losses, on the other hand you calculate backprop according to the mean loss.

Related

Tensorflow calculate hessian of model weights in a batch

I am replicating a paper. I have a basic Keras CNN model for MNIST classification. Now for sample z in the training, I want to calculate the hessian matrix of the model parameters with respect to the loss of that sample. I want to average out this hessian over the training data (n is number of training data).
My final goal is to calculate this value (the influence score):
I can calculate the left term and the right term and want to compute the Hessian term. I don't know how to calculate hessian for the model weights for a batch of examples (vectorization). I was able to calculate it only for a sample at a time which is too slow.
x=tf.convert_to_tensor(x_train[0:13])
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
y=model(x)
mce = tf.keras.losses.CategoricalCrossentropy()
y_expanded=y_train[train_idx]
loss=mce(y_expanded,y)
g = t1.gradient(loss, model.weights[4])
h = t2.jacobian(g, model.weights[4])
print(h.shape)
For clarification, if a model layer is of dimension 20*30, I want to feed a batch of 13 samples to it and get a Hessian of dimension (13,20,30,20,30). Now I can only get Hessian of dimension (20,30,20,30) which thwarts the vectorization (the code above).
This thread has the same problem, except that I want the second-order derivative rather than the first-order.
I also tried the below script which returns a (13,20,30,20,30) matrix that satisfies the dimension, but when I manually checked the sum of this matrix with the sum of 13 single hessian calculations with a for loop from 0 to 12, they lead to different numbers so it does not work either since I expected equal values.
x=tf.convert_to_tensor(x_train[0:13])
mce = tf.keras.losses.CategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
t1.watch(model.weights[4])
y_expanded=y_train[0:13]
y=model(x)
loss=mce(y_expanded,y)
j1=t1.jacobian(loss, model.weights[4])
j3 = t2.jacobian(j1, model.weights[4])
print(j3.shape)
That's how hessians are defined, you can only calculate a hessian of a scalar function.
But nothing new here, the same happens with gradients, and what is done to handle batches is to accumulate the gradients, something similar can be done with the hessian.
If you know how to compute the hessian of the loss, it means you could define batch cost and still be able to compute the hessian with the same method. e.g. you could define your cost as the sum(losses) where losses is the vector of losses for all examples in the batch.
Let's Suppose you have a model and you wanna train the model weights by taking the Hessian of the training images w.r.t trainable-weights
#Import the libraries we need
import tensorflow as tf
from tensorflow.python.eager import forwardprop
model = tf.keras.models.load_model('model.h5')
#Define the Adam Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
#Define the loss function
def loss_function(y_true , y_pred):
return tf.keras.losses.sparse_categorical_crossentropy(y_true , y_pred , from_logits=True)
#Define the Accuracy metric function
def accuracy_function(y_true , y_pred):
return tf.keras.metrics.sparse_categorical_accuracy(y_true , y_pred)
Now, define the variables for storing the mean of the loss and accuracy
train_loss = tf.keras.metrics.Mean(name='loss')
train_accuracy = tf.keras.metrics.Mean(name='accuracy')
#Now compute the Hessian in some different style for better efficiency of the model
vector = [tf.ones_like(v) for v in model.trainable_variables]
def _forward_over_back_hvp(images, labels):
with forwardprop.ForwardAccumulator(model.trainable_variables, vector) as acc:
with tf.GradientTape() as grad_tape:
logits = model(images, training=True)
loss = loss_function(labels ,logits)
grads = grad_tape.gradient(loss, model.trainable_variables)
hessian = acc.jvp(grads)
optimizer.apply_gradients(zip(hessian, model.trainable_variables))
train_loss(loss) #keep adding the loss
train_accuracy(accuracy_function(labels, logits)) #Keep adding the accuracy
#Now, here we need to call the function and train it
import time
for epoch in range(20):
start = time.time()
train_loss.reset_states()
train_accuracy.reset_states()
for i,(x , y) in enumerate(dataset):
_forward_over_back_hvp(x , y)
if(i%50==0):
print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')
Epoch 1 Loss 2.6396 Accuracy 0.1250
Time is taken for 1 epoch: 0.23 secs

PyTorch learning scheduler order changes loss in a drastic way

I'm a beginner to PyTorch and am trying to train a MNIST model based on a custom neural network class. My learning rate scheduler, loss function and optimizer are:
optimizer = optim.Adam(model.parameters(), lr=0.003)
loss_fn = nn.CrossEntropyLoss()
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
I'm also using Learning Rate scheduler for that purpose. Initially, I had my training loop like this:
# this training gives high loss and it doesn't varies that much
def training(epochs):
model.train()
for batch_idx, (imgs, labels) in enumerate(train_loader):
imgs = imgs.to(device=device)
labels = labels.to(device=device)
optimizer.zero_grad()
outputs = model(imgs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
exp_lr_scheduler.step() # inside the loop and after the optimizer
if (batch_idx + 1)% 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, (batch_idx + 1) * len(imgs), len(train_loader.dataset),
100. * (batch_idx + 1) / len(train_loader), loss.data))
But this training was not efficient and my loss was almost the same in every epoch.
Then, I changed my training function to this in the end:
# this training works perfectly
def training(epochs):
model.train()
exp_lr_scheduler.step() # out of the loop but before optimizer step
for batch_idx, (imgs, labels) in enumerate(train_loader):
imgs = imgs.to(device=device)
labels = labels.to(device=device)
optimizer.zero_grad()
outputs = model(imgs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
if (batch_idx + 1)% 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, (batch_idx + 1) * len(imgs), len(train_loader.dataset),
100. * (batch_idx + 1) / len(train_loader), loss.data))
And now, it's working correctly. I just don't get the reason for this.
I have two queries:
Shouldn't exp_lr_scheduler.step() be in the for loop so that it also get's updated with every epoch? ; and
PyTorch latest version says to keep exp_lr_scheduler.step() after optimizer.step() but doing this in the my training function gives me worse loss.
What can be the reason or am I doing it wrong?
StepLR updates the learning rate after every step_size by gamma, that means if step_size is 7 then learning rate will be updated after every 7 epoch by multiplying the current learning rate to gamma. That means that in your snippet, the learning rate is getting 10 times smaller every 7 epochs.
Have you tried increasing the starting learning rate? I would try 0.1 or 0.01. I think the problem could be at the size of the starting learning rate since the starting point it is already quite small. This causes that the gradient descent algorithm (or its derivatives, as Adam) cannot move towards the minimum because the step is too small and your results keep being the same (at the same point of the functional to minimize).
Hope it helps.

I this the correct way of computing the average accuracy?

I am fairly new to coding and getting confused between average accuracy and overall accuracy. I have created a function to calculate accuracy, i then divide this result by the len(dataloader) at the end of each epoch. Is this the correct way to calculate average accuracy? If not could someone explain how I go about doing this correctly?
def accuracy(predictions, labels):
classes = torch.argmax(predictions, dim=1)
return torch.mean((classes == labels).float())
def train(model, optimizer, dataloader):
#Setting model to train mode
model.train()
acc = 0.0
loss = 0.0
loss_fc = nn.CrossEntropyLoss()
for i, (img, label) in enumerate(dataloader):
#source images and labels to cpu device
img, label = img.to(device), label.to(device)
y_pred = model(img)
optimizer.zero_grad()
loss = loss_fc(y_pred, label)
loss.backward()
optimizer.step()
#Update loss and accuracy
loss += loss.item()
acc += accuracy(y_pred, s_label)
loss /= len(dataloader)
acc /= len(dataloader)
Not sure what you mean by the overall and average accuracy. Typically accuracy is calculated at the end of each epoch. You pass the accuracy function your predictions and your actual labels and it returns what proportion you got right as a decimal (0-1).
I haven't seen any use for calculating the average accuracy across every epoch during training as this metric would be heavily impacted by how fast your model learns rather than how well it is able to eventually perform e.g. a model that needs a lot of epochs to do well will probably appear worse on this average accuracy than one that can converge on fewer epochs.
If you take a look at the accuracy score metric from scikit-learn it should help clear things up for you.
Link:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
Hope this helps!

In pytorch, how to train a model with two or more outputs?

output_1, output_2 = model(x)
loss = cross_entropy_loss(output_1, target_1)
loss.backward()
optimizer.step()
loss = cross_entropy_loss(output_2, target_2)
loss.backward()
optimizer.step()
However, when I run this piece of code, I got this error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 4]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Then, I really wanna know what I am supposed to do to train a model with 2 or more outputs
The entire premise on which pytorch (and other DL frameworks) is founded on is the backporpagation of the gradients of a scalar loss function.
In your case, you have a vector (of dim=2) loss function:
[cross_entropy_loss(output_1, target_1), cross_entropy_loss(output_2, target_2)]
You need to decide how to combine these two losses into a single scalar loss.
For instance:
weight = 0.5 # relative weight
loss = weight * cross_entropy_loss(output_1, target_1) + (1. - weight) * cross_entropy_loss(output_2, target_2)
# now loss is a scalar
loss.backward()
optimizer.step()

Accumulating Gradients

I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article
it's:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
whereas I expected it to be:
model.zero_grad() # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss += loss_function(predictions, labels) # Compute loss function
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
loss = 0
where I accumulate the loss and then divide by the accumulation steps to average it.
Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?
So according to the answer here, the first method is memory efficient. The amount of work required is more or less the same in both methods.
The second method keeps accumulating the graph, so would require accumulation_steps times more memory. The first method calculates the gradients straight away (and simply adds gradients) so requires less memory.
The backward pass loss.backward() is the operation that actually computes the gradients.
If you only do the forward pass (predictions = model(inputs)) no gradients will be computed and no accumulation is thus possible.

Categories

Resources