I'm a beginner to PyTorch and am trying to train a MNIST model based on a custom neural network class. My learning rate scheduler, loss function and optimizer are:
optimizer = optim.Adam(model.parameters(), lr=0.003)
loss_fn = nn.CrossEntropyLoss()
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
I'm also using Learning Rate scheduler for that purpose. Initially, I had my training loop like this:
# this training gives high loss and it doesn't varies that much
def training(epochs):
model.train()
for batch_idx, (imgs, labels) in enumerate(train_loader):
imgs = imgs.to(device=device)
labels = labels.to(device=device)
optimizer.zero_grad()
outputs = model(imgs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
exp_lr_scheduler.step() # inside the loop and after the optimizer
if (batch_idx + 1)% 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, (batch_idx + 1) * len(imgs), len(train_loader.dataset),
100. * (batch_idx + 1) / len(train_loader), loss.data))
But this training was not efficient and my loss was almost the same in every epoch.
Then, I changed my training function to this in the end:
# this training works perfectly
def training(epochs):
model.train()
exp_lr_scheduler.step() # out of the loop but before optimizer step
for batch_idx, (imgs, labels) in enumerate(train_loader):
imgs = imgs.to(device=device)
labels = labels.to(device=device)
optimizer.zero_grad()
outputs = model(imgs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
if (batch_idx + 1)% 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, (batch_idx + 1) * len(imgs), len(train_loader.dataset),
100. * (batch_idx + 1) / len(train_loader), loss.data))
And now, it's working correctly. I just don't get the reason for this.
I have two queries:
Shouldn't exp_lr_scheduler.step() be in the for loop so that it also get's updated with every epoch? ; and
PyTorch latest version says to keep exp_lr_scheduler.step() after optimizer.step() but doing this in the my training function gives me worse loss.
What can be the reason or am I doing it wrong?
StepLR updates the learning rate after every step_size by gamma, that means if step_size is 7 then learning rate will be updated after every 7 epoch by multiplying the current learning rate to gamma. That means that in your snippet, the learning rate is getting 10 times smaller every 7 epochs.
Have you tried increasing the starting learning rate? I would try 0.1 or 0.01. I think the problem could be at the size of the starting learning rate since the starting point it is already quite small. This causes that the gradient descent algorithm (or its derivatives, as Adam) cannot move towards the minimum because the step is too small and your results keep being the same (at the same point of the functional to minimize).
Hope it helps.
Related
I am fairly new to coding and getting confused between average accuracy and overall accuracy. I have created a function to calculate accuracy, i then divide this result by the len(dataloader) at the end of each epoch. Is this the correct way to calculate average accuracy? If not could someone explain how I go about doing this correctly?
def accuracy(predictions, labels):
classes = torch.argmax(predictions, dim=1)
return torch.mean((classes == labels).float())
def train(model, optimizer, dataloader):
#Setting model to train mode
model.train()
acc = 0.0
loss = 0.0
loss_fc = nn.CrossEntropyLoss()
for i, (img, label) in enumerate(dataloader):
#source images and labels to cpu device
img, label = img.to(device), label.to(device)
y_pred = model(img)
optimizer.zero_grad()
loss = loss_fc(y_pred, label)
loss.backward()
optimizer.step()
#Update loss and accuracy
loss += loss.item()
acc += accuracy(y_pred, s_label)
loss /= len(dataloader)
acc /= len(dataloader)
Not sure what you mean by the overall and average accuracy. Typically accuracy is calculated at the end of each epoch. You pass the accuracy function your predictions and your actual labels and it returns what proportion you got right as a decimal (0-1).
I haven't seen any use for calculating the average accuracy across every epoch during training as this metric would be heavily impacted by how fast your model learns rather than how well it is able to eventually perform e.g. a model that needs a lot of epochs to do well will probably appear worse on this average accuracy than one that can converge on fewer epochs.
If you take a look at the accuracy score metric from scikit-learn it should help clear things up for you.
Link:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
Hope this helps!
I am reading Pytorch official tutorial for fine tuning and I am faced with one problem and that is calculation of loss in each epoch.
Before this , I calculate loss for batch of data, accumulate these batch losses and find mean of these values as loss of epoch. But in that example, the calculation is as follow:
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
my question is in this line running_loss += loss.item() * inputs.size(0). It is multiply loss value of batch in bach size. What is the true way to calculate loss of epoch?
and what is the unit of loss? What is the range of loss value?
Yes the code snippet adds multiplication of batch size with batch mean error. If you want to calculate true summation. You can use
torch.nn.CrossEntropyLoss(reduction = "sum")
which will give you the sum of errors for the batch. Then you can directly sum for each batch as follows:
running_loss += loss.item()
The range of the loss value depends on your number of classes and feature vector. The code in your question will have same running_loss if you use reduction="sum" because your code basically makes
(loss/batch_size) * batch_size
which is the same thing with loss value. However, backpropagation changes because on the one hand you backprop according to the sum of losses, on the other hand you calculate backprop according to the mean loss.
I think that Adam optimizer is designed such that it automtically adjusts the learning rate.
But there is an option to explicitly mention the decay in the Adam parameter options in Keras.
I want to clarify the effect of decay on Adam optimizer in Keras.
If we compile the model using decay say 0.01 on lr = 0.001, and then fit the model running for 50 epochs, then does the learning rate get reduced by a factor of 0.01 after each epoch?
Is there any way where we can specify that the learning rate should decay only after running for certain number of epochs?
In pytorch there is a different implementation called AdamW, which is not present in the standard keras library.
Is this the same as varying the decay after every epoch as mentioned above?
Thanks in advance for the reply.
From source code, decay adjusts lr per iterations according to
lr = lr * (1. / (1. + decay * iterations)) # simplified
see image below. This is epoch-independent. iterations is incremented by 1 on each batch fit (e.g. each time train_on_batch is called, or how many ever batches are in x for model.fit(x) - usually len(x) // batch_size batches).
To implement what you've described, you can use a callback as below:
from keras.callbacks import LearningRateScheduler
def decay_schedule(epoch, lr):
# decay by 0.1 every 5 epochs; use `% 1` to decay after each epoch
if (epoch % 5 == 0) and (epoch != 0):
lr = lr * 0.1
return lr
lr_scheduler = LearningRateScheduler(decay_schedule)
model.fit(x, y, epochs=50, callbacks=[lr_scheduler])
The LearningRateScheduler takes a function as an argument, and the function is fed the epoch index and lr at the beginning of each epoch by .fit. It then updates lr according to that function - so on next epoch, the function is fed the updated lr.
Also, there is a Keras implementation of AdamW, NadamW, and SGDW, by me - Keras AdamW.
Clarification: the very first call to .fit() invokes on_epoch_begin with epoch = 0 - if we don't wish lr to be decayed immediately, we should add a epoch != 0 check in decay_schedule. Then, epoch denotes how many epochs have already passed - so when epoch = 5, the decay is applied.
Internally, there is a learning rate decay at each after each batch-size, yet not after each epoch as it is commonly believed.
You can read more about it here: https://www.pyimagesearch.com/2019/07/22/keras-learning-rate-schedules-and-decay/
However, you can also implement your own learning_rate scheduler, via a custom callback function:
def learning_rate_scheduler(epoch, lr):
#Say you want to decay linearly by 5 after every 10 epochs the lr
#(epoch + 1) since it starts from epoch 0
if (epoch + 1) % 10 == 0:
lr = lr / 5
callbacks = [
tensorflow.keras.callbacks.LearningRateScheduler(learning_rate_scheduler, verbose=1)
]
model.fit(...,callbacks=callbacks,...)
The above method works for all types of optimizers, not only Adam.
Hi I am currently learning the use of scheduler in deep learning in pytroch. I came across the following code :
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
# Set seed
torch.manual_seed(0)
# Where to add a new import
from torch.optim.lr_scheduler import ReduceLROnPlateau
'''
STEP 1: LOADING DATASET
'''
train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
test_dataset = dsets.MNIST(root='./data',
train=False,
transform=transforms.ToTensor())
'''
STEP 2: MAKING DATASET ITERABLE
'''
batch_size = 100
n_iters = 6000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=batch_size,
shuffle=False)
'''
STEP 3: CREATE MODEL CLASS
'''
class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Non-linearity
self.relu = nn.ReLU()
# Linear function (readout)
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.relu(out)
# Linear function (readout)
out = self.fc2(out)
return out
'''
STEP 4: INSTANTIATE MODEL CLASS
'''
input_dim = 28*28
hidden_dim = 100
output_dim = 10
model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)
'''
STEP 5: INSTANTIATE LOSS CLASS
'''
criterion = nn.CrossEntropyLoss()
'''
STEP 6: INSTANTIATE OPTIMIZER CLASS
'''
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)
'''
STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS
'''
# lr = lr * factor
# mode='max': look for the maximum validation accuracy to track
# patience: number of epochs - 1 where loss plateaus before decreasing LR
# patience = 0, after 1 bad epoch, reduce LR
# factor = decaying factor
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.1, patience=0, verbose=True)
'''
STEP 7: TRAIN THE MODEL
'''
iter = 0
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
# Load images as Variable
images = images.view(-1, 28*28).requires_grad_()
# Clear gradients w.r.t. parameters
optimizer.zero_grad()
# Forward pass to get output/logits
outputs = model(images)
# Calculate Loss: softmax --> cross entropy loss
loss = criterion(outputs, labels)
# Getting gradients w.r.t. parameters
loss.backward()
# Updating parameters
optimizer.step()
iter += 1
if iter % 500 == 0:
# Calculate Accuracy
correct = 0
total = 0
# Iterate through test dataset
for images, labels in test_loader:
# Load images to a Torch Variable
images = images.view(-1, 28*28)
# Forward pass only to get logits/output
outputs = model(images)
# Get predictions from the maximum value
_, predicted = torch.max(outputs.data, 1)
# Total number of labels
total += labels.size(0)
# Total correct predictions
# Without .item(), it is a uint8 tensor which will not work when you pass this number to the scheduler
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
# Print Loss
# print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))
# Decay Learning Rate, pass validation accuracy for tracking at every epoch
print('Epoch {} completed'.format(epoch))
print('Loss: {}. Accuracy: {}'.format(loss.item(), accuracy))
print('-'*20)
scheduler.step(accuracy)
I am using the above strategy. The only thing that I am not able to understand is that how are they using test data to step up the accuracy and decreasing learning rate on the basis of that via scheduler? It is the last line of the code. Can we during training show the test accuracy to the scheduler and ask that it to reduce the learning rate? I found the similar thing on github resnet main.py
too. Can someone please clarify ?
I think there might be some confusion regarding the term test here.
Difference between test and validation data
What the code actually refers to by test is the validation set not the actual test set. The difference is that the validation set is used during training to see how well the model generalizes. Normally people just cut off a part of the training data and use that for validation. To me it seems like your code is using the same data for training and validation but that's just my assumption because I don't know what ./data looks like.
To work in a strictly scientific way, your model should never see actual test data during training, only training and validation. This way we can assess the models actual ability to generalize on unseen data after training.
Reducing learning rate based on validation accuracy
The reason why you use validation data (called test data in your case) to reduce the learning rate is probably because if you did this using the actual training data and training accuracy the model is more likely to overfit. Why? When you are on a plateau of the training accuracy it does not necessarily imply that it's a plateau of the validation accuracy and the other way round. Meaning you could be stepping in a promising direction regarding the validation accuracy (and thus in a direction of parameters that generalize well) and suddenly you reduce or increase the learning rate because there was a plateau (or non) in the training accuracy.
Is it possible to set model.loss in a callback without re-compiling model.compile(...) after (since then the optimizer states are reset), and just recompiling model.loss, like for example:
class NewCallback(Callback):
def __init__(self):
super(NewCallback,self).__init__()
def on_epoch_end(self, epoch, logs={}):
self.model.loss=[loss_wrapper(t_change, current_epoch=epoch)]
self.model.compile_only_loss() # is there a version or hack of
# model.compile(...) like this?
To expand more with previous examples on stackoverflow:
To achieve a loss function which depends on the epoch number, like (as in this stackoverflow question):
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
c_epoch = K.get_value(current_epoch)
if c_epoch < t_change:
# compute loss_1
else:
# compute loss_2
return custom_loss
where "current_epoch" is a Keras variable updated with a callback:
current_epoch = K.variable(0.)
model.compile(optimizer=opt, loss=loss_wrapper(5, current_epoch),
metrics=...)
class NewCallback(Callback):
def __init__(self, current_epoch):
self.current_epoch = current_epoch
def on_epoch_end(self, epoch, logs={}):
K.set_value(self.current_epoch, epoch)
One can essentially turn python code into compositions of backend functions for the loss to work as follows:
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
# compute loss_1 and loss_2
bool_case_1=K.less(current_epoch,t_change)
num_case_1=K.cast(bool_case_1,"float32")
loss = (num_case_1)*loss_1 + (1-num_case_1)*loss_2
return loss
return custom_loss
it works.
I am not satisfied with these hacks, and wonder, is it possible to set model.loss in a callback without re-compiling model.compile(...) after (since then the optimizer states are reset), and just recompiling model.loss?
I hope you have found a solution to your problem by now but using tensorflow I think you can solve this by building a custom training loop (here is the doc). this will not override the loss attribute as you requested however you can probably achieve what you are looking for.
example
initializing variable
modifying the example from the documentation, with a model and dataset as such:
inputs = tf.keras.Input(shape=(784,), name="digits")
x1 = tf.keras.layers.Dense(64, activation="relu")(inputs)
x2 = tf.keras.layers.Dense(64, activation="relu")(x1)
outputs = tf.keras.layers.Dense(10, name="predictions")(x2)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_test, (-1, 784))
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
we can define our two loss functions (the two I chose make no sense from a scientific point of view but allow us to check the code works)
# Instantiate an optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
# Instantiate a loss function.
loss_1 = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_2 = lambda y_true, y_pred: -1 * loss_1(y_true, y_pred)
training loop
we can then execute our custom training loop:
epochs = 10
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
loss_fn = loss_1 if epoch % 2 else loss_2
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
and we check the output is what we want (alternate positive and negative losses)
Start of epoch 0
Training loss (for one batch) at step 0: -96.1003
Seen so far: 64 samples
Training loss (for one batch) at step 200: -3383849.5000
Seen so far: 12864 samples
Training loss (for one batch) at step 400: -40419124.0000
Seen so far: 25664 samples
Training loss (for one batch) at step 600: -149133008.0000
Seen so far: 38464 samples
Training loss (for one batch) at step 800: -328322816.0000
Seen so far: 51264 samples
Start of epoch 1
Training loss (for one batch) at step 0: 580457984.0000
Seen so far: 64 samples
Training loss (for one batch) at step 200: 297710528.0000
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 213328544.0000
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 159328976.0000
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 105737024.0000
Seen so far: 51264 samples
drawbacks and further improvments
the problem with writing custom loops as such is that you will loose the convenience of keras's fit method. I think you can manage this by defining a custom model and overriding the train_step as shown here in the documentation
If you really need to have the loss attribute of your model changed, you can set the compiled_loss attribute using a keras.engine.compile_utils.LossesContainer (here is the reference) and set model.train_function to model.make_train_function() (so that the new loss gets taken into account).