Decay parameter of Adam optimizer in Keras

Decay parameter of Adam optimizer in Keras - python

I think that Adam optimizer is designed such that it automtically adjusts the learning rate.
But there is an option to explicitly mention the decay in the Adam parameter options in Keras.
I want to clarify the effect of decay on Adam optimizer in Keras.
If we compile the model using decay say 0.01 on lr = 0.001, and then fit the model running for 50 epochs, then does the learning rate get reduced by a factor of 0.01 after each epoch?
Is there any way where we can specify that the learning rate should decay only after running for certain number of epochs?
In pytorch there is a different implementation called AdamW, which is not present in the standard keras library.
Is this the same as varying the decay after every epoch as mentioned above?
Thanks in advance for the reply.

From source code, decay adjusts lr per iterations according to
lr = lr * (1. / (1. + decay * iterations)) # simplified
see image below. This is epoch-independent. iterations is incremented by 1 on each batch fit (e.g. each time train_on_batch is called, or how many ever batches are in x for model.fit(x) - usually len(x) // batch_size batches).
To implement what you've described, you can use a callback as below:
from keras.callbacks import LearningRateScheduler
def decay_schedule(epoch, lr):
# decay by 0.1 every 5 epochs; use `% 1` to decay after each epoch
if (epoch % 5 == 0) and (epoch != 0):
lr = lr * 0.1
return lr
lr_scheduler = LearningRateScheduler(decay_schedule)
model.fit(x, y, epochs=50, callbacks=[lr_scheduler])
The LearningRateScheduler takes a function as an argument, and the function is fed the epoch index and lr at the beginning of each epoch by .fit. It then updates lr according to that function - so on next epoch, the function is fed the updated lr.
Also, there is a Keras implementation of AdamW, NadamW, and SGDW, by me - Keras AdamW.
Clarification: the very first call to .fit() invokes on_epoch_begin with epoch = 0 - if we don't wish lr to be decayed immediately, we should add a epoch != 0 check in decay_schedule. Then, epoch denotes how many epochs have already passed - so when epoch = 5, the decay is applied.

Internally, there is a learning rate decay at each after each batch-size, yet not after each epoch as it is commonly believed.
You can read more about it here: https://www.pyimagesearch.com/2019/07/22/keras-learning-rate-schedules-and-decay/
However, you can also implement your own learning_rate scheduler, via a custom callback function:
def learning_rate_scheduler(epoch, lr):
#Say you want to decay linearly by 5 after every 10 epochs the lr
#(epoch + 1) since it starts from epoch 0
if (epoch + 1) % 10 == 0:
lr = lr / 5
callbacks = [
tensorflow.keras.callbacks.LearningRateScheduler(learning_rate_scheduler, verbose=1)
]
model.fit(...,callbacks=callbacks,...)
The above method works for all types of optimizers, not only Adam.

Related

Why doesn't the Adadelta optimizer decay the learning rate?

I have initialised an Adadelta optimizer in Keras (using Tensorflow backend) and assigned it to a model:
my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, rho=0.95)
my_model.compile(optimizer=my_adadelta, loss="binary_crossentropy")
During training, I am using a callback to print the learning rate after every epoch:
class LRPrintCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
print(K.eval(lr))
However, this prints the same (initial) learning rate after every epoch.
The same thing happens if I initialize the optimizer like this:
my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, decay=0.95)
Am I doing something wrong in the initialization? Is the learning rate maybe changing but I am not printing the right thing?

As discussed in a relevant Github thread, the decay does not affect the variable lr itself, which is used only to store the initial value of the learning rate. In order to print the decayed value, you need to explicitly compute it yourself and store it in a separate variable lr_with_decay; you can do so by using the following callback:
class MyCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
decay = self.model.optimizer.decay
iterations = self.model.optimizer.iterations
lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
print(K.eval(lr_with_decay))
as explained here and here. In fact, the specific code snippet suggested there, i.e.
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
comes directly from the underlying Keras source code for Adadelta.
As clear from the inspection of the linked source code, the parameter of interest here for decaying the learning rate is decay, and not rho; despite the term 'decay' used also for describing rho in the documentation, it is a different decay not having anything to do with the learning rate:
rho: float >= 0. Adadelta decay factor, corresponding to fraction of gradient to keep at each time step.

Cannot improve model accuracy

I am building a general-purpose NN that would classify images (Dog/No Dog) and movie reviews(Good/Bad). I have to stick to a very specific architecture and loss function so changing these two seems out of the equation. My architecture is a two-layer network with relu followed by a sigmoid and a cross-entropy loss function. With 1000 epochs and a learning rate of around .001 I am getting 100 percent training accuracy and .72 testing accuracy.I was looking for suggestions to improve my testing accuracy.This is the layout of what I have:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
n_x,n_h,n_y=model_size
model = Net(n_x, n_h, n_y)
optim = torch.optim.Adam(model.parameters(),lr=0.005)
loss_function = nn.BCELoss()
train_losses = []
accuracy = []
for epoch in range(epochs):
count=0
model.train()
train_loss = []
batch_accuracy = []
for idx in range(0, train_x.shape[0], batch_size):
batch_x = torch.from_numpy(train_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(train_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
batch_accuracy=[]
loss = loss_function(model_output, batch_y)
train_loss.append(loss.item())
preds = model_output > 0.5
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
optim.zero_grad()
loss.backward()
# Scheduler made it worse
# scheduler.step(loss.item())
optim.step()
if epoch % 100 == 1:
train_losses.append(train_loss)
print("Iteration : {}, Training loss: {} ,Accuracy %: {}".format(epoch,np.mean(train_loss),(count/train_x.shape[0])*100))
plt.plot(np.squeeze(train_losses))
plt.ylabel('loss')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(lr))
plt.show()
return model
My model parameters:
batch_size = 32
lr = 0.0001
epochs = 1500
n_x = 12288 # num_px * num_px * 3
n_h = 7
n_y = 1
model_size=n_x,n_h,n_y
model=train_net(epochs,batch_size,train_x,train_y,model_size,or)
and this is the testing phase.
model.eval() #Setting the model to eval mode, hence making it deterministic.
test_loss = []
count=0;
loss_function = nn.BCELoss()
for idx in range(0, test_x.shape[0], batch_size):
with torch.no_grad():
batch_x = torch.from_numpy(test_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(test_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
preds = model_output > 0.5
loss = loss_function(model_output, batch_y)
test_loss.append(loss.item())
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
print("test loss: {},test accuracy: {}".format(np.mean(test_loss),count/test_x.shape[0]))
Things I have tried:
Messing around with the learning rate, having momentum, using schedulers and changing batch sizes.Of course these were mainly guesses and not based on any valid assumptions.

The issue you're facing is overfitting. With 100% accuracy on the training set, your model is effectively memorizing the training set, then failing to generalize to unseen samples. The good news is this is a very common major challenge!
You need regularization. One method is dropout, whereby on different training epochs a random set of the NN connections are dropped, forcing the network to "learn" alternate pathways and weights, and softening sharp peaks in parameter space. Since you need to keep your architecture and loss function the same, you won't be able to add such an option in (though for completeness, read this article for a description and implementation of dropout in PyTorch).
Given your constraints, you'll want to use something like L2 or L1 weight regularization. This typically shows up in the way of adding an additional term to the cost/loss function, which penalizes large weights. In PyTorch, L2 regularization is implemented via the torch.optim construct, with the option weight_decay. (See documentation: torch.optim, search for 'L2')
For your code, try something like:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
...
optim = torch.optim.Adam(model.parameters(),...,weight_decay=0.01)
...

Based on your statement that your training accuracy is 100%, while your testing accuracy is significantly lower at 72%, it seems that you are significantly overfitting your dataset.
In short, this means that your model is training itself too specifically to the training data that you've given it, picking up on quirks that may exist in the training data but which are not inherent to the classification. For example, if the dogs in your training data were all white, the model would eventually learn to associate the color white with dogs, and be hard-pressed to recognize dogs of other colors given to it in the test data set.
There are many avenues to address this issue: a well sourced overview of the subject written in simple terms can be found here.
Without more information on the specific constraints you have around changing the architecture of the neural network, it's tough to say for sure what you will and will not be able to change. However, weight regularization and dropout are often used to great effect (and are described in the above article.) You should also be free to implement early stopping and a weight constraint to the model.
I'll leave it you to find resources on how to implement these specific strategies in pytorch, but this should provide a good jumping off point.

PyTorch: How to change the learning rate of an optimizer at any given moment (no LR schedule)

Is it possible in PyTorch to change the learning rate of the optimizer in the middle of training dynamically (I don't want to define a learning rate schedule beforehand)?
So let's say I have an optimizer:
optim = torch.optim.SGD(model.parameters(), lr=0.01)
Now due to some tests which I perform during training, I realize my learning rate is too high so I want to change it to say 0.001. There doesn't seem to be a method optim.set_lr(0.001) but is there some way to do this?

So the learning rate is stored in optim.param_groups[i]['lr'].
optim.param_groups is a list of the different weight groups which can have different learning rates. Thus, simply doing:
for g in optim.param_groups:
g['lr'] = 0.001
will do the trick.
**Alternatively,**
as mentionned in the comments, if your learning rate only depends on the epoch number, you can use a learning rate scheduler.
For example (modified example from the doc):
torch.optim.lr_scheduler import LambdaLR
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Assuming optimizer has two groups.
lambda_group1 = lambda epoch: epoch // 30
lambda_group2 = lambda epoch: 0.95 ** epoch
scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
for epoch in range(100):
train(...)
validate(...)
scheduler.step()
Also, there is a prebuilt learning rate scheduler to reduce on plateaus.

Instead of a loop in patapouf_ai's answer, you can do it directly via:
optim.param_groups[0]['lr'] = 0.001

tensorflow cifar10 example Learning rate decay confusion when using multiple gpus

community,
I have a small question about the learning rate decay in multi-GPUs training of the Tensorflow cifar10 example.
Here is the code:
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
In this code, the number of gpus is not considered. For instance, if we increase the FLAGS.num_gpus to 4. The decay_steps does not change.
In the comments, global_step is supposed to equal the number of batches processed * FlAGS.num_gpus. However, global_step only increases when opt.apply_gradients() function is called. It only increases 1 step per iteration.
In my opinion, the code should be
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY/FLAGS.num_gpus)
Therefore, when utilizing multiple GPUs, the number of iteration required to go through 1 epoch is reduced.
Please correct me and help me understand if my logic is not correct.

Keras: how learning rate changes when Adadelta optimizer is used?

For example I use Adadelta for optimizer when compile network model, then learning rate will change in time by this rule (but what is iterations ? ) and how can I log learning rate value to console?
model.compile(loss=keras.losses.mean_squared_error,
optimizer= keras.optimizers.Adadelta())
In documentation lr is just starting learning rate?

The rule is related to updates with decay. Adadelta is an adaptive learning rate method which uses exponentially decaying average of gradients.

Looking at Keras source code, learning rate is recalculated based on decay like:
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
So yes, lr is just starting learning rate.
To print it after every epoch, as #orabis mentioned, you can make a callback class:
class YourLearningRateTracker(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
decay = self.model.optimizer.decay
iterations = self.model.optimizer.iterations
lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
print(K.eval(lr_with_decay))
and then add its instance to the callbacks when calling model.fit() like:
model.fit(..., callbacks=[YourLearningRateTracker()])
However, note that, by default, decay parameter for Adadelta is zero and is not part of the “standard” arguments, so your learning rate would not be changing its value when using default arguments.
I suspect that decay is not intended to be used with Adadelta.
On the other hand, rho parameter, which is nonzero by default, doesn’t describe the decay of the learning rate, but corresponds to the fraction of gradient to keep at each time step (according to the Keras documentation).
I found some relevant information on this Github issue, and by asking a similar question.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decay parameter of Adam optimizer in Keras - python

Related

Why doesn't the Adadelta optimizer decay the learning rate?

Cannot improve model accuracy

PyTorch: How to change the learning rate of an optimizer at any given moment (no LR schedule)

tensorflow cifar10 example Learning rate decay confusion when using multiple gpus

Keras: how learning rate changes when Adadelta optimizer is used?

Categories

Resources