When I trained model for several epochs and want to retrain it again for more epochs. How would Adam optimizer work. will it initialize the time from t =0 or will it save the last time step?
a) The documentation in tensorflow shows the following calculations. Is there a away I can add these metrics to tensorboard.
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
there are no answers for a few questions since a long time question1 and question2.
I am actually getting a problem with error rate when re-training the model from the last checkpoint and I was not sure what exactly is happening with Adam optimizer in this case ?
Your answer is a bit similar to this question I think: Saving the state of the AdaGrad algorithm in Tensorflow
If you save and reload the state of the optimizer it will continue, if you don't load the state of your optimizer after training it will simply start again!
Related
I think that Adam optimizer is designed such that it automtically adjusts the learning rate.
But there is an option to explicitly mention the decay in the Adam parameter options in Keras.
I want to clarify the effect of decay on Adam optimizer in Keras.
If we compile the model using decay say 0.01 on lr = 0.001, and then fit the model running for 50 epochs, then does the learning rate get reduced by a factor of 0.01 after each epoch?
Is there any way where we can specify that the learning rate should decay only after running for certain number of epochs?
In pytorch there is a different implementation called AdamW, which is not present in the standard keras library.
Is this the same as varying the decay after every epoch as mentioned above?
Thanks in advance for the reply.
From source code, decay adjusts lr per iterations according to
lr = lr * (1. / (1. + decay * iterations)) # simplified
see image below. This is epoch-independent. iterations is incremented by 1 on each batch fit (e.g. each time train_on_batch is called, or how many ever batches are in x for model.fit(x) - usually len(x) // batch_size batches).
To implement what you've described, you can use a callback as below:
from keras.callbacks import LearningRateScheduler
def decay_schedule(epoch, lr):
# decay by 0.1 every 5 epochs; use `% 1` to decay after each epoch
if (epoch % 5 == 0) and (epoch != 0):
lr = lr * 0.1
return lr
lr_scheduler = LearningRateScheduler(decay_schedule)
model.fit(x, y, epochs=50, callbacks=[lr_scheduler])
The LearningRateScheduler takes a function as an argument, and the function is fed the epoch index and lr at the beginning of each epoch by .fit. It then updates lr according to that function - so on next epoch, the function is fed the updated lr.
Also, there is a Keras implementation of AdamW, NadamW, and SGDW, by me - Keras AdamW.
Clarification: the very first call to .fit() invokes on_epoch_begin with epoch = 0 - if we don't wish lr to be decayed immediately, we should add a epoch != 0 check in decay_schedule. Then, epoch denotes how many epochs have already passed - so when epoch = 5, the decay is applied.
Internally, there is a learning rate decay at each after each batch-size, yet not after each epoch as it is commonly believed.
You can read more about it here: https://www.pyimagesearch.com/2019/07/22/keras-learning-rate-schedules-and-decay/
However, you can also implement your own learning_rate scheduler, via a custom callback function:
def learning_rate_scheduler(epoch, lr):
#Say you want to decay linearly by 5 after every 10 epochs the lr
#(epoch + 1) since it starts from epoch 0
if (epoch + 1) % 10 == 0:
lr = lr / 5
callbacks = [
tensorflow.keras.callbacks.LearningRateScheduler(learning_rate_scheduler, verbose=1)
]
model.fit(...,callbacks=callbacks,...)
The above method works for all types of optimizers, not only Adam.
I have a question concerning learning rate decay in Keras. I need to understand how the option decay works inside optimizers in order to translate it to an equivalent PyTorch formulation.
From the source code of SGD I see that the update is done this way after every batch update:
lr = self.lr * (1. / (1. + self.decay * self.iterations))
Does this mean that after every batch update the lr is updated starting from its value from its previous update or from its initial value? I mean, which of the two following interpretation is the correct one?
lr = lr_0 * (1. / (1. + self.decay * self.iterations))
or
lr = lr * (1. / (1. + self.decay * self.iterations)),
where lr is the lr updated after previous iteration and lr_0 is always the initial learning rate.
If the correct answer is the first one, this would mean that, in my case, the learning rate would decay from 0.001 to just 0.0002 after 100 epochs, whereas in the second case it would decay from 0.001 at around 1e-230 after 70 epochs.
Just to give you some context, I'm working with a CNN for a regression problem from images and I just have to translate Keras code into Pytorch code. So far, with the second of the afore-mentioned interpretations I manage to only always predict the same value, disregarding of batch size and input at test time.
Thanks in advance for your help!
Based on the implementation in Keras I think your first formulation is the correct one, the one that contain the initial learning rate (note that self.lr is not being updated).
However I think your calculation is probably not correct: since the denominator is the same, and lr_0 >= lr since you are doing decay, the first formulation has to result in a bigger number.
I'm not sure if this decay is available in PyTorch, but you can easily create something similar with torch.optim.lr_scheduler.LambdaLR.
decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = LambdaLR(optimizer, lr_lambda=fcn)
Finally, don't forget that you will need to call .step() explicitly on the scheduler, it's not enough to step your optimizer. Also, most often learning scheduling is only done after a full epoch, not after every single batch, but I see that here you are just recreating Keras behavior.
Actually, the response of mkisantal might be incorrect, since the actual equation for the learning rate in keras (at least it was, now there is no default decay option) was like this:
lr = lr * (1. / (1. + self.decay * self.iterations))
(see https://github.com/keras-team/keras/blob/2.2.0/keras/optimizers.py#L178)
And the solution presented by mkisantal is missing the recurrent/multiplicative term lr, therefore the more accurate version should be based on MultiplicativeLR:
decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = MultiplicativeLR(optimizer, lr_lambda=fcn)
In Tensorflow, after I obtain my loss term, I give it to an optimizer and it adds the necessary differentiation and update terms to the computation graph:
global_counter = tf.Variable(0, dtype=DATA_TYPE, trainable=False)
learning_rate = tf.train.exponential_decay(
INITIAL_LR, # Base learning rate.
global_counter, # Current index into the dataset.
DECAY_STEP, # Decay step.
DECAY_RATE, # Decay rate.
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter)
feed_dict = {TRAIN_DATA_TENSOR: samples, TRAIN_LABEL_TENSOR: labels}
results = sess.run([optimizer], feed_dict=feed_dict)
I want a small modification to this process. I want to scale the learning_rate differently for my every distinct parameter in the network. For example, let A and B two different trainable parameters in the network and let dL/dA and dL/dB the partial derivatives of the parameters with respect to the loss. The momentum optimizer updates the variables as:
Ma <- 0.9*Ma + learning_rate*dL/dA
A <- A - Ma
Mb <- 0.9*Mb + learning_rate*dL/dB
B <- B - Mb
I want to modify this as:
Ma <- 0.9*Ma + ca*learning_rate*dL/dA
A <- A - Ma
Mb <- 0.9*Mb + cb*learning_rate*dL/dB
B <- B - Mb
Where ca and cb are special learning rate scales for different parameters. As far as I understand, Tensorflow has compute_gradients and apply_gradients methods we can call for such cases, but the documentation is not very clear about how to use them. Any help would be much appreciated.
TO calculate gradient:
self.gradients = tf.gradients(self.loss, tf.trainable_variables())
Now, you access the gradients using sess.run([model.gradients], feed_dict)
Assuming, you have declared the learning_rate as a tf.Variable(), you can assign the learning rate using the following code:
sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** epoch)))
The above code is just an example. You can modify it to be used for your purpose.
Custom learning rate, in tensorflow
are very easy to handle.
learning_rate = tf.Variable(INITIAL_LR,trainable=False,name="lr")
and say l1 and l2 are two different learning rates :
l1 = ca * learning_rate
l2 = cb * learning_rate
you can do any type of mathematical manipulation with respect to learning rate, and apply it in this manner :
optimizer=tf.train.MomentumOptimizer(l1,0.9).minimize(network.finalLoss, global_step=global_counter)
Regarding your problem: what you want is actually different gradient for different layers, say L1 layer (trainable variables containing Ma) and L2
(trainable variables containing Mb)
global_counter = tf.Variable(0, dtype=DATA_TYPE, trainable=False)
learning_rate = tf.train.exponential_decay(
INITIAL_LR, # Base learning rate.
global_counter, # Current index into the dataset.
DECAY_STEP, # Decay step.
DECAY_RATE, # Dec
staircase=True)
optimizer1 = tf.train.MomentumOptimizer(ca * learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter , var_list= L1)
optimizer2 = tf.train.MomentumOptimizer(cb * learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter , var_list= L2)
optimizer = tf.group(optimizer1 , optimizer2)
feed_dict = {TRAIN_DATA_TENSOR: samples, TRAIN_LABEL_TENSOR: labels}
results = sess.run([optimizer], feed_dict=feed_dict)
You can find the optimized version of the above code here
Please note if you can designate learning rate via tf.assign it returns the reference to the learning rate whereas the optimizer expects a float learning value type which probably will/should throw an error
I would like to create a custom loss function that has a weight term that's updated based on what epoch I'm in.
For example:
Let's say I have a loss function which has a beta weight, where beta increases over the first 20 epochs...
def custom_loss(x, x_pred):
loss1 = objectives.binary_crossentropy(x, x_pred)
loss2 = objectives.mse(x, x_pred)
return (beta*current_epoch/20) * loss1 + loss2
How could I implement something like this into a keras loss function?
Looking at their documentation they mention that you can use theano/Tf symbolic functions that return a scalar for each data point.
So you could do something like this
loss = tf.contrib.losses.softmax_cross_entropy(x, x_pred) *
(beta * current_epoch / 20 ) +
tf.contrib.losses.mean_squared_error
You would have to pass x and x_pred as x and x_pred as tf.placeholders
I think for model creation you could use keras but then again you would have to run the computational graph with sess.run()
References:
https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html#using-keras-models-with-tensorflow
I'm confused regarding as to how the adam optimizer actually works in tensorflow.
The way I read the docs, it says that the learning rate is changed every gradient descent iteration.
But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like
for epoch in epochs
for batch in data
sess.run(train_adam_step, feed_dict={eta:1e-3})
So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t is incremented each time I call the optimizer?
Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum and decay to match beta1 and beta2 respectively. Is that correct?
I find the documentation quite clear, I will paste here the algorithm in pseudo-code:
Your parameters:
learning_rate: between 1e-4 and 1e-2 is standard
beta1: 0.9 by default
beta2: 0.999 by default
epsilon: 1e-08 by default
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Initialization:
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
m_t and v_t will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)
At each iteration t, and for each parameter of the model:
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
Here lr_t is a bit different from learning_rate because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t). When t is high (t > 1./(1.-beta2)), lr_t is almost equal to learning_rate
To answer your question, you just need to pass a fixed learning rate, keep beta1 and beta2 default values, maybe modify epsilon, and Adam will do the magic :)
Link with RMSProp
Adam with beta1=1 is equivalent to RMSProp with momentum=0. The argument beta2 of Adam and the argument decay of RMSProp are the same.
However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.
A detailed description of rmsprop.
maintain a moving (discounted) average of the square of gradients
divide gradient by the root of this average
(can maintain a momentum)
Here is the pseudo-code:
v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom
RMS_PROP and ADAM both have adaptive learning rates .
The basic RMS_PROP
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
You can see originally this has two parameters decay_rate & eps
Then we can add a momentum to make our gradient more stable Then we can write
cache = decay_rate * cache + (1 - decay_rate) * dx**2
**m = beta1*m + (1-beta1)*dx** [beta1 =momentum parameter in the doc ]
x += - learning_rate * dx / (np.sqrt(cache) + eps)
Now you can see here if we keep beta1 = o Then it's rms_prop without the momentum .
Then Basics of ADAM
In cs-231 Andrej Karpathy has initially described the adam like this
Adam is a recently proposed update that looks a bit like RMSProp with
momentum
So yes ! Then what makes this difference from the rms_prop with momentum ?
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
**x += - learning_rate * m / (np.sqrt(v) + eps)**
He again mentioned in the updating equation m , v are more smooth .
So the difference from the rms_prop is the update is less noisy .
What makes this noise ?
Well in the initialization procedure we will initialize m and v as zero .
m=v=0
In order to reduce this initializing effect it's always to have some warm-up . So then equation is like
m = beta1*m + (1-beta1)*dx beta1 -o.9 beta2-0.999
**mt = m / (1-beta1**t)**
v = beta2*v + (1-beta2)*(dx**2)
**vt = v / (1-beta2**t)**
x += - learning_rate * mt / (np.sqrt(vt) + eps)
Now we run this for few iterations . Clearly pay attention to the bold lines , you can see when t is increasing (iteration number) following thing happen to the mt ,
mt = m