I have a question concerning learning rate decay in Keras. I need to understand how the option decay works inside optimizers in order to translate it to an equivalent PyTorch formulation.
From the source code of SGD I see that the update is done this way after every batch update:
lr = self.lr * (1. / (1. + self.decay * self.iterations))
Does this mean that after every batch update the lr is updated starting from its value from its previous update or from its initial value? I mean, which of the two following interpretation is the correct one?
lr = lr_0 * (1. / (1. + self.decay * self.iterations))
or
lr = lr * (1. / (1. + self.decay * self.iterations)),
where lr is the lr updated after previous iteration and lr_0 is always the initial learning rate.
If the correct answer is the first one, this would mean that, in my case, the learning rate would decay from 0.001 to just 0.0002 after 100 epochs, whereas in the second case it would decay from 0.001 at around 1e-230 after 70 epochs.
Just to give you some context, I'm working with a CNN for a regression problem from images and I just have to translate Keras code into Pytorch code. So far, with the second of the afore-mentioned interpretations I manage to only always predict the same value, disregarding of batch size and input at test time.
Thanks in advance for your help!
Based on the implementation in Keras I think your first formulation is the correct one, the one that contain the initial learning rate (note that self.lr is not being updated).
However I think your calculation is probably not correct: since the denominator is the same, and lr_0 >= lr since you are doing decay, the first formulation has to result in a bigger number.
I'm not sure if this decay is available in PyTorch, but you can easily create something similar with torch.optim.lr_scheduler.LambdaLR.
decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = LambdaLR(optimizer, lr_lambda=fcn)
Finally, don't forget that you will need to call .step() explicitly on the scheduler, it's not enough to step your optimizer. Also, most often learning scheduling is only done after a full epoch, not after every single batch, but I see that here you are just recreating Keras behavior.
Actually, the response of mkisantal might be incorrect, since the actual equation for the learning rate in keras (at least it was, now there is no default decay option) was like this:
lr = lr * (1. / (1. + self.decay * self.iterations))
(see https://github.com/keras-team/keras/blob/2.2.0/keras/optimizers.py#L178)
And the solution presented by mkisantal is missing the recurrent/multiplicative term lr, therefore the more accurate version should be based on MultiplicativeLR:
decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = MultiplicativeLR(optimizer, lr_lambda=fcn)
Related
I have initialised an Adadelta optimizer in Keras (using Tensorflow backend) and assigned it to a model:
my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, rho=0.95)
my_model.compile(optimizer=my_adadelta, loss="binary_crossentropy")
During training, I am using a callback to print the learning rate after every epoch:
class LRPrintCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
print(K.eval(lr))
However, this prints the same (initial) learning rate after every epoch.
The same thing happens if I initialize the optimizer like this:
my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, decay=0.95)
Am I doing something wrong in the initialization? Is the learning rate maybe changing but I am not printing the right thing?
As discussed in a relevant Github thread, the decay does not affect the variable lr itself, which is used only to store the initial value of the learning rate. In order to print the decayed value, you need to explicitly compute it yourself and store it in a separate variable lr_with_decay; you can do so by using the following callback:
class MyCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
decay = self.model.optimizer.decay
iterations = self.model.optimizer.iterations
lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
print(K.eval(lr_with_decay))
as explained here and here. In fact, the specific code snippet suggested there, i.e.
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
comes directly from the underlying Keras source code for Adadelta.
As clear from the inspection of the linked source code, the parameter of interest here for decaying the learning rate is decay, and not rho; despite the term 'decay' used also for describing rho in the documentation, it is a different decay not having anything to do with the learning rate:
rho: float >= 0. Adadelta decay factor, corresponding to fraction of gradient to keep at each time step.
I'm now working on a small project, but I don't know how I should build the model.
So, the number of inputs is 27, outputs is 163.
I need to find weights and biases by training, and I am done with this by using 5 layers including relu and dropout.
When I see a cost graph about training loss and validation loss from a tensorboard, it looks ok.
1) However, what I also need to concern about is uniformity, which is calculated as below:
uniformity = (max. of y - min. of y) / (max. of y + max. of y)
I have a real uniformity data which are given, and when I find uniformities from y_predict value, and the difference is too big from the real uniformity value.
Is there any way to add uniformity while training, so that it not only care about finding the right weights and biases, but also close uniformity?
Thank you!
You may incorporate a uniformity constraint into your loss function during training.
def my_loss(labels, predictions):
lambda_ = 0.01
return tf.losses.mean_squared_error(labels, predictions) + \
lambda_ * uniformity(labels) / uniformity(predictions)
For example I use Adadelta for optimizer when compile network model, then learning rate will change in time by this rule (but what is iterations ? ) and how can I log learning rate value to console?
model.compile(loss=keras.losses.mean_squared_error,
optimizer= keras.optimizers.Adadelta())
In documentation lr is just starting learning rate?
The rule is related to updates with decay. Adadelta is an adaptive learning rate method which uses exponentially decaying average of gradients.
Looking at Keras source code, learning rate is recalculated based on decay like:
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
So yes, lr is just starting learning rate.
To print it after every epoch, as #orabis mentioned, you can make a callback class:
class YourLearningRateTracker(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
decay = self.model.optimizer.decay
iterations = self.model.optimizer.iterations
lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
print(K.eval(lr_with_decay))
and then add its instance to the callbacks when calling model.fit() like:
model.fit(..., callbacks=[YourLearningRateTracker()])
However, note that, by default, decay parameter for Adadelta is zero and is not part of the “standard” arguments, so your learning rate would not be changing its value when using default arguments.
I suspect that decay is not intended to be used with Adadelta.
On the other hand, rho parameter, which is nonzero by default, doesn’t describe the decay of the learning rate, but corresponds to the fraction of gradient to keep at each time step (according to the Keras documentation).
I found some relevant information on this Github issue, and by asking a similar question.
I'm confused regarding as to how the adam optimizer actually works in tensorflow.
The way I read the docs, it says that the learning rate is changed every gradient descent iteration.
But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like
for epoch in epochs
for batch in data
sess.run(train_adam_step, feed_dict={eta:1e-3})
So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t is incremented each time I call the optimizer?
Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum and decay to match beta1 and beta2 respectively. Is that correct?
I find the documentation quite clear, I will paste here the algorithm in pseudo-code:
Your parameters:
learning_rate: between 1e-4 and 1e-2 is standard
beta1: 0.9 by default
beta2: 0.999 by default
epsilon: 1e-08 by default
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Initialization:
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
m_t and v_t will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)
At each iteration t, and for each parameter of the model:
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
Here lr_t is a bit different from learning_rate because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t). When t is high (t > 1./(1.-beta2)), lr_t is almost equal to learning_rate
To answer your question, you just need to pass a fixed learning rate, keep beta1 and beta2 default values, maybe modify epsilon, and Adam will do the magic :)
Link with RMSProp
Adam with beta1=1 is equivalent to RMSProp with momentum=0. The argument beta2 of Adam and the argument decay of RMSProp are the same.
However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.
A detailed description of rmsprop.
maintain a moving (discounted) average of the square of gradients
divide gradient by the root of this average
(can maintain a momentum)
Here is the pseudo-code:
v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom
RMS_PROP and ADAM both have adaptive learning rates .
The basic RMS_PROP
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
You can see originally this has two parameters decay_rate & eps
Then we can add a momentum to make our gradient more stable Then we can write
cache = decay_rate * cache + (1 - decay_rate) * dx**2
**m = beta1*m + (1-beta1)*dx** [beta1 =momentum parameter in the doc ]
x += - learning_rate * dx / (np.sqrt(cache) + eps)
Now you can see here if we keep beta1 = o Then it's rms_prop without the momentum .
Then Basics of ADAM
In cs-231 Andrej Karpathy has initially described the adam like this
Adam is a recently proposed update that looks a bit like RMSProp with
momentum
So yes ! Then what makes this difference from the rms_prop with momentum ?
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
**x += - learning_rate * m / (np.sqrt(v) + eps)**
He again mentioned in the updating equation m , v are more smooth .
So the difference from the rms_prop is the update is less noisy .
What makes this noise ?
Well in the initialization procedure we will initialize m and v as zero .
m=v=0
In order to reduce this initializing effect it's always to have some warm-up . So then equation is like
m = beta1*m + (1-beta1)*dx beta1 -o.9 beta2-0.999
**mt = m / (1-beta1**t)**
v = beta2*v + (1-beta2)*(dx**2)
**vt = v / (1-beta2**t)**
x += - learning_rate * mt / (np.sqrt(vt) + eps)
Now we run this for few iterations . Clearly pay attention to the bold lines , you can see when t is increasing (iteration number) following thing happen to the mt ,
mt = m
I implemented linear regression with gradient descent in python. To see how well it is doing I compared it with scikit-learn's LinearRegression() class. For some reason, sklearn always outperforms my program by a MSE of 3 on average (I am using the Boston Housing dataset for testing). I understand that I am currently not doing gradient checking to check for convergence, but I am allowing for many iterations and have set the learning rate low enough such that it SHOULD converge. Is there any clear bug in my learning algorithm implementation? Here is my code:
import numpy as np
from sklearn.linear_model import LinearRegression
def getWeights(x):
lenWeights = len(x[1,:]);
weights = np.random.rand(lenWeights)
bias = np.random.random();
return weights,bias
def train(x,y,weights,bias,maxIter):
converged = False;
iterations = 1;
m = len(x);
alpha = 0.001;
while not converged:
for i in range(len(x)):
# Dot product of weights and training sample
hypothesis = np.dot(x[i,:], weights) + bias;
# Calculate gradient
error = hypothesis - y[i];
grad = (alpha * 1/m) * ( error * x[i,:] );
# Update weights and bias
weights = weights - grad;
bias = bias - alpha * error;
iterations = iterations + 1;
if iterations > maxIter:
converged = True;
break
return weights, bias
def predict(x, weights, bias):
return np.dot(x,weights) + bias
if __name__ == '__main__':
data = np.loadtxt('housing.txt');
x = data[:,:-1];
y = data[:,-1];
for i in range(len(x[1,:])):
x[:,i] = ( (x[:,i] - np.min(x[:,i])) / (np.max(x[:,i]) - np.min(x[:,i])) );
initialWeights,initialBias = getWeights(x);
weights,bias = train(x,y,initialWeights,initialBias,55000);
pred = predict(x, weights,bias);
MSE = np.mean(abs(pred - y));
print "This Program MSE: " + str(MSE)
sklearnModel = LinearRegression();
sklearnModel = sklearnModel.fit(x,y);
sklearnModel = sklearnModel.predict(x);
skMSE = np.mean(abs(sklearnModel - y));
print "Sklearn MSE: " + str(skMSE)
First, make sure that you are computing the correct objective function value. The linear regression objective should be .5*np.mean((pred-y)**2), rather than np.mean(abs(pred - y)).
You are actually running a stochastic gradient descent (SGD) algorithm (running a gradient iteration on individual examples), which should be distinguished from "gradient descent".
SGD is a good learning method, but a bad optimization method - it can take many iterations to converge to a minimum of the empirical error (http://leon.bottou.org/publications/pdf/nips-2007.pdf).
For SGD to converge, the learning rate must be restricted. Typically, the learning rate is set to the base learning rate divided by the number of iterations, something like alpha/(iterations+1), using the variables in your code.
You also include a multiple of 1/m in your gradient, which is typically not used in SGD updates.
To test your SGD implementation, rather than evaluating the error on the dataset that you trained with, split the dataset into a training set and a test set, and evaluate the error on this test set after training with both methods. The training/test set split will allow you to estimate the performance of your algorithm as a learning algorithm (estimate the expected error) rather than as an optimization algorithm (minimize the empirical error).
Try increasing your iteration value. This should allow your algorithm to, hopefully, converge on a value that is closer to the global minimum. Keep in mind you are not using l-bfgs which can come closer to converging much faster than plain gradient descent or even SGD.
Also try using the normal equation as another way to do Linear Regression.
http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/.