Tensorflow Adam optimizer vs Keras Adam optimizer - python

I originally developed a classifier in Keras, where my optimizer was very easy to apply decay to.
adam = keras.optimizers.Adam(decay=0.001)
Recently I tried to change the entire code to pure Tensorflow, and cannot figure out how to correctly apply the same decay mechanism to my optimizer.
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss=loss,global_step=tf.train.get_global_step())
How do I apply the same learning rate decay seen in my Keras code snippet to my Tensorflow snippet?

You can find a decent documentation about decay in tensorflow:
...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100000, 0.96, staircase=True)
learning_step = ( tf.train.GradientDescentOptimizer(learning_rate)
.minimize(...my loss..., global_step=global_step)
)
tf.train.exponential_decay applies exponential decay to the learning rate.
Other decays:
inverse_time_decay
polynomial_decay
linear_cosine_decay
exponential_decay
cosine_decay
cosine_decay_restarts
natural_exp_decay
noisy_linear_cosine_decay
Keras implemented decay in AdamOptimizer similar to below, which is very close to inverse_time_decay in tensorflow:
lr = self.lr * (1. / (1. + self.decay * self.iterations))

You can find some useful hints of what you wanted to do here https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/.
To answer your question, I quote this source :
The callbacks operate separately from the optimization algorithm, although they adjust the learning rate used by the optimization algorithm. It is recommended to use the SGD when using a learning rate schedule callback
Base on this article you will find how to use the keras.callbacks and hopefully success to set the learning rate of Adam keras optimizer as you wished. Though, note that this is not recommended (I haven't tried it yet)

Related

Why is this tensorflow training taking so long?

I'm learning DRL with the book Deep Reinforcement Learning in Action. In chapter 3, they present the simple game Gridworld (instructions here, in the rules section) with the corresponding code in PyTorch.
I've experimented with the code and it takes less than 3 minutes to train the network with 89% of wins (won 89 of 100 games after training).
As an exercise, I have migrated the code to tensorflow. All the code is here.
The problem is that with my tensorflow port it takes near 2 hours to train the network with a win rate of 84%. Both versions are using the only CPU to train (I don't have GPU)
Training loss figures seem correct and also the rate of a win (we have to take into consideration that the game is random and can have impossible states). The problem is the performance of the overall process.
I'm doing something terribly wrong, but what?
The main differences are in the training loop, in torch is this:
loss_fn = torch.nn.MSELoss()
learning_rate = 1e-3
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
....
Q1 = model(state1_batch)
with torch.no_grad():
Q2 = model2(state2_batch) #B
Y = reward_batch + gamma * ((1-done_batch) * torch.max(Q2,dim=1)[0])
X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()
loss = loss_fn(X, Y.detach())
optimizer.zero_grad()
loss.backward()
optimizer.step()
and in the tensorflow version:
loss_fn = tf.keras.losses.MSE
learning_rate = 1e-3
optimizer = tf.keras.optimizers.Adam(learning_rate)
...
Q2 = model2(state2_batch) #B
with tf.GradientTape() as tape:
Q1 = model(state1_batch)
Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
X = [Q1[i][action_batch[i]] for i in range(len(action_batch))]
loss = loss_fn(X, Y)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Why is the training taking so long?
Why is TensorFlow slow
TensorFlow has 2 execution modes: eager execution, and graph mode. TensorFlow default behavior, since version 2, is to default to eager execution. Eager execution is great as it enables you to write code close to how you would write standard python. It's easier to write, and it's easier to debug. Unfortunately, it's really not as fast as graph mode.
So the idea is, once the function is prototyped in eager mode, to make TensorFlow execute it in graph mode. For that you can use tf.function. tf.function compiles a callable into a TensorFlow graph. Once the function is compiled into a graph, the performance gain is usually quite important. The recommended approach when developing in TensorFlow is the following:
Debug in eager mode, then decorate with #tf.function.
Don't rely on Python side effects like object mutation or list appends.
tf.function works best with TensorFlow ops; NumPy and Python calls are converted to constants.
I would add: think about the critical parts of your program, and which ones should be converted first into graph mode. It's usually the parts where you call a model to get a result. It's where you will see the best improvements.
You can find more information in the following guides:
Better performance with tf.function
Introduction to graphs and tf.function
Applying tf.function to your code
So, there are at least two things you can change in your code to make it run quite faster:
The first one is to not use model.predict on a small amount of data. The function is made to work on a huge dataset or on a generator. (See this comment on Github). Instead, you should call the model directly, and for performance enhancement, you can wrap the call to the model in a tf.function.
Model.predict is a top-level API designed for batch-predicting outside of any loops, with the fully-features of the Keras APIs.
The second one is to make your training step a separate function, and to decorate that function with #tf.function.
So, I would declare the following things before your training loop:
# to call instead of model.predict
model_func = tf.function(model)
def get_train_func(model, model2, loss_fn, optimizer):
"""Wrapper that creates a train step using the two model passed"""
#tf.function
def train_func(state1_batch, state2_batch, done_batch, reward_batch, action_batch):
Q2 = model2(state2_batch) #B
with tf.GradientTape() as tape:
Q1 = model(state1_batch)
Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
# gather is more efficient than a list comprehension, and needed in a tf.function
X = tf.gather(Q1, action_batch, batch_dims=1)
loss = loss_fn(X, Y)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return loss
return train_func
# train step is a callable
train_step = get_train_func(model, model2, loss_fn, optimizer)
And you can use that function in your training loop:
if len(replay) > batch_size:
minibatch = random.sample(replay, batch_size)
state1_batch = np.array([s1 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
action_batch = np.array([a for (s1,a,r,s2,d) in minibatch]) #TODO: Posibles diferencies
reward_batch = np.float32([r for (s1,a,r,s2,d) in minibatch])
state2_batch = np.array([s2 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
done_batch = np.array([d for (s1,a,r,s2,d) in minibatch]).astype(np.float32)
loss = train_step(state1_batch, state2_batch, done_batch, reward_batch, action_batch)
losses.append(loss)
There are other changes that you could make to make your code more TensorFlowesque, but with those modifications, your code takes ~2 minutes on my CPU. (with a 97% win rate).

YOLOv3-tensorflow not converging

I am trying to implement YOLOv3 in tensorflow, I have taken help from online repositories and was successful in converting the darknet weights to tensorflow and run inference.
Now, I am trying to train the model using the YOLO loss as implemented here.
I am using the following code snipppet to do so:
with tf.name_scope('Loss_and_Detect'):
yolo_loss = compute_loss(output, y_true, anchors, config.num_classes, print_loss=False)
tf.summary.scalar('YOLO_loss', yolo_loss)
variables = tf.trainable_variables()
# Variables to be optimized by train_op if the pre-trained darknet-53 is used as is
if config.pre_train:
variables = variables[312:] # Get the weights after the 52nd conv-layer (darknet-53)
# 5e-4 as used in the paper
l2_loss = config.weight_decay * tf.add_n([tf.nn.l2_loss(tf.cast(v, dtype=tf.float32)) for v in variables])
loss = yolo_loss + l2_loss
tf.summary.scalar('L2_loss', l2_loss)
tf.summary.scalar('Total_loss', loss)
# Define an optimizer for minimizing the computed loss
with tf.name_scope('Optimizer'):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss=loss, global_step=global_step, var_list=variables)
The problem is my YOLO_loss is stuck at ~7-8 and the L2_loss keeps on increasing.
Here is a snapshot of the tensorboard with learning rate 1e-6 with exponential decay applied to it.(decay_rate=0.8)
I cannot figure out what an I missing/doing wrong.
Any help is appreciated.

Keras: how learning rate changes when Adadelta optimizer is used?

For example I use Adadelta for optimizer when compile network model, then learning rate will change in time by this rule (but what is iterations ? ) and how can I log learning rate value to console?
model.compile(loss=keras.losses.mean_squared_error,
optimizer= keras.optimizers.Adadelta())
In documentation lr is just starting learning rate?
The rule is related to updates with decay. Adadelta is an adaptive learning rate method which uses exponentially decaying average of gradients.
Looking at Keras source code, learning rate is recalculated based on decay like:
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
So yes, lr is just starting learning rate.
To print it after every epoch, as #orabis mentioned, you can make a callback class:
class YourLearningRateTracker(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
decay = self.model.optimizer.decay
iterations = self.model.optimizer.iterations
lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
print(K.eval(lr_with_decay))
and then add its instance to the callbacks when calling model.fit() like:
model.fit(..., callbacks=[YourLearningRateTracker()])
However, note that, by default, decay parameter for Adadelta is zero and is not part of the “standard” arguments, so your learning rate would not be changing its value when using default arguments.
I suspect that decay is not intended to be used with Adadelta.
On the other hand, rho parameter, which is nonzero by default, doesn’t describe the decay of the learning rate, but corresponds to the fraction of gradient to keep at each time step (according to the Keras documentation).
I found some relevant information on this Github issue, and by asking a similar question.

TensorFlow why my cost function doesn't decrease?

I'm using a very simple NN with a normalized word2vec as input.
When running my train (based on the mini batch) the train cost start around 1020 and decrease around 1000 but never less than this and my accuracy is around 50%.
Why doesn't the cost decrease ? How can I verify that the weigth matrice is updated at each run?
apply_weights_OP = tf.matmul(X, weights, name="apply_weights")
add_bias_OP = tf.add(apply_weights_OP, bias, name="add_bias")
activation_OP = tf.nn.sigmoid(add_bias_OP, name="activation")
cost_OP = tf.nn.l2_loss(activation_OP-yGold, name="squared_error_cost")
optimizer = tf.train.AdamOptimizer(0.001)
global_step = tf.Variable(0, name='global_step', trainable=False)
training_OP = optimizer.minimize(cost_OP, global_step=global_step)
correct_predictions_OP = tf.equal(
tf.argmax(activation_OP, 0),
tf.argmax(yGold, 0)
)
accuracy_OP = tf.reduce_mean(tf.cast(correct_predictions_OP, "float"))
newCost, train_accuracy, _ = sess.run(
[cost_OP, accuracy_OP, training_OP],
feed_dict={
X: trainX[indice_bas: indice_haut],
yGold: trainY[indice_bas: indice_haut]
}
)
Thanks
try using cross entropy instead of the L2 loss, also there is no real point in having an activation function on your output layer.
The examples that ship with tensorflow actually have a basic model that is very similar to what you are trying.
btw: it might also be that the problem you are trying to learn is simply not solvable by a simple linear model (i.e. what you are trying to do), try using a deeper model. Here is an example of a 2 layer deep multilayer perceptron.

Tensorflow: global_step not incremented; hence exponentialDecay not working

I'm trying to learn Tensorflow, and I wanted to use the Tensorflow's cifar10 tutorial framework and train it on top of mnist (combining two tutorials).
In cifar10.py's train method:
cifar10.train(total_loss, global_step):
lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
global_step,
100,
0.1,
staircase=True)
tf.scalar_summary('learning_rate', lr)
tf.scalar_summary('global_step', global_step)
The global_step is passed initialized and passed in and the global_step does increase 1 at a step and the learning rate decays properly the source code can be found at tensorflow's cifar10 tutorial.
However, when I tried to do the same for my revised mnist.py's train method code:
mnist.training(loss, batch_size, global_step):
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(0.1,
global_step,
100,
0.1,
staircase=True)
tf.scalar_summary('learning_rate1', lr)
tf.scalar_summary('global_step1', global_step)
# Create the gradient descent optimizer with the given learning rate.
optimizer = tf.train.GradientDescentOptimizer(lr)
# Create a variable to track the global step.
global_step = tf.Variable(0, name='global_step', trainable=False)
# Use the optimizer to apply the gradients that minimize the loss
# (and also increment the global step counter) as a single training step.
train_op = optimizer.minimize(loss, global_step=global_step)
tf.scalar_summary('global_step2', global_step)
tf.scalar_summary('learning_rate2', lr)
return train_op
The global step is initialized (in both cifar10 and my mnist file) as:
with tf.Graph().as_default():
global_step = tf.Variable(0, trainable=False)
...
# Build a Graph that trains the model with one batch of examples and
# updates the model parameters.
train_op = mnist10.training(loss, batch_size=100,
global_step=global_step)
Here, I record the scalar_summary of global step and learning rate twice:
learning_rate1 and learning_rate2 are both the same and constant at 0.1 (initial learning rate).
global_step1 is also constant at 0 across 2000 steps.
global_step2 is increasing linearly 1 per step.
The more detailed code structure can be found at:
https://bitbucket.org/jackywang529/tesorflow-sandbox/src
It's quite confusing to me why this might be the case (in the case of my global_step since I thought everything was set up symbolically, and so once the program starts running the global step should be incremented no matter where I write the summary) and I think this is why my learning rate is constant. Of course I might have made some simplistic mistake, and would be glad to get helped/explained.
global_steps written before and after the minimize function is called
You are passing an argument called global_step to mnist.training, AND also creating a variable called global_step in mnist.training. The one used for tracking the exponential_decay is the variable that is passed in, but the one that is actually incremented (by passing to optimizer.minimize) is the newly created variable. Simply remove the following statement from mnist.training and things should work :
global_step = tf.Variable(0, name='global_step', trainable=False)

Categories

Resources