I am trying to implement YOLOv3 in tensorflow, I have taken help from online repositories and was successful in converting the darknet weights to tensorflow and run inference.
Now, I am trying to train the model using the YOLO loss as implemented here.
I am using the following code snipppet to do so:
with tf.name_scope('Loss_and_Detect'):
yolo_loss = compute_loss(output, y_true, anchors, config.num_classes, print_loss=False)
tf.summary.scalar('YOLO_loss', yolo_loss)
variables = tf.trainable_variables()
# Variables to be optimized by train_op if the pre-trained darknet-53 is used as is
if config.pre_train:
variables = variables[312:] # Get the weights after the 52nd conv-layer (darknet-53)
# 5e-4 as used in the paper
l2_loss = config.weight_decay * tf.add_n([tf.nn.l2_loss(tf.cast(v, dtype=tf.float32)) for v in variables])
loss = yolo_loss + l2_loss
tf.summary.scalar('L2_loss', l2_loss)
tf.summary.scalar('Total_loss', loss)
# Define an optimizer for minimizing the computed loss
with tf.name_scope('Optimizer'):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss=loss, global_step=global_step, var_list=variables)
The problem is my YOLO_loss is stuck at ~7-8 and the L2_loss keeps on increasing.
Here is a snapshot of the tensorboard with learning rate 1e-6 with exponential decay applied to it.(decay_rate=0.8)
I cannot figure out what an I missing/doing wrong.
Any help is appreciated.
Related
I'm working on a custom transformer model where the training steps method goes like this:
#simplified version of my training method. where model = myTransformerModel()
for windows in data: #step through data
l1 = model(window)
loss = torch.mean(l1)
optimizer.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()
scheduler.step()
I'm trying to recreate this in TensorFlow, currently its like this:
for windows in data: #step through data
with tf.GradientTape() as tape:
l1 = model.call(window)
loss = tf.reduce_mean(l1)
train = optimizer.minimize(loss, var_list=model.trainable_variables,tape=tape)
This functions, but causes the scheduler to step with every window, which throws off the learning rate.
I have also tried this in place of the minimize line:
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Is there a good way to make the TensorFlow model more like the PyTorch one? Is there a better way to implement my steps with the gradienttape?
I would like to perform transfer learning with pretrained model of keras
import tensorflow as tf
from tensorflow import keras
base_model = keras.applications.MobileNetV2(input_shape=(96, 96, 3), include_top=False, pooling='avg')
x = base_model.outputs[0]
outputs = layers.Dense(10, activation=tf.nn.softmax)(x)
model = keras.Model(inputs=base_model.inputs, outputs=outputs)
Training with keras compile/fit functions can converge
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
history = model.fit(train_data, epochs=1)
The results are: loss: 0.4402 - accuracy: 0.8548
I wanna train with tf.GradientTape, but it can't converge
optimizer = keras.optimizers.Adam()
train_loss = keras.metrics.Mean()
train_acc = keras.metrics.SparseCategoricalAccuracy()
def train_step(data, labels):
with tf.GradientTape() as gt:
pred = model(data)
loss = keras.losses.SparseCategoricalCrossentropy()(labels, pred)
grads = gt.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss(loss)
train_acc(labels, pred)
for xs, ys in train_data:
train_step(xs, ys)
print('train_loss = {:.3f}, train_acc = {:.3f}'.format(train_loss.result(), train_acc.result()))
But the results are: train_loss = 7.576, train_acc = 0.101
If I only train the last layer by setting
base_model.trainable = False
It converges and the results are: train_loss = 0.525, train_acc = 0.823
What's the problem with the codes? How should I modify it? Thanks
Try RELU as activation function. It may be Vanishing Gradient issue which occurs if you use activation function other than RELU.
Following my comment, the reason why it didn't converge is because you picked a learning rate that was too big. This causes the weight to change too much and the loss to explode. When setting base_model.trainable to False, most of the weight in the networks were fixed and the learning rate was a good fit for your last layers. Here's a picture :
As a general rule, your learning rate should always be chosen for each experiments.
Edit : Following Wilson's comment, I'm not sure this is the reason you have different results but this could be it :
When you specify your loss, your loss is computed on each element of the batch, then to get the loss of the batch, you can take the sum or the mean of the losses, depending on which one you chose, you get a different magnitude. For example, if your batch size is 64, summing the loss will yield you a 64 times bigger loss which will yield 64 times bigger gradient, so choosing sum over mean with a batch size 64 is like picking a 64 times bigger learning rate.
So maybe the reason you have different results is that by default a keras.losses wrapped in a model.compile has a different reduction method. In the same vein, if the loss is reduced by a sum method, the magnitude of the loss depends on the batch size, if you have twice the batch size, you get (on average) twice the loss, and twice the gradient and so it's like doubling the learning rate.
My advice is to check the reduction method used by the loss to be sure it's the same in both case, and if it's sum, to check that the batch size is the same. I would advise to use mean reduction in general since it's not influenced by batch size.
I am coding a wgan in tensorflow on mnist dataset and it works well but I am finding it difficult to clip weights of discriminator model [-0.01,0.01] in tensorflow. In keras we can do weight clipping using.
for l in self.discriminator.layers:
weights = l.get_weights()
weights = [np.clip(w, -self.clip_value, self.clip_value) for w in weights]
l.set_weights(weights)
I have found a tensorflow doc for weight clipping discrimantor
tf.contrib.gan.features.clip_discriminator_weights(
optimizer,
model,
weight_clip
)
Other than this there is not much is given to how use this function.
#my tf code
def generator(z):
h=tf.nn.relu(layer_mlp(z,"g1",[10,128]))
prob=tf.nn.sigmoid(layer_mlp(h,"g2",[128,784]))
return prob
def discriminator(x):
h=tf.nn.relu(layer_mlp(x,"d1",[784,128]))
logit=layer_mlp(h,"d2",[128,1])
prob=tf.nn.sigmoid(logit)
return prob
G_sample=generator(z)
D_real= discriminator(x)
D_fake= discriminator(G_sample)
D_loss = tf.reduce_mean(D_real) - tf.reduce_mean(D_fake)
G_loss = -tf.reduce_mean(D_fake)
for epoch in epochs:
#training the model
Adding to Yaakov's answer, you can use tf.clip_by_value with trainable_variables as shown in this repo https://github.com/hcnoh/WGAN-tensorflow2
for w in model.discriminator.trainable_variables:
w.assign(tf.clip_by_value(w, -clip_const, clip_const))
You can use below function to implement clipping in tensorflow.
tf.clip_by_value(
t,
clip_value_min,
clip_value_max,
name=None
)
Please refer to below links on how to implement it in your code.
https://www.tensorflow.org/api_docs/python/tf/clip_by_value
https://github.com/wiseodd/generative-models/blob/master/GAN/wasserstein_gan/wgan_tensorflow.py
I originally developed a classifier in Keras, where my optimizer was very easy to apply decay to.
adam = keras.optimizers.Adam(decay=0.001)
Recently I tried to change the entire code to pure Tensorflow, and cannot figure out how to correctly apply the same decay mechanism to my optimizer.
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss=loss,global_step=tf.train.get_global_step())
How do I apply the same learning rate decay seen in my Keras code snippet to my Tensorflow snippet?
You can find a decent documentation about decay in tensorflow:
...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100000, 0.96, staircase=True)
learning_step = ( tf.train.GradientDescentOptimizer(learning_rate)
.minimize(...my loss..., global_step=global_step)
)
tf.train.exponential_decay applies exponential decay to the learning rate.
Other decays:
inverse_time_decay
polynomial_decay
linear_cosine_decay
exponential_decay
cosine_decay
cosine_decay_restarts
natural_exp_decay
noisy_linear_cosine_decay
Keras implemented decay in AdamOptimizer similar to below, which is very close to inverse_time_decay in tensorflow:
lr = self.lr * (1. / (1. + self.decay * self.iterations))
You can find some useful hints of what you wanted to do here https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/.
To answer your question, I quote this source :
The callbacks operate separately from the optimization algorithm, although they adjust the learning rate used by the optimization algorithm. It is recommended to use the SGD when using a learning rate schedule callback
Base on this article you will find how to use the keras.callbacks and hopefully success to set the learning rate of Adam keras optimizer as you wished. Though, note that this is not recommended (I haven't tried it yet)
I'm using a very simple NN with a normalized word2vec as input.
When running my train (based on the mini batch) the train cost start around 1020 and decrease around 1000 but never less than this and my accuracy is around 50%.
Why doesn't the cost decrease ? How can I verify that the weigth matrice is updated at each run?
apply_weights_OP = tf.matmul(X, weights, name="apply_weights")
add_bias_OP = tf.add(apply_weights_OP, bias, name="add_bias")
activation_OP = tf.nn.sigmoid(add_bias_OP, name="activation")
cost_OP = tf.nn.l2_loss(activation_OP-yGold, name="squared_error_cost")
optimizer = tf.train.AdamOptimizer(0.001)
global_step = tf.Variable(0, name='global_step', trainable=False)
training_OP = optimizer.minimize(cost_OP, global_step=global_step)
correct_predictions_OP = tf.equal(
tf.argmax(activation_OP, 0),
tf.argmax(yGold, 0)
)
accuracy_OP = tf.reduce_mean(tf.cast(correct_predictions_OP, "float"))
newCost, train_accuracy, _ = sess.run(
[cost_OP, accuracy_OP, training_OP],
feed_dict={
X: trainX[indice_bas: indice_haut],
yGold: trainY[indice_bas: indice_haut]
}
)
Thanks
try using cross entropy instead of the L2 loss, also there is no real point in having an activation function on your output layer.
The examples that ship with tensorflow actually have a basic model that is very similar to what you are trying.
btw: it might also be that the problem you are trying to learn is simply not solvable by a simple linear model (i.e. what you are trying to do), try using a deeper model. Here is an example of a 2 layer deep multilayer perceptron.