TensorFlow why my cost function doesn't decrease?

TensorFlow why my cost function doesn't decrease? - python

I'm using a very simple NN with a normalized word2vec as input.
When running my train (based on the mini batch) the train cost start around 1020 and decrease around 1000 but never less than this and my accuracy is around 50%.
Why doesn't the cost decrease ? How can I verify that the weigth matrice is updated at each run?
apply_weights_OP = tf.matmul(X, weights, name="apply_weights")
add_bias_OP = tf.add(apply_weights_OP, bias, name="add_bias")
activation_OP = tf.nn.sigmoid(add_bias_OP, name="activation")
cost_OP = tf.nn.l2_loss(activation_OP-yGold, name="squared_error_cost")
optimizer = tf.train.AdamOptimizer(0.001)
global_step = tf.Variable(0, name='global_step', trainable=False)
training_OP = optimizer.minimize(cost_OP, global_step=global_step)
correct_predictions_OP = tf.equal(
tf.argmax(activation_OP, 0),
tf.argmax(yGold, 0)
)
accuracy_OP = tf.reduce_mean(tf.cast(correct_predictions_OP, "float"))
newCost, train_accuracy, _ = sess.run(
[cost_OP, accuracy_OP, training_OP],
feed_dict={
X: trainX[indice_bas: indice_haut],
yGold: trainY[indice_bas: indice_haut]
}
)
Thanks

try using cross entropy instead of the L2 loss, also there is no real point in having an activation function on your output layer.
The examples that ship with tensorflow actually have a basic model that is very similar to what you are trying.
btw: it might also be that the problem you are trying to learn is simply not solvable by a simple linear model (i.e. what you are trying to do), try using a deeper model. Here is an example of a 2 layer deep multilayer perceptron.

Related

Model not improving with GradientTape but with model.fit()

I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.

I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.

Transfer learning with pretrained model by tf.GradientTape can't converge

I would like to perform transfer learning with pretrained model of keras
import tensorflow as tf
from tensorflow import keras
base_model = keras.applications.MobileNetV2(input_shape=(96, 96, 3), include_top=False, pooling='avg')
x = base_model.outputs[0]
outputs = layers.Dense(10, activation=tf.nn.softmax)(x)
model = keras.Model(inputs=base_model.inputs, outputs=outputs)
Training with keras compile/fit functions can converge
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
history = model.fit(train_data, epochs=1)
The results are: loss: 0.4402 - accuracy: 0.8548
I wanna train with tf.GradientTape, but it can't converge
optimizer = keras.optimizers.Adam()
train_loss = keras.metrics.Mean()
train_acc = keras.metrics.SparseCategoricalAccuracy()
def train_step(data, labels):
with tf.GradientTape() as gt:
pred = model(data)
loss = keras.losses.SparseCategoricalCrossentropy()(labels, pred)
grads = gt.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss(loss)
train_acc(labels, pred)
for xs, ys in train_data:
train_step(xs, ys)
print('train_loss = {:.3f}, train_acc = {:.3f}'.format(train_loss.result(), train_acc.result()))
But the results are: train_loss = 7.576, train_acc = 0.101
If I only train the last layer by setting
base_model.trainable = False
It converges and the results are: train_loss = 0.525, train_acc = 0.823
What's the problem with the codes? How should I modify it? Thanks

Try RELU as activation function. It may be Vanishing Gradient issue which occurs if you use activation function other than RELU.

Following my comment, the reason why it didn't converge is because you picked a learning rate that was too big. This causes the weight to change too much and the loss to explode. When setting base_model.trainable to False, most of the weight in the networks were fixed and the learning rate was a good fit for your last layers. Here's a picture :
As a general rule, your learning rate should always be chosen for each experiments.
Edit : Following Wilson's comment, I'm not sure this is the reason you have different results but this could be it :
When you specify your loss, your loss is computed on each element of the batch, then to get the loss of the batch, you can take the sum or the mean of the losses, depending on which one you chose, you get a different magnitude. For example, if your batch size is 64, summing the loss will yield you a 64 times bigger loss which will yield 64 times bigger gradient, so choosing sum over mean with a batch size 64 is like picking a 64 times bigger learning rate.
So maybe the reason you have different results is that by default a keras.losses wrapped in a model.compile has a different reduction method. In the same vein, if the loss is reduced by a sum method, the magnitude of the loss depends on the batch size, if you have twice the batch size, you get (on average) twice the loss, and twice the gradient and so it's like doubling the learning rate.
My advice is to check the reduction method used by the loss to be sure it's the same in both case, and if it's sum, to check that the batch size is the same. I would advise to use mean reduction in general since it's not influenced by batch size.

YOLOv3-tensorflow not converging

I am trying to implement YOLOv3 in tensorflow, I have taken help from online repositories and was successful in converting the darknet weights to tensorflow and run inference.
Now, I am trying to train the model using the YOLO loss as implemented here.
I am using the following code snipppet to do so:
with tf.name_scope('Loss_and_Detect'):
yolo_loss = compute_loss(output, y_true, anchors, config.num_classes, print_loss=False)
tf.summary.scalar('YOLO_loss', yolo_loss)
variables = tf.trainable_variables()
# Variables to be optimized by train_op if the pre-trained darknet-53 is used as is
if config.pre_train:
variables = variables[312:] # Get the weights after the 52nd conv-layer (darknet-53)
# 5e-4 as used in the paper
l2_loss = config.weight_decay * tf.add_n([tf.nn.l2_loss(tf.cast(v, dtype=tf.float32)) for v in variables])
loss = yolo_loss + l2_loss
tf.summary.scalar('L2_loss', l2_loss)
tf.summary.scalar('Total_loss', loss)
# Define an optimizer for minimizing the computed loss
with tf.name_scope('Optimizer'):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss=loss, global_step=global_step, var_list=variables)
The problem is my YOLO_loss is stuck at ~7-8 and the L2_loss keeps on increasing.
Here is a snapshot of the tensorboard with learning rate 1e-6 with exponential decay applied to it.(decay_rate=0.8)
I cannot figure out what an I missing/doing wrong.
Any help is appreciated.

XOR neural network, the losses don't go down

I'm using Mxnet to train a XOR neural network, but the losses don't go down, they are always above 0.5.
Below is my code in Mxnet 1.1.0; Python 3.6; OS X El Capitan 10.11.6
I tried 2 loss functions - squared loss and softmax loss, both didn't work.
from mxnet import ndarray as nd
from mxnet import autograd
from mxnet import gluon
import matplotlib.pyplot as plt
X = nd.array([[0,0],[0,1],[1,0],[1,1]])
y = nd.array([0,1,1,0])
batch_size = 1
dataset = gluon.data.ArrayDataset(X, y)
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)
plt.scatter(X[:, 1].asnumpy(),y.asnumpy())
plt.show()
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Dense(2, activation="tanh"))
net.add(gluon.nn.Dense(1, activation="tanh"))
net.initialize()
softmax_cross_entropy = gluon.loss.SigmoidBCELoss()#SigmoidBinaryCrossEntropyLoss()
square_loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.3})
train_losses = []
for epoch in range(100):
train_loss = 0
for data, label in data_iter:
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
trainer.step(batch_size)
train_loss += nd.mean(loss).asscalar()
train_losses.append(train_loss)
plt.plot(train_losses)
plt.show()

I got this question figured out in somewhere else, so I'm going to post the answer here.
Basically, the issue in my original code was multi-dimensional.
Weight initialization. Notice that I used default initialization
net.initialize()
which actually does
net.initialize(initializer.Uniform(scale=0.07))
Apparently these initial weights were too small, and the network could never jump out of them. So the fix is
net.initialize(mx.init.Uniform(1))
After doing this, the network could converge using sigmoid/tanh as the activation, and using L2Loss as the loss function. And it worked with sigmoid and SigmoidBCELoss. However, it still didn't work with tanh and SigmoidBCELoss, which can be fixed by the second item below.
SigmoidBCELoss has to be used in these 2 scenarios in the output layer.
2.1. Linear activation and SigmoidBCELoss(from_sigmoid=False);
2.2. Non-linear activation and SigmoidBCELoss(from_sigmoid=True), in which the output of the non-linear function falls into (0, 1).
In my original code, when I used SigmoidBCELoss, I was using either all sigmoid, or all tanh. So just need to change the activation in the output layer from tanh to sigmoid, and the network could converge. I can still have tanh in the hidden layers.
Hope this helps!

Why is this TensorFlow code so slow?

I am training a deep neural network multi-class classifier using TensorFlow. The network outputs the linear values from the final layer, which the tf.nn.softmax_cross_entropy_with_logits cost function takes as input. However, I don't really care about that linear output per se - I want to know what it looks like when the softmax function is applied to it.
Below the relevant parts of my code:
def train_network(x, num_hidden_layers):
prediction = neural_network_model(x, num_hidden_layers)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# train the network
...
# get the network output; x_test is my test data (len=663)
output = sess.run(prediction,feed_dict={x: x_test})
# get softmax values of output
for i in range(len(x_test)):
softm = sess.run(tf.nn.softmax(output[i]))
pred_class = sess.run(tf.argmax(softm))
print(pred_class)
...
Now, that final for-loop in which I calculate the softmax values is extremely slow. Why is that, and how do I do this properly?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.