Transfer learning with pretrained model by tf.GradientTape can't converge

Transfer learning with pretrained model by tf.GradientTape can't converge - python

I would like to perform transfer learning with pretrained model of keras
import tensorflow as tf
from tensorflow import keras
base_model = keras.applications.MobileNetV2(input_shape=(96, 96, 3), include_top=False, pooling='avg')
x = base_model.outputs[0]
outputs = layers.Dense(10, activation=tf.nn.softmax)(x)
model = keras.Model(inputs=base_model.inputs, outputs=outputs)
Training with keras compile/fit functions can converge
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
history = model.fit(train_data, epochs=1)
The results are: loss: 0.4402 - accuracy: 0.8548
I wanna train with tf.GradientTape, but it can't converge
optimizer = keras.optimizers.Adam()
train_loss = keras.metrics.Mean()
train_acc = keras.metrics.SparseCategoricalAccuracy()
def train_step(data, labels):
with tf.GradientTape() as gt:
pred = model(data)
loss = keras.losses.SparseCategoricalCrossentropy()(labels, pred)
grads = gt.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss(loss)
train_acc(labels, pred)
for xs, ys in train_data:
train_step(xs, ys)
print('train_loss = {:.3f}, train_acc = {:.3f}'.format(train_loss.result(), train_acc.result()))
But the results are: train_loss = 7.576, train_acc = 0.101
If I only train the last layer by setting
base_model.trainable = False
It converges and the results are: train_loss = 0.525, train_acc = 0.823
What's the problem with the codes? How should I modify it? Thanks

Try RELU as activation function. It may be Vanishing Gradient issue which occurs if you use activation function other than RELU.

Following my comment, the reason why it didn't converge is because you picked a learning rate that was too big. This causes the weight to change too much and the loss to explode. When setting base_model.trainable to False, most of the weight in the networks were fixed and the learning rate was a good fit for your last layers. Here's a picture :
As a general rule, your learning rate should always be chosen for each experiments.
Edit : Following Wilson's comment, I'm not sure this is the reason you have different results but this could be it :
When you specify your loss, your loss is computed on each element of the batch, then to get the loss of the batch, you can take the sum or the mean of the losses, depending on which one you chose, you get a different magnitude. For example, if your batch size is 64, summing the loss will yield you a 64 times bigger loss which will yield 64 times bigger gradient, so choosing sum over mean with a batch size 64 is like picking a 64 times bigger learning rate.
So maybe the reason you have different results is that by default a keras.losses wrapped in a model.compile has a different reduction method. In the same vein, if the loss is reduced by a sum method, the magnitude of the loss depends on the batch size, if you have twice the batch size, you get (on average) twice the loss, and twice the gradient and so it's like doubling the learning rate.
My advice is to check the reduction method used by the loss to be sure it's the same in both case, and if it's sum, to check that the batch size is the same. I would advise to use mean reduction in general since it's not influenced by batch size.

Related

Model not improving with GradientTape but with model.fit()

I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.

I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.

ValueError- Scratch Training for Multiple Output in tf.keras

I have a ML models which have 3 output. Each of them have passed into a SoftMax player. I'm trying to train the model from custom scratch with training loop in keras. My generator output are something like this:
x,y = (batch_size, h,w,c), [(batch_size, 2), (batch_size, 8), (batch_size, 10)]
As you notice, each of the 3 output has different numbers of labels. First output layer has 2 labels output, second one has 8 and last one has 10.
Here is the code snippet,
epochs = 1
optimizer = tf.keras.optimizers.Adam(learning_rate=0.05)
loss_fn = tf.keras.losses.CategoricalCrossentropy()
train_acc_metric = tf.keras.metrics.Accuracy()
val_acc_metric = tf.keras.metrics.Accuracy()
#tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
train_loss_value = loss_fn(y, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(train_loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Update training metric.
train_acc_metric.update_state(y, logits)
return train_loss_value
And,
for epoch in range(epochs):
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_generator):
train_loss_value = train_step(x_batch_train, y_batch_train)
# Reset metrics at the end of each epoch
train_acc_metric.reset_states()
When I run the code using fit method, it works nicely. But approaching with custom scratch training like above throws following error message:
ValueError: Dimension 1 in both shapes must be equal, but are 8 and 10. Shapes are [64,8] and [64,10].
From merging shape 1 with other shapes.
for '{{node categorical_crossentropy/packed}} = Pack[N=3, T=DT_FLOAT, axis=0](net_2/gra/Softmax, net_2/vow/Softmax, net_2/cons/Softmax)'
with input shapes: [64,2], [64,8], [64,10].
I suspect, I am not handling properly the computation of loss in here: loss_fn(y, logits). y has 3 output, logits has also 3 output. When I pass loss_fn(y[0], logits[0], it works for the first output from rest.
However, any suggestion would be appreciated. Thanks.

Computing the loss (MSE) for every iteration and time Tensorflow

I want to use Tensorboard to plot the mean squared error (y-axis) for every iteration over a given time frame (x-axis), say 5 minutes.
However, i can only plot the MSE given every epoch and set a callback at 5 minutes. This does not however solve my problem.
I have tried looking at the internet for some solutions to how you can maybe set a maximum number of iterations rather than epochs when doing model.fit, but without luck. I know iterations is the number of batches needed to complete one epoch, but as I want to tune the batch_size, I prefer to use the iterations.
My code currently looks like the following:
input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 3
weights_initializer = keras.initializers.GlorotUniform()
#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
model = keras.models.Sequential([
#Layer to be used as an entry point into a Network
keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
#Dense layer 1
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_1'),
#Dense layer 2
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_2'),
#activation function is linear since we are doing regression
keras.layers.Dense(output_size, activation='linear', name='Output_layer')
])
#Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
nesterov=False)
#Compiling the model
model.compile(optimizer=optimizer,
loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
# initialize TimeStopping callback
time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
#Training the network
history = model.fit(normed_train_data, train_labels,
epochs=n_epochs,
batch_size=hparams['batch_size'],
verbose=1,
#validation_split=0.2,
callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
return history
#train_val_model("logs/sample", {'batch_size': len(normed_train_data)})
train_val_model("logs/sample1", {'batch_size': 1})
%tensorboard --logdir_spec=BSG:logs/sample,SGD:logs/sample1
resulting in:
The desired output should look something like this:

The reason you can't do it every iteration is that the loss is calculated at the end of each epoch. If you want to tune the batch size, run for a set number of epochs and evaluate. Start from 16 and jump in powers of 2 and see how much you can push the power of your network. But, usually bigger batch size is said to increase performance but it is not as substantial to solely focus on it. Focus on other things in the network first.

The answer was actually quite simple.
tf.keras.callbacks.TensorBoard has an update_freq argument allowing you to control when to write losses and metrics to tensorboard. The standard is epoch, but you can change it to batch or an integer if you want to write to tensorboard every n batches. See the documentation for more information: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard

Batch size for Stochastic gradient descent is length of training data and not 1?

I am trying to plot the different learning outcome when using Batch gradient descent, Stochastic gradient descent and mini-batch stochastic gradient descent.
Everywhere i look, i read that a batch_size=1 is the same as having a plain SGD and a batch_size=len(train_data) is the same as having the Batch gradient descent.
I know that stochastic gradient descent is when you use only one single data sample for every update and batch gradient descent uses the entire training data set to compute the gradient of the objective function / update.
However, when implementing the batch_size using keras, it seems to be the opposite that is happening. Take my code for example, where I have set the batch_size equal to the length of my training_data
input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 250
weights_initializer = keras.initializers.GlorotUniform()
#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
model = keras.models.Sequential([
#Layer to be used as an entry point into a Network
keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
#Dense layer 1
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_1'),
#Dense layer 2
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_2'),
#activation function is linear since we are doing regression
keras.layers.Dense(output_size, activation='linear', name='Output_layer')
])
#Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
nesterov=False)
#Compiling the model
model.compile(optimizer=optimizer,
loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
# initialize TimeStopping callback
time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
#Training the network
history = model.fit(normed_train_data, train_labels,
epochs=n_epochs,
batch_size=hparams['batch_size'],
verbose=1,
#validation_split=0.2,
callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
return history
train_val_model("logs/sample", {'batch_size': len(normed_train_data)})
When running this, the output seems to show a single update for each epoch i.e. SGD
:
As can be seen underneath every epoch it says 1/1 which I assume means a single update iteration. If I on the other hand set the batch_size=1 I get 90000/90000 which is the size of my entire data-set (training time wise this also makes sense).
So, my question is, batch_size=1 is actually Batch gradient descent and not stochastic gradient descent and batch_size=len(train_data) is actually stochastic gradient descent and not batch gradient descent?

There are actually three (3) cases:
batch_size = 1 means indeed stochastic gradient descent (SGD)
A batch_size equal to the whole of the training data is (batch) gradient descent (GD)
Intermediate cases (which are actually used in practice) are usually referred to as mini-batch gradient descent
See A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size for more details and references. Truth is, in practice, when we say "SGD" we usually mean "mini-batch SGD".
These definitions are in fact fully compliant with what you report from your experiments:
With batch_size=len(train_data) (GD case), only one update is indeed expected per epoch (since there is only one batch), hence the 1/1 indication in Keras output.
In contrast, with batch_size = 1 (SGD case), you expect as many updates as samples in your training data (since this is now the number of your batches), i.e. 90000, hence the 90000/90000 indication in Keras output.
i.e. the number of updates per epoch (which Keras indicates) is equal to the number of batches used (and not to the batch size).

batch_size is the size of how large each update will be.
Here, batch_size=1 means the size of each update is 1 sample. By your definitions, this would be SGD.
If you have batch_size=len(train_data), that means that each update to your weights will require the resulting gradient from your entire dataset. This is actually just good old gradient descent.
Batch gradient descent is somewhere in the middle, where the batch_size isn't 1 and the batch size isn't your entire training dataset. Take 32 for example. Batch gradient descent would update your weights every 32 examples, so it smooths out the ruggedness of SGD with just 1 example (where outliers may have a lot of impact) and yet has the benefits that SGD has over regular gradient descne.t

Change loss function dynamically during training in Keras, without recompiling other model properties like optimizer

Is it possible to set model.loss in a callback without re-compiling model.compile(...) after (since then the optimizer states are reset), and just recompiling model.loss, like for example:
class NewCallback(Callback):
def __init__(self):
super(NewCallback,self).__init__()
def on_epoch_end(self, epoch, logs={}):
self.model.loss=[loss_wrapper(t_change, current_epoch=epoch)]
self.model.compile_only_loss() # is there a version or hack of
# model.compile(...) like this?
To expand more with previous examples on stackoverflow:
To achieve a loss function which depends on the epoch number, like (as in this stackoverflow question):
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
c_epoch = K.get_value(current_epoch)
if c_epoch < t_change:
# compute loss_1
else:
# compute loss_2
return custom_loss
where "current_epoch" is a Keras variable updated with a callback:
current_epoch = K.variable(0.)
model.compile(optimizer=opt, loss=loss_wrapper(5, current_epoch),
metrics=...)
class NewCallback(Callback):
def __init__(self, current_epoch):
self.current_epoch = current_epoch
def on_epoch_end(self, epoch, logs={}):
K.set_value(self.current_epoch, epoch)
One can essentially turn python code into compositions of backend functions for the loss to work as follows:
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
# compute loss_1 and loss_2
bool_case_1=K.less(current_epoch,t_change)
num_case_1=K.cast(bool_case_1,"float32")
loss = (num_case_1)*loss_1 + (1-num_case_1)*loss_2
return loss
return custom_loss
it works.
I am not satisfied with these hacks, and wonder, is it possible to set model.loss in a callback without re-compiling model.compile(...) after (since then the optimizer states are reset), and just recompiling model.loss?

I hope you have found a solution to your problem by now but using tensorflow I think you can solve this by building a custom training loop (here is the doc). this will not override the loss attribute as you requested however you can probably achieve what you are looking for.
example
initializing variable
modifying the example from the documentation, with a model and dataset as such:
inputs = tf.keras.Input(shape=(784,), name="digits")
x1 = tf.keras.layers.Dense(64, activation="relu")(inputs)
x2 = tf.keras.layers.Dense(64, activation="relu")(x1)
outputs = tf.keras.layers.Dense(10, name="predictions")(x2)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_test, (-1, 784))
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
we can define our two loss functions (the two I chose make no sense from a scientific point of view but allow us to check the code works)
# Instantiate an optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
# Instantiate a loss function.
loss_1 = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_2 = lambda y_true, y_pred: -1 * loss_1(y_true, y_pred)
training loop
we can then execute our custom training loop:
epochs = 10
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
loss_fn = loss_1 if epoch % 2 else loss_2
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
and we check the output is what we want (alternate positive and negative losses)
Start of epoch 0
Training loss (for one batch) at step 0: -96.1003
Seen so far: 64 samples
Training loss (for one batch) at step 200: -3383849.5000
Seen so far: 12864 samples
Training loss (for one batch) at step 400: -40419124.0000
Seen so far: 25664 samples
Training loss (for one batch) at step 600: -149133008.0000
Seen so far: 38464 samples
Training loss (for one batch) at step 800: -328322816.0000
Seen so far: 51264 samples
Start of epoch 1
Training loss (for one batch) at step 0: 580457984.0000
Seen so far: 64 samples
Training loss (for one batch) at step 200: 297710528.0000
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 213328544.0000
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 159328976.0000
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 105737024.0000
Seen so far: 51264 samples
drawbacks and further improvments
the problem with writing custom loops as such is that you will loose the convenience of keras's fit method. I think you can manage this by defining a custom model and overriding the train_step as shown here in the documentation
If you really need to have the loss attribute of your model changed, you can set the compiled_loss attribute using a keras.engine.compile_utils.LossesContainer (here is the reference) and set model.train_function to model.make_train_function() (so that the new loss gets taken into account).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.