Computing the loss (MSE) for every iteration and time Tensorflow

Computing the loss (MSE) for every iteration and time Tensorflow - python

I want to use Tensorboard to plot the mean squared error (y-axis) for every iteration over a given time frame (x-axis), say 5 minutes.
However, i can only plot the MSE given every epoch and set a callback at 5 minutes. This does not however solve my problem.
I have tried looking at the internet for some solutions to how you can maybe set a maximum number of iterations rather than epochs when doing model.fit, but without luck. I know iterations is the number of batches needed to complete one epoch, but as I want to tune the batch_size, I prefer to use the iterations.
My code currently looks like the following:
input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 3
weights_initializer = keras.initializers.GlorotUniform()
#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
model = keras.models.Sequential([
#Layer to be used as an entry point into a Network
keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
#Dense layer 1
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_1'),
#Dense layer 2
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_2'),
#activation function is linear since we are doing regression
keras.layers.Dense(output_size, activation='linear', name='Output_layer')
])
#Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
nesterov=False)
#Compiling the model
model.compile(optimizer=optimizer,
loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
# initialize TimeStopping callback
time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
#Training the network
history = model.fit(normed_train_data, train_labels,
epochs=n_epochs,
batch_size=hparams['batch_size'],
verbose=1,
#validation_split=0.2,
callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
return history
#train_val_model("logs/sample", {'batch_size': len(normed_train_data)})
train_val_model("logs/sample1", {'batch_size': 1})
%tensorboard --logdir_spec=BSG:logs/sample,SGD:logs/sample1
resulting in:
The desired output should look something like this:

The reason you can't do it every iteration is that the loss is calculated at the end of each epoch. If you want to tune the batch size, run for a set number of epochs and evaluate. Start from 16 and jump in powers of 2 and see how much you can push the power of your network. But, usually bigger batch size is said to increase performance but it is not as substantial to solely focus on it. Focus on other things in the network first.

The answer was actually quite simple.
tf.keras.callbacks.TensorBoard has an update_freq argument allowing you to control when to write losses and metrics to tensorboard. The standard is epoch, but you can change it to batch or an integer if you want to write to tensorboard every n batches. See the documentation for more information: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard

Related

Model not improving with GradientTape but with model.fit()

I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.

I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.

Batch size for Stochastic gradient descent is length of training data and not 1?

I am trying to plot the different learning outcome when using Batch gradient descent, Stochastic gradient descent and mini-batch stochastic gradient descent.
Everywhere i look, i read that a batch_size=1 is the same as having a plain SGD and a batch_size=len(train_data) is the same as having the Batch gradient descent.
I know that stochastic gradient descent is when you use only one single data sample for every update and batch gradient descent uses the entire training data set to compute the gradient of the objective function / update.
However, when implementing the batch_size using keras, it seems to be the opposite that is happening. Take my code for example, where I have set the batch_size equal to the length of my training_data
input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 250
weights_initializer = keras.initializers.GlorotUniform()
#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
model = keras.models.Sequential([
#Layer to be used as an entry point into a Network
keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
#Dense layer 1
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_1'),
#Dense layer 2
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_2'),
#activation function is linear since we are doing regression
keras.layers.Dense(output_size, activation='linear', name='Output_layer')
])
#Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
nesterov=False)
#Compiling the model
model.compile(optimizer=optimizer,
loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
# initialize TimeStopping callback
time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
#Training the network
history = model.fit(normed_train_data, train_labels,
epochs=n_epochs,
batch_size=hparams['batch_size'],
verbose=1,
#validation_split=0.2,
callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
return history
train_val_model("logs/sample", {'batch_size': len(normed_train_data)})
When running this, the output seems to show a single update for each epoch i.e. SGD
:
As can be seen underneath every epoch it says 1/1 which I assume means a single update iteration. If I on the other hand set the batch_size=1 I get 90000/90000 which is the size of my entire data-set (training time wise this also makes sense).
So, my question is, batch_size=1 is actually Batch gradient descent and not stochastic gradient descent and batch_size=len(train_data) is actually stochastic gradient descent and not batch gradient descent?

There are actually three (3) cases:
batch_size = 1 means indeed stochastic gradient descent (SGD)
A batch_size equal to the whole of the training data is (batch) gradient descent (GD)
Intermediate cases (which are actually used in practice) are usually referred to as mini-batch gradient descent
See A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size for more details and references. Truth is, in practice, when we say "SGD" we usually mean "mini-batch SGD".
These definitions are in fact fully compliant with what you report from your experiments:
With batch_size=len(train_data) (GD case), only one update is indeed expected per epoch (since there is only one batch), hence the 1/1 indication in Keras output.
In contrast, with batch_size = 1 (SGD case), you expect as many updates as samples in your training data (since this is now the number of your batches), i.e. 90000, hence the 90000/90000 indication in Keras output.
i.e. the number of updates per epoch (which Keras indicates) is equal to the number of batches used (and not to the batch size).

batch_size is the size of how large each update will be.
Here, batch_size=1 means the size of each update is 1 sample. By your definitions, this would be SGD.
If you have batch_size=len(train_data), that means that each update to your weights will require the resulting gradient from your entire dataset. This is actually just good old gradient descent.
Batch gradient descent is somewhere in the middle, where the batch_size isn't 1 and the batch size isn't your entire training dataset. Take 32 for example. Batch gradient descent would update your weights every 32 examples, so it smooths out the ruggedness of SGD with just 1 example (where outliers may have a lot of impact) and yet has the benefits that SGD has over regular gradient descne.t

Simple neural network refuses to overfit

I wrote this super simple piece of code
model = Sequential()
model.add(Dense(1, input_dim=d, activation='linear'))
model.compile(loss='mse', optimizer='adam')
model.fit(X_train, y_train, epochs=10000, batch_size=n)
test_mse = model.evaluate(X_test, y_test)
print('test mse is {}'.format(test_mse))
X_train is an n by d numpy matrix and y is n by 1 numpy matrix.
This is basically the simplest linear neural network you could think of. One layer, input dimension is d, and we output a number.
It simply refuses to overfit. Even after running an insane amount of iterations (10k as you can see), the training loss is at around 0.17.
I expect the loss to be zero. Why do I expect that? Because in my case, d is much greater than n. I have a lot more degrees of freedom. And as a further piece of evidence, when I actually solve X_train # w = y_train using numpy.linalg.lstsq, the max value of X_train # w - y is something like 10 to the -14.
So this system is definitely solvable. I expected to see zero loss or very close to zero loss, but I don't. Why?

Transfer learning with pretrained model by tf.GradientTape can't converge

I would like to perform transfer learning with pretrained model of keras
import tensorflow as tf
from tensorflow import keras
base_model = keras.applications.MobileNetV2(input_shape=(96, 96, 3), include_top=False, pooling='avg')
x = base_model.outputs[0]
outputs = layers.Dense(10, activation=tf.nn.softmax)(x)
model = keras.Model(inputs=base_model.inputs, outputs=outputs)
Training with keras compile/fit functions can converge
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
history = model.fit(train_data, epochs=1)
The results are: loss: 0.4402 - accuracy: 0.8548
I wanna train with tf.GradientTape, but it can't converge
optimizer = keras.optimizers.Adam()
train_loss = keras.metrics.Mean()
train_acc = keras.metrics.SparseCategoricalAccuracy()
def train_step(data, labels):
with tf.GradientTape() as gt:
pred = model(data)
loss = keras.losses.SparseCategoricalCrossentropy()(labels, pred)
grads = gt.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss(loss)
train_acc(labels, pred)
for xs, ys in train_data:
train_step(xs, ys)
print('train_loss = {:.3f}, train_acc = {:.3f}'.format(train_loss.result(), train_acc.result()))
But the results are: train_loss = 7.576, train_acc = 0.101
If I only train the last layer by setting
base_model.trainable = False
It converges and the results are: train_loss = 0.525, train_acc = 0.823
What's the problem with the codes? How should I modify it? Thanks

Try RELU as activation function. It may be Vanishing Gradient issue which occurs if you use activation function other than RELU.

Following my comment, the reason why it didn't converge is because you picked a learning rate that was too big. This causes the weight to change too much and the loss to explode. When setting base_model.trainable to False, most of the weight in the networks were fixed and the learning rate was a good fit for your last layers. Here's a picture :
As a general rule, your learning rate should always be chosen for each experiments.
Edit : Following Wilson's comment, I'm not sure this is the reason you have different results but this could be it :
When you specify your loss, your loss is computed on each element of the batch, then to get the loss of the batch, you can take the sum or the mean of the losses, depending on which one you chose, you get a different magnitude. For example, if your batch size is 64, summing the loss will yield you a 64 times bigger loss which will yield 64 times bigger gradient, so choosing sum over mean with a batch size 64 is like picking a 64 times bigger learning rate.
So maybe the reason you have different results is that by default a keras.losses wrapped in a model.compile has a different reduction method. In the same vein, if the loss is reduced by a sum method, the magnitude of the loss depends on the batch size, if you have twice the batch size, you get (on average) twice the loss, and twice the gradient and so it's like doubling the learning rate.
My advice is to check the reduction method used by the loss to be sure it's the same in both case, and if it's sum, to check that the batch size is the same. I would advise to use mean reduction in general since it's not influenced by batch size.

Extracting weights from best Neural Network in Tensorflow/Keras - multiple epochs

I am working on a 1 - hidden - layer Neural Network with 2000 neurons and 8 + constant input neurons for a regression problem.
In particular, as optimizer I am using RMSprop with learning parameter = 0.001, ReLU activation from input to hidden layer and linear from hidden to output. I am also using a mini-batch-gradient-descent (32 observations) and running the model 2000 times, that is epochs = 2000.
My goal is, after the training, to extract the weights from the best Neural Network out of the 2000 run (where, after many trials, the best one is never the last, and with best I mean the one that leads to the smallest MSE).
Using save_weights('my_model_2.h5', save_format='h5') actually works, but at my understanding it extract the weights from the last epoch, while I want those from the epoch in which the NN has perfomed the best. Please find the code I have written:
def build_first_NN():
model = keras.Sequential([
layers.Dense(2000, activation=tf.nn.relu, input_shape=[len(X_34.keys())]),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error']
)
return model
first_NN = build_first_NN()
history_firstNN_all_nocv = first_NN.fit(X_34,
y_34,
epochs = 2000)
first_NN.save_weights('my_model_2.h5', save_format='h5')
trained_weights_path = 'C:/Users/Myname/Desktop/otherfolder/Data/my_model_2.h5'
trained_weights = h5py.File(trained_weights_path, 'r')
weights_0 = pd.DataFrame(trained_weights['dense/dense/kernel:0'][:])
weights_1 = pd.DataFrame(trained_weights['dense_1/dense_1/kernel:0'][:])
The then extracted weights should be those from the last of the 2000 epochs: how can I get those from, instead, the one in which the MSE was the smallest?
Looking forward for any comment.
EDIT: SOLVED
Building on the received suggestions, as for general interest, that's how I have updated my code, meeting my scope:
# build_first_NN() as defined before
first_NN = build_first_NN()
trained_weights_path = 'C:/Users/Myname/Desktop/otherfolder/Data/my_model_2.h5'
checkpoint = ModelCheckpoint(trained_weights_path,
monitor='mean_squared_error',
verbose=1,
save_best_only=True,
mode='min')
history_firstNN_all_nocv = first_NN.fit(X_34,
y_34,
epochs = 2000,
callbacks = [checkpoint])
trained_weights = h5py.File(trained_weights_path, 'r')
weights_0 = pd.DataFrame(trained_weights['model_weights/dense/dense/kernel:0'][:])
weights_1 = pd.DataFrame(trained_weights['model_weights/dense_1/dense_1/kernel:0'][:])

Use ModelCheckpoint callback from Keras.
from keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_squared_error', verbose=1, save_best_only=True, mode='max')
use this as a callback in your model.fit() . This will always save the model with the highest validation accuracy (lowest MSE on validation) at the location specified by filepath.
You can find the documentation here.
Of course, you need validation data during training for this. Otherwise I think you can probably be able to check on lowest training MSE by writing a callback function yourself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.