Do Keras epoch-counting callbacks work across several fitting sessions?

Do Keras epoch-counting callbacks work across several fitting sessions? - python

Some Keras callbacks, like ModelCheckpoint or ReduceLROnPlateau, rely on counting the number of epochs that some condition is met until some action is taken.
For certain purposes I need to train a Keras model in several fitting sessions, so something like
for epoch in range(num_epochs):
model.fit(data, epochs=1)
rather than
model.fit(data, epochs=num_epochs)
I was wondering if Keras callbacks work even if I use them across several fitting sessions.

Each time model.fit(...) is called callbacks.History is reset. So no, it will not work like that. While you could log yourself as #kacpo1 mentioned and save each, you may benefit from the train_on_batch(...) method. This performs a single update and you can set reset_metrics=False in the method call to retain your metrics.
https://keras.io/api/models/model_training_apis/#trainonbatch-method

I was wondering if Keras callbacks work even if I use them across several fitting sessions.
The answer is no. If you doing things this way (loop with fitting one epoch at time) you can do those things like saving weights and learning rate decay by yourself without using callbacks.
for epoch in range(num_epochs):
model.fit(data, epochs=1)
if epoch % 5 == 0:
model.save_weights(...)

Related

Why does train data performance deteriorate dramatically?

I am training a binary classifier model that classifies between disease and non-disease.
When I run the model, training loss decreased and auc, acc, get increased.
But, after certain epoch train loss increased and auc, acc were decreased.
I don't know why training performance got decreased after certain epoch.
I used general 1d cnn model and methods, details here:
I tried already to:
batch shuffle
introduce class weights
loss change (binary_crossentropy > BinaryFocalLoss)
learning_rate change

Two questions for you going forward.
Does the training and validation accuracy keep dropping - when you would just let it run for let's say 100 epochs? Definitely something I would try.
Which optimizer are you using? SGD? ADAM?
How large is your dropout, maybe this value is too large. Try without and check whether the behavior is still the same.
It might also be the optimizer
As you do not seem to augment (this could be a potential issue if you do by accident break some label affiliation) your data, each epoch should see similar gradients. Thus I guess, at this point in your optimization process, the learning rate and thus the update step is not adjusted properly - hence not allowing to further progress into that local optimum, and rather overstepping the minimum while at the same time decreasing training and validation performance.
This is an intuitive explanation and the next things I would try are:
Scheduling the learning rate
Using a more sophisticated optimizer (starting with ADAM if you are not already using it)

Your model is overfitting. This is why your accuracy increases and then begins decreasing. You need to implement Early Stopping to stop at the Epoch with the best results. You should also implement dropout layers.

Keras loss and metrics values do not match with same function in each

I am using keras with a custom loss function like below:
def custom_fn(y_true, y_pred):
# changing y_true, y_pred values systematically
return mean_absolute_percentage_error(y_true, y_pred)
Then I am calling model.compile(loss=custom_fn) and model.fit(X, y,..validation_data=(X_val, y_val)..)
Keras is then saving loss and val_loss in model history. As a sanity check, when the model finishes training, I am using model.predict(X_val) so I can calculate validation loss manually with my custom_fn using the trained model.
I am saving the model with the best epoch using this callback:
callbacks.append(ModelCheckpoint(path, save_best_only=True, monitor='val_loss', mode='min'))
so after calculating this, the validation loss should match keras' val_loss value of the best epoch. But this is not happening.
As another attempt to figure this issue out, I am also doing this:
model.compile(loss=custom_fn, metrics=[custom_fn])
And to my surprise, val_loss and val_custom_fn do not match (neither loss or loss_custom_fn for that matter).
This is really strange, my custom_fn is essentially keras' built in mape with the y_true and y_pred slightly manipulated. what is going on here?
PS: the layers I am using are LSTM layers and a final Dense layer. But I think this information is not relevant to the problem. I am also using regularisation as hyperparameter but not dropout.
Update
Even removing custom_fn and using keras' built in mape as a loss function and metric like so:
model.compile(loss='mape', metrics=['mape'])
and for simplicity, removing ModelCheckpoint callback is having the same effect; val_loss and val_mape for each epoch are not equivalent. This is extremely strange to me. I am either missing something or there is a bug in Keras code..the former might be more realistic.

This blog post suggests that keras adds any regularisation used in the training when calculating the validation loss. And obviously, when calculating the metric of choice no regularisation is applied. This is why it occurs with any loss function of choice as stated in the question.
This is something I could not find any documentation on from Keras. However, it seems to hold up since when I remove all regularisation hyperparameters, the val_loss and val_custom_fn match exactly in each epoch.
An easy workaround is to either use the custom_fn as a metric and save the best model based on the metric (val_custom_fn) than on the val_loss. Or else Loop through each epoch manually and calculate the correct val_loss manually after training each epoch. The latter seems to make more sense since there is no reason to include custom_fn both as a metric and as a loss function.
If anyone can find any evidence of this in the Keras documentation that would be helpful.

Difference between TensorFlow model fit and train_on_batch

I am building a vanilla DQN model to play the OpenAI gym Cartpole game.
However, in the training step where I feed in the state as input and the target Q values as the labels, if I use model.fit(x=states, y=target_q), it works fine and the agent can eventually play the game well, but if I use model.train_on_batch(x=states, y=target_q), the loss won't decrease and the model will not play the game anywhere better than a random policy.
I wonder what is the difference between fit and train_on_batch? To my understanding, fit calls train_on_batch with a batch size of 32 under the hood which should make no difference since specifying the batch size to equal the actual data size I feed in makes no difference.
The full code is here if more contextual information is needed to answer this question: https://github.com/ultronify/cartpole-tf

model.fit will train 1 or more epochs. That means it will train multiple batches. model.train_on_batch, as the name implies, trains only one batch.
To give a concrete example, imagine you are training a model on 10 images. Let's say your batch size is 2. model.fit will train on all 10 images, so it will update the gradients 5 times. (You can specify multiple epochs, so it iterates over your dataset.) model.train_on_batch will perform one update of the gradients, as you only give the model on batch. You would give model.train_on_batch two images if your batch size is 2.
And if we assume that model.fit calls model.train_on_batch under the hood (though I don't think it does), then model.train_on_batch would be called multiple times, likely in a loop. Here's pseudocode to explain.
def fit(x, y, batch_size, epochs=1):
for epoch in range(epochs):
for batch_x, batch_y in batch(x, y, batch_size):
model.train_on_batch(batch_x, batch_y)

How to begin counting in ModelCheckpoint from an epoch greater than 1

I am working on a project where Keras ModelCheckpoint is used. This callback class seems to cover my needs except for one small detail.
I have not found a way to pass a counter to the epoch numbering so as to deal with resume model training cases. I often train for some epochs and then resume training afterwards. I would like to have a consistent model saving pattern like:
model.{epoch:03d}-{loss:.2f}.hdf5
but with numbering beginning from the epoch the previous training stopped and not from the beginning.
The current command I use is this:
ckp_saver = ModelCheckpoint(checkpoint_dir + "/model.{epoch:03d}-{loss:.2f}.hdf5", monitor='loss', verbose=0,
save_best_only=False, save_weights_only=True, mode='auto', period=1)
Is there any way to pass this information to ModelCheckpoint? The solution I found is to edit the Keras code and add a default argument containing the actual pretrained epochs (defaulting in 0 if not passed) so as not to break any other code but I would prefer to avoid this if it's not necessary. Any other ideas?
The original code was taken from this file here.

Different accuracy by fit() and evaluate() in Keras with the same dataset

I program Keras's code to train GoogleNet. However, accuracy gotten from fit() is 100% yet with the same training dataset used for evaluate(), accuracy remains 25% only, which has such huge discrepancy!!! Also, accuracy by evaluate(), which is not like fit(), won't get improved for training more times, which means it almost stays in 25%.
Does anyone has idea of what is wrong with this situation?
# Training Dataset and labels r given. Here load GoogleNet model
from keras.models import load_model
model = load_model('FT_InceptionV3.h5')
# Training Phase
model.fit(x=X_train,
y=y_train,
batch_size=5,
epochs=20,
validation_split=0,
#callbacks=[tensorboard]
)
#Testing Phase
train_loss , train_acc=model.evaluate(X_train, y_train, verbose=1)
print("Train loss=",train_loss,"Train accuracy",train_acc)
Training Result
Testing Result

After some digging into Keras issues, I found this.
The reason for this is that when you use fit, At each batch of the training data the weights are updated. The loss value returned by the fit method is not the mean of the loss of the final model, but the mean of the loss of all slightly different models used on each batch.
On the other hand, when you use to evaluate, the same model is used on the whole dataset. And this model actually doesn't even appear in the loss of the fit method since even at the last batch of training, the loss computed is used to update the model's weights.
To sum everything up, fit and evaluate have two completely different behaviours.
Reference:-
Keras_issues_thread
Keras_official_doc

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Do Keras epoch-counting callbacks work across several fitting sessions? - python

Related

Why does train data performance deteriorate dramatically?

Keras loss and metrics values do not match with same function in each

Difference between TensorFlow model fit and train_on_batch

How to begin counting in ModelCheckpoint from an epoch greater than 1

Different accuracy by fit() and evaluate() in Keras with the same dataset

Categories

Resources