My model has two outputs, I want to monitor one to save my model.
Below is part of my code. The version of TensorFlow is 2.0
model = MobileNetBaseModel()()
model.compile(optimizer=tf.keras.optimizers.Adam(),
metrics={"pitch_yaw_roll": "mae"},
loss={"pitch_yaw_roll": compute_mse_loss, # or "mse"
"total_logits": compute_cross_entropy_loss(num_classes=num_classes)},
loss_weights= {"pitch_yaw_roll":mse_weight, "total_logits":cross_entropy_weight})
file_path = os.path.join(checkpoint_path, "model.{epoch:2d}-{val_loss:.2f}.h5")
tf.keras.callbacks.ModelCheckpoint(filepath=file_path,
monitor="val_loss",
verbose=1,
save_freq=save_freq,
save_best_only=True)
The default monitor='val_loss' in the ModelCheckpoint callback, how do I choose what I need? I want to monitor {"pitch_yaw_roll": "mae"}.
If you want ModelCheckpoint to save according to another metric value, use the key of that metric in the .compile(metrics={...}, ...) metrics dictionary.
So for example, if you would like to save only the best "pitch_yaw_roll" epoch result (best being the minimum value) you should use
tf.keras.callbacks.ModelCheckpoint(filepath=file_path,
monitor="val_pitch_yaw_roll",
verbose=1,
mode="min",
save_freq=save_freq,
save_best_only=True)
If you opt for "pitch_yaw_roll" instead of "val_pitch_yaw_roll" it will save according to the training loss and not according to the validation loss
Just adding to comment above, I belive your checkpoint doesn't work because of incorrect name of value to monitor.
General, solution here might be to have a peak into history that your fit creates.
history = model.fit(...)
pd.DataFrame(history.history)
there you will find names of metrics you should use in monitor statement.
Related
I'm training a model and using the tensorflow callbacks function to save my training logs and I have a model checkpoint to save my model's weights.
During training, every epoch I ran it says "WARNING:tensorflow: Can save best model only with val_acc available, skipping". This is issue 1.
Here are the code I used to be include in callbacks[] during model.fit.
def create_tensorboard_callback(dir_name, experiment_name):
"""
Creates a TensorBoard callback instand to store log files.
Stores log files with the filepath:
"dir_name/experiment_name/current_datetime/"
Args:
dir_name: target directory to store TensorBoard log files
experiment_name: name of experiment directory (e.g. efficientnet_model_1)
"""
log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir
)
print(f"Saving TensorBoard log files to: {log_dir}")
return tensorboard_callback
# Create ModelCheckpoint callback to save model's progress
checkpoint_path = "model_checkpoints/cp.ckpt"
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
monitor="val_acc",
save_best_only=True, #SAVING BEST ONLY
save_weights_only=True,
verbose=0)
Code for fitting the model with callbacks:
history_101_food_classes_feature_extract = model.fit(train_data,
epochs=3,
steps_per_epoch=len(train_data),
validation_data=test_data,
validation_steps=int(0.15 * len(test_data)),
callbacks=[create_tensorboard_callback("training_logs",
"efficientnetb0_101_classes_all_data_feature_extract"),
model_checkpoint])
Also, I cloned my model and used cloned_mode.load_weights(checkpoint_path) to evaluate both orignal and cloned model results using model.evaluate(test_data) Original model scores 70+% accuracy, while cloned_model always returns this exact accuracy. This is the issue 2.
My guess was that I have some previously trained and saved a very high accuracy model, hence issue 1 where it refuses to save at every epoch. But my model_checkpoint path looks clean to me.
And, if I did previously saved a high accuracy to my checkpoint_path, when I cloned a new model using weights load from that path, why would it give 0.54 accuracy everytime and not something higher? (Issue 2)
I need help. Let me know if you need more info from my side to solve this issue, happy to answer. Thanks. If you want to see the full code, here's the link to it.
https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/07_food_vision_milestone_project_1.ipynb
I am training a text classification model over a large set of data and I am using bert classifier (bert-base-uncased) of simpletransformer library. Simpletransformer retports by default mcc and eval_loss for evaluation during training and the test(eval) phase. I was able to set additional metrics such as acc, f1 etc. for the test phase (by sending extra metrics to the eval_model function), But I don't know how to tell simpletransformer to report these metrics during the training phase as well? Is it possible to do the same thing with train_model function?
It is worth mentioning that eval_during_training option is True.
It prints the mcc and eval_loss of the training for each checkpoint(in eval_results.txt in outputs) and I need other metrics to be reported in each checkpoint as well.
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1=f1_multiclass, acc=accuracy_score)
Thanks in advance
cheers
After surfing the web, I couldn't find the answer to my question. So, I started looking at the source code. It turns out it is way simpler than I thought. To include more metrics during training you need to include them just the way you include them in the eval_model method. Here is a sample code that shows how to feed extra metrics to simpletransformer train_model and eval_model.
def f1_multiclass(labels, preds):
return f1_score(labels, preds, average='weighted')
def prec_multiclass(labels, preds):
return precision_score(labels, preds, average='weighted')
def recall_multiclass(labels, preds):
return recall_score(labels, preds, average='weighted')
model.train_model(train_df, eval_df=test_df,
f1=f1_multiclass,
acc=accuracy_score,
prec=prec_multiclass,
recall=recall_multiclass,
cohen=cohen_kappa_score)
result, model_outputs, wrong_predictions = model.eval_model(test_df,
f1=f1_multiclass,
acc=accuracy_score,
prec=prec_multiclass,
recall=recall_multiclass,
cohen=cohen_kappa_score)
I'm trying to use a hook in my DNNClassifier model using tensorflow.keras.callbacks.EarlyStopping but I have no idea what to put in monitor. The documentation is not exactly helpful here.
From looking at the code a softmax cross-entropy is used as the loss function but for DNNRegressor the loss node is dnn/head/weighted_loss/Sum as per this thread. I have tried getting Tensorboard up and running but I am not able to and the import script from a saved model is equally defective on my machine.
Is there any way to figure out what the node of the DNNClassifier's loss is?
The monitor does not refer to a graph node or a layer, but to a loss or metric value. Indeed any value can be used that is present in your logs dictionary: https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/callbacks.py#L676
You can inspect the values you have in logs without debugging, by using CSVLogger for instance:
csv_logger = CSVLogger(filename=os.path.join(args.log_dir, 'train.csv'), separator=',', append=False)
If you cannot write to a file, you can print out everything you have in logs to stdout:
mycallback = LambdaCallback(on_epoch_end=lambda epoch, logs: print('\n'.join(['{}: {}'.format(k, v) for k, v in logs.items()])))
In case you do not have the metric in logs, you can use LambdaCallback to put it there. For instance:
eval_callback = LambdaCallback(on_epoch_end=lambda epoch, logs: logs.update({'metric_name': get_metric_value()}))
early_stopping = EarlyStopping(monitor='metric_name', min_delta=0.0, patience=10, verbose=1, mode='min')
Is there any way to return the number of epochs after which the training was stopped in Keras when using the EarlyStopping callback?
I can get the log of the training and validation loss and compute the number of epochs myself using the patience parameter, but is there a more direct way?
Use EarlyStopping.stopped_epoch attribute: remember the callback in a separate variable, say callback, and check callback.stopped_epoch after the training stopped.
Subtracting the patience value from the total number of epochs - as suggested in this comment - might not work in some situations. For instance, if you set epochs=100 and patience=20, if the best accuracy/loss value is found at epoch 90, the training will stop at epoch 100. So with this approach you would get a wrong number (100-20 = 80).
Moreover, as noted in this comment, using EarlyStopping.stopped_epoch only gives you the epoch when the training has been stopped, but NOT the epoch when the best weights are saved. This is particularly useful when you set save_best_weights=True or rely on ModelCheckpoint to save the best model before stopping the training.
Therefore my solution is to get the index of model history array, with the best value. Assuming that the metric used is the validation accuracy, relying on numpy, here is some code:
import numpy as np
model.fit(...)
hist = model.history.history['val_acc']
n_epochs_best = np.argmax(hist)
You can also leverage History() call back to find out the number of epochs it was ran for. Ex:
from keras.callbacks import History, EarlyStopping
history = History()
callback = [history, EarlyStopping(monitor='val_loss', patience=5, verbose=1, min_delta=1e-4)]
history = model.fit_generator(...., callbacks=callbacks)
number_of_epochs_it_ran = len(history.history['loss'])
I'm trying to add some TensorBoard logging to a model which uses the new tf.estimator API.
I have a hook set up like so:
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
summary_op=tf.summary.merge_all())
# ...
classifier.train(
input_fn,
steps=1000,
hooks=[summary_hook])
In my model_fn, I am also creating a summary -
def model_fn(features, labels, mode):
# ... model stuff, calculate the value of loss
tf.summary.scalar("loss", loss)
# ...
However, when I run this code, I get the following error from the summary_hook:
Exactly one of scaffold or summary_op must be provided. This is probably because tf.summary.merge_all() is not finding any summaries and is returning None, despite the tf.summary.scalar I declared in the model_fn.
Any ideas why this wouldn't be working?
Use tf.train.Scaffold() and pass tf.merge_all as following
summary_hook = tf.train.SummarySaverHook(
save_secs=2,
output_dir=MODEL_DIR,
scaffold=tf.train.Scaffold(summary_op=tf.summary.merge_all()))
Just for whoever have this question in the future, the selected solution doesn't work for me (see my comments in the selected solution).
Actually, with TF 1.2 Estimator API, one doesn't need to have summary_hook. I just have tf.summary.scalar("loss", loss) in the model_fn, and run the code without summary_hook. The loss is recorded and shown in the tensorboard. I'm not sure if TF API was changed after this and similar questions.
with Tensorflow ver-r1.3
Add your summary ops in your estimator model_fn
example :
tf.summary.histogram(tensorOp.name, tensorOp)
If you feel writing summaries may consume time and space, you can control the writing frequency of summaries, in your Estimator run_config
run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir=FLAGS.model_dir)
run_config = run_config.replace(save_summary_steps=150)
Note: this will affect the overall summary writer frequency for TensorBoard logging, of your estimator (tf.estimator.Estimator)