I'm training a model and using the tensorflow callbacks function to save my training logs and I have a model checkpoint to save my model's weights.
During training, every epoch I ran it says "WARNING:tensorflow: Can save best model only with val_acc available, skipping". This is issue 1.
Here are the code I used to be include in callbacks[] during model.fit.
def create_tensorboard_callback(dir_name, experiment_name):
"""
Creates a TensorBoard callback instand to store log files.
Stores log files with the filepath:
"dir_name/experiment_name/current_datetime/"
Args:
dir_name: target directory to store TensorBoard log files
experiment_name: name of experiment directory (e.g. efficientnet_model_1)
"""
log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir
)
print(f"Saving TensorBoard log files to: {log_dir}")
return tensorboard_callback
# Create ModelCheckpoint callback to save model's progress
checkpoint_path = "model_checkpoints/cp.ckpt"
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
monitor="val_acc",
save_best_only=True, #SAVING BEST ONLY
save_weights_only=True,
verbose=0)
Code for fitting the model with callbacks:
history_101_food_classes_feature_extract = model.fit(train_data,
epochs=3,
steps_per_epoch=len(train_data),
validation_data=test_data,
validation_steps=int(0.15 * len(test_data)),
callbacks=[create_tensorboard_callback("training_logs",
"efficientnetb0_101_classes_all_data_feature_extract"),
model_checkpoint])
Also, I cloned my model and used cloned_mode.load_weights(checkpoint_path) to evaluate both orignal and cloned model results using model.evaluate(test_data) Original model scores 70+% accuracy, while cloned_model always returns this exact accuracy. This is the issue 2.
My guess was that I have some previously trained and saved a very high accuracy model, hence issue 1 where it refuses to save at every epoch. But my model_checkpoint path looks clean to me.
And, if I did previously saved a high accuracy to my checkpoint_path, when I cloned a new model using weights load from that path, why would it give 0.54 accuracy everytime and not something higher? (Issue 2)
I need help. Let me know if you need more info from my side to solve this issue, happy to answer. Thanks. If you want to see the full code, here's the link to it.
https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/07_food_vision_milestone_project_1.ipynb
Related
So I am doing Transfer Learning with tensorflow and I want to be able to run
history = model.fit(...) # Run initial training with base_model.trainable = False
After the first training is done, I can fine-tune it by unfreezing some layers so if the first session ran for 20 epochs my next block of code will be:
# Train the model again for a few epochs
fine_tune_epochs = 10
total_epochs = len(history.epoch) + fine_tune_epochs
history_tuned = model.fit(train_set, validation_data = dev_set, initial_epoch=history.epoch[-1], epochs=total_epochs,verbose=2, callbacks=callbacks)
Basically it will take the epochs from history and will continue training from the last epoch and save these results in history_tuned
But I might want to train it again with more layers unfreezed so I would run history_tuned02 again and keep using the epochs for each history so my graphs look like one like the image below.
As you can see from the graph, it's all connected together but in reality is two different training sessions. The first one where the model is frozen and then the fine-tuned session. You can even tell where fine-tuning starts from the bump in performance.
The problem is, for me to do this I have to leave Jupyter open for days, because if I close it, all the variables are gone and I would need to train everything again, which would take insane amounts of time.
I tried using dill package but it would not work on history. I also tried using %store history but it also would not work for some reason as you can see from the image below on a dummy notebook that I test things.
So is there a way, to save history variable on disk, close jupyter, open it again, restore history and continue my work? Even if I leave jupyter and VS Code open until I finish with the model, crashes do happen.
Also I use checkpoint callback on tensorflow so I have my weights saved, restoring those is not a problem, but I do need history as well if it's possible.
UPDATE:
When I use CSVLogger callback as suggested and read it with
history = pd.read_csv('demo/logs/hist.log')
then
history.head()
The output is
You can save your history in two ways:
The manual method:
Simply interrupt your training and save your history file as a dictionary:
with open('/history_dict', 'wb') as file:
pickle.dump(history.history, file)
You can then reload it with:
history = pickle.load(open('/history_dict'), "rb")
The automated method:
You can create a simple callback that every epoch stores your history. So, even if your training crashes, it was automatically saved and can be restored.
The callback can be something like this:
from tensorflow import keras
import tensorflow.keras.backend as K
import os
import csv
my_dir = './model_dir' # where to save history
class SaveHistory(keras.callbacks.Callback):
def on_epoch_end(self, batch, logs=None):
if ('lr' not in logs.keys()):
logs.setdefault('lr', 0)
logs['lr'] = K.get_value(self.model.optimizer.lr)
if not ('history.csv' in os.listdir(my_dir)):
with open(my_dir + 'history.csv', 'a') as f:
content = csv.DictWriter(f, logs.keys())
content.writeheader()
with open(my_dir + 'history.csv','a') as f:
content = csv.DictWriter(f, logs.keys())
content.writerow(logs)
model.fit(..., callbacks=[SaveHistory()])
To reload the history saved as a .csv simply do:
import pandas as pd
history = pd.read_csv('history.csv')
Also I think that besides the custom callback, you can also save the history along your model checkpoints with a CSVLogger like this:
history = model.fit(..., callbacks=[keras.callbacks.CSVLogger('history.csv')])
This can be loaded back with pandas as shown above.
Hi I have tried to load my checkpoints but i get the following error:
" W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open ../codeOutputs/3DNewArchitectureWithRotation: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?"
This is the code I have used:
checkpoint_filepath = '../codeOutputs/3DNewArchitectureWithRotation'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
monitor='val_loss',
verbose=0,
save_best_only=False,
save_weights_only=False,
mode='auto',
save_freq='epoch',
options=None,
initial_value_threshold=None,
)
Model.load_weights(checkpoint_filepath)
BestRegressor = Model.fit(aaaiTrainImages, afTrainPorosity, validation_data = (aaaiValidationImages, afValidationPorosity), epochs=Epochs, callbacks =[EarlyStop,model_checkpoint_callback], verbose=2)
It seems the file type the checkpoints have been saved as are :HDF document (application/x-hdf).
I would appreciate any help as I have spend many days training my model and suddenly crashed, so it would be really helpful if I can skip retraining it up to the data I had
I was faced with the same issue. As others have pointed out, the issue derives from the argument save_weights_only=False which creates a directory of files. You can still call model.load_weights() and depersist the model, but you get that unpleasant error. One approach I took was to use the following to depersist the model without any errors/warnings.
import tensorflow as tf
m = tf.keras.models.load_model('/path/to/checkpoint/dir')
I am using Keras with Tensorflow backend. My work involves comparing the performances of several models such as Inception, VGG, Resnet etc on my dataset.
I would like to plot the training accuracies of several models in one graph. I am trying to do this in Tensorboard, but it is not working.
Is there a way of plotting multiple graphs in one plot using Tensorboard or is there some other way I can do this?
Thank you
If you are using the SummaryWriter from tensorboardX or pytorch 1.2, you have a method called add_scalars:
Call it like this:
my_summary_writer.add_scalars(f'loss/check_info', {
'score': score[iteration],
'score_nf': score_nf[iteration],
}, iteration)
And it will show up like this:
Be careful that add_scalars will mess with the organisation of your runs: it will add mutliple entries to this list (and thus create confusion):
I would recommend that instead you just do:
my_summary_writer.add_scalar(f'check_info/score', score[iter], iter)
my_summary_writer.add_scalar(f'check_info/score_nf', score_nf[iter], iter)
Here is a way to have multiple graphs in one plot grouped into one single run, using add_custom_scalar on PyTorch.
What I get:
The corresponding complete running code:
from torch.utils.tensorboard import SummaryWriter
import math
layout = {
"ABCDE": {
"loss": ["Multiline", ["loss/train", "loss/validation"]],
"accuracy": ["Multiline", ["accuracy/train", "accuracy/validation"]],
},
}
writer = SummaryWriter()
writer.add_custom_scalars(layout)
epochs = 10
batch_size = 50
for epoch in range(epochs):
for index in range(batch_size):
global_batch_index = epoch * batch_size + index
train_loss = math.exp(-0.01 * global_batch_index)
train_accuracy = 1 - math.exp(-0.01 * global_batch_index)
writer.add_scalar("loss/train", train_loss, global_batch_index)
writer.add_scalar("accuracy/train", train_accuracy, global_batch_index)
validation_loss = train_loss + 0.1
validation_accuracy = train_accuracy - 0.1
writer.add_scalar("loss/validation", validation_loss, global_batch_index)
writer.add_scalar("accuracy/validation", validation_accuracy, global_batch_index)
writer.close()
Please note the used tab on top left on the window is not SCALARS but CUSTOM SCALARS
You can definitely plot scalars like the loss & validation accuracy : tf.summary.scalar("loss", cost) where cost is a tensor cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
Now you write summary for plotting all the values and then you may want to merge all these summary to a single summary by: merged_summary_op = tf.summary.merge_all()
Next step will be to run this summary in the session by summary = sess.run(merged_summary_op)
After you run the merged_summary_op you have to write the summary using summary_writer : summary_writer.add_summary(summary, epoch_number) where summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())
Now open the terminal or cmd and run the following command: "Run the command tensorboard --logdir="logpath"
Then open http://0.0.0.0:6006/ into your web browser
You can refer the following link: https://github.com/jayshah19949596/Tensorboard-Visualization-Freezing-Graph
Other things you can plot are the weights, inputs
You can also display the images on tensorboard
I think if you are using keras with tensorflow 1.5 then using tensorboard is easy because in tensorflow 1.5 keras is included as their official high level api
I am sure you can plot different accuracy on the same graph for the same model with different hyper-paramters by using different FileWriter instances with different log path
Check the image below:
I don't know if you can plot different accuracy of different models on the same graph... But you can write a program that does that
May be you can write the summary information of different models to different directories and then point tensor-board to the parent directory to plot the accuracy of different models on the same graph as suggested in the comment by #RobertLugg
==================== UPDATED =================
I have tried saving accuracy and loss of different models to different directories and then making tensorboard to point to the parent directory and it works, you will get results of different models in the same graph. I have tried this myself and it works.
Just save each runs in different folders under a main folder and open tensorboard on the main folder.
for i in range(x):
tensorboard = TensorBoard(log_dir='./logs/' + 'run' + str(i), histogram_freq=0,
write_graph=True, write_images=False)
model.fit(X, Y, epochs=150, batch_size=10, callbacks=[tensorboard])
From the terminal, run tensorboard as such:
tensorboard --logdir=logs
My model has two outputs, I want to monitor one to save my model.
Below is part of my code. The version of TensorFlow is 2.0
model = MobileNetBaseModel()()
model.compile(optimizer=tf.keras.optimizers.Adam(),
metrics={"pitch_yaw_roll": "mae"},
loss={"pitch_yaw_roll": compute_mse_loss, # or "mse"
"total_logits": compute_cross_entropy_loss(num_classes=num_classes)},
loss_weights= {"pitch_yaw_roll":mse_weight, "total_logits":cross_entropy_weight})
file_path = os.path.join(checkpoint_path, "model.{epoch:2d}-{val_loss:.2f}.h5")
tf.keras.callbacks.ModelCheckpoint(filepath=file_path,
monitor="val_loss",
verbose=1,
save_freq=save_freq,
save_best_only=True)
The default monitor='val_loss' in the ModelCheckpoint callback, how do I choose what I need? I want to monitor {"pitch_yaw_roll": "mae"}.
If you want ModelCheckpoint to save according to another metric value, use the key of that metric in the .compile(metrics={...}, ...) metrics dictionary.
So for example, if you would like to save only the best "pitch_yaw_roll" epoch result (best being the minimum value) you should use
tf.keras.callbacks.ModelCheckpoint(filepath=file_path,
monitor="val_pitch_yaw_roll",
verbose=1,
mode="min",
save_freq=save_freq,
save_best_only=True)
If you opt for "pitch_yaw_roll" instead of "val_pitch_yaw_roll" it will save according to the training loss and not according to the validation loss
Just adding to comment above, I belive your checkpoint doesn't work because of incorrect name of value to monitor.
General, solution here might be to have a peak into history that your fit creates.
history = model.fit(...)
pd.DataFrame(history.history)
there you will find names of metrics you should use in monitor statement.
I'm trying to use a hook in my DNNClassifier model using tensorflow.keras.callbacks.EarlyStopping but I have no idea what to put in monitor. The documentation is not exactly helpful here.
From looking at the code a softmax cross-entropy is used as the loss function but for DNNRegressor the loss node is dnn/head/weighted_loss/Sum as per this thread. I have tried getting Tensorboard up and running but I am not able to and the import script from a saved model is equally defective on my machine.
Is there any way to figure out what the node of the DNNClassifier's loss is?
The monitor does not refer to a graph node or a layer, but to a loss or metric value. Indeed any value can be used that is present in your logs dictionary: https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/callbacks.py#L676
You can inspect the values you have in logs without debugging, by using CSVLogger for instance:
csv_logger = CSVLogger(filename=os.path.join(args.log_dir, 'train.csv'), separator=',', append=False)
If you cannot write to a file, you can print out everything you have in logs to stdout:
mycallback = LambdaCallback(on_epoch_end=lambda epoch, logs: print('\n'.join(['{}: {}'.format(k, v) for k, v in logs.items()])))
In case you do not have the metric in logs, you can use LambdaCallback to put it there. For instance:
eval_callback = LambdaCallback(on_epoch_end=lambda epoch, logs: logs.update({'metric_name': get_metric_value()}))
early_stopping = EarlyStopping(monitor='metric_name', min_delta=0.0, patience=10, verbose=1, mode='min')