Finding all checkpoints path in Tensorflow - python

So far I have used saving and loading checkpoints in Tensorflow only for loading the last checkpoint. Usually the code I use to this is along the lines:
ckpt = tf.train.get_checkpoint_state(load_dir)
if ckpt and ckpt.model_checkpoint_path:
saver.restore(session, ckpt.model_checkpoint_path)
else:
tf.gfile.DeleteRecursively(load_dir)
tf.gfile.MakeDirs(load_dir)
However, in my latest experiment, I'm saving a checkpoint at every 1000 iterations and I want to run an evaluation script on all of the checkpoints, e.g. to show how does different validation metrics progress. Is there any easy way of getting all checkpoints in Tensorflow or I will just need to loop over all of the names accordingly using os?

The ckpt object in your code snippet is CheckpointState protocol buffer. Instead of accessing the most recent model path (ckpt.model_checkpoint_path), you can iterate over all of them using something like:
for model_path in ckpt.all_model_checkpoint_paths:
saver.restore(session, model_path)
# Do the evaluation using the restored model

Related

Pytorch-Lightning ModelCheckpoint get paths of saved checkpoints

I am using PytorchLightning and beside others a ModelCheckpoint which saves models with a formated filename like `filename="model_{epoch}-{val_acc:.2f}"
In a process I want to load these checkpoints again, for simplicity lets say I want only the best via save_top_k=N.
As the filename is dynamic I wonder how can I retrieve the checkpoint easily is there a built in attribute or via the trainer that gives the saved checkpoints?
For example like
checkpoint_callback.get_top_k_paths()
I know I can do it with glob and model_dir but wondering if there is a one line solution built in somehwere.
you can retrieve the best model path after training from the checkpoint
# retrieve the best checkpoint after training
checkpoint_callback = ModelCheckpoint(dirpath='my/path/')
trainer = Trainer(callbacks=[checkpoint_callback])
model = ...
trainer.fit(model)
checkpoint_callback.best_model_path
To find all the checkpoints you can get the list of files in the dirpath where the checkpoints are saved.

How do I restore the Generator of a GAN from a Tensorflow Model?

I´m trying to restore the trained Generator of a Generative Adversarial Network using a Tensorflow Model (the metagraph and the checkpoint)
I´m new to tensorflow and python, so I´m not sure if what I´m doing is making sense. have already tried importing the metagraph from the meta file and restoring the variables from checkpoint, but i´m sure what to do next. My goal is to restore the trained Generator from the last checkpoint and then use it to generate new data from noise input.
Here´s a link to a drive containing the model files:
https://drive.google.com/drive/folders/1MaELMC4aOroSQlMJ32J3_ff3wxiBT_Fq?usp=sharing
So far I have tried the following and it seems to be loading the graph:
# import the graph from the file
imported_graph = tf.train.import_meta_graph("../../models/model-9.meta")
# list all the tensors in the graph
for tensor in tf.get_default_graph().get_operations():
print (tensor.name)
# run the session
with tf.Session() as sess:
# restore the saved vairable
imported_graph.restore(sess, "../../models/model-9")
However, I´m not sure what to do next. Is it possible to run only the trained generator using this files? How can I acces it?
In the Tensorflow 2 doc, they save both the generator and the discriminator. However, they do not explain how to only restore the generator.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(generator_optimizer=generator_optimizer,
discriminator_optimizer=discriminator_optimizer,
generator=generator,
discriminator=discriminator)
And then restore with
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
From https://www.tensorflow.org/tutorials/generative/dcgan#save_checkpoints

how to convert tensorflow .meta .data .index to .ckpt file?

As we know, when using tensorflow to save checkpoint, we have 3 files, for e.g.:
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
I check on the faster rcnn and found that they have an evaluation.py script which helps evaluate the pre-trained model, but the script only accept .ckpt file (as they provided some pre-trained models above).
I have run some finetuning from their pre-trained model
And then I wonder if there's a way to convert all the .data-00000-of-00001, .index and .meta into one single .ckpt file to run the evaluate.py script on the checkpoint?
(I also notice that the pre-trained models they provided in the repo do have only 1 .ckpt file, how can they do that when the save-checkpoint function generates 3 files?)
These
{
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
}
are the more recent checkpoint format
while
{model.ckpt}
is a previous checkpoint format
It will be in the same concept as to convert a Nintendo Switch to NES ... Or a 3 pieces CD bundle to a single ROM cartridge...
You don't need to convert, You can save the variables in the network using
saver = tf.train.Saver()
saver.save(sess, 'path of save/fileName.ckpt')
To restore the network for reuse later or in another script, use:
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint('path of save/')
sess.run(....)
Important points:
sess must be same between first and later runs (coherent structure).
saver.restore needs the path of the folder of the saved files, not
an individual file path.

Tensorflow: differences on how to save a trained model

I noticed that in Python there exist several ways for saving a trained model
However I am not able to see the real difference between them.
Checkpoints
saver = tf.train.Saver()
saver.save(session, output_path)
Freezing
from tensorflow.python.framework import graph_util
input_graph_def = graph.as_graph_def()
output_graph_def = graph_util.convert_variables_to_constants(
session, input_graph_def, output_nodes_names)
with tf.gfile.GFile(output_graph, "wb") as output_graph_file:
output_graph_file.write(output_graph_def.SerializeToString())
SavedModelBuilder
builder = tf.saved_model.builder.SavedModelBuilder(output_path)
builder.add_meta_graph_and_variables(
session,
[tf.saved_model.tag_constants.SERVING],
clear_devices=True)
builder.save()
Let's consider different scenarios: evaluation/inference, fine-tuning, serving API, export to other frameworks.
What's the best way for saving a model for each of these situtations? Are there rules about when to use one method or the other?
Thanks
This is not an exhaustive answer, but with modern (mid 2018) TensorFlow, you probably only need Checkpoints and SavedModels.
As pointed out in
https://www.tensorflow.org/get_started/checkpoints
"Checkpoints - a format dependent on the code that created the model"
"SavedModel - a format independent of the code that created the model"
"Freezing" largely got folded into and replaced by SavedModel.
In your training code and while you still want to retain the capability to continue training/fine-tuning, checkpoints are the way to go, as all the relevant code/state to not only train but also monitor that training is kept around between the checkpoints and your code.
When you move over to the "serving" side (i.e consumption), you add all the metadata needed to use the model, strip out the unneeded training elements and go to SavedModel.
I have not personally tried to export to other frameworks from TensorFlow, just into it, so I cannot offer a good opinion on what would be best for that case.

What are tensorflow summaries? How exactly are they utilized when using a tf model to make predictions?

Posting here as I couldn't find an explicit answer from tensorflow's documentation. I am curious about the actual purpose of the summary files in tensorflow. After training, I call the tensor flow model (the model file and the meta file) that were saved by:
tf.train.saver()
There seems to be no need for me to actually keep the summary files apart from logging training information;I can use my models to predict without referencing the summaries.
Is the summary file merely just log files of the training runs (accuracy and loss). Is there any other purpose that these files serve?

Categories

Resources