Pytorch-Lightning ModelCheckpoint get paths of saved checkpoints - python

I am using PytorchLightning and beside others a ModelCheckpoint which saves models with a formated filename like `filename="model_{epoch}-{val_acc:.2f}"
In a process I want to load these checkpoints again, for simplicity lets say I want only the best via save_top_k=N.
As the filename is dynamic I wonder how can I retrieve the checkpoint easily is there a built in attribute or via the trainer that gives the saved checkpoints?
For example like
checkpoint_callback.get_top_k_paths()
I know I can do it with glob and model_dir but wondering if there is a one line solution built in somehwere.

you can retrieve the best model path after training from the checkpoint
# retrieve the best checkpoint after training
checkpoint_callback = ModelCheckpoint(dirpath='my/path/')
trainer = Trainer(callbacks=[checkpoint_callback])
model = ...
trainer.fit(model)
checkpoint_callback.best_model_path
To find all the checkpoints you can get the list of files in the dirpath where the checkpoints are saved.

Related

Save model with weights using state dict Pytorch

I have a PyTorch model class and its statedict with the weights.
I'd like to save the model directly with its weight in a .pt file using torch.save(model, PATH) but that simply saves the state dict again.
How do I save the model with the loaded_weights in it?
What I'm currently doing
lin_model = ModelClass(args)
lin_model.load_state_dict(torch.load('state_dict.pt'))
torch.save(lin_model, PATH)
I want the newly saved model to be a fully loaded pt file. Please help me here,thanks in advance.
According to the pytorch documentation here, when you use torch.save(model, PATH) it saves the entire model with the class. But here is the problem. It doesn't work every time. You see, the saved model is in pickle format, but the pickle file does not save the exact directory structure but just a path to the file containing the model class. So this saving method can break in various ways when used in other projects.

How to convert a pretrained tensorflow pb frozen graph into a modifiable h5 keras model?

I have been searching for a method to do this for so long, and I can not find an answer. Most threads I found are of people wanting to do the opposite.
Backstory:
I am experimenting with some pre-trained models provided by the tensorflow/models repository. The models are saved as .pb frozen graphs. I want to fine-tune some of these models by changing the final layers to suit my application.
Hence, I want to load the models inside a jupyter notebook as a normal keras h5 model.
How can I do that?
do you have a better way to do so?
Thanks.
seems like all you would have to do is download the model files and store them in a directory. Call the directory for example c:\models. Then load the model.
model = tf.keras.models.load_model(r'c:\models')
model.summary() # prints out the model layers
# generate code to modify the model as you typically do for transfer learning
# compile the changed model
# train the model
# save the trained model as a .h5 file
dir=r'path to the directory you want to save the model to'
model_identifier= 'abcd.h5' # for abcd use whatever identification you want
save_path=os.path.join(dir, model_identifier)
model.save(save_path)

How to save and load google NLP reformer model

I´m working with the recent NLP model from Google
I have read a few post but mostly I´m plying over the colab example Which have all the model ceations steps and testing function. The problem I have at now is that since the model takes a long time to train even using the google TPUs I need to save the the trained model, my guess is that it works similarly as the GPT-2 model in the sense that the model can be trainined over several sessions since it allows to stop training at any moment:
This will take at least 30 minutes to run to completion, but can safely
# be interrupted by selecting "Runtime > Interrupt Execution"
But i I have not found an example on how to save and load the model once trained. In case of GPT-2 a new directory was created automatically for each new model, and to use it it was necessary only point to that new directory, but for this one I´m not finding how to load a previously trained model.
EDIT:
In the notebook I saw this code:
# Set up a Trainer.
output_dir = os.path.expanduser('~/train_dir/')
!rm -f ~/train_dir/model.pkl # Remove old model
trainer = trax.supervised.Trainer(
model=trax.models.ReformerLM,
loss_fn=trax.layers.CrossEntropyLoss,
optimizer=trax.optimizers.Adam,
lr_schedule=trax.lr.MultifactorSchedule,
inputs=trax.supervised.inputs.Inputs(my_inputs),
output_dir=output_dir,
has_weights=True)
Which is deleteing the previous model, I looked into that directory I found this:
I used pickle to load this model.pkl file, which I also copied to my Gdrive folder:
with open('model.pkl', 'rb') as handle:
reformer_model = pickle.load(handle)
reformer_model
But this is just a dictionary with the weigths, not a model to use directly:
if you remove the line "!rm -f ~/train_dir/model.pkl # Remove old model" and change output_dir to point to the folder the saved model is in it will load that model and continue training from where you left off. If there is no model in that directory it will create a new one.
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
%cd /content/gdrive/My\ Drive/
# Train tiny model with Trainer.
output_dir = "CornellMovieDialog/Model/"
trainer = trax.supervised.Trainer(
model=tiny_transformer_lm,
loss_fn=trax.layers.CrossEntropyLoss(),
optimizer=trax.optimizers.Adafactor, # Change optimizer params here.
lr_schedule=trax.lr.MultifactorSchedule, # Change lr schedule here.
inputs=copy_inputs,
output_dir=output_dir)

how to convert tensorflow .meta .data .index to .ckpt file?

As we know, when using tensorflow to save checkpoint, we have 3 files, for e.g.:
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
I check on the faster rcnn and found that they have an evaluation.py script which helps evaluate the pre-trained model, but the script only accept .ckpt file (as they provided some pre-trained models above).
I have run some finetuning from their pre-trained model
And then I wonder if there's a way to convert all the .data-00000-of-00001, .index and .meta into one single .ckpt file to run the evaluate.py script on the checkpoint?
(I also notice that the pre-trained models they provided in the repo do have only 1 .ckpt file, how can they do that when the save-checkpoint function generates 3 files?)
These
{
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
}
are the more recent checkpoint format
while
{model.ckpt}
is a previous checkpoint format
It will be in the same concept as to convert a Nintendo Switch to NES ... Or a 3 pieces CD bundle to a single ROM cartridge...
You don't need to convert, You can save the variables in the network using
saver = tf.train.Saver()
saver.save(sess, 'path of save/fileName.ckpt')
To restore the network for reuse later or in another script, use:
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint('path of save/')
sess.run(....)
Important points:
sess must be same between first and later runs (coherent structure).
saver.restore needs the path of the folder of the saved files, not
an individual file path.

Finding all checkpoints path in Tensorflow

So far I have used saving and loading checkpoints in Tensorflow only for loading the last checkpoint. Usually the code I use to this is along the lines:
ckpt = tf.train.get_checkpoint_state(load_dir)
if ckpt and ckpt.model_checkpoint_path:
saver.restore(session, ckpt.model_checkpoint_path)
else:
tf.gfile.DeleteRecursively(load_dir)
tf.gfile.MakeDirs(load_dir)
However, in my latest experiment, I'm saving a checkpoint at every 1000 iterations and I want to run an evaluation script on all of the checkpoints, e.g. to show how does different validation metrics progress. Is there any easy way of getting all checkpoints in Tensorflow or I will just need to loop over all of the names accordingly using os?
The ckpt object in your code snippet is CheckpointState protocol buffer. Instead of accessing the most recent model path (ckpt.model_checkpoint_path), you can iterate over all of them using something like:
for model_path in ckpt.all_model_checkpoint_paths:
saver.restore(session, model_path)
# Do the evaluation using the restored model

Categories

Resources