Error in saving checkpoint of tensorflow model in colab

Error in saving checkpoint of tensorflow model in colab - python

I am training a transformer model for a chat bot. I have thought of saving the checkpoints in colab to reuse the trained model whenever required after the training process is done.
I have followed the model saving tutorial from tensorflow but it keeps me giving me the following error.
UnimplementedError: File system scheme '[local]' not implemented (file: 'training_1/cp.ckpt_temp/part-00000-of-00001') [Op:MultiDeviceIteratorInit]
This is my try in saving the checkpoints.
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
#Create checkpoint callback
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)
#Fit model:
model.fit(dataset, epochs=EPOCHS,callbacks=[cp_callback])
In some training instances, the model get trained for about 5 epochs and this error occurs while in some instances the error occurs within just one or two epochs. I am using TPU to train the model.
What causes this issue and is there a way to get rid of it?
Any help will be highly appreciated.

Related

How to continue printing the Intermediate results in jupyter after the reconnection?

Today, I used jupyter to run a deep learning model remotely.
After the browser was disconnected for some time, I reconnected the running kernel, but jupyter did not continue to print the intermediate output results.
From the usage of GPU and the command line of jupyter, we can see that the kernel continues to run.
Is there any way I can continue to observe the intermediate output of the kernel?
the situation of the running kernel

The Google colab lifetime with the open browser is usually 12 hours.
The best way to save your changes is to use Checkpoint for your deep learning model to avoid losing the last trained model.
This is an example of how you can use checkpoint callback in your deep learning model while training, more examples and details can be found here.
# Include the epoch in the file name (uses `str.format`)
checkpoint_path = "training_2/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
batch_size = 32
# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_path,
verbose=1,
save_weights_only=True,
save_freq=5*batch_size)
# Create a new model instance
model = create_model()
# Save the weights using the `checkpoint_path` format
model.save_weights(checkpoint_path.format(epoch=0))
# Train the model with the new callback
model.fit(train_images,
train_labels,
epochs=50,
batch_size=batch_size,
callbacks=[cp_callback],
validation_data=(test_images, test_labels),
verbose=0)

How to save a TensorFlow model after a certain amount of epochs?

I have a model that train images, I want to know how to save the model after a certain amount of epochs so I have multiple reference points rather that having just one saved model at the end. Also how do I specify the folder or directory on which I would like to save the model?
Here's an example, where would I add the new code to save after a number of epochs? (Also side question, would the model save command at the end work? I haven't started training and I don't want to get to the end to find the model is not saving)
model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])
# Adam optimizer
# loss function will be categorical cross entropy
# evaluation metric will be accuracy
step_size_train=train_generator.n//train_generator.batch_size
model.fit_generator(generator=train_generator,
steps_per_epoch=step_size_train,
epochs=15)
model.save('C:\Users\Omar\Desktop\trainedmodel.h5')

You can use the keras model checkpoint callback. Here is the code:
checkpoint = keras.callbacks.ModelCheckpoint('model{epoch:08d}.h5', period=5)
Add this to the fit generator using the following command:
model.fit_generator(generator=train_generator,
steps_per_epoch=step_size_train,
epochs=15,
callbacks=[checkpoint])

Training a keras model using TPU pods?

I was wondering if anyone has an example of using a keras model on a TPU pod?
I have a model creating method which returns a keras model which is compiled within a TPU strategy scope, as recommended by many examples on using TPUs with keras. This works with v3-8 but gives an error when tried with more cores (specifically v3-32):
with strategy.scope():
keras_model = create_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
keras_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
When running model.fit, it fails with the following error:
Failed copying input tensor from /job:worker/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:1/device:CPU:0 in order to run DatasetFromGraph: FetchOutputs node : not found [Op:DatasetFromGraph]
The model input are in the form of numpy arrays. Is perhaps a tensorflow.data.Dataset required?

Keras load model after saving makes random predictions in a new python session

I am using tensorflow version '2.0.0' and keras version '2.3.0' to develop the model. Here's how I saved the model:
seed = 1234
random.seed(seed)
np.random.seed(seed)
tf.compat.v1.random.set_random_seed(seed)
I then save the entire model as instructed here:
model.save('some_model_name.h5')
I am getting an accuracy of about 95% during training. When I load the model from a different python session, like:
# Recreate the exact same model
new_model = load_model('some_model_name.h5', custom_objects={'SeqSelfAttention': SeqSelfAttention})
score = new_model.evaluate([x_img_train, x_txt_train], y_train, verbose=2)
print("%s: %.2f%%" % (new_model.metrics_names[1], score[1]*100))
The accuracy now is about 4%. Please note that I have batch norm and dropout layers. How can I make the predictions of my model consistent across different sessions?

Firstly, I have downgraded the TensorFlow version to 1.13.1, owing to stability issues of 2.0.0.
Secondly, I had to ensure a few things before I could achieve some level of reproducibility:
Use Adagrad optimizer instead of Adam gave me performance comparable to the train session. When every time I loaded the session, it was giving me a high variance in the predictions (for Adam)
Loading architecture from json and loading model weights subsequently gave me different results as compared to saving and loading weights only. The former approach seemed to produce comparable performance (to training)
Using tf.session to train and saving it and reloading the tf.session in a new python session did the trick.
There is no variation in the results with or without dropouts or Batch norm.
Please note that following these steps gave me some level of consistency although it's not 100% reproducible. If you're facing a similar issue, perhaps these insights could help.

After loading the model in a new kernel instance, make sure to config losses and metrics again with .compile() in the same way you did before saving.
For example:
old_model = tf.keras.Sequential([ ... ])
old_model.compile(loss = 'mean_squared_error', optimizer = 'sgd', metrics = ['accuracy'])
old_model.fit(train_ds, validation_data=valid_ds, epochs=3)
old_model.evaluate(test_ds)
old_model.save('some_model_name.h5')
Then in the new kernel:
from tensorflow.keras.models import load_model
new_model = load_model("some_model_name.h5")
new_model.compile(loss = 'mean_squared_error', optimizer = 'sgd', metrics = ['accuracy'])
new_model.evaluate(test_ds) # should be the same now

when to call compile while training a tensorflow (2.0) model in incremental fashion?

I am writing a neural network to train incrementally (not online). Here is a snippet of the code
output = create_model()
model = Model(inputs=values, outputs=output)
if start_epoch > 1:
weights_list = load_model_from_pickle()
model.set_weights(weights_list)
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(data , label, epochs=1, verbose=1, batch_size=1024, shuffle=False)
In essence, I want to load previously trained weights and train for a few more epochs. I read some SO reply that calling compile changes the weights? Is there any other way to do it? Does it make sense to set weight after calling compile? Will the answer change if I run my model in multi gpu setting?

You need to compile the model ones and after training when you reload the model, you dont' require to compile it again. Read more here.
Compile function defines the optimizer, loss functions and metrics you want. It does not change any weights. For more detailed information, read here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error in saving checkpoint of tensorflow model in colab - python

Related

How to continue printing the Intermediate results in jupyter after the reconnection?

How to save a TensorFlow model after a certain amount of epochs?

Training a keras model using TPU pods?

Keras load model after saving makes random predictions in a new python session

when to call compile while training a tensorflow (2.0) model in incremental fashion?

Categories

Resources