How can I catch CUDA_ERROR_LAUNCH_FAILED in my code?

How can I catch CUDA_ERROR_LAUNCH_FAILED in my code? - python

I have created a custom class for an NLP project.
This class tries to fit a Keras model, but I often have CUDA_ERROR_LAUNCH_FAILED during training (it seems to be caused by memory issues). These errors can occur after hours of training.
As I can't find a fix for the CUDA error, I tried to implement a workaround :
I added a ModelCheckpoint to save the "best" model at each epoch
If an error occurs during training, I reload the best model & clear the GPU memory.
I resume the training with the reloaded model
I tried this by simulating errors (KeyboardInterrupt), and it works.
However, I can't find a way to catch the CUDA_ERROR_LAUNCH_FAILED error. It seems it just stops the python process (low-level error ?).
Does anyone know how to catch these CUDA_ERROR_LAUNCH_FAILED ?
Code snippet :
def custom_fit(self, some_arguments):
...
try:
fit_history = self.model.fit(
x_train,
y_train_dummies,
batch_size=self.batch_size,
epochs=self.epochs,
validation_split=None,
validation_data=validation_data,
callbacks=callbacks, # Includes ModelCheckpoint
verbose=1,
)
except: # CUDA_ERROR_LAUNCH_FAILED not catched
...
# Reload model
# Clear GPU memory
# Resume training (recursive call to this function)

Related

How to get the Keras history object when you abort training?

When I train with tensorflow 2.0 / Keras APIs, I usually do something like this
model = tf.keras.Model(inputs, outputs)
history = model.fit(x, y, batch_size=64, epochs=10)
But sometimes things in life don't work out how I planned and I need to abort with ctrl-c or pressing stop in Jupyter notebook.
How can I still get the history object when I abort training early? I can't find any detailed documentation for how to get history.

As answered by #today in a comment above, the history object is also available as an attribute of model:
model.history.history['val_loss']

Restoring correct version of tensorflow

A few weeks ago, I was working on a project and I installed an older version of tensorflow to try to fix a problem I was having. It didn't work as I had hoped and I pip install the newest version of tensorflow but now I'm regularly getting error messages related to tensorflow being out of date. They don't stop program execution but they are there. As far as I know, I have the most recent version installed but I think I must be missing something. This is an example of one of the errors I'm getting: WARNING: tensorflow: Can save best model only with val_loss available, skipping. This is happening when I try to save a keras model using ModelCheckpoint. I get a different message when I use model_save(). It seems the issues arise whenever I try to save any model in any way. If anyone has any advice, I would love it.
I'm using Python on Google Colab. Please let me know if you need more info from me.
Edit: Adding code for ModelCheckpoint:
save=ModelCheckpoint("/content/drive/My Drive/Colab Notebooks/cavity data/Frequency Model.h5", save_best_only=True, verbose=1)
it was then called in model.fit() like this:
model.fit(X_train, Y_train, epochs=500, callbacks=[save, stop], verbose=1)

The default monitor for ModelCheckpoint is the validation loss or "val_loss".
As the warning suggests, the key "val_loss" is missing because you didn't use validation data in model.fit().
Either specify the validation split or validation data in model.fit() or just use training loss or accuracy as a monitor for ModelCheckpoint as in my example below.
monitor = "accuracy" # or "loss"
save = ModelCheckpoint("/content/drive/My Drive/Colab Notebooks/cavity data/Frequency Model.h5", monitor=monitor, save_best_only=True, verbose=1)
model.fit(X_train, Y_train, epochs=500, callbacks=[save, stop], verbose=1)

keras model.save() isn't saving

I have a keras NN that I want to train and validate using two sets of data, and then test the ultimate performance of using a third set. In order to avoid having to rerun the training every time I restart my google colab runtime or want to change my test data, I want to save the final state of the model after training in one script and then load it again in another script.
I've looked everywhere and it seems that model.save("content/drive/My Drive/Directory/ModelName", save_format='tf') should do the trick, but even though it outputs INFO:tensorflow:Assets written to: content/drive/My Drive/Directory/ModelName/assets nothing appears in my Google Drive, so I assume it isn't actually saving.
Please can someone help me solve this issue?
Thanks in advance!

The standard way of saving and retrieving your model's state after Google Colab terminated your connection is to use a feature called ModelCheckpoint. This is a callback in Keras that would run after each epoch and it will save your model for instance any time there's an improvement. Here's is the steps needed to accomplish what you want:
Connect to Google Drive
Use this code in order to connect to Google Drive:
from google.colab import drive
drive.mount('/content/gdrive')
Give access to Google Colab
Then you'll presented with a link that you should go to and after authorizing Google Colab by copying the given code to the text box as shown below:
Define your ModelCheckpoint
This is how you could define your ModelCheckpoint's callback:
from keras.callbacks import *
filepath="/content/gdrive/My Drive/MyCNN/epochs:{epoch:03d}-val_acc:{val_acc:.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
Use it as a callback in while you're training the model
Then you need to tell your model that after each epoch run this functionality for me to save the model's state.
model.fit(X_train, y_train,
batch_size=64,
epochs=epochs,
verbose=1,
validation_data=(X_val, y_val),
callbacks=callbacks_list)
Load the model after Google Colab terminated
Finally after your session was terminated, you can load your previous model's state by simply running the following code. Don't forget to re-define your model first and only load weights at this stage.
model.load_weights('/content/gdrive/My Drive/MyCNN/epochs:047-val_acc:0.905.hdf5'
Hope that this answers your question.

Python crashes when saving Keras model

I am building a model in Keras that contains roughly 4.2M parameters. When I try to save the model using ModelCheckpoint or using model.save('best_model.hdf5'), Python crashes.
The model runs without any issues when I comment out the code, to save the model, so there isn't any other issue that could potentially be causing python to crash.
My reasoning here is that a large number of parameters is causing python to crash.
I have looked but haven't been able to find any solution.
Are there any alternatives available to save my model and reuse it in Keras? Or is there a way to fix this issue?
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_squared_error', verbose=1, save_best_only=True, mode='max')
model.save(filepath)
Python doesn't shout out any error. This is all that pops up -
PythonErrorPopup

How to begin counting in ModelCheckpoint from an epoch greater than 1

I am working on a project where Keras ModelCheckpoint is used. This callback class seems to cover my needs except for one small detail.
I have not found a way to pass a counter to the epoch numbering so as to deal with resume model training cases. I often train for some epochs and then resume training afterwards. I would like to have a consistent model saving pattern like:
model.{epoch:03d}-{loss:.2f}.hdf5
but with numbering beginning from the epoch the previous training stopped and not from the beginning.
The current command I use is this:
ckp_saver = ModelCheckpoint(checkpoint_dir + "/model.{epoch:03d}-{loss:.2f}.hdf5", monitor='loss', verbose=0,
save_best_only=False, save_weights_only=True, mode='auto', period=1)
Is there any way to pass this information to ModelCheckpoint? The solution I found is to edit the Keras code and add a default argument containing the actual pretrained epochs (defaulting in 0 if not passed) so as not to break any other code but I would prefer to avoid this if it's not necessary. Any other ideas?
The original code was taken from this file here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I catch CUDA_ERROR_LAUNCH_FAILED in my code? - python

Related

How to get the Keras history object when you abort training?

Restoring correct version of tensorflow

keras model.save() isn't saving

Python crashes when saving Keras model

How to begin counting in ModelCheckpoint from an epoch greater than 1

Categories

Resources