I am building a model in Keras that contains roughly 4.2M parameters. When I try to save the model using ModelCheckpoint or using model.save('best_model.hdf5'), Python crashes.
The model runs without any issues when I comment out the code, to save the model, so there isn't any other issue that could potentially be causing python to crash.
My reasoning here is that a large number of parameters is causing python to crash.
I have looked but haven't been able to find any solution.
Are there any alternatives available to save my model and reuse it in Keras? Or is there a way to fix this issue?
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_squared_error', verbose=1, save_best_only=True, mode='max')
model.save(filepath)
Python doesn't shout out any error. This is all that pops up -
PythonErrorPopup
Related
I have created a custom class for an NLP project.
This class tries to fit a Keras model, but I often have CUDA_ERROR_LAUNCH_FAILED during training (it seems to be caused by memory issues). These errors can occur after hours of training.
As I can't find a fix for the CUDA error, I tried to implement a workaround :
I added a ModelCheckpoint to save the "best" model at each epoch
If an error occurs during training, I reload the best model & clear the GPU memory.
I resume the training with the reloaded model
I tried this by simulating errors (KeyboardInterrupt), and it works.
However, I can't find a way to catch the CUDA_ERROR_LAUNCH_FAILED error. It seems it just stops the python process (low-level error ?).
Does anyone know how to catch these CUDA_ERROR_LAUNCH_FAILED ?
Code snippet :
def custom_fit(self, some_arguments):
...
try:
fit_history = self.model.fit(
x_train,
y_train_dummies,
batch_size=self.batch_size,
epochs=self.epochs,
validation_split=None,
validation_data=validation_data,
callbacks=callbacks, # Includes ModelCheckpoint
verbose=1,
)
except: # CUDA_ERROR_LAUNCH_FAILED not catched
...
# Reload model
# Clear GPU memory
# Resume training (recursive call to this function)
A few weeks ago, I was working on a project and I installed an older version of tensorflow to try to fix a problem I was having. It didn't work as I had hoped and I pip install the newest version of tensorflow but now I'm regularly getting error messages related to tensorflow being out of date. They don't stop program execution but they are there. As far as I know, I have the most recent version installed but I think I must be missing something. This is an example of one of the errors I'm getting: WARNING: tensorflow: Can save best model only with val_loss available, skipping. This is happening when I try to save a keras model using ModelCheckpoint. I get a different message when I use model_save(). It seems the issues arise whenever I try to save any model in any way. If anyone has any advice, I would love it.
I'm using Python on Google Colab. Please let me know if you need more info from me.
Edit: Adding code for ModelCheckpoint:
save=ModelCheckpoint("/content/drive/My Drive/Colab Notebooks/cavity data/Frequency Model.h5", save_best_only=True, verbose=1)
it was then called in model.fit() like this:
model.fit(X_train, Y_train, epochs=500, callbacks=[save, stop], verbose=1)
The default monitor for ModelCheckpoint is the validation loss or "val_loss".
As the warning suggests, the key "val_loss" is missing because you didn't use validation data in model.fit().
Either specify the validation split or validation data in model.fit() or just use training loss or accuracy as a monitor for ModelCheckpoint as in my example below.
monitor = "accuracy" # or "loss"
save = ModelCheckpoint("/content/drive/My Drive/Colab Notebooks/cavity data/Frequency Model.h5", monitor=monitor, save_best_only=True, verbose=1)
model.fit(X_train, Y_train, epochs=500, callbacks=[save, stop], verbose=1)
I have a keras NN that I want to train and validate using two sets of data, and then test the ultimate performance of using a third set. In order to avoid having to rerun the training every time I restart my google colab runtime or want to change my test data, I want to save the final state of the model after training in one script and then load it again in another script.
I've looked everywhere and it seems that model.save("content/drive/My Drive/Directory/ModelName", save_format='tf') should do the trick, but even though it outputs INFO:tensorflow:Assets written to: content/drive/My Drive/Directory/ModelName/assets nothing appears in my Google Drive, so I assume it isn't actually saving.
Please can someone help me solve this issue?
Thanks in advance!
The standard way of saving and retrieving your model's state after Google Colab terminated your connection is to use a feature called ModelCheckpoint. This is a callback in Keras that would run after each epoch and it will save your model for instance any time there's an improvement. Here's is the steps needed to accomplish what you want:
Connect to Google Drive
Use this code in order to connect to Google Drive:
from google.colab import drive
drive.mount('/content/gdrive')
Give access to Google Colab
Then you'll presented with a link that you should go to and after authorizing Google Colab by copying the given code to the text box as shown below:
Define your ModelCheckpoint
This is how you could define your ModelCheckpoint's callback:
from keras.callbacks import *
filepath="/content/gdrive/My Drive/MyCNN/epochs:{epoch:03d}-val_acc:{val_acc:.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
Use it as a callback in while you're training the model
Then you need to tell your model that after each epoch run this functionality for me to save the model's state.
model.fit(X_train, y_train,
batch_size=64,
epochs=epochs,
verbose=1,
validation_data=(X_val, y_val),
callbacks=callbacks_list)
Load the model after Google Colab terminated
Finally after your session was terminated, you can load your previous model's state by simply running the following code. Don't forget to re-define your model first and only load weights at this stage.
model.load_weights('/content/gdrive/My Drive/MyCNN/epochs:047-val_acc:0.905.hdf5'
Hope that this answers your question.
I've searched around for a couple of answers regarding the load_model from keras but I still have a question.
I am following this model really closely (https://github.com/experiencor/keras-yolo2), and am training on a custom dataset.
I have done the training which gives me a yolov2.h5 file, basically model weights to fit into the keras model. But I am encountering some problems with the loading of the model.
After loading the model (in a separate.py file)
model = load_model('file_dir/yolov2.h5')
First I encounter the issue
NameError: name 'tf' is not defined
Which I then searched up to modify my code to add custom objects as such:
model = load_model('file_dir/yolov2.h5', custom_objects={'tf':tf})
This clears the first error but results in another
ValueError: Unknown loss function : custom_loss
I used the custom_loss function from the yolov2 (https://github.com/experiencor/keras-yolo2/blob/master/frontend.py), so i tried to solve it by
from frontend import YOLO
model = load_model('file_dir/yolov2.h5' custom_objects={'tf':tf, 'custom_loss':YOLO.custom_loss)
But ran into another error:
TypeError: custom_loss() missing 1 required positional argument
I got rather stuck here because I have no idea how to fit in the parameters for custom_loss. Seek some help regarding this (Don't particularly understand this part since I'm loading my model in a different python script separate.py). Thank you so much!
(Edit: This fix doesn't work for me either)
model = load_model('file_dir/yolov2.h5', compile = False)
To resolve this problem, as you already have the network at hand, only save trained weights (like what keras trainer does in callback).
For testing, make model, no need to compile, and then load trained weights using model.load_weights(path/to/saved/weights).
You also can use "by_name=True" if you make the network in a different way, this time you should keep layer names.
Another option id to manually set weights; for this you will load .h5 file bu "h5py" (h5py.File(path/to/weights, mode='r')) for example (have look how keras do that), then try to correspond layer names of the model and loaded weights.
I'm using Keras to train a CNN using the fit_generator function.
It seems to be a known issue that TensorBoard doesn't show histograms and distributions in this setup.
Did anybody figure out a way to make it work anyway?
There is no easy way to just plug it in with one line of code, you have to write your summaries by hand.
The good news is that it's not difficult and you can use the TensorBoard callback code in Keras as a reference.
(There is also a version 2 ready for TensorFlow 2.x.)
Basically, write a function e.g. write_summaries(model) and call it whenever you want to write your summaries (e.g. just after your fit_generator())
Inside your write_summaries(model) function use tf.summary, histogram_summary and other summary functions to log data you want to see on tensorboard.
If you don't know exactly how to check official tutorial:
and this great example of MNIST with summaries.
I believe bartgras's explanation is superseded in more recent versions of Keras (I'm using Keras 2.2.2). To get histograms in Tensorboard all I did was the following, (where bg is a data wrangling class which exposes a generator for gb.training_batch(); gb.validation_batch() however is NOT a generator):
NAME = "Foo_{}".format(datetime.now().isoformat(timespec='seconds')).replace(':', '-')
tensorboard = keras.callbacks.TensorBoard(
log_dir="logs/{}".format(NAME),
histogram_freq=1,
write_images=True)
callbacks = [
tensorboard
]
history = model.fit_generator(
bg.training_batch(),
validation_data=bg.validation_batch(),
epochs=EPOCHS,
steps_per_epoch=bg.steps_per_epoch,
validation_steps=bg.validation_steps,
verbose=1,
shuffle=False,
callbacks=callbacks)