Restoring correct version of tensorflow - python

A few weeks ago, I was working on a project and I installed an older version of tensorflow to try to fix a problem I was having. It didn't work as I had hoped and I pip install the newest version of tensorflow but now I'm regularly getting error messages related to tensorflow being out of date. They don't stop program execution but they are there. As far as I know, I have the most recent version installed but I think I must be missing something. This is an example of one of the errors I'm getting: WARNING: tensorflow: Can save best model only with val_loss available, skipping. This is happening when I try to save a keras model using ModelCheckpoint. I get a different message when I use model_save(). It seems the issues arise whenever I try to save any model in any way. If anyone has any advice, I would love it.
I'm using Python on Google Colab. Please let me know if you need more info from me.
Edit: Adding code for ModelCheckpoint:
save=ModelCheckpoint("/content/drive/My Drive/Colab Notebooks/cavity data/Frequency Model.h5", save_best_only=True, verbose=1)
it was then called in model.fit() like this:
model.fit(X_train, Y_train, epochs=500, callbacks=[save, stop], verbose=1)

The default monitor for ModelCheckpoint is the validation loss or "val_loss".
As the warning suggests, the key "val_loss" is missing because you didn't use validation data in model.fit().
Either specify the validation split or validation data in model.fit() or just use training loss or accuracy as a monitor for ModelCheckpoint as in my example below.
monitor = "accuracy" # or "loss"
save = ModelCheckpoint("/content/drive/My Drive/Colab Notebooks/cavity data/Frequency Model.h5", monitor=monitor, save_best_only=True, verbose=1)
model.fit(X_train, Y_train, epochs=500, callbacks=[save, stop], verbose=1)

Related

How can I catch CUDA_ERROR_LAUNCH_FAILED in my code?

I have created a custom class for an NLP project.
This class tries to fit a Keras model, but I often have CUDA_ERROR_LAUNCH_FAILED during training (it seems to be caused by memory issues). These errors can occur after hours of training.
As I can't find a fix for the CUDA error, I tried to implement a workaround :
I added a ModelCheckpoint to save the "best" model at each epoch
If an error occurs during training, I reload the best model & clear the GPU memory.
I resume the training with the reloaded model
I tried this by simulating errors (KeyboardInterrupt), and it works.
However, I can't find a way to catch the CUDA_ERROR_LAUNCH_FAILED error. It seems it just stops the python process (low-level error ?).
Does anyone know how to catch these CUDA_ERROR_LAUNCH_FAILED ?
Code snippet :
def custom_fit(self, some_arguments):
...
try:
fit_history = self.model.fit(
x_train,
y_train_dummies,
batch_size=self.batch_size,
epochs=self.epochs,
validation_split=None,
validation_data=validation_data,
callbacks=callbacks, # Includes ModelCheckpoint
verbose=1,
)
except: # CUDA_ERROR_LAUNCH_FAILED not catched
...
# Reload model
# Clear GPU memory
# Resume training (recursive call to this function)

How to get the Keras history object when you abort training?

When I train with tensorflow 2.0 / Keras APIs, I usually do something like this
model = tf.keras.Model(inputs, outputs)
history = model.fit(x, y, batch_size=64, epochs=10)
But sometimes things in life don't work out how I planned and I need to abort with ctrl-c or pressing stop in Jupyter notebook.
How can I still get the history object when I abort training early? I can't find any detailed documentation for how to get history.
As answered by #today in a comment above, the history object is also available as an attribute of model:
model.history.history['val_loss']

keras model.save() isn't saving

I have a keras NN that I want to train and validate using two sets of data, and then test the ultimate performance of using a third set. In order to avoid having to rerun the training every time I restart my google colab runtime or want to change my test data, I want to save the final state of the model after training in one script and then load it again in another script.
I've looked everywhere and it seems that model.save("content/drive/My Drive/Directory/ModelName", save_format='tf') should do the trick, but even though it outputs INFO:tensorflow:Assets written to: content/drive/My Drive/Directory/ModelName/assets nothing appears in my Google Drive, so I assume it isn't actually saving.
Please can someone help me solve this issue?
Thanks in advance!
The standard way of saving and retrieving your model's state after Google Colab terminated your connection is to use a feature called ModelCheckpoint. This is a callback in Keras that would run after each epoch and it will save your model for instance any time there's an improvement. Here's is the steps needed to accomplish what you want:
Connect to Google Drive
Use this code in order to connect to Google Drive:
from google.colab import drive
drive.mount('/content/gdrive')
Give access to Google Colab
Then you'll presented with a link that you should go to and after authorizing Google Colab by copying the given code to the text box as shown below:
Define your ModelCheckpoint
This is how you could define your ModelCheckpoint's callback:
from keras.callbacks import *
filepath="/content/gdrive/My Drive/MyCNN/epochs:{epoch:03d}-val_acc:{val_acc:.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
Use it as a callback in while you're training the model
Then you need to tell your model that after each epoch run this functionality for me to save the model's state.
model.fit(X_train, y_train,
batch_size=64,
epochs=epochs,
verbose=1,
validation_data=(X_val, y_val),
callbacks=callbacks_list)
Load the model after Google Colab terminated
Finally after your session was terminated, you can load your previous model's state by simply running the following code. Don't forget to re-define your model first and only load weights at this stage.
model.load_weights('/content/gdrive/My Drive/MyCNN/epochs:047-val_acc:0.905.hdf5'
Hope that this answers your question.

Python crashes when saving Keras model

I am building a model in Keras that contains roughly 4.2M parameters. When I try to save the model using ModelCheckpoint or using model.save('best_model.hdf5'), Python crashes.
The model runs without any issues when I comment out the code, to save the model, so there isn't any other issue that could potentially be causing python to crash.
My reasoning here is that a large number of parameters is causing python to crash.
I have looked but haven't been able to find any solution.
Are there any alternatives available to save my model and reuse it in Keras? Or is there a way to fix this issue?
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_squared_error', verbose=1, save_best_only=True, mode='max')
model.save(filepath)
Python doesn't shout out any error. This is all that pops up -
PythonErrorPopup

TensorBoard Distributions and Histograms with Keras and fit_generator

I'm using Keras to train a CNN using the fit_generator function.
It seems to be a known issue that TensorBoard doesn't show histograms and distributions in this setup.
Did anybody figure out a way to make it work anyway?
There is no easy way to just plug it in with one line of code, you have to write your summaries by hand.
The good news is that it's not difficult and you can use the TensorBoard callback code in Keras as a reference.
(There is also a version 2 ready for TensorFlow 2.x.)
Basically, write a function e.g. write_summaries(model) and call it whenever you want to write your summaries (e.g. just after your fit_generator())
Inside your write_summaries(model) function use tf.summary, histogram_summary and other summary functions to log data you want to see on tensorboard.
If you don't know exactly how to check official tutorial:
and this great example of MNIST with summaries.
I believe bartgras's explanation is superseded in more recent versions of Keras (I'm using Keras 2.2.2). To get histograms in Tensorboard all I did was the following, (where bg is a data wrangling class which exposes a generator for gb.training_batch(); gb.validation_batch() however is NOT a generator):
NAME = "Foo_{}".format(datetime.now().isoformat(timespec='seconds')).replace(':', '-')
tensorboard = keras.callbacks.TensorBoard(
log_dir="logs/{}".format(NAME),
histogram_freq=1,
write_images=True)
callbacks = [
tensorboard
]
history = model.fit_generator(
bg.training_batch(),
validation_data=bg.validation_batch(),
epochs=EPOCHS,
steps_per_epoch=bg.steps_per_epoch,
validation_steps=bg.validation_steps,
verbose=1,
shuffle=False,
callbacks=callbacks)

Categories

Resources