How to load checkpoints - python

Hi I have tried to load my checkpoints but i get the following error:
" W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open ../codeOutputs/3DNewArchitectureWithRotation: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?"
This is the code I have used:
checkpoint_filepath = '../codeOutputs/3DNewArchitectureWithRotation'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
monitor='val_loss',
verbose=0,
save_best_only=False,
save_weights_only=False,
mode='auto',
save_freq='epoch',
options=None,
initial_value_threshold=None,
)
Model.load_weights(checkpoint_filepath)
BestRegressor = Model.fit(aaaiTrainImages, afTrainPorosity, validation_data = (aaaiValidationImages, afValidationPorosity), epochs=Epochs, callbacks =[EarlyStop,model_checkpoint_callback], verbose=2)
It seems the file type the checkpoints have been saved as are :HDF document (application/x-hdf).
I would appreciate any help as I have spend many days training my model and suddenly crashed, so it would be really helpful if I can skip retraining it up to the data I had

I was faced with the same issue. As others have pointed out, the issue derives from the argument save_weights_only=False which creates a directory of files. You can still call model.load_weights() and depersist the model, but you get that unpleasant error. One approach I took was to use the following to depersist the model without any errors/warnings.
import tensorflow as tf
m = tf.keras.models.load_model('/path/to/checkpoint/dir')

Related

What's the correct parameter to use on tf.keras.models.load_model

I'm new using Pyhton and I'm kinda lost trying setup keras save model and weights, someone with more experience could help me please?
I'm folloing this guide to learn how predict system works and would give advices in lottery games:
https://medium.com/#polanitzer/predicting-the-israeli-lottery-results-for-the-november-29-2022-game-using-an-artificial-191489eb2c10
On my Jurassic computer this is frozing in random epochs above 900, then reading tensorflow docs have see about possibility to use save weights on every epoch and continue from previous one if it fails / computer frozen.
I have did to checkpoint:
checkpoint_filepath="/home/ubuntu/Downloads/Lottery/checkpoints/lottery/"
model_checkpoint_callback = ModelCheckpoint(
filepath=os.path.join(checkpoint_filepath,"weights-improvement.hd5"),
monitor='val_accuracy',
verbose=1,
save_best_only=True,
save_weights_only=True,
save_freq='epoch',
mode='max')
es = EarlyStopping(monitor='val_accuracy', patience=5)
callbacks_list = [model_checkpoint_callback, es]
And to load model:
model.load_weights("/home/ubuntu/Downloads/Lottery/checkpoints/lottery/weights-improvement.hd5")
loss, acc = model.evaluate(train_samples, train_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100 * acc))
load_model('/home/ubuntu/Downloads/Lottery/lottery/')
Tried use train_samples / train_labels and x_train / y_train, no luck restoring then, runing all code in jupyter notebook, this start from begin everytime (0.08% accuracy even previous run have get 60% before frozen).
And at training model I have did:
model.fit(x=x_train, y=y_train, batch_size=32, epochs=1200, verbose=2, callbacks=[model_checkpoint_callback], validation_split=0.22)
model.save('lottery')
I was reading docs from here:
https://www.tensorflow.org/tutorials/keras/save_and_load
What I'm doing wrong?
Thanks in advice to all able to try help me!!!

Tensorflow Callback, Multiple Issues on Saving and Loading Weights

I'm training a model and using the tensorflow callbacks function to save my training logs and I have a model checkpoint to save my model's weights.
During training, every epoch I ran it says "WARNING:tensorflow: Can save best model only with val_acc available, skipping". This is issue 1.
Here are the code I used to be include in callbacks[] during model.fit.
def create_tensorboard_callback(dir_name, experiment_name):
"""
Creates a TensorBoard callback instand to store log files.
Stores log files with the filepath:
"dir_name/experiment_name/current_datetime/"
Args:
dir_name: target directory to store TensorBoard log files
experiment_name: name of experiment directory (e.g. efficientnet_model_1)
"""
log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir
)
print(f"Saving TensorBoard log files to: {log_dir}")
return tensorboard_callback
# Create ModelCheckpoint callback to save model's progress
checkpoint_path = "model_checkpoints/cp.ckpt"
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
monitor="val_acc",
save_best_only=True, #SAVING BEST ONLY
save_weights_only=True,
verbose=0)
Code for fitting the model with callbacks:
history_101_food_classes_feature_extract = model.fit(train_data,
epochs=3,
steps_per_epoch=len(train_data),
validation_data=test_data,
validation_steps=int(0.15 * len(test_data)),
callbacks=[create_tensorboard_callback("training_logs",
"efficientnetb0_101_classes_all_data_feature_extract"),
model_checkpoint])
Also, I cloned my model and used cloned_mode.load_weights(checkpoint_path) to evaluate both orignal and cloned model results using model.evaluate(test_data) Original model scores 70+% accuracy, while cloned_model always returns this exact accuracy. This is the issue 2.
My guess was that I have some previously trained and saved a very high accuracy model, hence issue 1 where it refuses to save at every epoch. But my model_checkpoint path looks clean to me.
And, if I did previously saved a high accuracy to my checkpoint_path, when I cloned a new model using weights load from that path, why would it give 0.54 accuracy everytime and not something higher? (Issue 2)
I need help. Let me know if you need more info from my side to solve this issue, happy to answer. Thanks. If you want to see the full code, here's the link to it.
https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/07_food_vision_milestone_project_1.ipynb

Google Colab not saving model or model checkpoints?

I've been using Colab to train my models, but it's quite infuriating that so far I have only been able to save the weights to my Google Drive, not the whole model, or even model checkpoints.
I mounted Google Drive with:
from google.colab import drive
drive.mount('/content/gdrive')
And I know that I can read files from the Drive as this code works:
import numpy as np
with np.load("/content/gdrive/MyDrive/trainingData.npz") as f:
dataX = f["dataX"]
dataY = f["dataY"]
And I set up the TPU using the following:
%tensorflow_version 2.x
import tensorflow as tf
print("Tensorflow version " + tf.__version__)
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
But when I run the following code, no model checkpoints get saved:
with tpu_strategy.scope():
model = Sequential()
model.add(LSTM(256, input_shape=(dataX.shape[1], dataX.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(dataY.shape[1], activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')
filepath="/content/gdrive/MyDrive/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
model.fit(dataX, dataY, epochs=50, batch_size=128)
I can't even just save the model normally: model.save("/content/gdrive/MyDrive/model") gives:
UnimplementedError: File system scheme '[local]' not implemented (file: 'model/variables/variables_temp/part-00000-of-00001')
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
The interesting thing is that I can still save model weights, via model.save_weights("/content/gdrive/MyDrive/model.h5")
However, as I want to be able to save the whole model for future training, just saving the weights is not satisfactory.
What errors have I made and how can I save my model?

How to access log file with tensorboard on anaconda

I've made a NN model with Keras in a Ananconda enviroment (i'm using Jupiter).
I would want to access the log file that I'm writing with tensorboard, and I would like to see the accuracy and the loss function graphs.
However, when I try to access to the log file from the terminal this error occurs: AttributeError: module 'tensorboard.util' has no attribute 'PersistentOpEvaluator'
Anyone can help me to write these graphs and to see them opening tensorboard?
This is my code:
hidden_size = 256
sl_model = keras.models.Sequential()
[...]
sl_model.add(keras.layers.Dense(max_length, activation='softmax'))
optimizer = keras.optimizers.Adam()
sl_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['acc'])
batch_size = 128
epochs = 3
# Let's print a summary of the model
sl_model.summary()
#I'd like to access to this file
cbk = keras.callbacks.TensorBoard("logging/keras_model")
print("\nStarting training...")
sl_model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size,
shuffle=True, validation_data=(x_dev, y_dev), callbacks=[cbk])
How can I fix this? thank you!
You must delete tensorboard directory in site-packages then pip install tensorboard --upgrade supposing your tensorflow version is up-to-date.

Exporting/Importing Keras Model to Tensorflow fails when using multi_gpu_model

I'm currently struggeling with importing my exported Keras model into Tensorflow. The code worked fine with a sequential model. I was able to train the model in python and then import it into my c++ application. Since I needed more ressources I decided to distribute the model onto several GPUs. Afterwards I was not able to import the model.
This is how I created my model before:
input_img = Input(shape=(imgDim, imgDim, 1))
# add several layers to net
model = Model(input_img, net)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=100,
batch_size=100,
shuffle=True,
validation_data=(x_test, y_test))
saveKerasModelAsProtobuf(model, outpath)
This is how I export my model:
def saveKerasModelAsProtobuf(model, outputPath):
signature = tf.saved_model.signature_def_utils.predict_signature_def(
inputs={'image': model.input}, outputs={'scores': model.output})
builder = tf.saved_model.builder.SavedModelBuilder(outputPath)
builder.add_meta_graph_and_variables(
sess=keras.backend.get_session(),
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map={
tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
signature
}
)
builder.save()
return
This is how I changed the code to run on multiple GPUs:
input_img = Input(shape=(imgDim, imgDim, 1))
# add several layers to net
model = Model(input_img, net)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
parallel_model.fit(x_train, y_train,
epochs=100,
batch_size=100,
shuffle=True,
validation_data=(x_test, y_test))
# export model rather than parallel_model:
saveKerasModelAsProtobuf(model, outpath)
When I try to import the model in C++ on a single GPU machine I get the following error, indicating that it's not actually the sequential model (as I would expect) but the parallel_model:
Cannot assign a device for operation 'replica_3/lambda_4/Shape': Operation was explicitly assigned to /device:GPU:3 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[Node: replica_3/lambda_4/Shape = Shape[T=DT_FLOAT, _output_shapes=[[4]], out_type=DT_INT32, _device="/device:GPU:3"](input_1)]]
From what I read, they should share the same weights, but not the internal structure. What am I doing wrong? Is there a better/more generic way to export the model?
Thanks!

Categories

Resources