Loading a saved model to resume training

Loading a saved model to resume training - python

I'm training a ResNet model to classify car brands.
I saved the weights during training for every epoch.
For a test, I stopped the training at epoch 3.
# checkpoint = ModelCheckpoint("best_model.hdf5", monitor='loss', verbose=1)
checkpoint_path = "weights/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = tf.keras.callbacks.ModelCheckpoint(
checkpoint_path, verbose=1,
# Save weights, every epoch.
save_freq='epoch')
model.save_weights(checkpoint_path.format(epoch=0))
history = model.fit_generator(
training_set,
validation_data = test_set,
epochs = 50,
steps_per_epoch = len(training_set),
validation_steps = len(test_set),
callbacks = [cp_callback]
)
However, when loading them, I am unsure if it is resuming from the last epoch saved one since it says epoch 1/50 again. Below is the code I use to load the last saved model.
from keras.models import Sequential, load_model
# load the model
new_model = load_model('./weights/cp-0003.ckpt')
# fit the model
history = new_model.fit_generator(
training_set,
validation_data = test_set,
epochs = 50,
steps_per_epoch = len(training_set),
validation_steps = len(test_set),
callbacks = [cp_callback]
)
This is what it looks like:
Image showing that running the saved weight starts from epoch 1/50 again
Can someone please help?

You can use the initial_epoch argument of the fit_generator. By default, it is set to 0 but you can set it to any positive number:
from keras.models import Sequential, load_model
import tensorflow as tf
checkpoint_path = "weights/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = tf.keras.callbacks.ModelCheckpoint(
checkpoint_path, verbose=1,
# Save weights, every epoch.
save_freq='epoch')
model.save_weights(checkpoint_path.format(epoch=0))
history = model.fit_generator(
training_set,
validation_data=test_set,
epochs=3,
steps_per_epoch=len(training_set),
validation_steps=len(test_set),
callbacks = [cp_callback]
)
new_model = load_model('./weights/cp-0003.ckpt')
# fit the model
history = new_model.fit_generator(
training_set,
validation_data=test_set,
epochs=50,
steps_per_epoch=len(training_set),
validation_steps=len(test_set),
callbacks=[cp_callback],
initial_epoch=3
)
This will train your model for 50 - 3 = 47 additional epochs.
Some remarks regarding your code if you use Tensorflow 2.X:
fit_generator is deprecated since fit supports generator now
you should replace your import from keras.... to from tensorflow.keras...

Related

Issues with Keras load_model function

I am building a CNN in Keras using a Tensorflow backend for speaker identification, and currently I am attempting to train the model and then save it in as an .hdf5 file. The program trains the model for 100 epochs with early stopping and checkpoints, saving only the best model to a file, as illustrated in the code below:
class BuildModel:
# Create First Model in Ensemble
def createModel(self, model_input, n_outputs, first_session=True):
if first_session != True:
model = load_model('SI_ideal_model_fixed.hdf5')
return model
# Define Input Layer
inputs = model_input
# Define Densely Connected Layers
conv = Dense(16, activation='relu')(inputs)
conv = Dense(64, activation='relu')(conv)
conv = Dense(16, activation='relu')(conv)
conv = Reshape((conv.shape[1]*conv.shape[2]*conv.shape[3],))(conv)
outputs = Dense(n_outputs, activation='softmax')(conv)
# Create Model
model = Model(inputs, outputs)
model.summary()
return model
# Train the Model
def evaluateModel(self, x_train, x_val, y_train, y_val, num_classes, first_session=True):
# Model Parameters
verbose, epochs, batch_size, patience = 1, 100, 64, 10
# Determine Input and Output Dimensions
x = x_train[0].shape[0] # Number of MFCC rows
y = x_train[0].shape[1] # Number of MFCC columns
c = 1 # Number of channels
# Create Model
inputs = Input(shape=(x, y, c), name='input')
model = self.createModel(model_input=inputs,
n_outputs=num_classes,
first_session=first_session)
# Compile Model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# Callbacks
es = EarlyStopping(monitor='val_loss',
mode='min',
verbose=verbose,
patience=patience,
min_delta=0.0001) # Stop training at right time
mc = ModelCheckpoint('SI_ideal_model_fixed.hdf5',
monitor='val_accuracy',
verbose=verbose,
save_best_only=True,
mode='max') # Save best model after each epoch
reduce_lr = ReduceLROnPlateau(monitor='val_loss',
factor=0.2,
patience=patience//2,
min_lr=1e-3) # Reduce learning rate once learning stagnates
# Evaluate Model
model.fit(x_train, y=y_train, epochs=epochs,
callbacks=[es,mc,reduce_lr], batch_size=batch_size,
validation_data=(x_val, y_val))
accuracy = model.evaluate(x=x_train, y=y_train,
batch_size=batch_size,
verbose=verbose)
# Load Best Model
model = load_model('SI_ideal_model_fixed.hdf5')
return (accuracy[1], model)
However, it appears that the load_model function is not working properly since the model achieved a validation accuracy of 0.56193 after the first training session but then only started with a validation accuracy of 0.2508 at the beginning of the second training session. (From what I have seen, the first epoch of the second training session should have a validation accuracy much closer to the that of the best model.)
Moreover, I then attempted to test the trained model on a set of unseen samples with model.predict, and it failed on all six, often with high probabilities, which leads me to believe that it was using minimally trained (or untrained) weights.
So, my question is could this be an issue from loading and saving the models using the load_model and ModelCheckpoint functions? If so, what is the best alternative method? If not, what are some good troubleshooting tips for improving the model's prediction functionality?

I am not sure what you mean by training session. What I would do is first train for a few epochs epochs and note the validation accuracy. Then, load the model and use evaluate() to get the same accuracy. If it differs, then yes something is wrong with your loading. Here is what I would do:
def createModel(self, model_input, n_outputs):
# Define Input Layer
inputs = model_input
# Define Densely Connected Layers
conv = Dense(16, activation='relu')(inputs)
conv2 = Dense(64, activation='relu')(conv)
conv3 = Dense(16, activation='relu')(conv2)
conv4 = Reshape((conv.shape[1]*conv.shape[2]*conv.shape[3],))(conv3)
outputs = Dense(n_outputs, activation='softmax')(conv4)
# Create Model
model = Model(inputs, outputs)
return model
# Train the Model
def evaluateModel(self, x_train, x_val, y_train, y_val, num_classes, first_session=True):
# Model Parameters
verbose, epochs, batch_size, patience = 1, 100, 64, 10
# Determine Input and Output Dimensions
x = x_train[0].shape[0] # Number of MFCC rows
y = x_train[0].shape[1] # Number of MFCC columns
c = 1 # Number of channels
# Create Model
inputs = Input(shape=(x, y, c), name='input')
model = self.createModel(model_input=inputs,
n_outputs=num_classes)
# Compile Model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# Callbacks
es = EarlyStopping(monitor='val_loss',
mode='min',
verbose=verbose,
patience=patience,
min_delta=0.0001) # Stop training at right time
mc = ModelCheckpoint('SI_ideal_model_fixed.h5',
monitor='val_accuracy',
verbose=verbose,
save_best_only=True,
save_weights_only=False) # Save best model after each epoch
reduce_lr = ReduceLROnPlateau(monitor='val_loss',
factor=0.2,
patience=patience//2,
min_lr=1e-3) # Reduce learning rate once learning stagnates
# Evaluate Model
model.fit(x_train, y=y_train, epochs=5,
callbacks=[es,mc,reduce_lr], batch_size=batch_size,
validation_data=(x_val, y_val))
model.evaluate(x=x_val, y=y_val,
batch_size=batch_size,
verbose=verbose)
# Load Best Model
model2 = load_model('SI_ideal_model_fixed.h5')
model2.evaluate(x=x_val, y=y_val,
batch_size=batch_size,
verbose=verbose)
return (accuracy[1], model)
The two evaluations should print the same thing really.
P.S. TF might change the order of your computations so I used different names to prevent that in the model e.g. conv1, conv2 ...)

Load epochs logs Keras model

I load the Keras model I have been training with 150 epochs
tbCallBack = tensorflow.keras.callbacks.TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)
my_model.fit(X_train, X_train,
epochs=200,
batch_size=100,
shuffle=True,
validation_data = (X_test, X_test),
callbacks=[tbCallBack]
)
# Save the model
my_model.save('my_model.hdf5')
Then, I will load the Keras model
my_model = load_model("my_model.hdf5")
Is there a way to load all the epochs logs (loss, accuracy.. ) ?

You can use the keras callback called CSVLogger.
According to the documentation, it streams the results from each epoch into a csv file.
This is the code from the documentation of it.
from keras.callbacks import CSVLogger
csv_logger = CSVLogger('training.log')
model.fit(X_train, Y_train, callbacks=[csv_logger])
You can then manipulate it as a normal CSV file, for your needs.

How can I restart model.fit after x epochs if loss remains high?

I am having trouble sometimes getting a fit on my data, and when I restart fit (with shuffle=true) then I sometimes get a good fit.
See my previous question:
https://datascience.stackexchange.com/questions/62516/why-does-my-model-sometimes-not-learn-well-from-same-data
As a work around, I want to automatically restart the fitting process, if loss is high after x epochs. How can I achieve this?
I assume I would need to use a custom version of EarlyStopping callback? How could I differentiate between ES because of finding a low loss ( < 0.5) so training is finished, or because loss > 0.5 after x epochs so need to restart training?
Here is a simplified structure:
def train_till_good():
while not_finished:
train()
def train():
load_data()
model = VerySimpleNet2();
checkpoint = keras.callbacks.ModelCheckpoint(filepath=images_root + dataset_name + '\\CheckPoint.hdf5')
myOpt = keras.optimizers.Adam(lr=0.001,decay=0.01)
model.compile(optimizer=myOpt, loss='categorical_crossentropy', metrics=['accuracy'])
LRS = CyclicLR(base_lr=0.000005, max_lr=0.0003, step_size=200.)
tensorboard = keras.callbacks.TensorBoard(log_dir='C:\\Tensorflow', histogram_freq=0,write_graph=True, write_images=False)
ES = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)
model.fit(train_images, train_labels, shuffle=True, epochs=num_epochs,
callbacks=[checkpoint,
tensorboard,
ES,
LRS],
validation_data = (test_images, test_labels)
)
def VerySimpleNet2():
model = keras.Sequential([
keras.layers.Dense(112, activation=tf.nn.relu, input_shape=(224, 224, 3)),
keras.layers.Dropout(0.4),
keras.layers.Flatten(),
keras.layers.Dense(3, activation=tf.nn.softmax)
])
return model

Resume Training tf.keras Tensorboard

I encountered some problems when I continued training my model and visualized the progress on tensorboard.
My question is how do I resume training from the same step without specifying any epoch manually? If possible, simply by loading the saved model, it somehow could read the global_step from the optimizer saved and continue training from there.
I have provided some codes below to reproduce similar errors.
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.models import load_model
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)
del model
model = load_model('./final_model.h5')
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
You can run the tensorboard by using the command:
tensorboard --logdir ./logs

You can set the parameter initial_epoch in the function model.fit() to the number of the epoch you want your training to start from. Take into account that the model trains until the epoch of index epochs is reached (and not a number of iterations given by epochs).
In your example, if you want to train for 10 epochs more, it should be:
model.fit(x_train, y_train, initial_epoch=9, epochs=19, callbacks=[Tensorboard()])
It will allow you to visualise your plots on Tensorboard in a correct manner.
More extensive information about these parameters can be found in the docs.

Here is sample code in case someone needs it. It implements the idea proposed by Abhinav Anand:
mca = ModelCheckpoint(join(dir, 'model_{epoch:03d}.h5'),
monitor = 'loss',
save_best_only = False)
tb = TensorBoard(log_dir = join(dir, 'logs'),
write_graph = True,
write_images = True)
files = sorted(glob(join(fold_dir, 'model_???.h5')))
if files:
model_file = files[-1]
initial_epoch = int(model_file[-6:-3])
print('Resuming using saved model %s.' % model_file)
model = load_model(model_file)
else:
model = nn.model()
initial_epoch = 0
model.fit(x_train,
y_train,
epochs = 100,
initial_epoch = initial_epoch,
callbacks = [mca, tb])
Replace nn.model() with your own function for defining the model.

It's very simple. Create checkpoints while training the model and then use those checkpoints to resume training from where you left of.
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)
model = load_model('./final_model.h5')
callbacks = list()
tensorboard = Tensorboard()
callbacks.append(tensorboard)
file_path = "model-{epoch:02d}-{loss:.4f}.hdf5"
# now here you can create checkpoints and save according to your need
# here period is the no of epochs after which to save the model every time during training
# another option is save_weights_only, for your case it should be false
checkpoints = ModelCheckpoint(file_path, monitor='loss', verbose=1, period=1, save_weights_only=False)
callbacks.append(checkpoints)
model.fit(x_train, y_train, epochs=10, callbacks=callbacks)
After this just load the checkpoint from where you want to resume training again
model = load_model(checkpoint_of_choice)
model.fit(x_train, y_train, epochs=10, callbacks=callbacks)
And you are done.
Let me know if you have more questions about this.

Continue train CNN with saved model in keras

I train CNN model with keras library with numbers of epoch is 25. Can I run model in first time with 10 epochs then save model with these lines of code:
model.fit_generator(training_set,
steps_per_epoch = 100000,
epochs = 10,
validation_data = test_set,
validation_steps = 40000)
from keras.models import load_model
model.save('my_model.h5')
Then I restart python and continue to run the next 15 epochs with the same dataset like the code bellow:
model = load_model('my_model.h5')
model.fit_generator(training_set,
steps_per_epoch = 100000,
epochs = 15,
validation_data = test_set,
validation_steps = 40000)
Is it sufficient to continue training ? Or I have to do any other step to continue the job. I am very appreciated with any support.

Yes this is okay, model.save saves the weights, model architecture, and optimizer state, so you can resume training with no problems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading a saved model to resume training - python

Related

Issues with Keras load_model function

Load epochs logs Keras model

How can I restart model.fit after x epochs if loss remains high?

Resume Training tf.keras Tensorboard

Continue train CNN with saved model in keras

Categories

Resources