How to Save My Model Every Single Step in Tensorflow?

How to Save My Model Every Single Step in Tensorflow? - python

I am training a GPT2 text generation model in TensorFlow and am performing a single epoch across my text corpus. My question is, how can I save my model every, say, 10 steps or so? My model abruptly stopped training on the 100th step with only another 20 to go....oooof.
I'm aware of the Model_Checkpoint() callback, but it doesn't appear as though I can replace steps for epoch in the save_freq parameter.
tf.keras.callbacks.ModelCheckpoint(
filepath, monitor='val_loss', verbose=0, save_best_only=False,
save_weights_only=False, mode='auto', save_freq='epoch', **kwargs)
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint

Set save_freq = 1. This should save every step. I would not recommend this because it will spend much time on the i/o of the save and slow your training down.

Related

Is there any method/function within Keras to recover the weights of a model during different training epochs?

I have a model that I want to train for 10 epochs for a certain hyperparameters setting. After training, I will use the history.history object and find the epoch where the validation loss was at a minimum. Once I have this best scoring epoch, I would like to retrieve this model and use it to predict the test data. Now, imagine that my best scoring epoch was not the last one. Is there any option within this Keras history object (such as history.model) to retrieve past values of weights? I imagine that, if there is not, I would have to create a dictionary and temporarily store each model per epoch until finishing training and finding the best one. But, when using model.fit, there is no option to store each model per epoch right. How would you do this?

Keras offers the option of evaluate your model on validation data after each epoch
After divide your data into trainning, test and validation data you can train you model like this:
model=modelmlp(np.shape(x_trai)[0],hidden,4)
model.compile(loss='categorical_crossentropy', optimizer = 'adam', metrics='accuracy'])
hist=model.fit(x_train,y_train,epochs=epochs,batch_size=batch,
validation_data(x_valid,y_valid),verbose=verbose[1],
callbacks=[ModelCheckpoint(filepath='bestweigths.hdf5',
monitor='val_loss',verbose=verbose[2],save_best_only=True,mode='min')])
model=load_model('bestweigths.hdf5')
This code will train your model and, after each epoch, your model will be evaluated at the validation data. Every time the result on the validation data is improved the model will be saved on a file
After the trainning process end, you just need to load the model from the file

you can use the callback class of keras for this matter.
You can save the model weights based on the metric you need. Let's say for example you need to save the model with minimum loss. You'll have to define a modelcheckpoint.
first before training the model define the checkpoint in the below given format
callbacks = [ModelCheckpoint(filepath='resNet_centering2.h5', monitor='val_loss', mode='min', save_best_only=True)]
now since you have defined callback, you'll have to use use this callbacks in the model.fit call
history = model.fit(
x=X_train,
y=Y_train,
callbacks=callbacks,
batch_size=4,
epochs=100,
verbose=1,
validation_data=(X_test, Y_test))
this will save the best weights of your model at defined filepath and you can fetch those weights using the below given call.
model=load_model('bestweigths.hdf5')
I hope it solves your problem.

Training loss stays constant while validation loss fluctuates heavily

While doing transfer learning on VGG, with decent amount of data, and with the following configuration:
base_big_3 = tf.keras.applications.VGG19(include_top=False, weights='imagenet',input_shape=[IMG_SIZE,IMG_SIZE,3])
model_big_3 = tf.keras.Sequential()
model_big_3.add(base_big_3)
model_big_3.add(BatchNormalization(axis=-1))
model_big_3.add(GlobalAveragePooling2D())
model_big_3.add(Dense(5, activation='softmax'))
model_big_3.compile(loss=tf.keras.losses.CategoricalCrossentropy(), optimizer=tf.keras.optimizers.Adamax(learning_rate=0.01), metrics=['acc'])
history = model_big_3.fit(
train_generator,
steps_per_epoch=BATCH_SIZE,
epochs=100,
validation_data=valid_generator,
batch_size=BATCH_SIZE
)
The training loss and validation loss varies as below, wherein the training loss is constant throughout and validation loss spikes initially to become constant afterwards:
What I tried out
I tried the solutions given here one by one and decreased the learning rate from 0.01 to 0.0001.Now, this time training loss did go down slightly but then validation error still seems super fluctuating. The training loss and validation loss varies as below:
The above solution link also suggests to normalize the input, but in my opinion images doesn't need to be normalized because the data doesn't vary much and also that the VGG network already has batch normalization, please correct me if I'm wrong.Please point what is leading to this kind of behavior, what to change in the configs and how can I improve training?

One thing I see is you set steps_per_epoch = BATCH_SIZE. Assume you have 3200 training samples and the BATCH_SIZE=32. To go through all your training samples you would have to go through 3200/32=100 batches. But with steps_per_epoch=BATCH_SIZE=32 you only go through 1024 samples in an epoch. Set the steps_per_epoch as
steps_per_epoch =number_of_train samples//BATCH_SIZE
where BATCH_SIZE is whatever you specified in the generator. Alternatively you can leave it as None and model.fit will determine the right value internally.
As stated in the model.fit documentation located here. ,
Do not specify the batch_size if your data is in the form of datasets,
generators, or keras.utils.Sequence instances (since they generate batches).
Since in model.fit you use train_generator I assume this is a generator.
The VGG model was trained on imagenet images where the pixel values were rescaled within the range from -1 to +1. So somewhere in your input pipeline you should rescale the images. For example image=image/127.5-1 will do the job. What BATCH_SIZE did you use? Making it larger (within the limits of your memory size) may help smooth out the fluctuations.
I also recommend you use two keras callbacks, EarlyStopping and ReduceLROnPlateau. Documentation is here. Set them up to monitor validation loss. My suggested code is shown below
estop=tf.keras.callbacks.EarlyStopping(monitor="val_loss",patience=4,verbose=1,
restore_best_weights=True)
rlronp=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5,
patience=2, verbose=1)
callbacks=[estop, rlronp]
# in model.fit add callbacks=callbacks

When training neural networks, does Tensorflow automatically revert back to the best epoch after finishing?

If not, why not? Sometimes I will have an epoch that gets 95ish % and then finish with an epoch that has 10% or so less accuracy. I just never can tell whether it reverts back to that best epoch.

if you are using Keras,then in ModelCheckpoint callback, set save_best_only=True. If this option is enabled, then it saves the model which shows the best results based on the metric you set, either loss or accuracy which you mention for monitor attribute.
Read more about it here - https://keras.io/callbacks/#modelcheckpoint

keras.callbacks.ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', period=1)

How to begin counting in ModelCheckpoint from an epoch greater than 1

I am working on a project where Keras ModelCheckpoint is used. This callback class seems to cover my needs except for one small detail.
I have not found a way to pass a counter to the epoch numbering so as to deal with resume model training cases. I often train for some epochs and then resume training afterwards. I would like to have a consistent model saving pattern like:
model.{epoch:03d}-{loss:.2f}.hdf5
but with numbering beginning from the epoch the previous training stopped and not from the beginning.
The current command I use is this:
ckp_saver = ModelCheckpoint(checkpoint_dir + "/model.{epoch:03d}-{loss:.2f}.hdf5", monitor='loss', verbose=0,
save_best_only=False, save_weights_only=True, mode='auto', period=1)
Is there any way to pass this information to ModelCheckpoint? The solution I found is to edit the Keras code and add a default argument containing the actual pretrained epochs (defaulting in 0 if not passed) so as not to break any other code but I would prefer to avoid this if it's not necessary. Any other ideas?
The original code was taken from this file here.

Keras: EarlyStopping save best model

When I use EarlyStopping callback does Keras save best model in terms of val_loss or it save model on save_epoch = [best epoch in terms of val_loss] + YEARLY_STOPPING_PATIENCE_EPOCHS ?
If it's second option, how to just save best model?
Here is code snippet:
early_stopping = EarlyStopping(monitor='val_loss', patience=YEARLY_STOPPING_PATIENCE_EPOCHS)
history = model.fit_generator(
train_generator,
steps_per_epoch=100, # 1 epoch = BATCH_SIZE * steps_per_epoch samples
epochs=N_EPOCHS,
validation_data=test_generator,
validation_steps=20,
callbacks=[early_stopping])
#Save train log to .csv
pd.DataFrame(history.history).to_csv('vgg16_binary_crossentropy_train_log.csv', index=False)
model.save('vgg16_binary_crossentropy.h5')

In v2.2.4+ of Keras, EarlyStopping has a restore_best_weights parameter which, when set to True, will set the model to the state of best CV performance. For example:
EarlyStopping(restore_best_weights=True)

From my experience using the 'earlystopping' callback, the model will not be saved automatically...it will just stop training and when you save it manually, it will be the second option you present.
To have your model save each time val_loss decreases, see the following documentation page:
https://keras.io/callbacks/ and look at the "Example: model checkpoints" section which will tell you exactly what to do.
note that if you wish to re-use your saved model, I have had better luck using 'save_weights' in combo with saving the architecture in json. YMMV.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Save My Model Every Single Step in Tensorflow? - python

Set save_freq = 1. This should save every step. I would not recommend this because it will spend much time on the i/o of the save and slow your training down.

Related

Is there any method/function within Keras to recover the weights of a model during different training epochs?

Training loss stays constant while validation loss fluctuates heavily

When training neural networks, does Tensorflow automatically revert back to the best epoch after finishing?

How to begin counting in ModelCheckpoint from an epoch greater than 1

Keras: EarlyStopping save best model

Categories

Resources