So im trying to train a model on colab, and it is going to take me roughly 70-72 hr of continues running. I have a free account, so i get kicked due to over-use or inactivity pretty frequently, which means I cant just dump history in a pickle file.
history = model.fit_generator(custom_generator(train_csv_list,batch_size), steps_per_epoch=len(train_csv_list[:13400])//(batch_size), epochs=1000, verbose=1, callbacks=[stop_training], validation_data=(x_valid,y_valid))
I found the CSVLogger in callback method and added it to my callback as below. But it wont create model_history_log.csv for some reason. I don't get any error or warning. What part am i doing wrong ?
My goal is to only save accuracy and loss, throughout the training process
class stop_(Callback):
def on_epoch_end(self, epoch, logs={}):
model.save(Path("/content/drive/MyDrive/.../model" +str(int(epoch))))
CSVLogger("/content/drive/MyDrive/.../model_history_log.csv", append=True)
if(logs.get('accuracy') > ACCURACY_THRESHOLD):
print("\nReached %2.2f%% accuracy, so stopping training!!" %(ACCURACY_THRESHOLD*100))
self.model.stop_training = True
stop_training = stop_()
Also since im saving the model at every epoch, does the model save this information ? so far i havent found anything, and i doubt it saves accuracy, loss, val accuracy,etc
Think you want to write your callback as follows
class STOP(tf.keras.callbacks.Callback):
def __init__ (self, model, csv_path, model_save_dir, epochs, acc_thld): # initialization of the callback
# model is your compiled model
# csv_path is path where csv file will be stored
# model_save_dir is path to directory where model files will be saved
# number of epochs you set in model.fit
self.model=model
self.csv_path=csv_path
self.model_save_dir=model_save_dir
self.epochs=epochs
self.acc_thld=acc_thld
self.acc_list=[] # create empty list to store accuracy
self.loss_list=[] # create empty list to store loss
self.epoch_list=[] # create empty list to store the epoch
def on_epoch_end(self, epoch, logs=None): # method runs on the end of each epoch
savestr='_' + str(epoch+1) + '.h5' # model will be save as an .h5 file with name _epoch.h5
save_path=os.path.join(self.model_save_dir, savestr)
acc= logs.get('accuracy') #get the accuracy for this epoch
loss=logs.get('loss') # get the loss for this epoch
self.model.save (save_path) # save the model
self.acc_list.append(logs.get('accuracy'))
self.loss_list.append(logs.get('loss'))
self.epoch_list.append(epoch + 1)
if acc > self.acc_thld or epoch+1 ==epochs: # see of acc >thld or if this was the last epoch
self.model.stop_training = True # stop training
Eseries=pd.Series(self.epoch_list, name='Epoch')
Accseries =pd.Series(self.acc_list, name='accuracy')
Lseries=pd.Series(self.loss_list, name='loss')
df=pd.concat([Eseries, Lseries, Accseries], axis=1) # create a dataframe with columns epoch loss accuracy
df.to_csv(self.csv_path, index=False) # convert dataframe to a csv file and save it
if acc > self.acc_thld:
print ('\nTraining halted on epoch ', epoch + 1, ' when accuracy exceeded the threshhold')
then before you run model.fit use code
epochs=20 # set number of epoch for model.fit and the callback
sdir=r'C:\Temp\stooges' # set directory where save model files and the csv file will be stored
acc_thld=.98 # set accuracy threshold
csv_path=os.path.join(sdir, 'traindata.csv') # name your csv file to be saved in sdir
callbacks=STOP(model, csv_path, sdir, epochs, acc_thld) # instantiate the callback
Remember in model.fit set callbacks = callbacks. I tested this on a simple dataset. It ran for only 3 epochs before the accuracy exceeded the threshold of .98. So since it ran for 3 epoch it created 3 save model files in the sdir labeled as
_1.h5
_2.h5
_3.h5
It also created the csv file labelled as traindata.csv. The csv file content was
Epoch loss accuracy
1 8.086007 .817778
2 6.911876 .974444
3 6.129871 .987778
Related
I've been joining this hackathon and playing with keras callbacks and neural network, may I know if there is a way to monitor not only loss or val_loss but BOTH of them to avoid overfitting either the test or train set?
e.g: can i put a function for the monitor field instead of just one field name?
If I want to monitor val_loss to pick the lowest but I also want a second criteria to pick the minimum difference between val_loss and loss.
I have an answer to a problem that is pretty similar to this, here.
Basically, it is not possible to monitor multiple metrics with keras callbacks. However, you could define a custom callback (see the documentation for more info) that can access the logs at each epoch and do some operations.
Let's say if you want to monitor loss and val_loss you can do something like this:
import tensorflow as tf
from tensorflow import keras
class CombineCallback(tf.keras.callbacks.Callback):
def __init__(self, **kargs):
super(CombineCallback, self).__init__(**kargs)
def on_epoch_end(self, epoch, logs={}):
logs['combine_metric'] = logs['val_loss'] + logs['loss']
Side note: the most important thing in my opinion is to monitor the validation loss. Train loss of course will keep dropping, so it is not really that meaningful to observe. If you really want to monitor them both I suggest you add a multiplicative factor and give more weight to validation loss. In this case:
class CombineCallback(tf.keras.callbacks.Callback):
def __init__(self, **kargs):
super(CombineCallback, self).__init__(**kargs)
def on_epoch_end(self, epoch, logs={}):
factor = 0.8
logs['combine_metric'] = factor * logs['val_loss'] + (1-factor) * logs['loss']
Then, if you only want to monitor this new metric during the training, you can use it like this:
model.fit(
...
callbacks=[CombineCallback()],
)
Instead, if you also want to stop the training using the new metric, you should combine the new callback with the early stopping callback:
combined_cb = CombineCallback()
early_stopping_cb = keras.callbacks.EarlyStopping(monitor="combine_metric")
model.fit(
...
callbacks=[combined_cb, early_stopping_cb],
)
Be sure to get the CombinedCallback before the early stopping callback in the callbacks list.
Moreover, you can draw more inspiration here.
You can choose between two approaches:
Create a custom metric to record the metric you want, by subclassing tf.keras.metrics.Metric. See https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric for an example.
You can then use your metric in standard callbacks e.g. EarlyStopping()
Create a custom callback to do the calculation (and take the action) you want, by subclassing tf.keras.callbacks.CallBack. See https://www.tensorflow.org/guide/keras/custom_callback for how to do this.
Below is a Keras custom callback that should do the job. The callback monitors both the training loss and the validation loss. The form of the callback is
callbacks=[SOMT(model, train_thold, valid_thold)] where:
model is the name of your complied model
train_thold is a float. It is the value of accuracy (in Percent) that must be
achieved by the model in order to conditionally stop training
valid_threshold is a float. It is the value of validation accuracy (in Percent)
that must be achieved by the model in order to conditionally stop training
Note to stop training BOTH the train_thold and valid_thold must be exceeded in the SAME epoch.
If you want to stop training based soley on the training accuracy set the valid_thold to 0.0.
Similarly if you want to stop training on just validation accuracy set train_thold= 0.0.
Note if both thresholds are not achieved in the same epoch training will continue until the value of epochs specified in model.fit is reached.
For example lets take the case that you want to stop training when the
training accuracy has reached or exceeded 95 % and the validation accuracy has achieved at least 85%
then the code would be callbacks=[SOMT(my_model, .95, .85)]
class SOMT(keras.callbacks.Callback):
def __init__(self, model, train_thold, valid_thold):
super(SOMT, self).__init__()
self.model=model
self.train_thold=train_thold
self.valid_thold=valid_thold
def on_train_begin(self, logs=None):
print('Starting Training - training will halt if training accuracy achieves or exceeds ', self.train_thold)
print ('and validation accuracy meets or exceeds ', self.valid_thold)
msg='{0:^8s}{1:^12s}{2:^12s}{3:^12s}{4:^12s}{5:^12s}'.format('Epoch', 'Train Acc', 'Train Loss','Valid Acc','Valid_Loss','Duration')
print (msg)
def on_train_batch_end(self, batch, logs=None):
acc=logs.get('accuracy')* 100 # get training accuracy
loss=logs.get('loss')
msg='{0:1s}processed batch {1:4s} training accuracy= {2:8.3f} loss: {3:8.5f}'.format(' ', str(batch), acc, loss)
print(msg, '\r', end='') # prints over on the same line to show running batch count
def on_epoch_begin(self,epoch, logs=None):
self.now= time.time()
def on_epoch_end(self,epoch, logs=None):
later=time.time()
duration=later-self.now
tacc=logs.get('accuracy')
vacc=logs.get('val_accuracy')
tr_loss=logs.get('loss')
v_loss=logs.get('val_loss')
ep=epoch+1
print(f'{ep:^8.0f} {tacc:^12.2f}{tr_loss:^12.4f}{vacc:^12.2f}{v_loss:^12.4f}{duration:^12.2f}')
if tacc>= self.train_thold and vacc>= self.valid_thold:
print( f'\ntraining accuracy and validation accuracy reached the thresholds on epoch {epoch + 1}' )
self.model.stop_training = True # stop training
I am downloading the model https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main microsoft/Multilingual-MiniLM-L12-H384 and then using it.
Transformer Version: '4.11.3'
I have written the below code:
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
acc = np.sum(predictions == labels) / predictions.shape[0]
return {"accuracy" : acc}
model = tr.BertForSequenceClassification.from_pretrained("/home/pc/minilm_model",num_labels=2)
model.to(device)
print("hello")
training_args = tr.TrainingArguments(
output_dir='/home/pc/proj/results2', # output directory
num_train_epochs=10, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=32, # batch size for evaluation
learning_rate=2e-5,
warmup_steps=1000, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=1000,
evaluation_strategy="epoch",
save_strategy="no"
)
trainer = tr.Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_data, # training dataset
eval_dataset=val_data, # evaluation dataset
compute_metrics=compute_metrics
)
Is there way to retrieve my ( I want to use tensor-board) :
Training loss for every epoch
Validation loss for every epoch
I do not see anything in my log directory apart from model arguments which is empty?
How can I save my Training and validation loss so that tensorboard events captures it.
I am currently trying to add a feature to interrupt and resume training on a GAN created form this example code: https://machinelearningmastery.com/how-to-develop-an-auxiliary-classifier-gan-ac-gan-from-scratch-with-keras/
I managed to get it working in a way where I save the weights of the entire composite GAN in the summarize_performance function, which gets triggered every 10 epochs, like this:
# save all weights
filename3 = 'weights_%08d.h5' % (step+1)
gan_model.save_weights(filename3)
print('>Saved: %s and %s and %s' % (filename1, filename2, filename3))
which is loaded in a function I added to the start of the program called load_model, which takes the architecture of the gan built like normal, but updates it's weights to the most recent values, like this:
#load model from file and return startBatch number
def load_model(gan_model):
start_batch = 0
files = glob.glob("./weights_0*.h5")
if(len(files) > 0 ):
most_recent_file = files[len(files)-1]
gan_model.load_weights(most_recent_file)
#TODO: breaks if using more than 8 digits for batches
startBatch = int(most_recent_file[10:18])
if (start_batch != 0):
print("> found existing weights; starting at batch %d" % start_batch)
return start_batch
where the start_batch gets passed to the train function in order to skip the already completed epochs.
While this weight saving approach does "work", I still think that my approach here is wrong since I've discovered that the weight data obviously does not include the optimizer status of the GAN, hence the training does not continue as it would if it hadn't been interrupted.
The way I've found to save progress while also saving optimizer status is apparently done by saving the entire model instead of just the weights
Here I run into a problem since in a GAN I dont just have one model which I train but I have 3 models:
The generator model g_model
The discriminator model d_model
and the composite GAN model gan_model
which are all connected and dependant on each other. If I did the naive approach and saved and restored each of these part models individually I'd end up having 3 seperate disjointed models instead of a GAN
Is there a way to save and restore the entire GAN in a way that would let me resume training as if no interruption had occured?
Maybe consider using tf.train.Checkpoint, if you would like to restore your entire GAN:
### In your training loop
checkpoint_dir = '/checkpoints'
checkpoint = tf.train.Checkpoint(gan_optimizer=gan_optimizer,
discriminator_optimizer=discriminator_optimizer,
generator=generator,
discriminator=discriminator
gan_model = gan_model)
ckpt_manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir, max_to_keep=3)
if ckpt_manager.latest_checkpoint:
checkpoint.restore(ckpt_manager.latest_checkpoint)
print ('Latest checkpoint restored!!')
....
....
if (epoch + 1) % 40 == 0:
ckpt_save_path = ckpt_manager.save()
print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,ckpt_save_path))
### After x number of epochs, just save your generator model for inference.
generator.save('your_model.h5')
You can also consider getting rid of the composite model completely. Here is an example of what I mean.
I'm training a huge model. Unfortunately, the runtime environment breaks off about halfway and I have to restart the model.I save the model after each epoch.
But my question now is, for example, I've trained 5 out of 10 epcohs.
How do I load it and indicate that I was at the 5th epoch and that he has to continue there so only has to go through 5 epochs? I know that I can load the model, but how can I say I was at the 5 epoch and now you only have to go through 5 epochs because I wanted a total of 10.
cp_callback = [tf.keras.callbacks.ModelCheckpoint(
filepath='/saved/model.h5',
verbose=1,
save_weights_only=True,
save_freq= 'epoch'),
tf.keras.callbacks.EarlyStopping(monitor='loss', patience=2)]
You can save epoch number in a separate file (pickle or json file).
import json
train_parameters = {'iter': iteration, 'batch_size': batch_size'}
# saving
json.dump(trainParameters, open(output_path+"trainParameters.txt",'w'))
# loading
trainParameters = json.load(open(path_to_saved_model+"trainParameters.txt"))
input = tf.random.uniform([8, 24], 0, 100, dtype=tf.int32)
model.compile(optimizer=optimizer, loss=training_loss, metrics=evaluation_accuracy)
hist = model.fit((input, input), input, epochs=1,
steps_per_epoch=1, verbose=0)
model.load_weights(path_to_saved_model+'saved.h5')
But if you need to save learning rate step - save optimizer state. The state contain iteration number (number of batches passed).
When you run a Keras neural network model you might see something like this in the console:
Epoch 1/3
6/1000 [..............................] - ETA: 7994s - loss: 5111.7661
As time goes on the loss hopefully improves. I want to log these losses to a file over time so that I can learn from them. I have tried:
logging.basicConfig(filename='example.log', filemode='w', level=logging.DEBUG)
but this doesn't work. I am not sure what level of logging I need in this situation.
I have also tried using a callback like in:
def generate_train_batch():
while 1:
for i in xrange(0,dset_X.shape[0],3):
yield dset_X[i:i+3,:,:,:],dset_y[i:i+3,:,:]
class LossHistory(keras.callbacks.Callback):
def on_train_begin(self, logs={}):
self.losses = []
def on_batch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
logloss=LossHistory()
colorize.fit_generator(generate_train_batch(),samples_per_epoch=1000,nb_epoch=3,callbacks=['logloss'])
but obviously this isn't writing to a file. Whatever the method, through a callback or the logging module or anything else, I would love to hear your solutions for logging loss of a keras neural network to a file. Thanks!
You can use CSVLogger callback.
as example:
from keras.callbacks import CSVLogger
csv_logger = CSVLogger('log.csv', append=True, separator=';')
model.fit(X_train, Y_train, callbacks=[csv_logger])
Look at: Keras Callbacks
There is a simple solution to your problem. Every time any of the fit methods are used - as a result the special callback called History Callback is returned. It has a field history which is a dictionary of all metrics registered after every epoch. So to get list of loss function values after every epoch you can easly do:
history_callback = model.fit(params...)
loss_history = history_callback.history["loss"]
It's easy to save such list to a file (e.g. by converting it to numpy array and using savetxt method).
UPDATE:
Try:
import numpy
numpy_loss_history = numpy.array(loss_history)
numpy.savetxt("loss_history.txt", numpy_loss_history, delimiter=",")
UPDATE 2:
The solution to the problem of recording a loss after every batch is written in Keras Callbacks Documentation in a Create a Callback paragraph.
Old question, but here goes. Keras history output perfectly matches pandas DataSet input.
If you want the entire history to csv in one line:
pandas.DataFrame(model.fit(...).history).to_csv("history.csv")
Cheers
You can redirect the sys.stdout object to a file before the model.fit method and reassign it to the standard console after model.fit method as follows:
import sys
oldStdout = sys.stdout
file = open('logFile', 'w')
sys.stdout = file
model.fit(Xtrain, Ytrain)
sys.stdout = oldStdout
So In TensorFlow 2.0, it is quite easy to get Loss and Accuracy of each epoch because it returns a History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values
If you have validation Data
History = model.fit(trainX,trainY,validation_data = (testX,testY),batch_size= 100, epochs = epochs,verbose = 1)
train_loss = History.history['loss']
val_loss = History.history['val_loss']
acc = History.history['accuracy']
val_acc = History.history['val_accuracy']
If you don't have validation Data
History = model.fit(trainX,trainY,batch_size= 100, epochs = epochs,verbose = 1)
train_loss = History.history['loss']
acc = History.history['accuracy']
Then to save list data into text file use the below code
import numpy as np
train_loss = np.array(loss_history)
np.savetxt("train_loss.txt", train_loss, delimiter=",")
Best is to create a LambdaCallback:
from keras.callbacks import LambdaCallback
txt_log = open('loss_log.txt', mode='wt', buffering=1)
save_op_callback = LambdaCallback(
on_epoch_end = lambda epoch, logs: txt_log.write(
{'epoch': epoch, 'loss': logs['loss']} + '\n'),
on_train_end = lambda logs: txt_log.close()
)
Now,Just add it like this in the model.fit function:
model.fit(...,callbacks = [save_op_callback])