From AWS Sagemaker Documentation, In order to track metrics in cloudwatch for custom ml algorithms (non-builtin), I read that I have to define my estimaotr as below.
But I am not sure how to alter my training script so that the metric definitions declared inside my estimators can pick up these values.
estimator =
Estimator(image_name=ImageName,
role='SageMakerRole',
instance_count=1,
instance_type='ml.c4.xlarge',
k=10,
sagemaker_session=sagemaker_session,
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
)
In my training code, I have
for epoch in range(1, args.epochs + 1):
total_loss = 0
model.train()
for step, batch in enumerate(train_loader):
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
model.zero_grad()
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
loss = outputs[0]
total_loss += loss.item()
loss.backward() # Computes the gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip for error prevention
# modified based on their gradients, the learning rate, etc.
optimizer.step() # Back Prop
logger.info("Average training loss: %f\n", total_loss / len(train_loader))
Here, I want the train:error to pick up total_loss / len(train_loader) but I am not sure how to assign this.
You have to define a regex to capture that pattern, try with this:
{'Name': 'Average training loss', 'Regex': 'Average training loss = ([0-9\.]+)'}
You can try the regex in tool like this and see what happens.
Related
I use colab(with gpu) to run my code but it took along time approximately 12 hours per epoch in another hand, when I used keras it took 1 hour per epoch .
I want to run the code in PyTorch to finetune it . so, how to make pytorch faster?
# function to train the model
def train():
model.train()
total_loss, total_accuracy = 0, 0
# empty list to save model predictions
total_preds=[]
Labels=[]
# iterate over batches
for step,batch in enumerate(train_dataloader):
# progress update after every 50 batches.
if step % 10 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))
# push the batch to gpu
#batch = [r for r in batch]
sent_id, mask, labels = batch['input_ids'],batch['attention_mask'],batch['labels']
# clear previously calculated gradients
model.zero_grad()
#print(7)
# get model predictions for the current batch
preds = model(sent_id, mask, labels)
preds =torch.argmax(preds, dim=1)
preds=preds.detach().numpy()
labels = labels.detach().numpy()
alpha=0.25
gamma=2
ce_loss = dice_loss(preds, labels)
total_loss = total_loss + ce_loss
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
total_preds.append(preds)
total_accuracy += (preds == labels).sum()
# compute the training loss of the epoch
avg_loss = total_loss / len(traindataset)
avg_accuracy = total_accuracy / len(traindataset)
#returns the loss and predictions
return avg_loss, total_preds, avg_accuracy
I have the following code that trains a model and stores logs in a results variable
import tqdm.notebook as tq
import sys
num_epochs = 10
results = {"train_loss": [], "val_loss": [], "train_acc": [], "val_acc": []}
for epoch in range(1, num_epochs+1):
sys.stdout.write(f"---Epoch {epoch}/{num_epochs}: ")
epoch_loss = {"train": [], "val": []}
epoch_acc = {"train": [], "val": []}
for phase in ['train', 'val']:
if phase=="train":
model.train(True)
else:
model.train(False)
# most important thing I learned from this project was how to fix tqdm nastiness in colab
for batch_idx, (x, y) in tq.tqdm(enumerate(dataloaders[phase]),
total=len(dataloaders[phase]),
leave=False):
# put data to device and get output
x, y = x.to(device), y.to(device)
preds = model(x)
# calc and log model loss
batch_loss = criterion(preds, y)
epoch_loss[phase].append(batch_loss.item())
# calculate acc and extend to epoch_acc
preds = torch.argmax(preds, dim=1)
batch_acc = torch.sum(preds==y)/len(y)
epoch_acc[phase].append(batch_acc)
# zero the grad
optimizer.zero_grad()
# take a step if training mode is on
if phase=="train":
batch_loss.backward()
optimizer.step()
scheduler.step()
# at the end of each epoch, calculate avg epoch train/val loss/accuracy
train_loss = sum(epoch_loss["train"])/len(epoch_loss["train"])
val_loss = sum(epoch_loss["val"])/len(epoch_loss["val"])
train_acc = 100*sum(epoch_acc["train"])/len(epoch_acc["train"])
val_acc = 100*sum(epoch_acc["val"])/len(epoch_acc["val"])
# log losses and accs every epoch
results['train_loss'].extend(epoch_loss['train'])
results['train_acc'].extend(epoch_acc['train'])
results['val_loss'].extend(epoch_loss['val'])
results['val_acc'].extend(epoch_acc['val'])
# and print it nicely
sys.stdout.write("train_loss: {:.4f} train_acc: {:.2f}% ".format(train_loss, train_acc))
sys.stdout.write("val_loss: {:.4f} val_acc: {:.2f}%\n".format(val_loss, val_acc))
I'm logging the avg accuracy and avg loss of every batch into separate training/validation loss/acc arrays. The problem is that I have more training batches so when I try to graph my training logs I get something like this:
Is there a workaround for this?
You are making a few conceptual errors:
You are calculating the validation loss/accuracy in multiple batches, as opposed to over the entire validation set
You are calculating the validation accuracy for a static model after it has already trained on all the data, as opposed to periodically assessing the validation accuracy as it is training
You should average your batch training performance over each epoch, and once per epoch calculate the complete loss/acc statistics across the entire validation set. Then you will have n_epochs values for both training and validation and can plot them on the same axes.
Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)
I'm trying to train a tf.keras model with Gradient Accumulation (GA). But I don't want to use it in the custom training loop (like) but customize the .fit() method by overriding the train_step.Is it possible? How to accomplish this? The reason is if we want to get the benefit of keras built-in functionality like fit, callbacks, we don't want to use the custom training loop but at the same time if we want to override train_step for some reason (like GA or else) we can customize the fit method and still get the leverage of using those built-in functions.
And also, I know the pros of using GA but what are the major cons of using it? Why does it's not come as a default but an optional feature with the framework?
# overriding train step
# my attempt
# it's not appropriately implemented
# and need to fix
class CustomTrainStep(keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = n_gradients
self.gradient_accumulation = [
tf.zeros_like(this_var) for this_var in self.trainable_variables
]
def train_step(self, data):
x, y = data
batch_size = tf.cast(tf.shape(x)[0], tf.float32)
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(
y, y_pred, regularization_losses=self.losses
)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
accum_gradient = [
(acum_grad+grad) for acum_grad, grad in \
zip(self.gradient_accumulation, gradients)
]
accum_gradient = [
this_grad/batch_size for this_grad in accum_gradient
]
# apply accumulated gradients
self.optimizer.apply_gradients(
zip(accum_gradient, self.trainable_variables)
)
# TODO: reset self.gradient_accumulation
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
Please, run and check with the following toy setup.
# Model
size = 32
input = keras.Input(shape=(size,size,3))
efnet = keras.applications.DenseNet121(
weights=None,
include_top = False,
input_tensor = input
)
base_maps = keras.layers.GlobalAveragePooling2D()(efnet.output)
base_maps = keras.layers.Dense(
units=10, activation='softmax',
name='primary'
)(base_maps)
custom_model = CustomTrainStep(
n_gradients=10, inputs=[input], outputs=[base_maps]
)
# bind all
custom_model.compile(
loss = keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = keras.optimizers.Adam()
)
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.expand_dims(x_train, -1)
x_train = tf.repeat(x_train, 3, axis=-1)
x_train = tf.divide(x_train, 255)
x_train = tf.image.resize(x_train, [size,size]) # if we want to resize
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
Update
I've found that some others also tried to achieve this and ended up with the same issue. One has got some workaround, here, but it's too messy and I think there should be some better approach.
Update 2
The accepted answer (by Mr.For Example) is fine and works well in single strategy. Now, I like to start 2nd bounty to extend it to support multi-gpu, tpu, and with mixed-precision techniques. There are some complications, see details.
Yes it is possible to customize the .fit() method by overriding the train_step without a custom training loop, following simple example will show you how to train a simple mnist classifier with gradient accumulation:
import tensorflow as tf
class CustomTrainStep(tf.keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)
self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)
self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]
def train_step(self, data):
self.n_acum_step.assign_add(1)
x, y = data
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign_add(gradients[i])
# If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing
tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
def apply_accu_gradients(self):
# apply accumulated gradients
self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))
# reset
self.n_acum_step.assign(0)
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))
# Model
input = tf.keras.Input(shape=(28, 28))
base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)
base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps)
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])
# bind all
custom_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.divide(x_train, 255)
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)
Outputs:
Epoch 1/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584
Epoch 2/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600
Epoch 3/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748
Pros:
Gradient accumulation is a mechanism to split the batch of samples —
used for training a neural network — into several mini-batches of
samples that will be run sequentially
Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, i.e using less memory to training the model like it using large batch size.
Example: If you run a gradient accumulation with steps of 5 and batch
size of 4 images, it serves almost the same purpose of running with a
batch size of 20 images.
We could also parallel the training when using GA, i.e aggregate gradients from multiple machines.
Things to consider:
This technique is working so well so it is widely used, there few things to consider before using it that I don't think it should be called cons, after all, all GA does is turning 4 + 4 to 2 + 2 + 2 + 2.
If your machine has sufficient memory for the batch size that already large enough then there no need to use it, because it is well known that too large of a batch size will lead to poor generalization, and it will certainly run slower if you using GA to achieve the same batch size that your machine's memory already can handle.
Reference:
What is Gradient Accumulation in Deep Learning?
Thanks to #Mr.For Example for his convenient answer.
Usually, I also observed that using Gradient Accumulation, won't speed up training since we are doing n_gradients times forward pass and compute all the gradients. But it will speed up the convergence of our model. And I found that using the mixed_precision technique here can be really helpful here. Details here.
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)
Here is a complete gist.
I'm working on an NLP Task from Kaggle competition, the purpose is to predict if a tweet expresses a real disaster or not. I'm using BertForSequenceClassification.
My Training set size is 10000, I split it into:
8000 as Training set
2000 as Validation set
Learning rate : 2e-5
Epochs :4
Batch size :32
Even if I have good learning curves, the performance on test set is bad (0.47 when submitting on Kaagle). I tried many changes on Learning rate and Epochs, but I still have the same problem.
How to change parameters of Bert model for a better performance on test set?
from transformers import BertForSequenceClassification,AdamW,BertConfig
from transformers import BertTokenizer
print("Loading BertTokenizer...")
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased")
model=BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
output_attentions=False,
output_hidden_states=False,
)
model.cuda()
optimizer=AdamW(model.parameters(),
lr=1.5e-5,
eps=1e-8,
)
from transformers import get_linear_schedule_with_warmup
epochs=4
total_steps=len(train_dataloader)*epochs
scheduler=get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0,
num_training_steps=total_steps)
import random
seed_val=42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
##################################################################
# TRAINING #
##################################################################
# loss_values=[]
training_stats = []
for epoch_i in range(0,epochs):
print("****Epoch {:} /{:} ******".format(epoch_i+1,epochs))
print("Training...")
t0=time.time()
total_loss=0
model.train()
for step,batch in enumerate(train_dataloader):
if step%100==0 and not step==0:
elapsed=format_time(time.time()-t0)
print(" Batch {:>5,} of {:>5,}. Elapsed: {:}".format(step,len(train_dataloader),elapsed))
b_input_ids=batch[0].to(device)
b_input_mask=batch[1].to(device)
b_labels=batch[2].to(device)
model.zero_grad()
# outputs=model(b_input_ids,
# token_type_ids=None,
# # attention_masks=b_input_mask,
# labels=b_labels
# )
loss, logits = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels
)
total_loss +=loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm(model.parameters(),1.0)
optimizer.step()
scheduler.step()
avg_train_loss=total_loss/len(train_dataloader)
print("")
print(" Average training loss :{0:.2f}".format(avg_train_loss))
print("Training epoch took {:}".format(format_time(time.time()-t0)))
training_time = format_time(time.time() - t0)
##################################################################
# VALIDATION #
##################################################################
print("")
print("Runing Validation ...")
t0=time.time()
model.eval()
total_eval_loss,eval_accuracy=0,0
nb_eval_steps,nb_eval_examples=0,0
for batch in validation_dataloader:
batch=tuple(t.to(device) for t in batch)
b_input_ids,b_input_mask,b_labels=batch
with torch.no_grad():
# outputs=model(b_input_ids,
# token_type_ids=None,
# # attention_masks=b_input_mask
# )
# logits=outputs[0]
loss, logits = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
total_eval_loss += loss.item()
#Move to cpu
logits=logits.detach().cpu().numpy()
label_ids=b_labels.to('cpu').numpy()
# # Accuracy of this batch
tmp_eval_accuracy=flat_accuracy(logits,label_ids)
eval_accuracy+=tmp_eval_accuracy
nb_eval_steps+=1
print(" Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print(" Validation took:{:}".format(format_time(time.time()- t0)))
avg_val_loss = total_eval_loss / len(validation_dataloader)
print(" Average validation loss :{0:.2f}".format(avg_val_loss))
avg_val_accuracy = eval_accuracy / len(validation_dataloader)
validation_time = format_time(time.time() - t0)
training_stats.append(
{
'epoch': epoch_i + 1,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Valid. Accur.': avg_val_accuracy,
'Training Time': training_time,
'Validation Time': validation_time
}
)
print("")
print("Training completed!")
And I put results on CSV to submit like this
predictions=predictions[:,1]
predictions[predictions>0]=0
predictions[predictions<0]=1
predictions=predictions.astype(np.int64)
sample_submission=pd.read_csv('sample_submission.csv',sep=',',index_col=0)
sample_submission["target"]=predictions
sample_submission.head()
to_submit=sample_submission.to_csv("submission.csv",index=True)
I used this code to train a model:
def train(model, epochs):
for epoch in range(epochs):
for idx, batch in enumerate(train_loader):
x, bndbox = batch # unpack batch
pred_bndbox = model(x)# forward pass
#print('label:', bndbox, 'prediction:', pred_bndbox)
loss = criterion(pred_bndbox, bndbox) # compute loss for this batch
optimiser.zero_grad()# zero gradients of optimiser
loss.backward() # backward pass (find rate of change of loss with respect to model parameters)
optimiser.step()# take optimisation step
print('Epoch:', epoch, 'Batch:', idx, 'Loss:', loss.item())
writer.add_scalar('DETECTION Loss/Train', loss, epoch*len(train_loader) + idx) # write loss to a graph
train(cnn, epochs)
torch.save(cnn.state_dict(), str(time.time()))# save model
def visualise(model, n):
model.eval()
for idx, batch in enumerate(test_loader):
x, y = batch
pred_bndbox = model(x)
S40dataset.show(batch, pred_bndbox=pred_bndbox)
if idx == n:
break
How do I evaluate the model prediction on a single image to check the operation of the neural network?
You can use:
model.eval() # turn the model to evaluate mode
with torch.no_grad(): # does not calculate gradient
class_index = model(single_image).argmax() #gets the prediction for the image's class
This code will save the network's prediction as the index of the class in the class_index variable. You have to save the image you would like to examine in the single_image variable in the right shape.
Hope that helps.