I have the following code that trains a model and stores logs in a results variable
import tqdm.notebook as tq
import sys
num_epochs = 10
results = {"train_loss": [], "val_loss": [], "train_acc": [], "val_acc": []}
for epoch in range(1, num_epochs+1):
sys.stdout.write(f"---Epoch {epoch}/{num_epochs}: ")
epoch_loss = {"train": [], "val": []}
epoch_acc = {"train": [], "val": []}
for phase in ['train', 'val']:
if phase=="train":
model.train(True)
else:
model.train(False)
# most important thing I learned from this project was how to fix tqdm nastiness in colab
for batch_idx, (x, y) in tq.tqdm(enumerate(dataloaders[phase]),
total=len(dataloaders[phase]),
leave=False):
# put data to device and get output
x, y = x.to(device), y.to(device)
preds = model(x)
# calc and log model loss
batch_loss = criterion(preds, y)
epoch_loss[phase].append(batch_loss.item())
# calculate acc and extend to epoch_acc
preds = torch.argmax(preds, dim=1)
batch_acc = torch.sum(preds==y)/len(y)
epoch_acc[phase].append(batch_acc)
# zero the grad
optimizer.zero_grad()
# take a step if training mode is on
if phase=="train":
batch_loss.backward()
optimizer.step()
scheduler.step()
# at the end of each epoch, calculate avg epoch train/val loss/accuracy
train_loss = sum(epoch_loss["train"])/len(epoch_loss["train"])
val_loss = sum(epoch_loss["val"])/len(epoch_loss["val"])
train_acc = 100*sum(epoch_acc["train"])/len(epoch_acc["train"])
val_acc = 100*sum(epoch_acc["val"])/len(epoch_acc["val"])
# log losses and accs every epoch
results['train_loss'].extend(epoch_loss['train'])
results['train_acc'].extend(epoch_acc['train'])
results['val_loss'].extend(epoch_loss['val'])
results['val_acc'].extend(epoch_acc['val'])
# and print it nicely
sys.stdout.write("train_loss: {:.4f} train_acc: {:.2f}% ".format(train_loss, train_acc))
sys.stdout.write("val_loss: {:.4f} val_acc: {:.2f}%\n".format(val_loss, val_acc))
I'm logging the avg accuracy and avg loss of every batch into separate training/validation loss/acc arrays. The problem is that I have more training batches so when I try to graph my training logs I get something like this:
Is there a workaround for this?
You are making a few conceptual errors:
You are calculating the validation loss/accuracy in multiple batches, as opposed to over the entire validation set
You are calculating the validation accuracy for a static model after it has already trained on all the data, as opposed to periodically assessing the validation accuracy as it is training
You should average your batch training performance over each epoch, and once per epoch calculate the complete loss/acc statistics across the entire validation set. Then you will have n_epochs values for both training and validation and can plot them on the same axes.
Related
I use colab(with gpu) to run my code but it took along time approximately 12 hours per epoch in another hand, when I used keras it took 1 hour per epoch .
I want to run the code in PyTorch to finetune it . so, how to make pytorch faster?
# function to train the model
def train():
model.train()
total_loss, total_accuracy = 0, 0
# empty list to save model predictions
total_preds=[]
Labels=[]
# iterate over batches
for step,batch in enumerate(train_dataloader):
# progress update after every 50 batches.
if step % 10 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))
# push the batch to gpu
#batch = [r for r in batch]
sent_id, mask, labels = batch['input_ids'],batch['attention_mask'],batch['labels']
# clear previously calculated gradients
model.zero_grad()
#print(7)
# get model predictions for the current batch
preds = model(sent_id, mask, labels)
preds =torch.argmax(preds, dim=1)
preds=preds.detach().numpy()
labels = labels.detach().numpy()
alpha=0.25
gamma=2
ce_loss = dice_loss(preds, labels)
total_loss = total_loss + ce_loss
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
total_preds.append(preds)
total_accuracy += (preds == labels).sum()
# compute the training loss of the epoch
avg_loss = total_loss / len(traindataset)
avg_accuracy = total_accuracy / len(traindataset)
#returns the loss and predictions
return avg_loss, total_preds, avg_accuracy
From AWS Sagemaker Documentation, In order to track metrics in cloudwatch for custom ml algorithms (non-builtin), I read that I have to define my estimaotr as below.
But I am not sure how to alter my training script so that the metric definitions declared inside my estimators can pick up these values.
estimator =
Estimator(image_name=ImageName,
role='SageMakerRole',
instance_count=1,
instance_type='ml.c4.xlarge',
k=10,
sagemaker_session=sagemaker_session,
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
)
In my training code, I have
for epoch in range(1, args.epochs + 1):
total_loss = 0
model.train()
for step, batch in enumerate(train_loader):
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
model.zero_grad()
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
loss = outputs[0]
total_loss += loss.item()
loss.backward() # Computes the gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip for error prevention
# modified based on their gradients, the learning rate, etc.
optimizer.step() # Back Prop
logger.info("Average training loss: %f\n", total_loss / len(train_loader))
Here, I want the train:error to pick up total_loss / len(train_loader) but I am not sure how to assign this.
You have to define a regex to capture that pattern, try with this:
{'Name': 'Average training loss', 'Regex': 'Average training loss = ([0-9\.]+)'}
You can try the regex in tool like this and see what happens.
I'm working on an NLP Task from Kaggle competition, the purpose is to predict if a tweet expresses a real disaster or not. I'm using BertForSequenceClassification.
My Training set size is 10000, I split it into:
8000 as Training set
2000 as Validation set
Learning rate : 2e-5
Epochs :4
Batch size :32
Even if I have good learning curves, the performance on test set is bad (0.47 when submitting on Kaagle). I tried many changes on Learning rate and Epochs, but I still have the same problem.
How to change parameters of Bert model for a better performance on test set?
from transformers import BertForSequenceClassification,AdamW,BertConfig
from transformers import BertTokenizer
print("Loading BertTokenizer...")
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased")
model=BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
output_attentions=False,
output_hidden_states=False,
)
model.cuda()
optimizer=AdamW(model.parameters(),
lr=1.5e-5,
eps=1e-8,
)
from transformers import get_linear_schedule_with_warmup
epochs=4
total_steps=len(train_dataloader)*epochs
scheduler=get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0,
num_training_steps=total_steps)
import random
seed_val=42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
##################################################################
# TRAINING #
##################################################################
# loss_values=[]
training_stats = []
for epoch_i in range(0,epochs):
print("****Epoch {:} /{:} ******".format(epoch_i+1,epochs))
print("Training...")
t0=time.time()
total_loss=0
model.train()
for step,batch in enumerate(train_dataloader):
if step%100==0 and not step==0:
elapsed=format_time(time.time()-t0)
print(" Batch {:>5,} of {:>5,}. Elapsed: {:}".format(step,len(train_dataloader),elapsed))
b_input_ids=batch[0].to(device)
b_input_mask=batch[1].to(device)
b_labels=batch[2].to(device)
model.zero_grad()
# outputs=model(b_input_ids,
# token_type_ids=None,
# # attention_masks=b_input_mask,
# labels=b_labels
# )
loss, logits = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels
)
total_loss +=loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm(model.parameters(),1.0)
optimizer.step()
scheduler.step()
avg_train_loss=total_loss/len(train_dataloader)
print("")
print(" Average training loss :{0:.2f}".format(avg_train_loss))
print("Training epoch took {:}".format(format_time(time.time()-t0)))
training_time = format_time(time.time() - t0)
##################################################################
# VALIDATION #
##################################################################
print("")
print("Runing Validation ...")
t0=time.time()
model.eval()
total_eval_loss,eval_accuracy=0,0
nb_eval_steps,nb_eval_examples=0,0
for batch in validation_dataloader:
batch=tuple(t.to(device) for t in batch)
b_input_ids,b_input_mask,b_labels=batch
with torch.no_grad():
# outputs=model(b_input_ids,
# token_type_ids=None,
# # attention_masks=b_input_mask
# )
# logits=outputs[0]
loss, logits = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
total_eval_loss += loss.item()
#Move to cpu
logits=logits.detach().cpu().numpy()
label_ids=b_labels.to('cpu').numpy()
# # Accuracy of this batch
tmp_eval_accuracy=flat_accuracy(logits,label_ids)
eval_accuracy+=tmp_eval_accuracy
nb_eval_steps+=1
print(" Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print(" Validation took:{:}".format(format_time(time.time()- t0)))
avg_val_loss = total_eval_loss / len(validation_dataloader)
print(" Average validation loss :{0:.2f}".format(avg_val_loss))
avg_val_accuracy = eval_accuracy / len(validation_dataloader)
validation_time = format_time(time.time() - t0)
training_stats.append(
{
'epoch': epoch_i + 1,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Valid. Accur.': avg_val_accuracy,
'Training Time': training_time,
'Validation Time': validation_time
}
)
print("")
print("Training completed!")
And I put results on CSV to submit like this
predictions=predictions[:,1]
predictions[predictions>0]=0
predictions[predictions<0]=1
predictions=predictions.astype(np.int64)
sample_submission=pd.read_csv('sample_submission.csv',sep=',',index_col=0)
sample_submission["target"]=predictions
sample_submission.head()
to_submit=sample_submission.to_csv("submission.csv",index=True)
I'm writing a custom training loop using the code provided in the Tensorflow DCGAN implementation guide. I wanted to add callbacks in the training loop. In Keras I know we pass them as an argument to the 'fit' method, but can't find resources on how to use these callbacks in the custom training loop. I'm adding the code for the custom training loop from the Tensorflow documentation:
# Notice the use of `tf.function`
# This annotation causes the function to be "compiled".
#tf.function
def train_step(images):
noise = tf.random.normal([BATCH_SIZE, noise_dim])
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
generated_images = generator(noise, training=True)
real_output = discriminator(images, training=True)
fake_output = discriminator(generated_images, training=True)
gen_loss = generator_loss(fake_output)
disc_loss = discriminator_loss(real_output, fake_output)
gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables)
generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables))
def train(dataset, epochs):
for epoch in range(epochs):
start = time.time()
for image_batch in dataset:
train_step(image_batch)
# Produce images for the GIF as we go
display.clear_output(wait=True)
generate_and_save_images(generator,
epoch + 1,
seed)
# Save the model every 15 epochs
if (epoch + 1) % 15 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print ('Time for epoch {} is {} sec'.format(epoch + 1, time.time()-start))
# Generate after the final epoch
display.clear_output(wait=True)
generate_and_save_images(generator,
epochs,
seed)
I've had this problem myself: (1) I want to use a custom training loop; (2) I don't want to lose the bells and whistles Keras gives me in terms of callbacks; (3) I don't want to re-implement them all myself. Tensorflow has a design philosophy of allowing a developer to gradually opt-in to its more low-level APIs. As #HyeonPhilYoun notes in his comment below, the official documentation for tf.keras.callbacks.Callback gives an example of what we're looking for.
The following has worked for me, but can be improved by reverse engineering tf.keras.Model.
The trick is to use tf.keras.callbacks.CallbackList and then manually trigger its lifecycle events from within your custom training loop. This example uses tqdm to give attractive progress bars, but CallbackList has a progress_bar initialization argument that can let you use the defaults. training_model is a typical instance of tf.keras.Model.
from tqdm.notebook import tqdm, trange
# Populate with typical keras callbacks
_callbacks = []
callbacks = tf.keras.callbacks.CallbackList(
_callbacks, add_history=True, model=training_model)
logs = {}
callbacks.on_train_begin(logs=logs)
# Presentation
epochs = trange(
max_epochs,
desc="Epoch",
unit="Epoch",
postfix="loss = {loss:.4f}, accuracy = {accuracy:.4f}")
epochs.set_postfix(loss=0, accuracy=0)
# Get a stable test set so epoch results are comparable
test_batches = batches(test_x, test_Y)
for epoch in epochs:
callbacks.on_epoch_begin(epoch, logs=logs)
# I like to formulate new batches each epoch
# if there are data augmentation methods in play
training_batches = batches(x, Y)
# Presentation
enumerated_batches = tqdm(
enumerate(training_batches),
desc="Batch",
unit="batch",
postfix="loss = {loss:.4f}, accuracy = {accuracy:.4f}",
position=1,
leave=False)
for (batch, (x, y)) in enumerated_batches:
training_model.reset_states()
callbacks.on_batch_begin(batch, logs=logs)
callbacks.on_train_batch_begin(batch, logs=logs)
logs = training_model.train_on_batch(x=x, y=Y, return_dict=True)
callbacks.on_train_batch_end(batch, logs=logs)
callbacks.on_batch_end(batch, logs=logs)
# Presentation
enumerated_batches.set_postfix(
loss=float(logs["loss"]),
accuracy=float(logs["accuracy"]))
for (batch, (x, y)) in enumerate(test_batches):
training_model.reset_states()
callbacks.on_batch_begin(batch, logs=logs)
callbacks.on_test_batch_begin(batch, logs=logs)
logs = training_model.test_on_batch(x=x, y=Y, return_dict=True)
callbacks.on_test_batch_end(batch, logs=logs)
callbacks.on_batch_end(batch, logs=logs)
# Presentation
epochs.set_postfix(
loss=float(logs["loss"]),
accuracy=float(logs["accuracy"]))
callbacks.on_epoch_end(epoch, logs=logs)
# NOTE: This is a decent place to check on your early stopping
# callback.
# Example: use training_model.stop_training to check for early stopping
callbacks.on_train_end(logs=logs)
# Fetch the history object we normally get from keras.fit
history_object = None
for cb in callbacks:
if isinstance(cb, tf.keras.callbacks.History):
history_object = cb
assert history_object is not None
The simplest way would be to check if the loss has changed over your expected period and break or manipulate the training process if not.
Here is one way you could implement a custom early stopping callback :
def Callback_EarlyStopping(LossList, min_delta=0.1, patience=20):
#No early stopping for 2*patience epochs
if len(LossList)//patience < 2 :
return False
#Mean loss for last patience epochs and second-last patience epochs
mean_previous = np.mean(LossList[::-1][patience:2*patience]) #second-last
mean_recent = np.mean(LossList[::-1][:patience]) #last
#you can use relative or absolute change
delta_abs = np.abs(mean_recent - mean_previous) #abs change
delta_abs = np.abs(delta_abs / mean_previous) # relative change
if delta_abs < min_delta :
print("*CB_ES* Loss didn't change much from last %d epochs"%(patience))
print("*CB_ES* Percent change in loss value:", delta_abs*1e2)
return True
else:
return False
This Callback_EarlyStopping checks your metrics/loss every epoch and returns True if the relative change is less than what you expected by computing moving average of losses after every patience number of epochs. You can then capture this True signal and break the training loop. To completely answer your question, within your sample training loop you can use this as:
gen_loss_seq = []
for epoch in range(epochs):
#in your example, make sure your train_step returns gen_loss
gen_loss = train_step(dataset)
#ideally, you can have a validation_step and get gen_valid_loss
gen_loss_seq.append(gen_loss)
#check every 20 epochs and stop if gen_valid_loss doesn't change by 10%
stopEarly = Callback_EarlyStopping(gen_loss_seq, min_delta=0.1, patience=20)
if stopEarly:
print("Callback_EarlyStopping signal received at epoch= %d/%d"%(epoch,epochs))
print("Terminating training ")
break
Of course, you can increase the complexity in numerous ways, for example, which loss or metrics you would like to track, your interest in the loss at a particular epoch or moving average of loss, your interest in relative or absolute change in value, etc. You can refer to Tensorflow 2.x implementation of tf.keras.callbacks.EarlyStopping here which is generally used in the popular tf.keras.Model.fit method.
I think you would need to implement the functionality of the callback manually. It should not be too difficult. You could for instance have the "train_step" function return the losses and then implement functionality of callbacks such as early stopping in your "train" function. For callbacks such as learning rate schedule the function tf.keras.backend.set_value(generator_optimizer.lr,new_lr) would come in handy. Therefore the functionality of the callback would be implemented in your "train" function.
A custom training loop is just a normal Python loop, so you can use if statements to break the loop whenever some condition is met. For instance:
if len(loss_history) > patience:
if loss_history.popleft()*delta < min(loss_history):
print(f'\nEarly stopping. No improvement of more than {delta:.5%} in '
f'validation loss in the last {patience} epochs.')
break
If there is no improvement of delta% in the loss in the past patience epochs, the loop will be broken. Here, I'm using a collections.deque, which can easily be used as a rolling list that keeps in memory information only the last patience epochs.
Here's a full implementation, with the documentation example from the Tensorflow documentation:
patience = 3
delta = 0.001
loss_history = deque(maxlen=patience + 1)
for epoch in range(1, 25 + 1):
train_loss = tf.metrics.Mean()
train_acc = tf.metrics.CategoricalAccuracy()
test_loss = tf.metrics.Mean()
test_acc = tf.metrics.CategoricalAccuracy()
for x, y in train:
loss_value, grads = get_grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss.update_state(loss_value)
train_acc.update_state(y, model(x, training=True))
for x, y in test:
loss_value, _ = get_grad(model, x, y)
test_loss.update_state(loss_value)
test_acc.update_state(y, model(x, training=False))
print(verbose.format(epoch,
train_loss.result(),
test_loss.result(),
train_acc.result(),
test_acc.result()))
loss_history.append(test_loss.result())
if len(loss_history) > patience:
if loss_history.popleft()*delta < min(loss_history):
print(f'\nEarly stopping. No improvement of more than {delta:.5%} in '
f'validation loss in the last {patience} epochs.')
break
Epoch 1 Loss: 0.191 TLoss: 0.282 Acc: 68.920% TAcc: 89.200%
Epoch 2 Loss: 0.157 TLoss: 0.297 Acc: 70.880% TAcc: 90.000%
Epoch 3 Loss: 0.133 TLoss: 0.318 Acc: 71.560% TAcc: 90.800%
Epoch 4 Loss: 0.117 TLoss: 0.299 Acc: 71.960% TAcc: 90.800%
Early stopping. No improvement of more than 0.10000% in validation loss in the last 3 epochs.
aapa3e8's answer is correct but I am providing an implementation of Callback_EarlyStopping below that is more similar to tf.keras.callbacks.EarlyStopping
def Callback_EarlyStopping(MetricList, min_delta=0.1, patience=20, mode='min'):
#No early stopping for the first patience epochs
if len(MetricList) <= patience:
return False
min_delta = abs(min_delta)
if mode == 'min':
min_delta *= -1
else:
min_delta *= 1
#last patience epochs
last_patience_epochs = [x + min_delta for x in MetricList[::-1][1:patience + 1]]
current_metric = MetricList[::-1][0]
if mode == 'min':
if current_metric >= max(last_patience_epochs):
print(f'Metric did not decrease for the last {patience} epochs.')
return True
else:
return False
else:
if current_metric <= min(last_patience_epochs):
print(f'Metric did not increase for the last {patience} epochs.')
return True
else:
return False
I tested #Rob Hall's method with tensorboard callbacks and it did indeed work. So in my case it looked like this:
'''
tensorboard_callback = keras.callbacks.TensorBoard(
log_dir='./callbacks/tensorboard',
histogram_freq=1)
_callbacks = [tensorboard_callback]
callbacks = keras.callbacks.CallbackList(
_callbacks, add_history=True, model=encoder)
logs_ae = {}
callbacks.on_train_begin(logs=logs_ae)
...
...
'''
I am trying to use tensorflow to train a neural network (LeNet) using the traffic sign images. I want to check the effect of a preprocessing technique on the performance of the nn. So, I preprocessed the images and stored the results (trainingimages, validationimages,testimages, final testimages) as a tuple in a dict.
I then tried to iterate over this dict and then use the training and validation operations of the tensorflow as follows
import tensorflow as tf
from sklearn.utils import shuffle
output_data = []
EPOCHS = 5
BATCH_SIZE = 128
rate = 0.0005
for key in finalInputdata.keys():
for procTypes in range(0,(len(finalInputdata[key]))):
if np.shape(finalInputdata[key][procTypes][0]) != ():
X_train = finalInputdata[key][procTypes][0]
X_valid = finalInputdata[key][procTypes][1]
X_test = finalInputdata[key][procTypes][2]
X_finaltest = finalInputdata[key][procTypes][3]
x = tf.placeholder(tf.float32, (None, 32, 32,np.shape(X_train)[-1]))
y = tf.placeholder(tf.int32, (None))
one_hot_y = tf.one_hot(y,43)
# Tensor Operations
logits = LeNet(x,np.shape(X_train)[-1])
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits,one_hot_y)
softmax_probability = tf.nn.softmax(logits)
loss_operation = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate=rate)
training_operation = optimizer.minimize(loss_operation)
correct_prediction = tf.equal(tf.argmax(logits,1), tf.argmax(one_hot_y,1))
accuracy_operation = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# Pipeline for training and evaluation
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
num_examples = len(X_train)
print("Training on %s images processed as %s" %(key,dict_fornames['proctypes'][procTypes]))
print()
for i in range(EPOCHS):
X_train, y_train = shuffle(X_train, y_train)
for offset in range(0, num_examples, BATCH_SIZE):
end = offset + BATCH_SIZE
batch_x, batch_y = X_train[offset:end], y_train[offset:end]
sess.run(training_operation, feed_dict = {x: batch_x, y: batch_y})
training_accuracy = evaluate(X_train,y_train)
validation_accuracy = evaluate(X_valid, y_valid)
testing_accuracy = evaluate(X_test, y_test)
final_accuracy = evaluate(X_finaltest, y_finalTest)
print("EPOCH {} ...".format(i+1))
print("Training Accuracy = {:.3f}".format(training_accuracy))
print("Validation Accuracy = {:.3f}".format(validation_accuracy))
print()
output_data.append({'EPOCHS':EPOCHS, 'LearningRate':rate, 'ImageType': 'RGB',\
'PreprocType': dict_fornames['proctypes'][0],\
'TrainingAccuracy':training_accuracy, 'ValidationAccuracy':validation_accuracy, \
'TestingAccuracy': testing_accuracy})
sess.close()
The evaluate function is as follows
def evaluate(X_data, y_data):
num_examples = len(X_data)
total_accuracy = 0
sess = tf.get_default_session()
for offset in range(0,num_examples, BATCH_SIZE):
batch_x, batch_y = X_data[offset:offset+BATCH_SIZE], y_data[offset:offset+BATCH_SIZE]
accuracy = sess.run(accuracy_operation, feed_dict = {x:batch_x, y:batch_y})
total_accuracy += (accuracy * len(batch_x))
return total_accuracy / num_examples
Once I execute the program, it works good for the first iteration of dataset but from the second iteration, the network doesnt train and continues to do so for all the other iterations.
Training on RGB images processed as Original
EPOCH 1 ...
Training Accuracy = 0.525
Validation Accuracy = 0.474
EPOCH 2 ...
Training Accuracy = 0.763
Validation Accuracy = 0.682
EPOCH 3 ...
Training Accuracy = 0.844
Validation Accuracy = 0.723
EPOCH 4 ...
Training Accuracy = 0.888
Validation Accuracy = 0.779
EPOCH 5 ...
Training Accuracy = 0.913
Validation Accuracy = 0.795
Training on RGB images processed as Mean Subtracted Data
EPOCH 1 ...
Training Accuracy = 0.056
Validation Accuracy = 0.057
EPOCH 2 ...
Training Accuracy = 0.057
Validation Accuracy = 0.057
EPOCH 3 ...
Training Accuracy = 0.057
Validation Accuracy = 0.056
EPOCH 4 ...
Training Accuracy = 0.058
Validation Accuracy = 0.056
EPOCH 5 ...
Training Accuracy = 0.058
Validation Accuracy = 0.058
Training on RGB images processed as Normalized Data
EPOCH 1 ...
Training Accuracy = 0.058
Validation Accuracy = 0.054
EPOCH 2 ...
Training Accuracy = 0.058
Validation Accuracy = 0.054
EPOCH 3 ...
Training Accuracy = 0.058
Validation Accuracy = 0.054
EPOCH 4 ...
Training Accuracy = 0.058
Validation Accuracy = 0.054
EPOCH 5 ...
Training Accuracy = 0.058
Validation Accuracy = 0.054
However, if I restart the kernel and use any datatype (any iteration), it works. I figured out that I must clear the graph or run multiple sessions for multiple datatypes but I am not yet clear on how to do that. I tried using tf.reset_default_graph() but seems like it does not have any effect. Can somebody point me in the right direction?
Thanks
You might want to make try with data that is normalized to zero mean and unit variance before feeding it to the network, e.g. by scaling images to -1..1 range; that said, 0..1 range mostly sounds sane as well. Depending on the activations used in the network, the value range can make all the difference: ReLUs, for example, die out for inputs below zero, sigmoids start to saturate when values are below -4 or above +4 and tanh activations miss out on half of their value range if no value is ever below 0 - if the value range is too big, gradients may explode as well, preventing training altogether. From this paper, the authors seem to subtract the (batch) image mean instead of the value range mean.
You can try to use a smaller learning rate as well (although personally, I usually start experimenting around 0.0001 for Adam).
As for your multiple sessions part of the question: The way it is currently implemented in your code, you are basically cluttering the default graph. By calling
for key in finalInputdata.keys():
for procTypes in range(0,(len(finalInputdata[key]))):
if np.shape(finalInputdata[key][procTypes][0]) != ():
# ...
x = tf.placeholder(tf.float32, (None, 32, 32,np.shape(X_train)[-1]))
y = tf.placeholder(tf.int32, (None))
one_hot_y = tf.one_hot(y,43)
# Tensor Operations
logits = LeNet(x,np.shape(X_train)[-1])
# ... etc ...
you are creating len(finalInputdata) * N different instances of LeNet, all within the default graph. That might be an issue when variables are internally reused in the network.
If you do want to reset your default graph in order to try different hyperparameters, try
for key in finalInputdata.keys():
for procTypes in range(0,(len(finalInputdata[key]))):
tf.reset_default_graph()
# define the graph
sess = tf.InteractiveSession()
# train
but it is probably better to explictly create Graphs and Sessions like so:
for key in finalInputdata.keys():
for procTypes in range(0,(len(finalInputdata[key]))):
with tf.Graph().as_default() as graph:
# define the graph
with tf.Session(graph=graph) as sess:
# train
Instead of calling sess = tf.get_default_session() you would then directly use the sess reference.
I also found that Jupyter kernels and GPU enabled TensorFlow don't play together that well when iterating on networks, sometimes running into out of memory errors or downright crashing the browser tab.