It's a simple model architecture based on this tutorial. The dataset would look like this, although in 10 dimensions:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers, optimizers
from sklearn.datasets import make_blobs
def pre_processing(inputs, targets):
inputs = tf.cast(inputs, tf.float32)
targets = tf.cast(targets, tf.int64)
return inputs, targets
def get_data():
inputs, targets = make_blobs(n_samples=1000, n_features=10, centers=7, cluster_std=1)
data =, targets))
data =
data = data.take(count=1000).shuffle(buffer_size=1000).batch(batch_size=256)
return data
model = Sequential([
layers.Dense(8, input_shape=(10,), activation='relu'),
layers.Dense(16, activation='relu'),
layers.Dense(32, activation='relu'),
def compute_loss(logits, labels):
return tf.reduce_mean(
logits=logits, labels=labels))
def compute_accuracy(logits, labels):
predictions = tf.argmax(logits, axis=1)
return tf.reduce_mean(tf.cast(tf.equal(predictions, labels), tf.float32))
def train_step(model, optim, x, y):
with tf.GradientTape() as tape:
logits = model(x)
loss = compute_loss(logits, y)
grads = tape.gradient(loss, model.trainable_variables)
optim.apply_gradients(zip(grads, model.trainable_variables))
accuracy = compute_accuracy(logits, y)
return loss, accuracy
def train(epochs, model, optim):
train_ds = get_data()
loss = 0.
acc = 0.
for step, (x, y) in enumerate(train_ds):
loss, acc = train_step(model, optim, x, y)
if step % 500 == 0:
print(f'Epoch {epochs} loss {loss.numpy()} acc {acc.numpy()}')
return loss, acc
optim = optimizers.Adam(learning_rate=1e-6)
for epoch in range(100):
loss, accuracy = train(epoch, model, optim)
Epoch 85 loss 2.530677080154419 acc 0.140625
Epoch 86 loss 3.3184046745300293 acc 0.0
Epoch 87 loss 3.138179063796997 acc 0.30078125
Epoch 88 loss 3.7781732082366943 acc 0.0
Epoch 89 loss 3.4101686477661133 acc 0.14453125
Epoch 90 loss 2.2888522148132324 acc 0.13671875
Epoch 91 loss 5.993691444396973 acc 0.16015625
What have I done wrong?
There are two problems in your code:
The first one is that you are generating a new training dataset in each epoch (see first line of train function, i.e. get_data function is called in each epoch). Since you are using sklearn.datasets.make_blobs function to generate data clusters, there is no guarantee that the generated data clusters between different calls follow the same distribution and/or label mapping. Therefore, the best thing the model could do in each epoch on a completely different dataset is just a random guess (hence, the average 1/7 ~= 0.14 accuracy you see in the results). To resolve this problem, take the data generation out of train function (i.e. generate the data at global level once by calling get_data function), and then pass the generated data to train function as an argument in each epoch.
The second problem is that you are using a very low learning rate, i.e. 1e-6, for the optimizer; therefore, the model is stuck and effectively not training at all. Instead, use the default learning rate for Adam optimizer, i.e. 1e-3, and change it only as needed (e.g. based on the results of experiments you perform).
I tried to write a code which is about the brand detection. But while training the model, there are high losses in every epoch. I tried to normalize the dataset, however nothings changed. Am I doing something wrong?
My code is as below:
train_link = "C:/Users\proin\OneDrive\Masaüstü\Data_2/train"
test_link = "C:/Users\proin\OneDrive\Masaüstü\Data_2/test"
val_link = "C:/Users\proin\OneDrive\Masaüstü\Data_2/validation"
transforming_train = transforms.Compose([transforms.Resize((300, 300)), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
transforming_test = transforms.Compose([transforms.Resize((300, 300)), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
I did also try to plug mean and std values in normalize function but nothings changed.
Let me continue:
trainset = torchvision.datasets.ImageFolder(train_link, transform = transforming_train)
testset = torchvision.datasets.ImageFolder(test_link, transform = transforming_test)
valset = torchvision.datasets.ImageFolder(val_link, transform = transforming_test)
batch_size = 1
trainloader =, batch_size=batch_size,
valloader =, batch_size=batch_size,
testloader =, batch_size=1,
Because ImageFolder has no argument of normalize, I couldn't plug it in here.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
And I'm using the CUDA as well.
def train_log_loss_network(model, train_loader, val_loader=None, epochs=50, device="cpu"):
loss_fn = nn.CrossEntropyLoss() #CrossEntropy is another name for the Logistic Regression loss function. Like before, we phrase learning as minimize a loss function. This is the loss we are going to minimize!
#We need an optimizer! Adam is a good default one that works "well enough" for most problems
#To tell Adam what to optimize, we give it the model's parameters - because thats what the learning will adjust
# optimizer = torch.optim.Adam(model.parameters())
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
#Devices can be spcified by a string, or a special torch object
#If it is a string, lets get the correct device
if device.__class__ == str:
device = torch.device(device) the model on the correct compute resource
for epoch in range(epochs):
model = model.train()#Put our model in training mode
running_loss = 0.0
for inputs, labels in train_loader: #tqdm(train_loader):
#Move the batch to the device we are using.
inputs, labels = inputs.cuda(), labels.cuda()
inputs =
labels =
# zero the parameter gradients
y_pred = model(inputs)
# Compute loss.
loss = loss_fn(y_pred, labels.long())
# Backward pass: compute gradient of the loss with respect to model parameters
# Calling the step function on an Optimizer makes an update to its parameters
running_loss += loss.item() * inputs.size(0)
if val_loader is None:
print("Loss after epoch {} is {}".format(epoch + 1, running_loss))
else:#Lets find out validation performance as we go!
model = model.eval() #Set the model to "evaluation" mode, b/c we don't want to make any updates!
predictions = []
targets = []
for inputs, labels in val_loader:
#Move the batch to the device we are using.
inputs =
labels =
y_pred = model(inputs)
# Get predicted classes
# y_pred will have a shape (Batch_size, C)
#We are asking for which class had the largest response along dimension #1, the C dimension
for pred in torch.argmax(y_pred, dim=1).cpu().numpy():
for l in labels.cpu().numpy():
#print("Network Accuracy: ", )
print("Loss after epoch {} is {}. Accuracy: {}".format(epoch + 1, running_loss, accuracy_score(predictions, targets)))
And lastly,
train_log_loss_network(model, trainloader, val_loader=valloader, epochs=10, device=device)
Even I tried different epoch number and different conv layer, the results are similar.
Loss after epoch 1 is 72568.83042097092. Accuracy: 0.0036231884057971015
Loss after epoch 2 is 72568.78793954849. Accuracy: 0.0036231884057971015
Loss after epoch 3 is 72568.74511051178. Accuracy: 0.0036231884057971015
Loss after epoch 4 is 72568.7018828392. Accuracy: 0.014492753623188406
Loss after epoch 5 is 72568.65722990036. Accuracy: 0.014492753623188406
I want to use tensorflow's custom training loop for my model but, down to memory constraints, I can only pass a small number of samples (mini-batches) through in one go. How do I use an approach to train on these mini-batches and sensibly aggregate the gradients for the whole batch on one machine (GPU/CPU)? See below example with code from here - note this example doesn't hit memory issues based on the batch size but does give the idea of what I'm trying to do:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
#simple MNIST model
inputs = keras.Input(shape=(784,), name="digits")
x1 = layers.Dense(64, activation="relu")(inputs)
x2 = layers.Dense(64, activation="relu")(x1)
outputs = layers.Dense(10, name="predictions")(x2)
model = keras.Model(inputs=inputs, outputs=outputs)
# Instantiate an optimizer.
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_test, (-1, 784))
# Reserve 10,000 samples for validation.
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
# Prepare the training dataset.
train_dataset =, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
# Prepare the validation dataset.
val_dataset =, y_val))
val_dataset = val_dataset.batch(batch_size)
If training on the full 64 sample batch size in one go could fit in memory we could simply use:
def train_step(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True)
loss_value = loss_fn(y, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
train_acc_metric.update_state(y, logits)
return loss_value
import time
epochs = 10
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
start_time = time.time()
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
loss_value = train_step(x_batch_train, y_batch_train)
# Log every 200 batches.
if step % 200 == 0:
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
print("Seen so far: %d samples" % ((step + 1) * batch_size))
However, how do I update train_step to enable it to take four mini-batch runs of size 16 (for example) to make up the full batch size of 64 to deal with my more memory intensive data and then aggregate the gradients to update the model?
I tried just writing a loop within the with tf.GradientTape() as tape: call and just stacking the loss results but I don't think this is the correct approach.
I also thought about using tf.distribute.Strategy but my understanding is this is only for using when training across machines or GPUs so I don't see how I could use it here?
To summarise, What I want to do is agnostic to the dataset and model architecture. I guess I am looking for an Gradient AllReduce approach which in stead of splitting the mini-batches to different machines instead just runs them iteratively. So it would need to:
Compute the gradient using a minibatch.
Compute the mean of the gradients from all mini-batches, using a AllReduce collective-style approach.
Update the model with the averaged gradient.
I assume this approach of applying the mean of the gradients would be far less memory intensive than applying all the gradients as discussed here
I wanna use the following code of this traditional image classification problem for my regression problem. The code can be found here:
GeeksforGeeks-Training Neural Networks with Validation using Pytorch
class Network(nn.Module):
def __init__(self):
self.fc1 = nn.Linear(28*28, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 10)
def forward(self, x):
x = x.view(1,-1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
model = Network()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)
epochs = 5
for e in range(epochs):
train_loss = 0.0
model.train() # Optional when not using Model Specific layer
for data, labels in trainloader:
if torch.cuda.is_available():
data, labels = data.cuda(), labels.cuda()
target = model(data)
loss = criterion(target,labels)
train_loss += loss.item()
valid_loss = 0.0
model.eval() # Optional when not using Model Specific layer
for data, labels in validloader:
if torch.cuda.is_available():
data, labels = data.cuda(), labels.cuda()
target = model(data)
loss = criterion(target,labels)
valid_loss = loss.item() * data.size(0)
print(f'Epoch {e+1} \t\t Training Loss: {train_loss / len(trainloader)} \t\t Validation Loss: {valid_loss / len(validloader)}')
I can understand why the training loss is summed up and then divided by the length of the training data in this example, but I can't get why the validation loss is also not summed up and divided by the length. If I understand correctly, the validation loss will be calculated here by using the validation loss of the last batch and then it is multiplied by the length of the batch size.
Is the calulation of the validation loss the correct way to do it? Can I use the code for my regression problem assuming I use regression-specific metrics (e.g. MSE instead of CrossEntropyLoss etc.)?
Yes, you can use the code for your regression task. The targets of the code example are one-hot vectors or in the MNIST example the numbers 0 to 9, which symbolize the classes. You would make a scalar out of that in the regression case. The loss function, which is the cross-entropy in the example, can be replaced by the MSE in your case.
I assume that the validation loss in this example is only estimated by extrapolating from a single data point to all other data points.
Since data.size represents the batch size, even averaging would only come out with the loss of that single data point.
However, on the web page, the validation loss is calculated over all data points in the validation set, as it should be done.
I am trying to train a model using the keras method. This method returns a history object which contains loss values for each epoch - however I would like to have loss values for each individual batch.
Looking online I have found suggestions to use a custom callback class with an on_batch_end(self, logs={}) method. The problem is that this method only gets passed aggregated statistics that get reset each epoch. I would like to have individual statistics for each batch.
You could do that easily with a custom training loop, where you can just append a list with the loss value of every batch.
Here's how to do all of it:
import tensorflow as tf
import tensorflow_datasets as tfds
ds = tfds.load('iris', split='train', as_supervised=True)
train = ds.take(125).shuffle(16).batch(4)
test = ds.skip(125).take(25).shuffle(16).batch(4)
model = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(3, activation='softmax')
loss_object = tf.losses.SparseCategoricalCrossentropy(from_logits=False)
def compute_loss(model, x, y, training):
out = model(x, training=training)
loss = loss_object(y_true=y, y_pred=out)
return loss
def get_grad(model, x, y):
with tf.GradientTape() as tape:
loss = compute_loss(model, x, y, training=True)
return loss, tape.gradient(loss, model.trainable_variables)
optimizer = tf.optimizers.Adam()
verbose = "Epoch {:2d} Loss: {:.3f} TLoss: {:.3f} Acc: {:.2%} TAcc: {:.2%}"
train_loss_per_train_batch = list()
for epoch in range(1, 25 + 1):
train_loss = tf.metrics.Mean()
train_acc = tf.metrics.SparseCategoricalAccuracy()
test_loss = tf.metrics.Mean()
test_acc = tf.metrics.SparseCategoricalAccuracy()
for x, y in train:
loss_value, grads = get_grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_acc.update_state(y, model(x, training=True))
for x, y in test:
loss_value, _ = get_grad(model, x, y)
test_acc.update_state(y, model(x, training=False))
The loss for the current batch can be calculated from the provided average loss as follows:
from tensorflow.keras.callbacks import Callback
class CustomCallback(Callback):
''' This callback converts the average loss (default behavior in TF>=2.2)
into the loss for only the current batch.
def on_epoch_begin(self, epoch, logs={}):
self.previous_loss_sum = 0
def on_train_batch_end(self, batch, logs={}):
# calculate loss of current batch:
current_loss_sum = (batch + 1) * logs['loss']
current_loss = current_loss_sum - self.previous_loss_sum
self.previous_loss_sum = current_loss_sum
# use current_loss:
# ...
This code can be added to any custom callback that needs the loss for the current batch instead of the average loss.
Also, if you are using Tensorflow 1 or TensorFlow 2 version <= 2.1, then do not include this code in your callback, as in those versions the current loss is already provided, instead of the average loss.
I want to customize the fit function of the model in order to apply the gradient descent on the weights only if the model improved its predictions on the validation data. The reason for this is that I want to prevent overfitting.
According to this guide it should be possible to customize the fit function of the model. However, the following code runs into errors:
class CustomModel(tf.keras.Model):
def train_step(self, data):
x, y = data
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
### check and apply gradient
Y_pred_val = self.predict(X_val) # this does not work
acc_val = calculate_accuracy(Y_val, Y_pred_val)
if acc_val > last_acc_val:
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
self.compiled_metrics.update_state(y, y_pred)
return_obj = { m.result() for m in self.metrics}
return_obj["acc_val"] = acc_val
return return_obj
How could it be possible to evaluate the model inside the fit function?
You don't have to subclass fit() for this. You can just make a custom training loop. Look how I did that:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow.keras import Model
import tensorflow as tf
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, Flatten, Concatenate
import tensorflow_datasets as tfds
from tensorflow.keras.regularizers import l1, l2, l1_l2
from collections import deque
dataset, info = tfds.load('mnist',
TAKE = 1_000
data = x: (tf.cast(x['image'],
tf.float32), x['label'])).shuffle(TAKE).take(TAKE)
len_train = int(8e-1*TAKE)
train = data.take(len_train).batch(8)
test = data.skip(len_train).take(info.splits['train'].num_examples - len_train).batch(8)
class CNN(Model):
def __init__(self):
super(CNN, self).__init__()
self.layer1 = Dense(32, activation=tf.nn.relu,
self.layer2 = Conv2D(filters=16,
kernel_size=(3, 3),
strides=(1, 1),
self.layer3 = MaxPooling2D(pool_size=(2, 2))
self.layer4 = Conv2D(filters=32,
kernel_size=(3, 3),
strides=(1, 1),
self.layer5 = MaxPooling2D(pool_size=(2, 2))
self.layer6 = Flatten()
self.layer7 = Dense(units=64,
self.layer8 = Dense(units=64,
kernel_regularizer=l1_l2(l1=1e-2, l2=1e-2))
self.layer9 = Concatenate()
self.layer10 = Dense(units=info.features['label'].num_classes)
def call(self, inputs, training=None, **kwargs):
b = self.layer1(inputs)
a = self.layer2(inputs)
a = self.layer3(a)
a = self.layer4(a)
a = self.layer5(a)
a = self.layer6(a)
a = self.layer8(a)
b = self.layer7(b)
b = self.layer6(b)
x = self.layer9([a, b])
x = self.layer10(x)
return x
cnn = CNN()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
train_loss = tf.keras.metrics.Mean()
test_loss = tf.keras.metrics.Mean()
train_acc = tf.keras.metrics.SparseCategoricalAccuracy()
test_acc = tf.keras.metrics.SparseCategoricalAccuracy()
optimizer = tf.keras.optimizers.Nadam()
template = 'Epoch {:3} Train Loss {:7.4f} Test Loss {:7.4f} ' \
'Train Acc {:6.2%} Test Acc {:6.2%} '
epochs = 5
early_stop = epochs//50
loss_hist = deque()
acc_hist = deque(maxlen=1)
for epoch in range(1, epochs + 1):
for images, labels in train:
with tf.GradientTape() as tape:
logits = cnn(images, training=True)
loss = loss_object(labels, logits)
train_acc(labels, logits)
current_acc = tf.metrics.SparseCategoricalAccuracy()(labels, logits)
if tf.greater(current_acc, acc_hist[-1]):
gradients = tape.gradient(loss, cnn.trainable_variables)
optimizer.apply_gradients(zip(gradients, cnn.trainable_variables))
for images, labels in test:
logits = cnn(images, training=False)
loss = loss_object(labels, logits)
test_acc(labels, logits)
if len(loss_hist) > early_stop and loss_hist.popleft() < min(loss_hist):
print('Early stopping. No validation loss decrease in %i epochs.' % early_stop)
Epoch 1 Train Loss 21.1698 Test Loss 21.3391 Train Acc 37.13% Test Acc 38.50%
Epoch 2 Train Loss 13.8314 Test Loss 12.2496 Train Acc 50.88% Test Acc 52.50%
Epoch 3 Train Loss 13.7594 Test Loss 12.5884 Train Acc 51.75% Test Acc 53.00%
Epoch 4 Train Loss 13.1418 Test Loss 13.2374 Train Acc 52.75% Test Acc 51.50%
Epoch 5 Train Loss 13.6471 Test Loss 13.3157 Train Acc 49.63% Test Acc 51.50%
Here's the part that did the job. It's a deque and it skips the application of gradients if the last element of the deque is smaller.
for images, labels in train:
with tf.GradientTape() as tape:
logits = cnn(images, training=True)
loss = loss_object(labels, logits)
train_acc(labels, logits)
current_acc = tf.metrics.SparseCategoricalAccuracy()(labels, logits)
if tf.greater(current_acc, acc_hist[-1]):
gradients = tape.gradient(loss, cnn.trainable_variables)
optimizer.apply_gradients(zip(gradients, cnn.trainable_variables))
Rather than create a custom fit I think it would be easier to use the callback ModelCheckpoint.
What you are trying to do is get the model that has the lowest validation error. Set it up to monitor validation loss. That way it will save the best model even if the network starts to over fit. Documentation is here.
If you do not get a model with a satisfactory validation accuracy then you will have to take other measures.
First look at your training accuracy.
My experience is that you should achieve at least 95%.
If the training accuracy is good but the validation accuracy is poor and degrades as you run more epochs that is a sign of over fitting.
You did not show the model but if you are doing classification you will probably have dense layers with the final layer using softmax activation.
Start out with model with only one dense layer and see if it trains well.
If not you may have to add additional dense hidden layers. If you do include a drop out layer to help prevent over fitting. You might also consider using regularizers. Documentation is
I also find you can get improved performance if you dynamically adjust the learning rate. The callback ReduceLROnPlateau enables that capability.
Set it up to monitor validation loss and to reduce the learning rate by a factor if the loss fails to decrease. Documentation is here.