Implementing Multiclass Dice Loss Function - python

I am doing multi class segmentation using UNet. My input to the model is HxWxC and my output is,
outputs = layers.Conv2D(n_classes, (1, 1), activation='sigmoid')(decoder0)
Using SparseCategoricalCrossentropy I can train the network fine. Now I would like to also try dice coefficient as the loss function. Implemented as follows,
def dice_loss(y_true, y_pred, smooth=1e-6):
y_true = tf.cast(y_true, tf.float32)
y_pred = tf.math.sigmoid(y_pred)
numerator = 2 * tf.reduce_sum(y_true * y_pred) + smooth
denominator = tf.reduce_sum(y_true + y_pred) + smooth
return 1 - numerator / denominator
However, I am actually getting an increasing loss instead of decreasing loss. I have checked multiple sources but all the material I find uses dice loss for binary classification and not multiclass. So my question is there a problem with the implementation.

The problem is that your dice loss doesn't address the number of classes you have but rather assumes binary case, so it might explain the increase in your loss.
You should implement generalized dice loss that accounts for all the classes and return the value for all of them.
Something like the following:
def dice_coef_9cat(y_true, y_pred, smooth=1e-7):
Dice coefficient for 10 categories. Ignores background pixel label 0
Pass to model as metric during compile statement
y_true_f = K.flatten(K.one_hot(K.cast(y_true, 'int32'), num_classes=10)[...,1:])
y_pred_f = K.flatten(y_pred[...,1:])
intersect = K.sum(y_true_f * y_pred_f, axis=-1)
denom = K.sum(y_true_f + y_pred_f, axis=-1)
return K.mean((2. * intersect / (denom + smooth)))
def dice_coef_9cat_loss(y_true, y_pred):
Dice loss to minimize. Pass to model as loss during compile statement
return 1 - dice_coef_9cat(y_true, y_pred)
This snippet is taken from
This is for 9 categories, while you should adjust to the number of categories you have.

If you are doing multi-class segmentation, the 'softmax' activation function should be used.
I would recommend using one-hot encoded ground-truth masks. This needs to be done outside of the loss calculation code.
The generalized dice loss and others were implemented in the following link:

Not sure why but the last layer has "sigmoid" as activation function.
For Multiclass segmentation it has to be "softmax" not "sigmoid".
Also, the loss you are considering is SparseCategoricalCrossentropy along with a multichannel output. If the last layer would have just 1 channel (when doing multi class segmentation), then using SparseCategoricalCrossentropy makes sense but when you have multiple channels as your output the loss which is to be considered is "CategoricalCrossentropy".
Your loss is increasing as the activation and output channels aren't matching (as mentioned above).
outputs = layers.Conv2D(n_classes, (1, 1), activation='sigmoid')(decoder0)
outputs = layers.Conv2D(n_classes, (1, 1), activation='softmax')(decoder0)


Siamese Network for binary classification with pre-encoded inputs

I want to train a Siamese Network to compare vectors for similarity.
My dataset consist of pairs of vectors and a target column with "1" if they are the same and "0" otherwise (binary classification):
import pandas as pd
# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
y_train_val = X_train_val.pop("class")
# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val
# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]
assert X_left_train.shape == X_right_train.shape
# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")
X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]
v0 v1 v2 ... v397 v398 v399 class
0 0.003615 0.013794 0.030388 ... -0.093931 0.106202 0.034870 0.0
1 0.018988 0.056302 0.002915 ... -0.007905 0.100859 -0.043529 0.0
2 0.072516 0.125697 0.111230 ... -0.010007 0.064125 -0.085632 0.0
3 0.051016 0.066028 0.082519 ... 0.012677 0.043831 -0.073935 1.0
4 0.020367 0.026446 0.015681 ... 0.062367 -0.022781 -0.032091 0.0
1.0 1060
0.0 923
Name: class, dtype: int64
1.0 354
0.0 308
Name: class, dtype: int64
The rest of my script is as follows:
import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model
def euclidean_distance(vectors):
Find the Euclidean distance between two vectors.
x, y = vectors
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
# Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
return K.sqrt(K.maximum(sum_square, K.epsilon()))
def contrastive_loss(y_true, y_pred):
Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.
margin = 1
return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))
def accuracy(y_true, y_pred):
Compute classification accuracy with a fixed threshold on distances.
return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
input1 = Input(input_dim, name="encoder")
x = input1
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu", name="Embeddings")(x)
return Model(input1, x)
def build_siamese_model(input_dim: int):
shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)
left_input = Input(input_dim)
right_input = Input(input_dim)
# Since this is a siamese nn, both sides share the same network.
encoded_l = shared_network(left_input)
encoded_r = shared_network(right_input)
# The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
return siamese_net
model = build_siamese_model(X_left_train.shape[1])
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history =
[X_left_train, X_right_train],
validation_data=([X_left_val, X_right_val], y_val),
I have plotted the contrastive loss vs epoch and model accuracy vs epoch:
The validation line is almost flat, which seems odd to me (overfitted?).
After changing the dropout of the shared network from 0.1 to 0.5, I get the following results:
Somehow it looks better, but yields bad predictions as well.
My questions are:
Most examples of Siamese Networks I've seen so far involves embedding layers (text pairs) and/or Convolution layers (image pairs). My input pairs are the actual vector representation of some text, which is why I used Dense layers for the shared network. Is this the proper approach?
The output layer of my Siamese Network is as follows:
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
but someone over the internet suggested this instead:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="sigmoid")(distance) # returns the class probability
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
I'm not sure which one to trust nor the difference between them (except that the former returns the distance and the latter returns the class probability). In my experiments, I get poor results with binary_crossentropy.
After following #PlzBePython suggestions, I come up with the following base network:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
Thank you for your help!
This is less of an answer and more writing my thoughts down and hoping they can help find an answer.
In general, everything you do seems pretty reasonable to me.
Regarding your Questions:
Embedding or feature extraction layers are never a must, but almost always make it easier to learn the intended. You can think of them like providing your distance model with the comprehensive summary of a sentence instead of its raw words. This also makes your model not depend on the location of a word. In your case, creating the summary/important features of a sentence and embedding similar sentences close to each other is done by the same network. Of course, this can also work, and I don't even think it's a bad approach. However, I would maybe increase the network size.
In my opinion, those two loss functions are not too different. Binary Crossentropy is defined as:
While Contrastive Loss (margin = 1) is:
So you basically swap a log function for a hinge function.
The only real difference comes from the distance calculation. You probably got suggested using some kind of L1 distance, since L2 distance is supposed to perform worse with higher dimensions (see for example here) and your dimensionality is 128. Personally, I would rather go with L1 in your case, but I don't think it's a dealbreaker.
What I would try is:
increase the margin parameter. "1" always results in a pretty low loss in the false positive case. This could slow down training in general
try out embedding into the [-inf, inf] space (change last layer embedding activation to "linear")
change "binary_crossentropy" loss into "keras.losses.BinaryCrossentropy(from_logits=True)" and last activation from "sigmoid" to "linear". This should actually not make a difference, but I've made some weird experiences with the keras binary crossentropy function and from_logits seems to help sometimes
increase parameters
Lastly, a validation accuracy of 90% actually looks pretty good to me. Keep in mind, that when the validation accuracy is calculated in the first epoch, the model already has done about 60 weight updates (batch_size = 32). That means, especially in the first episode, a validation accuracy that is higher than the training accuracy (which is calculated during training) is kind of to be expected. Also, this can sometimes cause the misbelief that training loss is increasing faster than validation loss.
I recommended "linear" in the last layer, because tensorflow recommends it ("from_logits"=True which requires value in [-inf, inf]) for Binary Crossentropy. In my experience, it converges better.

Differentiable Hamming Loss for TensorFlow

The Hamming Loss counts the number of labels for which our prediction is wrong normalizing it.
The standard implementation of the HammingLoss as a metric relies on counting the wrong predictions, with something along these lines: (on TF)
count_non_zero = tf.math.count_nonzero(actuals - predictions)
return tf.reduce_mean(count_non_zero / actuals.get_shape()[-1])
Implementing the Hamming Loss as an actual loss requires it to be differentiable, which is not this case due to the tf.math.count_nonzero.
An alternative (and approximated) method would be counting the non-zero labels in this way, but unluckily the NN doesn't seem to improve.
def hamming_loss(y_true, y_pred):
y_true = tf.convert_to_tensor(y_true, name="y_true")
y_pred = tf.convert_to_tensor(y_pred, name="y_pred")
diff = tf.cast(tf.math.abs(y_true - y_pred), dtype=tf.float32)
#Counting non-zeros in a differentiable way
epsilon = K.epsilon()
nonzero = tf.reduce_mean(tf.math.abs( diff / (tf.math.abs(diff) + epsilon)))
return tf.reduce_mean(nonzero / K.int_shape(y_pred)[-1])
Concluding, what's the correct implementation of the Hamming Loss for TensorFlow?
Your network doesn't converge since:
diff / (tf.math.abs(diff) + epsilon)
yields a 0 , 1 vector which kills the gradients both on zeros and ones

Deal with imbalanced dataset in text classification with Keras and Theano

For ~20,000 text datasets, the true and false samples are ~5,000 against ~1,5000. Two-channel textCNN built with Keras and Theano is used to do the classification. F1 score is the evaluation metric. The F1 score is not bad while the confusion matrix shows that the accuracy of the true samples is relatively low(~40%). But actually it is very important to predict the true samples accurately. Therefore, want to design a custom binary cross entropy loss function to increase the weight of mis-classified true samples and make the model focus more on predicting accurately on the true samples.
tried class_weight with sklearn in method and it did not work very well since the weight applied to all samples instead of the mis-classified ones.
tried and adjusted the method mentioned here:, but the loss function was categorical cross entropy and it did not work well for the binary classification problem. Tried to modified the loss function to a binary one but encounter some issues concerning the input dimension.
The sample code of the cost sensitive loss function focusing on the mis-classified samples is:
def w_categorical_crossentropy(y_true, y_pred, weights):
nb_cl = len(weights)
final_mask = K.zeros_like(y_pred[:, 0])
y_pred_max = K.max(y_pred, axis=1)
y_pred_max = K.reshape(y_pred_max, (K.shape(y_pred)[0], 1))
y_pred_max_mat = K.equal(y_pred, y_pred_max)
for c_p, c_t in product(range(nb_cl), range(nb_cl)):
final_mask += (weights[c_t, c_p] * y_pred_max_mat[:, c_p] * y_true[:, c_t])
return K.categorical_crossentropy(y_pred, y_true) * final_mask
Actually, a custom loss function for binary classification implemented with Keras and Theano that focuses on the mis-classified samples is of great importance to the imbalanced dataset. Please help troubleshoot this. Thanks!
Well when I have to deal with imbalanced datasets in keras, what I do is to first compute the weights for each class and pass them to the model instance during training. This will look something like this:
from sklearn.utils import compute_class_weight
w = compute_class_weight('balanced', np.unique(targets), targets)
# here I am adding only two categories with their corresponding weights
# you can spin a loop or continue by hand until you include all of your categories
weights = {
np.unique(targets)[0] : w[0], # class 0 with weight 0
np.unique(targets)[1] : w[1] # class 1 with weight 1
# then during training you do like this, y=targets, {..}, class_weight=weights)
I believe this will solve your problem.

Custom combined hinge/kb-divergence loss function in siamese-net fails to generate meaningful speaker-embeddings

I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.
Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)

Zero Jaccard accuracy in u-net implementation with keras

I am trying to use u-net with keras implementation, I am using the following repo
it works well, but my problem is a two-class segmentation problem, so I want to set the accuracy metric to jaccard, and also the loss function
I tried to define the function:
def Jac(y_true, y_pred):
y_pred_f = K.flatten(K.round(y_pred))
y_true_f = K.flatten(y_true)
num = K.sum(y_true_f * y_pred_f)
den = K.sum(y_true_f) + K.sum(y_pred_f) - num
return num / den
and call it in the compilation:
model.compile(optimizer = Adam(lr = 1e-4), loss = ['binary_crossentropy'], metrics = [Jac])
When I do that the jaccard accuracy in every iteration decreases till it reach ZERO !!
Any explanation of why that happen ??
P.S: The same thing happens with the Dice.
P.S: The output layer is conv 1 * 1 with sigmoid activation function
Attached the original implementation in keras of the binary accuracy:
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
And I can see that it also uses rounding to get the output prediction.
You're rounding your function (K.round).
That causes two problems:
(real problem) The function is not differentiable and will not be capable of being a loss function (A "None values not supported" error will be shown)
Whenever your network is unsure and has any values below 0.5, those values will be considered zero.
If the amount of black (zero) pixels in y_true is greater than the white (1) ones, this will happen:
your network will tend to predict everything to zero first, and this will indeed result in a better binary crossentropy loss!
And also a better Jaccard if not rounded
But a zero Jaccard if rounded
and only later, when the learning rates are more finely adjusted, it will start bringing out the white pixels where they should be.
You should really be using a non-rounded function for both reasons above.
And plot your outputs sometimes to see what is going on :)
Notice that if you're using this as a loss function, multiply it by -1 (because you will want it to decrease, not increase)
Try these functions bellow, copied from github. Use jacard_coef in keras metrics and if you want jacard_coef_loss keras loss
def jacard_coef(y_true, y_pred):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
return (intersection + 1.0) / (K.sum(y_true_f) + K.sum(y_pred_f) - intersection + 1.0)
def jacard_coef_loss(y_true, y_pred):
return -jacard_coef(y_true, y_pred)
model.compile(optimizer = Adam(lr = 1e-4), loss = [jacard_coef_loss], metrics = [jacard_coef])

