The Hamming Loss counts the number of labels for which our prediction is wrong normalizing it.
The standard implementation of the HammingLoss as a metric relies on counting the wrong predictions, with something along these lines: (on TF)
count_non_zero = tf.math.count_nonzero(actuals - predictions)
return tf.reduce_mean(count_non_zero / actuals.get_shape()[-1])
Implementing the Hamming Loss as an actual loss requires it to be differentiable, which is not this case due to the tf.math.count_nonzero.
An alternative (and approximated) method would be counting the non-zero labels in this way, but unluckily the NN doesn't seem to improve.
def hamming_loss(y_true, y_pred):
y_true = tf.convert_to_tensor(y_true, name="y_true")
y_pred = tf.convert_to_tensor(y_pred, name="y_pred")
diff = tf.cast(tf.math.abs(y_true - y_pred), dtype=tf.float32)
#Counting non-zeros in a differentiable way
epsilon = K.epsilon()
nonzero = tf.reduce_mean(tf.math.abs( diff / (tf.math.abs(diff) + epsilon)))
return tf.reduce_mean(nonzero / K.int_shape(y_pred)[-1])
Concluding, what's the correct implementation of the Hamming Loss for TensorFlow?
[.1] https://hal.archives-ouvertes.fr/hal-01044994/document
Your network doesn't converge since:
diff / (tf.math.abs(diff) + epsilon)
yields a 0 , 1 vector which kills the gradients both on zeros and ones
Related
I am trying to implement logistic regression from scratch using binary cross entropy loss function. The loss function implemented below is created based on the following formula.
def binary_crossentropy(y, yhat):
no_of_samples = len(y)
numerator_1 = y*np.log(yhat)
numerator_2 = (1-y) * np.log(1-yhat)
loss = -(np.sum(numerator_1 + numerator_2) / no_of_samples)
return loss
And below is how I implement the training using gradient descent.
L = 0.01
epochs = 40000
no_of_samples = len(x)
# Keeping track of the loss
loss = []
for _ in range(epochs):
yhat = sigmoid(x*weight + bias)
# Finding out the loss of each iteration
loss.append(binary_crossentropy(y, yhat))
d_weight = np.sum(x *(yhat-y)) / no_of_samples
d_bias = np.sum(yhat-y) / no_of_samples
weight = weight - L*d_weight
bias = bias - L*d_bias
The training above goes fine since the weight and bias are properly adjusted. But my question here is that, why the loss graph appears to be very fluctuating?
I have ever tried implementing linear regression and the loss appears to be constantly decreasing.
Is there anything incorrect in my logistic regression implementation? If my implementation is already correct, why does it fluctuate that way?
You need to optimize hyperparameters to see if the problem solves or not. One thing that can be done is to change the type of optimizers that you used. For instance, you can use Fmin_tnc instead of gradient descent.
Besides, you can tune the epochs, L and type of solvers (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’) if you use sklearn for regression.
I am using the Image segmentation guide by fchollet to perform semantic segmentation. I have attempted modifying the guide to suit my dataset by labelling the 8-bit img mask values into 1 and 2 like in the Oxford Pets dataset. (which will be subtracted to 0 and 1 in class OxfordPets(keras.utils.Sequence):)
Question is how do I get the IoU metric of a single class (e.g 1)?
I have tried different metrics suggested by Stack Overflow but most of suggest using MeanIoU which I tried but I have gotten nan loss as a result. Here is an example of a mask after using autocontrast.
PIL.ImageOps.autocontrast(load_img(val_target_img_paths[i]))
The model seems to train well but the accuracy was decreasing over time.
Also, can someone help explain how the metric score can be calculated from y_true and y_pred? I don't quite fully understand when the label value is used in the IoU metric calculation.
I had a similar problem back then. I used jaccard_distance_loss and dice_metric. They are based on IoU. My task was a binary segmentation, so I guess you might have to modify the code in case you want to use it for a multi-label classification problem.
from keras import backend as K
def jaccard_distance_loss(y_true, y_pred, smooth=100):
"""
Jaccard = (|X & Y|)/ (|X|+ |Y| - |X & Y|)
= sum(|A*B|)/(sum(|A|)+sum(|B|)-sum(|A*B|))
The jaccard distance loss is usefull for unbalanced datasets. This has been
shifted so it converges on 0 and is smoothed to avoid exploding or disapearing
gradient.
Ref: https://en.wikipedia.org/wiki/Jaccard_index
#url: https://gist.github.com/wassname/f1452b748efcbeb4cb9b1d059dce6f96
#author: wassname
"""
intersection = K.sum(K.sum(K.abs(y_true * y_pred), axis=-1))
sum_ = K.sum(K.sum(K.abs(y_true) + K.abs(y_pred), axis=-1))
jac = (intersection + smooth) / (sum_ - intersection + smooth)
return (1 - jac) * smooth
def dice_metric(y_pred, y_true):
intersection = K.sum(K.sum(K.abs(y_true * y_pred), axis=-1))
union = K.sum(K.sum(K.abs(y_true) + K.abs(y_pred), axis=-1))
# if y_pred.sum() == 0 and y_pred.sum() == 0:
# return 1.0
return 2*intersection / union
# Example
size = 10
y_true = np.zeros(shape=(size,size))
y_true[3:6,3:6] = 1
y_pred = np.zeros(shape=(size,size))
y_pred[3:5,3:5] = 1
loss = jaccard_distance_loss(y_true,y_pred)
metric = dice_metric(y_pred,y_true)
print(f"loss: {loss}")
print(f"dice_metric: {metric}")
loss: 4.587155963302747
dice_metric: 0.6153846153846154
You can use the tf.keras.metrics.IoU method, and you can find its documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/metrics/IoU
And how you can use it is shown here:
import tensorflow as tf
y_true = [0, 1]
y_pred = [0.5, 1.0] # they must be the same shape
# target_class_ids indicates the class/classes you want to calculate IoU on
loss = tf.keras.metrics.IoU(num_classes=2, target_class_ids=[1])
loss.update_state(y_true, y_pred)
print(loss.result().numpy())
And as you can see in the documentation, IoU is calculated by:
true_positives / (true_positives + false_positives + false_negatives)
I am working on epilepsy seizure prediction. I have imbalanced dataset I want to make it balanced by using focal loss. I have 2 classes one-hot encoding vector. I found the below focal loss code but I don't know how I can get y_pred to be used in focal loss code before model.fit_generator .
y_pred is the output of the model. So how I can use it in the focal loss code before fitting my model??
focal loss code:
def categorical_focal_loss(gamma=2.0, alpha=0.25):
"""
Implementation of Focal Loss from the paper in multiclass classification
Formula:
loss = -alpha*((1-p)^gamma)*log(p)
Parameters:
alpha -- the same as wighting factor in balanced cross entropy
gamma -- focusing parameter for modulating factor (1-p)
Default value:
gamma -- 2.0 as mentioned in the paper
alpha -- 0.25 as mentioned in the paper
"""
def focal_loss(y_true, y_pred):
# Define epsilon so that the backpropagation will not result in NaN
# for 0 divisor case
epsilon = K.epsilon()
# Add the epsilon to prediction value
#y_pred = y_pred + epsilon
# Clip the prediction value
y_pred = K.clip(y_pred, epsilon, 1.0-epsilon)
# Calculate cross entropy
cross_entropy = -y_true*K.log(y_pred)
# Calculate weight that consists of modulating factor and weighting factor
weight = alpha * y_true * K.pow((1-y_pred), gamma)
# Calculate focal loss
loss = weight * cross_entropy
# Sum the losses in mini_batch
loss = K.sum(loss, axis=1)
return loss
return focal_loss
My code:
history=model.fit_generator(generate_arrays_for_training(indexPat, train_data, start=0,end=100)
validation_data=generate_arrays_for_training(indexPat, test_data, start=0,end=100)
steps_per_epoch=int((len(train_data)/2)),
validation_steps=int((len(test_data)/2)),
verbose=2,epochs=65, max_queue_size=2, shuffle=True)
preictPrediction=model.predict_generator(generate_arrays_for_predict(indexPat, filesPath_data), max_queue_size=4, steps=len(filesPath_data))
y_pred1=np.argmax(preictPrediction,axis=1)
y_pred=list(y_pred1)
From the comment section for the benefit of the community.
This is not specific to focal loss, all keras loss functions take
y_true and y_pred, you do not need to worry where those parameters are
coming from, they are fed by keras automatically.
I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
"""
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
"""
MARGIN = 1.
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
else:
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
model.add(
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Dropout(topology['dropout1']))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
model.add(ks.layers.Dropout(topology['dropout2']))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
output_shape=kullback_leibler_shape,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.
Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
Clipping
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
So:
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
See:
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)
I am trying to use u-net with keras implementation, I am using the following repo
https://github.com/zhixuhao/unet
it works well, but my problem is a two-class segmentation problem, so I want to set the accuracy metric to jaccard, and also the loss function
I tried to define the function:
def Jac(y_true, y_pred):
y_pred_f = K.flatten(K.round(y_pred))
y_true_f = K.flatten(y_true)
num = K.sum(y_true_f * y_pred_f)
den = K.sum(y_true_f) + K.sum(y_pred_f) - num
return num / den
and call it in the compilation:
model.compile(optimizer = Adam(lr = 1e-4), loss = ['binary_crossentropy'], metrics = [Jac])
When I do that the jaccard accuracy in every iteration decreases till it reach ZERO !!
Any explanation of why that happen ??
P.S: The same thing happens with the Dice.
P.S: The output layer is conv 1 * 1 with sigmoid activation function
Update:
Attached the original implementation in keras of the binary accuracy:
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
And I can see that it also uses rounding to get the output prediction.
You're rounding your function (K.round).
That causes two problems:
(real problem) The function is not differentiable and will not be capable of being a loss function (A "None values not supported" error will be shown)
Whenever your network is unsure and has any values below 0.5, those values will be considered zero.
If the amount of black (zero) pixels in y_true is greater than the white (1) ones, this will happen:
your network will tend to predict everything to zero first, and this will indeed result in a better binary crossentropy loss!
And also a better Jaccard if not rounded
But a zero Jaccard if rounded
and only later, when the learning rates are more finely adjusted, it will start bringing out the white pixels where they should be.
You should really be using a non-rounded function for both reasons above.
And plot your outputs sometimes to see what is going on :)
Notice that if you're using this as a loss function, multiply it by -1 (because you will want it to decrease, not increase)
Try these functions bellow, copied from github. Use jacard_coef in keras metrics and if you want jacard_coef_loss keras loss
def jacard_coef(y_true, y_pred):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
return (intersection + 1.0) / (K.sum(y_true_f) + K.sum(y_pred_f) - intersection + 1.0)
def jacard_coef_loss(y_true, y_pred):
return -jacard_coef(y_true, y_pred)
model.compile(optimizer = Adam(lr = 1e-4), loss = [jacard_coef_loss], metrics = [jacard_coef])