Siamese Network for binary classification with pre-encoded inputs

Siamese Network for binary classification with pre-encoded inputs - python

I want to train a Siamese Network to compare vectors for similarity.
My dataset consist of pairs of vectors and a target column with "1" if they are the same and "0" otherwise (binary classification):
import pandas as pd
# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())
y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())
# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val
# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]
assert X_left_train.shape == X_right_train.shape
# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")
print(y_test.value_counts())
X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]
returns
v0 v1 v2 ... v397 v398 v399 class
0 0.003615 0.013794 0.030388 ... -0.093931 0.106202 0.034870 0.0
1 0.018988 0.056302 0.002915 ... -0.007905 0.100859 -0.043529 0.0
2 0.072516 0.125697 0.111230 ... -0.010007 0.064125 -0.085632 0.0
3 0.051016 0.066028 0.082519 ... 0.012677 0.043831 -0.073935 1.0
4 0.020367 0.026446 0.015681 ... 0.062367 -0.022781 -0.032091 0.0
1.0 1060
0.0 923
Name: class, dtype: int64
1.0 354
0.0 308
Name: class, dtype: int64
The rest of my script is as follows:
import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model
def euclidean_distance(vectors):
"""
Find the Euclidean distance between two vectors.
"""
x, y = vectors
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
# Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
return K.sqrt(K.maximum(sum_square, K.epsilon()))
def contrastive_loss(y_true, y_pred):
"""
Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.
See:
* https://gombru.github.io/2019/04/03/ranking_loss/
"""
margin = 1
return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))
def accuracy(y_true, y_pred):
"""
Compute classification accuracy with a fixed threshold on distances.
"""
return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
input1 = Input(input_dim, name="encoder")
x = input1
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu", name="Embeddings")(x)
return Model(input1, x)
def build_siamese_model(input_dim: int):
shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)
left_input = Input(input_dim)
right_input = Input(input_dim)
# Since this is a siamese nn, both sides share the same network.
encoded_l = shared_network(left_input)
encoded_r = shared_network(right_input)
# The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
return siamese_net
model = build_siamese_model(X_left_train.shape[1])
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
[X_left_train, X_right_train],
y_train,
validation_data=([X_left_val, X_right_val], y_val),
epochs=100,
callbacks=[es_callback],
verbose=1,
)
I have plotted the contrastive loss vs epoch and model accuracy vs epoch:
The validation line is almost flat, which seems odd to me (overfitted?).
After changing the dropout of the shared network from 0.1 to 0.5, I get the following results:
Somehow it looks better, but yields bad predictions as well.
My questions are:
Most examples of Siamese Networks I've seen so far involves embedding layers (text pairs) and/or Convolution layers (image pairs). My input pairs are the actual vector representation of some text, which is why I used Dense layers for the shared network. Is this the proper approach?
The output layer of my Siamese Network is as follows:
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
but someone over the internet suggested this instead:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="sigmoid")(distance) # returns the class probability
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
I'm not sure which one to trust nor the difference between them (except that the former returns the distance and the latter returns the class probability). In my experiments, I get poor results with binary_crossentropy.
EDIT:
After following #PlzBePython suggestions, I come up with the following base network:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
Thank you for your help!

This is less of an answer and more writing my thoughts down and hoping they can help find an answer.
In general, everything you do seems pretty reasonable to me.
Regarding your Questions:
1:
Embedding or feature extraction layers are never a must, but almost always make it easier to learn the intended. You can think of them like providing your distance model with the comprehensive summary of a sentence instead of its raw words. This also makes your model not depend on the location of a word. In your case, creating the summary/important features of a sentence and embedding similar sentences close to each other is done by the same network. Of course, this can also work, and I don't even think it's a bad approach. However, I would maybe increase the network size.
2:
In my opinion, those two loss functions are not too different. Binary Crossentropy is defined as:
While Contrastive Loss (margin = 1) is:
So you basically swap a log function for a hinge function.
The only real difference comes from the distance calculation. You probably got suggested using some kind of L1 distance, since L2 distance is supposed to perform worse with higher dimensions (see for example here) and your dimensionality is 128. Personally, I would rather go with L1 in your case, but I don't think it's a dealbreaker.
What I would try is:
increase the margin parameter. "1" always results in a pretty low loss in the false positive case. This could slow down training in general
try out embedding into the [-inf, inf] space (change last layer embedding activation to "linear")
change "binary_crossentropy" loss into "keras.losses.BinaryCrossentropy(from_logits=True)" and last activation from "sigmoid" to "linear". This should actually not make a difference, but I've made some weird experiences with the keras binary crossentropy function and from_logits seems to help sometimes
increase parameters
Lastly, a validation accuracy of 90% actually looks pretty good to me. Keep in mind, that when the validation accuracy is calculated in the first epoch, the model already has done about 60 weight updates (batch_size = 32). That means, especially in the first episode, a validation accuracy that is higher than the training accuracy (which is calculated during training) is kind of to be expected. Also, this can sometimes cause the misbelief that training loss is increasing faster than validation loss.
EDIT
I recommended "linear" in the last layer, because tensorflow recommends it ("from_logits"=True which requires value in [-inf, inf]) for Binary Crossentropy. In my experience, it converges better.

Related

BatchNorm makes accuracy at prediction time around 10% of what's reported during training in tensorflow 2.6

I know this has been previously discussed, but I did not find a concrete answer, and some answers did not work after trying them, the case is simple, I have a model, if I use batch norm, the training accuracy reported by model.fit(training_data) is above 0.9 (it consistently increases, and the loss decreases), but then after training if I run model.evaluate(training_data) (notice is the same data) it returns 0.09, also predictions are really bad (the accuracy is low too if manually calculated using the results from model.predict(training_data). I know the difference between training and testing time in batch norm, and I know differences should be expected, but a drop from 0.9 to 0.09 seems just wrong(and the model is completely unusable). I tried some solutions from other threads:
use batch_size in .evaluate to be the same as .fit: did not make a difference
set tf.keras.backend.set_learning_phase(0): got a message saying it is now deprecated and made no difference.
set all batch norm layers to have layer.trainable=False before .predict and .evaluate: it did not a difference.
If I remove batch norm layers, the report from model.fit(training_data) coincides with model.evaluate(training_data) but the training is not doing any progress (results are consistent but bad) so I need to add it.
Is this a major bug in TF 2.6?
Update: also tested TF 2.5, result is the same.
Sample code(omitting irrelevant code, like data reading and pre-processing):
### model definition
class CLS_BERT_Embedding(tf.keras.Model):
"""Will only use the CLS token"""
def __init__(self, bert_trainable=False, number_filters=50,FNN_units=512,
number_clases=2,dropout_rate=0.1,name="dcnn"):
super(CLS_BERT_Embedding,self).__init__(name)
self.checkpoint_id ="CLS_BERT_Embedding_bn_3fc_{}filters_{}fc_units_berttrainable{}".format(number_filters,
FNN_units,bert_trainable)
# trainable= False so we don't fine-tune bert, just use as embedding layer
self.bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
trainable=bert_trainable,
input_shape=(3,376))
self.dense_1 = layers.Dense(units = FNN_units,activation="relu")
self.bn1 = layers.BatchNormalization()
self.dense_2 = layers.Dense(units = FNN_units, activation="relu")
self.bn2 = layers.BatchNormalization()
self.dense_3 = layers.Dense(units = FNN_units, activation="relu")
self.bn3 = layers.BatchNormalization()
self.dropout = layers.Dropout(rate=dropout_rate)
if number_clases == 2:
self.last_dense = layers.Dense(units=1,activation="sigmoid")
else:
self.last_dense = layers.Dense(units=number_clases,activation="softmax")
def get_bert_embeddings(self,all_tokens):
CLS_embedding ,embeddings = self.bert_layer([all_tokens[:,0,:],
all_tokens[:,1,:],
all_tokens[:,2,:]])
return CLS_embedding,embeddings
def call(self,inputs,training):
CLS_embedding, x_seq = self.get_bert_embeddings(inputs)
x = self.dense_1(CLS_embedding)
x = self.bn1(x,training)
x = self.dense_2(x)
x = self.bn2(x,training)
x = self.dense_3(x)
x = self.bn3(x,training)
output = self.last_dense(x)
return output
#### config and hyper-params
NUMBER_FILTERS = 1024
FNN_UNITS = 2048
BERT_TRAINABLE = False
NUMBER_CLASSES = len(tokenizer.vocab)
DROPOUT_RATE = 0.2
NUMBER_EPOCHS = 3
LR = 0.001
DEVICE = '/GPU:0'
#### optimization definition
with tf.device(DEVICE):
model = CLS_BERT_Embedding(
bert_trainable = BERT_TRAINABLE,
number_filters=NUMBER_FILTERS,
FNN_units=FNN_UNITS,
number_clases=NUMBER_CLASSES,
dropout_rate = DROPOUT_RATE)
if NUMBER_CLASSES == 2:
loss = "binary_crossentropy"
metrics = ["accuracy"]
else:
loss="sparse_categorical_crossentropy"
metrics = ["sparse_categorical_accuracy"]
optimizer = tf.keras.optimizers.Adam(learning_rate = LR)
loss="sparse_categorical_crossentropy"
model.compile(loss=loss,optimizer=optimizer,metrics=metrics)
### training
with tf.device(DEVICE):
model.fit(train_dataset,
batch_size = BATCH_SIZE ,
epochs=NUMBER_EPOCHS,
shuffle=True,
callbacks=[MyCustomCallback(),
tf.keras.callbacks.ReduceLROnPlateau(monitor="loss",patience=5),
tensorboard,lr_tensorboard])
### testing
train_results = model.evaluate(train_dataset,batch_size = BATCH_SIZE)
print(train_results)

Try running inference without adjusting the trainable flag at all and then verifying that self.bn1.trainable=True. Then run forward prop by calling the model as a callable on each batch of your training data but with training=True, and evaluating each time. So that would be something like
for idx d in enumerate(train_dataset_:
_ = model(d[0], training=True)
model.evaluate(d)
if idx > 100:
break
If your loss starts dropping, then this is an example of your batch norm moving statistics not updating fast enough, which is possible given the BNs are not trained but your bert mode/layer is. If not, ignore the rest of this because you may have a different issue.
If that's the case, you have two options. One is to keep calling the model to allow the BN moving statistics to stabilize.
The other is to statistically analyze the output of your bert layer (get the mean and var) and directly update the BN's moving statistic weights. Probably the first BN is sufficient given your latter Dense layers are
Xavier Glorot initialized, but given you are using Relu, you might also try Kaiming He initialization on them.

I agree with #Yaoshiang, it is likely that the internal statistics (moving average, moving var) do not coincide with the mean and variance per batch, hence a different normalization in the BN layer at training and at testing. Thinking about it, if we use the same batch size for training and testing, then we can keep training=True in the test without it being a real problem (when using predict or evaluate). Otherwise, we can force in the training to use the moving average and moving variance for the normalization rather than the mean and variance by batch and still estimate beta and gamma. (This implies a minor modification of the BatchNormalization class.)

LSTM doesn't learn to add random numbers

I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
tf.enable_eager_execution()
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
lr=0.1
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data = tf.data.Dataset.from_tensor_slices((data, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
tape.watch(my_inp)
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
global_step=tf.train.get_or_create_global_step())
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
EDIT:
Set activations to linear, I realised that the tanh() squashes the values.
SOLVED:
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.

For a classification model in tensorflow, is there a way to impose an asymmetric cost function during the training?

I am trying to build a Neural Network in tensorflow where the cost of a Type I error (false-positive) is more costly than a Type II error (false-negative). Is there a way to impose this during the training process (i.e. inputting a cost matrix)? This is possible with simple models like Logistic Regression in scikit learn by specifying the class_weight parameter.
cw = {0: 3,1:1}
clf = LogisticRegression(class_weight = cw )
In this case, incorrectly predicting a 0 is 3x more costly than incorrectly predicting a 1. However, this cannot be performed with a Neural Network, so I want to see if it is possible in tensorflow.
Thanks

You could use tf.nn.weighted_cross_entropy_with_logits and it's pos_weight argument.
This argument weights positive class, as described by documentation (in TF2.0 at least):
A value pos_weights > 1 decreases the false negative count, hence increasing the recall.
Conversely setting pos_weights < 1 decreases the false positive count and increases the precision.
In your case, you could create custom loss function like this:
import tensorflow as tf
# Output logits from your network, not the values after sigmoid activation
class WeightedBinaryCrossEntropy:
def __init__(self, positive_weight: float):
self.positive_weight = positive_weight
def __call__(self, targets, logits, sample_weight=None):
return tf.nn.weighted_cross_entropy_with_logits(
targets, logits, pos_weight=self.positive_weight
)
And create a custom neural network with it, for example using tf.keras (samples are weighted as they were in your question:
import numpy as np
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(32, input_shape=(10,)),
tf.keras.layers.Activation("relu"),
tf.keras.layers.Dense(10),
tf.keras.layers.Activation("relu"),
# Output one logit for binary classification
tf.keras.layers.Dense(1),
]
)
# Example random data
data = np.random.random((32, 10))
targets = np.random.randint(2, size=32)
# 3 times as costly to make type I error
model.compile(optimizer="rmsprop", loss=WeightedBinaryCrossEntropy(positive_weight=3))
model.fit(data, targets, batch_size=32)

You can use a logarithmic scale. For a 0 incorrectly predicted as 1, y - ŷ = -1, log goes to 1.71. For a 1 predicted as 0, y - ŷ = 1 log equals 0.63. For y == ŷ log equals 0. Almost the three times more costly, for a 0 incorrectly predicted as 1.
import numpy as np
from math import exp
loss=abs(1-exp(-np.log(exp(y-ŷ))))
#abs(1-exp(-np.log(exp(0))))
#Out[53]: 0.0
#abs(1-exp(-np.log(exp(-1))))
#Out[54]: 1.718281828459045
#abs(1-exp(-np.log(exp(1))))
#Out[55]: 0.6321205588285577
Then you will have a convex optimization. Implementing:
import keras.backend as K
def custom_loss(y_true,y_pred):
return K.mean(abs(1-exp(-np.log(exp(y_true-y_pred)))))
Then:
model.compile(loss=custom_loss, optimizer=sgd,metrics = ['accuracy'])

Custom combined hinge/kb-divergence loss function in siamese-net fails to generate meaningful speaker-embeddings

I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
"""
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
"""
MARGIN = 1.
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
else:
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
model.add(
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Dropout(topology['dropout1']))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
model.add(ks.layers.Dropout(topology['dropout2']))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
output_shape=kullback_leibler_shape,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.

Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
Clipping
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
So:
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
See:
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)

Keras custom loss: a * MAE + (1-a) * constant

I am trying to implement a fairly simple custom loss function in Keras.
I am trying to make the network predict a bad input case (i.e. on which it has no chance of predicting correct output), along with correct output. To try to do this, I used a loss function which allows the network to 'choose' a constant loss (8) instead of it's current loss (determined by MAE).
loss = quality * output + (1-quality) * 8
Where quality is output from sigmoid, so in [0,1]
How would I design such a loss function properly in Keras?
Specifically, in the basic case, the network gets several predictions of the output, along with metrics known or thought to correlate with prediction quality. The role of the (small) network is to use these metrics to determine the weights to give when averaging these different prediction. This works well enough.
However, in some fraction of cases (say 5-10%) the input data is so bad that all predictors will be wrong. In that case, I want to output '?' to the user instead of a wrong answer.
My code complained about 1 array vs 2 arrays (presumably, identical number of y_true and y_pred are expected, but I don't have these).
model = Model(inputs=[ain, in1, in2, in3, in4, in5, x], outputs=[pred,qual])
model.compile(loss=quality_loss, optimizer='adam', metrics=['mae'])
model.fit([acc, hrmet0, hrmet1, hrmet2, hrmet3, hrmet4, hrs], ref, epochs=50, batch_size=5000, verbose=2, shuffle=True)
It seems having two outputs is causing the loss function to be called independently for each output.
ValueError: Error when checking model target: the list of Numpy arrays that you
are passing to your model is not the size the model expected. Expected to see 2
array(s), but instead got the following list of 1 arrays:
This was solved by passing a concatenated array instead.
def quality_loss(y_true, y_pred):
qual = y_pred[:,0]
hr = y_pred[:,1]
const = 8
return qual * mean_absolute_error(y_true,hr) + (1 - qual) * const
def my_mae(y_true,y_pred):
return mean_absolute_error(y_true,y_pred[:,1])
model = Model(inputs=[xin, in1, in2, in3, in4, in5, hr], outputs=concatenate([qual, pred_hr]))
model.compile(loss=quality_loss, optimizer='adam', metrics=[my_mae])
Network code:
xin = Input(shape=(1,))
in1 = Input(shape=(4,))
net1 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in1) )
in2 = Input(shape=(4,))
net2 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in2) )
in3 = Input(shape=(4,))
net3 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in3) )
in4 = Input(shape=(4,))
net4 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in4) )
in5 = Input(shape=(4,))
net5 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in5) )
smweights = Dense(5, activation='softmax')( concatenate([xin, net1, net2, net3, net4, net5]) )
qual = Dense(1, activation='sigmoid')( Dense(3, activation='tanh')( concatenate([xin, net1, net2, net3, net4, net5]) ) )
x = Input(shape=(5,))
pred = dot([x, smweights], axes=1)
This runs, but converges to loss = const and mae > 25 (whereas a simple mae loss here achieves 3-4 quite easily). Something is still not quite right with the loss function. Since shape on y_true/y_pred in the loss function gives (?) it's hard to track what is being passed exactly.

This issue is actually not caused by your custom loss function, but by something else: The problem arises because of how you call the fit function.
When you define the model, you give it 7 inputs and 2 outputs:
model = Model(inputs=[ain, in1, in2, in3, in4, in5, x], outputs=[pred,qual])
When you eventually call the fit function, you give a list 7 arrays as the input of the network but only 1 target output value called ref:
model.fit([acc, hrmet0, hrmet1, hrmet2, hrmet3, hrmet4, hrs], ref, ...)
This will not work. You have to supply the fit function with the same number of inputs and outputs as declared in the model's definition.
Edit: I think there is some conceptual problem with your approach: how are you actually planning to define the quality of your prediction? Why are you thinking, that adding a branch of your network which is supposed to judge the quality of your network's prediction will actually help to train it? The network will converge to a local minimum of the loss function. The fancier your loss function is, the more likely it is, that it will not actually converge to the state you actually want it to be in, but to some other local and not global minimum. You could try to experiment with different optimizers and learning rates - maybe this helps your training.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.