My model always predict under probability 0.5 for all pixels.
I dropped all images without ships and have tried focal loss,iou loss,weighted loss to deal with imbalance .
But the result is same.After few batches the masks i predicted gradually became all zeros.
Here is my notebook: enter link description here
Kaggle discussion:enter link description here
In the notebook , basically what i did is :
(1)discard all samples where there is no ship
(2)build a plain u-net
(3)define three custom loss function(iouloss,focal_binarycrossentropy,biased_crossentropy), all of which i have tried.
(4)train and submit
#define different losses to try
def iouloss(y_true,y_pred):
intersection = K.sum(y_true * y_pred, axis=-1)
sum_ = K.sum(y_true + y_pred, axis=-1)
jac = intersection / (sum_ - intersection)
return 1 - jac
def focal_binarycrossentropy(y_true,y_pred):
#focal loss with gamma 8
t1=K.binary_crossentropy(y_true, y_pred)
t2=tf.where(tf.equal(y_true,0),t1*(y_pred**8),t1*((1-y_pred)**8))
return t2
def biased_crossentropy(y_true,y_pred):
#apply 1000 times heavier punishment to ship pixels
t1=K.binary_crossentropy(y_true, y_pred)
t2=tf.where(tf.equal(y_true,0),t1*1000,t1)
return t2
...
#try different loss function
unet.compile(loss=iouloss, optimizer="adam", metrics=[ioumetric])
or
unet.compile(loss=focal_binarycrossentropy, optimizer="adam", metrics=[ioumetric])
or
unet.compile(loss=biased_crossentropy, optimizer="adam", metrics=[ioumetric])
...
#start training
unet.train_on_batch(x=image_batch,y=mask_batch)
One option that Keras provides is class_weight parameter in fit from documentation:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
This will allow you to counter the imbalance to some extent.
I have heard use of the Dice coefficient for this problem, although I have no personal experience of having done so. Perhaps you could try this? It is related to the Jaccard but have heard anecdotally that it is easier to train. Sorry not to offer anything more concrete.
Related
Hy everybody,
I'm beginning with tensorflow probability and I have some difficulties to interpret my Bayesian neural network outputs.
I'm working on a regression case, and started with the example provided by tensorflow notebook here: https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html?hl=fr
As I seek to know the uncertainty of my network predictions, I dived directly into example 4 with Aleatoric & Epistemic Uncertainty. You can find my code bellow:
def negative_loglikelihood(targets, estimated_distribution):
return -estimated_distribution.log_prob(targets)
def posterior_mean_field(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size #number of total paramaeters (Weights and Bias)
c = np.log(np.expm1(1.))
return tf.keras.Sequential([
tfp.layers.VariableLayer(2 * n, dtype=dtype, initializer=lambda shape, dtype: random_gaussian_initializer(shape, dtype), trainable=True),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
# The Normal distribution with location loc and scale parameters.
tfd.Normal(loc=t[..., :n],
scale=1e-5 +0.01*tf.nn.softplus(c + t[..., n:])),
reinterpreted_batch_ndims=1)),
])
def prior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
return tf.keras.Sequential([
tfp.layers.VariableLayer(n, dtype=dtype),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
tfd.Normal(loc=t, scale=1),
reinterpreted_batch_ndims=1)),
])
def build_model(param):
model = keras.Sequential()
for i in range(param["n_layers"] ):
name="n_units_l"+str(i)
num_hidden = param[name]
model.add(tfp.layers.DenseVariational(units=num_hidden, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,kl_weight=1/len(X_train),activation="relu"))
model.add(tfp.layers.DenseVariational(units=2, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,activation="relu",kl_weight=1/len(X_train)))
model.add(tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t[..., :1],scale=1e-3 + tf.math.softplus(0.01 * t[...,1:]))))
lr = param["learning_rate"]
optimizer=optimizers.Adam(learning_rate=lr)
model.compile(
loss=negative_loglikelihood, #negative_loglikelihood,
optimizer=optimizer,
metrics=[keras.metrics.RootMeanSquaredError()],
)
return model
I think I have the same network than in tfp example, I just added few hidden layers with differents units. Also I added 0.01 in front of the Softplus in the posterior as suggested here, which allows the network to come up to good performances.
Not able to get reasonable results from DenseVariational
The performances of the model are very good (less than 1% of error) but I have some questions:
As Bayesian neural networks "promise" to mesure the uncertainty of the predictions, I was expecting bigger errors on high variance predictions. I ploted the absolute error versus variance and the results are not good enough on my mind. Of course, the model is better at low variance but I can have really bad predicitions at low variance, and therefore cannot really use standard deviation to filter bad predictions. Why is my Bayesian neural netowrk struggling to give me the uncertainty ?
The previous network was train 2000 epochs and we can notice a strange phenome with a vertical bar on lowest stdv. If I increase the number of epoch up to 25000, my results get better either on training and validation set.
But the phenomene of vertical bar that we may notice on the figure 1 is much more obvious. It seems that as much as I increase the number or EPOCH, all output variance converge to 0.68. Is that a case of overfitting ? Why this value of 0.6931571960449219 and why I can't get lower stdv ? As the phenome start appearing at 2000 EPOCH, am i already overfitting at 2000 epochs ?
At this point stdv is totaly useless. So is there a kind of trade off ? With few epochs my model is less performant but gives me some insigh about uncertainty (even if I think they're not sufficient), where with lot of epochs I have better performances but no more uncertainty informations as all outputs have the same stdv.
Sorry for the long post and the language mistakes.
Thank you in advance for you help and any feed back.
I solved the problem of why my uncertainty could not get lower than 0.6931571960449219.
Actually this value is converging to log(2). This is due to my relu activation function on my last Dense Variational layer.
Indeed, the scale of tfd.Normal is a softplus (tf.math.softplus).
And softplus is implement like that : softplus(x) = log(exp(x) + 1). As my x doesn't go in negative values, my minumum incertainty il log(2).
A basic linear activation function solved the problem and my uncertainty has a normal behavior now.
I am building an image segmentation model using keras and I want to train my model on multiple loss functions. I have seen this link but I am looking for a simpler and straight-forward solutions for this situation as my loss functions are quite complex. Can someone tell me how to build a model with single output with multiple losses in keras.
You can use multiple losses with one output using weighted loss, which is a sum of your losses multiplied by weight. Create your custom loss which will return a sum of other losses with coefficients and pass it to model.compile. There is an example here.
This is just an example from here. You could play around with it.
def custom_losses(y_true, y_pred):
alpha = 0.6
squared_difference = tf.square(y_true - y_pred)
Huber = tf.keras.losses.huber(y_true, y_pred)
return tf.reduce_mean(squared_difference, axis=-1) + (alpha*Huber)
model.compile(optimizer='adam', loss=custom_losses,metrics=['MeanSquaredError'])
For ~20,000 text datasets, the true and false samples are ~5,000 against ~1,5000. Two-channel textCNN built with Keras and Theano is used to do the classification. F1 score is the evaluation metric. The F1 score is not bad while the confusion matrix shows that the accuracy of the true samples is relatively low(~40%). But actually it is very important to predict the true samples accurately. Therefore, want to design a custom binary cross entropy loss function to increase the weight of mis-classified true samples and make the model focus more on predicting accurately on the true samples.
tried class_weight with sklearn in model.fit method and it did not work very well since the weight applied to all samples instead of the mis-classified ones.
tried and adjusted the method mentioned here: https://github.com/keras-team/keras/issues/2115, but the loss function was categorical cross entropy and it did not work well for the binary classification problem. Tried to modified the loss function to a binary one but encounter some issues concerning the input dimension.
The sample code of the cost sensitive loss function focusing on the mis-classified samples is:
def w_categorical_crossentropy(y_true, y_pred, weights):
nb_cl = len(weights)
final_mask = K.zeros_like(y_pred[:, 0])
y_pred_max = K.max(y_pred, axis=1)
y_pred_max = K.reshape(y_pred_max, (K.shape(y_pred)[0], 1))
y_pred_max_mat = K.equal(y_pred, y_pred_max)
for c_p, c_t in product(range(nb_cl), range(nb_cl)):
final_mask += (weights[c_t, c_p] * y_pred_max_mat[:, c_p] * y_true[:, c_t])
return K.categorical_crossentropy(y_pred, y_true) * final_mask
Actually, a custom loss function for binary classification implemented with Keras and Theano that focuses on the mis-classified samples is of great importance to the imbalanced dataset. Please help troubleshoot this. Thanks!
Well when I have to deal with imbalanced datasets in keras, what I do is to first compute the weights for each class and pass them to the model instance during training. This will look something like this:
from sklearn.utils import compute_class_weight
w = compute_class_weight('balanced', np.unique(targets), targets)
# here I am adding only two categories with their corresponding weights
# you can spin a loop or continue by hand until you include all of your categories
weights = {
np.unique(targets)[0] : w[0], # class 0 with weight 0
np.unique(targets)[1] : w[1] # class 1 with weight 1
}
# then during training you do like this
model.fit(x=features, y=targets, {..}, class_weight=weights)
I believe this will solve your problem.
I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
"""
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
"""
MARGIN = 1.
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
else:
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
model.add(
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Dropout(topology['dropout1']))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
model.add(ks.layers.Dropout(topology['dropout2']))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
output_shape=kullback_leibler_shape,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.
Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
Clipping
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
So:
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
See:
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)
I am pretty new to neural networks. I am training a network in tensorflow, but the number of positive examples is much much less than negative examples in my dataset (it is a medical dataset).
So, I know that F-score calculated from precision and recall is a good measure of how well the model is trained.
I have used error functions like cross-entropy loss or MSE before, but they are all based on accuracy calculation (if I am not wrong). But how do I use this F-score as an error function? Is there a tensorflow function for that? Or I have to create a new one?
Thanks in advance.
It appears approaches for optimising directly for these types of metrics have been devised and used successfully, improving scoring and or training times:
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77289
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/70328
https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric
One such method involves using the sums of probabilities, in place of counts, for the sets of true positives, false positives, and false negative metrics. For example F-beta loss (the generalisation of F1) can be calculated in with Torch in Python as follows:
def forward(self, y_logits, y_true):
y_pred = self.sigmoid(y_logits)
TP = (y_pred * y_true).sum(dim=1)
FP = ((1 - y_pred) * y_true).sum(dim=1)
FN = (y_pred * (1 - y_true)).sum(dim=1)
fbeta = (1 + self.beta**2) * TP / ((1 + self.beta**2) * TP + (self.beta**2) * FN + FP + self.epsilon)
fbeta = fbeta.clamp(min=self.epsilon, max=1 - self.epsilon)
return 1 - fbeta.mean()
An alternative method is described in this paper:
https://arxiv.org/abs/1608.04802
The approach taken optimises for a lower bound on the statistic. Other metrics such as AUROC and AUCPR are also discussed. An implementation in TF of such an approach can be found here:
https://github.com/tensorflow/models/tree/master/research/global_objectives
I think you are confusing model evaluation metrics for classification with training losses.
Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.
For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.
Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting
If you want to weigh positive and negative cases differently, two methods are
oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.
I recommend just using simple oversampling in practice.
the loss value and accuracy is a different concept. The loss value is used for training the NN. However, accuracy or other metrics is to value the training result.