Why not use mean squared error for classification problems?

Why not use mean squared error for classification problems? - python

I am trying to solve a simple binary classification problem using LSTM. I am trying to figure out the correct loss function for the network. The issue is, when I use the binary cross-entropy as loss function, the loss value for training and testing is relatively high as compared to using the mean squared error (MSE) function.
Upon research, I came across justifications that binary cross-entropy should be used for classification problems and MSE for the regression problem. However, in my case, I am getting better accuracies and lesser loss value with MSE for binary classification.
I am not sure how to justify these obtained results. Why not use mean squared error for classification problems?

I would like to show it using an example.
Assume a 6 class classification problem.
Assume,
True probabilities = [1, 0, 0, 0, 0, 0]
Case 1:
Predicted probabilities = [0.2, 0.16, 0.16, 0.16, 0.16, 0.16]
Case 2:
Predicted probabilities = [0.4, 0.5, 0.1, 0, 0, 0]
The MSE in the Case1 and Case 2 is 0.128 and 0.1033 respectively.
Although, Case 1 is correctly predicting class 1 for the instance, the loss in Case 1 is higher than the loss in Case 2.

The answer is right there in your question. Value of binary cross entropy loss is higher than rmse loss.
Case 1 (Large Error):
Lets say your model predicted 1e-7 and the actual label is 1.
Binary Cross Entropy loss will be -log(1e-7) = 16.11.
Root mean square error will be (1-1e-7)^2 = 0.99.
Case 2 (Small Error)
Lets say your model predicted 0.94 and the actual label is 1.
Binary Cross Entropy loss will be -log(0.94) = 0.06.
Root mean square error will be (1-1e-7)^2 = 0.06.
In Case 1 when prediction is far off from reality, BCELoss has larger value compared to RMSE. When you have large value of loss you'll have large value of gradients, thus optimizer will take a larger step in direction opposite to gradient. Which will result in relatively more reduction in loss.

Though #nerd21 gives a good example for "MSE as loss function is bad for 6-class classification", it's not the same for binary classification.
Let's just consider binary classification. Label is [1, 0], one prediction is h1=[p, 1-p], another prediction is h2=[q, 1-q], thus their's MSEs are:
L1 = 2*(1-p)^2, L2 = 2*(1-q)^2
Assuming h1 is mis-classifcation, i.e. p<1-p, thus 0<p<0.5
Assuming h2 is correct-classification, i.e. q>1-q, thus 0.5<q<1
Then L1-L2=2(p-q)(p+q-2) > 0 is for sure:
p < q is for sure;
q + q < 1 + 0.5 < 1.5, thus p + q - 2 < -0.5 < 0;
thus L1-L2>0, i.e. L1 > L2
This mean for binary classfication with MSE as loss function, mis-classification will definitely with larger loss that correct-classification.

I'd like to share my understanding of the MSE and binary cross-entropy functions.
In the case of classification, we take the argmax of the probability of each training instance.
Now, consider an example of a binary classifier where model predicts the probability as [0.49, 0.51]. In this case, the model will return 1 as the prediction.
Now, assume that the actual label is also 1.
In such a case, if MSE is used, it will return 0 as a loss value, whereas the binary cross-entropy will return some "tangible" value.
And, if somehow with all data samples, the trained model predicts a similar type of probability, then binary cross-entropy effectively return a big accumulative loss value, whereas MSE will return a 0.
According to the MSE, it's a perfect model, but, actually, it's not that good model, that's why we should not use MSE for classification.

Related

how to interpret a probability predictions of a deep learning model that is an output of a sigmoid activation of last layer?

I have trained a binary classification task (pos. vs. neg.) and have a .h5 model. And I have external data (which was never used in training nor in the validation). There are 20 of samples overall belonging to both classes.
preds = model.predict(img)
y_classes = np.argmax(preds , axis=1)
The above code is supposed to calculate probability (preds) and class labels (0 or 1) if it were trained with softmax as the last output layer. But, preds is only a single number between [0;1] and y_classes is always 0.
To go back a little, the model was evaluated with mean AUC with the area being around 0.75.
I can see the probabilities of those 20 samples mostly (17) lie between 0 - 0.15, the rest are 0.74, 0.51 and 0.79.
How do I make a conclusion from this?
EDIT:
10 among 20 samples for testing the model belong to positive class, the other 10 belong to negative class. All 10 which belong to pos. class have very low prabability (0 - 0.15). 7 out 10 negative classes have the same low probability, only 3 being (0.74, 0.51 and 0.79).
The question: Why is the model predicting the samples with such a low probability even though its AUC was quite higher?

the sigmoid activation function is used to generate probabilities in binary classification problems. in this case, the model output an array of probabilities with shape equal to the length of images to predict. we can retrieve the predicted class simply checking the probability score... if it's above 0.5 (this is a common practice but u can also change it according to your needs) the image belongs to the class 1 else it belongs to the class 0.
preds = model.predict(img) # (n_images, 1)
y_classes = ((pred > 0.5)+0).ravel() # (n_images,)
in case of sigmoid, your last output layer must be Dense(1, activation='sigmoid')
in the case of softmax (as you have just done), the predicted class are retrieved using argmax
preds = model.predict(img) # (n_images, n_class)
y_classes = np.argmax(preds , axis=1) # (n_images,)
in case of softmax, your last output layer must be Dense(n_classes, activation='softmax')
WHY AUC IS NOT A GOOD METRIC
The value of AUC can be misleading and can cause us sometimes to overestimate and sometimes to underestimate the actual performance of a model. The behavior of Average-Precision is more expressive in getting a flavor of how the model is doing because it is more sensible in distinguishing between a good and a very good model. Moreover, it is directly linked to precision: an indicator which is human-understandable Here a great reference about the topics which explains all you need: https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728

By using a sigmoid function as your activation function you are basically "compressing" the output of prior layers to a probability value from 0 to 1.
Softmax function is just taking a sequence of sigmoid functions, aggregates them and shows the ratio between a specific class probability and all aggregated probabilities for all classes.
For example: if I'm using a model to predict whether an image is an image of a banana, apple or grape, and my model recognizes that a certain image is 0.75 banana, 0.20 apple and 0.15 grape (Each probability is generated with a sigmoid function), my softmax layer will make this calculation:
banana: 0.75 / (0.75 + 0.20 + 0.15) = 0.6818 && apple: 0.20 / 1.1 = 0.1818 && grape: 0.15 / 1.1 = 0.1364.
As we can see, this model will classify this specific picture as a picture of a banana thanks to our softmax layer. Yet, in order to make this classification, it priorly used a series of sigmoid functions.
So if we finally reach to the point, I'd say that the interpretation of a sigmoid function output should be similar to the one that you'd make with a softmax layer, but while a softmax layer gives you the comparison between one class to another, a sigmoid function simply tells you how likely it is that this piece of information belongs to the positive class.
In order to make the final call and decide if a certain item does or doesn't belong to the positive class, you need to pick a threshold (not necessarily 0.5). Picking a threshold is the final step of your output interpretation. If you'd like to max the precision of your model, you will pick a high threshold, but if you'd like to max the recall of your model you can definitely pick a lower threshold.
I hope it answers your question, let me know if you'd like me to elaborate on anything as this answer is quite general.

Meaning of Loss function in Keras?

i made a neural network with keras in python and cannot really understand what the loss function means.
So here first some general information:
i worked with the poker hand dataset with classes 0-9, which i wrote as vectors with the OneHotEncoding. I used the softmax activation in the last layer, so my output tells me for each of the 10 entries in a vector the probability if the sample belongs to a certain class. For example:
my real input it (0,1,0,0,0,0,0,0,0,0), which means class 1 (from 0-9 means from no card to royal flush), and class 1 means one pair (if you know poker).
With the neural net, it get at the and Outputs like (0.4, 0.2, 0.1, 0.1, 0.2, 0,0,0,0,0), which means that my sample belongs with 40 percent to class 0, with 20 percent to class 1 and so on!
Allright! i used also the binary cross_entropy as loss, the accuracy-metrics and the RMSprop-Optimizer.
When i use mode.evaluate() from keras, i got something like 0.16 for the loss and i do not know how to interpret this.
Does this mean, that in average, my predictions deviate 0.16 from the true? so if my prediction for class 0 is 0.5, it also could be 0.66 or 0.34?
Or how can i interpret it?
Please send help!

First at all, according to your problem definition you have a multi-class problem. Thus, you should use categorical_crossentropy. Binary cross_entropy is for two-class problems or for multi-label classification.
But generally the value of the loss function has a relative impact value. First at all, you have to understand what the cross_entropy is meaning. The formula is:
where
c is the correct classification of observation o and
y is the binary indicator (0 or 1) if class label c is the correct classification for observation o and p is the predicted probability that o is of class c.
For binary cross entropy, M is equal to 2. For categorical cross entropy, M>2.
Therefore, the cross entropy decreases if the predicted probability converges to the actual label:
Now let's take your example, where you have 10 classes and your real input is: (0,1,0,0,0,0,0,0,0,0).
If you have a loss of 0.16, it means that
which means that your model has assigned 0.85 to the correct label.
Therefore, the loss function gives you the log of the correct classification probability. As in keras the loss is computed on whole batches, it is the average of the log of the correct classification probability of the whole data in the specific batch. If you use the evaluate function, then it is the average of the log of the correct classification probability of the whole data you are evaluating.

How to use F-score as error function to train neural networks?

I am pretty new to neural networks. I am training a network in tensorflow, but the number of positive examples is much much less than negative examples in my dataset (it is a medical dataset).
So, I know that F-score calculated from precision and recall is a good measure of how well the model is trained.
I have used error functions like cross-entropy loss or MSE before, but they are all based on accuracy calculation (if I am not wrong). But how do I use this F-score as an error function? Is there a tensorflow function for that? Or I have to create a new one?
Thanks in advance.

It appears approaches for optimising directly for these types of metrics have been devised and used successfully, improving scoring and or training times:
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77289
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/70328
https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric
One such method involves using the sums of probabilities, in place of counts, for the sets of true positives, false positives, and false negative metrics. For example F-beta loss (the generalisation of F1) can be calculated in with Torch in Python as follows:
def forward(self, y_logits, y_true):
y_pred = self.sigmoid(y_logits)
TP = (y_pred * y_true).sum(dim=1)
FP = ((1 - y_pred) * y_true).sum(dim=1)
FN = (y_pred * (1 - y_true)).sum(dim=1)
fbeta = (1 + self.beta**2) * TP / ((1 + self.beta**2) * TP + (self.beta**2) * FN + FP + self.epsilon)
fbeta = fbeta.clamp(min=self.epsilon, max=1 - self.epsilon)
return 1 - fbeta.mean()
An alternative method is described in this paper:
https://arxiv.org/abs/1608.04802
The approach taken optimises for a lower bound on the statistic. Other metrics such as AUROC and AUCPR are also discussed. An implementation in TF of such an approach can be found here:
https://github.com/tensorflow/models/tree/master/research/global_objectives

I think you are confusing model evaluation metrics for classification with training losses.
Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.
For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.
Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting
If you want to weigh positive and negative cases differently, two methods are
oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.
I recommend just using simple oversampling in practice.

the loss value and accuracy is a different concept. The loss value is used for training the NN. However, accuracy or other metrics is to value the training result.

Machine Learning: Does computing the accuracy score for binary labels always result in a low accuracy score?

If I have 2 labels (1 and 0), and after I pass my logits through a softmax activation layer, I get something like:
[[0.1, 0.9],
[0.3, 0.7],
[0.333, 0.667]]
as a predictions output, and my labels are only 1 or 0, does this always result in a low accuracy? Meaning to say if I have a lot more classes, will my softmax layer give me something close to either 1 or 0 for each of the classes, which give me a higher accuracy score?
Further, if I want to use accuracy as my metric, is there a way to scale my probability to either 0 or 1? Can be this be done by applying a mask in TensorFlow that outputs boolean values whenever a probability hits 0.5 or above?

After softmax layer you have probabilities in range 0..1,
so if you want to check accuracy against your labels being only 0 or 1 you have to convert probablities
if pred>0.5 then pred=1
if pred<=0.5 then pred=0

Text Classification for multiple label

I doing Text Classification by Convolution Neural Network. I used health documents (ICD-9-CM code) for my project and I used the same model as dennybritz used but my data has 36 labels. I used one_hot encoding to encode my label.
Here is my problem, when I run data which has one label for each document my code the accuracy is perfect from 0.8 to 1. If I run data which has more than one labels, the accuracy is significantly reduced.
For example: a document has single label as "782.0": [0 0 1 0 ... 0],
a document has multiple label as "782.0 V13.09 593.5": [1 0 1 0 ... 1].
Could anyone suggest why this happen and how to improve it?

The label encoding seems correct. If you have multiple correct labels, [1 0 1 0 ... 1] looks totally fine. The loss function used in Denny's post is tf.nn.softmax_cross_entropy_with_logits, which is the loss function for a multi-class problem.
Computes softmax cross entropy between logits and labels.
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one class).
In multi-label problem, you should use tf.nn.sigmoid_cross_entropy_with_logits:
Computes sigmoid cross entropy given logits.
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
The input to the loss function would be logits (WX) and targets (labels).
Fix the accuracy measure
In order to measure the accuracy correctly for a multi-label problem, the code below needs to be changed.
# Calculate Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
The logic of correct_predictions above is incorrect when you could have multiple correct labels. For example, say num_classes=4, and label 0 and 2 are correct. Thus your input_y=[1, 0, 1, 0]. The correct_predictions would need to break tie between index 0 and index 2. I am not sure how tf.argmax breaks tie but if it breaks the tie by choosing the smaller index, a prediction of label 2 is always considered wrong, which definitely hurt your accuracy measure.
Actually in a multi-label problem, precision and recall are better metrics than accuracy. Also you can consider using precision#k (tf.nn.in_top_k) to report classifier performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why not use mean squared error for classification problems? - python

Related

how to interpret a probability predictions of a deep learning model that is an output of a sigmoid activation of last layer?

Meaning of Loss function in Keras?

How to use F-score as error function to train neural networks?

Machine Learning: Does computing the accuracy score for binary labels always result in a low accuracy score?

Text Classification for multiple label

Categories

Resources