confusion matrix
I have an issue where I'm trying to compute the test accuracy for a naive classifier that always predicts ^y=−1.
I have already calculated the test accuracy of the classifier based on the confusion matrix attached above by using (TN + TP)/𝑛. But how do I calculate the naive value?
accuracy = (109112+3805)/127933
naive_accuracy = # TODO: Compute the accuracy of the naive classifier
It is actually the same formula. You should just notice that your naive classifier never gives positives answers, so TP = 0. TN will be equal to the total number of negatives: TN = 123324.
So naive_accuracy = (TN + TP)/𝑛 = (123324 + 0)/127933.
And yes, this is the case when naive classifier actually shows better accuracy than the one given by the confusion matrix you are referring to. This is due to data imbalance problem: there are 30 times more negative examples than positive ones. This is why accuracy is not applicable in that setting. Please check out precision, recall and f-score metrics if you need to have a meaningful result.
Related
I'm trying to clarify something about accuracy in Python. I have 3 classes of Cancer and I'm trying to predict samples (patients) by their condition. I have followed this method, proposed by a guy always from stack overflow :
True Positive Rate and False Positive Rate (TPR, FPR) for Multi-Class Data in python
Now I have done the exact same (only the part of sensitivity , specificity and accuracy were needed) :
cnf_matrix = confusion_matrix(y_test, pred_y)
FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix)
FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
TP = np.diag(cnf_matrix)
TN = cnf_matrix.sum() - (FP + FN + TP)
FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)
# Sensitivity, hit rate, recall, or true positive rate
Sensitivity = TP/(TP+FN)
# Specificity or true negative rate
Specificity = TN/(TN+FP)
# Overall Accuracy (even if I dont think is overall)
ACC = (TP+TN)/(TP+FP+FN+TN)
And as a result I get 3 lists (sensitivity, specificity and accuracy), but each of these lists contains 3 values ( I guess one per class ).
Sensitivity : [0.76999182 0.99404079 0.96377484]
Specificity : [0.98132687 0.97199254 0.9036957 ]
ACC : [0.91487179 0.97717949 0.92794872]
But in the post the guy spoke about "overall accuracy", while instead I get the individual accuracy for each class (not bad tho). In fact when I use accuracy_score from Scikit-learn the final accuracy is different :
accuracy = accuracy_score(y_test,pred_y)
accuracy: 0.9099999999999991
I assume that using the guy technique I get an accuracy for each class and so I can compute the mean accuracy (that in this case is 0.9399999999999992) while Scikit-learn gives me the overall accuracy? I think that it is important to know what is what, because sometimes the difference is about 20%, and is a lot.
The accuracy returned from from sklearn.metrics.accuracy_score is
(number of correctly predicted samples) / (total number of samples)
ie., accuracy.
What you're computing there is not accuracy for the entire dataset, it is the accuracy for the binary classification problem of each label, which you'll see listed here as accuracy for binary classification.
I haven't really ever seen that metric used, generally you'd pay attention to precision, recall, F1 score, and actual accuracy. Even if you wanted to use it, you should be careful when computing the mean: often there is a class imbalance in your data, so you might want to use a weighted mean.
I am trying to solve a simple binary classification problem using LSTM. I am trying to figure out the correct loss function for the network. The issue is, when I use the binary cross-entropy as loss function, the loss value for training and testing is relatively high as compared to using the mean squared error (MSE) function.
Upon research, I came across justifications that binary cross-entropy should be used for classification problems and MSE for the regression problem. However, in my case, I am getting better accuracies and lesser loss value with MSE for binary classification.
I am not sure how to justify these obtained results. Why not use mean squared error for classification problems?
I would like to show it using an example.
Assume a 6 class classification problem.
Assume,
True probabilities = [1, 0, 0, 0, 0, 0]
Case 1:
Predicted probabilities = [0.2, 0.16, 0.16, 0.16, 0.16, 0.16]
Case 2:
Predicted probabilities = [0.4, 0.5, 0.1, 0, 0, 0]
The MSE in the Case1 and Case 2 is 0.128 and 0.1033 respectively.
Although, Case 1 is correctly predicting class 1 for the instance, the loss in Case 1 is higher than the loss in Case 2.
The answer is right there in your question. Value of binary cross entropy loss is higher than rmse loss.
Case 1 (Large Error):
Lets say your model predicted 1e-7 and the actual label is 1.
Binary Cross Entropy loss will be -log(1e-7) = 16.11.
Root mean square error will be (1-1e-7)^2 = 0.99.
Case 2 (Small Error)
Lets say your model predicted 0.94 and the actual label is 1.
Binary Cross Entropy loss will be -log(0.94) = 0.06.
Root mean square error will be (1-1e-7)^2 = 0.06.
In Case 1 when prediction is far off from reality, BCELoss has larger value compared to RMSE. When you have large value of loss you'll have large value of gradients, thus optimizer will take a larger step in direction opposite to gradient. Which will result in relatively more reduction in loss.
Though #nerd21 gives a good example for "MSE as loss function is bad for 6-class classification", it's not the same for binary classification.
Let's just consider binary classification. Label is [1, 0], one prediction is h1=[p, 1-p], another prediction is h2=[q, 1-q], thus their's MSEs are:
L1 = 2*(1-p)^2, L2 = 2*(1-q)^2
Assuming h1 is mis-classifcation, i.e. p<1-p, thus 0<p<0.5
Assuming h2 is correct-classification, i.e. q>1-q, thus 0.5<q<1
Then L1-L2=2(p-q)(p+q-2) > 0 is for sure:
p < q is for sure;
q + q < 1 + 0.5 < 1.5, thus p + q - 2 < -0.5 < 0;
thus L1-L2>0, i.e. L1 > L2
This mean for binary classfication with MSE as loss function, mis-classification will definitely with larger loss that correct-classification.
I'd like to share my understanding of the MSE and binary cross-entropy functions.
In the case of classification, we take the argmax of the probability of each training instance.
Now, consider an example of a binary classifier where model predicts the probability as [0.49, 0.51]. In this case, the model will return 1 as the prediction.
Now, assume that the actual label is also 1.
In such a case, if MSE is used, it will return 0 as a loss value, whereas the binary cross-entropy will return some "tangible" value.
And, if somehow with all data samples, the trained model predicts a similar type of probability, then binary cross-entropy effectively return a big accumulative loss value, whereas MSE will return a 0.
According to the MSE, it's a perfect model, but, actually, it's not that good model, that's why we should not use MSE for classification.
I am looking at precision, recall, and f-score using scikit-learn using:
from sklearn.metrics import `precision_score`
Then:
y_true = np.array(["one", "two", "three"])
y_pred = np.array(["one", "two"])
precision = precision_score(y_true, y_pred, average=None)
print(precision)
The error returned is:
ValueError: Found input variables with inconsistent numbers of samples: [3, 2]
Due to the imbalanced input arrays, why does scikit-learn require an equal amount of inputs? Particularly when evaluating recall (which I would have thought was taking more guesses than answers).
I can implement my own metrics or just reduce the arrays so they match. I want to be sure there is no underlying reason why I should not?
It really depends what your y_true and y_pred mean in your case. But generally, y_true will be a vector indicating what the true value is supposed to be for every element of y_pred. I think this is not your case, and to use scikit-learn's metrics, you would need to put them in that format.
So in the case of binary classification, precision will be:
correct_classifications = (y_true == y_pred).astype(int)
precision = sum(y_pred * correct_classifications) / sum(y_pred)
Here you see that you need y_true and y_pred to be the same length.
That is quite simply because sklearn is playing the safe role here.
It doesn't make sense that you didn't do 100% of the predictions for the test set.
Let's say you have 1M data points in your dataset but you only predict 200k, are those the first 200k points? The last? Spread all over? How would the library know which matches which?
You have to have a 1:1 correspondance at the input of the metrics calculation. If you don't have predictions for some points throw them out (but make sure you know why you don't have such predictions in the first place, if it's not a problem with the pipeline) - you don't want to say you have 100% recall at 1% precision and in the end you only predicted for 10% of the dataset.
I am pretty new to neural networks. I am training a network in tensorflow, but the number of positive examples is much much less than negative examples in my dataset (it is a medical dataset).
So, I know that F-score calculated from precision and recall is a good measure of how well the model is trained.
I have used error functions like cross-entropy loss or MSE before, but they are all based on accuracy calculation (if I am not wrong). But how do I use this F-score as an error function? Is there a tensorflow function for that? Or I have to create a new one?
Thanks in advance.
It appears approaches for optimising directly for these types of metrics have been devised and used successfully, improving scoring and or training times:
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77289
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/70328
https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric
One such method involves using the sums of probabilities, in place of counts, for the sets of true positives, false positives, and false negative metrics. For example F-beta loss (the generalisation of F1) can be calculated in with Torch in Python as follows:
def forward(self, y_logits, y_true):
y_pred = self.sigmoid(y_logits)
TP = (y_pred * y_true).sum(dim=1)
FP = ((1 - y_pred) * y_true).sum(dim=1)
FN = (y_pred * (1 - y_true)).sum(dim=1)
fbeta = (1 + self.beta**2) * TP / ((1 + self.beta**2) * TP + (self.beta**2) * FN + FP + self.epsilon)
fbeta = fbeta.clamp(min=self.epsilon, max=1 - self.epsilon)
return 1 - fbeta.mean()
An alternative method is described in this paper:
https://arxiv.org/abs/1608.04802
The approach taken optimises for a lower bound on the statistic. Other metrics such as AUROC and AUCPR are also discussed. An implementation in TF of such an approach can be found here:
https://github.com/tensorflow/models/tree/master/research/global_objectives
I think you are confusing model evaluation metrics for classification with training losses.
Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.
For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.
Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting
If you want to weigh positive and negative cases differently, two methods are
oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.
I recommend just using simple oversampling in practice.
the loss value and accuracy is a different concept. The loss value is used for training the NN. However, accuracy or other metrics is to value the training result.
I'm using a linear regression model to predict the weather data for one year. Prediction is done using Python's sklearn library. The problem is that I need to find the accuracy of the prediction. After a quick internet search I found out that r^2 is the way to find out the accuracy. I calculated the r value as follows:
r value
0.0919309031356
Coefficients:
[-20.01071429 0. ]
Residual sum of squares: 19331.78
Variance score: -0.23
The problem is that I need to show the accuracy as a percentage. How do I do that? Do I need use a tool to find out the accuracy?
Maybe this question is more complicated than I think, but why not just
r = str((r**2) * 100) + '%'
For regression problems, you can use the following metrics to determine the quality of the fit (http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics):
Mean squared error. Fit is good when the value is as low as possible
R^2 score. Fit is good when the value is 1 or close to it.
You can also calculate prediction error with:
(Actual value - Predicted value)/Actual value.
However, I am not sure if this is a common metric to evaluate a linear regression fit.