how to calculate Average TPR, TNR, FPR, FNR in case of imbalanced dataset?
example FPR: [3.54224720e-04 0.00000000e+00 1.59383505e-05 0.00000000e+00]
So, can I calculate to sum of 4 class and divided by 4?
TPR :[3.54224720e-04 + 0.00000000e+00 + 1.59383505e-05 + 0.00000000e+00]/4 = 0.99966 ?
And how to calculate 3.54224720e-04 it is equal .000354224720 ?
Thank you
FP = np.sum(matrix, axis=0) - np.diag(matrix)
FN = np.sum(matrix, axis=1) - np.diag(matrix)
TP = np.diag(matrix)
TN = np.sum(matrix) - (FP + FN + TP)
# True Positive rate
TPR = TP/(TP+FN)
print("TPR:", TPR)
# True Negative Rate
TNR = TN/(TN+FP)
print("TNR:", TNR)
# False Positive Rate
FPR = FP/(FP+TN)
print("FPR:", FPR)
# False Negative Rate
FNR = FN/(TP+FN)
print("FNR:", FNR)
# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
print("ACC :", ACC)
There are different ways of measuring the average of the metrics. if you check the packages e.g. sklearn, you see there are multiple parameters that you can give. either micro, macro, weighted, and etc.
if you want to calculate them manually, one way(micro) is to get different TP, FN, FP, and TN values from your four different outputs and sum them up together, and then calculate your metrics.
So, you should really understand your problem and to see which one makes sense. Mostly, in the case of imbalanced data, it is better to use the weighted average. Keep in mind that if you have any baseline calculation, you have to use the exact same method for calculating these values to give a fair comparison since there can be huge differences between different ways of averaging.
and yes, those two numbers are equal.
Update:
As the documentation shows:
Weighted average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters βmacroβ to account for label imbalance; it can result in an F-score that is not between precision and recall.
this question is also related.
In your case for weighted metrics, you calculate each metric for each of your 4 classes separately. having the number of instances in each of the classes, you calculate the weighted average metric. This pictures shows the equaation for the weighted precision:
Related
confusion matrix
I have an issue where I'm trying to compute the test accuracy for a naive classifier that always predicts ^y=β1.
I have already calculated the test accuracy of the classifier based on the confusion matrix attached above by using (TN + TP)/π. But how do I calculate the naive value?
accuracy = (109112+3805)/127933
naive_accuracy = # TODO: Compute the accuracy of the naive classifier
It is actually the same formula. You should just notice that your naive classifier never gives positives answers, so TP = 0. TN will be equal to the total number of negatives: TN = 123324.
So naive_accuracy = (TN + TP)/π = (123324 + 0)/127933.
And yes, this is the case when naive classifier actually shows better accuracy than the one given by the confusion matrix you are referring to. This is due to data imbalance problem: there are 30 times more negative examples than positive ones. This is why accuracy is not applicable in that setting. Please check out precision, recall and f-score metrics if you need to have a meaningful result.
I'm trying to clarify something about accuracy in Python. I have 3 classes of Cancer and I'm trying to predict samples (patients) by their condition. I have followed this method, proposed by a guy always from stack overflow :
True Positive Rate and False Positive Rate (TPR, FPR) for Multi-Class Data in python
Now I have done the exact same (only the part of sensitivity , specificity and accuracy were needed) :
cnf_matrix = confusion_matrix(y_test, pred_y)
FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix)
FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
TP = np.diag(cnf_matrix)
TN = cnf_matrix.sum() - (FP + FN + TP)
FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)
# Sensitivity, hit rate, recall, or true positive rate
Sensitivity = TP/(TP+FN)
# Specificity or true negative rate
Specificity = TN/(TN+FP)
# Overall Accuracy (even if I dont think is overall)
ACC = (TP+TN)/(TP+FP+FN+TN)
And as a result I get 3 lists (sensitivity, specificity and accuracy), but each of these lists contains 3 values ( I guess one per class ).
Sensitivity : [0.76999182 0.99404079 0.96377484]
Specificity : [0.98132687 0.97199254 0.9036957 ]
ACC : [0.91487179 0.97717949 0.92794872]
But in the post the guy spoke about "overall accuracy", while instead I get the individual accuracy for each class (not bad tho). In fact when I use accuracy_score from Scikit-learn the final accuracy is different :
accuracy = accuracy_score(y_test,pred_y)
accuracy: 0.9099999999999991
I assume that using the guy technique I get an accuracy for each class and so I can compute the mean accuracy (that in this case is 0.9399999999999992) while Scikit-learn gives me the overall accuracy? I think that it is important to know what is what, because sometimes the difference is about 20%, and is a lot.
The accuracy returned from from sklearn.metrics.accuracy_score is
(number of correctly predicted samples) / (total number of samples)
ie., accuracy.
What you're computing there is not accuracy for the entire dataset, it is the accuracy for the binary classification problem of each label, which you'll see listed here as accuracy for binary classification.
I haven't really ever seen that metric used, generally you'd pay attention to precision, recall, F1 score, and actual accuracy. Even if you wanted to use it, you should be careful when computing the mean: often there is a class imbalance in your data, so you might want to use a weighted mean.
I am working on a binary segmentation problem using Pytorch. I want to know the correct way of calculating metrics like precision, recall, f1-score, mIOU, etc for my test set. From many of the online codes available, I found different ways of calculation, which I have mentioned below:
Method 1
Calculate metrics for individual images separately.
Compute the sum of each metric of all test images.
Divide the metrics by the number of test images.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
Score += calc_metrics(mask1, pred1)
# Score could be any of the metrics
Final_Score = Score/ len(test_x)
Method 2
Add TP, FP, TN, and FN pixel count of all the images and prepare a confusion matrix.
Calculate all metrics at the end using the total TP, FP, TN, and FN pixel count and confusion matrix.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
FP += np.float(np.sum((pred1==1) & (mask1==0)))
FN += np.float(np.sum((pred1==0) & (mask1==1)))
TP += np.float(np.sum((pred1==1) & (mask1==1)))
TN += np.float(np.sum((pred1==0) & (mask1==0)))
Score = calc_metrics(TP, FP, TN, TP)
Method 3
Calculate batch-wise metrics and divide them by the number of test images at the end.
Method 4
Unlike all the above methods, which use 0.5 as the threshold on the predicted mask (after applying sigmoid), this method uses a range of thresholds and computes metrics on different thresholds for each metric and takes the mean of these values at the end.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
for t in range(len(threshold)):
thresh = threshold(t)
thresh_metric(t) = calc_metrics(mask1, pred1, thresh)
thresh_final(i,:) = thresh_metric
Score = np.sum(mean(thresh_final))/len(test_x)
I am confused about which way to use to report my model's results on the test set.
I am new to deep learning and I want to be able to evaluate the model which underwent training for certain epochs using F1 score. I believe it requires the calculation of precision and recall first.
The model trained is SSD-300-Tensorflow. Is there a code or something that can produce these results? I am not using sci-kit or anything as I am not sure if that is required to calculate the score but any guidance is appreciated.
There is a file for evaluation in folder tf_extended called metrics.py which has a code for precision and recall. After training my files, I have the checkpoints in logs folder. How can I calculate my metrics? I am using Google collab due to hardware limitations (GPU issues)
If you are using the Tensorflow Object Detection API, it provides a way for running model evaluation that can be configured for different metrics. A tutorial on how to do this is here.
The COCO evaluation metrics includes analogous measures of precision and recall for object detection use cases. A good overview of these metrics is here. The concepts of precision and recall need to be adapted somewhat for object detection scenarios because you have to define "how closely" a predicted bounding box needs to match the ground truth bounding box to be considered a true positive.
I am not certain what the analog for F1 score would be for object detection scenarios. Typically, I have seen models compared using mAP as the single evaluation metric.
You should first calculate False positive, False negatives, True positive and True negatives. To obtain these values you must evaluate your model with test dataset. This link might help
with these formulas you can calculate precision and recall and here is some example code:
y_hat = []
y = []
threshold = 0.5
for data, label in test_dataset:
y_hat.extend(model.predict(data))
y.extend(label.numpy()[:, 1])
y_hat = np.asarray(y_hat)
y = np.asarray(y)
m = len(y)
y_hat = np.asarray([1 if i > threshold else 0 for i in y_hat[:, 1]])
true_positive = np.logical_and(y, y_hat).sum()
true_negative = np.logical_and(np.logical_not(y_hat), np.logical_not(y)).sum()
false_positive = np.logical_and(np.logical_not(y), y_hat).sum()
false_negative = np.logical_and(np.logical_not(y_hat), y).sum()
total = true_positive + true_negative + false_negative + false_positive
assert total == m
precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)
accuracy = (true_positive + true_negative) / total
f1score = 2 * precision * recall / (precision + recall)
I'm just wondering if this is a legitimate way of calculating classification accuracy:
obtain precision recall thresholds
for each threshold binarize the continuous y_scores
calculate their accuracy from the contingency table (confusion matrix)
return the average accuracy for the thresholds
recall, precision, thresholds = precision_recall_curve(np.array(np_y_true), np.array(np_y_scores))
accuracy = 0
for threshold in thresholds:
contingency_table = confusion_matrix(np_y_true, binarize(np_y_scores, threshold=threshold)[0])
accuracy += (float(contingency_table[0][0]) + float(contingency_table[1][1]))/float(np.sum(contingency_table))
print "Classification accuracy is: {}".format(accuracy/len(thresholds))
You are heading into the right direction.
The confusion matrix definetly is the right start for computing the accuracy of your classifier. It seems to me that you are aiming at reciever operating characteristics.
In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
The AUC (area under the curve) is a measurement of your classifiers performance. More information and explanation can be found here:
https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it
http://mlwiki.org/index.php/ROC_Analysis
This is my implementation, which you are welcome to improve/comment:
def auc(y_true, y_val, plot=False):
#check input
if len(y_true) != len(y_val):
raise ValueError('Label vector (y_true) and corresponding value vector (y_val) must have the same length.\n')
#empty arrays, true positive and false positive numbers
tp = []
fp = []
#count 1's and -1's in y_true
cond_positive = list(y_true).count(1)
cond_negative = list(y_true).count(-1)
#all possibly relevant bias parameters stored in a list
bias_set = sorted(list(set(y_val)), key=float, reverse=True)
bias_set.append(min(bias_set)*0.9)
#initialize y_pred array full of negative predictions (-1)
y_pred = np.ones(len(y_true))*(-1)
#the computation time is mainly influenced by this for loop
#for a contamination rate of 1% it already takes ~8s to terminate
for bias in bias_set:
#"lower values tend to correspond to label β1"
#indices of values which exceed the bias
posIdx = np.where(y_val > bias)
#set predicted values to 1
y_pred[posIdx] = 1
#the following function simply calculates results which enable a distinction
#between the cases of true positive and false positive
results = np.asarray(y_true) + 2*np.asarray(y_pred)
#append the amount of tp's and fp's
tp.append(float(list(results).count(3)))
fp.append(float(list(results).count(1)))
#calculate false positive/negative rate
tpr = np.asarray(tp)/cond_positive
fpr = np.asarray(fp)/cond_negative
#optional scatterplot
if plot == True:
plt.scatter(fpr,tpr)
plt.show()
#calculate AUC
AUC = np.trapz(tpr,fpr)
return AUC