Classification accuracy after recall and precision - python

I'm just wondering if this is a legitimate way of calculating classification accuracy:
obtain precision recall thresholds
for each threshold binarize the continuous y_scores
calculate their accuracy from the contingency table (confusion matrix)
return the average accuracy for the thresholds
recall, precision, thresholds = precision_recall_curve(np.array(np_y_true), np.array(np_y_scores))
accuracy = 0
for threshold in thresholds:
contingency_table = confusion_matrix(np_y_true, binarize(np_y_scores, threshold=threshold)[0])
accuracy += (float(contingency_table[0][0]) + float(contingency_table[1][1]))/float(np.sum(contingency_table))
print "Classification accuracy is: {}".format(accuracy/len(thresholds))

You are heading into the right direction.
The confusion matrix definetly is the right start for computing the accuracy of your classifier. It seems to me that you are aiming at reciever operating characteristics.
In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
The AUC (area under the curve) is a measurement of your classifiers performance. More information and explanation can be found here:
https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it
http://mlwiki.org/index.php/ROC_Analysis
This is my implementation, which you are welcome to improve/comment:
def auc(y_true, y_val, plot=False):
#check input
if len(y_true) != len(y_val):
raise ValueError('Label vector (y_true) and corresponding value vector (y_val) must have the same length.\n')
#empty arrays, true positive and false positive numbers
tp = []
fp = []
#count 1's and -1's in y_true
cond_positive = list(y_true).count(1)
cond_negative = list(y_true).count(-1)
#all possibly relevant bias parameters stored in a list
bias_set = sorted(list(set(y_val)), key=float, reverse=True)
bias_set.append(min(bias_set)*0.9)
#initialize y_pred array full of negative predictions (-1)
y_pred = np.ones(len(y_true))*(-1)
#the computation time is mainly influenced by this for loop
#for a contamination rate of 1% it already takes ~8s to terminate
for bias in bias_set:
#"lower values tend to correspond to label βˆ’1"
#indices of values which exceed the bias
posIdx = np.where(y_val > bias)
#set predicted values to 1
y_pred[posIdx] = 1
#the following function simply calculates results which enable a distinction
#between the cases of true positive and false positive
results = np.asarray(y_true) + 2*np.asarray(y_pred)
#append the amount of tp's and fp's
tp.append(float(list(results).count(3)))
fp.append(float(list(results).count(1)))
#calculate false positive/negative rate
tpr = np.asarray(tp)/cond_positive
fpr = np.asarray(fp)/cond_negative
#optional scatterplot
if plot == True:
plt.scatter(fpr,tpr)
plt.show()
#calculate AUC
AUC = np.trapz(tpr,fpr)
return AUC

Related

How do I calculate the naive accuracy an a confusion matrix?

confusion matrix
I have an issue where I'm trying to compute the test accuracy for a naive classifier that always predicts ^y=βˆ’1.
I have already calculated the test accuracy of the classifier based on the confusion matrix attached above by using (TN + TP)/𝑛. But how do I calculate the naive value?
accuracy = (109112+3805)/127933
naive_accuracy = # TODO: Compute the accuracy of the naive classifier
It is actually the same formula. You should just notice that your naive classifier never gives positives answers, so TP = 0. TN will be equal to the total number of negatives: TN = 123324.
So naive_accuracy = (TN + TP)/𝑛 = (123324 + 0)/127933.
And yes, this is the case when naive classifier actually shows better accuracy than the one given by the confusion matrix you are referring to. This is due to data imbalance problem: there are 30 times more negative examples than positive ones. This is why accuracy is not applicable in that setting. Please check out precision, recall and f-score metrics if you need to have a meaningful result.

How to calculate binary segmentation metrics?

I am working on a binary segmentation problem using Pytorch. I want to know the correct way of calculating metrics like precision, recall, f1-score, mIOU, etc for my test set. From many of the online codes available, I found different ways of calculation, which I have mentioned below:
Method 1
Calculate metrics for individual images separately.
Compute the sum of each metric of all test images.
Divide the metrics by the number of test images.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
Score += calc_metrics(mask1, pred1)
# Score could be any of the metrics
Final_Score = Score/ len(test_x)
Method 2
Add TP, FP, TN, and FN pixel count of all the images and prepare a confusion matrix.
Calculate all metrics at the end using the total TP, FP, TN, and FN pixel count and confusion matrix.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
FP += np.float(np.sum((pred1==1) & (mask1==0)))
FN += np.float(np.sum((pred1==0) & (mask1==1)))
TP += np.float(np.sum((pred1==1) & (mask1==1)))
TN += np.float(np.sum((pred1==0) & (mask1==0)))
Score = calc_metrics(TP, FP, TN, TP)
Method 3
Calculate batch-wise metrics and divide them by the number of test images at the end.
Method 4
Unlike all the above methods, which use 0.5 as the threshold on the predicted mask (after applying sigmoid), this method uses a range of thresholds and computes metrics on different thresholds for each metric and takes the mean of these values at the end.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
for t in range(len(threshold)):
thresh = threshold(t)
thresh_metric(t) = calc_metrics(mask1, pred1, thresh)
thresh_final(i,:) = thresh_metric
Score = np.sum(mean(thresh_final))/len(test_x)
I am confused about which way to use to report my model's results on the test set.

How to predict Precision, Recall and F1 score after training SSD

I am new to deep learning and I want to be able to evaluate the model which underwent training for certain epochs using F1 score. I believe it requires the calculation of precision and recall first.
The model trained is SSD-300-Tensorflow. Is there a code or something that can produce these results? I am not using sci-kit or anything as I am not sure if that is required to calculate the score but any guidance is appreciated.
There is a file for evaluation in folder tf_extended called metrics.py which has a code for precision and recall. After training my files, I have the checkpoints in logs folder. How can I calculate my metrics? I am using Google collab due to hardware limitations (GPU issues)
If you are using the Tensorflow Object Detection API, it provides a way for running model evaluation that can be configured for different metrics. A tutorial on how to do this is here.
The COCO evaluation metrics includes analogous measures of precision and recall for object detection use cases. A good overview of these metrics is here. The concepts of precision and recall need to be adapted somewhat for object detection scenarios because you have to define "how closely" a predicted bounding box needs to match the ground truth bounding box to be considered a true positive.
I am not certain what the analog for F1 score would be for object detection scenarios. Typically, I have seen models compared using mAP as the single evaluation metric.
You should first calculate False positive, False negatives, True positive and True negatives. To obtain these values you must evaluate your model with test dataset. This link might help
with these formulas you can calculate precision and recall and here is some example code:
y_hat = []
y = []
threshold = 0.5
for data, label in test_dataset:
y_hat.extend(model.predict(data))
y.extend(label.numpy()[:, 1])
y_hat = np.asarray(y_hat)
y = np.asarray(y)
m = len(y)
y_hat = np.asarray([1 if i > threshold else 0 for i in y_hat[:, 1]])
true_positive = np.logical_and(y, y_hat).sum()
true_negative = np.logical_and(np.logical_not(y_hat), np.logical_not(y)).sum()
false_positive = np.logical_and(np.logical_not(y), y_hat).sum()
false_negative = np.logical_and(np.logical_not(y_hat), y).sum()
total = true_positive + true_negative + false_negative + false_positive
assert total == m
precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)
accuracy = (true_positive + true_negative) / total
f1score = 2 * precision * recall / (precision + recall)

ground truth fit is worse than cross validated fit on noisy data?

I am having these weird results when playing around with cross validation that I would greatly appreciate to have any comments.
Briefly, I have a lower mean squared error (MSE) when doing regression (least-squares) using cross-valitation (CV), than when using the "ground truth weights" that I used to generate the data.
Note however, that I compute the MSE on noisy data (generated data + noise), so MSE of 0 would not be expected for noise levels above 0.
Weirdly, for high noise conditions, I get lower MSE with cross validated least squares than with the "ground" truth weights used to generate the clean data - to which I then add different levels of noise to the input (X). Instead, if I add guassian noise to the output (y) the "ground truth weights" perform better.
More details below.
Simulation of data
I am generating beta from a guassian and X from a uniform distribution. I then compute the to-be-regressed y as y = beta * X.
python 3 code:
def generate_data(noise_frac):
X = np.random.rand(ntrials,nneurons)
X = np.random.normal(size=(ntrials,nneurons))
beta = np.random.randn(nneurons)
y = X # beta
# not very important how I generated noise here
noise_x = np.random.multivariate_normal(mean=zeros(nneurons), cov=diag(np.random.rand(nneurons)), size=ntrials)
X_noise = X + noise_x*noise_frac
return X_noise, y, beta
As you can see I also add noise to X.
Regression
I then project this noised data X_noise for different values of noise onto beta:
y_hat = (X_noise) # beta
And compute the MSE:
mse = mean((y_hat - y)**2)
As expected, MSE increases with noise (blue line in the figure).
However, I get lower MSE if I use cross validated least-squares! This is now orange line in the figure.
To do CV, I split X_noise in random 100 train and test sets. In broad terms, This is how I do CV in python:
beta_lsq = pinv(X_train) # y_train
y_hat_lsq = (X_test) # beta_lsq
mse = mean((y_hat_lsq - y_test)**2)
On the other hand, if I add noise to y, instead of X, then everything makes sense:
Thank you very much in advance!
PS: This is a crosspost from stack overflow

How to calculate average TPR, TNR, FPR, FNR - Multiclass Classification

how to calculate Average TPR, TNR, FPR, FNR in case of imbalanced dataset?
example FPR: [3.54224720e-04 0.00000000e+00 1.59383505e-05 0.00000000e+00]
So, can I calculate to sum of 4 class and divided by 4?
TPR :[3.54224720e-04 + 0.00000000e+00 + 1.59383505e-05 + 0.00000000e+00]/4 = 0.99966 ?
And how to calculate 3.54224720e-04 it is equal .000354224720 ?
Thank you
FP = np.sum(matrix, axis=0) - np.diag(matrix)
FN = np.sum(matrix, axis=1) - np.diag(matrix)
TP = np.diag(matrix)
TN = np.sum(matrix) - (FP + FN + TP)
# True Positive rate
TPR = TP/(TP+FN)
print("TPR:", TPR)
# True Negative Rate
TNR = TN/(TN+FP)
print("TNR:", TNR)
# False Positive Rate
FPR = FP/(FP+TN)
print("FPR:", FPR)
# False Negative Rate
FNR = FN/(TP+FN)
print("FNR:", FNR)
# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
print("ACC :", ACC)
There are different ways of measuring the average of the metrics. if you check the packages e.g. sklearn, you see there are multiple parameters that you can give. either micro, macro, weighted, and etc.
if you want to calculate them manually, one way(micro) is to get different TP, FN, FP, and TN values from your four different outputs and sum them up together, and then calculate your metrics.
So, you should really understand your problem and to see which one makes sense. Mostly, in the case of imbalanced data, it is better to use the weighted average. Keep in mind that if you have any baseline calculation, you have to use the exact same method for calculating these values to give a fair comparison since there can be huge differences between different ways of averaging.
and yes, those two numbers are equal.
Update:
As the documentation shows:
Weighted average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters β€˜macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
this question is also related.
In your case for weighted metrics, you calculate each metric for each of your 4 classes separately. having the number of instances in each of the classes, you calculate the weighted average metric. This pictures shows the equaation for the weighted precision:

Categories

Resources