I am working on a binary segmentation problem using Pytorch. I want to know the correct way of calculating metrics like precision, recall, f1-score, mIOU, etc for my test set. From many of the online codes available, I found different ways of calculation, which I have mentioned below:
Method 1
Calculate metrics for individual images separately.
Compute the sum of each metric of all test images.
Divide the metrics by the number of test images.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
Score += calc_metrics(mask1, pred1)
# Score could be any of the metrics
Final_Score = Score/ len(test_x)
Method 2
Add TP, FP, TN, and FN pixel count of all the images and prepare a confusion matrix.
Calculate all metrics at the end using the total TP, FP, TN, and FN pixel count and confusion matrix.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
FP += np.float(np.sum((pred1==1) & (mask1==0)))
FN += np.float(np.sum((pred1==0) & (mask1==1)))
TP += np.float(np.sum((pred1==1) & (mask1==1)))
TN += np.float(np.sum((pred1==0) & (mask1==0)))
Score = calc_metrics(TP, FP, TN, TP)
Method 3
Calculate batch-wise metrics and divide them by the number of test images at the end.
Method 4
Unlike all the above methods, which use 0.5 as the threshold on the predicted mask (after applying sigmoid), this method uses a range of thresholds and computes metrics on different thresholds for each metric and takes the mean of these values at the end.
Example:
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
for t in range(len(threshold)):
thresh = threshold(t)
thresh_metric(t) = calc_metrics(mask1, pred1, thresh)
thresh_final(i,:) = thresh_metric
Score = np.sum(mean(thresh_final))/len(test_x)
I am confused about which way to use to report my model's results on the test set.
Related
I am new to deep learning and I want to be able to evaluate the model which underwent training for certain epochs using F1 score. I believe it requires the calculation of precision and recall first.
The model trained is SSD-300-Tensorflow. Is there a code or something that can produce these results? I am not using sci-kit or anything as I am not sure if that is required to calculate the score but any guidance is appreciated.
There is a file for evaluation in folder tf_extended called metrics.py which has a code for precision and recall. After training my files, I have the checkpoints in logs folder. How can I calculate my metrics? I am using Google collab due to hardware limitations (GPU issues)
If you are using the Tensorflow Object Detection API, it provides a way for running model evaluation that can be configured for different metrics. A tutorial on how to do this is here.
The COCO evaluation metrics includes analogous measures of precision and recall for object detection use cases. A good overview of these metrics is here. The concepts of precision and recall need to be adapted somewhat for object detection scenarios because you have to define "how closely" a predicted bounding box needs to match the ground truth bounding box to be considered a true positive.
I am not certain what the analog for F1 score would be for object detection scenarios. Typically, I have seen models compared using mAP as the single evaluation metric.
You should first calculate False positive, False negatives, True positive and True negatives. To obtain these values you must evaluate your model with test dataset. This link might help
with these formulas you can calculate precision and recall and here is some example code:
y_hat = []
y = []
threshold = 0.5
for data, label in test_dataset:
y_hat.extend(model.predict(data))
y.extend(label.numpy()[:, 1])
y_hat = np.asarray(y_hat)
y = np.asarray(y)
m = len(y)
y_hat = np.asarray([1 if i > threshold else 0 for i in y_hat[:, 1]])
true_positive = np.logical_and(y, y_hat).sum()
true_negative = np.logical_and(np.logical_not(y_hat), np.logical_not(y)).sum()
false_positive = np.logical_and(np.logical_not(y), y_hat).sum()
false_negative = np.logical_and(np.logical_not(y_hat), y).sum()
total = true_positive + true_negative + false_negative + false_positive
assert total == m
precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)
accuracy = (true_positive + true_negative) / total
f1score = 2 * precision * recall / (precision + recall)
I'm trying to adapt some code for positive unlabeled learning from this example, which runs with my data but I want to also calculate the ROC AUC score which I'm getting stuck on.
My data is divided into positive samples (data_P) and unlabeled samples (data_U), each with only 2 features/columns of data such as:
#3 example rows:
data_P
[[-1.471, 5.766],
[-1.672, 5.121],
[-1.371, 4.619]]
#3 example rows:
data_U
[[1.23, 6.26],
[-5.72, 4.1213],
[-3.1, 7.129]]
I run the positive-unlabeled learning as in the linked example:
known_labels_ratio = 0.5
NP = data_P.shape[0]
NU = data_U.shape[0]
T = 1000
K = NP
train_label = np.zeros(shape=(NP+K,))
train_label[:NP] = 1.0
n_oob = np.zeros(shape=(NU,))
f_oob = np.zeros(shape=(NU, 2))
for i in range(T):
# Bootstrap resample
bootstrap_sample = np.random.choice(np.arange(NU), replace=True, size=K)
# Positive set + bootstrapped unlabeled set
data_bootstrap = np.concatenate((data_P, data_U[bootstrap_sample, :]), axis=0)
# Train model
model = DecisionTreeClassifier(max_depth=None, max_features=None,
criterion='gini', class_weight='balanced')
model.fit(data_bootstrap, train_label)
# Index for the out of the bag (oob) samples
idx_oob = sorted(set(range(NU)) - set(np.unique(bootstrap_sample)))
# Transductive learning of oob samples
f_oob[idx_oob] += model.predict_proba(data_U[idx_oob])
n_oob[idx_oob] += 1
predict_proba = f_oob[:, 1]/n_oob
This all runs fine but what I want is to run roc_auc_score() which I'm getting stuck on how to do without errors.
Currently I am trying:
y_pred = model.predict_proba(data_bootstrap)
roc_auc_score(train_label, y_pred)
ValueError: bad input shape (3, 2)
The problem seems to be that y_pred gives an output with 2 columns, looking like:
y_pred
array([[0.00554287, 0.9944571 ],
[0.0732314 , 0.9267686 ],
[0.16861796, 0.83138204]])
I'm not sure why y_pred ends up like this, is it giving the probability based on if the sample falls into 2 groups? Positive or other essentially? Could I just filter these to select per row the probability with the highest score? Or is there a way for me to change this or another way for me to calculate the AUCROC score?
y_pred must be a single number, giving the probability of the positive class p1; currently your y_pred consists of both probabilities [p0, p1] (with p0+p1=1.0 by definition).
Assuming that your positive class is class 1 (i.e. the second element of each array in y_pred), what you should do is:
y_pred_pos = [y_pred[i, 1] for i in range(len(y_pred))]
y_pred_pos # inspect
# [0.9944571, 0.9267686, 0.83138204]
roc_auc_score(train_label, y_pred_pos)
In case your y_pred is a Numpy array (and not a Python list), you could replace the list comprehension in the first command above with:
y_pred_pos = y_pred[:,1]
how to calculate Average TPR, TNR, FPR, FNR in case of imbalanced dataset?
example FPR: [3.54224720e-04 0.00000000e+00 1.59383505e-05 0.00000000e+00]
So, can I calculate to sum of 4 class and divided by 4?
TPR :[3.54224720e-04 + 0.00000000e+00 + 1.59383505e-05 + 0.00000000e+00]/4 = 0.99966 ?
And how to calculate 3.54224720e-04 it is equal .000354224720 ?
Thank you
FP = np.sum(matrix, axis=0) - np.diag(matrix)
FN = np.sum(matrix, axis=1) - np.diag(matrix)
TP = np.diag(matrix)
TN = np.sum(matrix) - (FP + FN + TP)
# True Positive rate
TPR = TP/(TP+FN)
print("TPR:", TPR)
# True Negative Rate
TNR = TN/(TN+FP)
print("TNR:", TNR)
# False Positive Rate
FPR = FP/(FP+TN)
print("FPR:", FPR)
# False Negative Rate
FNR = FN/(TP+FN)
print("FNR:", FNR)
# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
print("ACC :", ACC)
There are different ways of measuring the average of the metrics. if you check the packages e.g. sklearn, you see there are multiple parameters that you can give. either micro, macro, weighted, and etc.
if you want to calculate them manually, one way(micro) is to get different TP, FN, FP, and TN values from your four different outputs and sum them up together, and then calculate your metrics.
So, you should really understand your problem and to see which one makes sense. Mostly, in the case of imbalanced data, it is better to use the weighted average. Keep in mind that if you have any baseline calculation, you have to use the exact same method for calculating these values to give a fair comparison since there can be huge differences between different ways of averaging.
and yes, those two numbers are equal.
Update:
As the documentation shows:
Weighted average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
this question is also related.
In your case for weighted metrics, you calculate each metric for each of your 4 classes separately. having the number of instances in each of the classes, you calculate the weighted average metric. This pictures shows the equaation for the weighted precision:
I've trained a model and identified a 'threshold' that I'd like to deploy it at, but I'm having trouble understanding how the threshold relates to the score.
X = labeled_data[features].reset_index(drop=True)
Y = np.array(labeled_data['fraud'].reset_index(drop=True))
# (train/test etc.. settle on an acceptable model)
grad_des = SGDClassifier(alpha=alpha_optimum, l1_ratio=l1_optimum, loss='log')
grad_des.fit(X, Y)
score_Y = grad_des.predict_proba(X)
precision, recall, thresholds = precision_recall_curve(Y, score_Y[:,1])
Alright, so now I plot precision and recall vs threshold and decide I want my threshold to be .4
What is threshold?
My model coefficients, which I understand are 'scoring' events by computing coefficients['x']*event_values['x'], sum up to 29. Threshold is between 0 and 1.
How am I to understand the translation from threshold to what is, I guess a raw score? Would an event with a 1 for all features (all are binary) have a calculated score of 29 since that is the sum of all coefficients?
Do I need to compute this 'raw' score metric for all events and then plot that against precision instead of threshold?
Edit and Update:
So my question hinged on a lack of understanding about the logistic function, as Mikhail Korobov pointed out below. Regardless of 'raw score' the logistic function forces a value in [0, 1] range.
In order to 'unwrap' that value back into the 'raw score' I was looking for, I can do scipy.special.logit(0.8) - grad_des.intercept_ and this returns the 'score' of the row.
Probabilities are not just coefficients['x']*event_values['x'] - a logistic function is applied to these scores to get probability values in [0, 1] range.
predict_proba method returns these probabilities; they are in range [0, 1].
To get a concrete yes/no prediction one have to choose a probability threshold. An obvious and sane way is to use 0.5: if probability is greater than 0.5 then predict "yep", predict "nope" otherwise. This is what .predict() method does.
precision_recall_curve tries different probability thresholds and computes precision and recall for them. If based on precision and recall scores you believe some other threshold is better for your application you can use it instead of 0.5, e.g. bool_prediction = score_Y[:,1] > threshold.
I'm just wondering if this is a legitimate way of calculating classification accuracy:
obtain precision recall thresholds
for each threshold binarize the continuous y_scores
calculate their accuracy from the contingency table (confusion matrix)
return the average accuracy for the thresholds
recall, precision, thresholds = precision_recall_curve(np.array(np_y_true), np.array(np_y_scores))
accuracy = 0
for threshold in thresholds:
contingency_table = confusion_matrix(np_y_true, binarize(np_y_scores, threshold=threshold)[0])
accuracy += (float(contingency_table[0][0]) + float(contingency_table[1][1]))/float(np.sum(contingency_table))
print "Classification accuracy is: {}".format(accuracy/len(thresholds))
You are heading into the right direction.
The confusion matrix definetly is the right start for computing the accuracy of your classifier. It seems to me that you are aiming at reciever operating characteristics.
In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
The AUC (area under the curve) is a measurement of your classifiers performance. More information and explanation can be found here:
https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it
http://mlwiki.org/index.php/ROC_Analysis
This is my implementation, which you are welcome to improve/comment:
def auc(y_true, y_val, plot=False):
#check input
if len(y_true) != len(y_val):
raise ValueError('Label vector (y_true) and corresponding value vector (y_val) must have the same length.\n')
#empty arrays, true positive and false positive numbers
tp = []
fp = []
#count 1's and -1's in y_true
cond_positive = list(y_true).count(1)
cond_negative = list(y_true).count(-1)
#all possibly relevant bias parameters stored in a list
bias_set = sorted(list(set(y_val)), key=float, reverse=True)
bias_set.append(min(bias_set)*0.9)
#initialize y_pred array full of negative predictions (-1)
y_pred = np.ones(len(y_true))*(-1)
#the computation time is mainly influenced by this for loop
#for a contamination rate of 1% it already takes ~8s to terminate
for bias in bias_set:
#"lower values tend to correspond to label −1"
#indices of values which exceed the bias
posIdx = np.where(y_val > bias)
#set predicted values to 1
y_pred[posIdx] = 1
#the following function simply calculates results which enable a distinction
#between the cases of true positive and false positive
results = np.asarray(y_true) + 2*np.asarray(y_pred)
#append the amount of tp's and fp's
tp.append(float(list(results).count(3)))
fp.append(float(list(results).count(1)))
#calculate false positive/negative rate
tpr = np.asarray(tp)/cond_positive
fpr = np.asarray(fp)/cond_negative
#optional scatterplot
if plot == True:
plt.scatter(fpr,tpr)
plt.show()
#calculate AUC
AUC = np.trapz(tpr,fpr)
return AUC