Changing Threshold in confusion matrix Python - python

Given I have true label values and predicted values, and calculated optimal threshold, how do I construct a confusion matrix using that optimal threshold?
# given data
label = np.array([0,0,1,0,1,0,0,1,1,1])
pred = np.array([0.15,0.2,0.25,0.37,0.41,0.55,0.65,0.8,0.92,0.99])
fpr, tpr, thresh = metrics.roc_curve(label, pred)
auc = metrics.roc_auc_score(label, pred)
# Getting optimal threshold
def optimal_index (tpr, fpr, thresh ):
optimal_idx = np.argmax(tpr - fpr)
optimal_thres = thresh[optimal_idx]
return optimal_thres
threshold = optimal_index(tpr, fpr, thresh)
print("Optimal Threshold value for Classifier is :", threshold_1)
Optimal Threshold value for Classifier is : 0.8
Now I need to construct a confusion matrix using that 0.8 optimal threshold, but I simply can't figure out how to achieve that.
Please provide any insight! Thank you

You could use metrics.confusion_matrix, method
matrix = metrics.confusion_matrix(label, y_pred)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=matrix, display_labels=[0, 1])
disp.plot()
plt.show()
convert pred score list to label for your optimum theshold
y_pred = list(map(lambda x : 0 if x < threshold else 1, pred))
For more help visit plot confusion matrix with keras

Related

ValueError: Classification metrics can't handle a mix of multiclass and multilabel-indicator targets in ROC curve calculation

I'm trying to draw a roc curve for multiclass classification.
At first I calculate y_pred and y_proba using the following code
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
y_pred = dtree_model.predict(X_test)
y_proba= dtree_model.predict_proba(X_test)
After that I use the following function to calculate tpr and fpr
from sklearn.metrics import confusion_matrix
def calculate_tpr_fpr(y_test, y_pred):
'''
Calculates the True Positive Rate (tpr) and the True Negative Rate (fpr) based on real and predicted observations
Args:
y_real: The list or series with the real classes
y_pred: The list or series with the predicted classes
Returns:
tpr: The True Positive Rate of the classifier
fpr: The False Positive Rate of the classifier
'''
# Calculates the confusion matrix and recover each element
cm = confusion_matrix(y_test, y_pred)
TN = cm[0, 0]
FP = cm[0, 1]
FN = cm[1, 0]
TP = cm[1, 1]
# Calculates tpr and fpr
tpr = TP / (TP + FN) # sensitivity - true positive rate
fpr = 1 - TN / (TN + FP) # 1-specificity - false positive rate
return tpr, fpr
Then, I try using this function to calculate a list of fpr and tpr to draw the curve
def get_all_roc_coordinates(y_test, y_proba):
'''
Calculates all the ROC Curve coordinates (tpr and fpr) by considering each point as a treshold for the predicion of the class.
Args:
y_test: The list or series with the real classes.
y_proba: The array with the probabilities for each class, obtained by using the `.predict_proba()` method.
Returns:
tpr_list: The list of TPRs representing each threshold.
fpr_list: The list of FPRs representing each threshold.
'''
tpr_list = [0]
fpr_list = [0]
for i in range(len(y_proba)):
threshold = y_proba[i]
y_pred = y_proba = threshold
tpr, fpr = calculate_tpr_fpr(y_test, y_pred)
tpr_list.append(tpr)
fpr_list.append(fpr)
return tpr_list, fpr_list
but it gives me the following error
ValueError: Classification metrics can't handle a mix of multiclass and multilabel-indicator targets
Note that the Y column is multiclass {0,1,2}. I also tried to ensure that y is string not integer, but it gives me the same error.
You've got 3 classes but you only use 2 classes in your calculate_tpr_fpr(). Also, you probably meant y_pred = y_proba > threshold. Either way, it won't be that easy since you've got 3 columns of class scores. The easiest way seems to be drawing one vs rest curves, treating each column individually:
from sklearn.metrics import roc_curve
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
classes = range(y_proba.shape[1])
for i in classes:
fpr, tpr, _ = roc_curve(label_binarize(y_test, classes=classes)[:,i], y_proba[:,i])
plt.plot(fpr, tpr, alpha=0.7)
plt.legend(classes)

How to read this ROC curve and set custom thresholds?

Using this code :
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
y_true = [1,0,0]
y_predict = [.6,.1,.1]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
print(fpr)
print(tpr)
print(thresholds)
# Print ROC curve
plt.plot(fpr,tpr)
plt.show()
y_true = [1,0,0]
y_predict = [.6,.1,.6]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
print(fpr)
print(tpr)
print(thresholds)
# Print ROC curve
plt.plot(fpr,tpr)
plt.show()
the following roc curves are plotted :
scikit learn sets the thresholds but I would like to set custom thresholds.
For example, for values :
y_true = [1,0,0]
y_predict = [.6,.1,.6]
The following thresholds are returned :
[1.6 0.6 0.1]
Why does value 1.6 not exist in ROC curve ? Is threshold 1.6 redundant in this case as the probabilities range from 0-1 ? Can custom thresholds be set : .3,.5,.7 to check how well the classifier performs in this case ?
Update :
From https://sachinkalsi.github.io/blog/category/ml/2018/08/20/top-8-performance-metrics-one-should-know.html#receiver-operating-characteristic-curve-roc I used same x and predicted values :
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
y_true = [1,1,1,0]
y_predict = [.94,.87,.83,.80]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
print('false positive rate:', fpr)
print('true positive rate:', tpr)
print('thresholds:', thresholds)
# Print ROC curve
plt.plot(fpr,tpr)
plt.show()
which produces this plot :
Plot is different to referenced plot in blog, also thresholds are different :
Also, the thresholds returned by using scikit metrics.roc_curve implemented are : thresholds: [0.94 0.83 0.8 ]. Should scikit return a similar roc curve as is using same points ? I should implement roc curve myself instead of relying on scikit implementation as results are different ?
Thresholds won't appear in the ROC curve. The scikit-learn documentations says:
thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1
If y_predict contains 0.3, 0.5, 0.7, then those thresholds will be tried by the metrics.roc_curve function.
Typically these steps are followed while calculating ROC curve
1. Sort y_predict in descending order.
2. For each of the probability scores (lets say τ_i) in y_predict, if y_predict >= τ_i, then consider that data point as positive.
P.S: If we have N data points, then we will have N thresholds (if the combinations of y_true and y_predict is unique)
3. For each of the y_predicted (τ_i) values, calculate TPR & FPR.
4. Plot ROC by taking N (no. of data points) TPR, FPR pairs
You can refer this blog for detailed information

Manually calculate AUC

How can I obtain the AUC value having fpr and tpr? Fpr and tpr are just 2 floats obtained from these formulas:
my_fpr = fp / (fp + tn)
my_tpr = tp / (tp + fn)
my_roc_auc = auc(my_fpr, my_tpr)
I know this can't pe possible, because fpr and tpr are just some floats and they need to be arrays, but I can't figure it out how to do that so. I also know that I can compute AUC this way:
y_predict_proba = model.predict_proba(X_test)
probabilities = np.array(y_predict_proba)[:, 1]
fpr, tpr, _ = roc_curve(y_test, probabilities)
roc_auc = auc(fpr, tpr)
but I want to avoid using predict_proba for some reasons. So my question is: how can I obtain AUC having fp, tp, fn, tn, fpr, tpr? In other words, is it possible to obtain AUC without roc_curve?
Yes, it is possible to obtain the AUC without calling roc_curve.
You first need to create the ROC (Receiver Operating Characteristics) curve. To be able to use the ROC curve, your classifier should be able to rank examples such that the ones with higher rank are more likely to be positive (e.g. fraudulent). As an example, Logistic Regression outputs probabilities, which is a score that you can use for ranking.
The ROC curve is created by plotting the True Positive Pate (TPR) against the False Positive Rate (FPR) at various threshold settings. As an example:
The model performance is determined by looking at the area under the ROC curve (or AUC)
You can find here the more detailed explanation.
You can divide the space into 2 parts: a triangle and a trapezium. The triangle will have area TPR*FRP/2, the trapezium (1-FPR)*(1+TPR)/2 = 1/2 - FPR/2 + TPR/2 - TPR*FPR/2. The total area is 1/2 - FPR/2 + TPR/2. This is how you can get it, having just 2 points.

Confusion matrix from probabilities

I have the following scikit-learn machine learning pipeline:
cv = StratifiedKFold(n_splits=6)
classifier = svm.SVC(kernel='linear', probability=True,
random_state=random_state)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 0
for train, test in cv.split(X, y):
probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
Now I would like to also calculate (and plot) the confusion matrix. How can this be done with the above code? I'm only getting probabilities (which I need for caluclating AUC). I have 4 classes (1...4).
You can use this example here to plot confusion matrix:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
But for this, you need to have discrete class values (not probabilities). Which can be easily derived from your probas_ variable using:
y_pred = np.argmax(probas_, axis=1)
Now you can use this y_pred in confusion matrix

How to Calculating ROC and AUC in Caffe?

I have trained a binary-classes CNN in Caffe, and now i want to plot the ROC curve and calculate the AUC value. I have two quetions:
1) How to plot the ROC curve in Caffe with python?
2) How to calculate the AUC value of the ROC curve?
Python has roc_curve and roc_auc_score functions in sklearn.metrics module, just import and use them.
Assuming you have a binary prediction layer that outputs a two-vector of binary class probabilities (let's call it "prob") then your code should look something like:
import caffe
from sklearn import metrics
# load the net with trained weights
net = caffe.Net('/path/to/deploy.prototxt', '/path/to/weights.caffemodel', caffe.TEST)
y_score = []
y_true = []
for i in xrange(N): # assuming you have N validation samples
x_i = ... # get i-th validation sample
y_true.append( y_i ) # y_i is 0 or 1 the TRUE label of x_i
out = net.forward( data=x_i ) # get prediction for x_i
y_score.append( out['prob'][1] ) # get score for "1" class
# once you have N y_score and y_true values
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_score, pos_label=1)
auc = metrics.roc_auc_score(y_true, y_scores)

Categories

Resources