I have trained a binary-classes CNN in Caffe, and now i want to plot the ROC curve and calculate the AUC value. I have two quetions:
1) How to plot the ROC curve in Caffe with python?
2) How to calculate the AUC value of the ROC curve?
Python has roc_curve and roc_auc_score functions in sklearn.metrics module, just import and use them.
Assuming you have a binary prediction layer that outputs a two-vector of binary class probabilities (let's call it "prob") then your code should look something like:
import caffe
from sklearn import metrics
# load the net with trained weights
net = caffe.Net('/path/to/deploy.prototxt', '/path/to/weights.caffemodel', caffe.TEST)
y_score = []
y_true = []
for i in xrange(N): # assuming you have N validation samples
x_i = ... # get i-th validation sample
y_true.append( y_i ) # y_i is 0 or 1 the TRUE label of x_i
out = net.forward( data=x_i ) # get prediction for x_i
y_score.append( out['prob'][1] ) # get score for "1" class
# once you have N y_score and y_true values
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_score, pos_label=1)
auc = metrics.roc_auc_score(y_true, y_scores)
I'm trying to draw a roc curve for multiclass classification.
At first I calculate y_pred and y_proba using the following code
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
y_pred = dtree_model.predict(X_test)
y_proba= dtree_model.predict_proba(X_test)
After that I use the following function to calculate tpr and fpr
from sklearn.metrics import confusion_matrix
def calculate_tpr_fpr(y_test, y_pred):
Calculates the True Positive Rate (tpr) and the True Negative Rate (fpr) based on real and predicted observations
y_real: The list or series with the real classes
y_pred: The list or series with the predicted classes
tpr: The True Positive Rate of the classifier
fpr: The False Positive Rate of the classifier
# Calculates the confusion matrix and recover each element
cm = confusion_matrix(y_test, y_pred)
TN = cm[0, 0]
FP = cm[0, 1]
FN = cm[1, 0]
TP = cm[1, 1]
# Calculates tpr and fpr
tpr = TP / (TP + FN) # sensitivity - true positive rate
fpr = 1 - TN / (TN + FP) # 1-specificity - false positive rate
return tpr, fpr
Then, I try using this function to calculate a list of fpr and tpr to draw the curve
def get_all_roc_coordinates(y_test, y_proba):
Calculates all the ROC Curve coordinates (tpr and fpr) by considering each point as a treshold for the predicion of the class.
y_test: The list or series with the real classes.
y_proba: The array with the probabilities for each class, obtained by using the `.predict_proba()` method.
tpr_list: The list of TPRs representing each threshold.
fpr_list: The list of FPRs representing each threshold.
tpr_list = [0]
fpr_list = [0]
for i in range(len(y_proba)):
threshold = y_proba[i]
y_pred = y_proba = threshold
tpr, fpr = calculate_tpr_fpr(y_test, y_pred)
return tpr_list, fpr_list
but it gives me the following error
ValueError: Classification metrics can't handle a mix of multiclass and multilabel-indicator targets
Note that the Y column is multiclass {0,1,2}. I also tried to ensure that y is string not integer, but it gives me the same error.
You've got 3 classes but you only use 2 classes in your calculate_tpr_fpr(). Also, you probably meant y_pred = y_proba > threshold. Either way, it won't be that easy since you've got 3 columns of class scores. The easiest way seems to be drawing one vs rest curves, treating each column individually:
from sklearn.metrics import roc_curve
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
classes = range(y_proba.shape[1])
for i in classes:
fpr, tpr, _ = roc_curve(label_binarize(y_test, classes=classes)[:,i], y_proba[:,i])
plt.plot(fpr, tpr, alpha=0.7)
I'm using scikit learn, and I want to plot the precision and recall curves. the classifier I'm using is RandomForestClassifier. All the resources in the documentations of scikit learn uses binary classification. Also, can I plot a ROC curve for multiclass?
Also, I only found for SVM for multilabel and it has a decision_function which RandomForest doesn't have
From scikit-learn documentation:
Precision-recall curves are typically used in binary classification to
study the output of a classifier. In order to extend the
precision-recall curve and average precision to multi-class or
multi-label classification, it is necessary to binarize the output.
One curve can be drawn per label, but one can also draw a
precision-recall curve by considering each element of the label
indicator matrix as a binary prediction (micro-averaging).
Receiver Operating Characteristic (ROC):
ROC curves are typically used in binary classification to study the
output of a classifier. In order to extend ROC curve and ROC area to
multi-class or multi-label classification, it is necessary to binarize
the output. One ROC curve can be drawn per label, but one can also
draw a ROC curve by considering each element of the label indicator
matrix as a binary prediction (micro-averaging).
Therefore, you should binarize the output and consider precision-recall and roc curves for each class. Moreover, you are going to use predict_proba to get class probabilities.
I divide the code into three parts:
general settings, learning and prediction
precision-recall curve
ROC curve
1. general settings, learning and prediction
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_curve, roc_curve
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
#%matplotlib inline
mnist = fetch_openml("mnist_784")
y = mnist.target
y = y.astype(np.uint8)
n_classes = len(set(y))
Y = label_binarize(mnist.target, classes=[*range(n_classes)])
X_train, X_test, y_train, y_test = train_test_split(mnist.data,
random_state = 42)
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=50,
clf.fit(X_train, y_train)
y_score = clf.predict_proba(X_test)
2. precision-recall curve
# precision recall curve
precision = dict()
recall = dict()
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(y_test[:, i],
y_score[:, i])
plt.plot(recall[i], precision[i], lw=2, label='class {}'.format(i))
plt.title("precision vs. recall curve")
3. ROC curve
# roc curve
fpr = dict()
tpr = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i],
y_score[:, i]))
plt.plot(fpr[i], tpr[i], lw=2, label='class {}'.format(i))
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.title("ROC curve")
Using this code :
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
y_true = [1,0,0]
y_predict = [.6,.1,.1]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
# Print ROC curve
y_true = [1,0,0]
y_predict = [.6,.1,.6]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
# Print ROC curve
the following roc curves are plotted :
scikit learn sets the thresholds but I would like to set custom thresholds.
For example, for values :
y_true = [1,0,0]
y_predict = [.6,.1,.6]
The following thresholds are returned :
[1.6 0.6 0.1]
Why does value 1.6 not exist in ROC curve ? Is threshold 1.6 redundant in this case as the probabilities range from 0-1 ? Can custom thresholds be set : .3,.5,.7 to check how well the classifier performs in this case ?
Update :
From https://sachinkalsi.github.io/blog/category/ml/2018/08/20/top-8-performance-metrics-one-should-know.html#receiver-operating-characteristic-curve-roc I used same x and predicted values :
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
y_true = [1,1,1,0]
y_predict = [.94,.87,.83,.80]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
print('false positive rate:', fpr)
print('true positive rate:', tpr)
print('thresholds:', thresholds)
# Print ROC curve
which produces this plot :
Plot is different to referenced plot in blog, also thresholds are different :
Also, the thresholds returned by using scikit metrics.roc_curve implemented are : thresholds: [0.94 0.83 0.8 ]. Should scikit return a similar roc curve as is using same points ? I should implement roc curve myself instead of relying on scikit implementation as results are different ?
Thresholds won't appear in the ROC curve. The scikit-learn documentations says:
thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1
If y_predict contains 0.3, 0.5, 0.7, then those thresholds will be tried by the metrics.roc_curve function.
Typically these steps are followed while calculating ROC curve
1. Sort y_predict in descending order.
2. For each of the probability scores (lets say τ_i) in y_predict, if y_predict >= τ_i, then consider that data point as positive.
P.S: If we have N data points, then we will have N thresholds (if the combinations of y_true and y_predict is unique)
3. For each of the y_predicted (τ_i) values, calculate TPR & FPR.
4. Plot ROC by taking N (no. of data points) TPR, FPR pairs
You can refer this blog for detailed information
I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and p-value to access statistical significance.
Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores.
Here are my specific questions:
How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g. with bootstrapping).
How to compare the AUC scores (on test set) and measure the p-value to assess statistical significance? (The null hypothesis is that the models are not different. Rejecting the null hypothesis means the difference in AUC scores is statistically significant.)
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=17)
# Naive Bayes Classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_prediction_proba = nb_clf.predict_proba(X_test)[:, 1]
# Ranodm Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=20)
rf_clf.fit(X_train, y_train)
rf_prediction_proba = rf_clf.predict_proba(X_test)[:, 1]
# Multi-layer Perceptron Classifier
mlp_clf = MLPClassifier(alpha=1, hidden_layer_sizes=150)
mlp_clf.fit(X_train, y_train)
mlp_prediction_proba = mlp_clf.predict_proba(X_test)[:, 1]
def roc_curve_and_score(y_test, pred_proba):
fpr, tpr, _ = roc_curve(y_test.ravel(), pred_proba.ravel())
roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel())
return fpr, tpr, roc_auc
plt.figure(figsize=(8, 6))
matplotlib.rcParams.update({'font.size': 14})
fpr, tpr, roc_auc = roc_curve_and_score(y_test, rf_prediction_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, nb_prediction_proba)
plt.plot(fpr, tpr, color='green', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, mlp_prediction_proba)
plt.plot(fpr, tpr, color='crimson', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.legend(loc="lower right")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
Bootstrap for 95% confidence interval
You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x) that determines whatever statistic you need from data x and you can bootstrap like this:
def bootstrap(x, f, nsamples=1000):
stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
return np.percentile(stats, (2.5, 97.5))
This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).
In your case, you can write a more specific function like this
def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
auc_values = []
for b in range(nsamples):
idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
clf.fit(X_train[idx], y_train[idx])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
return np.percentile(auc_values, (2.5, 97.5))
Here, clf is the classifier for which you want to test the performance and X_train, y_train, X_test, y_test are like in your code.
This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):
Naive Bayes: 0.986 [0.980 0.988] (estimate, lower and upper limit of confidence interval)
Random Forest: 0.983 [0.974 0.989]
Multilayer Perceptron: 0.974 [0.223 0.98]
Permutation tests to test against chance performance
A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values
def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
idx1 = np.arange(X_train.shape[0])
idx2 = np.arange(X_test.shape[0])
auc_values = np.empty(nsamples)
for b in range(nsamples):
np.random.shuffle(idx1) # Shuffles in-place
clf.fit(X_train, y_train[idx1])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
auc_values[b] = roc_auc
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
return roc_auc, np.mean(auc_values >= roc_auc)
This function again takes your classifier as clf and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).
Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.
Permutation test for differences between classifiers
This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this
def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
auc_differences = []
auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
observed_difference = auc1 - auc2
for _ in range(nsamples):
mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
auc1 = roc_auc_score(y_test.ravel(), p1)
auc2 = roc_auc_score(y_test.ravel(), p2)
auc_differences.append(auc1 - auc2)
return observed_difference, np.mean(auc_differences >= observed_difference)
With this test and 1000 samples, I find no significant differences between the three classifiers:
Naive bayes vs random forest: diff=0.0029, p(diff>)=0.311
Naive bayes vs MLP: diff=0.0117, p(diff>)=0.186
random forest vs MLP: diff=0.0088, p(diff>)=0.203
Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.
One can use the code given below to compute the AUC and asymptotic normally distributed confidence interval for Neural Nets.
I have the following scikit-learn machine learning pipeline:
cv = StratifiedKFold(n_splits=6)
classifier = svm.SVC(kernel='linear', probability=True,
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 0
for train, test in cv.split(X, y):
probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
i += 1
Now I would like to also calculate (and plot) the confusion matrix. How can this be done with the above code? I'm only getting probabilities (which I need for caluclating AUC). I have 4 classes (1...4).
You can use this example here to plot confusion matrix:
But for this, you need to have discrete class values (not probabilities). Which can be easily derived from your probas_ variable using:
y_pred = np.argmax(probas_, axis=1)
Now you can use this y_pred in confusion matrix