Related
I'm trying to draw a roc curve for multiclass classification.
At first I calculate y_pred and y_proba using the following code
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
y_pred = dtree_model.predict(X_test)
y_proba= dtree_model.predict_proba(X_test)
After that I use the following function to calculate tpr and fpr
from sklearn.metrics import confusion_matrix
def calculate_tpr_fpr(y_test, y_pred):
'''
Calculates the True Positive Rate (tpr) and the True Negative Rate (fpr) based on real and predicted observations
Args:
y_real: The list or series with the real classes
y_pred: The list or series with the predicted classes
Returns:
tpr: The True Positive Rate of the classifier
fpr: The False Positive Rate of the classifier
'''
# Calculates the confusion matrix and recover each element
cm = confusion_matrix(y_test, y_pred)
TN = cm[0, 0]
FP = cm[0, 1]
FN = cm[1, 0]
TP = cm[1, 1]
# Calculates tpr and fpr
tpr = TP / (TP + FN) # sensitivity - true positive rate
fpr = 1 - TN / (TN + FP) # 1-specificity - false positive rate
return tpr, fpr
Then, I try using this function to calculate a list of fpr and tpr to draw the curve
def get_all_roc_coordinates(y_test, y_proba):
'''
Calculates all the ROC Curve coordinates (tpr and fpr) by considering each point as a treshold for the predicion of the class.
Args:
y_test: The list or series with the real classes.
y_proba: The array with the probabilities for each class, obtained by using the `.predict_proba()` method.
Returns:
tpr_list: The list of TPRs representing each threshold.
fpr_list: The list of FPRs representing each threshold.
'''
tpr_list = [0]
fpr_list = [0]
for i in range(len(y_proba)):
threshold = y_proba[i]
y_pred = y_proba = threshold
tpr, fpr = calculate_tpr_fpr(y_test, y_pred)
tpr_list.append(tpr)
fpr_list.append(fpr)
return tpr_list, fpr_list
but it gives me the following error
ValueError: Classification metrics can't handle a mix of multiclass and multilabel-indicator targets
Note that the Y column is multiclass {0,1,2}. I also tried to ensure that y is string not integer, but it gives me the same error.
You've got 3 classes but you only use 2 classes in your calculate_tpr_fpr(). Also, you probably meant y_pred = y_proba > threshold. Either way, it won't be that easy since you've got 3 columns of class scores. The easiest way seems to be drawing one vs rest curves, treating each column individually:
from sklearn.metrics import roc_curve
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
classes = range(y_proba.shape[1])
for i in classes:
fpr, tpr, _ = roc_curve(label_binarize(y_test, classes=classes)[:,i], y_proba[:,i])
plt.plot(fpr, tpr, alpha=0.7)
plt.legend(classes)
I would like to evaluate my model's ability to discriminate between people with prediabetes (hba1c 5.7-6.4%) and diabetes type 2 (hba1c > 6.4%)
My outcome label (y_test) is hba1c>5.7%, defining unhealthy people with undiagnosed diabetes or prediabetic conditions.
How do I separate the two ranges, compare the predicted values with actual values and calculate the sensitivity?
The present example is according to the logistic regression model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
X_train = X_train[feat_cols]
X_test = X_test[feat_cols]
# Building LGR and evaluating on training data
LGR = LogisticRegression(max_iter=100, random_state=1)
def evaluate_model(LGR, X_test, y_test):
# Predict Test Data
y_pred = LGR.predict(X_test)
# Calculate accuracy, precision, sensitivity and specificity
acc = metrics.accuracy_score(y_test, y_pred)
prec = metrics.precision_score(y_test, y_pred)
sen = metrics.recall_score(y_test, y_pred, pos_label=1)
spe = metrics.recall_score(y_test, y_pred, pos_label=0)
# Calculate area under curve (AUC)
y_pred_proba = LGR.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
# Display confussion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
return {'acc': acc, 'prec': prec, 'sen': sen, 'spe': spe,
'fpr': fpr, 'tpr': tpr, 'auc': auc, 'cm': cm}
LGR_eval = evaluate_model(LGR, X_test, y_test)
# Print result
print('Accuracy:', LGR_eval['acc'])
print('Precision:', LGR_eval['prec'])
print('Sensitivity:', LGR_eval['sen'])
print('Specificity:', LGR_eval['spe'])
print('Area Under Curve:', LGR_eval['auc'])
print('Confusion Matrix:\n', LGR_eval['cm'])```
Accuracy: 0.7315175097276264
Precision: 0.711340206185567
Sensitivity: 0.7439353099730458
Specificity: 0.72
Area Under Curve: 0.8036994609164421
Confusion Matrix:
[[288 112]
[ 95 276]]
(Answering comment above)
As you do not use a linear output but the logistical regression it will be difficult to achieve what you ask for without making quite some changes.
Either you change your model to a linear regression model and predict the hbac-value and after that just classify the prediction in the < 5.7, 5.7 - 6.4 range or > 6.4 range. This way you can use your metrics which you've used above.
The other way around is dependant on the Y dataset, does it contain any labels regarding the different conditions or is it just labeled healthy / unhealthy? If you were to add another label which corresponds to the values (alot like above) then you can turn your model to an multi-output prediction model and still use the logistical regression, and then investigate your metrics for the requested classes.
Edit in response to comment:
Below is the function from comments.
def IsAtRisk(x):
if x < 5.7:
return 0
return 1
df['IsAtRisk'] = df['LBXGH'].map(IsAtRisk)
print(df)
print(f"{len(df[df['IsAtRisk'] == True])} of {len(df)} people are at risk")
If you instead include the range that you ask for and apply another class you'll have the labels for the diffrent classes and can measure with the metrics regarding how the model performs.
def IsAtRisk(x):
if x < 5.7:
return 0
elif 5.7 < x < 6.4:
return 1
return 2
But for this to work you should most likely format the samples in a different way depending on your model structure. If you would share your model structure and the output-layer it would help.
Most likely you will want to restructure your Y labels to
y_sample = [1, 0, 0] # This is the probability 1 of class 0 which may be the health individuals in your dataset
# With this in mind you can correct the return values from the above function to return arrays of the labels instead of ints.
I am trying to apply the idea of sklearn ROC extension to multiclass to my dataset. My per-class ROC curve looks find of a straight line each, unline the sklearn's example showing curve's fluctuating.
I give an MWE below to show what I mean:
# all imports
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# dummy dataset
X, y = make_classification(10000, n_classes=5, n_informative=10, weights=[.04, .4, .12, .5, .04])
train, test, ytrain, ytest = train_test_split(X, y, test_size=.3, random_state=42)
# random forest model
model = RandomForestClassifier()
model.fit(train, ytrain)
yhat = model.predict(test)
The following function then plots the ROC curve:
def plot_roc_curve(y_test, y_pred):
n_classes = len(np.unique(y_test))
y_test = label_binarize(y_test, classes=np.arange(n_classes))
y_pred = label_binarize(y_pred, classes=np.arange(n_classes))
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_pred.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# Plot all ROC curves
#plt.figure(figsize=(10,5))
plt.figure(dpi=600)
lw = 2
plt.plot(fpr["micro"], tpr["micro"],
label="micro-average ROC curve (area = {0:0.2f})".format(roc_auc["micro"]),
color="deeppink", linestyle=":", linewidth=4,)
plt.plot(fpr["macro"], tpr["macro"],
label="macro-average ROC curve (area = {0:0.2f})".format(roc_auc["macro"]),
color="navy", linestyle=":", linewidth=4,)
colors = cycle(["aqua", "darkorange", "darkgreen", "yellow", "blue"])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label="ROC curve of class {0} (area = {1:0.2f})".format(i, roc_auc[i]),)
plt.plot([0, 1], [0, 1], "k--", lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) curve")
plt.legend()
Output:
plot_roc_curve(ytest, yhat)
Kind of straight line bending once. I would like to see the model performance at different thresholds, not just one, a figure similar to sklearn's illustration for 3-classes shown below:
Point is that you're using predict() rather than predict_proba()/decision_function() to define your y_hat. This means - considering that the threshold vector is defined by the number of distinct values in y_hat (see here for reference), that you'll have few thresholds per class only on which tpr and fpr are computed (which in turn implies that your curves are evaluated at few points only).
Indeed, consider what the doc says to pass to y_scores in roc_curve(), either prob estimates or decision values. In the example from sklearn, decision values are used to compute the scores. Given that you're considering a RandomForestClassifier(), considering probability estimates in your y_hat should be the way to go.
What's the point then of label-binarizing the output? The standard definition for ROC is in terms of binary classification. To pass to a multiclass problem, you have to convert your problem into binary by using OneVsAll approach, so that you'll have n_class number of ROC curves. (Observe, indeed, that as SVC() handles multiclass problems in a OvO fashion by default, in the example they had to force to use OvA by applying OneVsRestClassifier constructor; with a RandomForestClassifier you don't have such problem as that's inherently multiclass, see here for reference). In these terms, once you switch to predict_proba() you'll see there's no much sense in label binarizing predictions.
# all imports
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# dummy dataset
X, y = make_classification(10000, n_classes=5, n_informative=10, weights=[.04, .4, .12, .5, .04])
train, test, ytrain, ytest = train_test_split(X, y, test_size=.3, random_state=42)
# random forest model
model = RandomForestClassifier()
model.fit(train, ytrain)
yhat = model.predict_proba(test)
def plot_roc_curve(y_test, y_pred):
n_classes = len(np.unique(y_test))
y_test = label_binarize(y_test, classes=np.arange(n_classes))
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
thresholds = dict()
for i in range(n_classes):
fpr[i], tpr[i], thresholds[i] = roc_curve(y_test[:, i], y_pred[:, i], drop_intermediate=False)
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_pred.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# Plot all ROC curves
#plt.figure(figsize=(10,5))
plt.figure(dpi=600)
lw = 2
plt.plot(fpr["micro"], tpr["micro"],
label="micro-average ROC curve (area = {0:0.2f})".format(roc_auc["micro"]),
color="deeppink", linestyle=":", linewidth=4,)
plt.plot(fpr["macro"], tpr["macro"],
label="macro-average ROC curve (area = {0:0.2f})".format(roc_auc["macro"]),
color="navy", linestyle=":", linewidth=4,)
colors = cycle(["aqua", "darkorange", "darkgreen", "yellow", "blue"])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label="ROC curve of class {0} (area = {1:0.2f})".format(i, roc_auc[i]),)
plt.plot([0, 1], [0, 1], "k--", lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) curve")
plt.legend()
Eventually, consider that roc_curve() has also a drop_intermediate parameter meant for dropping suboptimal thresholds (it might be useful to know).
Just to update on #amiola answer: I had an issue with non-monotonic classes which lead to very strange fuzzy results. In this case a little modification to the function above will work very well:
classes = sorted(list(y_test['label'].unique()))
Use this in the label_binarize line:
y_test = label_binarize(y_test, classes=classes)
And then when you need a range in the function, just use:
range(len(classes))
Thanks to #dx2-66 for the answer. You can check for more details here.
I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and p-value to access statistical significance.
Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores.
Here are my specific questions:
How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g. with bootstrapping).
How to compare the AUC scores (on test set) and measure the p-value to assess statistical significance? (The null hypothesis is that the models are not different. Rejecting the null hypothesis means the difference in AUC scores is statistically significant.)
.
import numpy as np
np.random.seed(2018)
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=17)
# Naive Bayes Classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_prediction_proba = nb_clf.predict_proba(X_test)[:, 1]
# Ranodm Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=20)
rf_clf.fit(X_train, y_train)
rf_prediction_proba = rf_clf.predict_proba(X_test)[:, 1]
# Multi-layer Perceptron Classifier
mlp_clf = MLPClassifier(alpha=1, hidden_layer_sizes=150)
mlp_clf.fit(X_train, y_train)
mlp_prediction_proba = mlp_clf.predict_proba(X_test)[:, 1]
def roc_curve_and_score(y_test, pred_proba):
fpr, tpr, _ = roc_curve(y_test.ravel(), pred_proba.ravel())
roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel())
return fpr, tpr, roc_auc
plt.figure(figsize=(8, 6))
matplotlib.rcParams.update({'font.size': 14})
plt.grid()
fpr, tpr, roc_auc = roc_curve_and_score(y_test, rf_prediction_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, nb_prediction_proba)
plt.plot(fpr, tpr, color='green', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, mlp_prediction_proba)
plt.plot(fpr, tpr, color='crimson', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.legend(loc="lower right")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
plt.ylabel('Sensitivity')
plt.show()
Bootstrap for 95% confidence interval
You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x) that determines whatever statistic you need from data x and you can bootstrap like this:
def bootstrap(x, f, nsamples=1000):
stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
return np.percentile(stats, (2.5, 97.5))
This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).
In your case, you can write a more specific function like this
def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
auc_values = []
for b in range(nsamples):
idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
clf.fit(X_train[idx], y_train[idx])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
auc_values.append(roc_auc)
return np.percentile(auc_values, (2.5, 97.5))
Here, clf is the classifier for which you want to test the performance and X_train, y_train, X_test, y_test are like in your code.
This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):
Naive Bayes: 0.986 [0.980 0.988] (estimate, lower and upper limit of confidence interval)
Random Forest: 0.983 [0.974 0.989]
Multilayer Perceptron: 0.974 [0.223 0.98]
Permutation tests to test against chance performance
A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values
def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
idx1 = np.arange(X_train.shape[0])
idx2 = np.arange(X_test.shape[0])
auc_values = np.empty(nsamples)
for b in range(nsamples):
np.random.shuffle(idx1) # Shuffles in-place
np.random.shuffle(idx2)
clf.fit(X_train, y_train[idx1])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
auc_values[b] = roc_auc
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
return roc_auc, np.mean(auc_values >= roc_auc)
This function again takes your classifier as clf and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).
Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.
Permutation test for differences between classifiers
This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this
def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
auc_differences = []
auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
observed_difference = auc1 - auc2
for _ in range(nsamples):
mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
auc1 = roc_auc_score(y_test.ravel(), p1)
auc2 = roc_auc_score(y_test.ravel(), p2)
auc_differences.append(auc1 - auc2)
return observed_difference, np.mean(auc_differences >= observed_difference)
With this test and 1000 samples, I find no significant differences between the three classifiers:
Naive bayes vs random forest: diff=0.0029, p(diff>)=0.311
Naive bayes vs MLP: diff=0.0117, p(diff>)=0.186
random forest vs MLP: diff=0.0088, p(diff>)=0.203
Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.
One can use the code given below to compute the AUC and asymptotic normally distributed confidence interval for Neural Nets.
tf.contrib.metrics.auc_with_confidence_intervals(
labels,
predictions,
weights=None,
alpha=0.95,
logit_transformation=True,
metrics_collections=(),
updates_collections=(),
name=None)
I have trained a binary-classes CNN in Caffe, and now i want to plot the ROC curve and calculate the AUC value. I have two quetions:
1) How to plot the ROC curve in Caffe with python?
2) How to calculate the AUC value of the ROC curve?
Python has roc_curve and roc_auc_score functions in sklearn.metrics module, just import and use them.
Assuming you have a binary prediction layer that outputs a two-vector of binary class probabilities (let's call it "prob") then your code should look something like:
import caffe
from sklearn import metrics
# load the net with trained weights
net = caffe.Net('/path/to/deploy.prototxt', '/path/to/weights.caffemodel', caffe.TEST)
y_score = []
y_true = []
for i in xrange(N): # assuming you have N validation samples
x_i = ... # get i-th validation sample
y_true.append( y_i ) # y_i is 0 or 1 the TRUE label of x_i
out = net.forward( data=x_i ) # get prediction for x_i
y_score.append( out['prob'][1] ) # get score for "1" class
# once you have N y_score and y_true values
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_score, pos_label=1)
auc = metrics.roc_auc_score(y_true, y_scores)