How can I obtain the AUC value having fpr and tpr? Fpr and tpr are just 2 floats obtained from these formulas:
my_fpr = fp / (fp + tn)
my_tpr = tp / (tp + fn)
my_roc_auc = auc(my_fpr, my_tpr)
I know this can't pe possible, because fpr and tpr are just some floats and they need to be arrays, but I can't figure it out how to do that so. I also know that I can compute AUC this way:
y_predict_proba = model.predict_proba(X_test)
probabilities = np.array(y_predict_proba)[:, 1]
fpr, tpr, _ = roc_curve(y_test, probabilities)
roc_auc = auc(fpr, tpr)
but I want to avoid using predict_proba for some reasons. So my question is: how can I obtain AUC having fp, tp, fn, tn, fpr, tpr? In other words, is it possible to obtain AUC without roc_curve?
Yes, it is possible to obtain the AUC without calling roc_curve.
You first need to create the ROC (Receiver Operating Characteristics) curve. To be able to use the ROC curve, your classifier should be able to rank examples such that the ones with higher rank are more likely to be positive (e.g. fraudulent). As an example, Logistic Regression outputs probabilities, which is a score that you can use for ranking.
The ROC curve is created by plotting the True Positive Pate (TPR) against the False Positive Rate (FPR) at various threshold settings. As an example:
The model performance is determined by looking at the area under the ROC curve (or AUC)
You can find here the more detailed explanation.
You can divide the space into 2 parts: a triangle and a trapezium. The triangle will have area TPR*FRP/2, the trapezium (1-FPR)*(1+TPR)/2 = 1/2 - FPR/2 + TPR/2 - TPR*FPR/2. The total area is 1/2 - FPR/2 + TPR/2. This is how you can get it, having just 2 points.
Related
I'm trying to draw a roc curve for multiclass classification.
At first I calculate y_pred and y_proba using the following code
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
y_pred = dtree_model.predict(X_test)
y_proba= dtree_model.predict_proba(X_test)
After that I use the following function to calculate tpr and fpr
from sklearn.metrics import confusion_matrix
def calculate_tpr_fpr(y_test, y_pred):
'''
Calculates the True Positive Rate (tpr) and the True Negative Rate (fpr) based on real and predicted observations
Args:
y_real: The list or series with the real classes
y_pred: The list or series with the predicted classes
Returns:
tpr: The True Positive Rate of the classifier
fpr: The False Positive Rate of the classifier
'''
# Calculates the confusion matrix and recover each element
cm = confusion_matrix(y_test, y_pred)
TN = cm[0, 0]
FP = cm[0, 1]
FN = cm[1, 0]
TP = cm[1, 1]
# Calculates tpr and fpr
tpr = TP / (TP + FN) # sensitivity - true positive rate
fpr = 1 - TN / (TN + FP) # 1-specificity - false positive rate
return tpr, fpr
Then, I try using this function to calculate a list of fpr and tpr to draw the curve
def get_all_roc_coordinates(y_test, y_proba):
'''
Calculates all the ROC Curve coordinates (tpr and fpr) by considering each point as a treshold for the predicion of the class.
Args:
y_test: The list or series with the real classes.
y_proba: The array with the probabilities for each class, obtained by using the `.predict_proba()` method.
Returns:
tpr_list: The list of TPRs representing each threshold.
fpr_list: The list of FPRs representing each threshold.
'''
tpr_list = [0]
fpr_list = [0]
for i in range(len(y_proba)):
threshold = y_proba[i]
y_pred = y_proba = threshold
tpr, fpr = calculate_tpr_fpr(y_test, y_pred)
tpr_list.append(tpr)
fpr_list.append(fpr)
return tpr_list, fpr_list
but it gives me the following error
ValueError: Classification metrics can't handle a mix of multiclass and multilabel-indicator targets
Note that the Y column is multiclass {0,1,2}. I also tried to ensure that y is string not integer, but it gives me the same error.
You've got 3 classes but you only use 2 classes in your calculate_tpr_fpr(). Also, you probably meant y_pred = y_proba > threshold. Either way, it won't be that easy since you've got 3 columns of class scores. The easiest way seems to be drawing one vs rest curves, treating each column individually:
from sklearn.metrics import roc_curve
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
classes = range(y_proba.shape[1])
for i in classes:
fpr, tpr, _ = roc_curve(label_binarize(y_test, classes=classes)[:,i], y_proba[:,i])
plt.plot(fpr, tpr, alpha=0.7)
plt.legend(classes)
Using this code :
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
y_true = [1,0,0]
y_predict = [.6,.1,.1]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
print(fpr)
print(tpr)
print(thresholds)
# Print ROC curve
plt.plot(fpr,tpr)
plt.show()
y_true = [1,0,0]
y_predict = [.6,.1,.6]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
print(fpr)
print(tpr)
print(thresholds)
# Print ROC curve
plt.plot(fpr,tpr)
plt.show()
the following roc curves are plotted :
scikit learn sets the thresholds but I would like to set custom thresholds.
For example, for values :
y_true = [1,0,0]
y_predict = [.6,.1,.6]
The following thresholds are returned :
[1.6 0.6 0.1]
Why does value 1.6 not exist in ROC curve ? Is threshold 1.6 redundant in this case as the probabilities range from 0-1 ? Can custom thresholds be set : .3,.5,.7 to check how well the classifier performs in this case ?
Update :
From https://sachinkalsi.github.io/blog/category/ml/2018/08/20/top-8-performance-metrics-one-should-know.html#receiver-operating-characteristic-curve-roc I used same x and predicted values :
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
y_true = [1,1,1,0]
y_predict = [.94,.87,.83,.80]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1)
print('false positive rate:', fpr)
print('true positive rate:', tpr)
print('thresholds:', thresholds)
# Print ROC curve
plt.plot(fpr,tpr)
plt.show()
which produces this plot :
Plot is different to referenced plot in blog, also thresholds are different :
Also, the thresholds returned by using scikit metrics.roc_curve implemented are : thresholds: [0.94 0.83 0.8 ]. Should scikit return a similar roc curve as is using same points ? I should implement roc curve myself instead of relying on scikit implementation as results are different ?
Thresholds won't appear in the ROC curve. The scikit-learn documentations says:
thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1
If y_predict contains 0.3, 0.5, 0.7, then those thresholds will be tried by the metrics.roc_curve function.
Typically these steps are followed while calculating ROC curve
1. Sort y_predict in descending order.
2. For each of the probability scores (lets say τ_i) in y_predict, if y_predict >= τ_i, then consider that data point as positive.
P.S: If we have N data points, then we will have N thresholds (if the combinations of y_true and y_predict is unique)
3. For each of the y_predicted (τ_i) values, calculate TPR & FPR.
4. Plot ROC by taking N (no. of data points) TPR, FPR pairs
You can refer this blog for detailed information
I want to plot a ROC curve for evaluating a trained Nearest Centroid classifier.
My code works for Naive Bayes, SVM, kNN and DT but I get an exception whenever I try to plot the curve for Nearest Centroid, because the estimator has no .predict_proba() method:
AttributeError: 'NearestCentroid' object has no attribute 'predict_proba'
The code for plotting the curve is
def plot_roc(self):
plt.clf()
for label, estimator in self.roc_estimators.items():
estimator.fit(self.data_train, self.target_train)
proba_for_each_class = estimator.predict_proba(self.data_test)
fpr, tpr, thresholds = roc_curve(self.target_test, proba_for_each_class[:, 1])
plt.plot(fpr, tpr, label=label)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()
self.roc_estimators is a dict where I store the trained estimators with the label of the classifier like this
cl_label = "kNN"
knn_estimator = KNeighborsClassifier(algorithm='ball_tree', p=2, n_neighbors=5)
knn_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = knn_estimator
and for Nearest Centroid respectively
cl_label = "Nearest Centroid"
nc_estimator = NearestCentroid(metric='euclidean', shrink_threshold=6)
nc_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = nc_estimator
So it works for all classifiers I tried but not for Nearest Centroid. Is there a specific reason regarding the nature of the Nearest Centroid classifier that I am missing which explains why it is not possible to plot the ROC curve (more specifically why the estimator does not have the .predict_proba() method?) Thank you in advance!
You need a "score" for each prediction to make the ROC curve. This could be the predicted probability of belonging to one class.
See e.g. https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Curves_in_ROC_space
Just looking for the nearest centroid will give you predicted class, but not the probability.
EDIT: For NearestCentroid it is not possible to compute a score. This is simply a limitation of the model. It assigns a class to each sample, but not a probability of that class. I guess if you need to use Nearest Centroid and you want a probability, you can use some ensemble method. Train a bunch of models of subsets of your training data, and average their predictions on your test set. That could give you a score. See scikit-learn.org/stable/modules/ensemble.html#bagging
To get the class probabilities you can do something like (untested code):
from sklearn.utils.extmath import softmax
from sklearn.metrics.pairwise import pairwise_distances
def predict_proba(self, X):
distances = pairwise_distances(X, self.centroids_, metric=self.metric)
probs = softmax(distances)
return probs
clf = NearestCentroid()
clf.fit(X_train, y_train)
predict_proba(clf, X_test)
I have the following code:
from sklearn.metrics import roc_curve, auc
actual = [1,1,1,0,0,1]
prediction_scores = [0.9,0.9,0.9,0.1,0.1,0.1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
# 0.875
In this example the interpretation of prediction_scores is straightforward namely, the higher the score the more confident the prediction is.
Now I have another set of prediction prediction scores.
It is non-fractional, and the interpretation is the reverse. Meaning the lower
the score more confident the prediction is.
prediction_scores_v2 = [10.3,10.3,10.2,10.5,2000.34,2000.34]
# so this is equivalent
My question is: how can I scale that in prediction_scores_v2 so that it gives
similar AUC score like the first one?
To put it another way, Scikit's ROC_CURVE requires the y_score to be probability estimates of the positive class. How can I treat the value if the y_score I have is probability estimates of the wrong class?
For AUC, you really only care about the order of your predictions. So as long as that is true, you can just get your predictions into a format that AUC will accept.
You'll want to divide by the max to get your predictions to be between 0 and 1, and then subtract from 1 since lower is better in your case:
max_pred = max(prediction_scores_v2)
prediction_scores_v2[:] = (1-x/max_pred for x in prediction_scores_v2)
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores_v2, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
# 0.8125
How can I treat the value if the y_score I have is probability estimates of the wrong class?
This is a really cheap shot, but have you considered reversing the original class list, as in
actual = [abs(x-1) for x in actual]
Then, you could still apply the normalization #Tchotchke proposed.
Still, in the end, #BrenBarn seems right. If possible, have an in-depth look at how these values are created and/or used in the other prediction tool.
I'm doing a binary classification .. I've an imbalanced data and I've used the svm weight in trying to mitigate the situation ...
As you can see I've calculated and plot the roc curve for each class and I've got the following plot:
It looks like the two classes some up to one .. and I'm n't sure if I'm doing the right thing or not because its the first time for me to draw my own roc curve ... I'm using Scikit learn to plot ... is it right to plot each class alone .. and is the classifier failing in classifying the blue class ?
this is the code that I've used to get the plot:
y_pred = clf.predict_proba(X_test)[:,0] # for calculating the probability of the first class
y_pred2 = clf.predict_proba(X_test)[:,1] # for calculating the probability of the second class
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
auc=metrics.auc(fpr, tpr)
print "auc for the first class",auc
fpr2, tpr2, thresholds2 = metrics.roc_curve(y_test, y_pred2)
auc2=metrics.auc(fpr2, tpr2)
print "auc for the second class",auc2
# ploting the roc curve
plt.plot(fpr,tpr)
plt.plot(fpr2,tpr2)
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('Roc curve')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend(loc="lower right")
plt.show()
I know there is a better way to write as a dictionary for example but I was just trying to see the curve first
See the Wikipedia entry for a all your ROC curve needs :)
predict_proba returns class probabilities for each class. The first column contains the probability of the first class and the second column contains the probability of the second class. Note that the two curves are rotated versions of each other. That is because the class probabilities add up to 1.
The documentation of roc_curve states that the second parameter must contain
Target scores, can either be probability estimates of the positive class or confidence values.
This means you have to pass the probabilities that corresponds to class 1. Most likely this is the second column.
You get the blue curve because you passed the probabilities of the wrong class (first column). Only the green curve is correct.
It does not make sense to compute ROC curves for each class, because the ROC curve describes the ability of the classifier to distinguish two classes. You have only one curve per classifier.
The specific problem is a coding mistake.
predict_proba returns class probabilities (1 if it's certainly the class, 0 if it is definitly not the class, usually it's something in-between).
metrics.roc_curve(y_test, y_pred) now compares class labels against probabilities, which is like comparing pears against apple juice.
You should use predict instead of predict_proba to predict class labels and not probabilities. These can be compared against the true class labels for computing the ROC curve. Incidentally, this also removes the option to plot a second curve - you only get one curve for the classifier, not one for each class.
you have to rethink the whole approach. ROC curve indicates the quality of different classifiers at different "probability" thresholds and not the classes. Usually, a straight line with a slope of 0.5 is the benchmark for the classifiers whether your classifier is able to beat a random guess.
It's because while building ROC for class 0, it considers '0' in y_test as Boolean False for your target class.
Try changing:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred) to fpr, tpr, thresholds = metrics.roc_curve(1-y_test, y_pred)