I'm doing a binary classification .. I've an imbalanced data and I've used the svm weight in trying to mitigate the situation ...
As you can see I've calculated and plot the roc curve for each class and I've got the following plot:
It looks like the two classes some up to one .. and I'm n't sure if I'm doing the right thing or not because its the first time for me to draw my own roc curve ... I'm using Scikit learn to plot ... is it right to plot each class alone .. and is the classifier failing in classifying the blue class ?
this is the code that I've used to get the plot:
y_pred = clf.predict_proba(X_test)[:,0] # for calculating the probability of the first class
y_pred2 = clf.predict_proba(X_test)[:,1] # for calculating the probability of the second class
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
auc=metrics.auc(fpr, tpr)
print "auc for the first class",auc
fpr2, tpr2, thresholds2 = metrics.roc_curve(y_test, y_pred2)
auc2=metrics.auc(fpr2, tpr2)
print "auc for the second class",auc2
# ploting the roc curve
plt.plot(fpr,tpr)
plt.plot(fpr2,tpr2)
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('Roc curve')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend(loc="lower right")
plt.show()
I know there is a better way to write as a dictionary for example but I was just trying to see the curve first
See the Wikipedia entry for a all your ROC curve needs :)
predict_proba returns class probabilities for each class. The first column contains the probability of the first class and the second column contains the probability of the second class. Note that the two curves are rotated versions of each other. That is because the class probabilities add up to 1.
The documentation of roc_curve states that the second parameter must contain
Target scores, can either be probability estimates of the positive class or confidence values.
This means you have to pass the probabilities that corresponds to class 1. Most likely this is the second column.
You get the blue curve because you passed the probabilities of the wrong class (first column). Only the green curve is correct.
It does not make sense to compute ROC curves for each class, because the ROC curve describes the ability of the classifier to distinguish two classes. You have only one curve per classifier.
The specific problem is a coding mistake.
predict_proba returns class probabilities (1 if it's certainly the class, 0 if it is definitly not the class, usually it's something in-between).
metrics.roc_curve(y_test, y_pred) now compares class labels against probabilities, which is like comparing pears against apple juice.
You should use predict instead of predict_proba to predict class labels and not probabilities. These can be compared against the true class labels for computing the ROC curve. Incidentally, this also removes the option to plot a second curve - you only get one curve for the classifier, not one for each class.
you have to rethink the whole approach. ROC curve indicates the quality of different classifiers at different "probability" thresholds and not the classes. Usually, a straight line with a slope of 0.5 is the benchmark for the classifiers whether your classifier is able to beat a random guess.
It's because while building ROC for class 0, it considers '0' in y_test as Boolean False for your target class.
Try changing:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred) to fpr, tpr, thresholds = metrics.roc_curve(1-y_test, y_pred)
Related
I am doing model selection with 10-odd models for a classification (binary) problem. I loop through each model, do a train-test-split, make predictions, and find the score, updating each score to a dictionary where the keys are the model.
However, many of these models (Decision Tree, Random Forest, AdaBoost, etc.) do not give probabilities as a float but rather the class as an int (1 or 0). This means that my AUC plots for these models have only three points.
My code:
fig = make_subplots(1,1)
ns_probs = [0 for i in range(len(y_test))]
ns_auc = roc_auc_score(y_test, ns_probs)
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
for model in scores_kf_df.columns:
fig.add_scatter(x=scores_kf_df.T.loc[model, 'fpr'],y=scores_kf_df.T.loc[model, 'tpr'], name=model)
fig.add_scatter(x=ns_fpr,y=ns_tpr, name='No Skill', line={'dash':'dot'})
fig.update_layout({'title':{'text':'ROC AUC Curve', 'x':0.5, 'font':{'size':28}},
'xaxis':{'title':'False Positive Rate'},
'yaxis':{'title':'True Positive Rate'}})
iplot(fig)
And my plot here:
Is there a way to get probabilities for these models, similar to the predict_proba() function in logistic regression?
When plotting the ROC (or deriving the AUC) in scikit-learn, how can one specify arbitrary thresholds for roc_curve, rather than having the function calculate them internally and return them?
from sklearn.metrics import roc_curve
fpr,tpr,thresholds = roc_curve(y_true,y_pred)
A related question was asked at Scikit - How to define thresholds for plotting roc curve, but the OP's accepted answer indicates that their intent was different to how it was written.
Thanks!
What you get from the classifier are scores, not just a class prediction.
roc_curve will give you a set of thresholds with associated false positive rates and true positive rates.
If you want your own threshold, just use it:
y_class = y_pred > threshold
Then you can display a confusion matrix, with this new y_class compared to y_true.
And if you want several thresholds, do the same, and get the confusion matrix from each of them to get the true and false positive rate.
It's quite simple. ROC curve shows you outputs for different thresholds. You always choose best threshold for you model to get forecasts, but ROC curve shows you how robust/good your model is for different thresholds. Here you have quite good explanation how it works: https://www.dataschool.io/roc-curves-and-auc-explained/
I want to plot a ROC curve for evaluating a trained Nearest Centroid classifier.
My code works for Naive Bayes, SVM, kNN and DT but I get an exception whenever I try to plot the curve for Nearest Centroid, because the estimator has no .predict_proba() method:
AttributeError: 'NearestCentroid' object has no attribute 'predict_proba'
The code for plotting the curve is
def plot_roc(self):
plt.clf()
for label, estimator in self.roc_estimators.items():
estimator.fit(self.data_train, self.target_train)
proba_for_each_class = estimator.predict_proba(self.data_test)
fpr, tpr, thresholds = roc_curve(self.target_test, proba_for_each_class[:, 1])
plt.plot(fpr, tpr, label=label)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()
self.roc_estimators is a dict where I store the trained estimators with the label of the classifier like this
cl_label = "kNN"
knn_estimator = KNeighborsClassifier(algorithm='ball_tree', p=2, n_neighbors=5)
knn_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = knn_estimator
and for Nearest Centroid respectively
cl_label = "Nearest Centroid"
nc_estimator = NearestCentroid(metric='euclidean', shrink_threshold=6)
nc_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = nc_estimator
So it works for all classifiers I tried but not for Nearest Centroid. Is there a specific reason regarding the nature of the Nearest Centroid classifier that I am missing which explains why it is not possible to plot the ROC curve (more specifically why the estimator does not have the .predict_proba() method?) Thank you in advance!
You need a "score" for each prediction to make the ROC curve. This could be the predicted probability of belonging to one class.
See e.g. https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Curves_in_ROC_space
Just looking for the nearest centroid will give you predicted class, but not the probability.
EDIT: For NearestCentroid it is not possible to compute a score. This is simply a limitation of the model. It assigns a class to each sample, but not a probability of that class. I guess if you need to use Nearest Centroid and you want a probability, you can use some ensemble method. Train a bunch of models of subsets of your training data, and average their predictions on your test set. That could give you a score. See scikit-learn.org/stable/modules/ensemble.html#bagging
To get the class probabilities you can do something like (untested code):
from sklearn.utils.extmath import softmax
from sklearn.metrics.pairwise import pairwise_distances
def predict_proba(self, X):
distances = pairwise_distances(X, self.centroids_, metric=self.metric)
probs = softmax(distances)
return probs
clf = NearestCentroid()
clf.fit(X_train, y_train)
predict_proba(clf, X_test)
I am doing a binary classification task on imbalanced data set .. and right now computing the ROC AUC using :
sklearn.metrics.roc_auc_score(y_true, y_score, average='macro') source
and I have two questions:
I am not sure if the averaging macro is influenced by the class imbalance here and what is the best averaging in this situation (when classifying imbalanced classes)?
Is there a reference for the way that shows how scikit-learn calculate the ROC AUC with the different averaging argument ?
If your target variable is binary, then average does not make sense and is ignored. See https://github.com/scikit-learn/scikit-learn/blob/7b136e92acf49d46251479b75c88cba632de1937/sklearn/metrics/base.py#L76 and also the comment in the doc: https://github.com/scikit-learn/scikit-learn/blob/7b136e92acf49d46251479b75c88cba632de1937/sklearn/metrics/base.py#L52
The average='weighted' is your choice for the problem of imbalanced classes
as it follows from 3.3.2.1 in
http://scikit-learn.org/stable/modules/model_evaluation.html
Using average='macro' is the reasonable way to go. Hopefully, you already trained your model with consideration of the data's imbalance. So now, when evaluating performance, you want to give both classes the same weight.
For example, if your set consists of 90% positive examples, and let's say the roc auc for the positive label is 0.8, and the roc auc for the negative label is 0.4. Using average='weighted' will produce an average roc auc of 0.8 * 0.9 + 0.4 * 0.1 = 0.76. Obviously, it is mostly affected by the positive label's score. Using average='macro' will result in a score that gives the minority label (0) equal weight. In this case, 0.6.
To conclude, if you don't care much about precision and recall relating to the negative label, use average='weighted'. Otherwise, use average='macro'.
I have the following code:
from sklearn.metrics import roc_curve, auc
actual = [1,1,1,0,0,1]
prediction_scores = [0.9,0.9,0.9,0.1,0.1,0.1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
# 0.875
In this example the interpretation of prediction_scores is straightforward namely, the higher the score the more confident the prediction is.
Now I have another set of prediction prediction scores.
It is non-fractional, and the interpretation is the reverse. Meaning the lower
the score more confident the prediction is.
prediction_scores_v2 = [10.3,10.3,10.2,10.5,2000.34,2000.34]
# so this is equivalent
My question is: how can I scale that in prediction_scores_v2 so that it gives
similar AUC score like the first one?
To put it another way, Scikit's ROC_CURVE requires the y_score to be probability estimates of the positive class. How can I treat the value if the y_score I have is probability estimates of the wrong class?
For AUC, you really only care about the order of your predictions. So as long as that is true, you can just get your predictions into a format that AUC will accept.
You'll want to divide by the max to get your predictions to be between 0 and 1, and then subtract from 1 since lower is better in your case:
max_pred = max(prediction_scores_v2)
prediction_scores_v2[:] = (1-x/max_pred for x in prediction_scores_v2)
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores_v2, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
# 0.8125
How can I treat the value if the y_score I have is probability estimates of the wrong class?
This is a really cheap shot, but have you considered reversing the original class list, as in
actual = [abs(x-1) for x in actual]
Then, you could still apply the normalization #Tchotchke proposed.
Still, in the end, #BrenBarn seems right. If possible, have an in-depth look at how these values are created and/or used in the other prediction tool.