I am working with an imbalanced dataset. I have applied SMOTE Algorithm to balance the dataset after splitting the dataset into test and training set before applying ML models. I want to apply cross-validation and plot the ROC curves of each folds showing the AUC of each fold and also display the mean of the AUCs in the plot. I named the resampled training set variables as X_train_res and y_train_res and following is the code:
cv = StratifiedKFold(n_splits=10)
classifier = SVC(kernel='sigmoid',probability=True,random_state=0)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
plt.figure(figsize=(10,10))
i = 0
for train, test in cv.split(X_train_res, y_train_res):
probas_ = classifier.fit(X_train_res[train], y_train_res[train]).predict_proba(X_train_res[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_train_res[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
plt.plot(fpr, tpr, lw=1, alpha=0.3,
label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
label='Chance', alpha=.8)
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
lw=2, alpha=.8)
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
label=r'$\pm$ 1 std. dev.')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate',fontsize=18)
plt.ylabel('True Positive Rate',fontsize=18)
plt.title('Cross-Validation ROC of SVM',fontsize=18)
plt.legend(loc="lower right", prop={'size': 15})
plt.show()
following is the output:
Please tell me whether the code is correct for plotting ROC curve for the cross-validation or not.
The problem is that I do not clearly understand cross-validation. In the for loop range, I have passed the training sets of X and y variables. Does cross-validation work like this?
Leaving SMOTE and the imbalance issue aside, which are not included in your code, your procedure looks correct.
In more detail, for each one of your n_splits=10:
you create train and test folds
you fit the model using the train fold:
classifier.fit(X_train_res[train], y_train_res[train])
and then you predict probabilities using the test fold:
predict_proba(X_train_res[test])
This is exactly the idea behind cross-validation.
So, since you have n_splits=10, you get 10 ROC curves and respective AUC values (and their average), exactly as expected.
However:
The need for (SMOTE) upsampling due to the class imbalance changes the correct procedure, and turns your overall process incorrect: you should not upsample your initial dataset; instead, you need to incorporate the upsampling procedure into the CV process.
So, the correct procedure here for each one of your n_splits becomes (notice that starting with a stratified CV split, as you have done, becomes essential in class imbalance cases):
create train and test folds
upsample your train fold with SMOTE
fit the model using the upsampled train fold
predict probabilities using the test fold (not upsampled)
For details regarding the rationale, please see own answer in the Data Science SE thread Why you shouldn't upsample before cross validation.
Related
I am doing a binary classification using the sequential() model of Keras. I have some doubts about its accuracy assessment.
I am calculating AUC-ROC for it. For this, should I use the prediction probability or the prediction class?
Explanation:
After training the model, I'm doing model.predict() to find prediction values for training and validation data (code below).
y_pred_train = model.predict(x_train_df).ravel()
y_pred_val = model.predict(x_val_df).ravel()
fpr_train, tpr_train, thresholds_roc_train = roc_curve(y_train_df, y_pred_train, pos_label=None)
fpr_val, tpr_val, thresholds_roc_val = roc_curve(y_val_df, y_pred_val, pos_label=None)
roc_auc_train = auc(fpr_train, tpr_train)
roc_auc_val = auc(fpr_val, tpr_val)
plt.figure()
lw = 2
plt.plot(fpr_train, tpr_train, color='darkgreen',lw=lw, label='ROC curve Training (area = %0.2f)' % roc_auc_train)
plt.plot(fpr_val, tpr_val, color='darkorange',lw=lw, label='ROC curve Validation (area = %0.2f)' % roc_auc_val)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--',label='Base line')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
This shows the plot as this. training and validation accuracies are 0.76 and 0.76.
model.predict() gives the probabilities not the actual predicted class, so I chnaged the first two line of the above code sample to give the class as;
y_pred_train = (model.predict(x_train_df).ravel() > 0.5).astype("int32")
y_pred_val = (model.predict(x_test_df).ravel() > 0.5).astype("int32")
So this now calculates the AUC-ROC from the class values (I guess). But the accuracy I am getting with this is very different and low. Training and validation accuracies are 0.66 and 0.46. (plot).
What is the correct way between these two and why the difference in accuracies?
The ROC is normally created by plotting sensitivity (TPR) vs specificity (FPR) when varying the class thresholding value from 0. to 1.0 eg:
See for instance here:
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
Some pseudo code to get you started:
pred_proba = model.predict(x_train_df).ravel()
for thresh in np.arange (0, 1, 0.1):
pred = np.where(pred_proba >thresh ,1,0)
# assuming you have a truth array of 0,1 classifications
#now you can assess sensitivy by calculating true positive, false positive,...
tp= np.count_nonzero(truth & pred)
# same for false positive, false negative,...
# they you can evaluate your sensitivity (TPR) and specificity(FPR) for the current threshold
tpr = (tp / (tp + fn)
# same for fpr
# now you can plot the tpr, fpr point for the current treshold value
I'm using scikit learn, and I want to plot the precision and recall curves. the classifier I'm using is RandomForestClassifier. All the resources in the documentations of scikit learn uses binary classification. Also, can I plot a ROC curve for multiclass?
Also, I only found for SVM for multilabel and it has a decision_function which RandomForest doesn't have
From scikit-learn documentation:
Precision-Recall:
Precision-recall curves are typically used in binary classification to
study the output of a classifier. In order to extend the
precision-recall curve and average precision to multi-class or
multi-label classification, it is necessary to binarize the output.
One curve can be drawn per label, but one can also draw a
precision-recall curve by considering each element of the label
indicator matrix as a binary prediction (micro-averaging).
Receiver Operating Characteristic (ROC):
ROC curves are typically used in binary classification to study the
output of a classifier. In order to extend ROC curve and ROC area to
multi-class or multi-label classification, it is necessary to binarize
the output. One ROC curve can be drawn per label, but one can also
draw a ROC curve by considering each element of the label indicator
matrix as a binary prediction (micro-averaging).
Therefore, you should binarize the output and consider precision-recall and roc curves for each class. Moreover, you are going to use predict_proba to get class probabilities.
I divide the code into three parts:
general settings, learning and prediction
precision-recall curve
ROC curve
1. general settings, learning and prediction
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_curve, roc_curve
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
#%matplotlib inline
mnist = fetch_openml("mnist_784")
y = mnist.target
y = y.astype(np.uint8)
n_classes = len(set(y))
Y = label_binarize(mnist.target, classes=[*range(n_classes)])
X_train, X_test, y_train, y_test = train_test_split(mnist.data,
Y,
random_state = 42)
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=50,
max_depth=3,
random_state=0))
clf.fit(X_train, y_train)
y_score = clf.predict_proba(X_test)
2. precision-recall curve
# precision recall curve
precision = dict()
recall = dict()
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(y_test[:, i],
y_score[:, i])
plt.plot(recall[i], precision[i], lw=2, label='class {}'.format(i))
plt.xlabel("recall")
plt.ylabel("precision")
plt.legend(loc="best")
plt.title("precision vs. recall curve")
plt.show()
3. ROC curve
# roc curve
fpr = dict()
tpr = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i],
y_score[:, i]))
plt.plot(fpr[i], tpr[i], lw=2, label='class {}'.format(i))
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.legend(loc="best")
plt.title("ROC curve")
plt.show()
I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and p-value to access statistical significance.
Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores.
Here are my specific questions:
How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g. with bootstrapping).
How to compare the AUC scores (on test set) and measure the p-value to assess statistical significance? (The null hypothesis is that the models are not different. Rejecting the null hypothesis means the difference in AUC scores is statistically significant.)
.
import numpy as np
np.random.seed(2018)
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=17)
# Naive Bayes Classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_prediction_proba = nb_clf.predict_proba(X_test)[:, 1]
# Ranodm Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=20)
rf_clf.fit(X_train, y_train)
rf_prediction_proba = rf_clf.predict_proba(X_test)[:, 1]
# Multi-layer Perceptron Classifier
mlp_clf = MLPClassifier(alpha=1, hidden_layer_sizes=150)
mlp_clf.fit(X_train, y_train)
mlp_prediction_proba = mlp_clf.predict_proba(X_test)[:, 1]
def roc_curve_and_score(y_test, pred_proba):
fpr, tpr, _ = roc_curve(y_test.ravel(), pred_proba.ravel())
roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel())
return fpr, tpr, roc_auc
plt.figure(figsize=(8, 6))
matplotlib.rcParams.update({'font.size': 14})
plt.grid()
fpr, tpr, roc_auc = roc_curve_and_score(y_test, rf_prediction_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, nb_prediction_proba)
plt.plot(fpr, tpr, color='green', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, mlp_prediction_proba)
plt.plot(fpr, tpr, color='crimson', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.legend(loc="lower right")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
plt.ylabel('Sensitivity')
plt.show()
Bootstrap for 95% confidence interval
You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x) that determines whatever statistic you need from data x and you can bootstrap like this:
def bootstrap(x, f, nsamples=1000):
stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
return np.percentile(stats, (2.5, 97.5))
This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).
In your case, you can write a more specific function like this
def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
auc_values = []
for b in range(nsamples):
idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
clf.fit(X_train[idx], y_train[idx])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
auc_values.append(roc_auc)
return np.percentile(auc_values, (2.5, 97.5))
Here, clf is the classifier for which you want to test the performance and X_train, y_train, X_test, y_test are like in your code.
This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):
Naive Bayes: 0.986 [0.980 0.988] (estimate, lower and upper limit of confidence interval)
Random Forest: 0.983 [0.974 0.989]
Multilayer Perceptron: 0.974 [0.223 0.98]
Permutation tests to test against chance performance
A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values
def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
idx1 = np.arange(X_train.shape[0])
idx2 = np.arange(X_test.shape[0])
auc_values = np.empty(nsamples)
for b in range(nsamples):
np.random.shuffle(idx1) # Shuffles in-place
np.random.shuffle(idx2)
clf.fit(X_train, y_train[idx1])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
auc_values[b] = roc_auc
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
return roc_auc, np.mean(auc_values >= roc_auc)
This function again takes your classifier as clf and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).
Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.
Permutation test for differences between classifiers
This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this
def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
auc_differences = []
auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
observed_difference = auc1 - auc2
for _ in range(nsamples):
mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
auc1 = roc_auc_score(y_test.ravel(), p1)
auc2 = roc_auc_score(y_test.ravel(), p2)
auc_differences.append(auc1 - auc2)
return observed_difference, np.mean(auc_differences >= observed_difference)
With this test and 1000 samples, I find no significant differences between the three classifiers:
Naive bayes vs random forest: diff=0.0029, p(diff>)=0.311
Naive bayes vs MLP: diff=0.0117, p(diff>)=0.186
random forest vs MLP: diff=0.0088, p(diff>)=0.203
Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.
One can use the code given below to compute the AUC and asymptotic normally distributed confidence interval for Neural Nets.
tf.contrib.metrics.auc_with_confidence_intervals(
labels,
predictions,
weights=None,
alpha=0.95,
logit_transformation=True,
metrics_collections=(),
updates_collections=(),
name=None)
I want to plot a ROC curve for evaluating a trained Nearest Centroid classifier.
My code works for Naive Bayes, SVM, kNN and DT but I get an exception whenever I try to plot the curve for Nearest Centroid, because the estimator has no .predict_proba() method:
AttributeError: 'NearestCentroid' object has no attribute 'predict_proba'
The code for plotting the curve is
def plot_roc(self):
plt.clf()
for label, estimator in self.roc_estimators.items():
estimator.fit(self.data_train, self.target_train)
proba_for_each_class = estimator.predict_proba(self.data_test)
fpr, tpr, thresholds = roc_curve(self.target_test, proba_for_each_class[:, 1])
plt.plot(fpr, tpr, label=label)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()
self.roc_estimators is a dict where I store the trained estimators with the label of the classifier like this
cl_label = "kNN"
knn_estimator = KNeighborsClassifier(algorithm='ball_tree', p=2, n_neighbors=5)
knn_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = knn_estimator
and for Nearest Centroid respectively
cl_label = "Nearest Centroid"
nc_estimator = NearestCentroid(metric='euclidean', shrink_threshold=6)
nc_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = nc_estimator
So it works for all classifiers I tried but not for Nearest Centroid. Is there a specific reason regarding the nature of the Nearest Centroid classifier that I am missing which explains why it is not possible to plot the ROC curve (more specifically why the estimator does not have the .predict_proba() method?) Thank you in advance!
You need a "score" for each prediction to make the ROC curve. This could be the predicted probability of belonging to one class.
See e.g. https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Curves_in_ROC_space
Just looking for the nearest centroid will give you predicted class, but not the probability.
EDIT: For NearestCentroid it is not possible to compute a score. This is simply a limitation of the model. It assigns a class to each sample, but not a probability of that class. I guess if you need to use Nearest Centroid and you want a probability, you can use some ensemble method. Train a bunch of models of subsets of your training data, and average their predictions on your test set. That could give you a score. See scikit-learn.org/stable/modules/ensemble.html#bagging
To get the class probabilities you can do something like (untested code):
from sklearn.utils.extmath import softmax
from sklearn.metrics.pairwise import pairwise_distances
def predict_proba(self, X):
distances = pairwise_distances(X, self.centroids_, metric=self.metric)
probs = softmax(distances)
return probs
clf = NearestCentroid()
clf.fit(X_train, y_train)
predict_proba(clf, X_test)
I implemented a model using gradient boosting decision tree as classifier and I plotted learning curves for both training and test sets to decide what to do next in order to improve my model.
The result is as the image:
(Y axis is accuracy (percentage of correct prediction) while x axis is the number of samples i use to train the model.)
I understand that the gap between training and testing score is probably due to high variance(overfitting). But the image also shows that the test score (the green line) increases very little while the number of samples grows from 2000 to 3000. The curve of testing score is getting flat. The model is not getting better even with more samples.
My understand is that a flat learning curve usually indicates high bias (underfitting). Is that possible that both underfitting and overfitting are happening in this model? Or is there another explanation for the flat curve?
Any help would be appreciated. Thanks in advance.
=====================================
the code i use is as follows. Basic i use the same code as the example in sklearn document
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
title = "Learning Curves (GBDT)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = GradientBoostingClassifier(n_estimators=450)
X,y= features, target #features and target are already loaded
plot_learning_curve(estimator, title, X, y, ylim=(0.6, 1.01), cv=cv, n_jobs=4)
plt.show()
I would say you are overfitting. Considering you are using cross validation, the gap between the training and the cross-validation score is probably too big. Without cross validation or random splitting, it could be that your train and test data differ in some way.
There are a couple of ways you could try to mitigate this:
Add more data (the training score will probably still go down a little bit more)
Reduce the number of estimators, or even better, use early stopping
Increase gamma for prunning
Use subsampling (by tree, by column...)
There are lots of parameters that you can play with, so have some fun! :-D
First of all, your training accuracy goes down quite a bit when you add more examples. So this could still be high variance. However, I doubt that this is the only explanation as the gap seems to be too big.
A reason for a gap between the training accuracy and the test accuracy could be a different distribution of the training samples and the test samples. However, with cross-validation this should not happen (do you make a k-fold cross validation where you re-train for each of the k folds?)
You should pay more attention to your training accuracy. If it goes down during the training, you did something terribly wrong. Check again the correctness of your data (are your labels correct?) and your model.
Normally, both train and test accuracies should increase, but test accuracy is behind.