I am working with an extremely unbalanced dataset with a total of 44 samples for my research project. It is a binary classification problem with 3/44 samples of the minority class for which I am using Leave One Out Cross Validation. If I perform SMOTE oversampling of the entire dataset prior to LOOCV loop, both prediction accuracy and AUC for ROC curves are close to 90% and 0.9 respectively. However, if I oversample only the training set inside the LOOCV loop, which happens to be a more logical approach, AUC for ROC curves falls as low as 0.3
I also tried precision-recall curves and stratified k-fold cross validation but faced a similar distinction in results from oversampling outside and inside the loop.
Please suggest me what is the right place to oversample and also explain the distinction if possible.
Oversampling inside the loop:-
i=0
acc_dec = 0
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split
for train, test in loo.split(X): #Leave One Out Cross Validation
#Create training and test sets for split indices
X_train = X.loc[train]
y_train = Y.loc[train]
X_test = X.loc[test]
y_test = Y.loc[test]
#oversampling minority class using SMOTE technique
sm = SMOTE(sampling_strategy='minority',k_neighbors=1)
X_res, y_res = sm.fit_resample(X_train, y_train)
#KNN
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(X_res,y_res)
y_pred = clf.predict(X_test)
acc_dec = acc_dec + metrics.accuracy_score(y_test, y_pred)
y_test_dec.append(y_test.to_numpy()[0])
y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
i+=1
# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")
AUC: 0.25
Accuracy: 68.1%
Oversampling Outside the loop:
acc_dec=0 #accuracy for decision tree classifier
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split
i=0
#Oversampling before the loop
sm = SMOTE(k_neighbors=1)
X, Y = sm.fit_resample(X, Y)
X=pd.DataFrame(X)
Y=pd.DataFrame(Y)
for train, test in loo.split(X): #Leave One Out Cross Validation
#Create training and test sets for split indices
X_train = X.loc[train]
y_train = Y.loc[train]
X_test = X.loc[test]
y_test = Y.loc[test]
#KNN
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(X_res,y_res)
y_pred = clf.predict(X_test)
acc_dec = acc_dec + metrics.accuracy_score(y_test, y_pred)
y_test_dec.append(y_test.to_numpy()[0])
y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
i+=1
# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")
AUC: 0.99
Accuracy: 90.24%
How can these two approaches lead to so different results? What shall I follow?
Doing upsampling (like SMOTE) before you split your data means information about the training set is present in the test set. This is sometimes called "leakage". Your first setup is, unfortunately, correct.
Here's a post walking through this problem.
Related
I would like to evaluate my model's ability to discriminate between people with prediabetes (hba1c 5.7-6.4%) and diabetes type 2 (hba1c > 6.4%)
My outcome label (y_test) is hba1c>5.7%, defining unhealthy people with undiagnosed diabetes or prediabetic conditions.
How do I separate the two ranges, compare the predicted values with actual values and calculate the sensitivity?
The present example is according to the logistic regression model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
X_train = X_train[feat_cols]
X_test = X_test[feat_cols]
# Building LGR and evaluating on training data
LGR = LogisticRegression(max_iter=100, random_state=1)
def evaluate_model(LGR, X_test, y_test):
# Predict Test Data
y_pred = LGR.predict(X_test)
# Calculate accuracy, precision, sensitivity and specificity
acc = metrics.accuracy_score(y_test, y_pred)
prec = metrics.precision_score(y_test, y_pred)
sen = metrics.recall_score(y_test, y_pred, pos_label=1)
spe = metrics.recall_score(y_test, y_pred, pos_label=0)
# Calculate area under curve (AUC)
y_pred_proba = LGR.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
# Display confussion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
return {'acc': acc, 'prec': prec, 'sen': sen, 'spe': spe,
'fpr': fpr, 'tpr': tpr, 'auc': auc, 'cm': cm}
LGR_eval = evaluate_model(LGR, X_test, y_test)
# Print result
print('Accuracy:', LGR_eval['acc'])
print('Precision:', LGR_eval['prec'])
print('Sensitivity:', LGR_eval['sen'])
print('Specificity:', LGR_eval['spe'])
print('Area Under Curve:', LGR_eval['auc'])
print('Confusion Matrix:\n', LGR_eval['cm'])```
Accuracy: 0.7315175097276264
Precision: 0.711340206185567
Sensitivity: 0.7439353099730458
Specificity: 0.72
Area Under Curve: 0.8036994609164421
Confusion Matrix:
[[288 112]
[ 95 276]]
(Answering comment above)
As you do not use a linear output but the logistical regression it will be difficult to achieve what you ask for without making quite some changes.
Either you change your model to a linear regression model and predict the hbac-value and after that just classify the prediction in the < 5.7, 5.7 - 6.4 range or > 6.4 range. This way you can use your metrics which you've used above.
The other way around is dependant on the Y dataset, does it contain any labels regarding the different conditions or is it just labeled healthy / unhealthy? If you were to add another label which corresponds to the values (alot like above) then you can turn your model to an multi-output prediction model and still use the logistical regression, and then investigate your metrics for the requested classes.
Edit in response to comment:
Below is the function from comments.
def IsAtRisk(x):
if x < 5.7:
return 0
return 1
df['IsAtRisk'] = df['LBXGH'].map(IsAtRisk)
print(df)
print(f"{len(df[df['IsAtRisk'] == True])} of {len(df)} people are at risk")
If you instead include the range that you ask for and apply another class you'll have the labels for the diffrent classes and can measure with the metrics regarding how the model performs.
def IsAtRisk(x):
if x < 5.7:
return 0
elif 5.7 < x < 6.4:
return 1
return 2
But for this to work you should most likely format the samples in a different way depending on your model structure. If you would share your model structure and the output-layer it would help.
Most likely you will want to restructure your Y labels to
y_sample = [1, 0, 0] # This is the probability 1 of class 0 which may be the health individuals in your dataset
# With this in mind you can correct the return values from the above function to return arrays of the labels instead of ints.
I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set. However, the cross-validation consistently performs better, as shown below. Am I over-fitting on the training data? If so, which hyper parameter(s) would be best to tune to avoid this?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)
Which gives me:
0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914
Re-train the model now with the entire training+validation set, and test it with never-seen-before test-set
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1]
auc = roc_auc_score(y_test, y_pred_proba)
print(accuracy,
precision,
recall,
f1,
auc
)
Now it gives me the numbers below, which are clearly worse:
0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368
I am able to reproduce your scenario with Pima Indians Diabetes Dataset.
The difference you see in the prediction metrics is not consistence and in some runs you may even notice the opposite, because it depends on the selection of the X_test during the split - some of the cases will be easier to predict and will give better metrics and vice versa. While Cross-validation runs predictions on the whole set you have in rotation and aggregates this effect, the single X_test set will suffer from effects of random splits.
In order to have better visibility on what is happening here, I have modified your experiment and split in two steps:
1. Cross-validation step:
I use the whole of the X and y sets and run rest of the code as it is
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)
Output:
0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622
2. Classic train-test step:
Next I run the plain train-test step, but I do it 50 times with the different train_test splits, and average the metrics (similar to Cross-validation step).
accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []
for i in range(50):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::, 1]
auc = roc_auc_score(y_test, y_pred_proba)
accuracies.append(accuracy)
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
aucs.append(auc)
print(mean(accuracies),
mean(precisions),
mean(recalls),
mean(f1s),
mean(aucs)
)
Output:
0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568
As expected the prediction metrics are similar. However, the Cross-validation runs much faster and uses each data point of the whole data set for testing (in rotation) by a given number of times.
This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?
You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.
In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.
In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.
So I have a very challenging dataset to work with, but even with that in mind the ROC curves I am getting as a result seem quite bizarre and looks wrong.
Below is my code - I have used the scikitplot library (skplt) for plotting ROC curves after passing in my predictions and the ground truth labels so I cannot reasonably be getting that wrong. Is there something crazily obvious that I am missing here?
# My dataset - note that m (number of examples) is 115. These are histograms that are already
# summed to 1 so I am doubtful that further preprocessing is necessary.
X, y = load_new_dataset(positives, positive_files, m=115, upper=21, range_size=10, display_plot=False)
# Partition - class balance is 0.87 : 0.13 for negative and positive classes respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, stratify=y)
# Pick a baseline classifier - Naive Bayes
nb = GaussianNB()
# Very large class imbalance, so use stratified K-fold cross-validation.
cross_val = StratifiedKFold(n_splits=10)
# Use RFE for feature selection
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
# Create pipeline, nothing fancy here
clf = Pipeline(steps=[("feature selection", selector), ("classifier", nb)])
# Score using F1-score due to class imbalance - accuracy unlikely to be meaningful
scores = cross_val_score(clf, X_train, y_train, cv=cross_val,
scoring=make_scorer(f1_score, average='micro'))
# Fit and make predictions. Use these to plot ROC curves.
print(scores)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, y_pred)
plt.show()
And below is the starkly binary ROC curve:
I understand that I can't expect outstanding performance with such a challenging dataset, but even so I cannot fathom why I am getting such a binary result, particularly for the ROC curves of the individual classes. No, I cannot get more data, although I sincerely wish I could. If this really is valid code, then I will just have to make do with it and perhaps report the micro-average F1 score, which does not look too bad.
For reference, using the make_classification function from sklearn in the code snippet below, I get the following ROC curve:
# Randomly generate a dataset with similar characteristics (size, class balance,
# num_features)
X, y = make_classification(n_samples=103, n_features=21, random_state=0, n_classes=2, \
weights=[0.87, 0.13], n_informative=5, n_clusters_per_class=3)
positives = np.where(y == 1)
X_minority, X_majority, y_minority, y_majority = np.take(X, positives, axis=0), \
np.delete(X, positives, axis=0), \
np.take(y, positives, axis=0), \
np.delete(y, positives, axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, stratify=y)
# Cross-validation again
cross_val = StratifiedKFold(n_splits=10)
# Use Naive Bayes again for consistency
clf = GaussianNB()
# Likewise for the evaluation metric
scores = cross_val_score(clf, X_train, y_train, cv=cross_val, \
scoring=make_scorer(f1_score, average='micro'))
print(scores)
# Fit, predict, plot results
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, y_pred)
plt.show()
Am I doing something wrong? Or is this what I should expect given these characteristics?
Thanks to Stev's kind suggestion of increasing the test size, the resulting curves I ended up getting were far smoother and exhibited much less variance. Using SMOTE in this case was also very helpful and I would advise it (using imblearn perhaps) for anyone else with a similar issue.
I'm using GridSearchCV to identify the best set of parameters for a random forest classifier.
PARAMS = {
'max_depth': [8,None],
'n_estimators': [500,1000]
}
rf = RandomForestClassifier()
clf = grid_search.GridSearchCV(estimator=rf, param_grid=PARAMS, scoring='roc_auc', cv=5, n_jobs=4)
clf.fit(data, labels)
where data and labels are respectively the full dataset and the corresponding labels.
Now, I compared the performance returned by the GridSearchCV (from clf.grid_scores_) with a "manual" AUC estimation:
aucs = []
for fold in range (0,n_folds):
probabilities = []
train_data,train_label = read_data(train_file_fold)
test_data,test_labels = read_data(test_file_fold)
clf = RandomForestClassifier(n_estimators = 1000,max_depth=8)
clf = clf.fit(train_data,train_labels)
predicted_probs = clf.predict_proba(test_data)
for value in predicted_probs:
for k, pr in enumerate(value):
if k == 1:
probabilities.append(pr)
fpr, tpr, thresholds = metrics.roc_curve(test_labels, probabilities, pos_label=1)
fold_auc = metrics.auc(fpr, tpr)
aucs.append(fold_auc)
performance = np.mean(aucs)
where I manually pre-split the data into training and test set (same 5 CV approach).
The AUC values returned by GridSearchCV are always higher than the one manually calculated (e.g. 0.62 vs. 0.70) when using the same parameter for RandomForest.
I know that different training and test split might give you different performance but this occurred constantly when testing 100 repetitions of the GridSearchCV. Interesting, if I use the accuarcy instead of roc_auc as scoring metric, the difference in performance is minimal and can be associated to the fact that I use different training and test set. Is this happening because the AUC value of GridSearchCV is estimated in a different way than by using metrics.roc_curve?