I am doing model selection with 10-odd models for a classification (binary) problem. I loop through each model, do a train-test-split, make predictions, and find the score, updating each score to a dictionary where the keys are the model.
However, many of these models (Decision Tree, Random Forest, AdaBoost, etc.) do not give probabilities as a float but rather the class as an int (1 or 0). This means that my AUC plots for these models have only three points.
My code:
fig = make_subplots(1,1)
ns_probs = [0 for i in range(len(y_test))]
ns_auc = roc_auc_score(y_test, ns_probs)
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
for model in scores_kf_df.columns:
fig.add_scatter(x=scores_kf_df.T.loc[model, 'fpr'],y=scores_kf_df.T.loc[model, 'tpr'], name=model)
fig.add_scatter(x=ns_fpr,y=ns_tpr, name='No Skill', line={'dash':'dot'})
fig.update_layout({'title':{'text':'ROC AUC Curve', 'x':0.5, 'font':{'size':28}},
'xaxis':{'title':'False Positive Rate'},
'yaxis':{'title':'True Positive Rate'}})
iplot(fig)
And my plot here:
Is there a way to get probabilities for these models, similar to the predict_proba() function in logistic regression?
Related
I'm building a decision tree model based on data from the "Give me some credit" Kaggle competition (https://www.kaggle.com/competitions/GiveMeSomeCredit/overview). I'm trying to train this model on the training dataset from the competition and then apply to it to my own dataset for research.
The problem I'm facing is that it looks like the f1 score my model gets and the results presented by the confusion matrix do not correlate, and the higher the f1 score is, the worse label prediction becomes. Currently my best parameters for maximizing f1 are the following (the way I measure the score is included):
from sklearn.model_selection import RandomizedSearchCV
import xgboost
classifier=xgboost.XGBClassifier(tree_method='gpu_hist', booster='gbtree', importance_type='gain')
params={
"colsample_bytree":[0.3],
"gamma":[0.3],
"learning_rate":[0.1],
"max_delta_step":[1],
"max_depth":[4],
"min_child_weight":[9],
"n_estimators":[150],
"num_parallel_tree":[1],
"random_state":[0],
"reg_alpha":[0],
"reg_lambda":[0],
"scale_pos_weight":[4],
"validate_parameters":[1],
"n_jobs":[-1],
"subsample":[1],
}
clf=RandomizedSearchCV(classifier,param_distributions=params,n_iter=100,scoring='f1',cv=10,verbose=3)
clf.fit(X,y)
These parameters give me an f1 score of ≈0.46.
However, when this model is output onto a confusion matrix, the label prediction accuracy for label "1" is only 50% (Picture below).
When attempting to tune the parameters in order to achieve better label prediction, I can improve the label prediction accuracy to 97% for both labels, however that decreases the f1 score to about 0.3. Here's the code I use for creating the confusion matrix (parameters included are the ones that have the f1 score of 0.3):
from xgboost import XGBClassifier
from numpy import nan
final_model = XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0.2, gpu_id=0, grow_policy='depthwise',
importance_type='gain', interaction_constraints='',
learning_rate=1.5, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=9,
missing=nan, monotone_constraints='()', n_estimators=800,
n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=5)
final_model.fit(X,y)
pred_xgboost = final_model.predict(X)
cm = confusion_matrix(y, pred_xgboost)
cm_norm = cm/cm.sum(axis=1)[:, np.newaxis]
plt.figure()
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(cm_norm, classes=rf.classes_)
And here's the confusion matrix for these parameters:
I don't understand why there is seemingly no correlation between these two metrics (f1 score and confusion matrix accuracy), perhaps a different scoring system would prove more useful?
Would you kindly show the absolute values?
Technically, cm_norm = cm/cm.sum(axis=1)[:, np.newaxis] would represent recall, not the accuracy. You can easily get a matrix with a good recall but poor precision for the positive class (e.g. [[9000, 300], [1, 30]]) - you can check your precision using the same code with axis=0. (F1 is the harmonic mean of your positive class recall and precision.)
If you wish to optimize for F1, you should also look for an optimal classification threshold on the sklearn.metrics.precision_recall_curve().
There is a relationship, although not so obvious. It would help to understand it better if you generate a classification report.
Also, a higher max_rate can change the value of Recall Specificity, which affects one of the class's f1_score in the classification report, but not the f1-score derived from f1_score(y_valid, predictions). Oversampling can also affect the Recall.
from sklearn.metrics import classification_report
ClassificationReport = classification_report(y_valid,predictions.round(),output_dict=True)
f1_score is the balance between precision and recall. The confusion matrix shows the precision values of both classes. With the classification report, I can see the relationship, like in the example below.
Classification Report
precision recall f1-score support
0 0.722292 0.922951 0.810385 23167.0
1 0.982273 0.923263 0.951854 107132.0
Confusion Matrix using Validation Data (y_valid)
True Negative : CHGOFF (0) was predicted 21382 times correctly (72.23 %)
False Negative : CHGOFF (0) was predicted 8221 times incorrectly (27.77 %)
True Positive : P I F (1) was predicted 98911 times correctly (98.23 %)
False Positive : P I F (1) was predicted 1785 times incorrectly (1.77 %)
I want to plot a ROC curve for evaluating a trained Nearest Centroid classifier.
My code works for Naive Bayes, SVM, kNN and DT but I get an exception whenever I try to plot the curve for Nearest Centroid, because the estimator has no .predict_proba() method:
AttributeError: 'NearestCentroid' object has no attribute 'predict_proba'
The code for plotting the curve is
def plot_roc(self):
plt.clf()
for label, estimator in self.roc_estimators.items():
estimator.fit(self.data_train, self.target_train)
proba_for_each_class = estimator.predict_proba(self.data_test)
fpr, tpr, thresholds = roc_curve(self.target_test, proba_for_each_class[:, 1])
plt.plot(fpr, tpr, label=label)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()
self.roc_estimators is a dict where I store the trained estimators with the label of the classifier like this
cl_label = "kNN"
knn_estimator = KNeighborsClassifier(algorithm='ball_tree', p=2, n_neighbors=5)
knn_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = knn_estimator
and for Nearest Centroid respectively
cl_label = "Nearest Centroid"
nc_estimator = NearestCentroid(metric='euclidean', shrink_threshold=6)
nc_estimator.fit(self.data_train, self.target_train)
self.roc_estimators[cl_label] = nc_estimator
So it works for all classifiers I tried but not for Nearest Centroid. Is there a specific reason regarding the nature of the Nearest Centroid classifier that I am missing which explains why it is not possible to plot the ROC curve (more specifically why the estimator does not have the .predict_proba() method?) Thank you in advance!
You need a "score" for each prediction to make the ROC curve. This could be the predicted probability of belonging to one class.
See e.g. https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Curves_in_ROC_space
Just looking for the nearest centroid will give you predicted class, but not the probability.
EDIT: For NearestCentroid it is not possible to compute a score. This is simply a limitation of the model. It assigns a class to each sample, but not a probability of that class. I guess if you need to use Nearest Centroid and you want a probability, you can use some ensemble method. Train a bunch of models of subsets of your training data, and average their predictions on your test set. That could give you a score. See scikit-learn.org/stable/modules/ensemble.html#bagging
To get the class probabilities you can do something like (untested code):
from sklearn.utils.extmath import softmax
from sklearn.metrics.pairwise import pairwise_distances
def predict_proba(self, X):
distances = pairwise_distances(X, self.centroids_, metric=self.metric)
probs = softmax(distances)
return probs
clf = NearestCentroid()
clf.fit(X_train, y_train)
predict_proba(clf, X_test)
I have the following code:
from sklearn.metrics import roc_curve, auc
actual = [1,1,1,0,0,1]
prediction_scores = [0.9,0.9,0.9,0.1,0.1,0.1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
# 0.875
In this example the interpretation of prediction_scores is straightforward namely, the higher the score the more confident the prediction is.
Now I have another set of prediction prediction scores.
It is non-fractional, and the interpretation is the reverse. Meaning the lower
the score more confident the prediction is.
prediction_scores_v2 = [10.3,10.3,10.2,10.5,2000.34,2000.34]
# so this is equivalent
My question is: how can I scale that in prediction_scores_v2 so that it gives
similar AUC score like the first one?
To put it another way, Scikit's ROC_CURVE requires the y_score to be probability estimates of the positive class. How can I treat the value if the y_score I have is probability estimates of the wrong class?
For AUC, you really only care about the order of your predictions. So as long as that is true, you can just get your predictions into a format that AUC will accept.
You'll want to divide by the max to get your predictions to be between 0 and 1, and then subtract from 1 since lower is better in your case:
max_pred = max(prediction_scores_v2)
prediction_scores_v2[:] = (1-x/max_pred for x in prediction_scores_v2)
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores_v2, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
# 0.8125
How can I treat the value if the y_score I have is probability estimates of the wrong class?
This is a really cheap shot, but have you considered reversing the original class list, as in
actual = [abs(x-1) for x in actual]
Then, you could still apply the normalization #Tchotchke proposed.
Still, in the end, #BrenBarn seems right. If possible, have an in-depth look at how these values are created and/or used in the other prediction tool.
I'm doing a binary classification .. I've an imbalanced data and I've used the svm weight in trying to mitigate the situation ...
As you can see I've calculated and plot the roc curve for each class and I've got the following plot:
It looks like the two classes some up to one .. and I'm n't sure if I'm doing the right thing or not because its the first time for me to draw my own roc curve ... I'm using Scikit learn to plot ... is it right to plot each class alone .. and is the classifier failing in classifying the blue class ?
this is the code that I've used to get the plot:
y_pred = clf.predict_proba(X_test)[:,0] # for calculating the probability of the first class
y_pred2 = clf.predict_proba(X_test)[:,1] # for calculating the probability of the second class
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
auc=metrics.auc(fpr, tpr)
print "auc for the first class",auc
fpr2, tpr2, thresholds2 = metrics.roc_curve(y_test, y_pred2)
auc2=metrics.auc(fpr2, tpr2)
print "auc for the second class",auc2
# ploting the roc curve
plt.plot(fpr,tpr)
plt.plot(fpr2,tpr2)
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('Roc curve')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend(loc="lower right")
plt.show()
I know there is a better way to write as a dictionary for example but I was just trying to see the curve first
See the Wikipedia entry for a all your ROC curve needs :)
predict_proba returns class probabilities for each class. The first column contains the probability of the first class and the second column contains the probability of the second class. Note that the two curves are rotated versions of each other. That is because the class probabilities add up to 1.
The documentation of roc_curve states that the second parameter must contain
Target scores, can either be probability estimates of the positive class or confidence values.
This means you have to pass the probabilities that corresponds to class 1. Most likely this is the second column.
You get the blue curve because you passed the probabilities of the wrong class (first column). Only the green curve is correct.
It does not make sense to compute ROC curves for each class, because the ROC curve describes the ability of the classifier to distinguish two classes. You have only one curve per classifier.
The specific problem is a coding mistake.
predict_proba returns class probabilities (1 if it's certainly the class, 0 if it is definitly not the class, usually it's something in-between).
metrics.roc_curve(y_test, y_pred) now compares class labels against probabilities, which is like comparing pears against apple juice.
You should use predict instead of predict_proba to predict class labels and not probabilities. These can be compared against the true class labels for computing the ROC curve. Incidentally, this also removes the option to plot a second curve - you only get one curve for the classifier, not one for each class.
you have to rethink the whole approach. ROC curve indicates the quality of different classifiers at different "probability" thresholds and not the classes. Usually, a straight line with a slope of 0.5 is the benchmark for the classifiers whether your classifier is able to beat a random guess.
It's because while building ROC for class 0, it considers '0' in y_test as Boolean False for your target class.
Try changing:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred) to fpr, tpr, thresholds = metrics.roc_curve(1-y_test, y_pred)
I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes.
In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None. Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my intuition of weights is wrong in this case.
Here's the code:
import numpy as np
from sklearn import ensemble,metrics, cross_validation, datasets
#create a synthetic dataset with unbalanced classes
X,y = datasets.make_classification(
n_samples=10000,
n_features=20,
n_informative=4,
n_redundant=2,
n_repeated=0,
n_classes=2,
n_clusters_per_class=2,
weights=[0.9],
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=0)
model = ensemble.RandomForestClassifier()
w0=1 #weight associated to 0's
w1=1 #weight associated to 1's
#I should split train and validation but for the sake of understanding sample_weights I'll skip this step
model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y]))
preds = model.predict(X)
probas = model.predict_proba(X)
ACC = metrics.accuracy_score(y,preds)
precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1])
fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1])
ROC = metrics.auc(fpr, tpr)
cm = metrics.confusion_matrix(y,preds)
print "ACCURACY:", ACC
print "ROC:", ROC
print "F1 Score:", metrics.f1_score(y,preds)
print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0)
print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0)
print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1)
print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)
With w0=w1=1 I get, for instance, F1=0.9456.
With w0=w1=10 I get, for instance, F1=0.9569.
With sample_weights=None I get F1=0.9474.
With the Random Forest algorithm, there is, as the name implies, some "Random"ness to it.
You are getting different F1 score because the Random Forest Algorithm (RFA) is using a subset of your data to generate the decision trees, and then averaging across all of your trees. I am not surprised, therefore, that you have similar (but non-identical) F1 scores for each of your runs.
I have tried balancing the weights before. You may want to try balancing the weights by the size of each class in the population. For example, if you were to have two classes as such:
Class A: 5 members
Class B: 2 members
You may wish to balance the weights by assigning 2/7 for each of Class A's members and 5/7 for each of Class B's members. That's just an idea as a starting place, though. How you weight your classes will depend on the problem you have.