How to interpret the AUC score in this case?

How to interpret the AUC score in this case? - python

I just run a random forest model on a imbalance dataset. I got the set of AUC and the confusion matrix. The AUC seemed not bad but actually the model predict every instance as positive. So how it happened and how to use AUC properly?
The ROC Curve as below:

You can have this problem when your data is skewed in one direction or the other (sort of similar to a small false positive rate being terrible for medical tests for rare conditions). It might be helpful to look at the entire receiver operating characteristic curve (ROC curve) instead of just the AUC summary score.

Related

Scoring parameter in BayesSearchCV class confusion

I'm using BayesSearchCV from scikit-optimize to train a model on a fairly imbalanced dataset. From what I'm reading precision or ROC AUC would be the best metrics for imbalanced dataset. In my code:
knn_b = BayesSearchCV(estimator=pipe, search_spaces=search_space, n_iter=40, random_state=7, scoring='roc_auc')
knn_b.fit(X_train, y_train)
The number of iterations is just a random value I chose (although I get a warning saying I already reached the best result, and there is not a way to early stop as far as I'm aware?). For the scoring parameter, I specified roc_auc, which I'm assuming it will be the primary metric to monitor for the best parameter in the results. So when I call knn_b.best_params_, I should have the parameters where the roc_auc metrics is higher. Is that correct?
My confusion is when I look at the results using knn_b.cv_results_. Shouldn't the mean_test_score be the roc_auc score because of the scoring param in the BayesSearchCV class? What I'm doing it plotting the results and seeing how each combination of params performed.
sns.relplot(
data=knn_b.cv_results_, kind='line', x='param_classifier__n_neighbors', y='mean_test_score',
hue='param_scaler', col='param_classifier__p',
)
When I try to use to roc_auc_score() function on the true and predicted values, I get something completely different.
Is the mean_test_score here different? How would I be able to get the individual/mean roc_auc score of each CV/split of each iteration? Similarly for when I want to use RandomizedSearchCV or GridSearchCV.
EDIT: tldr; I want to know what's being computed exactly in mean_test_score. I thought it was roc_auc because of the scoring param, or accuracy, but it seems to be neither.

mean_test_score is the AUROC, because of your scoring parameter, yes.
Your main problem is that the ROC curve (and the area under it) require the probability predictions (or other continuous score), not the hard class predictions. Your manual calculation is thus incorrect.
You shouldn't expect exactly the same score anyway. Your second score is on the test set, and the first score is optimistically biased by the hyperparameter selection.

Does tf.keras.metrics.AUC work on multi-class problems?

I have a multi-class classification problem and I want to measure AUC on training and test data.
tf.keras has implemented AUC metric (tf.keras.metrics.AUC), but I'm not be able to see whether this metric could safely be used in multi-class problems. Even, the example "Classification on imbalanced data" on the official Web page is dedicated to a binary classification problem.
I have implemented a CNN model that predicts six classes, having a softmax layer that gives the probabilities of all the classes. I used this metric as follows
self.model.compile(loss='categorical_crossentropy',
optimizer=Adam(hp.get("learning_rate")),
metrics=['accuracy', AUC()]),
and the code was executed without any problem. However, sometimes I see some results that are quite strange for me. For example, the model reported an accuracy of 0.78333336 and AUC equal to 0.97327775, Is this possible? Can a model have a low accuracy and an AUC so high?
I wonder that, although the code does not give any error, the AUC metric is computing wrong.
Somebody may confirm me whether or not this metrics support multi-class classification problems?

There is the argument multi_label which is a boolean inside your tf.keras.metrics.AUC call.
If True (not the default), multi-label data will be treated as such, and so AUC is computed separately for each label and then averaged across labels.
When False (the default), the data will be flattened into a single label before AUC computation. In the latter case, when multi-label data is passed to AUC, each label-prediction pair is treated as an individual data point.
The documentation recommends to set it to False for multi-class data.
e.g.: tf.keras.metrics.AUC(multi_label = True)
See the AUC Documentation for more details.

AUC can have a higher score than accuracy.
Additionally, you can use AUC to decide the cutoff threshold for a binary classifier(this cutoff is by default 0.5). Though there are more technical ways to decide this cutoff, you could simply simply increase it from 0 to 1 to find the value which maximizes your accuracy(this is a naive solution and 1 recommend you to read this https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf for an in depth explanation on cutoff analysis )

How do I get the score for plotting the ROC curve for a genetic algorithm classifier?

I am trying to get the ROC curve for a binary (good/bad) classifier that I used for a project. This classifier uses the genetic algorithm to make predictions.
E.g. a test chromosome given by [1.0,0.5,0.4,0.7] is said to be good if it matches another chromosome, say [0.8,0.5,0.3,0.6]. And by matching, I mean having an Euclidean distance value (from the other chromosome) below a particular value.
I have completed the classification of the 600 instances, and I have the final confusion matrix (by this matrix I mean the four-valued table from which we can we calculate the final TPR and FPR), the correct classification labels for each instance, and also all the predictions for each instance.
I have read this documentation about ROC curve, Receiver operating characteristic and Tools for Machine Learning Performance Evaluation: ROC Curves in Python. How do I proceed to get the ROC curve?
With my final four-valued table I think I can only plot a single point in the curve. The attached links above keeps mentioning that I need a score (i.e a probability score), but I don't know how I can get this for a genetic algorithm classifier. But how do I use the knowledge of each instance's prediction to create a kind of continuous ROC curve?
Disclaimer: I am new to the ROC plotting thing, and I am coding this in Python - hence, I attached the Python-related ROC documents.

It does not matter how did you create your classifier. In the end, your model is simply giving a positive label iff ||x - x_i|| < T, where T is some predefined threshold. ROC curves are parametrized with exactly this kind of things - scalar value, which you can change to make things more biased toward classifing as positive or negative. So simply go through multiple values of T, compute metrics for each value and this will create your ROC curve. That's all!

Default choice of SVM position on ROC curve

I've got a question about where the sklearn SVM classifier, on default settings, will be on a ROC curve or, failing that, how to find out. I've been of the assumption that the ROC curve was a description of general performance, so trying to find the exact position of the classifier was new to me.
Assume that the ROC curve looks like the mean on the graph provided
here.
Assuming you train a SVM on the entire dataset at default settings, where will it lie on the ROC curve
EDIT: Clarification
Assume I train a SVM at default values (sklearn), how would I determine where on the ROC curve it was. Alternatively, which setting on the SVC class allows me to set ROC position?

I think you're misunderstanding the concept of an ROC. A model doesn't "lie on the ROC", a model has an ROC curve. This can be used for evaluating your model, or for deciding how you're going to use your model.
Evaluating your model's performance
To calculate the ROC of your model, use the roc_curve function, with inputs as the predicted probabilities from your model, and the actual results:
from sklearn.metrics import roc_curve
roc = roc_curve(model.predict_proba(X), y)
If you want a single measure of your model's performance, you can use the area under the ROC; this can be useful if you're trying to tune hyperparamaters of your model, optimise your feature selection, etc. A typical way to calculate this (with k-fold cross validation) in sklearn would be:
from sklearn.cross_validation import cross_val_score
cross_val_score(model, X, y, scoring = 'roc_auc')
Using your model to predict.
If you just call model.predict(X), the model will predict based on a probability threshold of 0.5. This is probably not what you want: as #AndreHolzner pointed out in the comment to your question, you'll want to use your ROC curve to decide the false positive rate that you're willing to accept. After this you just check whether your predicted probabilities are above this threshold or not:
thresh = 0.8
predictions = model.predict_proba(X) > thresh

SVM poor performance compared to Random Forest

I am using the scikit-learn library for python for a classification problem. I used RandomForestClassifier and a SVM (SVC class). However while the rf achieves about 66% precision and 68% recall the SVM only gets up to 45% each.
I did a GridSearch for the parameters C and gamma for the rbf-SVM and also considered scaling and normalization in advance. However I think the gap between rf and SVM is still too large.
What else should I consider to get an adequate SVM performance?
I thought it should be possible to get at least up to equal results.
(All the scores are obtained by cross-validation on the very same test and training sets.)

As EdChum said in the comments there is no rule or guarantee that any model always perform best.
The SVM with RBF kernel model makes the assumption that the optimal decision boundary is smooth and rotation invariant (once you fix a specific feature scaling that is not rotation invariant).
The Random Forest does not make the smoothness assumption (it's a piece wise constant prediction function) and favors axis aligned decision boundaries.
The assumptions made by the RF model might just better fit the task.
BTW, thanks for having grid searched C and gamma and checked the impact of feature normalization before asking on stackoverflow :)
Edit to get some more insight, it might be interesting to plot the learning curves for the 2 models. It might be the case that the SVM model regularization and kernel bandwidth cannot deal with overfitting good enough while the ensemble nature of RF works best for this dataset size. The gap might get closer if you had more data. The learning curves plot is a good way to check how your model would benefit from more samples.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.