Scoring parameter in BayesSearchCV class confusion

Scoring parameter in BayesSearchCV class confusion - python

I'm using BayesSearchCV from scikit-optimize to train a model on a fairly imbalanced dataset. From what I'm reading precision or ROC AUC would be the best metrics for imbalanced dataset. In my code:
knn_b = BayesSearchCV(estimator=pipe, search_spaces=search_space, n_iter=40, random_state=7, scoring='roc_auc')
knn_b.fit(X_train, y_train)
The number of iterations is just a random value I chose (although I get a warning saying I already reached the best result, and there is not a way to early stop as far as I'm aware?). For the scoring parameter, I specified roc_auc, which I'm assuming it will be the primary metric to monitor for the best parameter in the results. So when I call knn_b.best_params_, I should have the parameters where the roc_auc metrics is higher. Is that correct?
My confusion is when I look at the results using knn_b.cv_results_. Shouldn't the mean_test_score be the roc_auc score because of the scoring param in the BayesSearchCV class? What I'm doing it plotting the results and seeing how each combination of params performed.
sns.relplot(
data=knn_b.cv_results_, kind='line', x='param_classifier__n_neighbors', y='mean_test_score',
hue='param_scaler', col='param_classifier__p',
)
When I try to use to roc_auc_score() function on the true and predicted values, I get something completely different.
Is the mean_test_score here different? How would I be able to get the individual/mean roc_auc score of each CV/split of each iteration? Similarly for when I want to use RandomizedSearchCV or GridSearchCV.
EDIT: tldr; I want to know what's being computed exactly in mean_test_score. I thought it was roc_auc because of the scoring param, or accuracy, but it seems to be neither.

mean_test_score is the AUROC, because of your scoring parameter, yes.
Your main problem is that the ROC curve (and the area under it) require the probability predictions (or other continuous score), not the hard class predictions. Your manual calculation is thus incorrect.
You shouldn't expect exactly the same score anyway. Your second score is on the test set, and the first score is optimistically biased by the hyperparameter selection.

Related

wanted insights on score method of sklearn classifiers

There is a .score() function for classifiers that sklearn provides to us like LogisticRegression,DecisionTreeClassifier,etc.Does this score function returns the score on the basis of accuracy of its prediction?If yes then what about the cases where accuracy might not be the best parameter to evaluate the performance of the model?Is the score function self adjusting according to the use cases?

Yes, as you can see from the documentation of LogisticRegression and DecisionTreeClassifier, the score method returns the "Mean accuracy of self.predict(X) wrt. y.". So, it does indeed return the accuracy of the predictions.
In cases where you want to use other metrics to evaluate a model's performance, you can use the metrics provided in the scikit-learn library which you can find on the scikit-learn's website.
An example would be using F1 as a metric. You can have your true values y_true and your predicted values y_pred, then calling f1_score(y_true, y_pred) to get the F1 result.

Does tf.keras.metrics.AUC work on multi-class problems?

I have a multi-class classification problem and I want to measure AUC on training and test data.
tf.keras has implemented AUC metric (tf.keras.metrics.AUC), but I'm not be able to see whether this metric could safely be used in multi-class problems. Even, the example "Classification on imbalanced data" on the official Web page is dedicated to a binary classification problem.
I have implemented a CNN model that predicts six classes, having a softmax layer that gives the probabilities of all the classes. I used this metric as follows
self.model.compile(loss='categorical_crossentropy',
optimizer=Adam(hp.get("learning_rate")),
metrics=['accuracy', AUC()]),
and the code was executed without any problem. However, sometimes I see some results that are quite strange for me. For example, the model reported an accuracy of 0.78333336 and AUC equal to 0.97327775, Is this possible? Can a model have a low accuracy and an AUC so high?
I wonder that, although the code does not give any error, the AUC metric is computing wrong.
Somebody may confirm me whether or not this metrics support multi-class classification problems?

There is the argument multi_label which is a boolean inside your tf.keras.metrics.AUC call.
If True (not the default), multi-label data will be treated as such, and so AUC is computed separately for each label and then averaged across labels.
When False (the default), the data will be flattened into a single label before AUC computation. In the latter case, when multi-label data is passed to AUC, each label-prediction pair is treated as an individual data point.
The documentation recommends to set it to False for multi-class data.
e.g.: tf.keras.metrics.AUC(multi_label = True)
See the AUC Documentation for more details.

AUC can have a higher score than accuracy.
Additionally, you can use AUC to decide the cutoff threshold for a binary classifier(this cutoff is by default 0.5). Though there are more technical ways to decide this cutoff, you could simply simply increase it from 0 to 1 to find the value which maximizes your accuracy(this is a naive solution and 1 recommend you to read this https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf for an in depth explanation on cutoff analysis )

What does sklearn "RidgeClassifier" do?

I'm trying to understand the difference between RidgeClassifier and LogisticRegression in sklearn.linear_model. I couldn't find it in the documentation.
I think I understand quite well what the LogisticRegression does.It computes the coefficients and intercept to minimise half of sum of squares of the coefficients + C times the binary cross-entropy loss, where C is the regularisation parameter. I checked against a naive implementation from scratch, and results coincide.
Results of RidgeClassifier differ and I couldn't figure out, how the coefficients and intercept are computed there? Looking at the Github code, I'm not experienced enough to untangle it.
The reason why I'm asking is that I like the RidgeClassifier results -- it generalises a bit better to my problem. But before I use it, I would like to at least have an idea where does it come from.
Thanks for possible help.

RidgeClassifier() works differently compared to LogisticRegression() with l2 penalty. The loss function for RidgeClassifier() is not cross entropy.
RidgeClassifier() uses Ridge() regression model in the following way to create a classifier:
Let us consider binary classification for simplicity.
Convert target variable into +1 or -1 based on the class in which it belongs to.
Build a Ridge() model (which is a regression model) to predict our target variable. The loss function is MSE + l2 penalty
If the Ridge() regression's prediction value (calculated based on decision_function() function) is greater than 0, then predict as positive class else negative class.
For multi-class classification:
Use LabelBinarizer() to create a multi-output regression scenario, and then train independent Ridge() regression models, one for each class (One-Vs-Rest modelling).
Get prediction from each class's Ridge() regression model (a real number for each class) and then use argmax to predict the class.

Does sklearn LogisticRegressionCV use all data for final model

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:
The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
GridSearchCV final model
scikit-learn LogisticRegressionCV: best coefficients
Using cross validation and AUC-ROC for a logistic regression model in sklearn
Evaluating Logistic regression with cross validation

You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within
estimators. In scikit-learn they are passed as arguments to the
constructor of the estimator classes. Typical examples include C,
kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.
Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.
Hope this makes things clear. Feel free to ask if still any doubt.

You have the parameter refit=True by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the
coefs and the C that corresponds to the best score is taken, and a
final refit is done using these parameters. Otherwise the coefs,
intercepts and C that correspond to the best scores across folds are
averaged.
So if refit=True the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense

I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.
for refit = False:
It seems they do average the hyperparameters, see the blow source code:
best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))

Default choice of SVM position on ROC curve

I've got a question about where the sklearn SVM classifier, on default settings, will be on a ROC curve or, failing that, how to find out. I've been of the assumption that the ROC curve was a description of general performance, so trying to find the exact position of the classifier was new to me.
Assume that the ROC curve looks like the mean on the graph provided
here.
Assuming you train a SVM on the entire dataset at default settings, where will it lie on the ROC curve
EDIT: Clarification
Assume I train a SVM at default values (sklearn), how would I determine where on the ROC curve it was. Alternatively, which setting on the SVC class allows me to set ROC position?

I think you're misunderstanding the concept of an ROC. A model doesn't "lie on the ROC", a model has an ROC curve. This can be used for evaluating your model, or for deciding how you're going to use your model.
Evaluating your model's performance
To calculate the ROC of your model, use the roc_curve function, with inputs as the predicted probabilities from your model, and the actual results:
from sklearn.metrics import roc_curve
roc = roc_curve(model.predict_proba(X), y)
If you want a single measure of your model's performance, you can use the area under the ROC; this can be useful if you're trying to tune hyperparamaters of your model, optimise your feature selection, etc. A typical way to calculate this (with k-fold cross validation) in sklearn would be:
from sklearn.cross_validation import cross_val_score
cross_val_score(model, X, y, scoring = 'roc_auc')
Using your model to predict.
If you just call model.predict(X), the model will predict based on a probability threshold of 0.5. This is probably not what you want: as #AndreHolzner pointed out in the comment to your question, you'll want to use your ROC curve to decide the false positive rate that you're willing to accept. After this you just check whether your predicted probabilities are above this threshold or not:
thresh = 0.8
predictions = model.predict_proba(X) > thresh

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.