Combination of GridSearchCV's refit and scorer unclear

Combination of GridSearchCV's refit and scorer unclear - python

I use GridSearchCV to find the best parameters in the inner loop of my nested cross-validation. The 'inner winner' is found using GridSearchCV(scorer='balanced_accuracy'), so as I understand the documentation the model with the highest balanced accuracy on average in the inner folds is the 'best_estimator'. I don't understand what the different arguments for refit in GridSearchCV do in combination with the scorer argument. If refit is True, what scoring function will be used to estimate the performance of that 'inner winner' when refitted to the dataset? The same scoring function that was passed to scorer (so in my case 'balanced_accuracy')? Why can you pass also a string to refit? Does that mean that you can use different functions for 1.) finding the 'inner winner' and 2.) to estimate the performance of that 'inner winner' on the whole dataset?

When refit=True, sklearn uses entire training set to refit the model. So, there is no test data left to estimate the performance using any scorer function.
If you use multiple scorer in GridSearchCV, maybe f1_score or precision along with your balanced_accuracy, sklearn needs to know which one of those scorer to use to find the "inner winner" as you say. For example with KNN, f1_score might have best result with K=5, but accuracy might be highest for K=10. There is no way for sklearn to know which value of hyper-parameter K is the best.
To resolve that, you can pass one string scorer to refit to specify which of those scorer should ultimately decide best hyper-parameter. This best value will then be used to retrain or refit the model using full dataset. So, when you've got just one scorer, as your case seems to be, you don't have to worry about this. Simply refit=True will suffice.

Related

xgboost and gridsearchcv in python

I have question about this tutorial.
The author is doing hyper parameter tuning. The first window shows different values of hyperparameters
Then he initializes gridsearchcv and mentions cv=3 and scoring='roc_auc'
then he fits gridsearchcv and uses eval_set and eval_metric='auc'
what is the purpose using cv and eval_set both? shouldn't we use just one of them? how they are used along with scoring='roc_auc' and eval_metric='auc'
is there a better way to do hyper parameter tuning using gridsearchcv? please suggest or provide a link

GridSearchCV performs cv for hyperparameter tuning using only training data. Since refit=True by default, the best fit is then validated on the eval set provided (a true test score).
You can use any metric to perform cv and testing. However, it would be odd to use a different metric for cv hyperparameter optimization and testing phases. So, the same metric is used. If you are wondering about the slightly different metric naming, I think it's just because xgboost is a sklearn-interface-compliant package, but it's not being developed by the same guys from sklearn. They should do both the same thing (area under the curve of receiving operator for predictions). Take a look at the sklearn docs: auc and roc_auc.
I don't think there is a better way.

Confusion around the SKLearn GridSearchCV scoring parameter and using train test split

I'm a little bit confused about how GridSearchCV works with Train Test Split.
As far as I know, when creating models for the dataset I'm using, a paper used roc-auc.
I'm trying to replicate what this paper did, at least as well as I can. From reading a few other posts here, I've gathered that running GridSearchCV on the entire dataset is prone to overfitting, so we should split the data into a training partition and a testing partition. Then, we should run the training partition with GridSearchCV with whatever model and parameters, and then fit it, and then get a score using the test part of the dataset we set aside.
Now where I'm confused is with GridSearchCV, as far as I understand, it gives us scores for each of the folds that the data is split into when doing the search for parameters and using best_score_ we can pull the best of these scores. I don't understand what the scores represent and why you can pass in a scoring parameter to begin with, since the job of GridSearchCV is to always find the best possible parameters anyways? (Perhaps I'm making a poor assumption here but I'm assuming that there is an objective best set of parameters, regardless of scoring method). What I figured was that I would find the best parameters with GridSearchCV and then use the said parameters to create fit a model, and finally use that model and the partition I saved for testing and test it using the roc-auc scoring method.
So in the end, does it matter (if at all) what scoring methods I'm passing into GridSearchCV, as it will always look to give the best set of parameters anyways, which I will use to compute my final score with the testing partition?

This document may help.
Here you see that the scoring parameter allows you to have various metrics, such as roc_auc. See here all Scikit's metrics.
Optimizing over different metrics result in different optimal parameters. Just think about optimizing precision versus recall. Optimizing precision leads to less false positives while optimizing recall leads to less false negatives.
Also, in GridSearchCV, the CV stands for cross validated. Train/test splitting happens inside this function, it's taken care of. You only have to provide the splitter as an argument to GridSearchCV, for example cv=StratifiedKFold(n_splits=5, shuffle=True).

Does sklearn LogisticRegressionCV use all data for final model

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:
The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
GridSearchCV final model
scikit-learn LogisticRegressionCV: best coefficients
Using cross validation and AUC-ROC for a logistic regression model in sklearn
Evaluating Logistic regression with cross validation

You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within
estimators. In scikit-learn they are passed as arguments to the
constructor of the estimator classes. Typical examples include C,
kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.
Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.
Hope this makes things clear. Feel free to ask if still any doubt.

You have the parameter refit=True by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the
coefs and the C that corresponds to the best score is taken, and a
final refit is done using these parameters. Otherwise the coefs,
intercepts and C that correspond to the best scores across folds are
averaged.
So if refit=True the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense

I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.
for refit = False:
It seems they do average the hyperparameters, see the blow source code:
best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))

How to run scikit's cross validation with several classifiers on the same folds

I'm currently working on a research study about classifiers performances comparison. To evaluate those performances, I'm computing the accuracy, the area under curve and the squared error for each classifier on all the datasets I have. Besides I need to perform tuning parameters for some of the classifiers in order to select the best parameters in terms of accuracy, so a validation test is required (I chose 20% of the dataset).
I was told that, in order to make this comparison even more meaningful, the cross validation should be performed on the same sets for each classifier.
So basically, is there a way to use the cross_val_score method so that it runs always on the same folds for all the classifiers or should I rewrite from scratch some code that can do this job ?
Thank you in advance.

cross_val_score accepts a cv parameter which represents the cross validation object you want to use. You probably want StratifiedKFold, which accepts a shuffle parameter, which specifies if you want to shuffle the data prior to running cross validation on it.
cv can also be an int, in which case a StratifiedKFold or KFold object will be created automatically with K = cv.
As you can tell from the documentation, shuffle is False by default, so by default it will already be performed on the same folds for all of your classifiers.
You can test it by running it twice on the same classifier to make sure (you should get the exact same results).
You can specify it yourself like this:
your_cv = StratifiedKFold(your_y, n_folds=10, shuffle=True) # or shuffle=False
cross_val_score(your_estimator, your_X, y=your_y, cv=your_cv)

Confused with repect to working of GridSearchCV

GridSearchCV implements a fit method in which it performs n-fold cross validation to determine best parameters. After this we can directly apply the best estimator to the testing data using predict() - Following this link : - http://scikit-learn.org/stable/auto_examples/grid_search_digits.html
It says here "The model is trained on the full development set"
However we have only applied n fold cross validations here. Is the classifier somehow also training itself on the entire data? or is it just choosing the best trained estimator with best parameters amongst the n-folds when applying predict?

If you want to use predict, you'll need to set 'refit' to True. From the documentation:
refit : boolean
Refit the best estimator with the entire dataset.
If “False”, it is impossible to make predictions using
this GridSearchCV instance after fitting.
It looks like it is true by default, so in the example, predict is based on the whole training set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.