Does sklearn LogisticRegressionCV use all data for final model

Does sklearn LogisticRegressionCV use all data for final model - python

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:
The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
GridSearchCV final model
scikit-learn LogisticRegressionCV: best coefficients
Using cross validation and AUC-ROC for a logistic regression model in sklearn
Evaluating Logistic regression with cross validation

You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within
estimators. In scikit-learn they are passed as arguments to the
constructor of the estimator classes. Typical examples include C,
kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.
Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.
Hope this makes things clear. Feel free to ask if still any doubt.

You have the parameter refit=True by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the
coefs and the C that corresponds to the best score is taken, and a
final refit is done using these parameters. Otherwise the coefs,
intercepts and C that correspond to the best scores across folds are
averaged.
So if refit=True the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense

I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.
for refit = False:
It seems they do average the hyperparameters, see the blow source code:
best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))

Related

Using cross-validation to determine weights of machine learning algorithms (GridSearchCv,RidgeCV,StackingClassifier)

My question has to do with GridSearchCV, RidgeCV, and StackingClassifier/Regressor.
Stacking Classifier/Regressor-AFAIK, it first trains the whole train set individually for each base estimator. Then, it uses a cross validation scheme, using the predictions for each base estimator as the new features to train the new final estimator. From the documentation: "To generalize and avoid over-fitting, the final_estimator is trained on out-samples using sklearn.model_selection.cross_val_predict internally."
My question is, what exactly does this mean? Does it break the train data into k folds, and then for each fold, train the final estimator on the training section of the fold, test it on the testing section of the fold, and then take the final estimator weights from the fold with the best score? or what?
I think I can group GridSearchCV and RidgeCV into the same question as they are quite similar. ( albeit, ridgeCV uses one vs all CV by default)
-To find the best hyperparameters, do they do a CV on all the folds, for each hyperparameter, find the hyperparameters that had the best average score AND THEN AFTER finding the best hyperparameters, train the model with the best hyperparameters, using the WHOLE training set? Or am I looking at it wrong?
If anyone could shed some light on this, that would be great. Thanks!

You're exactly right. The process looks like this:
Select the first set of hyperparameters
Partition the data into k-folds
Run the model on each fold
Obtain the average score (loss, r2, or whatever specified criteria)
Repeat steps 2-4 for all other sets of hyperparameters
Choose the set of hyperparameters with the best score
Retrain the model on the entire dataset (as opposed to a single fold) using the best hyperparameters

Setting exact number of iterations for Logistic regression in python

I'm creating a model to perform Logistic regression on a dataset using Python. This is my code:
from sklearn import linear_model
my_classifier2=linear_model.LogisticRegression(solver='lbfgs',max_iter=10000)
Now, according to Sklearn doc page, max_iter is maximum number of iterations taken for the solvers to converge. How do I specifically state that I need 'N' number of iterations ?
Any kind of help would be really appreciated.

I’m not sure, but, Do you want to know the optimal number of iterations for your model? If so, you are better off utilizing GridSearchCV that scan tune hyper parameter like max_iter.
Briefly,
Split your data into two groups: train/test data with train_test_split or KFold that can be imported from sklean
Set your parameter, for instance para=[{‘max_iter’:[1,10,100,100]}]
Instance, for example clf=GridSearchCV(LogisticRegression, param_grid=para, cv=5, scoring=‘r2’)
Implement with using train data like this: clf.fit(x_train, y_train)
You can also fetch the best number of iterations with RandomizedSearchCV or BayesianOptimization.

About the GridSearchCV of the max_iter parameter, the fitted LogisticRegression models have and attribute n_iter_ so you can discover the exact max_iter needed for a given sample size and regarding features:
n_iter_: ndarray of shape (n_classes,) or (1, )
Actual number of iterations for all classes. If binary or multinomial, it
returns only 1 element. For liblinear solver, only the maximum number of
iteration across all classes is given.
Scanning very short intervals, like 1 by 1, is a waste of resources that could be used for more important LogisticRegression fit parameters such as the combination of solver itself, its regularization penalty and the inverse of the regularization strength C which contributes for a faster convergence within a given max_iter.
Setting a very high max_iter could be also a waste of resources if you haven't previously did a minimal feature preprocessing, at least, feature scaling or maybe imputation, outlier clipping and a dimensionality reduction (e.g. PCA).
Things can become worse: a tunned max_iter could be ok for a given sample size but not for a bigger sample size, for instance, if you are developing a cross-validated learning curve, which by the way is imperative for optimal machine learning.
It becomes even worse if you increase a sample size in a pipeline that generates feature vectors such as n-grams (NLP): more rows will generate more (sparse) features for the LogisticRegression classification.
I think it's important to observe if different solvers converges or not on given sample size, generated features and max_iter.
Methods that help a faster convergence which eventually won't demand increasing max_iter are:
Feature scaling
Dimensionality Reduction (e.g. PCA) of scaled features
There's a nice sklearn example demonstrating the importance of feature scaling

How is Cross Validation performed and how GridSearchCV() specifically?

How is GridSearchCV() (and or RandomizedSearchCV()) implemented in scikit? I wonder about the following: When using one of these techniques, how are the following aspects taken into account:
validation set
model selection
hyperparameter tuning
prediction
? Here is a picture that summarizes my confusion:
What happens when and how often? Maybe to keep it simple, let's assume a single neural network acting as our model.
My understanding so far:
In the first iteration, the model is fit on the training fold, separated into different folds. Here I struggle already: Is the model trained on a single fold and then tested on the validation fold?
What happens then with the next fold? Does the model keep the weights achieved by its first training fold or will it re-initialize for the next training fold?
To be more precise: In the first iteration, is the model fit four times and tested four times on the validation set, independently between all folds?
When the next iteration begins, the model keeps no information from the first iteration, right?
Thus, are all iterations and all folds are independent from each other?
How are the hyperparameters tuned here?
In above example, there are 25 folds in total. Is the model with a constant set of hyperparameters fit and tested 20 times?
Let's say, we have two hyperparameters to tune: Learning rate and dropout rate, both with two levels:
learning_rate = [0.3, 0.6] and
dropout_rate = [0.4, 0.8].
Will the neural net now fitted 80 times? And when having not only a single model but e.g. two models (neural network and random forest), the whole procedure will be performed twice?
Is there a possibility to see how many folds GridSearchCV() will consider?
I have seen Does GridSearchCV perform cross-validation? , Model help using Scikit-learn when using GridSearch and scikit-learn GridSearchCV with multiple repetitions but I can't see a clear and precise answer to my questions.

So the k-folds method:
you split your training set into n parts (k folds) for example 5. You take de first part as the validation set and the 4 other parts as the training set. You train and this gives you a training/CV performance. You do this 5 (number of folds) times, each folds become the validation set and the others de training set. At the end you do the mean of the performances to obtain the cv performance of your model. This is for the k-fold.
Now, GridSearchCV is an hyperparameter tuner which uses k-folds method. The principel is you give to gridsearch a dictionary with all the hyper parameters you want to test then it will tests all the hyperparameters (dictionary) and select the best set of hyperparameters (those with the best model cv performance). It can take a very loooooooong time.
You pass a model (estimator) in gridsearch, a set of params and if you want the number of k-folds.
Example:
GridSearchCV(SVC(), parameters, cv = 5)
where SVC() is the estimator, parameters is your dictionary of hyperparameters and cv is the number of folds.

What does sklearn "RidgeClassifier" do?

I'm trying to understand the difference between RidgeClassifier and LogisticRegression in sklearn.linear_model. I couldn't find it in the documentation.
I think I understand quite well what the LogisticRegression does.It computes the coefficients and intercept to minimise half of sum of squares of the coefficients + C times the binary cross-entropy loss, where C is the regularisation parameter. I checked against a naive implementation from scratch, and results coincide.
Results of RidgeClassifier differ and I couldn't figure out, how the coefficients and intercept are computed there? Looking at the Github code, I'm not experienced enough to untangle it.
The reason why I'm asking is that I like the RidgeClassifier results -- it generalises a bit better to my problem. But before I use it, I would like to at least have an idea where does it come from.
Thanks for possible help.

RidgeClassifier() works differently compared to LogisticRegression() with l2 penalty. The loss function for RidgeClassifier() is not cross entropy.
RidgeClassifier() uses Ridge() regression model in the following way to create a classifier:
Let us consider binary classification for simplicity.
Convert target variable into +1 or -1 based on the class in which it belongs to.
Build a Ridge() model (which is a regression model) to predict our target variable. The loss function is MSE + l2 penalty
If the Ridge() regression's prediction value (calculated based on decision_function() function) is greater than 0, then predict as positive class else negative class.
For multi-class classification:
Use LabelBinarizer() to create a multi-output regression scenario, and then train independent Ridge() regression models, one for each class (One-Vs-Rest modelling).
Get prediction from each class's Ridge() regression model (a real number for each class) and then use argmax to predict the class.

Converting LinearSVC's decision function to probabilities (Scikit learn python )

I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.

scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.

I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.

If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.

If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.

Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.