I'm using scikit-learn's LassoCV function. During cross-validation, what scoring metric is being used by default?
I would like cross-validation to be based on "Mean squared error regression loss". Can one use this metric with LassoCV? One can specify a scoring metric for LogisticRegressionCV, so it may be possible with LassoCV too?
LassoCV uses R^2 as the scoring metric. From the docs:
By default, parameter search uses the score function of the estimator
to evaluate a parameter setting. These are the
sklearn.metrics.accuracy_score for classification and
sklearn.metrics.r2_score for regression.
To use an alternative scoring metric, such as mean squared error, you need to use GridSearchCV or RandomizedSearchCV (instead of LassoCV) and specify the scoring parameter as scoring='neg_mean_squared_error'. From the docs:
An alternative scoring function can be specified via the scoring
parameter to GridSearchCV, RandomizedSearchCV and many of the
specialized cross-validation tools described below.
I think the accepted answer is wrong, as it quotes the documentation of Grid Search, but LassoCV uses regularisation paths, not grid search.
In fact in the docs page for LassoCV, it says that the loss function is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
Meaning that its minimising the MSE (plus the LASSO term).
Related
I have question about this tutorial.
The author is doing hyper parameter tuning. The first window shows different values of hyperparameters
Then he initializes gridsearchcv and mentions cv=3 and scoring='roc_auc'
then he fits gridsearchcv and uses eval_set and eval_metric='auc'
what is the purpose using cv and eval_set both? shouldn't we use just one of them? how they are used along with scoring='roc_auc' and eval_metric='auc'
is there a better way to do hyper parameter tuning using gridsearchcv? please suggest or provide a link
GridSearchCV performs cv for hyperparameter tuning using only training data. Since refit=True by default, the best fit is then validated on the eval set provided (a true test score).
You can use any metric to perform cv and testing. However, it would be odd to use a different metric for cv hyperparameter optimization and testing phases. So, the same metric is used. If you are wondering about the slightly different metric naming, I think it's just because xgboost is a sklearn-interface-compliant package, but it's not being developed by the same guys from sklearn. They should do both the same thing (area under the curve of receiving operator for predictions). Take a look at the sklearn docs: auc and roc_auc.
I don't think there is a better way.
There is a .score() function for classifiers that sklearn provides to us like LogisticRegression,DecisionTreeClassifier,etc.Does this score function returns the score on the basis of accuracy of its prediction?If yes then what about the cases where accuracy might not be the best parameter to evaluate the performance of the model?Is the score function self adjusting according to the use cases?
Yes, as you can see from the documentation of LogisticRegression and DecisionTreeClassifier, the score method returns the "Mean accuracy of self.predict(X) wrt. y.". So, it does indeed return the accuracy of the predictions.
In cases where you want to use other metrics to evaluate a model's performance, you can use the metrics provided in the scikit-learn library which you can find on the scikit-learn's website.
An example would be using F1 as a metric. You can have your true values y_true and your predicted values y_pred, then calling f1_score(y_true, y_pred) to get the F1 result.
In order to properly fit a regularized linear regression model like the Elastic Net, the independent variables have to be stanardized first. However, the coefficients have then different meaning. In order to extract the proper weights of such model, do I need to calculate them manually with this equation:
b = b' * std_y/std_x
or is there already some built-in feature in sklearn?
Also: I don't think I can just use normalize=True parameter, since I have dummy variables which should probably remain unscaled
You can unstandardize using the mean and standard deviation. sklearn provides them after you use StandardScaler.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit_transform(X_train) # or whatever you called it
unstandardized_coefficients = model.coef_ * np.sqrt(ss.var_) + ss.mean_
That would put them on the scale of the unstandardized data.
However, since you're using regularization, it becomes a biased estimator. There is a tradeoff between performance and interpretability when it comes to biased/unbiased estimators. This is more a discussion for stats.stackexchange.com. There's a difference between an unbiased estimator and a low MSE estimator. Read about biased estimators and interpretability here: When is a biased estimator preferable to unbiased one?.
tl;dr It doesn't make sense to do what you suggested.
I use GridSearchCV to find the best parameters in the inner loop of my nested cross-validation. The 'inner winner' is found using GridSearchCV(scorer='balanced_accuracy'), so as I understand the documentation the model with the highest balanced accuracy on average in the inner folds is the 'best_estimator'. I don't understand what the different arguments for refit in GridSearchCV do in combination with the scorer argument. If refit is True, what scoring function will be used to estimate the performance of that 'inner winner' when refitted to the dataset? The same scoring function that was passed to scorer (so in my case 'balanced_accuracy')? Why can you pass also a string to refit? Does that mean that you can use different functions for 1.) finding the 'inner winner' and 2.) to estimate the performance of that 'inner winner' on the whole dataset?
When refit=True, sklearn uses entire training set to refit the model. So, there is no test data left to estimate the performance using any scorer function.
If you use multiple scorer in GridSearchCV, maybe f1_score or precision along with your balanced_accuracy, sklearn needs to know which one of those scorer to use to find the "inner winner" as you say. For example with KNN, f1_score might have best result with K=5, but accuracy might be highest for K=10. There is no way for sklearn to know which value of hyper-parameter K is the best.
To resolve that, you can pass one string scorer to refit to specify which of those scorer should ultimately decide best hyper-parameter. This best value will then be used to retrain or refit the model using full dataset. So, when you've got just one scorer, as your case seems to be, you don't have to worry about this. Simply refit=True will suffice.
I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.
scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.
I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.
If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.
Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)