In order to properly fit a regularized linear regression model like the Elastic Net, the independent variables have to be stanardized first. However, the coefficients have then different meaning. In order to extract the proper weights of such model, do I need to calculate them manually with this equation:
b = b' * std_y/std_x
or is there already some built-in feature in sklearn?
Also: I don't think I can just use normalize=True parameter, since I have dummy variables which should probably remain unscaled
You can unstandardize using the mean and standard deviation. sklearn provides them after you use StandardScaler.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit_transform(X_train) # or whatever you called it
unstandardized_coefficients = model.coef_ * np.sqrt(ss.var_) + ss.mean_
That would put them on the scale of the unstandardized data.
However, since you're using regularization, it becomes a biased estimator. There is a tradeoff between performance and interpretability when it comes to biased/unbiased estimators. This is more a discussion for stats.stackexchange.com. There's a difference between an unbiased estimator and a low MSE estimator. Read about biased estimators and interpretability here: When is a biased estimator preferable to unbiased one?.
tl;dr It doesn't make sense to do what you suggested.
Related
I'm trying to understand the difference between RidgeClassifier and LogisticRegression in sklearn.linear_model. I couldn't find it in the documentation.
I think I understand quite well what the LogisticRegression does.It computes the coefficients and intercept to minimise half of sum of squares of the coefficients + C times the binary cross-entropy loss, where C is the regularisation parameter. I checked against a naive implementation from scratch, and results coincide.
Results of RidgeClassifier differ and I couldn't figure out, how the coefficients and intercept are computed there? Looking at the Github code, I'm not experienced enough to untangle it.
The reason why I'm asking is that I like the RidgeClassifier results -- it generalises a bit better to my problem. But before I use it, I would like to at least have an idea where does it come from.
Thanks for possible help.
RidgeClassifier() works differently compared to LogisticRegression() with l2 penalty. The loss function for RidgeClassifier() is not cross entropy.
RidgeClassifier() uses Ridge() regression model in the following way to create a classifier:
Let us consider binary classification for simplicity.
Convert target variable into +1 or -1 based on the class in which it belongs to.
Build a Ridge() model (which is a regression model) to predict our target variable. The loss function is MSE + l2 penalty
If the Ridge() regression's prediction value (calculated based on decision_function() function) is greater than 0, then predict as positive class else negative class.
For multi-class classification:
Use LabelBinarizer() to create a multi-output regression scenario, and then train independent Ridge() regression models, one for each class (One-Vs-Rest modelling).
Get prediction from each class's Ridge() regression model (a real number for each class) and then use argmax to predict the class.
I used to believe that scikit-learn's Logistic Regression classifier (as well as SVM) automatically standardizes my data before training. The reason I used to believe it is because of the regularization parameter C that is passed to the LogisticRegression constructor: Applying regularization (as I understand it) doesn't make sense without feature scaling. For regularization to work properly, all the features should be on comparable scales. Therefore, I used to assume that when calling the LogisticRegression.fit(X) on training data X, the fit method first performs feature scaling and then starts training. In order to test my assumption I've decided to manually scale the features of X as follows:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_std = scaler.transform(X)
Then I've initialized a LogisticRegression object with a regularization parameter C:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=10.0, random_state=0)
I've found out that training the model on X is not equivalent to training the model on X_std. That is to say, the model produced by
log_reg.fit(X_std, y)
is not similar to the model produced by
log_reg.fit(X, y)
Does that mean that scikit-learn doesn't standardize the features before training? Or maybe it does scale but by applying a different procedure? If scikit-learn doesn't perform feature scaling, how is it consistent with requiring the regularization parameter C? Should I manually standardize my data every time before fitting the model in order for regularization to make sense?
From the following note in: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I'd assume that you need to preprocess the data yourself (e.g. with a scaler from sklearn.preprocessing.)
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ is faster for large ones.
For multiclass problems, only ‘newton-cg’ and ‘lbfgs’ handle multinomial loss; ‘sag’ and ‘liblinear’ are limited to one-versus-rest schemes.
‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty.
Note that ‘sag’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.
I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.
scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.
I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.
If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.
Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)
I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.
If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.
The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.
What is the advantages of using StandardScalar over manually subtracting the mean and dividing by standard deviation (other than the ability to use it in a pipeline)?
LinearSVC in scikit-learn depends on one-vs-the-rest for multiple classes (as larsmans mentioned, SVC depends on one-vs-one for multi-class). So what would happen if I have multiple classes trained with a pipeline with normalization as the first estimator? Would it also calculate the mean and standard variation of the each class, and use it during classification?
To be more specific, does the following classifier apply different mean and standard deviations to each class before svm stage of pipeline?
estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)
The feature scaling performed by StandardScaler is performed without reference to the target classes. It only considers the X feature matrix. It calculates the mean and standard deviation of each feature across all samples, irrespective of the target class of each sample.
Each component of the pipeline operates independently: only the data is passed between them. Let's expand the pipeline's clf.fit(X_train, y). It roughly does the following:
X_train_scaled = clf.named_steps['normalize'].fit_transform(X_train, y)
clf.named_steps['svm'].fit(X_train_scaled, y)
The first scaling step actually ignores the y it is passed, but calculates the mean and standard deviation of each feature in X_train and stores them in its mean_ and std_ attributes (the fit component). It also centers X_train and returns it (the transform component). The next step learns an SVM model, and does what is necessary for one-vs-rest.
Now the pipeline's perspective for classification. clf.predict(X_test) expands to:
X_test_scaled = clf.named_steps['normalize'].transform(X_test)
y_pred = clf.named_steps['svm'].predict(X_test_scaled)
returning y_pred. In the first line it uses the stored mean_ and std_ to apply the transformation to X_test using parameters learnt from the training data.
Yes, the scaling algorithm isn't very complicated. It just subtracts the mean and divides by the std. But StandardScalar:
provides a name to the algorithm so you can pull it out of the library
avoids you rolling your own, ensuring it works correctly, and not requiring you to understand what it's doing on the inside
remembers the parameters from a fit or fit_transform for later transform operations (as above)
provides the same interface as other data transformations (and hence can be used in a pipeline)
operates over dense or sparse matrices
is able to reverse the transformation with its inverse_transform method