Sklearn RANSAC without intercept - python

I am trying to fit a linear model without intercept (forcing the intercept to 0) using sklearn's RANSAC: RANdom SAmple Consensus algorithm. In LinearRegression one can easily set fit_intercept=False. However, this option does not seem to exist in RANSAC's list of possible parameters. Is this functionality not implemented? How should one do it? What are alternatives to sklearn's RANSAC to objectively select inliers and outliers, that allow setting the intercept to 0?
The implementation should look like this, but it raises an error:
from sklearn.linear_model import RANSACRegressor
ransac_regressor = RANSACRegressor(fit_intercept=False)

RANSAC is a wrapper around other linear regressors to implement them using random sampling consesus, thus you can simply set the base_estimator to fit_intercept=False:
from sklearn.linear_model import RANSACRegressor, LinearRegression
ransac_lm = RANSACRegressor(base_estimator=LinearRegression(fit_intercept=False))

Related

Explain why Lasso regression intercept coming different when we include intercept in data vs model fitting intercept by itself?

So I am trying to fit a Lasso Regression model with Polynomial Features.
To create these Polynomial Features, I am using from sklearn.preprocessing import PolynomialFeatures
It has a parameter inclue_bias which can be set to True or False as PolynomialFeatures(degree=5, include_bias=False). This bias is nothing but the column of 1 that represents the intercept in the final equation. As explained in detail here in another answer.
The issue is arising, when I either set include_bias=True and then fit Lasso regression without an intercept as it has already been taken care of or I choose to not include bias and set fit_intercept=True in from sklearn.linear_model import Lasso. They technically should give result for the intercept coefficient right? But turns out they aren't coming out to be same.
Is there internally some bug in scikit learn or what am I missing here? Let me know if anyone wants to try this on their own with some data. I'll share some.
When PolynomialFeatures.include_bias=True but Lasso.fit_intercept=False, the lasso model thinks the all-ones column is just another feature, and so the "intercept term" is penalized with the L1 penalty. When PolynomialFeatures.include_bias=False and Lasso.fit_intercept=True, the lasso model knows not to penalize the intercept that it is fitting.

Sklearn regression with clustered data

I'm trying to run a multinomial LogisticRegression in sklearn with a clustered dataset (that is, there are more than 1 observations for each individual, where only some features change and others remain constant per individual).
I am aware in statsmodels it is possible to account for this the following way:
mnl = MNLogit(x,y).fit(cov_type="cluster", cov_kwds={"groups": cluster_groups)
Is there a way to replicate this with the sklearn package instead?
In order to run multinomial Logistic Regression in sklearn, you can use the LogisticRegression module and then set the parameter multi_class to multinomial.
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

What is .linear_model in sklearn.linear_model

I want to know what is the meaning of .linear_model in the following code -
from sklearn.linear_model import LogisticRegression
My understanding is sklearn is the library/module (both have same meaning) and LogisticRegression is the class inside this module.
But I'm not able to understand what .linear_model means?
linear_model is a module. sklearn is a package. A package is basically a module that contains other modules.
linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models.
The term linear model implies that the model is specified as a linear combination of features. Based on training data, the learning process computes one weight for each feature to form a model that can predict or estimate the target value.
It includes :
Linear regression and classification, Ridge regression and classification, Lasso, Multi-task Lasso
etc..
Check the sklearn doc for further details.

Sklearn StandardScaler + ElasticNet with interpretable coefficients

In order to properly fit a regularized linear regression model like the Elastic Net, the independent variables have to be stanardized first. However, the coefficients have then different meaning. In order to extract the proper weights of such model, do I need to calculate them manually with this equation:
b = b' * std_y/std_x
or is there already some built-in feature in sklearn?
Also: I don't think I can just use normalize=True parameter, since I have dummy variables which should probably remain unscaled
You can unstandardize using the mean and standard deviation. sklearn provides them after you use StandardScaler.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit_transform(X_train) # or whatever you called it
unstandardized_coefficients = model.coef_ * np.sqrt(ss.var_) + ss.mean_
That would put them on the scale of the unstandardized data.
However, since you're using regularization, it becomes a biased estimator. There is a tradeoff between performance and interpretability when it comes to biased/unbiased estimators. This is more a discussion for stats.stackexchange.com. There's a difference between an unbiased estimator and a low MSE estimator. Read about biased estimators and interpretability here: When is a biased estimator preferable to unbiased one?.
tl;dr It doesn't make sense to do what you suggested.

How does LassoCV in scikit-learn partition data?

I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.
If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.

Categories

Resources