Why is intercept_ an array in sklearn linear regression? - python

In sklearn linear regression, intercept_ returned is an array and not a scalar. Why so?
Other type of regressors, for example HuberRegressor allow intercept_ to be returned as scalar. So code consistency throughout the api should not be the reason.

I would rephrase your question as "why some algorithms return intercept_ as a scalar value?"
For multiple features we usually need multiple biases (intercepts) if we are talking about linear models...
In HuberRegressor the intercept is set explicitly to a scalar value:
if self.fit_intercept:
self.intercept_ = parameters[-2]
else:
self.intercept_ = 0.0
self.coef_ = parameters[:X.shape[1]]

Related

how initial bias value is chosen in sklearn logistic regression?

When training logistic regression it goes through an iterative process where at each process it calculates weights of x variables and bias value to minimize the loss function.
From official sklearn code class LogisticRegression | linear model in scikit-learn, the logistic regression class' fit method is as follows
def fit(self, X, y, sample_weight=None):
"""
Fit the model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like of shape (n_samples,)
Target vector relative to X.
sample_weight : array-like of shape (n_samples,) default=None
Array of weights that are assigned to individual samples.
If not provided, then each sample is given unit weight.
.. versionadded:: 0.17
*sample_weight* support to LogisticRegression.
I am guessing sample_weight = weight of x variables which are set to 1 if not given, is the bias value also 1?
You sound somewhat confused, perhaps looking for an analogy here with the weights & biases of a neural network. But this is not the case; sample_weight here has nothing to do with the weights of a neural network, even as a concept.
sample_weight is there so that, if the (business) problem requires so, we can give more weight (i.e. more importance) to some samples compared with others, and this importance directly affects the loss. It is sometimes used in cases of imbalanced data; quoting from the Tips on practical use section of the documentation (it is about decision trees, but the rationale is the same):
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.
and from a relevant thread at Cross Validated:
Sample weights are used to increase the importance of a single data-point (let's say, some of your data is more trustworthy, then they receive a higher weight). So: The sample weights exist to change the importance of data-points
You can see a practical demostration of how changing the weight of some samples changes the final model in the SO thread What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn? (again, it is about decision trees, but the rationale is the same).
Having clarified that, it should now be apparent that there is no room here for any kind of "bias" parameter whatsoever. In fact, the introductory paragraph in your question is wrong: logistic regression does not compute such weights and biases; it returns coefficients and an intercept term (sometimes itself called bias), and these coefficients & intercept have nothing to do with sample_weight.

Python PLSRegression : obtaining the latent variables scores using loadings

In sklearn.cross_decomposition.PLSRegression, we can obtain the latent variables scores from the X array using x_scores_.
I would like to extract the loadings to calculate the latent variables scores for a new array W. Intuitively, what I whould do is: scores = W*loadings (matrix multiplication). I tried this using either x_loadings_, x_weights_, and x_rotations_ as loadings as I could not figure out which array was the good one (there is little info on the sklearn website). I also tried to standardize W (subtracting the mean and dividing by the standard deviation of X) before multiplying by the loadings. But none of these works (I tried using the X array and I cannot obtain the same scores as in the x_scores_ array).
Any help with this?
Actually, I just had to better understand the fit() and transform() methods of Sklearn. I need to use transform(W) to obtain the latent variables scores of the W array:
1.Fit(): generates learning model parameters from training data
2.Transform(): uses the parameters generated from fit() method to transform a particular dataset

Third parameter (kwargs) sklearn fit() function

I'm quite new to scikit-learn and I have a question about the fit() function. I tried to look for information on the internet but couldn't find much.
In an assignement I have to create a dict of parameters passed to the fit function of a classifier, which means the function will take 3 arguments (X, y, kwargs). What parameters is this dictionary supposed to have? Apparently those are hyper parameters for the fit function. Online I only found information for xgbooster but I'm not supposed to use that, only classifiers from sklearn.
I also found online that fit can take a dictionary called **fit_params but there is nothing about the parameters the function might take.
I hope my question is clear, thanks a lot in advance!
The model hyperparameters are not arguments to the fit function, but to the model class object that you need to create beforehand.
If you have a dictionary with parameters that you want to pass to your model, you need to do things this way (here with a Logistic Regression):
from sklearn.linear_model import LogisticRegression
params = {"C":10, "max_iter":200}
LR = LogisticRegression(**params)
Now that you have created the model specifying the hyperparameters, you can proceed and fit it with your data.
LR.fit(X, y)
I haven't used scikit-learn before, but you can get the docs of a function that you are unsure about by using the __doc__ method. The fit() method of an estimator returns this for its __doc__ method:
Fit the SVM model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
For kernel="precomputed", the expected shape of X is
(n_samples, n_samples).
y : array-like of shape (n_samples,)
Target values (class labels in classification, real numbers in
regression)
sample_weight : array-like of shape (n_samples,), default=None
Per-sample weights. Rescale C per sample. Higher weights
force the classifier to put more emphasis on these points.
Returns
-------
self : object
Notes
-----
If X and y are not C-ordered and contiguous arrays of np.float64 and
X is not a scipy.sparse.csr_matrix, X and/or y may be copied.
If X is a dense array, then the other methods will not support sparse
matrices as input.
I ran this to get that output:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
print(clf.fit.__doc__)

Linear Discriminant Analysis transform function

x = data.values
y = target.values
lda = LDA(solver='eigen', shrinkage='auto',n_components=2)
df_lda = lda.fit(x,y).transform(x)
df_lda.shape
This is the small part of the code. I am trying to reduce the dimensionality to the most discriminative directions. To my understanding the transform() function projects data to maximize class separation for my data set and should return an array of shape (n_samples, n_components)
But my df_lda is of shape (614, 1).
What am I missing here ? Or is my data not linearly separable?.
For the case of K distinct classes in target.values there are K-1 components in the transformed data (without further dimensionality reduction). Since you only have two classes in your data set, there is only one transformed component so you cannot get more components than that.
I suppose it might by helpful for sklearn to issue a warning when you request more than are available.

how to use sklearn when target variable is a proportion

There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.
There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.

Categories

Resources