Transpose input matrix before LinearRegression in sklearn - python

Here is my python program:
import numpy as np
from sklearn import linear_model
X=np.array([[1, 2, 4]]).T**2
y=np.array([1, 4, 16])
model=linear_model.LinearRegression()
model.fit(X,y)
print('Coefficients: \n', model.coef_)
As a result i have:
Coefficients:
[1.]
It is a first program i test with sklearn.
My question is: why i have to use the transpose .T**2 in the third instruction ?
Without
T**2
i have these errors https://imgur.com/a/XWzJx0f
i use http://jupyter.org/try

As the documentation says, you have to pass a matrix with n_samples (3) and n_features (1). So your input X in the form [[1,2,3]] needs the inner vector in a vertical position.
After **T:
array([[ 1],
[ 4],
[16]])
This is what happens under the hood: https://machinelearningmastery.com/solve-linear-regression-using-linear-algebra/

You have to match X,y in same dimensions (same number of training samples)
If you do not use transpose, you have 1 training sample [1,2,4] but 3 labels, which does not match
If you use transpose, you could have [1][2][4] 3 samples and thus could match 3 labels
the **2 does not matters

The initial shape of matrix X in (1,3). You need to pass the matrix in form of (3,1) as the documentation says and mentioned in answer by Alessandro
The **2 part is just squaring each of the element of matrix X. You can run it without that part. The coefficient will differ then. Currently, when you squared, you have each of the X and y values as (1,1), (4,4) and (16,16) so the coefficient (slope of equation y=mx+ c, if you plot these on graph) is 1. If you don't square, coefficient will differ accordingly

Related

Why does fit_transform() always give me zeros?

I'm wondering why the following:
sklearn.preprocessing.StandardScaler().fit_transform([[58,144000]])
gives this result:
array([[0., 0.]])
I'm doing a Logistic Regression where I run fit_transform() on array of values (the actual data file) like the ones above. Yet, that transform seems to work fine. But when I try to do a single pair of values as shown above ([[58,144000]]), I get zeros.
For predictions using a "new" input, I need to scale that new value the same way as the test/train data were scaled so my ML predictions will work.
Thanks for suggestions.
Thanks!
If you read the docs, you may wondering, why does it expect a 2D array? You can compute mean and standard deviation of a vector, which is a 1D array, as you reflect it on your question. The answer is, it expects (samples, features) data.
So, in case where you pass data like [[58,144000]], it is a (1,2) array which means 1 sample with 2 features. Then it will fit transform each feature, which is a single number. Normalizing each feature give you a zero: [[0., 0.]].
On the other hand, if you pass the data like [[58],[144000]], then it will be (2,1), which means 2 samples and 1 feature. Then it scale and standard each feature, and give you the result as you may expected like: [[-1],[1]].
x = [58,144000]
mu = np.mean(x)
sigma = np.std(x)
print([((58 - mu) / sigma),((144000 - mu) / sigma)]) # [-1.0, 1.0]
from sklearn.preprocessing import StandardScaler
print(StandardScaler().fit_transform([[58],[144000]])) # [[-1.] [ 1.]]

Element-wise multiplication between vector and scalar with two matrices

I have run a classification experiment with 2 classifiers on a dataset with 2 classes and 150 samples. Classifiers are scikit-learn objects with predict_proba() method. This method returns an array of shape (samples, classes) with the probability distribution for each sample. I also computed another matrix G with shape (samples, 2) which contains the "importance" of each classifier for each sample.
The final output must be a linear combination of each predict_proba() row and the scalar in G. Example with one single sample:
G = np.array([0.3, 0.7])
classifier_1_proba = np.array([0.6, 0.4])
classifier_2_proba = np.array([0.2, 0.8])
Y = classifier_1_proba * G[0] + classifier_2_proba * G[1]
This is easy with just one sample/output, but i don't know how could it be done with multiple samples (e.g. an entire test set).
I think this would work for you:
Y = c1_proba * G[:, 0, None] + c2_proba * G[:, 1, None]
Assuming the classifier proba matrices c1_proba, c2_proba and the weights G are all 2D numpy arrays as you mentioned.

numpy covariance between each column of a matrix and a vector

Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0). I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for loop to write. I was wondering if I can somehow use np.cov() to get the result directly.
As Warren Weckesser said, the numpy.cov(X, Y) is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:
A = np.sqrt(np.arange(12).reshape(3, 4)) # some 3 by 4 array
b = np.array([[2], [4], [5]]) # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)
This returns the covariances of each column of A with b.
array([[ 2.21895142, 1.53934466, 1.3379221 , 1.20866607]])
The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0] you get the unadjusted population covariance.
For comparison, the same computation using np.cov:
import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]
Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.
Correlation coefficient
The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var does not make it by default, to make it happen you need ddof=1 parameter.
corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1))
Check that the output is the same as the less efficient version
np.corrcoef(A, b, rowvar=False)[-1, :-1]

what does the option normalize = True in Lasso sklearn do?

I have a matrix where each column has mean 0 and std 1
In [67]: x_val.std(axis=0).min()
Out[70]: 0.99999999999999922
In [71]: x_val.std(axis=0).max()
Out[71]: 1.0000000000000007
In [72]: x_val.mean(axis=0).max()
Out[72]: 1.1990408665951691e-16
In [73]: x_val.mean(axis=0).min()
Out[73]: -9.7144514654701197e-17
The number of non 0 coefficients changes if I use the normalize option
In [74]: l = Lasso(alpha=alpha_perc70).fit(x_val, y_val)
In [81]: sum(l.coef_!=0)
Out[83]: 47
In [84]: l2 = Lasso(alpha=alpha_perc70, normalize=True).fit(x_val, y_val)
In [93]: sum(l2.coef_!=0)
Out[95]: 3
It seems to me that normalize just set the variance of each columns to 1. This is strange that the results change so much. My data has already variance=1.
So what does normalize=T actually do?
This is due to an (or a potential [1]) inconsistency in the concept of scaling in sklearn.linear_model.base.center_data: If normalize=True, then it will divide by the norm of each column of the design matrix, not by the standard deviation . For what it's worth, the keyword normalize=True will be deprecated from sklearn version 0.17.
Solution: Do not use standardize=True. Instead, build a sklearn.pipeline.Pipeline and prepend a sklearn.preprocessing.StandardScaler to your Lasso object. That way you don't even need to perform your initial scaling.
Note that the data loss term in the sklearn implementation of Lasso is scaled by n_samples. Thus the minimal penalty yielding a zero solution is alpha_max = np.abs(X.T.dot(y)).max() / n_samples (for normalize=False).
[1] I say potential inconsistency, because normalize is associated to the word norm and thus at least linguistically consistent :)
[Stop reading here if you don't want the details]
Here is some copy and pasteable code reproducing the problem
import numpy as np
rng = np.random.RandomState(42)
n_samples, n_features, n_active_vars = 20, 10, 5
X = rng.randn(n_samples, n_features)
X = ((X - X.mean(0)) / X.std(0))
beta = rng.randn(n_features)
beta[rng.permutation(n_features)[:n_active_vars]] = 0.
y = X.dot(beta)
print X.std(0)
print X.mean(0)
from sklearn.linear_model import Lasso
lasso1 = Lasso(alpha=.1)
print lasso1.fit(X, y).coef_
lasso2 = Lasso(alpha=.1, normalize=True)
print lasso2.fit(X, y).coef_
In order to understand what is going on, now observe that
lasso1.fit(X / np.sqrt(n_samples), y).coef_ / np.sqrt(n_samples)
is equal to
lasso2.fit(X, y).coef_
Hence, scaling the design matrix and appropriately rescaling the coefficients by np.sqrt(n_samples) converts one model to the other. This can also be achieved by acting on the penalty: A lasso estimator with normalize=True with its penalty scaled down by np.sqrt(n_samples) acts like a lasso estimator with normalize=False (on your type of data, i.e. already standardized to std=1).
lasso3 = Lasso(alpha=.1 / np.sqrt(n_samples), normalize=True)
print lasso3.fit(X, y).coef_ # yields the same coefficients as lasso1.fit(X, y).coef_
I think the top answer is wrong...
In Lasso, if you set normalize=True, every column will be divided by its L2 norm (i.e., sd*sqrt(n)) before fitting a lasso regression. The magnitude of design matrix is thus reduced, and the "expected" coefficients will be enlarged. The larger the coefficients, the stronger the L1 penalty. So the function has to pay more attention to L1 penalty, and make more features to be 0. You will see more sparse features (β=0) as a result.

Least-Squares Regression of Matrices with Numpy

I'm looking to calculate least squares linear regression from an N by M matrix and a set of known, ground-truth solutions, in a N-1 matrix. From there, I'd like to get the slope, intercept, and residual value of each regression. Basic idea being, I know the actual value of that should be predicted for each sample in a row of N, and I'd like to determine which set of predicted values in a column of M is most accurate using the residuals.
I don't describe matrices well, so here's a drawing:
(N,M) matrix with predicted values for each row N
in each column of M...
##NOTE: Values of M and N are not actually 4 and 3, just examples
4 columns in "M"
[1, 1.1, 0.8, 1.3]
[2, 1.9, 2.2, 1.7] 3 rows in "N"
[3, 3.1, 2.8, 3.3]
(1,N) matrix with actual values of N
[1]
[2] Actual value of each sample N, in a single column
[3]
So again, for clarity's sake, I'm looking to calculate the lstsq regression between each column of the (N,M) matrix and the (1,N) matrix.
For instance, the regression between
[1] and [1]
[2] [2]
[3] [3]
then the regression between
[1] and [1.1]
[2] [1.9]
[3] [3.1]
and so on, outputting the slope, intercept, and standard error (average residual) for each regression calculated.
So far in the numpy/scipy documentation and around the 'net, I've only found examples computing one column at a time. I had thought numpy had the capability to compute regressions on each column in a set with the standard
np.linalg.lstsq(arrayA,arrayB)
But that returns the error
ValueError: array dimensions must agree except for d_0
Do I need to split the columns into their own arrays, then compute one at a time?
Is there a parameter or matrix operation I need to use to have numpy calculate the regressions on each column independently?
I feel like it should be simpler? I've looked it all over, and I can't seem to find anyone doing something similar.
Maybe you switched A and b?
Following works for me:
A=np.random.rand(4)+np.arange(3)[:,None]
# A is now a (3,4) array
b=np.arange(3)
np.linalg.lstsq(A,b)
The 0th dimension of arrayB must be the same as the 0th dimension of arrayA (ref: the official documentation of np.linalg.lstsq). You need matrices with dimensions (N, M) and (N, 1) or (N, M) and (N) instead of the (N,M) and (1,N) matrices you're using now.
Note that the (N, 1) and N dimensional matrices will give identical results -- but the shapes of the arrays will be different.
I get a slightly different exception from you, but that may be due to different versions (I am using Python 2.7, Numpy 1.6 on Windows):
>>> A = np.arange(12).reshape(3, 4)
>>> b = np.arange(3).reshape(1, 3)
>>> np.linalg.lstsq(A,b)
# This gives "LinAlgError: Incompatible dimensions" exception
>>> np.linalg.lstsq(A,b.T)
# This works, note that I am using the transpose of b here

Categories

Resources