what does the option normalize = True in Lasso sklearn do?

what does the option normalize = True in Lasso sklearn do? - python

I have a matrix where each column has mean 0 and std 1
In [67]: x_val.std(axis=0).min()
Out[70]: 0.99999999999999922
In [71]: x_val.std(axis=0).max()
Out[71]: 1.0000000000000007
In [72]: x_val.mean(axis=0).max()
Out[72]: 1.1990408665951691e-16
In [73]: x_val.mean(axis=0).min()
Out[73]: -9.7144514654701197e-17
The number of non 0 coefficients changes if I use the normalize option
In [74]: l = Lasso(alpha=alpha_perc70).fit(x_val, y_val)
In [81]: sum(l.coef_!=0)
Out[83]: 47
In [84]: l2 = Lasso(alpha=alpha_perc70, normalize=True).fit(x_val, y_val)
In [93]: sum(l2.coef_!=0)
Out[95]: 3
It seems to me that normalize just set the variance of each columns to 1. This is strange that the results change so much. My data has already variance=1.
So what does normalize=T actually do?

This is due to an (or a potential [1]) inconsistency in the concept of scaling in sklearn.linear_model.base.center_data: If normalize=True, then it will divide by the norm of each column of the design matrix, not by the standard deviation . For what it's worth, the keyword normalize=True will be deprecated from sklearn version 0.17.
Solution: Do not use standardize=True. Instead, build a sklearn.pipeline.Pipeline and prepend a sklearn.preprocessing.StandardScaler to your Lasso object. That way you don't even need to perform your initial scaling.
Note that the data loss term in the sklearn implementation of Lasso is scaled by n_samples. Thus the minimal penalty yielding a zero solution is alpha_max = np.abs(X.T.dot(y)).max() / n_samples (for normalize=False).
[1] I say potential inconsistency, because normalize is associated to the word norm and thus at least linguistically consistent :)
[Stop reading here if you don't want the details]
Here is some copy and pasteable code reproducing the problem
import numpy as np
rng = np.random.RandomState(42)
n_samples, n_features, n_active_vars = 20, 10, 5
X = rng.randn(n_samples, n_features)
X = ((X - X.mean(0)) / X.std(0))
beta = rng.randn(n_features)
beta[rng.permutation(n_features)[:n_active_vars]] = 0.
y = X.dot(beta)
print X.std(0)
print X.mean(0)
from sklearn.linear_model import Lasso
lasso1 = Lasso(alpha=.1)
print lasso1.fit(X, y).coef_
lasso2 = Lasso(alpha=.1, normalize=True)
print lasso2.fit(X, y).coef_
In order to understand what is going on, now observe that
lasso1.fit(X / np.sqrt(n_samples), y).coef_ / np.sqrt(n_samples)
is equal to
lasso2.fit(X, y).coef_
Hence, scaling the design matrix and appropriately rescaling the coefficients by np.sqrt(n_samples) converts one model to the other. This can also be achieved by acting on the penalty: A lasso estimator with normalize=True with its penalty scaled down by np.sqrt(n_samples) acts like a lasso estimator with normalize=False (on your type of data, i.e. already standardized to std=1).
lasso3 = Lasso(alpha=.1 / np.sqrt(n_samples), normalize=True)
print lasso3.fit(X, y).coef_ # yields the same coefficients as lasso1.fit(X, y).coef_

I think the top answer is wrong...
In Lasso, if you set normalize=True, every column will be divided by its L2 norm (i.e., sd*sqrt(n)) before fitting a lasso regression. The magnitude of design matrix is thus reduced, and the "expected" coefficients will be enlarged. The larger the coefficients, the stronger the L1 penalty. So the function has to pay more attention to L1 penalty, and make more features to be 0. You will see more sparse features (β=0) as a result.

Related

scipy curve_fit incorrect for large X values

To determine trends over time, I use scipy curve_fit with X values from time.time(), for example 1663847528.7147126 (1.6 billion).
Doing a linear interpolation sometimes creates erroneous results, and providing approximate initial p0 values doesn't help. I found the magnitude of X to be a crucial element for this error and I wonder why?
Here is a simple snippet that shows working and non-working X offset:
import scipy.optimize
def fit_func(x, a, b):
return a + b * x
y = list(range(5))
x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
# (array([-1.e+08, 1.e+00]), array([[ 0., -0.],
# [-0., 0.]]))
x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
# OptimizeWarning: Covariance of the parameters could not be estimated
# warnings.warn('Covariance of the parameters could not be estimated',
# (array([-4.53788811e+08, 4.53788812e-01]), array([[inf, inf],
# [inf, inf]]))
Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
# (array([-7.60846335e+10, 7.60846334e+01]), array([[-1.97051972e+19, 1.97051970e+10],
# [ 1.97051970e+10, -1.97051968e+01]]))
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09, 1.e+00]), array([[inf, inf],
# [inf, inf]]))
As a side question, perhaps there's a more efficient method for a linear fit? Sometimes I want to find the second-order polynomial fit, though.
Tested with Python 3.9.6 and SciPy 1.7.1 under Windows 10.

Root cause
You are facing two problems:
Fitting procedure are scale sensitive. It means chosen units on a specific variable (eg. µA instead of kA) can artificially prevent an algorithm to converge properly (eg. One variable is several order of magnitude bigger than another and dominate the regression);
Float Arithmetic Error. When switching from 1e8 to 1e9 you just hit the magnitude when such a kind of error become predominant.
The second one is very important to realize. Let's say you are limited to 8 significant digits representation, then 1 000 000 000 and 1 000 000 001 are the same numbers as they are both limited to this writing 1.0000000e9 and we cannot accurately represents 1.0000000_e9 which requires one more digit (_). This is why your second example fails.
Additionally you are using an Non Linear Least Square algorithm to solve a Linear Least Square problem, and this is also somehow related to your problem.
You have three solutions:
Normalize;
Normalize and change the methodology/algorithm;
Increase the machine precision.
I'll choose the first one as it is more generic, the second one has been proposed by #blunova and totally makes sense, the latter is probably an inherent limitation.
Normalization
To mitigate both problems, a common solution is normalization. In your case a simple standardization is enough:
import numpy as np
import scipy.optimize
y = np.arange(5)
x = 1e9 + y
def fit_func(x, a, b):
return a + b * x
xm = np.mean(x) # 1000000002.0
xs = np.std(x) # 1.4142135623730951
result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)
# (array([2. , 1.41421356]),
# array([[0., 0.],
# [0., 0.]]))
# Back transformation:
a = result[0][1]/xs # 1.0
b = result[0][0] - xm*result[0][1]/xs # -1000000000.0
Or the same result using sklearn interface:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("regressor", LinearRegression())
])
pipe.fit(x.reshape(-1, 1), y)
pipe.named_steps["scaler"].mean_ # array([1.e+09])
pipe.named_steps["scaler"].scale_ # array([1.41421356])
pipe.named_steps["regressor"].coef_ # array([1.41421356])
pipe.named_steps["regressor"].intercept_ # 2.0
Back transformation
Indeed when normalizing the fit result is then expressed in term of normalized variable. To get the required fit parameters, you just need to do a bit of math to convert back the regressed parameters into the original variable scales.
Simply write down and solve the transformation:
y = x'*a' + b'
x' = (x - m)/s
y = x*a + b
Which gives you the following solution:
a = a'/s
b = b' - m/s*a'
Precision addendum
Numpy default float precision is float64 as you expected and has about 15 significant digits:
x.dtype # dtype('float64')
np.finfo(np.float64).precision # 15
But scipy.curve_fit relies on scipy.least_square which makes use of a squared metric to drive the optimization.
Without digging into the details I suspect this is where the problem happens, when dealing with values that are all close to 1e9 you reach the threshold where Float Arithmetic Error becomes predominant.
So this threshold of 1e9 you have hit is not related to the distinction between numbers on your variable x (float64 has sufficient precision to make it almost exactly different) but on the usage that is made of it when solving:
minimize F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
subject to lb <= x <= ub`
You can also check that in its signature, tolerances are about 8 decades wide:
scipy.optimize.least_squares(fun, x0, jac='2-point', bounds=(- inf, inf),
method='trf', ftol=1e-08, xtol=1e-08, gtol=1e-08, x_scale=1.0,
loss='linear', f_scale=1.0, diff_step=None, tr_solver=None,
tr_options={}, jac_sparsity=None, max_nfev=None, verbose=0,
args=(), kwargs={})
Which may let you tweak the algorithm to add extra steps before convergence is reached (if so) but that will not replace or beat the usefulness of normalization.
Methods comparison
What is interesting with scipy.stats.linregress method is the scale tolerance which is handled by design. The method uses variable normalization and pure linear algebra and numerical stability trick (see the TINY variable) to solve the LS problem even in problematic conditions.
This of course contrasts with the scipy.optimize.curve_fit method which is a NLLS solver implemented as an optimized gradient descent algorithm (see Levenberg–Marquardt algorithm).
If you stick with linear least square problems (linear in terms of parameters not variables, so second order polynomial is LLS) then LLS is probably a simpler option to chose as it handles normalization for you.

If you just need to compute a linear fit, I believe curve_fit is not necessary and I would just use the linregress function instead from SciPy as well:
>>> from scipy import stats
>>> y = list(range(5))
>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
>>> x2 = [1e9 + a for a in range(5)]
>>> stats.linregress(x2, y)
LinregressResult(slope=1.0, intercept=-1000000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
In general, if you need a polynomial fit I would use NumPy polyfit.

How to get the unscaled regression coefficients errors using statsmodels?

I'm trying to compute the coefficient errors of a regression using statsmodels. Also known as the standard errors of the parameter estimates. But I need to compute their "unscaled" version. I've only managed to do so with NumPy.
You can see the meaning of "unscaled" in the docs: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
cov bool or str, optional
If given and not False, return not just the estimate but also its covariance matrix.
By default, the covariance are scaled by chi2/dof, where dof = M - (deg + 1),
i.e., the weights are presumed to be unreliable except in a relative sense and
everything is scaled such that the reduced chi2 is unity. This scaling is omitted
if cov='unscaled', as is relevant for the case that the weights are w = 1/sigma, with
sigma known to be a reliable estimate of the uncertainty.
I'm using this data to run the rest of the code in this post:
import numpy as np
x = np.array([-0.841, -0.399, 0.599, 0.203, 0.527, 0.129, 0.703, 0.503])
y = np.array([1.01, 1.24, 1.09, 0.95, 1.02, 0.97, 1.01, 0.98])
sigmas = np.array([6872.26, 80.71, 47.97, 699.94, 57.55, 1561.54, 311.98, 501.08])
# The convention for weights are different
sm_weights = np.array([1.0/sigma**2 for sigma in sigmas])
np_weights = np.array([1.0/sigma for sigma in sigmas])
With NumPy:
coefficients, cov = np.polyfit(x, y, deg=2, w=np_weights, cov='unscaled')
# The errors I need to get
print(np.sqrt(np.diag(cov))) # [917.57938013 191.2100413 211.29028248]
If I compute the regression using statsmodels:
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as smapi
polynomial_features = PolynomialFeatures(degree=2)
polynomial = polynomial_features.fit_transform(x.reshape(-1, 1))
model = smapi.WLS(y, polynomial, weights=sm_weights)
regression = model.fit()
# Get coefficient errors
# Notice the [::-1], statsmodels returns the coefficients in the reverse order NumPy does
print(regression.bse[::-1]) # [0.24532856, 0.05112286, 0.05649161]
So the values I get are different, but related:
np_errors = np.sqrt(np.diag(cov))
sm_errors = regression.bse[::-1]
print(np_errors / sm_errors) # [3740.2061481, 3740.2061481, 3740.2061481]
The NumPy documentation says the covariance are scaled by chi2/dof where dof = M - (deg + 1). So I tried the following:
degree = 2
model_predictions = np.polyval(coefficients, x)
residuals = (model_predictions - y)
chi_squared = np.sum(residuals**2)
degrees_of_freedom = len(x) - (degree + 1)
scale_factor = chi_squared / degrees_of_freedom
sm_cov = regression.cov_params()
unscaled_errors = np.sqrt(np.diag(sm_cov * scale_factor))[::-1] # [0.09848423, 0.02052266, 0.02267789]
unscaled_errors = np.sqrt(np.diag(sm_cov / scale_factor))[::-1] # [0.61112427, 0.12734931, 0.14072311]
What I notice is that the covariance matrix I get from NumPy is much larger than the one I get from statsmodels:
>>> cov
array([[ 841951.9188366 , -154385.61049538, -188456.18957375],
[-154385.61049538, 36561.27989418, 31208.76422516],
[-188456.18957375, 31208.76422516, 44643.58346933]])
>>> regression.cov_params()
array([[ 0.0031913 , 0.00223093, -0.0134716 ],
[ 0.00223093, 0.00261355, -0.0110361 ],
[-0.0134716 , -0.0110361 , 0.0601861 ]])
As long as I can't make them equivalent, I won't be able to get the same errors. Any idea of what the difference in scale could mean and how to make both covariance matrices equal?

statsmodels documentation is not well organized in some parts.
Here is a notebook with an example for the following
https://www.statsmodels.org/devel/examples/notebooks/generated/chi2_fitting.html
The regression models in statsmodels like OLS and WLS, have an option to keep the scale fixed. This is the equivalent to cov="unscaled" in numpy and scipy.
The statsmodels option is more general, because it allows fixing the scale at any user defined value.
https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html
We we have a model as defined in the example, either OLS or WLS, then using
regression = model.fit(cov_type="fixed scale")
will keep the scale at 1 and the resulting covariance matrix is unscaled.
Using
regression = model.fit(cov_type="fixed scale", cov_kwds={"scale": 2})
will keep the scale fixed at value two.
(some links to related discussion motivation are in https://github.com/statsmodels/statsmodels/pull/2137 )
Caution
The fixed scale cov_type will be used for inferential statistic that are based on the covariance of the parameter estimates, cov_params.
This affects standard errors, t-tests, wald tests and confidence and prediction intervals.
However, some other results statistics might not be adjusted to use the fixed scale instead of the estimated scale, e.g. resid_pearson.
https://github.com/statsmodels/statsmodels/issues/8190

Why does fit_transform() always give me zeros?

I'm wondering why the following:
sklearn.preprocessing.StandardScaler().fit_transform([[58,144000]])
gives this result:
array([[0., 0.]])
I'm doing a Logistic Regression where I run fit_transform() on array of values (the actual data file) like the ones above. Yet, that transform seems to work fine. But when I try to do a single pair of values as shown above ([[58,144000]]), I get zeros.
For predictions using a "new" input, I need to scale that new value the same way as the test/train data were scaled so my ML predictions will work.
Thanks for suggestions.
Thanks!

If you read the docs, you may wondering, why does it expect a 2D array? You can compute mean and standard deviation of a vector, which is a 1D array, as you reflect it on your question. The answer is, it expects (samples, features) data.
So, in case where you pass data like [[58,144000]], it is a (1,2) array which means 1 sample with 2 features. Then it will fit transform each feature, which is a single number. Normalizing each feature give you a zero: [[0., 0.]].
On the other hand, if you pass the data like [[58],[144000]], then it will be (2,1), which means 2 samples and 1 feature. Then it scale and standard each feature, and give you the result as you may expected like: [[-1],[1]].
x = [58,144000]
mu = np.mean(x)
sigma = np.std(x)
print([((58 - mu) / sigma),((144000 - mu) / sigma)]) # [-1.0, 1.0]
from sklearn.preprocessing import StandardScaler
print(StandardScaler().fit_transform([[58],[144000]])) # [[-1.] [ 1.]]

Closed Form Ridge Regression

I am having trouble understanding the output of my function to implement multiple-ridge regression. I am doing this from scratch in Python for the closed form of the method. This closed form is shown below:
I have a training set X that is 100 rows x 10 columns and a vector y that is 100x1.
My attempt is as follows:
def ridgeRegression(xMatrix, yVector, lambdaRange):
wList = []
for i in range(1, lambdaRange+1):
lambVal = i
# compute the inner values (X.T X + lambda I)
xTranspose = np.transpose(x)
xTx = xTranspose # x
lamb_I = lambVal * np.eye(xTx.shape[0])
# invert inner, e.g. (inner)**(-1)
inner_matInv = np.linalg.inv(xTx + lamb_I)
# compute outer (X.T y)
outer_xTy = np.dot(xTranspose, y)
# multiply together
w = inner_matInv # outer_xTy
wList.append(w)
print(wList)
For testing, I am running it with the first 5 lambda values.
wList becomes 5 numpy.arrays each of length 10 (I'm assuming for the 10 coefficients).
Here is the first of those 5 arrays:
array([ 0.29686755, 1.48420319, 0.36388528, 0.70324668, -0.51604451,
2.39045735, 1.45295857, 2.21437745, 0.98222546, 0.86124358])
My question, and clarification:
Shouldn't there be 11 coefficients, (1 for the y-intercept + 10 slopes)?
How do I get the Minimum Square Error from this computation?
What comes next if I wanted to plot this line?
I think I am just really confused as to what I'm looking at, since I'm still working on my linear-algebra.
Thanks!

First, I would modify your ridge regression to look like the following:
import numpy as np
def ridgeRegression(X, y, lambdaRange):
wList = []
# Get normal form of `X`
A = X.T # X
# Get Identity matrix
I = np.eye(A.shape[0])
# Get right hand side
c = X.T # y
for lambVal in range(1, lambdaRange+1):
# Set up equations Bw = c
lamb_I = lambVal * I
B = A + lamb_I
# Solve for w
w = np.linalg.solve(B,c)
wList.append(w)
return wList
Notice that I replaced your inv call to compute the matrix inverse with an implicit solve. This is much more numerically stable, which is an important consideration for these types of problems especially.
I've also taken the A=X.T#X computation, identity matrix I generation, and right hand side vector c=X.T#y computation out of the loop--these don't change within the loop and are relatively expensive to compute.
As was pointed out by #qwr, the number of columns of X will determine the number of coefficients you have. You have not described your model, so it's not clear how the underlying domain, x, is structured into X.
Traditionally, one might use polynomial regression, in which case X is the Vandermonde Matrix. In that case, the first coefficient would be associated with the y-intercept. However, based on the context of your question, you seem to be interested in multivariate linear regression. In any case, the model needs to be clearly defined. Once it is, then the returned weights may be used to further analyze your data.

Typically to make notation more compact, the matrix X contains a column of ones for an intercept, so if you have p predictors, the matrix is dimensions n by p+1. See Wikipedia article on linear regression for an example.
To compute in-sample MSE, use the definition for MSE: the average of squared residuals. To compute generalization error, you need cross-validation.

Also, you shouldn't take lambVal as an integer. It can be small (close to 0) if the aim is just to avoid numerical error when xTx is ill-conditionned.
I would advise you to use a logarithmic range instead of a linear one, starting from 0.001 and going up to 100 or more if you want to. For instance you can change your code to that:
powerMin = -3
powerMax = 3
for i in range(powerMin, powerMax):
lambVal = 10**i
print(lambVal)
And then you can try a smaller range or a linear range once you figure out what is the correct order of lambVal with your data from cross-validation.

Transpose input matrix before LinearRegression in sklearn

Here is my python program:
import numpy as np
from sklearn import linear_model
X=np.array([[1, 2, 4]]).T**2
y=np.array([1, 4, 16])
model=linear_model.LinearRegression()
model.fit(X,y)
print('Coefficients: \n', model.coef_)
As a result i have:
Coefficients:
[1.]
It is a first program i test with sklearn.
My question is: why i have to use the transpose .T**2 in the third instruction ?
Without
T**2
i have these errors https://imgur.com/a/XWzJx0f
i use http://jupyter.org/try

As the documentation says, you have to pass a matrix with n_samples (3) and n_features (1). So your input X in the form [[1,2,3]] needs the inner vector in a vertical position.
After **T:
array([[ 1],
[ 4],
[16]])
This is what happens under the hood: https://machinelearningmastery.com/solve-linear-regression-using-linear-algebra/

You have to match X,y in same dimensions (same number of training samples)
If you do not use transpose, you have 1 training sample [1,2,4] but 3 labels, which does not match
If you use transpose, you could have [1][2][4] 3 samples and thus could match 3 labels
the **2 does not matters

The initial shape of matrix X in (1,3). You need to pass the matrix in form of (3,1) as the documentation says and mentioned in answer by Alessandro
The **2 part is just squaring each of the element of matrix X. You can run it without that part. The coefficient will differ then. Currently, when you squared, you have each of the X and y values as (1,1), (4,4) and (16,16) so the coefficient (slope of equation y=mx+ c, if you plot these on graph) is 1. If you don't square, coefficient will differ accordingly

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

what does the option normalize = True in Lasso sklearn do? - python

Related

scipy curve_fit incorrect for large X values

How to get the unscaled regression coefficients errors using statsmodels?

Why does fit_transform() always give me zeros?

Closed Form Ridge Regression

Transpose input matrix before LinearRegression in sklearn

Categories

Resources