I have a classic linear regression problem of the form:
y = X b
where y is a response vector X is a matrix of input variables and b is the vector of fit parameters I am searching for.
Python provides b = numpy.linalg.lstsq( X , y ) for solving problems of this form.
However, when I use this I tend to get either extremely large or extremely small values for the components of b.
I'd like to perform the same fit, but constrain the values of b between 0 and 255.
It looks like scipy.optimize.fmin_slsqp() is an option, but I found it extremely slow for the size of problem I'm interested in (X is something like 3375 by 1500 and hopefully even larger).
Are there any other Python options for performing constrained least
squares fits?
Or are there python routines for performing Lasso
Regression or Ridge Regression or some other regression method
which penalizes large b coefficient values?
Recent scipy versions include a solver:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.lsq_linear.html#scipy.optimize.lsq_linear
You mention you would find Lasso Regression or Ridge Regression acceptable. These and many other constrained linear models are available in the scikit-learn package. Check out the section on generalized linear models.
Usually constraining the coefficients involves some kind of regularization parameter (C or alpha)---some of the models (the ones ending in CV) can use cross validation to automatically set these parameters. You can also further constrain models to use only positive coefficents---for example, there is an option for this on the Lasso model.
scipy-optimize-leastsq-with-bound-constraints
on SO gives leastsq_bounds, which is scipy leastsq
+ bound constraints such as 0 <= x_i <= 255.
(Scipy leastsq wraps MINPACK, one of several implementations of the widely-used
Levenberg–Marquardt algorithm
a.k.a. damped least-squares.
There are various ways of implementing bounds; leastsq_bounds is I think the simplest.)
As #conradlee says, you can find Lasso and Ridge Regression implementations in the scikit-learn package. These regressors serve your purpose if you just want your fit parameters to be small or positive.
However, if you want to impose any other range as a bound for the fit parameters, you can build your own constrained Regressor with the same package. See the answer by David Dale to this question for an example.
I recently prepared some tutorials on Linear Regression in Python. Here is one of the options (Gekko) that includes constraints on the coefficients.
# Constrained Multiple Linear Regression
import numpy as np
nd = 100 # number of data sets
nc = 5 # number of inputs
x = np.random.rand(nd,nc)
y = np.random.rand(nd)
from gekko import GEKKO
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,nc+1)
for ci in c:
ci.STATUS=1
ci.LOWER = -10
ci.UPPER = 10
xd = m.Array(m.Param,nc)
for i in range(nc):
xd[i].value = x[:,i]
yd = m.Param(y); yp = m.Var()
s = m.sum([c[i]*xd[i] for i in range(nc)])
m.Equation(yp==s+c[-1])
m.Minimize((yd-yp)**2)
m.solve(disp=True)
a = [c[i].value[0] for i in range(nc+1)]
print('Solve time: ' + str(m.options.SOLVETIME))
print('Coefficients: ' + str(a))
It uses the nonlinear solver IPOPT to solve the problem that is better than the scipy.optimize.minimize solver. There are other constrained optimization methods in Python as well as discussed in Is there a high quality nonlinear programming solver for Python?.
Related
I'm currently doing a project investigating the Bayesian Lasso and part of the project involves running some simulations. It can be shown that if we set independent and identically distributed conditional Laplace priors on the regression coefficients beta, the posterior mode is a frequentist Lasso estimate with tuning parameter 2 x sigma x lambda. So to check my work, I often make use of both SciKit and Statmodels (in particular their Lasso implementations) to calculate the frequentist Lasso estimate that should be approximately equal to the posterior mode (using the medians of sigma and lambda as estimates for 2 x sig x lam) and superimpose them onto my histograms. In the simulations I've run involving independent predictor variables, all Lasso estimates that I compute with scikit and statsmodels align, and they appear to coincide with the posterior mode when I superimpose them on my histograms.
However, if I use virtually the same code but in the case of (a) multicollinearity, (b) p > n or (c) n > p but n small e.g. n=12 and p = 9, sklearn and statsmodels output different Lasso estimates occasionally for the same tuning hyperparameter, which is confusing. Here's my code for the multicollinear predictors case (lam=1 here):
sigmamedian = np.median(np.sqrt(burned_sig2_tracesLam))
# Check with SciKit's Lasso
skmodel = Lasso(alpha=2*sigmamedian*lam/(2*len(y_train)),fit_intercept=False,tol=1e-12)
skmodel = skmodel.fit(X_trainStd,y_train-np.mean(y_train))
print(skmodel.coef_)
# StatsModel's Lasso
smLasso = sm.OLS(y_train - np.mean(y_train),X_trainStd).fit_regularized(alpha=2*sigmamedian*lam/(2*len(y_train)))
print(smLasso.params)
The output in this case is:
[ 0.28284396, -1.23878332, 1.08344865, 0.29263474, 0. , 0.00655085]
[ 0.45950192, -1.32361768, 0.8906759, 0.28951489, 0. , 0.]
These aren't the same, which is confusing because I've checked the documentation and I haven't made the common mistake of dealing with the intercept parameters the same way in scikit and statsmodels, the tuning hyperparameters inputted are the same, both modules use coordinate descent and LASSO estimates should be unique for n < p (and actually Tibshirani has shown that LASSO estimates are almost surely unique if the data is generated from a continuous distribution, even if p > n). After superimposing both of these onto the histograms of the posterior distributions, it appears to be that scikit's implementation that returns the posterior mode (or at least, a very good approximation of it). If I change lambda to 2, then scikit and statsmodels still return different things but statsmodels returns the estimate that accurately approximates the posterior mode.
I also generated data with p=25 and n = 20; the theory suggests that Lasso should set 5 of the coefficients to zero, but neither scikit nor statsmodels did this.
What's going on here?
I have a regression of the form model = sm.GLM(y, X, w = weight).
Which ends up being a simple weighted OLS. (note that specificying w as the error weights array actually works in sm.GLM identically to sm.WLS despite it not being in the documentation).
I'm using GLM because this allows me to fit with some additional constraints using fit_constrained(). My X consists of 6 independent variables, 2 of which i want to constrain the resulting coeffecients to be positive. But i can not seem to figure out the syntax to get fit_constrained() to work. The documentation is extremely bare and i can not find any good examples anywhere. All i really need is the correct syntax for imputing these constraints. Thanks!
The function you see is meant for linear constraints, that is a combination of your coefficients fulfill some linear equalities, not meant for defining boundaries.
The closest you can get is using scipy least squares and defining the boundaries, for example, we set up some dataset with 6 coefficients:
from scipy.optimize import least_squares
import numpy as np
np.random.seed(100)
x = np.random.uniform(0,1,(30,6))
y = np.random.normal(0,2,30)
The function to basically matrix multiply and return error:
def fun(b, x, y):
return b[0] + np.matmul(x,b[1:]) - y
The first coefficient is the intercept. Let's say we require the 2nd and 6th to be always positive:
res_lsq = least_squares(fun, [1,1,1,1,1,1,1], args=(x, y),
bounds=([-np.inf,0,-np.inf,-np.inf,-np.inf,-np.inf,0],+np.inf))
And we check the result:
res_lsq.x
array([-1.74342242e-01, 2.09521327e+00, -2.02132481e-01, 2.06247855e+00,
-3.65963504e+00, 6.52264332e-01, 5.33657765e-20])
Reviewing linear regressions via statsmodels OLS fit I see you have to use add_constant to add a constant '1' to all your points in the independent variable(s) before fitting. However my only understanding of intercepts in this context would be the value of y for our line when our x equals 0, so I'm not clear what purpose always just injecting a '1' here serves. What is this constant actually telling the OLS fit?
It doesn't add a constant to your values, it adds a constant term to the linear equation it is fitting. In the single-predictor case, it's the difference between fitting an a line y = mx to your data vs fitting y = mx + b.
sm.add_constant in statsmodel is the same as sklearn's fit_intercept parameter in LinearRegression(). If you don't do sm.add_constant or when LinearRegression(fit_intercept=False), then both statsmodels and sklearn algorithms assume that b=0 in y = mx + b, and it'll fit the model using b=0 instead of calculating what b is supposed to be based on your data.
I am a heavy R user and am recently learning python.
I have a question about how statsmodels.api handles duplicated features.
In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE).
My question is which algorithm is statsmodels employ to obtain MLE?
Especially how is the algorithm handling the situation with duplicated features?
To clarify my question, I generate a sample of size 50 from Bernoullie distribution with a single covariate x1.
import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
return 1.0 - 1.0/(np.exp(eta)+1)
## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] = np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]
Then fit the logistic regression as:
## fit logistic regrssion
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This returns:
The estimated coefficient is more or less similar to the true value (=0.5).
Then I create a duplicate column, namely x2, and fit the logistic regression model again. (glm in R package would return NA for x2)
cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This outputs:
Surprisingly, this works and coefficient estimates of x1 and x2 are exactly identical (=0.1182). As the previous fit returns the coefficient estimate of x1 = 0.2364, the estimate was halved.
Then I increase the number of duplicated features to 9 and fit the model:
cov = cov
for icol in range(3,10):
cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
As expected, the estimates of each duplicated variable are the same (0.0263) and they seem to be 9 times smaller than the original estimate for x1 (0.2364).
I am surprised with this unexpected behaviour of maximum likelihood estimates. Could you explain why this is happening and also what kind of algorithms are employed behind statsmodels.api?
The short answer:
GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.
statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.
The default optimization algorithm in GLM.fit is iteratively reweighted least squares irls which uses WLS and inherits the default behavior of WLS for singular design matrices.
The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.
I want to do least-squares polynomial fits on data sets (X,Y,Yerr) and obtain the covariance matrices of the fit parameters. Also, since I have many data sets, CPU-time is an issue, so I'm seeking an analytical (=fast) solution. I found the following (non-ideal) options:
numpy.polyfit does the fit, but doesn't take into account the errors Yerr, nor does it return the covariance;
numpy.polynomial.polynomial.polyfit does accept Yerr as an input (in the form of weights), but doesn't return covariance either;
scipy.optimize.curve_fit and scipy.optimize.leastsq can be tailored to fit polynomials and return the covariance matrix, but - being iterative methods - these are much slower than the polyfit routines (which yield an analytical solution);
Does Python provide an analytical polynomial fit routine that returns the covariance of the fit parameters (or do I have to write one myself :-) ?
Update:
It appears that in Numpy 1.7.0, numpy.polyfit now not only does accept weights, but also returns the covariance matrix of the coefficients ... So, issue resolved! :-)
You want a fast weighted least squares model that returns the covariance matrix without additional overhead? In general, the right covariance matrix will depend on the data generating process (DGP) because different DGP (say Heteroscedasticity of errors) imply different distributions of parameter estimates (think White vs. OLS standard errors). But if you can assume WLS is the right way to do it, and I believe you would use the asymptotic variance estimate for beta for WLS, (1/n X'V^-1X)^-1, where V is the weighting matrix created from Yerrs. That's a pretty simple formula if numpy.polynomial.polynomial.polyfit is working for you.
I looked for an online reference but couldn't find one. But see Fumio Hayashi's Ecomometrics, 2000, Princeton University press, p. 133 - 137 for a derivation and discussion.
Update 12/4/12:
There is another stack overflow question that comes close:
numpy.polyfit has no keyword 'cov' that has a nice explanation (with code) of how to use scikits.statsmodels to do what you want. I believe you'll want to replace the line:
result = sm.OLS(Y,reg_x_data).fit()
to
result = sm.WLS(Y,reg_x_data, weights).fit()
Where you define weights as a function of Yerr as before with numpy.polynomial.polynomial.polyfit. More details on using statsmodels with WLS over at
the statsmodels website.
Here it is using scipy.linalg.lstsq
import numpy as np,numpy.random, scipy.linalg
#generate the test data
N = 100
xs = np.random.uniform(size=N)
errs = np.random.uniform(0, 0.1, size=N) # errors
ys = 1 + 2 * xs + 3 * xs ** 2 + errs * np.random.normal(size=N)
# do the fit
polydeg = 2
A = np.vstack([1 / errs] + [xs ** _ / errs for _ in range(1, polydeg + 1)]).T
result = scipy.linalg.lstsq(A, (ys / errs))[0]
covar = np.matrix(np.dot(A.T, A)).I
print result, '\n', covar
>> [ 0.99991811 2.00009834 3.00195187]
[[ 4.82718910e-07 -2.82097554e-06 3.80331414e-06]
[ -2.82097554e-06 1.77361434e-05 -2.60150367e-05]
[ 3.80331414e-06 -2.60150367e-05 4.22541049e-05]]