Can I force coefficients between 0-1 in LinearRegression? - python

I am using LinearRegression from sklearn.linear_model. Can I force the coefficients between 0 and 1? Also, can I give priority to solutions involving only binary coefficients? (Assume such a solution exists!)
From https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html, I only know how to force positive coefficients using the positive=True parameter, but coefficients reach values above 1:
from sklearn.linear_model import LinearRegression
reg = LinearRegression(positive=True, fit_intercept=False).fit(X, y)
Alternatively, can you suggest a different model for this?
EDIT:
As I understand, the command reg.coef_ shows the coefficients that were found to fit the data best. Can I force the algorithm to only look for solutions with coefficients in the range of 0-1 (or if possible binary)? E.g., scipy.optimize.curve_fit allows to set bounds (possible ranges) for each variable.

Sklearn LinearRegression() is a wrapper for scipy.linalg.lstsq(). It does not implement a constrained version as far as I know, but you can try scipy.optimize.lsq_linear():
from scipy.optimize import lsq_linear
res = lsq_linear(X, y, bounds=(0, 1))
# Get coefficients:
print(res.x)

Related

Constraining OLS (or WLS) coeffecients using statsmodels

I have a regression of the form model = sm.GLM(y, X, w = weight).
Which ends up being a simple weighted OLS. (note that specificying w as the error weights array actually works in sm.GLM identically to sm.WLS despite it not being in the documentation).
I'm using GLM because this allows me to fit with some additional constraints using fit_constrained(). My X consists of 6 independent variables, 2 of which i want to constrain the resulting coeffecients to be positive. But i can not seem to figure out the syntax to get fit_constrained() to work. The documentation is extremely bare and i can not find any good examples anywhere. All i really need is the correct syntax for imputing these constraints. Thanks!
The function you see is meant for linear constraints, that is a combination of your coefficients fulfill some linear equalities, not meant for defining boundaries.
The closest you can get is using scipy least squares and defining the boundaries, for example, we set up some dataset with 6 coefficients:
from scipy.optimize import least_squares
import numpy as np
np.random.seed(100)
x = np.random.uniform(0,1,(30,6))
y = np.random.normal(0,2,30)
The function to basically matrix multiply and return error:
def fun(b, x, y):
return b[0] + np.matmul(x,b[1:]) - y
The first coefficient is the intercept. Let's say we require the 2nd and 6th to be always positive:
res_lsq = least_squares(fun, [1,1,1,1,1,1,1], args=(x, y),
bounds=([-np.inf,0,-np.inf,-np.inf,-np.inf,-np.inf,0],+np.inf))
And we check the result:
res_lsq.x
array([-1.74342242e-01, 2.09521327e+00, -2.02132481e-01, 2.06247855e+00,
-3.65963504e+00, 6.52264332e-01, 5.33657765e-20])

statsmodels add_constant for OLS intercept, what is this actually doing?

Reviewing linear regressions via statsmodels OLS fit I see you have to use add_constant to add a constant '1' to all your points in the independent variable(s) before fitting. However my only understanding of intercepts in this context would be the value of y for our line when our x equals 0, so I'm not clear what purpose always just injecting a '1' here serves. What is this constant actually telling the OLS fit?
It doesn't add a constant to your values, it adds a constant term to the linear equation it is fitting. In the single-predictor case, it's the difference between fitting an a line y = mx to your data vs fitting y = mx + b.
sm.add_constant in statsmodel is the same as sklearn's fit_intercept parameter in LinearRegression(). If you don't do sm.add_constant or when LinearRegression(fit_intercept=False), then both statsmodels and sklearn algorithms assume that b=0 in y = mx + b, and it'll fit the model using b=0 instead of calculating what b is supposed to be based on your data.

statmodels in python package, How exactly duplicated features are handled?

I am a heavy R user and am recently learning python.
I have a question about how statsmodels.api handles duplicated features.
In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE).
My question is which algorithm is statsmodels employ to obtain MLE?
Especially how is the algorithm handling the situation with duplicated features?
To clarify my question, I generate a sample of size 50 from Bernoullie distribution with a single covariate x1.
import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
return 1.0 - 1.0/(np.exp(eta)+1)
## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] = np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]
Then fit the logistic regression as:
## fit logistic regrssion
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This returns:
The estimated coefficient is more or less similar to the true value (=0.5).
Then I create a duplicate column, namely x2, and fit the logistic regression model again. (glm in R package would return NA for x2)
cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This outputs:
Surprisingly, this works and coefficient estimates of x1 and x2 are exactly identical (=0.1182). As the previous fit returns the coefficient estimate of x1 = 0.2364, the estimate was halved.
Then I increase the number of duplicated features to 9 and fit the model:
cov = cov
for icol in range(3,10):
cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
As expected, the estimates of each duplicated variable are the same (0.0263) and they seem to be 9 times smaller than the original estimate for x1 (0.2364).
I am surprised with this unexpected behaviour of maximum likelihood estimates. Could you explain why this is happening and also what kind of algorithms are employed behind statsmodels.api?
The short answer:
GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.
statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.
The default optimization algorithm in GLM.fit is iteratively reweighted least squares irls which uses WLS and inherits the default behavior of WLS for singular design matrices.
The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.

Estimating Posterior in Python?

I'm new to Bayesian stats and I'm trying to estimate the posterior of a poisson (likelihood) and gamma distribution (prior) in Python. The parameter I'm trying to estimate is the lambda variable in the poisson distribution. I think the posterior will take the form of a gamma distribution (conjugate prior?) but I don't want to leverage that. The only thing I'm given is the data (named "my_data"). Here's my code:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats
x=np.linspace(1,len(my_data),len(my_data))
lambda_estimate=np.mean(my_data)
prior= scipy.stats.gamma.pdf(x,alpha,beta) #the parameters dont matter for now
likelihood_temp = lambda yi, a: scipy.stats.poisson.pmf(yi, a)
likelihood = lambda y, a: np.log(np.prod([likelihood_temp(data, a) for data in my_data]))
posterior=likelihood(my_data,lambda_estimate) * prior
When I try to plot the posterior I get an empty plot. I plotted the prior and it looks fine, so I think the issue is the likelihood. I took the log because the data is fairly large and I didn't want things to get unstable. Can anyone point out the issues in my code? Any help would be appreciated.
In Bayesian statistics, one goal is to calculate the posterior distribution of the parameter (lambda) given the data and the prior over a range of possible values for lambda. In your code, you calculating the prior over the array x, but you are taking a single value for lambda to calculate the likelihood. The posterior and likelihood should be over x as well, something like:
posterior = [likelihood(my_data, lambda_i) for lambda_i in x] * prior
(assuming you are not taking the logs of the prior and likelihood)
You might want to take a look at the PyMC3 library.
I would recommend you to have a look at the conjugate_prior module.
You could just type:
from conjugate_prior import GammaPoisson
model = GammaPoisson(prior_a, prior_b)
model = model.update(...)
credible_interval = model.posterior(lower_bound, upper_bound)

Constrained Linear Regression in Python

I have a classic linear regression problem of the form:
y = X b
where y is a response vector X is a matrix of input variables and b is the vector of fit parameters I am searching for.
Python provides b = numpy.linalg.lstsq( X , y ) for solving problems of this form.
However, when I use this I tend to get either extremely large or extremely small values for the components of b.
I'd like to perform the same fit, but constrain the values of b between 0 and 255.
It looks like scipy.optimize.fmin_slsqp() is an option, but I found it extremely slow for the size of problem I'm interested in (X is something like 3375 by 1500 and hopefully even larger).
Are there any other Python options for performing constrained least
squares fits?
Or are there python routines for performing Lasso
Regression or Ridge Regression or some other regression method
which penalizes large b coefficient values?
Recent scipy versions include a solver:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.lsq_linear.html#scipy.optimize.lsq_linear
You mention you would find Lasso Regression or Ridge Regression acceptable. These and many other constrained linear models are available in the scikit-learn package. Check out the section on generalized linear models.
Usually constraining the coefficients involves some kind of regularization parameter (C or alpha)---some of the models (the ones ending in CV) can use cross validation to automatically set these parameters. You can also further constrain models to use only positive coefficents---for example, there is an option for this on the Lasso model.
scipy-optimize-leastsq-with-bound-constraints
on SO gives leastsq_bounds, which is scipy leastsq
+ bound constraints such as 0 <= x_i <= 255.
(Scipy leastsq wraps MINPACK, one of several implementations of the widely-used
Levenberg–Marquardt algorithm
a.k.a. damped least-squares.
There are various ways of implementing bounds; leastsq_bounds is I think the simplest.)
As #conradlee says, you can find Lasso and Ridge Regression implementations in the scikit-learn package. These regressors serve your purpose if you just want your fit parameters to be small or positive.
However, if you want to impose any other range as a bound for the fit parameters, you can build your own constrained Regressor with the same package. See the answer by David Dale to this question for an example.
I recently prepared some tutorials on Linear Regression in Python. Here is one of the options (Gekko) that includes constraints on the coefficients.
# Constrained Multiple Linear Regression
import numpy as np
nd = 100 # number of data sets
nc = 5 # number of inputs
x = np.random.rand(nd,nc)
y = np.random.rand(nd)
from gekko import GEKKO
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,nc+1)
for ci in c:
ci.STATUS=1
ci.LOWER = -10
ci.UPPER = 10
xd = m.Array(m.Param,nc)
for i in range(nc):
xd[i].value = x[:,i]
yd = m.Param(y); yp = m.Var()
s = m.sum([c[i]*xd[i] for i in range(nc)])
m.Equation(yp==s+c[-1])
m.Minimize((yd-yp)**2)
m.solve(disp=True)
a = [c[i].value[0] for i in range(nc+1)]
print('Solve time: ' + str(m.options.SOLVETIME))
print('Coefficients: ' + str(a))
It uses the nonlinear solver IPOPT to solve the problem that is better than the scipy.optimize.minimize solver. There are other constrained optimization methods in Python as well as discussed in Is there a high quality nonlinear programming solver for Python?.

Categories

Resources