Reviewing linear regressions via statsmodels OLS fit I see you have to use add_constant to add a constant '1' to all your points in the independent variable(s) before fitting. However my only understanding of intercepts in this context would be the value of y for our line when our x equals 0, so I'm not clear what purpose always just injecting a '1' here serves. What is this constant actually telling the OLS fit?
It doesn't add a constant to your values, it adds a constant term to the linear equation it is fitting. In the single-predictor case, it's the difference between fitting an a line y = mx to your data vs fitting y = mx + b.
sm.add_constant in statsmodel is the same as sklearn's fit_intercept parameter in LinearRegression(). If you don't do sm.add_constant or when LinearRegression(fit_intercept=False), then both statsmodels and sklearn algorithms assume that b=0 in y = mx + b, and it'll fit the model using b=0 instead of calculating what b is supposed to be based on your data.
Related
I am using LinearRegression from sklearn.linear_model. Can I force the coefficients between 0 and 1? Also, can I give priority to solutions involving only binary coefficients? (Assume such a solution exists!)
From https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html, I only know how to force positive coefficients using the positive=True parameter, but coefficients reach values above 1:
from sklearn.linear_model import LinearRegression
reg = LinearRegression(positive=True, fit_intercept=False).fit(X, y)
Alternatively, can you suggest a different model for this?
EDIT:
As I understand, the command reg.coef_ shows the coefficients that were found to fit the data best. Can I force the algorithm to only look for solutions with coefficients in the range of 0-1 (or if possible binary)? E.g., scipy.optimize.curve_fit allows to set bounds (possible ranges) for each variable.
Sklearn LinearRegression() is a wrapper for scipy.linalg.lstsq(). It does not implement a constrained version as far as I know, but you can try scipy.optimize.lsq_linear():
from scipy.optimize import lsq_linear
res = lsq_linear(X, y, bounds=(0, 1))
# Get coefficients:
print(res.x)
I use Python's statsmodels.formula.api for most regression tasks. When I test a large number of variables in a model, I check their p-values to be confident that the variables are actually improving the model.
I usually apply polynomial transformations to variables to test whether that improves the fit. For example, this is the equation for a polynomial transform of 3 degrees to variable x:
ŷ = C + β₁x + β₂x² + β₃x³
My problem is that each term of the variable x has its own p-value. That is, there's a separate p-value for the x, x² and x³ terms.
In a multivariate regression model, where the predictors X comprises n individual variables x₁, x₁, ... xₙ, I have no way to assess the p-value of xᵢ if I applied a polynomial transform to it.
In this code, I apply a 3-degree polynomial transform to X:
import statsmodels.formula.api as smf
X = data["predictor"]
y = data["outcome"]
model = smf.ols(
formula="outcome ~ predictor + I(predictor**2) + I(predictor**3)",
data=data
)
results = model.fit()
print(results.pvalues)
This will yield 4 p-values: one for the constant/intercept, and one for each term of X. Can I somehow combine the p-values of X, or otherwise develop a proxy measure for them that is a single value?
I have a regression of the form model = sm.GLM(y, X, w = weight).
Which ends up being a simple weighted OLS. (note that specificying w as the error weights array actually works in sm.GLM identically to sm.WLS despite it not being in the documentation).
I'm using GLM because this allows me to fit with some additional constraints using fit_constrained(). My X consists of 6 independent variables, 2 of which i want to constrain the resulting coeffecients to be positive. But i can not seem to figure out the syntax to get fit_constrained() to work. The documentation is extremely bare and i can not find any good examples anywhere. All i really need is the correct syntax for imputing these constraints. Thanks!
The function you see is meant for linear constraints, that is a combination of your coefficients fulfill some linear equalities, not meant for defining boundaries.
The closest you can get is using scipy least squares and defining the boundaries, for example, we set up some dataset with 6 coefficients:
from scipy.optimize import least_squares
import numpy as np
np.random.seed(100)
x = np.random.uniform(0,1,(30,6))
y = np.random.normal(0,2,30)
The function to basically matrix multiply and return error:
def fun(b, x, y):
return b[0] + np.matmul(x,b[1:]) - y
The first coefficient is the intercept. Let's say we require the 2nd and 6th to be always positive:
res_lsq = least_squares(fun, [1,1,1,1,1,1,1], args=(x, y),
bounds=([-np.inf,0,-np.inf,-np.inf,-np.inf,-np.inf,0],+np.inf))
And we check the result:
res_lsq.x
array([-1.74342242e-01, 2.09521327e+00, -2.02132481e-01, 2.06247855e+00,
-3.65963504e+00, 6.52264332e-01, 5.33657765e-20])
I am a heavy R user and am recently learning python.
I have a question about how statsmodels.api handles duplicated features.
In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE).
My question is which algorithm is statsmodels employ to obtain MLE?
Especially how is the algorithm handling the situation with duplicated features?
To clarify my question, I generate a sample of size 50 from Bernoullie distribution with a single covariate x1.
import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
return 1.0 - 1.0/(np.exp(eta)+1)
## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] = np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]
Then fit the logistic regression as:
## fit logistic regrssion
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This returns:
The estimated coefficient is more or less similar to the true value (=0.5).
Then I create a duplicate column, namely x2, and fit the logistic regression model again. (glm in R package would return NA for x2)
cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This outputs:
Surprisingly, this works and coefficient estimates of x1 and x2 are exactly identical (=0.1182). As the previous fit returns the coefficient estimate of x1 = 0.2364, the estimate was halved.
Then I increase the number of duplicated features to 9 and fit the model:
cov = cov
for icol in range(3,10):
cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
As expected, the estimates of each duplicated variable are the same (0.0263) and they seem to be 9 times smaller than the original estimate for x1 (0.2364).
I am surprised with this unexpected behaviour of maximum likelihood estimates. Could you explain why this is happening and also what kind of algorithms are employed behind statsmodels.api?
The short answer:
GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.
statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.
The default optimization algorithm in GLM.fit is iteratively reweighted least squares irls which uses WLS and inherits the default behavior of WLS for singular design matrices.
The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.
I have a classic linear regression problem of the form:
y = X b
where y is a response vector X is a matrix of input variables and b is the vector of fit parameters I am searching for.
Python provides b = numpy.linalg.lstsq( X , y ) for solving problems of this form.
However, when I use this I tend to get either extremely large or extremely small values for the components of b.
I'd like to perform the same fit, but constrain the values of b between 0 and 255.
It looks like scipy.optimize.fmin_slsqp() is an option, but I found it extremely slow for the size of problem I'm interested in (X is something like 3375 by 1500 and hopefully even larger).
Are there any other Python options for performing constrained least
squares fits?
Or are there python routines for performing Lasso
Regression or Ridge Regression or some other regression method
which penalizes large b coefficient values?
Recent scipy versions include a solver:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.lsq_linear.html#scipy.optimize.lsq_linear
You mention you would find Lasso Regression or Ridge Regression acceptable. These and many other constrained linear models are available in the scikit-learn package. Check out the section on generalized linear models.
Usually constraining the coefficients involves some kind of regularization parameter (C or alpha)---some of the models (the ones ending in CV) can use cross validation to automatically set these parameters. You can also further constrain models to use only positive coefficents---for example, there is an option for this on the Lasso model.
scipy-optimize-leastsq-with-bound-constraints
on SO gives leastsq_bounds, which is scipy leastsq
+ bound constraints such as 0 <= x_i <= 255.
(Scipy leastsq wraps MINPACK, one of several implementations of the widely-used
Levenberg–Marquardt algorithm
a.k.a. damped least-squares.
There are various ways of implementing bounds; leastsq_bounds is I think the simplest.)
As #conradlee says, you can find Lasso and Ridge Regression implementations in the scikit-learn package. These regressors serve your purpose if you just want your fit parameters to be small or positive.
However, if you want to impose any other range as a bound for the fit parameters, you can build your own constrained Regressor with the same package. See the answer by David Dale to this question for an example.
I recently prepared some tutorials on Linear Regression in Python. Here is one of the options (Gekko) that includes constraints on the coefficients.
# Constrained Multiple Linear Regression
import numpy as np
nd = 100 # number of data sets
nc = 5 # number of inputs
x = np.random.rand(nd,nc)
y = np.random.rand(nd)
from gekko import GEKKO
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,nc+1)
for ci in c:
ci.STATUS=1
ci.LOWER = -10
ci.UPPER = 10
xd = m.Array(m.Param,nc)
for i in range(nc):
xd[i].value = x[:,i]
yd = m.Param(y); yp = m.Var()
s = m.sum([c[i]*xd[i] for i in range(nc)])
m.Equation(yp==s+c[-1])
m.Minimize((yd-yp)**2)
m.solve(disp=True)
a = [c[i].value[0] for i in range(nc+1)]
print('Solve time: ' + str(m.options.SOLVETIME))
print('Coefficients: ' + str(a))
It uses the nonlinear solver IPOPT to solve the problem that is better than the scipy.optimize.minimize solver. There are other constrained optimization methods in Python as well as discussed in Is there a high quality nonlinear programming solver for Python?.