Constraining OLS (or WLS) coeffecients using statsmodels

Constraining OLS (or WLS) coeffecients using statsmodels - python

I have a regression of the form model = sm.GLM(y, X, w = weight).
Which ends up being a simple weighted OLS. (note that specificying w as the error weights array actually works in sm.GLM identically to sm.WLS despite it not being in the documentation).
I'm using GLM because this allows me to fit with some additional constraints using fit_constrained(). My X consists of 6 independent variables, 2 of which i want to constrain the resulting coeffecients to be positive. But i can not seem to figure out the syntax to get fit_constrained() to work. The documentation is extremely bare and i can not find any good examples anywhere. All i really need is the correct syntax for imputing these constraints. Thanks!

The function you see is meant for linear constraints, that is a combination of your coefficients fulfill some linear equalities, not meant for defining boundaries.
The closest you can get is using scipy least squares and defining the boundaries, for example, we set up some dataset with 6 coefficients:
from scipy.optimize import least_squares
import numpy as np
np.random.seed(100)
x = np.random.uniform(0,1,(30,6))
y = np.random.normal(0,2,30)
The function to basically matrix multiply and return error:
def fun(b, x, y):
return b[0] + np.matmul(x,b[1:]) - y
The first coefficient is the intercept. Let's say we require the 2nd and 6th to be always positive:
res_lsq = least_squares(fun, [1,1,1,1,1,1,1], args=(x, y),
bounds=([-np.inf,0,-np.inf,-np.inf,-np.inf,-np.inf,0],+np.inf))
And we check the result:
res_lsq.x
array([-1.74342242e-01, 2.09521327e+00, -2.02132481e-01, 2.06247855e+00,
-3.65963504e+00, 6.52264332e-01, 5.33657765e-20])

Related

Is there any function for calculating k and b coefficients for linear regression model with only one independent variable?

I know I can just write needed method by myself but there must be a function for this because this problem is so common as heck. If somebody does't understand what I am talking about take a look at the following formula
{Image must be here}
For example, I have a function y = kx+b where y is dependent variable, and x is independent. I need to calculate k (slope) and b (intercept), and I have formulas from the picture, and everything those formulas need. Is there any function in common data science libraries which can help calculate those ones? I mentioned "only one independent variable" because sometimes there are multiple independent vars which leads to multidimentional plots
Googling gives nothing. I already use my own implementation of those ones, but I prefer native functions from packages such as scipy and numpy, or sklearn

not sure to fully understand the question (especially, what do you mean by "one independent variable"?), so I try to reformulate. If you have two variables, x andy, both represented by samples (x_1,..., x_n), (y_1,..., y_n) and that you suspect a linear relationship between them, y = a*x +b, then you can use numpy.polyfit to find the coefficients a and b. Here is an example:
import numpy as np
n = 100
x = np.linspace(0, 1, n)
y = 2*x + 0.3
a, b = np.polyfit(x, y, 1)
print(f"a={a}, b={b}")
Returns
a=2.0, b=0.30000000000000016
Hope that helps!

statsmodels add_constant for OLS intercept, what is this actually doing?

Reviewing linear regressions via statsmodels OLS fit I see you have to use add_constant to add a constant '1' to all your points in the independent variable(s) before fitting. However my only understanding of intercepts in this context would be the value of y for our line when our x equals 0, so I'm not clear what purpose always just injecting a '1' here serves. What is this constant actually telling the OLS fit?

It doesn't add a constant to your values, it adds a constant term to the linear equation it is fitting. In the single-predictor case, it's the difference between fitting an a line y = mx to your data vs fitting y = mx + b.

sm.add_constant in statsmodel is the same as sklearn's fit_intercept parameter in LinearRegression(). If you don't do sm.add_constant or when LinearRegression(fit_intercept=False), then both statsmodels and sklearn algorithms assume that b=0 in y = mx + b, and it'll fit the model using b=0 instead of calculating what b is supposed to be based on your data.

Fitting a vector function with curve_fit in Scipy

I want to fit a function with vector output using Scipy's curve_fit (or something more appropriate if available). For example, consider the following function:
import numpy as np
def fmodel(x, a, b):
return np.vstack([a*np.sin(b*x), a*x**2 - b*x, a*np.exp(b/x)])
Each component is a different function but they share the parameters I wish to fit. Ideally, I would do something like this:
x = np.linspace(1, 20, 50)
a = 0.1
b = 0.5
y = fmodel(x, a, b)
y_noisy = y + 0.2 * np.random.normal(size=y.shape)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=fmodel, xdata=x, ydata=y_noisy, p0=[0.3, 0.1])
But curve_fit does not work with functions with vector output, and an error Result from function call is not a proper array of floats. is thrown. What I did instead is to flatten out the output like this:
def fmodel_flat(x, a, b):
return fmodel(x[0:len(x)/3], a, b).flatten()
popt, pcov = curve_fit(f=fmodel_flat, xdata=np.tile(x, 3),
ydata=y_noisy.flatten(), p0=[0.3, 0.1])
and this works. If instead of a vector function I am actually fitting several functions with different inputs as well but which share model parameters, I can concatenate both input and output.
Is there a more appropriate way to fit vector function with Scipy or perhaps some additional module? A main consideration for me is efficiency - the actual functions to fit are much more complex and fitting can take some time, so if this use of curve_fit is mangled and is leading to excessive runtimes I would like to know what I should do instead.

If I can be so blunt as to recommend my own package symfit, I think it does precisely what you need. An example on fitting with shared parameters can be found in the docs.
Your specific problem stated above would become:
from symfit import variables, parameters, Model, Fit, sin, exp
x, y_1, y_2, y_3 = variables('x, y_1, y_2, y_3')
a, b = parameters('a, b')
a.value = 0.3
b.value = 0.1
model = Model({
y_1: a * sin(b * x),
y_2: a * x**2 - b * x,
y_3: a * exp(b / x),
})
xdata = np.linspace(1, 20, 50)
ydata = model(x=xdata, a=0.1, b=0.5)
y_noisy = ydata + 0.2 * np.random.normal(size=(len(model), len(xdata)))
fit = Fit(model, x=xdata, y_1=y_noisy[0], y_2=y_noisy[1], y_3=y_noisy[2])
fit_result = fit.execute()
Check out the docs for more!

I think what you're doing is perfectly fine from an efficiency stand point. I'll try to look at the implementation and come up with something more quantitative, but for the time being here is my reasoning.
What you're doing during curve fitting is optimizing the parameters (a,b) such that
res = sum_i |f(x_i; a,b)-y_i|^2
is minimal. By this I mean that you have data points (x_i,y_i) of arbitrary dimensionality, two parameters (a,b) and a fitting model that approximates the data at query points x_i.
The curve fitting algorithm starts from a starting (a,b) pair, puts this into a black box that computes the above square error, and tries to come up with a new (a',b') pair that produces a smaller error. My point is that the error above is really a black box for the fitting algorithm: the configurational space of the fitting is defined merely by the (a,b) parameters. If you imagine how you'd implement a simple curve fitting function, you could imagine that you try to do, say, a gradient descent, with the square error as cost function.
Now, it should be irrelevant to the fitting procedure how the black box computes the error. It's easy to see that the dimensionality of x_i is really irrelevant for scalar functions, since it doesn't matter if you have 1000 1d query points to fit for, or a 10x10x10 grid in 3d space. What matters is that you have 1000 points x_i for which you need to compute f(x_i) ~ y_i from the model.
The only subtlety that should further be noted is that in case of a vector-valued function, the calculation of the error is not trivial. In my opinion, it's fine to define the error at each x_i point using the 2-norm of the vector-valued function. But hey: in this case, the square error at point x_i is
|f(x_i; a,b)-y_i|^2 == sum_k (f(x_i; a,b)[k]-y_i[k])^2
which implies that the square error for each component is accumulated. This just means that what you're doing right now is just right: by replicating your x_i points and taking into account each component of the function individually, your square error will contain exactly the 2-norm of the error at each point.
So my point is what you're doing is mathematically correct, and I don't expect any behaviour of the fitting procedure to depend on the way how multivariate/vector-valued functions are handled.

Python LeastSquares plot

I have to draw plot using least squares method in Python 3. I have list of x and y values:
y = [186,273,308,484]
x = [2.25,2.34,2.47,2.56]
There are many more values for x and for y, there is only a shortcut. And now, I know, that f(x)=y should be a linear function. I can get cofactor „a” and „b” of this function, by calculating:
delta_x = x[len(x)]-x[0] and delta_y = y[len(y)]-y[0]
Etc, using tangent function. I know, how to do it.
But there are also uncertainties of y, about 2 percent of y. So I have y_errors table, which contains all uncertainties of y.
But what now, how I can draw least squares?
Of course I have been used Google, I saw docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#least-square-fitting-leastsq, but there are some problems.
I tried to edit example from scipy.org to my own purpose. So I edited x, y, y_meas variables, putting here my own lists. But now, I dont know, what is p0 variable in this example. And what should I must edit to make my example working.
Of course I can edit also residuals function. It must get only one variable - y_true. In addition to this I dont understand arquments of leastsq function.
Sorry for my english and for asking such newbie question. But I dont understand this method. Thank You in advance.

I believe you are trying to fit a set of {x, y} (and possibly sigma_y: the uncertainties in y) values to a linear expression. This is known as linear regression, and For linear regression (or indeed, for regression of any polynomial) you can use numpy's polyfit. The uncertainties can be used for the weights::
weight = 1/sigma_y
where sigma_y is the standard deviation in y.
The least-squares routines in scipy.optimize allow you to fit a non-linear function to data, but you have to write the function that computes the "residual" (data - model) in terms of variables that are to be adjusted in order to minimize the calculated residual.

Standard error in non-linear regression

I have been doing some Monte Carlo physics simulations with Python and I am in unable to determine the standard error for the coefficients of a non-linear least square fit.
Initially, I was using SciPy's scipy.stats.linregress for my model since I thought it would be a linear model but noticed it is actually some sort of power function. I then used NumPy's polyfit with the degrees of freedom being 2 but I can't find anyway to determine the standard error of the coefficients.
I know gnuplot can determine the errors for me but I need to do fits for over 30 different cases. I was wondering if anyone knows of anyway for Python to read the standard error from gnuplot or is there some other library I can use?

Finally found the answer to this long asked question! I'm hoping this can at least save someone a few hours of hopeless research for this topic. Scipy has a special function called curve_fit under its optimize section. It uses the least square method to determine the coefficients and best of all, it gives you the covariance matrix. The covariance matrix contains the variance of each coefficient. More exactly, the diagonal of the matrix is the variance and by square rooting the values, the standard error of each coefficient can be determined! Scipy doesn't have much documentation for this so here's a sample code for a better understanding:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plot
def func(x,a,b,c):
return a*x**2 + b*x + c #Refer [1]
x = np.linspace(0,4,50)
y = func(x,2.6,2,3) + 4*np.random.normal(size=len(x)) #Refer [2]
coeff, var_matrix = curve_fit(func,x,y)
variance = np.diagonal(var_matrix) #Refer [3]
SE = np.sqrt(variance) #Refer [4]
#======Making a dictionary to print results========
results = {'a':[coeff[0],SE[0]],'b':[coeff[1],SE[1]],'c':[coeff[2],SE[2]]}
print "Coeff\tValue\t\tError"
for v,c in results.iteritems():
print v,"\t",c[0],"\t",c[1]
#========End Results Printing=================
y2 = func(x,coeff[0],coeff[1],coeff[2]) #Saves the y values for the fitted model
plot.plot(x,y)
plot.plot(x,y2)
plot.show()
What this function returns is critical because it defines what will used to fit for the model
Using the function to create some arbitrary data + some noise
Saves the covariance matrix's diagonal to a 1D matrix which is just a normal array
Square rooting the variance to get the standard error (SE)

it looks like gnuplot uses levenberg-marquardt and there's a python implementation available - you can get the error estimates from the mpfit.covar attribute (incidentally, you should worry about what the error estimates "mean" - are other parameters allowed to adjust to compensate, for example?)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Constraining OLS (or WLS) coeffecients using statsmodels - python

Related

Is there any function for calculating k and b coefficients for linear regression model with only one independent variable?

statsmodels add_constant for OLS intercept, what is this actually doing?

Fitting a vector function with curve_fit in Scipy

Python LeastSquares plot

Standard error in non-linear regression

Categories

Resources