Solving linearised least squares using statsmodels - python

I'm trying to translate a simple linearised least squares problem to statsmodels, in order to learn how to use it for iterative least squares:
The (contrived) data comprise measurements of the time it takes for a ball to drop a given distance.
distance time
10 1.430
20 2.035
30 2.460
40 2.855
Using these measurements, I want to determine the acceleration due to gravity, using:
t = sqrt(2s/g)
This is (obviously) non-linear, but I can linearise it (F(x- + 𝛿x) = l0 + v, where x- is a provisional value), then use a provisional value for g (10) to calculate F(g), and iterate if necessary:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
measurements = pd.DataFrame({
'distance': [10, 20, 30, 40],
'time': [1.430, 2.035, 2.460, 2.855]
})
prov_g = 10
measurements['fg'] = measurements['distance'].apply(
lambda d: ((2 * d) ** 0.5) * (prov_g ** -0.5))
measurements['A_matrix'] = measurements['distance'].apply(
lambda d: -np.sqrt(d / 2) * (prov_g ** -1.5))
measurements['b'] = measurements['time'] - measurements['fg']
ATA = np.dot(measurements['A_matrix'], measurements['A_matrix'].T)
ATb = np.dot(measurements['A_matrix'].T, measurements['b'])
x = np.dot(ATA ** -1, ATb)
updated_g = prov_g + x
updated_g
>>> 9.807
What I can't figure out from the examples is how I can use solve statsmodels to do what I've just done manually (linearising the problem, then solving using matrix multiplication)

statsmodels is not directly of any help here, at least not yet.
I think your linearized non-linear least square optimization is essentially what scipy.optimize.leastsq does internally. It has several more user friendly or extended wrappers, for example scipy.optimize.curve_fit or the lmfit package.
Statsmodels currently does not have a generic version of an equivalent iterative solver.
Statsmodels uses iteratively reweighted least squares as optimizer in several models like GLM and RLM. However, those are model specific implementations. In those cases statsmodels uses WLS (weighted least square) to calculate the equivalent of your solution for the linear model in calculating the next step.

Related

Python and GEKKO optimisation of a blackbox function

I am using GEKKO for fitting purposes trying to optimise functions which are explicitly defined - so I have a fully functional form and can create equation objects for optimisation purposes.
But now I have a different problem.
I can't create equations because of the complexed functional dependence.
But I have a python function that calculates the output using some inputs - optimisation parameters and some other that can be interpreted as fixed or known.
The key moments: I have the experimental data and a complexed model that is described in f1(set_of_parameters) - python function. f1 - is nonlinear, nonconvex and it can't be expressed as one simple equation - it has a lot of conditional parameters and a lot of branches the calls of other python functions inside, etc.
So actually f1 can't be converted to a gekko model equation.
And I need to find such parameters - set_of_optimal_parameters, which will lead to the minimum of a distance so that f1(set_of_optimal_parameters) will be as close as possible to the experimental data I have, so I will find a set_of_optimal_parameters.
For each parameter of a set, I have initial values and boundaries and even some constraints.
So I need to do something like this:
m = GEKKO()
#parameters I need to find
param_1 = m.FV(value = val_1, lb = lb_1, ub=rb_1)
param_2 = m.FV(value = val_2, lb = lb_2, ub=rb_2)
...
param_n = m.FV(value = val_n, lb = lb_n, ub=rb_n) #constructing the input for the function f1()
params_dataframe = ....()# some function that collects all the parameters and arranges all of them to a proper form to an input of f1()
#exp data description
x = m.Param(value = xData)
z = m.Param(value = yData)
y = m.Var()
#model description - is it possible to use other function inside equation? because f1 is very complexed with a lot of branches and options.. I don't really want to translate it in equation form..
m.Equation(
y==f1(params_dataframe)
)
#add some constraints
min = m.Param(value=some_val_min)
m.Equation(min <= (param_1+param_2) / (param_1+param_2)**2))
# trying to solve and minimize the sum of squares
m.Minimize(((y-z))**2)
# Options for solver
param_1.STATUS = 1
param_2.STATUS = 1
...
param_n.STATUS = 1
m.options.IMODE = 2
m.options.SOLVER = 1
m.options.MAX_ITER = 1000
m.solve(disp=1)
Is it possible to use GEKKO this way or it's not allowed? and why?
Gekko compiles equations into byte-code and requires all equations in Gekko format so that it can overload equation operators to provide exact first and second derivatives in sparse form. Black-box functions do not provide the necessary first and second derivatives, but they can provide function evaluations for finite differences (derivative approximations) or for surrogate functions.
To answer your question directly, you can't use f1(params) in a Gekko problem. If you need an optimizer to evaluate arbitrary black box functions, an optimizer such as scipy.optimize.minimize() is a good choice.
If you would still like to use Gekko, there are several options to built a surrogate model for f1 that has continuous first and second derivatives. The surrogate model depends on the number of params:
1D: use cspline()
2D: use bspline()
3D+: use Machine learning such as Gaussian Processes, Neural Network, Linear Regression, etc.
Here is an example that create a surrogate model for y=f(x) where f(x)=3*np.sin(x) - (x-3). This equation could be modeled directly in Gekko, but it serves as an example of creating a cspline() object that approximates the function and finds the minimum.
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
"""
minimize y
s.t. y = f(x)
using cubic spline with random sampling of data
"""
# function to generate data for cspline
def f(x):
return 3*np.sin(x) - (x-3)
x_data = np.random.rand(50)*10+10
y_data = f(x_data)
c = GEKKO()
x = c.Var(value=np.random.rand(1)*10+10)
y = c.Var()
c.cspline(x,y,x_data,y_data,True)
c.Obj(y)
c.options.IMODE = 3
c.options.CSV_READ = 0
c.options.SOLVER = 3
c.solve(disp=True)
if c.options.SOLVESTATUS == 1:
plt.figure()
plt.scatter(x_data,y_data,5,'b')
plt.scatter(x.value,y.value,200,'r','x')
else:
print ('Failed!')
print(x_data,y_data)
plt.figure()
plt.scatter(x_data,y_data,5,'b')
plt.show()

Scipy least squares: is it possible to optimize for two error functions simultaneously?

For this particular work, I am using scipy optimize to try find the best parameters that fit two different models at the same time.
model_func_par = lambda t, total, r0, theta: np.multiply((total/3),(1+2*r0),np.exp(-t/theta))
model_func_perp = lambda t, total, r0, theta: np.multiply((total/3),(1-r0),np.exp(-t/theta))
After this I create two error functions by subtractig the raw data, and plug it into scipy.optimize.leastsq(). As you can see I have two different equations with the same r0 and theta parameters - I have to find the parameters that fit best both equations (in theory, r0 and theta should be the same for both equations, but because of noise and experimental errors etc I am sure this won't be quite the case.
I guess I could do a separate optimization for each equation and perhaps take an average of the two results, but I wanted to see if anyone knows of a way to do one optimization for both.
Thanks in advance!
Is there any specific reason to use np.multiply? Since the typical mathematical operators are overloaded for np.ndarrays, it's more convenient to write (and read):
model1 = lambda t, total, r0, theta: (total/3) * (1+2*r0) * np.exp(-t/theta)
model2 = lambda t, total, r0, theta: (total/3) * (1-r0) * np.exp(-t/theta)
To answer your question: AFAIK this isn't possible with scipy.optimize.least_squares. However, a very simple approach would be to minimize the sum of least squares residuals
min || model1(xdata, *coeffs) - ydata ||^2 + || model2(xdata, *coeffs) - ydata ||^2
like this:
import numpy as np
from scipy.optimize import minimize
from scipy.linalg import norm
# your xdata and ydata as np.ndarrays
# xdata = np.array([...])
# ydata = np.array([...])
# the objective function to minimize
def obj(coeffs):
return norm(model1(xdata,*coeffs)-ydata)**2 + norm(model2(xdata,*coeffs)-ydata)**2
# the initial point (coefficients)
coeffs0 = np.ones(3)
# res.x contains your coefficients
res = minimize(obj, x0=coeffs0)
However, note that this isn't a complete answer. A better approach would be to formulate it as a multi-objective optimization problem. You may take a look at pymoo for this purpose.

How can I solve ODEs for a set number of time steps using Python (SciPy)?

I am trying to solve a set of ODEs using SciPy. The task I am given asks me to solve the differential equation for 500 time steps. How can I achieve this using SciPy?
So far, I have tried using scipy.integrate.solve_ivp, which gives me a correct solution, but I cannot control the number of time steps that it runs for. The t_span argument lets me configure what the initial and final values of t are, but I'm not actually interested in that -- I am only interested in how many times I integrate. (For example, when I run my equations with t_span = (0, 500) the solver integrates 907 times.)
Below is a simplified example of my code:
from scipy.integrate import solve_ivp
def simple_diff(t, z) :
x, y = z
return [1 - 2*x*y, 2*x - y]
t_range = (0, 500)
xy_init = [0, 0]
sol = solve_ivp(simple_diff, t_range, xy_init)
I am also fine with using something other than SciPy, but solutions with SciPy are preferable.
You can use the t_eval argument to solve_ivp to evaluate at particular time points:
import numpy as np
t_eval = np.arange(501)
sol = solve_ivp(simple_diff, t_range, xy_init, t_eval=t_eval)
However, note that this will not cause the solver to limit the number of integration steps - that is determined by error metrics.
If you absolutely must evaluate the function exactly 500 times to obtain 500 integration steps, you are describing Euler integration, which will be less accurate than the algorithm that solve_ivp uses.
Looking at the solutions to your equation, it feels like you probably want to integrate only up to t=5.
Here's what the result looks like when integrating with the above settings:
And here's the result for
t_eval = np.linspace(0, 5)
t_range = (0, 5)
sol = solve_ivp(simple_diff, t_range, xy_init, t_eval=t_eval)

Is there a way to estimate Poisson interaction effect in python statsmodels?

Does statsmodels in Python have a way to estimate interaction with a 95% confidence interval? This would be the linear combination of the model's parameter estimates.
Given the example below, I would like to get the effect of being in arm 'b' among people in place 'there', it would require estimating the linear combination of model parameters:
Beta arm + Delta arm*place, but also including the appropriate confidence interval.
I'm aware of mod.params and mod.conf_int(), but does statsmodels have another methods for linear combinations?
import random
import pandas as pd
import statsmodels.api as sm
import patsy
import numpy as np
cases = np.array([random.randint(0,10) for i in range(200)])
arm = [random.choice(['a', 'b']) for i in range(200)]
place = [random.choice(['here', 'there']) for i in range(200)]
df = pd.DataFrame({'arm': arm, 'place': place})
exog = patsy.dmatrix('arm + place + arm * place', df, return_type='dataframe')
mod = sm.GLM(endog=cases, exog=exog, family=sm.families.Poisson()).fit()
mod.summary()
Bollen's Delta Method is frequently used to get the confidence interval for the linear combination b1 * x + b2 * x * z.
I'm not sure how and to what extent Statsmodels incorporates the Delta Method.
If you want to go down the results.get_prediction route just make sure all the 'other covariates' (if any) are set to their sample or population mean.

Constrained Linear Regression in Python

I have a classic linear regression problem of the form:
y = X b
where y is a response vector X is a matrix of input variables and b is the vector of fit parameters I am searching for.
Python provides b = numpy.linalg.lstsq( X , y ) for solving problems of this form.
However, when I use this I tend to get either extremely large or extremely small values for the components of b.
I'd like to perform the same fit, but constrain the values of b between 0 and 255.
It looks like scipy.optimize.fmin_slsqp() is an option, but I found it extremely slow for the size of problem I'm interested in (X is something like 3375 by 1500 and hopefully even larger).
Are there any other Python options for performing constrained least
squares fits?
Or are there python routines for performing Lasso
Regression or Ridge Regression or some other regression method
which penalizes large b coefficient values?
Recent scipy versions include a solver:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.lsq_linear.html#scipy.optimize.lsq_linear
You mention you would find Lasso Regression or Ridge Regression acceptable. These and many other constrained linear models are available in the scikit-learn package. Check out the section on generalized linear models.
Usually constraining the coefficients involves some kind of regularization parameter (C or alpha)---some of the models (the ones ending in CV) can use cross validation to automatically set these parameters. You can also further constrain models to use only positive coefficents---for example, there is an option for this on the Lasso model.
scipy-optimize-leastsq-with-bound-constraints
on SO gives leastsq_bounds, which is scipy leastsq
+ bound constraints such as 0 <= x_i <= 255.
(Scipy leastsq wraps MINPACK, one of several implementations of the widely-used
Levenberg–Marquardt algorithm
a.k.a. damped least-squares.
There are various ways of implementing bounds; leastsq_bounds is I think the simplest.)
As #conradlee says, you can find Lasso and Ridge Regression implementations in the scikit-learn package. These regressors serve your purpose if you just want your fit parameters to be small or positive.
However, if you want to impose any other range as a bound for the fit parameters, you can build your own constrained Regressor with the same package. See the answer by David Dale to this question for an example.
I recently prepared some tutorials on Linear Regression in Python. Here is one of the options (Gekko) that includes constraints on the coefficients.
# Constrained Multiple Linear Regression
import numpy as np
nd = 100 # number of data sets
nc = 5 # number of inputs
x = np.random.rand(nd,nc)
y = np.random.rand(nd)
from gekko import GEKKO
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,nc+1)
for ci in c:
ci.STATUS=1
ci.LOWER = -10
ci.UPPER = 10
xd = m.Array(m.Param,nc)
for i in range(nc):
xd[i].value = x[:,i]
yd = m.Param(y); yp = m.Var()
s = m.sum([c[i]*xd[i] for i in range(nc)])
m.Equation(yp==s+c[-1])
m.Minimize((yd-yp)**2)
m.solve(disp=True)
a = [c[i].value[0] for i in range(nc+1)]
print('Solve time: ' + str(m.options.SOLVETIME))
print('Coefficients: ' + str(a))
It uses the nonlinear solver IPOPT to solve the problem that is better than the scipy.optimize.minimize solver. There are other constrained optimization methods in Python as well as discussed in Is there a high quality nonlinear programming solver for Python?.

Categories

Resources