Fitting a sum to data in Python - python

Given that the fitting function is of type:
I intend to fit such function to the experimental data (x,y=f(x)) that I have. But then I have some doubts:
How do I define my fitting function when there's a summation involved?
Once the function defined, i.e. def func(..) return ... is it still possible to use curve_fit from scipy.optimize? Because now there's a set of parameters s_i and r_i involved compared to the usual fitting cases where one has few single parameters.
Finally are such cases treated completely differently?
Feel a bit lost here, thanks for any help.

This is very well within reach of scipy.optimize.curve_fit (or just scipy.optimize.leastsqr). The fact that a sum is involved does not matter at all, nor that you have arrays of parameters. The only thing to note is that curve_fit wants to give your fit function the parameters as individual arguments, while leastsqr gives a single vector.
Here's a solution:
import numpy as np
from scipy.optimize import curve_fit, leastsq
def f(x,r,s):
""" The fit function, applied to every x_k for the vectors r_i and s_i. """
x = x[...,np.newaxis] # add an axis for the summation
# by virtue of numpy's fantastic broadcasting rules,
# the following will be evaluated for every combination of k and i.
x2s2 = (x*s)**2
return np.sum(r * x2s2 / (1 + x2s2), axis=-1)
# fit using curve_fit
popt,pcov = curve_fit(
lambda x,*params: f(x,params[:N],params[N:]),
X,Y,
np.r_[R0,S0],
)
R = popt[:N]
S = popt[N:]
# fit using leastsq
popt,ier = leastsq(
lambda params: f(X,params[:N],params[N:]) - Y,
np.r_[R0,S0],
)
R = popt[:N]
S = popt[N:]
A few things to note:
Upon start, we need the 1d arrays X and Y of measurements to fit to, the 1d arrays R0 and S0 as initial guesses and Nthe length of those two arrays.
I separated the implementation of the actual model f from the objective functions supplied to the fitters. Those I implemented using lambda functions. Of course, one could also have ordinary def ... functions and combine them into one.
The model function f uses numpy's broadcasting to simultaneously sum over a set of parameters (along the last axis), and calculate in parallel for many x (along any axes before the last, though both fit functions would complain if there is more than one... .ravel() to help there)
We concatenate the fit parameters R and S into a single parameter vector using numpy's shorthand np.r_[R,S].
curve_fit supplies every single parameter as a distinct parameter to the objective function. We want them as a vector, so we use *params: It catches all remaining parameters in a single list.
leastsq gives a single params vector. However, it neither supplies x, nor does it compare it to y. Those are directly bound into the objective function.

In order to use scipy.optimize.leastsq to estimate multiple parameters, you need to pack them into an array and unpack them inside your function. You can then do anything you want with them. For example, if your s_i are the first 3 and your r_i are the next three parameters in your array p, you would just set ssum=p[:3].sum() and rsum=p[3:6].sum(). But again, your parameters are not identified (according to your comment), so estimation is pointless.
For an example of using leastsq, see the Cookbook's Fitting Data example.

Related

Curve_fit for a function that returns a numpy array

I know the library curve_fit of scipy and its power to fitting curves. I have read many examples here and in the documentation, but I cannot solve my problem.
For example, I have 10 files (chemical structers but it does not matter) and ten experimental energy values. I have a function inside a class that calculates for each structure the theoretical energy for some parameters and it returns a numpy array with the theoretical energy values.
I want to find the best parameters to have the theoretical values nearest to the experimental ones. I will furnish here the minimum exemple of my code
This is the class function that reads the experimental energy files, extracts the correct substring and returns the values as a numpy array. The self.path is just the directory and self.nPoints = 10. It is not so important, but I furnish for the sake of completeness
def experimentalValues(self):
os.chdir(self.path)
energy = np.zeros(self.nPoints)
for i in range(1, self.nPoints):
f = open("p_" + str(i + 1) + ".xyz", "r")
energy[i] = float(f.readlines()[1].split()[1])
f.close()
os.chdir('..')
return energy
I calculate the theoretical value with this class function that takes two numpy arrays as arguments, lets say
sigma = np.full(nSubstrate, 2.)
epsilon = np.full(nSubstrate, 0.15)
where nSubstrate = 9
Here there is the class function. It reads files and does two nested loops to calculate for each file the theoretical value and return it to a numpy array.
def theoreticalEnergy(self, epsilon, sigma):
os.chdir(self.path)
cE = np.zeros(self.nPoints)
for n in range(0, self.nPoints):
filenameXYZ = "p_" + str(n + 1) + "_extended.xyz"
allCoordinates = np.loadtxt(filenameXYZ, skiprows = 0, usecols = (1, 2, 3))
substrate = allCoordinates[0:self.nSubstrate]
surface = allCoordinates[self.nSubstrate:]
for i in range(0, substrate.shape[0]):
positionAtomI = np.array(substrate[i][:])
for j in range(0, surface.shape[0]):
positionAtomJ = np.array(surface[j][:])
distanceIJ = self.distance(positionAtomI, positionAtomJ)
cE[n] += self.LennardJones(distanceIJ, epsilon[i], sigma[i])
os.chdir('..')
return cE
Again, for the sake of completeness the Lennard Jones class function is defined as
def LennardJones(self, distance, epsilon, sigma):
repulsive = (sigma/distance) ** 12.
attractive = (sigma/distance) ** 6.
potential = 4. * epsilon* (repulsive - attractive)
return potential
where in this case all the arguments are scalar as the return value.
To conclude the problem presentation I have 3 ingredients:
a numpy array with the experimental data
two numpy arrays with a guess for the parameters sigma and epsilon
a function that takes the last parameters and returns a numpy vector with the values to be fitted.
How can I solve this problem like the approach described in the documentation https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html?
Curve fitting
The curve_fit fits a function f(w, x[i]) to points y[i] by finding w that minimizes sum((f(w, x[i] - y[i])**2 for i in range(n)). As you will read in the first line after the function definition
[It uses] non-linear least squares to fit a function, f, to data.
It refers to least_squares where it states
Given the residuals f(x) (an m-D real function of n real variables) and the loss function rho(s) (a scalar function), least_squares finds a local minimum of the cost function F(x):
Curve fitting is a kind of convex-cost multi-objective optimization. Since the each individual cost is convex, you can add all of them and that will still be a convex function. Notice that the decision variables (the parameters to be optimized) are the same in every point.
Your problem
In my understanding for each energy level you have a different set of parameters, if you write it as a curve fitting problem, the objective function could be expressed as sum((f(w[i], x[i]) - y[i])**2 ...), where y[i]is determined by the energy level. Since each of the terms in the sum is independent on the other terms, this is equivalent to finding each group of parametersw[i]separately minimizing(f(w[i], x[i]) - y[i])**2`.
Convexity
Convexity is a very convenient property for optimization because it ensures that you will have only one minimum in the parameter space. I am not doing a detailed analysis but have reasonable doubts about the convexity of your energy function.
The Lennard Jones function has the difference of a repulsive and an attractive force both with negative even exponent on the distance this alone is very unlikely to be convex.
The sum of multiple local functions centered at different positions has no defined convexity.
Molecular energy, or crystal energy, or protein folding are well known to be non-convex.
A few days ago (on a bike ride) I was thinking about this, how the molecules will be configured in a global minimum energy, and I was wondering if it finds that configuration so rapidly because of quantum tunneling effects.
Non-convex optimization
The non-convex (global) optimization is different from (non-linear) least-squares, in the sense that when a local minimum is found the process don't return immediately, it start making new attempts in different regions of the search spaces. If the function is smooth you can still take advantage of a gradient based local optimization method, but the complexity is still NP.
A classic global optimization method is the Simulated annenaling, if you have a chemical background I think you will have some insights reading about it. Once upon a time, simulated annealing was provided in scipy.optimize.
You will find a few global optimization methods in scipy.optimize. I would encourage you to try Basin hopping, since it was successfully applied to similar problems, as you can read in the references.
I hope this drop you on the right way to your solution. But, be aware that you will probably need to spend, learning how to use the function and will need to make some decisions. You will need to find a balance of accuracy, simplicity, efficiency.
If you want better solution take the time to derive the gradient of the cost function (you can return two values f, and df, where df is the gradient of f with respect to the decision variables).

Constraining OLS (or WLS) coeffecients using statsmodels

I have a regression of the form model = sm.GLM(y, X, w = weight).
Which ends up being a simple weighted OLS. (note that specificying w as the error weights array actually works in sm.GLM identically to sm.WLS despite it not being in the documentation).
I'm using GLM because this allows me to fit with some additional constraints using fit_constrained(). My X consists of 6 independent variables, 2 of which i want to constrain the resulting coeffecients to be positive. But i can not seem to figure out the syntax to get fit_constrained() to work. The documentation is extremely bare and i can not find any good examples anywhere. All i really need is the correct syntax for imputing these constraints. Thanks!
The function you see is meant for linear constraints, that is a combination of your coefficients fulfill some linear equalities, not meant for defining boundaries.
The closest you can get is using scipy least squares and defining the boundaries, for example, we set up some dataset with 6 coefficients:
from scipy.optimize import least_squares
import numpy as np
np.random.seed(100)
x = np.random.uniform(0,1,(30,6))
y = np.random.normal(0,2,30)
The function to basically matrix multiply and return error:
def fun(b, x, y):
return b[0] + np.matmul(x,b[1:]) - y
The first coefficient is the intercept. Let's say we require the 2nd and 6th to be always positive:
res_lsq = least_squares(fun, [1,1,1,1,1,1,1], args=(x, y),
bounds=([-np.inf,0,-np.inf,-np.inf,-np.inf,-np.inf,0],+np.inf))
And we check the result:
res_lsq.x
array([-1.74342242e-01, 2.09521327e+00, -2.02132481e-01, 2.06247855e+00,
-3.65963504e+00, 6.52264332e-01, 5.33657765e-20])

Output all guesses from scipy.optimize.leastsq()

I'm hoping to make an animation about how the least-squares regression analysis provided by scipy.optimize.leastsq() converges on a specific result. Is there any way to get the function to, say, append to a list a tuple of guess values for each iteration until the function converges to the local minima? Or, is there a different library which includes this feature?
Below is what I have:
# initial guess for gaussian distributions to optimize [height, position, width].
# if more than 2 distributions required, add a new set of [h,p,w] initial parameters to 'initials' for each new distribution.
# new parameters should be of the same format for consistency; i.e. [h,p,w],[h,p,w],[h,p,w]... etc.
# A 'w' guess of 1 is typically a sufficient estimation.
initials = [6.5,13,1],[4.5,19,1]
# determines the number of gaussian functions to compute from the initial guesses
n = len(initials)
# formats initials into a 1D array
var = np.concatenate(initials)
# data matrix
M = np.array(master)
# defines a typical gaussian function, of independent variable x,
# amplitude a, position b, and width parameter c.
def gaussian(x,a,b,c):
return a*np.exp((-(x-b)**2.0)/c**2.0)
# defines the expected resultant as a sum of intrinsic gaussian functions
def GaussSum(x, p):
return sum(gaussian(x, p[3*k], p[3*k+1], p[3*k+2]) for k in range(n))
# defines condition of minimization, reducing the square of the difference between the data (y) and the function 'func(x,p)'
def residuals(p, y, x):
return (y - GaussSum(x,p))**2
# executes least-squares regression analysis to optimize initial parameters
cnsts = leastsq(residuals, var, args=(M[:,1],M[:,0]))[0]
what I'm eventually hoping for is for 'cnsts' to be a list of tuples of every guess from the initial guess to the final guess.
If I'm understanding your question correctly, you want to make a guess at each of the different coefficients while fitting a linear regression line, then have a list of all the coefficents that have been guessed? Similar to how a NN will back-propagate the error to better fit a model?
Linear regression isn't guessing the different coefficents. It's just calculating them... https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/regression-analysis/find-a-linear-regression-equation/#FindaLinear

Using a vector of constraints to a scipy.optimize function

I want to do constrained optimisation using a vector of constraints using the scipy.optimize library. In particular, I am supplying a vector of 3d coordinates r0 of N points -- hence a matrix of size N x 3 -- as input to the function. The coordinates are Cartesian, and I wish to freeze out all y dependence. So that means that I need the second column of my N x 3 matrix to be held to a constant, y0 say. How do I go about defining such a list of constraints?
To be concrete, let's the consider the COBYLA algorithm (https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_cobyla.html#scipy.optimize.fmin_cobyla). I tried the following construction:
cons = []
for i in range(xyz0.shape[0]):
def f(x):
return x[i,1]-xyz0cyl[i,1]
cons.append(f)
fmin_cobyla(energy, xyz0, cons, rhoend=1e-7)
and got the error:
41 for i in range(xyz0.shape[0]):
42 def f(x):
---> 43 return x[i,1]-xyz0cyl[i,1]
44 cons.append(f)
45
IndexError: too many indices for array
What is going on?
Your approach is wrong in quite a number of ways.
First, minimize takes a sequence as constraint, so that your Nx3 array is first flattened before it is passed to constraint functions leaving you with an array of only one dimension. Therefore you can't index with a tuple except you reshape your array inside the constraint functions to the original Nx3; could be pretty expensive for large N:
return x.reshape(-1, 3)[i,1] - xyz0cyl[i,1]
Secondly, closures in Python are late binding; all of the constraints functions will use the last value of i after the for loop as completed. You'll only finding out later on after fixing the first bug that the optimisation does not go as expected. See How do lexical closures work? to learn more.
A better approach is to actually make the y-axis (i.e. 1st column) stationary in your energy function or simply passing a Nx2 matrix instead to fmin_cobyla.

Evaluate SmoothBivariateSpline for two 1d array lists

I have three arrays x,y,z. I wanted to smooth the z-data. So, I have used SmoothBivariateSpline function. But when I eval the result, I get completely different values compared to my previous z-data. Below is my code:
def envinterpolate(x,y,z):
x_interp = np.linspace(min(x),max(x),len(x)*4)
y_interp = np.linspace(min(y),max(y),len(x)*4)
sbsp = SmoothBivariateSpline(x,y,z)
z_interp = sbsp.ev(x_interp,y_interp)
return z_interp
Is there anything wrong in my code while evaluating the values of spline?
Attaching the plot,after trying s=0 parameter(redline my actual z-data,blackline z-interp data)
By convention, "smoothing" refers specifically to cases where you don't want the interpolant to pass exactly through your input data points (for example if you know that your input data is noisy).
SmoothBivariateSpline takes a parameter s that controls the degree of smoothing that is applied to the interpolant:
s : float, optional
Positive smoothing factor defined for estimation condition: sum((w[i]*(z[i]-s(x[i], y[i])))**2, axis=0) <= s Default s=len(w) which should be a good value if 1/w[i] is an estimate of the standard deviation of z[i].
If you don't want any smoothing you could simply set s=0.

Categories

Resources