Fitting curve: why small numbers are better?

Fitting curve: why small numbers are better? - python

I spent some time these days on a problem. I have a set of data:
y = f(t), where y is very small concentration (10^-7), and t is in second. t varies from 0 to around 12000.
The measurements follow an established model:
y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)
And I need to find Vs, Vi, and k. So I used curve_fit, which returns the best fitting parameters, and I plotted the curve.
And then I used a similar model:
y = (Vs * t/3600 - ((Vs - Vi) * (1 - np.exp(-k * t/3600)) / k)) * 10**7
By doing that, t is a number of hour, and y is a number between 0 and about 10. The parameters returned are of course different. But when I plot each curve, here is what I get:
http://i.imgur.com/XLa4LtL.png
The green fit is the first model, the blue one with the "normalized" model. And the red dots are the experimental values.
The fitting curves are different. I think it's not expected, and I don't understand why. Are the calculations more accurate if the numbers are "reasonnable" ?

The docstring for optimize.curve_fit says,
p0 : None, scalar, or M-length sequence
Initial guess for the parameters. If None, then the initial
values will all be 1 (if the number of parameters for the function
can be determined using introspection, otherwise a ValueError
is raised).
Thus, to begin with, the initial guess for the parameters is by default 1.
Moreover, curve fitting algorithms have to sample the function for various values of the parameters. The "various values" are initially chosen with an initial step size on the order of 1. The algorithm will work better if your data varies somewhat smoothly with changes in the parameter values that on the order of 1.
If the function varies wildly with parameter changes on the order of 1, then the algorithm may tend to miss the optimum parameter values.
Note that even if the algorithm uses an adaptive step size when it tweaks the parameter values, if the initial tweak is so far off the mark as to produce a big residual, and if tweaking in some other direction happens to produce a smaller residual, then the algorithm may wander off in the wrong direction and miss the local minimum. It may find some other (undesired) local minimum, or simply fail to converge. So using an algorithm with an adaptive step size won't necessarily save you.
The moral of the story is that scaling your data can improve the algorithm's chances of of finding the desired minimum.
Numerical algorithms in general all tend to work better when applied to data whose magnitude is on the order of 1. This bias enters into the algorithm in numerous ways. For instance, optimize.curve_fit relies on optimize.leastsq, and the call signature for optimize.leastsq is:
def leastsq(func, x0, args=(), Dfun=None, full_output=0,
col_deriv=0, ftol=1.49012e-8, xtol=1.49012e-8,
gtol=0.0, maxfev=0, epsfcn=None, factor=100, diag=None):
Thus, by default, the tolerances ftol and xtol are on the order of 1e-8. If finding the optimum parameter values require much smaller tolerances, then these hard-coded default numbers will cause optimize.curve_fit to miss the optimize parameter values.
To make this more concrete, suppose you were trying to minimize f(x) = 1e-100*x**2. The factor of 1e-100 squashes the y-values so much that a wide range of x-values (the parameter values mentioned above) will fit within the tolerance of 1e-8. So, with un-ideal scaling, leastsq will not do a good job of finding the minimum.
Another reason to use floats on the order of 1 is because there are many more (IEEE754) floats in the interval [-1,1] than there are far away from 1. For example,
import struct
def floats_between(x, y):
"""
http://stackoverflow.com/a/3587987/190597 (jsbueno)
"""
a = struct.pack("<dd", x, y)
b = struct.unpack("<qq", a)
return b[1] - b[0]
In [26]: floats_between(0,1) / float(floats_between(1e6,1e7))
Out[26]: 311.4397707054894
This shows there are over 300 times as many floats representing numbers between 0 and 1 than there are in the interval [1e6, 1e7].
Thus, all else being equal, you'll typically get a more accurate answer if working with small numbers than very large numbers.

I would imagine it has more to do with the initial parameter estimates you are passing to curve fit. If you are not passing any I believe they all default to 1. Normalizing your data makes those initial estimates closer to the truth. If you don't want to use normalized data just pass the initial estimates yourself and give them reasonable values.

Others have already mentioned that you probably need to have a good starting guess for your fit. In cases like this is, I usually try to find some quick and dirty tricks to get at least a ballpark estimate of the parameters. In your case, for large t, the exponential decays pretty quickly to zero, so for large t, you have
y == Vs * t - (Vs - Vi) / k
Doing a first-order linear fit like
[slope1, offset1] = polyfit(t[t > 2000], y[t > 2000], 1)
you will get slope1 == Vs and offset1 == (Vi - Vs) / k.
Subtracting this straight line from all the points you have, you get the exponential
residual == y - slope1 * t - offset1 == (Vs - Vi) * exp(-t * k)
Taking the log of both sides, you get
log(residual) == log(Vs - Vi) - t * k
So doing a second fit
[slope2, offset2] = polyfit(t, log(y - slope1 * t - offset1), 1)
will give you slope2 == -k and offset2 == log(Vs - Vi), which should be solvable for Vi since you already know Vs. You might have to limit the second fit to small values of t, otherwise you might be taking the log of negative numbers. Collect all the parameters you obtained with these fits and use them as the starting points for your curve_fit.
Finally, you might want to look into doing some sort of weighted fit. The information about the exponential part of your curve is contained in just the first few points, so maybe you should give those a higher weight. Doing this in a statistically correct way is not trivial.

Related

Simulated annealing, normalized temperature

I have a problem that I need to maximize the value X of the given function:
This is the python code for the formula: 2 ** (-2 *((((x-0.1) / 0.9)) ** 2)) * ((math.sin(5*math.pi*x)) ** 6).
I'm using the simulated annealing algorithm to this job, but I'm having a problem.
probability = pow(math.e, (actual_cost - best_cost) / temperature)
My "cost" (what I'm trying to optimize) is a very short number, most often between 0 and 0.1, but my temperature, in the other side, is like 100.
So, when I apply the probability function, my result is always something like 99%, which makes my algorithm accept negative values in all iterations, instead of decreasing this probability throughout the iterations.
How can I adapt the value of my temperature to change the probability through the iterations?

The solution to this can be found in the docs for scipy.optimize.basinhopping:
Choosing T: The parameter T is the “temperature” used in the
Metropolis criterion. Basinhopping steps are always accepted if
func(xnew) < func(xold). Otherwise, they are accepted with
probability:
exp( -(func(xnew) - func(xold)) / T )
So, for best results, T should to be comparable to the typical
difference (in function values) between local minima. (The height of
“walls” between local minima is irrelevant.)
If T is 0, the algorithm becomes Monotonic Basin-Hopping, in which all
steps that increase energy are rejected.

Using lmfit of two Gaussians how to restrain the parameters of the second peak in dependence of the first?

In a multi-peak fitting I intend to constrain the solution space for the parameters of the second peak based on the values of the first one. Especially I want to have the amplitude parameter of the second one never to be larger than the amplitude of the first one.
I've read on the lmfit website about "Using Inequality Constraints" and I have the feeling it should be possible with this approach, but I do not quite understand it well it enough to make it work.
import lmfit
GaussianA = lmfit.models.GaussianModel(prefix='A_')
pars = GaussianA.make_params()
GaussianB = lmfit.models.GaussianModel(prefix='B_')
pars.update(GaussianB.make_params())
pars['B_amplitude'].set(expr = 'A_amplitude')
This locks in the amplitude of B to the amplitude of A.
However, how do I specify that the amplitude of B is at most 'A_amplitude'?
This doesn't work (but it would be awesome if it were that easy) but maybe helps to demonstrate what I'd like to have): pars['B_amplitude'].set(1,max='A_amplitude')

The min and max values for a lmfit.Parameter are not dynamically calculated from the other variables, but must be real numerical values. That is, something like
pars['B_amplitude'].set(1,max='A_amplitude') # Nope!
will not work.
What you need to do is follow the documentation for an inequality constraint (see https://lmfit.github.io/lmfit-py/constraints.html#using-inequality-constraints). That is, you can think of
B_amplitude < A_amplitude
as
B_amplitude = A_amplitude - delta_amplitude
with delta_amplitude being some variable value that must be positive.
That can be expressed as
GaussianA = lmfit.models.GaussianModel(prefix='A_')
pars = GaussianA.make_params()
GaussianB = lmfit.models.GaussianModel(prefix='B_')
pars.update(GaussianB.make_params())
pars.add('delta_amplitude', value=0.01, min=0, vary=True)
pars['B_amplitude'].set(expr = 'A_amplitude - delta_amplitude')
Now delta_amplitude is a variable that must be positive, and B_amplitude is no longer a freely varying parameter but is constrained by the values of A_amplitude and delta_amplitude.

Do you have a plot of your data, how noisy is it? I understood that you do 2 seperate fits but you have 2 peaks in your data. If your data is friendly you might be able to fit first one peak and then take the amplitude of it and fit the second one by setting limits for the amplitude. But maybe it's better to set a limit for the x position as you are talking of two different peaks.
How I solved this in a little hacky way (I assume your problem is that your fit does not converges):
Find the highest peak (maximum) in data -> x1
cut out the data in the environment of the peak (x1 +- 2 half power width, depending of the distance of your peaks and the heights of them)
find the highest peak (maximum) in the new reduced data -> x2
Use a custom fit curve which is a sum of your two gauss curves. f(x) = gauss1 + gauss2, where gauss(x, x1, width, amplitude, y_offset) and gauss = amplitude/width * e^(-(x-x1)^2/width) + y_offset
Sorry, it's years ago that I did that and without lmfit, so I can't give you details on it.

equation system with fsolve

I try to find a solution for a system of equations by using scipy.optimize.fsolve in python 2.7. The goal is to calculate equilibrium concentrations for a chemical system. Due to the nature of the problem, some of the constants are very small. Now for some combinations i do get a proper solution. For some parameters i don't find a solution. Either the solutions are negative, which is not reasonable from a physical point of view or fsolve produces:
ier = 3, 'xtol=0.000000 is too small, no further improvement in the approximate\n solution is possible.')
ier = 4, 'The iteration is not making good progress, as measured by the \n improvement from the last five Jacobian evaluations.')
ier = 5, 'The iteration is not making good progress, as measured by the \n improvement from the last ten iterations.')
It seems to me, based on my research, that the failure to find proper solutions of the equation system is connected to the datatype float.64 not being precise enough. As a friend pointed out, the system is not well conditioned with parameters differing in several magnitudes.
So i tried to use fsolve with the mpfr type provided by the gmpy2 module but that resulted in the following error:
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
Now here is a small example with parameter which lead to a solution if the randomized starting parameters fit happen to be good. However if the constant C_HCL is chosen to be something like 1e-4 or bigger then i never find a proper solution.
from numpy import *
from scipy.optimize import *
K_1 = 1e-8
K_2 = 1e-8
K_W = 1e-30
C_HCL = 1e-11
C_NAOH = K_W/C_HCL
C_HL = 1e-6
if C_HCL-C_NAOH > 0:
Saeure_Base = C_HCL-C_NAOH+sqrt(K_W)
OH_init = K_W/(Saeure_Base)
elif C_HCL-C_NAOH < 0:
OH_init = C_NAOH-C_HCL+sqrt(K_W)
Saeure_Base = K_W/OH_init
# some randomized start parameters
G1 = random.uniform(0, 2)*Saeure_Base
G2 = random.uniform(0, 2)*OH_init
G3 = random.uniform(1, 2)*C_HL*(sqrt(K_W))/(Saeure_Base+OH_init)
G4 = random.uniform(0.1, 1)*(C_HL - G3)/2
G5 = C_HL - G3 - G4
zGuess = array([G1,G2,G3,G4,G5])
#equation system / 5 variables --> H3O, OH, HL, H2L, L
def myFunction(z):
H3O = z[0]
OH = z[1]
HL = z[2]
H2L = z[3]
L = z[4]
F = empty((5))
F[0] = H3O*L/HL - K_1
F[1] = OH*H2L/HL - K_2
F[2] = K_W - OH*H3O
F[3] = C_HL - HL - H2L - L
F[4] = OH+L+C_HCL-H2L-H3O-C_NAOH
return F
z = fsolve(myFunction,zGuess, maxfev=10000, xtol=1e-15, full_output=1,factor=0.1)
print z
So the questions are. Is this problem based on the precision of float.64 and
if yes , (how) can it be solved with python? Is fsolve the way to go? Would i need to change the fsolve function so it accepts a different data type?

The root of your problem is either theoretical or numerical.
The scipy.optimize.fsolvefunction is based on the MINPACK Fortran solver (http://www.netlib.org/minpack/). This solver use a Newton-Raphson optimisation algorithm to provide the solution.
There are underlying assumptions about the smoothness of the function when you use this algorithm. For example, the jacobian matrix at the solution point x is supposed to be invertible. The one you are more concerned about is the basins of attraction.
In order to converge, the starting point of the algorithm needs to be near the actual solution, i.e. in the basins of attraction. This condition is always met for convex functions, however it is easy to find some functions for which this algorithm behaves badly. Your function is one of this as you have a fraction of your inputs parameters.
To address this issue you should just change the starting point. This starting point becomes also very important for functions with multiple solutions: this picture from the wikipedia article shows you the solution found depending of the starting point (five colours for five solutions); so you should be careful with your solution and actually check the "physical" aspects of your solution.
For the numerical aspects, the Newton-Raphson algorithm needs to have the value of the jacobian matrix (the derivatives matrix). If it is not provided to the MINPACK solver, the jacobian is estimated with a finite-difference formula. The perturbation step for the finite difference formula need to be provided epsfcn=None, the None being here as default value only in the case where fprimeis provided (there is no need for the jacobian estimation in this case). So first you should incorporate that. You could also specify directly the jacobian by derivating your function by hand.
However, the minimum value for the step size will be the machine precision, also called machine epsilon. For your problem, you have very small inputs values which can be a problem. I would suggest multiply everyone of them by the same value (like 10^6), it is equivalent to a change of the units but will avoid rounding up errors and problems with machine precision.
This problem is also important when you look at the parameter xtol=1e-15 you provided. In your error message, it gives xtol=0.000000, as it is below machine precision and cannot be taken into account. Also, if you look at your line F[2] = K_W - OH*H3O, given the machine precision, it does not matter if K_W is 1e-15or 1e-30. 0 is a solution for both of this case compare to the machine precision. To avoid this problem, just multiply everything by a bigger value.
So to sum up:
For the Newton-Raphson algorithm, the initialisation point matters !
For this algorithm, you should specify how you compute the jacobian !
In numerical computation, never work with small values. You can easily change the dimension to something different: it is basic units conversion, like working in gram instead of kilogram.

Fitting Fresnel Equations Using Scipy

I am attempting a non-linear fit of Fresnel equations with data of reflectance against angle of incidence. Found on this site http://en.wikipedia.org/wiki/Fresnel_equations are two graphs that have a red and blue line. I need to basically fit the blue line when n1 = 1 to my data.
Here I use the following code where th is theta, the angle of incidence.
def Rperp(th, n, norm, constant):
numerator = np.cos(th) - np.sqrt(n**2.0 - np.sin(th)**2.0)
denominator = 1.0 * np.cos(th) + np.sqrt(n**2.0 - np.sin(th)**2.0)
return ((numerator / denominator)**2.0) * norm + constant
The parameters I'm looking for are:
the index of refraction n
some normalization to multiply by and
a constant to shift the baseline of the graph.
My attempt is the following:
xdata = angle[1:] * 1.0 # angle of incidence
ydata = greenDD[1:] # reflectance
params = curve_fit(Rperp, xdata, ydata)
What I get is a division of zero apparently and gives me [1, 1, 1] for the parameters. The Fresnel equation itself is the bit without the normalizer and the constant in Rperp. Theta in the equation is the angle of incidence also. Overall I am just not sure if I am doing this right at all to get the parameters.
The idea seems to be the first parameter in the function is the independent variable and the rest are the dependent variables going to be found. Then you just plug into scipy's curve_fit and it will give you a fit to your data for the parameters. If it is just getting around division of zero, which I had though might be integer division, then it seems like I should be set. Any help is appreciated and let me know if things need to be clarified (such as np is numpy).

Make sure to pass the arguments to the trigonometric functions, like sine, in radians, not degrees.
As for why you're getting a negative refractive index returned: it is because in your function, you're always squaring the refractive index. The curve_fit algorithm might end up in a local minimum state where (by accident) n is negative, because it has the same value as n positive.
Ideally, you'd add constraints to the minimization problem, but for this (simple) problem, just observe your formula and remember that a result of negative n is simply solved by changing the sign, as you did.
You could also try passing an initial guess to the algorithm and you might observe that it will not end up in the local minimum with negative value.

High frequency noise at solving differential equation

I'm trying to simulate a simple diffusion based on Fick's 2nd law.
from pylab import *
import numpy as np
gridpoints = 128
def profile(x):
range = 2.
straggle = .1576
dose = 1
return dose/(sqrt(2*pi)*straggle)*exp(-(x-range)**2/2/straggle**2)
x = linspace(0,4,gridpoints)
nx = profile(x)
dx = x[1] - x[0] # use np.diff(x) if x is not uniform
dxdx = dx**2
figure(figsize=(12,8))
plot(x,nx)
timestep = 0.5
steps = 21
diffusion_coefficient = 0.002
for i in range(steps):
coefficients = [-1.785714e-3, 2.539683e-2, -0.2e0, 1.6e0,
-2.847222e0,
1.6e0, -0.2e0, 2.539683e-2, -1.785714e-3]
ccf = (np.convolve(nx, coefficients) / dxdx)[4:-4] # second order derivative
nx = timestep*diffusion_coefficient*ccf + nx
plot(x,nx)
for the first few time steps everything looks fine, but then I start to get high frequency noise, do to build-up from numerical errors which are amplified through the second derivative. Since it seems to be hard to increase the float precision I'm hoping that there is something else that I can do to suppress this? I already increased the number of points that are being used to construct the 2nd derivative.

I don't have the time to study your solution in detail, but it seems that you are solving the partial differential equation with a forward Euler scheme. This is pretty easy to implement, as you show, but this can become numerical instable if your timestep is too small. Your only solution is to reduce the timestep or to increase the spatial resolution.
The easiest way to explain this is for the 1-D case: assume your concentration is a function of spatial coordinate x and timestep i. If you do all the math (write down your equations, substitute the partial derivatives with finite differences, should be pretty easy), you will probably get something like this:
C(x, i+1) = [1 - 2 * k] * C(x, i) + k * [C(x - 1, i) + C(x + 1, i)]
so the concentration of a point on the next step depends on its previous value and the ones of its two neighbors. It is not too hard to see that when k = 0.5, every point gets replaced by the average of its two neighbors, so a concentration profile of [...,0,1,0,1,0,...] will become [...,1,0,1,0,1,...] on the next step. If k > 0.5, such a profile will blow up exponentially. You calculate your second order derivative with a longer convolution (I effectively use [1,-2,1]), but I guess that does not change anything for the instability problem.
I don't know about normal diffusion, but based on experience with thermal diffusion, I would guess that k scales with dt * diffusion_coeff / dx^2. You thus have to chose your timestep small enough so that your simulation does not become instable. To make the simulation stable, but still as fast as possible, chose your parameters so that k is a bit smaller than 0.5. Something similar can be derived for 2-D and 3-D cases. The easiest way to achieve this is to increase dx, since your total calculation time will scale with 1/dx^3 for a linear problem, 1/dx^4 for 2-D problems, and even 1/dx^5 for 3-D problems.
There are better methods to solve diffusion equations, I believe that Crank Nicolson is at least standard for solving heat-equations (which is also a diffusion problem). The 'problem' is that this is an implicit method, which means that you have to solve a set of equations to calculate your 'concentration' at the next timestep, which is a bit of a pain to implement. But this method is guaranteed to be numerical stable, even for big timesteps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.