Fit bipolar sigmoid python - python

I'm stuck trying to fit a bipolar sigmoid curve - I'd like to have the following curve:
but I need it shifted and stretched. I have the following inputs:
x[0] = 8, x[48] = 2
So over 48 periods I need to drop from 8 to 2 using a bipolar sigmoid function to approximate a nice smooth dropoff. Any ideas how I could derive the curve that would fit those parameters?
Here's what I have so far, but I need to change the sigmoid function:
import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))
plt.plot([sigmoid(float(z)) for z in range(1,48)])

You could redefine the sigmoid function like so
def sigmoid(x, a, b, c, d):
""" General sigmoid function
a adjusts amplitude
b adjusts y offset
c adjusts x offset
d adjusts slope """
y = ((a-b) / (1 + np.exp(x-(c/2))**d)) + b
return y
x = np.arange(49)
y = sigmoid(x, 8, 2, 48, 0.3)
plt.plot(x, y)
Severin's answer is likely more robust, but this should be fine if all you want is a quick and dirty solution.
In [2]: y[0]
Out[2]: 7.9955238269969806
In [3]: y[48]
Out[3]: 2.0044761730030203

From generic bipolar sigmoid function:
f(x,m,b)= 2/(1+exp(-b*(x-m))) - 1
there are two parameters and two unknowns - shift m and scale b
You have two condition:f(0) = 8, f(48) = 2
take first condition, express b vs m, together with second condition write non-linear function to solve, and then use fsolve from SciPy to solve it numerically, and recover back b and m.
Here related by similar method question and answer: How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

Alternatively, you could also use curve_fit which might come in handy if you have more than just two datapoints. The output looks like this:
As you can see, the graph contains the desired data points. I used #lanery's function for the fit; you can of course choose any function you like. This is the code with some inline comments:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def sigmoid(x, a, b, c, d):
return ((a - b) / (1. + np.exp(x - (c / 2)) ** d)) + b
# one needs at least as many data points as parameters, so I just duplicate the data
xdata = [0., 48.] * 2
ydata = [8., 2.] * 2
# plot data
plt.plot(xdata, ydata, 'bo', label='data')
# fit the data
popt, pcov = curve_fit(sigmoid, xdata, ydata, p0=[1., 1., 50., 0.5])
# plot the result
xdata_new = np.linspace(0, 50, 100)
plt.plot(xdata_new, sigmoid(xdata_new, *popt), 'r-', label='fit')
plt.legend(loc='best')
plt.show()

Related

scipy curve_fit give the initial guess values as optimal results when I set the bounds parameter

I'm trying to fit a curve with a Gaussian plus a Lorentzian function, using the curve_fit function from scipy.
def gaussian(x, a, x0, sig):
return a * np.exp(-1/2 * (x - x0)**2 / sig**2)
def lorentzian(x, a, b, c):
return a*c**2/((x-b)**2+c**2)
def decompose(x, z, n, b, *par):
hb_n = gaussian(x, par[0], 4861.3*(1+z), n)
hb_b = lorentzian(x, par[1], 4861.3*(1+z), b)
return hb_b + hb_n
And when I set the p0 parameter, I can get a reasonable result, which fits the curve well.
guess = [0.0001, 2, 10, 3e-16, 3e-16]
p, c = curve_fit(decompose, wave, residual, guess)
fitting parameters
the fitting model and data figure when I set the p0 parameter
But if I set the p0 and bounds parameters simultaneously, the curve_fit function gives the initial guess as the final fitting result, which is rather deviated from the data.
guess = [0.0001, 2, 10, 3e-16, 3e-16]
p, c = curve_fit(decompose, wave, residual, guess, bounds=([-0.001, 0, 0, 0, 0], [0.001, 10, 100, 1e-15, 1e-15]))
fitting parameters
the fitting model and data figure when I set the p0 and bounds parameters simultaneously
I have tried many different combinations of boundaries for the parameters, but the fitting results invariably return the initial guess values. I've been stuck in this problem for a long time. I would be very grateful if anyone can give me some advice to solve this problem.
This happens due to a combination of the optimization algorithm and its parameters.
From the official documentation:
method{‘lm’, ‘trf’, ‘dogbox’}, optional
Method to use for optimization. See least_squares for more details.
Default is ‘lm’ for unconstrained problems and ‘trf’ if bounds are
provided. The method ‘lm’ won’t work when the number of observations
is less than the number of variables, use ‘trf’ or ‘dogbox’ in this
case.
So when you add bound constraints curve_fit will use different optimization algorithm (trust region instead of Levenberg-Marquardt).
To debug the problem you can try to set full_output=True as Warren Weckesser noted in the comments.
In the case of the fit with bounds you will see something similar to:
'nfev': 1
`gtol` termination condition is satisfied.
So the optimization stopped after the first iteration. That's why founed parameters similar to the initial guess.
To fix this you can specify lower gtol parameter. Full list of available parameters you can find here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.least_squares.html#scipy.optimize.least_squares
Example:
Code:
import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
def gaussian(x, a, x0, sig):
return a * np.exp(-1 / 2 * (x - x0) ** 2 / sig**2)
def lorentzian(x, a, b, c):
return a * c**2 / ((x - b) ** 2 + c**2)
def decompose(x, z, n, b, *par):
hb_n = gaussian(x, par[0], 4861.3 * (1 + z), n)
hb_b = lorentzian(x, par[1], 4861.3 * (1 + z), b)
return hb_b + hb_n
gt_parameters = [-2.42688295e-4, 2.3477827, 1.56977708e1, 4.47455820e-16, 2.2193466e-16]
wave = np.linspace(4750, 5000, num=400)
gt_curve = decompose(wave, *gt_parameters)
noisy_curve = gt_curve + np.random.normal(0, 2e-17, size=len(wave))
guess = [0.0001, 2, 10, 3e-16, 3e-16]
bounds = ([-0.001, 0, 0, 0, 0], [0.001, 10, 100, 1e-15, 1e-15])
options = [
("Levenberg-Marquardt without bounds", dict(method="lm")),
("Trust Region without bounds", dict(method="trf")),
("Trust Region with bounds", dict(method="trf", bounds=bounds)),
(
"Trust Region with bounds + fixed tolerance",
dict(method="trf", bounds=bounds, gtol=1e-36),
),
]
fig, axs = plt.subplots(len(options))
for (title, fit_params), ax in zip(options, axs):
ax.set_title(title)
p, c = curve_fit(decompose, wave, noisy_curve, guess, **fit_params)
fitted_curve = decompose(wave, *p)
ax.plot(wave, gt_curve, label="gt_curve")
ax.plot(wave, noisy_curve, label="noisy")
ax.plot(wave, fitted_curve, label="fitted_curve")
handles, labels = ax.get_legend_handles_labels()
fig.legend(handles, labels)
plt.show()

Multivariate curve-fitting in python for estimating the parameter and order of ellipse-like shapes

I'm trying to find the best parameters (a, b, and c) of the following function (general formula of circle, ellipse, or rhombus):
(|x|/a)^c + (|y|/b)^c = 1
of two arrays of independent data (x and y) in python. My main objective is to estimate the best value of (a, b, and c) based on my x and y variable. I am using curve_fit function from scipy, so here is my code with a demo x, and y.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
alpha = 5
beta = 3
N = 500
DIM = 2
np.random.seed(2)
theta = np.random.uniform(0, 2*np.pi, (N,1))
eps_noise = 0.2 * np.random.normal(size=[N,1])
circle = np.hstack([np.cos(theta), np.sin(theta)])
B = np.random.randint(-3, 3, (DIM, DIM))
noisy_ellipse = circle.dot(B) + eps_noise
X = noisy_ellipse[:,0:1]
Y = noisy_ellipse[:,1:]
def func(xdata, a, b,c):
x, y = xdata
return (np.abs(x)/a)**c + (np.abs(y)/b)**c
xdata = np.transpose(np.hstack((X, Y)))
ydata = np.ones((xdata.shape[1],))
pp, pcov = curve_fit(func, xdata, ydata, maxfev = 1000000, bounds=((0, 0, 1), (50, 50, 2)))
plt.scatter(X, Y, label='Data Points')
x_coord = np.linspace(-5,5,300)
y_coord = np.linspace(-5,5,300)
X_coord, Y_coord = np.meshgrid(x_coord, y_coord)
Z_coord = func((X_coord,Y_coord),pp[0],pp[1],pp[2])
plt.contour(X_coord, Y_coord, Z_coord, levels=[1], colors=('g'), linewidths=2)
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
By using this code, the parameters are [4.69949891, 3.65493859, 1.0] for a, b, and c.
The problem is that I usually get the value of c the smallest in its bound, while in this demo data it (i.e., c parameter) supposes to be very close to 2 as the data represent an ellipse.
Any help and suggestions for solving this issue are appreciated.
A curve which equation is (|x/a|)^c + (|y/b|)^c = 1 is called "Superellipse" :
http://mathworld.wolfram.com/Superellipse.html
For large c the superellipse tends to a rectangular shape.
For c=2 the curve is an ellipse, or a circle in the particular case a=b.
For c close to 1 the superellipse tends to a rhombus shape.
For c larger than 0 and lower than 1 the superellipse looks like a (squashed) astroid with sharp vertices. This kind of shape will not be considered below.
Before looking to the right question of the OP, it is of interest to study the regression behaviour for fitting a superellipse to scattered data. A short experimental and simplified approach tends to make understand the mathematical difficulty, prior the programming difficulties.
When the scatter increases the computed value of c (corresponding to the minimum of MSE ) decreases. Also the minimum becomes more and more difficult to localize. This is certainly a difficulty for the softwares.
For even larger scatter the value of c=1 leads to a rhombus shape.
So, it is not surprizing that in the example highly scattered published by the OP the software gave a rhombus as fitted curve.
If this was not the expected result, one have to chose another goal than the minimum MSE. For example if the goal is to obtain an elliptic shape, one have to set c=2. The result on the next figure shows that the MSE is worse than with the preceeding rhombus shape. But the elliptic fitting is well achieved.
NOTE : In case of large scatter the result depends a lot from the choice of criteria of fitting (MSE, MAE, ..., and with respect to what variable). This can be the cause of very different results from a software to another if the criterias of fitting (sometime not explicit) are different.
Among the criterias of fitting, if it is specified that the rhombus shape is excluded, one have to define more representative criteria and/or model and implement them in the software.
IMPORTANCE OF CRITERIA OF FITTING :
In order to show how the choice of criteria of fitting is important especially in case of data highly scattered, we will make the study again with a different criteria.
Instead of the preceeding criteria which was the MSE of the errors on the superellipse equation itself, that was :
we chose a different criteria, for example the MSE of the errors on the radial coordinate in polar system :
The notations are defined on the next picture :
Some results from the empirical study for increasing scatter :
We observe that the numerical calculus with the second criteria is more robust that with the first. Cases with higher scatter can be treated With the second criteria of fitting .
The drawback it that this second criteria is probably not considered in the available softwares. So one have to implement the above formulas in the existing software if possible. Or to write a software especially adapted.
Nevertheless this discussion about criteria of fitting is somehow out of subject because the criteria of fitting should not result from mathematical considerations only. If the problem comes from a practical need in physic or technology the criteria of fitting might be derived from the reality without choice.
I have modified your code (though you took it from https://stackoverflow.com/a/47881806/10640534) quite a lot, but I think I have what you expect. I am using a different equation, which I found here. I have also used the new Numpy random generators, but I believe that is only aesthetic for this problem. I am drawing the ellipse using patches from matplotlib, which indeed is aesthetic, but definitely a way better solution to represent your conic. Importantly, I am using the dogbox method for curve_fit because other methods do not converge; occasionally the ellipse is not matched and decreasing the added noise (e.g., rng.normal(0, 1, (500, 2)) / 1e2 instead of rng.normal(0, 1, (500, 2)) / 1e1 helps). Anyway, snippet and figure below.
import numpy as np
from numpy.random import default_rng
from matplotlib.patches import Ellipse
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(data, a, b, h, k, A):
x, y = data
return ((((x - h) * np.cos(A) + (y - k) * np.sin(A)) / a) ** 2
+ (((x - h) * np.sin(A) - (y - k) * np.cos(A)) / b) ** 2)
rng = default_rng(3)
numPoints = 500
center = rng.random(2) * 10 - 5
theta = rng.uniform(0, 2 * np.pi, (numPoints, 1))
circle = np.hstack([np.cos(theta), np.sin(theta)])
ellipse = (circle.dot(rng.random((2, 2)) * 2 * np.pi - np.pi)
+ (center[0], center[1]) + rng.normal(0, 1, (500, 2)) / 1e1)
pp, pcov = curve_fit(func, (ellipse[:, 0], ellipse[:, 1]), np.ones(numPoints),
p0=(1, 1, center[0], center[1], np.pi / 2),
method='dogbox')
plt.scatter(ellipse[:, 0], ellipse[:, 1], label='Data Points')
plt.gca().add_patch(Ellipse(xy=(pp[2], pp[3]), width=2 * pp[0],
height=2 * pp[1], angle=pp[4] * 180 / np.pi,
fill=False))
plt.gca().set_aspect('equal')
plt.tight_layout()
plt.show()
To incorporate the value of the exponent, I have used your equation and generated an ellipse according to this answer. This results in:
import numpy as np
from numpy.random import default_rng
from matplotlib.patches import Ellipse
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit, root
from scipy.special import ellipeinc
def angles_in_ellipse(num, a, b):
assert(num > 0)
assert(a < b)
angles = 2 * np.pi * np.arange(num) / num
if a != b:
e = (1.0 - a ** 2.0 / b ** 2.0) ** 0.5
tot_size = ellipeinc(2.0 * np.pi, e)
arc_size = tot_size / num
arcs = np.arange(num) * arc_size
res = root(lambda x: (ellipeinc(x, e) - arcs), angles)
angles = res.x
return angles
def func(data, a, b, c):
x, y = data
return (np.absolute(x) / a) ** c + (np.absolute(y) / b) ** c
a = 10
b = 20
n = 100
phi = angles_in_ellipse(n, a, b)
e = (1.0 - a ** 2.0 / b ** 2.0) ** 0.5
arcs = ellipeinc(phi, e)
noise = default_rng(0).normal(0, 1, n) / 2
pp, pcov = curve_fit(func, (b * np.sin(phi) + noise,
a * np.cos(phi) + noise),
np.ones(n), method='lm')
plt.scatter(b * np.sin(phi) + noise, a * np.cos(phi) + noise,
label='Data Points')
plt.gca().add_patch(Ellipse(xy=(0, 0), width=2 * pp[0], height=2 * pp[1],
angle=0, fill=False))
plt.gca().set_aspect('equal')
plt.tight_layout()
plt.show()
As you decrease noise values, pp will tend to (b, a, 2).

Scipy optimize curve_fit gives different plots for same parameters when fitting custom function

I have a problem with fitting a custom function using scipy.optimize in Python and I do not know, why that is happening. I generate data from centered and normalized binomial distribution (Gaussian curve) and then fit a curve. The expected outcome is in the picture when I plot my function over the fitted data. But when I do the fitting, it fails.
I'm convinced it is a pythonic thing because it should give the parameter a = 1 (that's how I define it) and it gives it but then the fit is bad (see picture). However, if I change sigma to 0.65*sigma in:
p_halfg, p_halfg_cov = optimize.curve_fit(lambda x, a:piecewise_half_gauss(x, a, sigma = 0.65*sigma_fit), x, y, p0=[1])
, it gives almost perfect fit (a is then 5/3, as predicted by math). Those fits should be the same and they are not!
I give more comments bellow. Could you please tell me what is happening and where the problem could be?
Plot with a=1 and sigma = sigma_fit
Plot with sigma = 0.65*sigma_fit
I generate data from normalized binomial distribution (I can provide my code but the values are more important now). It is a distribution with N = 10 and p = 0.5 and I'm centering it and taking only the right side of the curve. Then I'm fitting it with my half-gauss function, which should be the same distribution as binomial if its parameter a = 1 (and the sigma is equal to the sigma of the distribution, sqrt(np(1-p)) ). Now the problem is first that it does not fit the data as shown in the picture despite getting the correct value of parameter a.
Notice weird stuff... if I set sigma = 3* sigma_fit, I get a = 1/3 and a very bad fit (underestimate). If I set it to 0.2*sigma_fit, I get also a bad fit and a = 1/0.2 = 5 (overestimate). And so on. Why? (btw. a=1/sigma so the fitting procedure should work).
import numpy as np
import matplotlib.pyplot as plt
import math
pi = math.pi
import scipy.optimize as optimize
# define my function
sigma_fit = 1
def piecewise_half_gauss(x, a, sigma=sigma_fit):
"""Half of normal distribution curve, defined as gaussian centered at 0 with constant value of preexponential factor for x < 0
Arguments: x values as ndarray whose numbers MUST be float type (use linspace or np.arange(start, end, step, dtype=float),
a as a parameter of width of the distribution,
sigma being the deviation, second moment
Returns: Half gaussian curve
Ex:
>>> piecewise_half_gauss(5., 1)
array(0.04839414)
>>> x = np.linspace(0,10,11)
... piecewise_half_gauss(x, 2, 3)
array([0.06649038, 0.06557329, 0.0628972 , 0.05867755, 0.05324133,
0.04698531, 0.04032845, 0.03366645, 0.02733501, 0.02158627,
0.01657952])
>>> piecewise_half_gauss(np.arange(0,11,1, dtype=float), 1, 2.4)
array([1.66225950e-01, 1.52405153e-01, 1.17463281e-01, 7.61037856e-02,
4.14488078e-02, 1.89766470e-02, 7.30345854e-03, 2.36286717e-03,
6.42616248e-04, 1.46914868e-04, 2.82345875e-05])
"""
return np.piecewise(x, [x >= 0, x < 0],
[lambda x: np.exp(-x ** 2 / (2 * ((a * sigma) ** 2))) / (np.sqrt(2 * pi) * sigma * a),
lambda x: 1 / (np.sqrt(2 * pi) * sigma)])
# Create normalized data for binomial distribution Bin(N,p)
n = 10
p = 0.5
x = np.array([0., 1., 2., 3., 4., 5.])
y = np.array([0.25231325, 0.20657662, 0.11337165, 0.0417071 , 0.01028484,
0.00170007])
# Get the estimate for sigma parameter
sigma_fit = (n*p*(1-p))**0.5
# Get fitting parameters
p_halfg, p_halfg_cov = optimize.curve_fit(lambda x, a:piecewise_half_gauss(x, a, sigma = sigma_fit), x, y, p0=[1])
print(sigma_fit, p_halfg, p_halfg_cov)
## Plot the result
# unpack fitting parameters
a = np.float64(p_halfg)
# unpack uncertainties in fitting parameters from diagonal of covariance matrix
#da = [np.sqrt(p_halfg_cov[j,j]) for j in range(p_halfg.size)] # if we fit more parameters
da = np.float64(np.sqrt(p_halfg_cov[0]))
# create fitting function from fitted parameters
f_fit = np.linspace(0, 10, 50)
y_fit = piecewise_half_gauss(f_fit, a)
# Create figure window to plot data
fig = plt.figure(1, figsize=(10,10))
plt.scatter(x, y, color = 'r', label = 'Original points')
plt.plot(f_fit, y_fit, label = 'Fit')
plt.xlabel('My x values')
plt.ylabel('My y values')
plt.text(5.8, .25, 'a = {0:0.5f}$\pm${1:0.6f}'.format(a, da))
plt.legend()
However, if I plot it manually, it fits EXACTLY!
plt.scatter(x, y, c = 'r', label = 'Original points')
plt.plot(np.linspace(0,5,50), piecewise_half_gauss(np.linspace(0,5,50), 1, sigma_fit), label = 'Fit')
plt.legend()
EDIT -- solved:
it is a plotting problem, need to use
y_fit = piecewise_half_gauss(f_fit, a, sigma = 0.6*sigma_fit)
The problem was in plotting and fitting the parameters -- if I fit it with different sigma, I also need to change it in the plotting section when I generate y_fit:
# Get fitting parameters
p_halfg, p_halfg_cov = optimize.curve_fit(lambda x, a:piecewise_half_gauss(x, a, sigma = 0.6*sigma_fit), x, y, p0=[1])
...
y_fit = piecewise_half_gauss(f_fit, a, sigma = 0.6*sigma_fit)

Fitting of experimental data within two different regions

I am fitting a set of experimental data (sample) within two different experimental regions and can be expressed with two mathematical functions as follows:
1st region:
y = m*x + c ( the slope can be constrained to zero)
2nd region:
y = d*exp(-k*x)
the experimental data is shown below and I coded it in python as follows:
def func(x, m, c, d, k):
return m*x+ c + d*np.exp(-k*x)
popt, pcov = curve_fit(func, t, y)
Unfortunately, my data is not fitting properly and fitted (returned) parameters do not make sense (see picture below).
Any assistance will be appreciated.
Very interesting question. As said by a_guest, you will have to fit to the two regions separately. However, I think you probably also want the two regions to connect smoothly at the point t0, the point where we switch from one model to the other. In order to do this, we need to add the constraint that y1 == y2 at the point t0.
In order to do this with scipy, look at scipy.optimize.minimize with the SLSQP method. However, I wrote a scipy wrapper to make this kind of thing easier, called symfit. I will show you how to do this with symfit, because I think it's better suited to the task, but with this example you should also be able to implement it with pure scipy if you prefer.
from symfit import parameters, variables, Fit, Piecewise, exp, Eq
import numpy as np
import matplotlib.pyplot as plt
t, y = variables('t, y')
m, c, d, k, t0 = parameters('m, c, d, k, t0')
# Help the fit by bounding the switchpoint between the models
t0.min = 0.6
t0.max = 0.9
# Make a piecewise model
y1 = m * t + c
y2 = d * exp(- k * t)
model = {y: Piecewise((y1, t <= t0), (y2, t > t0))}
# As a constraint, we demand equality between the two models at the point t0
# to do this, we substitute t -> t0 and demand equality using `Eq`
constraints = [Eq(y1.subs({t: t0}), y2.subs({t: t0}))]
# Read the data
tdata, ydata = np.genfromtxt('Experimental Data.csv', delimiter=',', skip_header=1).T
fit = Fit(model, t=tdata, y=ydata, constraints=constraints)
fit_result = fit.execute()
print(fit_result)
plt.scatter(tdata, ydata)
plt.plot(tdata, fit.model(t=tdata, **fit_result.params).y)
plt.show()
Since your data shows different behavior in different regions you also need to fit the data on these different regions. That is instead of making a sum of the two models (functions) you should fit one the left region with y = m*x + c and separately on the right region with y = d*exp(-k*x). If you have trouble finding the boundary of the two regions you could assess this by comparing the goodness of fit.
popt_1, pcov_1 = curve_fit(lambda x, m, c: m*x + c, t[t < 0.8], y[t < 0.8], p0=(1, 0))
popt_2, pcov_2 = curve_fit(lambda x, d, k: d*exp(-k*x), t[t >= 0.8], y[t >= 0.8], p0=(400, 1))
Edit
Example code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
df = pd.read_csv('test.csv', index_col=None)
t = df.t.values
y = df.Y.values
boundary = t[y.argmax()]
t1 = t[t < boundary]
y1 = y[t < boundary]
t2 = t[t >= boundary]
y2 = y[t >= boundary]
f1 = lambda x, m, c: m*x + c
f2 = lambda x, d, k: d*np.exp(-k*x)
popt_1 ,pcov_1 = curve_fit(f1, t1, y1, p0=((y1[-1] - y1[0]) / (t1[-1] - t1[0]), y1[0]))
popt_2 ,pcov_2 = curve_fit(f2, t2, y2, p0=(y2[0], 1))
plt.title('Fitted data on two different domains')
plt.xlabel('t [a.u.]')
plt.ylabel('y [a.u.]')
plt.plot(t, y, '-o', label='Data')
plt.plot(t1, f1(t1, *popt_1), '--', color='#ff7f0e', lw=3, label='Fit')
plt.plot(t2, f2(t2, *popt_2), '--', color='#ff7f0e', lw=3, label='_nolegend_')
plt.grid()
plt.legend()
plt.show()
Which produces the following plot:
Note that the resulting "compound" function is not continuous at the boundary. If that is undesired you can resolve it by fixing one the fit parameters (e.g.k) before fitting the other domain (one way or the other). Alternatively you could fit both regions separately, then determine the value at the boundary as the average of the two separate functions (i.e. y_b = (f1(t1[-1], *popt_1) + f2(t2[0], *popt_2)) / 2) and then repeat the fitting by constraining the parameters such that this boundary condition is fulfilled.
For example fitting the linear function first and then fixing the d parameter of the exponential in order to have a continuous transition at the boundary (note that the linear function f1 is extrapolated outside its domain at t2[0] in order to ensure the continuity):
f1 = lambda x, m, c: m*x + c
popt_1, pcov_1 = curve_fit(f1, t1, y1, p0=((y1[-1] - y1[0]) / (t1[-1] - t1[0]), y1[0]))
d = f1(t2[0], *popt_1)
f2 = lambda x, k: d*np.exp(-k*(x - boundary))
popt_2, pcov_2 = curve_fit(f2, t2, y2, p0=(1,))
Which produces the following plot:
If you would prefer to use a single equation, I found that the Hocket-Sherby equation "y = b - (b-a) * exp(-c * (x**d))" seems like an OK fit to your data, yielding an R-squared of 0.99 and RMSE of 11.2 with parameters a = 1.1262189756312683E+01, b = 3.2040596733114870E+02, c = 3.9385197507261771E-01, and d = -4.7723382040098095E+00

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Categories

Resources