Related
I have a relatively complex plot which I would like to model. The model consists of two 'phases'/components which have been measured experimentally, however the ratio of these components is unknown. The data is shown in the image: where some combination of the orange and green curves make up the blue curve.
Complex curve and two components
The model is defined as:
Model = A * (z1 * 0.3 * y1 + z2 * 0.7 * y2) + D
where A is an overall scaling, z1 and z2 are the individual scalings for y1 and y2 which are the components. D is an offset.
I have tried to use the lmfit minimise function by defining the fit parameters and residual like so:
import lmfit
from lmfit import Minimizer, Parameters, report_fit
fit_params = Parameters()
fit_params.add('A', value=1.00, min=0, max=1)
fit_params.add('z1', value=1, min=0, max=1)
fit_params.add('z2', value=1, min=0, max=1)
fit_params.add('D', value=0.00, min=-5, max=5)
def residual(pars, x, data = total_scatter_expt):
Data = total_scatter_expt
Model = A * (z1 * 0.3 * y1 + z2 * 0.7 * y2) + D
return model - data
Because y1 and y2 are also functions it's no entirely clear to me what my x is in the residual or in the minimisation line. For this I have used the x values on the graph in the picture.
out = minimize(residual, fit_params, args=(x,))
print(fit_report(out))
This has led to the error
TypeError: can't multiply sequence by non-int of type 'float'
I'm not sure whether this problem would be best described as a deconvolution or perhaps fitting functions of functions. Any help would be appreciated.
You claim that your objective function is:
def residual(pars, x, data = total_scatter_expt):
Data = total_scatter_expt
Model = A * (z1 * 0.3 * y1 + z2 * 0.7 * y2) + D
return model - data
There a number of problems here:
'data = total_scatter_expt' means the default value for data will be the value of total_scatter_expt at the time of function definition (like when the text of this code is processed). Don't do that. Anyway, total_scatter_expt is not defined. This cannot be the code you actually ran.
You then set Data to total_scatter_expt. Again, this cannot be valid code.
You capitalize both Data and Model but then return model-data. Python is case sensitive so your code cannot be valid.
You access A, z1, y1, z2, y2, and D which are not defined and never use pars or x. Again, your code cannot be valid.
Almost certainly you want to extract the values for the parameters A, D, etc from pars, but you don't do this. Perhaps try
pvals = pars.valuesdict()
model = pvals['A'] * (pvals['z1'] ....
Probably you mean x to encompass y1 and/or y2.
In the future, give the code that actually shows the problem you are actually having. Do not lie about the error message you get: show the actual messages you get with the actual code you post.
Curve_fit is not fit properly. I'm trying to fit experimental data with curve_fit. The data is imported from a .txt file to a array:
d = np.loadtxt("data.txt")
data_x = np.array(d[:, 0])
data_y = np.array(d[:, 2])
data_y_err = np.array(d[:, 3])
Since i know there must be two peaks, my model is a sum of two gaussian curves:
def model_dGauss(x, xc, A, y0, w, dx):
P = A/(w*np.sqrt(2*np.pi))
mu1 = (x - (xc - dx/3))/(2*w**2)
mu2 = (x - (xc + 2*dx/3))/(2*w**2)
return y0 + P * ( np.exp(-mu1**2) + 0.5 * np.exp(-mu2**2))
Using values for the guess is very sensitive to my guess values. Where is the point of fitting data if just nearly perfect fitting parameter will provide a result? Or am I doing something completely wrong?
t = np.linspace(8.4, 10, 300)
guess_dG = [32, 1, 10, 0.1, 0.2]
popt, pcov = curve_fit(model_dGauss, data_x, data_y, p0=guess_dG, sigma=data_y_err, absolute_sigma=True)
A, xc, y0, w, dx = popt
Plotting the data
plt.scatter(data_x, data_y)
plt.plot(t, model_dGauss(t1,*popt))
plt.errorbar(data_x, data_y, yerr=data_y_err)
yields:
Plot result
The result is just a straight line at the bottom of my graph while the evaluated parameters are not that bad. How can that be?
A complete example of code is always appreciated (and, ahem, usually expected here on SO). To remove much of the confusion about using curve_fit here, allow me to suggest that you will have an easier time using lmfit (https://lmfit.github.io/lmfit-py) and especially its builtin model functions and its use of named parameters. With lmfit, your code for two Gaussians plus a constant offset might look like this:
from lmfit.models import GaussianModel, ConstantModel
# start with 1 Gaussian + Constant offset:
model = GaussianModel(prefix='p1_') + ConstantModel()
# this model will have parameters named:
# p1_amplitude, p1_center, p1_sigma, and c.
# here we give initial values to these parameters
params = model.make_params(p1_amplitude=10, p1_center=32, p1_sigma=0.5, c=10)
# optionally place bounds on parameters (probably not needed here):
params['p1_amplitude'].min = 0.
## params['p1_center'].vary = False # fix a parameter from varying in fit
# now do the fit (including weighting residual by 1/y_err):
result = model.fit(data_y, params, x=data_x, weights=1.0/data_y_err)
# print out param values, uncertainties, and fit statistics, or get best-fit
# parameters from `result.params`
print(result.fit_report())
# plot results
plt.errorbar(data_x, data_y, yerr=data_y_err, label='data')
plt.plot(data_x, result.best_fit, label='best fit')
plt.legend()
plt.show()
To add a second Gaussian, you could just do
model = GaussianModel(prefix='p1_') + GaussianModel(prefix='p2_') + ConstantModel()
# and then:
params = model.make_params(p1_amplitude=10, p1_center=32, p1_sigma=0.5, c=10,
p2_amplitude=2, p2_center=31.75, p1_sigma=0.5)
and so on.
Your model has the two gaussian sharing or at least having "linked" values - the sigma values should be the same for the two peaks and the amplitude of the 2nd is half that of the 1st. As defined so far, the 2-Gaussian model has all the parameters being independent. But lmfit has a mechanism for setting constraints on any parameter by giving an algebraic expression in terms of other parameters. So, for example, you could say
params['p2_sigma'].expr = 'p1_sigma'
params['p2_amplitude'].expr = 'p1_amplitude / 2.0'
Now, p2_amplitude and p2_sigma will not be independently varied in the fit but will be constrained to have those values.
I was recently trying to plot a nonlinear decision boundary, and the function ended up being a partially horizontal hyperbola, where there were multiple y-values for a given x. Although I got it to work, I know there has to be a more pythonic or numpythonic way of plotting this line.
Background: The problem was a perceptron classifier on a set of inputs that were not linearly separable. In order to find this, the inputs were mapped to a general hyperbola function to increase the dimensionality to 5, and have these separable by a hyperplane. The equation for the decision boundary that will be plotted is
d(x) = w0 + w1xx + w2yy + w3xy + wx + w5y
Through the course of the perceptron's gradient descent, the values for w0-w5 are found, and the boundary is the x,y value when d(x)=0.
Current implementation: I got it to work, but I think it is hacky. I first have to create an array of the given size so that I can append these values, and I have to delete the initialized value the first time I append my found value. I then sweep through my the space on my graph and find a y-value, first by guessing high, second by guessing low, in order to find both possible y-values. I put these found values at the front and back of D, in order to plot this using matplotlib.
D = np.array([[0], [0]])
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
a_iter, b_iter = 0, 0 # used as initial guess for numeric solver
for xx in range(x_min, x_max):
# used to print top and bottom sides of hyperbola
yya = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, max(a_iter, 7))
yyb = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, b_iter)
a_iter = yya
b_iter = yyb
# add these points to a single matrix for printing
dda = np.array([[xx],[yya]])
ddb = np.array([[xx],[yyb]])
D = np.concatenate((dda, D), axis=1)
if xx == x_min: # delete initial [0; 0]
D = dda
D = np.concatenate((D, ddb), axis=1)
I know there has to be a better way to do this. Any insight is appreciated.
Edit: Apologies, I realize that without an image this is really difficult to understand. The main issue of finding multiple roots and populating a numpy array are a bit generic. I don't have enough rep to post images, but the link is below
nonlinear classifier
If you want plot an implicit equation curve, you can use pyplot.contour(), here is an example:
np.random.seed(1)
w = np.random.randn(6)
def f(x, y, w):
return w[0] + w[1]*x**2 + w[2]*y**2 + w[3]*x*y + w[4]*x + w[5]*y
X, Y = np.mgrid[-2:2:100j, -2:2:100j]
pl.contour(X, Y, f(X, Y, w), levels=[0])
there are parameterized options too - a trig one, branches centered at 0, pi
t = np.linspace(-np.pi/3, np.pi/3, 200) # 0 centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default blue)
Out[94]: [<matplotlib.lines.Line2D at 0xe26e6a0>]
t = np.linspace(np.pi-np.pi/3, np.pi+np.pi/3, 200) # pi centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default orange)
Out[96]: [<matplotlib.lines.Line2D at 0xf68e780>]
sympy ought to be up to finding the full denormalized, rotated, offset parameterized hyperbola coefficients from the bivariate polynomial ws
(or continue the hackage with a fit)
I am trying to run a least square algorithm using numpy and is having trouble. Can someone please tell me what I am doing wrong in the given code? When I set y to be y = np.power(X, 1) + np.random.rand(20)*3 or some other reasonable function of x, everything is working fine. But for that particular y defined by those given y values, the plot I am getting is senseless.
Is this some kind of numerical problem?
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()
Are you sure there is a linear relationship between what you are fitting and the y variable data?
Using the code (y = np.power(X, 1) + np.random.rand(20)*3) from your example, you have a linear relationship built into the y variable itself (with some noise) which allows your plot to track relatively well with the linear equation.
X = np.arange(1,21)
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()
However, when you alternate to something like your y variable
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
You end up with something less easy to fit.
Looking at the documentation, if you are attempting to something that fits this set of values, you will need to build in a constant component in which case lstsq does not do by default.
The docs state for lstsq
Return the least-squares solution to a linear matrix equation.
Solves the equation a x = b
If you really want to fit the data to a linear equation, running code like the below will give you something that almost matches your original data. However, the data behind this process seems to have polynomial/exponential driver which would make polyfit better.
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
X2 = np.vstack([X, np.ones(len(X))]).T
w = np.linalg.lstsq(X2, y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X.dot(w[0])+w[1], 'blue')
plt.show()
I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.