Measuring the similarity between two irregular plots - python

I have two irregular lines as a list of [x,y] coordinates, which has peaks and troughs. The length of the list might vary slightly(unequal). I want to measure their similarity such that to check occurence of the peaks and troughs (of similar depth or height) are coming at proper interval and give a similarity measure. I want to do this in Python. Is there any inbuilt function to do this?

I don't know of any builtin functions in Python to do this.
I can give you a list of possible functions in the Python ecosystem you can use. This is in no way a complete list of functions, and there are probably quite a few methods out there that I am not aware of.
If the data is ordered, but you don't know which data point is the first and which data point is last:
Use the directed Hausdorff distance
If the data is ordered, and you know the first and last points are correct:
Discrete Fréchet distance *
Dynamic Time Warping (DTW) *
Partial Curve Mapping (PCM) **
A Curve-Length distance metric (uses arc length distance from beginning to end) **
Area between two curves **
* Generally mathematical method used in a variety of machine learning tasks
** Methods I've used to identify unique material hysteresis responses
First let's assume we have two of the exact same random X Y data. Note that all of these methods will return a zero. You can install the similaritymeasures from pip if you do not have it.
import numpy as np
from scipy.spatial.distance import directed_hausdorff
import similaritymeasures
import matplotlib.pyplot as plt
# Generate random experimental data
np.random.seed(121)
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T
# Generate an exact copy of P, Q, which we will use to compare
Q = P.copy()
dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)
# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)
The printed output is
0.0, 0.0, 0.0, 0.0, 0.0, 0.0
This is because the curves P and Q are exactly the same!
Now let's assume P and Q are different.
# Generate random experimental data
np.random.seed(121)
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T
# Generate random Q
x = np.random.random(100)
y = np.random.random(100)
Q = np.array([x, y]).T
dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)
# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)
The printed output is
0.107, 0.743, 37.69, 21.5, 6.86, 11.8
which quantify how different P is from Q according to each method.
You now have many methods to compare the two curves. I would start with DTW, since this has been used in many time series applications which look like the data you have uploaded.
We can visualize what P and Q look like with the following code.
plt.figure()
plt.plot(P[:, 0], P[:, 1])
plt.plot(Q[:, 0], Q[:, 1])
plt.show()

Since your arrays are not the same size ( and I am assuming you are taking the same real time) , you need to interpolate them to compare across related set of points.
The following code does that, and calculates correlation measures:
#!/usr/bin/python
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import scipy.spatial.distance as ssd
import scipy.stats as ss
x = np.linspace(0, 10, num=11)
x2 = np.linspace(1, 11, num=13)
y = 2*np.cos( x) + 4 + np.random.random(len(x))
y2 = 2* np.cos(x2) + 5 + np.random.random(len(x2))
# Interpolating now, using linear, but you can do better based on your data
f = interp1d(x, y)
f2 = interp1d(x2,y2)
points = 15
xnew = np.linspace ( min(x), max(x), num = points)
xnew2 = np.linspace ( min(x2), max(x2), num = points)
ynew = f(xnew)
ynew2 = f2(xnew2)
plt.plot(x,y, 'r', x2, y2, 'g', xnew, ynew, 'r--', xnew2, ynew2, 'g--')
plt.show()
# Now compute correlations
print ssd.correlation(ynew, ynew2) # Computes a distance measure based on correlation between the two vectors
print np.correlate(ynew, ynew2, mode='valid') # Does a cross-correlation of same sized arrays and gives back correlation
print np.corrcoef(ynew, ynew2) # Gives back the correlation matrix for the two arrays
print ss.spearmanr(ynew, ynew2) # Gives the spearman correlation for the two arrays
Output:
0.499028272458
[ 363.48984942]
[[ 1. 0.50097173]
[ 0.50097173 1. ]]
SpearmanrResult(correlation=0.45357142857142857, pvalue=0.089485900143027278)
Remember that the correlations here are parametric and pearson type and assume monotonicity for calculating correlations. If this is not the case, and you think that your arrays are just changing sign together, you can use Spearman's correlation as in the last example.

I'm not aware of an inbuild function, but sounds like you can modify Levenshtein's distance. The following code is adopted from the code at wikibooks.
def point_distance(p1, p2):
# Define distance, if they are the same, then the distance should be 0
def levenshtein_point(l1, l2):
if len(l1) < len(l2):
return levenshtein(l2, l1)
# len(l1) >= len(l2)
if len(l2) == 0:
return len(l1)
previous_row = range(len(l2) + 1)
for i, p1 in enumerate(l1):
current_row = [i + 1]
for j, p2 in enumerate(l2):
print('{},{}'.format(p1, p2))
insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
deletions = current_row[j] + 1 # than l2
substitutions = previous_row[j] + point_distance(p1, p2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]

Related

Poincare Section of a system of second order odes

It is the first time I am trying to write a Poincare section code at Python.
I borrowed the piece of code from here:
https://github.com/williamgilpin/rk4/blob/master/rk4_demo.py
and I have tried to run it for my system of second order coupled odes. The problem is that I do not see what I was expecting to. Actually, I need the Poincare section when x=0 and px>0.
I believe that my implementation is not the best out there. I would like to:
Improve the way that the initial conditions are chosen.
Apply the correct conditions (x=0 and px>0) in order to acquire the correct Poincare section.
Create one plot with all the collected poincare section data, not four separate ones.
I would appreciate any help.
This is the code:
from matplotlib.pyplot import *
from scipy import *
from numpy import *
# a simple Runge-Kutta integrator for multiple dependent variables and one independent variable
def rungekutta4(yprime, time, y0):
# yprime is a list of functions, y0 is a list of initial values of y
# time is a list of t-values at which solutions are computed
#
# Dependency: numpy
N = len(time)
y = array([thing*ones(N) for thing in y0]).T
for ii in xrange(N-1):
dt = time[ii+1] - time[ii]
k1 = dt*yprime(y[ii], time[ii])
k2 = dt*yprime(y[ii] + 0.5*k1, time[ii] + 0.5*dt)
k3 = dt*yprime(y[ii] + 0.5*k2, time[ii] + 0.5*dt)
k4 = dt*yprime(y[ii] + k3, time[ii+1])
y[ii+1] = y[ii] + (k1 + 2.0*(k2 + k3) + k4)/6.0
return y
# Miscellaneous functions
n= 1.0/3.0
kappa1 = 0.1
kappa2 = 0.1
kappa3 = 0.1
def total_energy(valpair):
(x, y, px, py) = tuple(valpair)
return .5*(px**2 + py**2) + (1.0/(1.0*(n+1)))*(kappa1*np.absolute(x)**(n+1)+kappa2*np.absolute(y-x)**(n+1)+kappa3*np.absolute(y)**(n+1))
def pqdot(valpair, tval):
# input: [x, y, px, py], t
# takes a pair of x and y values and returns \dot{p} according to the Hamiltonian
(x, y, px, py) = tuple(valpair)
return np.array([px, py, -kappa1*np.sign(x)*np.absolute(x)**n+kappa2*np.sign(y-x)*np.absolute(y-x)**n, kappa2*np.sign(y-x)*np.absolute(y-x)**n-kappa3*np.sign(y)*np.absolute(y)**n]).T
def findcrossings(data, data1):
# returns indices in 1D data set where the data crossed zero. Useful for generating Poincare map at 0
prb = list()
for ii in xrange(len(data)-1):
if (((data[ii] > 0) and (data[ii+1] < 0)) or ((data[ii] < 0) and (data[ii+1] > 0))) and data1[ii] > 0:
prb.append(ii)
return array(prb)
t = linspace(0, 1000.0, 100000)
print ("step size is " + str(t[1]-t[0]))
# Representative initial conditions for E=1
E = 1
x0=0
y0=0
init_cons = [[x0, y0, np.sqrt(2*E-(1.0*i/10.0)*(1.0*i/10.0)-2.0/(n+1)*(kappa1*np.absolute(x0)**(n+1)+kappa2*np.absolute(y0-x0)**(n+1)+kappa3*np.absolute(y0)**(n+1))), 1.0*i/10.0] for i in range(-10,11)]
outs = list()
for con in init_cons:
outs.append( rungekutta4(pqdot, t, con) )
# plot the results
fig1 = figure(1)
for ii in xrange(4):
subplot(2, 2, ii+1)
plot(outs[ii][:,1],outs[ii][:,3])
ylabel("py")
xlabel("y")
title("Full trajectory projected onto the plane")
fig1.suptitle('Full trajectories E = 1', fontsize=10)
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(4):
subplot(2, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
yints = [.5*(outs[ii][cross, 1] + outs[ii][cross+1, 1]) for cross in xcrossings]
pyints = [.5*(outs[ii][cross, 3] + outs[ii][cross+1, 3]) for cross in xcrossings]
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
fig2.suptitle('Poincare Sections E = 1', fontsize=10)
show()
You need to compute the derivatives of the Hamiltonian correctly. The derivative of |y-x|^n for x is
n*(x-y)*|x-y|^(n-2)=n*sign(x-y)*|x-y|^(n-1)
and the derivative for y is almost, but not exactly (as in your code), the same,
n*(y-x)*|x-y|^(n-2)=n*sign(y-x)*|x-y|^(n-1),
note the sign difference. With this correction you can take larger time steps, with correct linear interpolation probably even larger ones, to obtain the images
I changed the integration of the ODE to
t = linspace(0, 1000.0, 2000+1)
...
E_kin = E-total_energy([x0,y0,0,0])
init_cons = [[x0, y0, (2*E_kin-py**2)**0.5, py] for py in np.linspace(-10,10,8)]
outs = [ odeint(pqdot, con, t, atol=1e-9, rtol=1e-8) ) for con in init_cons[:8] ]
Obviously the number and parametrization of initial conditions may change.
The computation and display of the zero-crossings was changed to
def refine_crossing(a,b):
tf = -a[0]/a[2]
while abs(b[0])>1e-6:
b = odeint(pqdot, a, [0,tf], atol=1e-8, rtol=1e-6)[-1];
# Newton step using that b[0]=x(tf) and b[2]=x'(tf)
tf -= b[0]/b[2]
return [ b[1], b[3] ]
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(8):
#subplot(4, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
ycrossings = [ refine_crossing(outs[ii][cross], outs[ii][cross+1]) for cross in xcrossings]
yints, pyints = array(ycrossings).T
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
and evaluating the result of a longer integration interval

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Failure of non linear fit to sine curve

I've been trying to fit the amplitude, frequency and phase of a sine curve given some generated two dimensional toy data. (Code at the end)
To get estimates for the three parameters, I first perform an FFT. I use the values from the FFT as initial guesses for the actual frequency and phase and then fit for them (row by row). I wrote my code such that I input which bin of the FFT I want the frequency to be in, so I can check if the fitting is working well. But there's some pretty strange behaviour. If my input bin is say 3.1 (a non integral bin, so the FFT won't give me the right frequency) then the fit works wonderfully. But if the input bin is 3 (so the FFT outputs the exact frequency) then my fit fails, and I'm trying to understand why.
Here's the output when I give the input bins (in the X and Y direction) as 3.0 and 2.1 respectively:
(The plot on the right is data - fit)
Here's the output when I give the input bins as 3.0 and 2.0:
Question: Why does the non linear fit fail when I input the exact frequency of the curve?
Code:
#! /usr/bin/python
# For the purposes of this code, it's easier to think of the X-Y axes as transposed,
# so the X axis is vertical and the Y axis is horizontal
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimize
import itertools
import sys
PI = np.pi
# Function which accepts paramters to define a sin curve
# Used for the non linear fit
def sineFit(t, a, f, p):
return a * np.sin(2.0 * PI * f*t + p)
xSize = 18
ySize = 60
npt = xSize * ySize
# Get frequency bin from user input
xFreq = float(sys.argv[1])
yFreq = float(sys.argv[2])
xPeriod = xSize/xFreq
yPeriod = ySize/yFreq
# arrays should be defined here
# Generate the 2D sine curve
for jj in range (0, xSize):
for ii in range(0, ySize):
sineGen[jj, ii] = np.cos(2.0*PI*(ii/xPeriod + jj/yPeriod))
# Compute 2dim FFT as well as freq bins along each axis
fftData = np.fft.fft2(sineGen)
fftMean = np.mean(fftData)
fftRMS = np.std(fftData)
xFreqArr = np.fft.fftfreq(fftData.shape[1]) # Frequency bins along x
yFreqArr = np.fft.fftfreq(fftData.shape[0]) # Frequency bins along y
# Find peak of FFT, and position of peak
maxVal = np.amax(np.abs(fftData))
maxPos = np.where(np.abs(fftData) == maxVal)
# Iterate through peaks in the FFT
# For this example, number of loops will always be only one
prevPhase = -1000
for col, row in itertools.izip(maxPos[0], maxPos[1]):
# Initial guesses for fit parameters from FFT
init_phase = np.angle(fftData[col,row])
init_amp = 2.0 * maxVal/npt
init_freqY = yFreqArr[col]
init_freqX = xFreqArr[row]
cntr = 0
if prevPhase == -1000:
prevPhase = init_phase
guess = [init_amp, init_freqX, prevPhase]
# Fit each row of the 2D sine curve independently
for rr in sineGen:
(amp, freq, phs), pcov = optimize.curve_fit(sineFit, xDat, rr, guess)
# xDat is an linspace array, containing a list of numbers from 0 to xSize-1
# Subtract fit from original data and plot
fitData = sineFit(xDat, amp, freq, phs)
sub1 = rr - fitData
# Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(121)
p1, = ax1.plot(rr, 'g')
p2, = ax1.plot(fitData, 'b')
plt.legend([p1,p2], ["data", "fit"])
ax2 = fig1.add_subplot(122)
p3, = ax2.plot(sub1)
plt.legend([p3], ['residual1'])
fig1.tight_layout()
plt.show()
cntr += 1
prevPhase = phs # Update guess for phase of sine curve
I've tried to distill the important parts of your question into this answer.
First of all, try fitting a single block of data, not an array. Once you are confident that your model is sufficient you can move on.
Your fit is only going to be as good as your model, if you move on to something not "sine"-like you'll need to adjust accordingly.
Fitting is an "art", in that the initial conditions can greatly change the convergence of the error function. In addition there may be more than one minima in your fits, so you often have to worry about the uniqueness of your proposed solution.
While you were on the right track with your FFT idea, I think your implementation wasn't quite correct. The code below should be a great toy system. It generates random data of the type f(x) = a0*sin(a1*x+a2). Sometimes a random initial guess will work, sometimes it will fail spectacularly. However, using the FFT guess for the frequency the convergence should always work for this system. An example output:
import numpy as np
import pylab as plt
import scipy.optimize as optimize
# This is your target function
def sineFit(t, (a, f, p)):
return a * np.sin(2.0*np.pi*f*t + p)
# This is our "error" function
def err_func(p0, X, Y, target_function):
err = ((Y - target_function(X, p0))**2).sum()
return err
# Try out different parameters, sometimes the random guess works
# sometimes it fails. The FFT solution should always work for this problem
inital_args = np.random.random(3)
X = np.linspace(0, 10, 1000)
Y = sineFit(X, inital_args)
# Use a random inital guess
inital_guess = np.random.random(3)
# Fit
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
# Plot the fit
Y2 = sineFit(X, sol)
plt.figure(figsize=(15,10))
plt.subplot(211)
plt.title("Random Inital Guess: Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
# Use an improved "fft" guess for the frequency
# this will be the max in k-space
timestep = X[1]-X[0]
guess_k = np.argmax( np.fft.rfft(Y) )
guess_f = np.fft.fftfreq(X.size, timestep)[guess_k]
inital_guess[1] = guess_f
# Guess the amplitiude by taking the max of the absolute values
inital_guess[0] = np.abs(Y).max()
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
Y2 = sineFit(X, sol)
plt.subplot(212)
plt.title("FFT Guess : Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
plt.show()
The problem is due to a bad initial guess of the phase, not the frequency. While cycling through the rows of genSine (inner loop) you use the fit result of the previous line as initial guess for the next row which does not work always. If you determine the phase from an fft of the current row and use that as initial guess the fit will succeed.
You could change the inner loop as follows:
for n,rr in enumerate(sineGen):
fftx = np.fft.fft(rr)
fftx = fftx[:len(fftx)/2]
idx = np.argmax(np.abs(fftx))
init_phase = np.angle(fftx[idx])
print fftx[idx], init_phase
...
Also you need to change
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
to
def sineFit(t, a, f, p):
return a * np.cos(2.0 * np.pi * f*t + p)
since phase=0 means that the imaginary part of the fft is zero and thus the function is cosine like.
Btw. your sample above is still lacking definitions of sineGen and xDat.
Without understanding much of your code, according to http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, guess2)
should become:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, p0=guess2)
Assuming that tDat and sub1 are x and y, that should do the trick. But, once again, it is quite difficult to understand such a complex code with so many interlinked variables and no comments at all. A code should always be build from bottom up, meaning that you don't do a loop of fits when a single one is not working, you don't add noise until the code works to fit the non-noisy examples... Good luck!
By "nothing fancy" I meant something like removing EVERYTHING that is not related with the fit, and doing a simplified mock example such as:
import numpy as np
import scipy.optimize as optimize
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
# Create array of x and y with given parameters
x = np.asarray(range(100))
y = sineFit(x, 1, 0.05, 0)
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.05, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
The result of this is exactly the answer:
[1. 0.05 0.]
But if you change guess not too much, just enough:
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.06, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
the result gives absurdly wrong numbers:
[ 0.00823701 0.06391323 -1.20382787]
Can you explain this behavior?
You can use curve_fit with a series of trigonometric functions, usually very robust and ajustable to the precision that you need just by increasing the number of terms... here is an example:
from scipy import sin, cos, linspace
def f(x, a0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12):
return a0 + s1*sin(1*x) + c1*cos(1*x) \
+ s2*sin(2*x) + c2*cos(2*x) \
+ s3*sin(3*x) + c3*cos(3*x) \
+ s4*sin(4*x) + c4*cos(4*x) \
+ s5*sin(5*x) + c5*cos(5*x) \
+ s6*sin(6*x) + c6*cos(6*x) \
+ s7*sin(7*x) + c7*cos(7*x) \
+ s8*sin(8*x) + c8*cos(8*x) \
+ s9*sin(9*x) + c9*cos(9*x) \
+ s10*sin(9*x) + c10*cos(9*x) \
+ s11*sin(9*x) + c11*cos(9*x) \
+ s12*sin(9*x) + c12*cos(9*x)
from scipy.optimize import curve_fit
pi/2. / (x.max() - x.min())
x_norm *= norm_factor
popt, pcov = curve_fit(f, x_norm, y)
x_fit = linspace(x_norm.min(), x_norm.max(), 1000)
y_fit = f(x_fit, *popt)
plt.plot( x_fit/x_norm, y_fit )

Bézier curve fitting with SciPy

I have a set of points which approximate a 2D curve. I would like to use Python with numpy and scipy to find a cubic Bézier path which approximately fits the points, where I specify the exact coordinates of two endpoints, and it returns the coordinates of the other two control points.
I initially thought scipy.interpolate.splprep() might do what I want, but it seems to force the curve to pass through each one of the data points (as I suppose you would want for interpolation). I'll assume that I was on the wrong track with that.
My question is similar to this one: How can I fit a Bézier curve to a set of data?, except that they said they didn't want to use numpy. My preference would be to find what I need already implemented somewhere in scipy or numpy. Otherwise, I plan to implement the algorithm linked from one of the answers to that question, using numpy: An algorithm for automatically fitting digitized curves (pdf.page 622).
Thank you for any suggestions!
Edit: I understand that a cubic Bézier curve is not guaranteed to pass through all the points; I want one which passes through two given endpoints, and which is as close as possible to the specified interior points.
Here's a way to do Bezier curves with numpy:
import numpy as np
from scipy.special import comb
def bernstein_poly(i, n, t):
"""
The Bernstein polynomial of n, i as a function of t
"""
return comb(n, i) * ( t**(n-i) ) * (1 - t)**i
def bezier_curve(points, nTimes=1000):
"""
Given a set of control points, return the
bezier curve defined by the control points.
points should be a list of lists, or list of tuples
such as [ [1,1],
[2,3],
[4,5], ..[Xn, Yn] ]
nTimes is the number of time steps, defaults to 1000
See http://processingjs.nihongoresources.com/bezierinfo/
"""
nPoints = len(points)
xPoints = np.array([p[0] for p in points])
yPoints = np.array([p[1] for p in points])
t = np.linspace(0.0, 1.0, nTimes)
polynomial_array = np.array([ bernstein_poly(i, nPoints-1, t) for i in range(0, nPoints) ])
xvals = np.dot(xPoints, polynomial_array)
yvals = np.dot(yPoints, polynomial_array)
return xvals, yvals
if __name__ == "__main__":
from matplotlib import pyplot as plt
nPoints = 4
points = np.random.rand(nPoints,2)*200
xpoints = [p[0] for p in points]
ypoints = [p[1] for p in points]
xvals, yvals = bezier_curve(points, nTimes=1000)
plt.plot(xvals, yvals)
plt.plot(xpoints, ypoints, "ro")
for nr in range(len(points)):
plt.text(points[nr][0], points[nr][1], nr)
plt.show()
Here is a piece of python code for fitting points:
'''least square qbezier fit using penrose pseudoinverse
>>> V=array
>>> E, W, N, S = V((1,0)), V((-1,0)), V((0,1)), V((0,-1))
>>> cw = 100
>>> ch = 300
>>> cpb = V((0, 0))
>>> cpe = V((cw, 0))
>>> xys=[cpb,cpb+ch*N+E*cw/8,cpe+ch*N+E*cw/8, cpe]
>>>
>>> ts = V(range(11), dtype='float')/10
>>> M = bezierM (ts)
>>> points = M*xys #produces the points on the bezier curve at t in ts
>>>
>>> control_points=lsqfit(points, M)
>>> linalg.norm(control_points-xys)<10e-5
True
>>> control_points.tolist()[1]
[12.500000000000037, 300.00000000000017]
'''
from numpy import array, linalg, matrix
from scipy.misc import comb as nOk
Mtk = lambda n, t, k: t**(k)*(1-t)**(n-k)*nOk(n,k)
bezierM = lambda ts: matrix([[Mtk(3,t,k) for k in range(4)] for t in ts])
def lsqfit(points,M):
M_ = linalg.pinv(M)
return M_ * points
Generally on bezier curves check out
Animated bezier and
bezierinfo
Resulting Plot
Building upon the answers from #reptilicus and #Guillaume P., here is the complete code to:
Get the Bezier Parameters i.e. the control points from a list of points.
Create the Bezier Curve from the Bezier Parameters i.e. the control points.
Plot the original points, the control points and the resulting Bezier Curve.
Getting the Bezier Parameters i.e. the control points from a set of X,Y points or coordinates. The other parameter needed is the degree for the approximation and the resulting control points will be (degree + 1)
import numpy as np
from scipy.special import comb
def get_bezier_parameters(X, Y, degree=3):
""" Least square qbezier fit using penrose pseudoinverse.
Parameters:
X: array of x data.
Y: array of y data. Y[0] is the y point for X[0].
degree: degree of the Bézier curve. 2 for quadratic, 3 for cubic.
Based on https://stackoverflow.com/questions/12643079/b%C3%A9zier-curve-fitting-with-scipy
and probably on the 1998 thesis by Tim Andrew Pastva, "Bézier Curve Fitting".
"""
if degree < 1:
raise ValueError('degree must be 1 or greater.')
if len(X) != len(Y):
raise ValueError('X and Y must be of the same length.')
if len(X) < degree + 1:
raise ValueError(f'There must be at least {degree + 1} points to '
f'determine the parameters of a degree {degree} curve. '
f'Got only {len(X)} points.')
def bpoly(n, t, k):
""" Bernstein polynomial when a = 0 and b = 1. """
return t ** k * (1 - t) ** (n - k) * comb(n, k)
#return comb(n, i) * ( t**(n-i) ) * (1 - t)**i
def bmatrix(T):
""" Bernstein matrix for Bézier curves. """
return np.matrix([[bpoly(degree, t, k) for k in range(degree + 1)] for t in T])
def least_square_fit(points, M):
M_ = np.linalg.pinv(M)
return M_ * points
T = np.linspace(0, 1, len(X))
M = bmatrix(T)
points = np.array(list(zip(X, Y)))
final = least_square_fit(points, M).tolist()
final[0] = [X[0], Y[0]]
final[len(final)-1] = [X[len(X)-1], Y[len(Y)-1]]
return final
Create the Bezier curve given the Bezier Parameters i.e. control points.
def bernstein_poly(i, n, t):
"""
The Bernstein polynomial of n, i as a function of t
"""
return comb(n, i) * ( t**(n-i) ) * (1 - t)**i
def bezier_curve(points, nTimes=50):
"""
Given a set of control points, return the
bezier curve defined by the control points.
points should be a list of lists, or list of tuples
such as [ [1,1],
[2,3],
[4,5], ..[Xn, Yn] ]
nTimes is the number of time steps, defaults to 1000
See http://processingjs.nihongoresources.com/bezierinfo/
"""
nPoints = len(points)
xPoints = np.array([p[0] for p in points])
yPoints = np.array([p[1] for p in points])
t = np.linspace(0.0, 1.0, nTimes)
polynomial_array = np.array([ bernstein_poly(i, nPoints-1, t) for i in range(0, nPoints) ])
xvals = np.dot(xPoints, polynomial_array)
yvals = np.dot(yPoints, polynomial_array)
return xvals, yvals
Sample data used (can be replaced with any data, this is GPS data).
points = []
xpoints = [19.21270, 19.21269, 19.21268, 19.21266, 19.21264, 19.21263, 19.21261, 19.21261, 19.21264, 19.21268,19.21274, 19.21282, 19.21290, 19.21299, 19.21307, 19.21316, 19.21324, 19.21333, 19.21342]
ypoints = [-100.14895, -100.14885, -100.14875, -100.14865, -100.14855, -100.14847, -100.14840, -100.14832, -100.14827, -100.14823, -100.14818, -100.14818, -100.14818, -100.14818, -100.14819, -100.14819, -100.14819, -100.14820, -100.14820]
for i in range(len(xpoints)):
points.append([xpoints[i],ypoints[i]])
Plot the original points, the control points and the resulting Bezier Curve.
import matplotlib.pyplot as plt
# Plot the original points
plt.plot(xpoints, ypoints, "ro",label='Original Points')
# Get the Bezier parameters based on a degree.
data = get_bezier_parameters(xpoints, ypoints, degree=4)
x_val = [x[0] for x in data]
y_val = [x[1] for x in data]
print(data)
# Plot the control points
plt.plot(x_val,y_val,'k--o', label='Control Points')
# Plot the resulting Bezier curve
xvals, yvals = bezier_curve(data, nTimes=1000)
plt.plot(xvals, yvals, 'b-', label='B Curve')
plt.legend()
plt.show()
#keynesiancross asked for "comments in [Roland's] code as to what the variables are" and others completely missed the stated problem. Roland started with a Bézier curve as input (to get a perfect match), which made it harder to understand both the problem and (at least for me) the solution. The difference from interpolation is easier to see for input that leaves residuals. Here is both paraphrased code and non-Bézier input -- and an unexpected outcome.
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import comb as n_over_k
Mtk = lambda n, t, k: t**k * (1-t)**(n-k) * n_over_k(n,k)
BézierCoeff = lambda ts: [[Mtk(3,t,k) for k in range(4)] for t in ts]
fcn = np.log
tPlot = np.linspace(0. ,1. , 81)
xPlot = np.linspace(0.1,2.5, 81)
tData = tPlot[0:81:10]
xData = xPlot[0:81:10]
data = np.column_stack((xData, fcn(xData))) # shapes (9,2)
Pseudoinverse = np.linalg.pinv(BézierCoeff(tData)) # (9,4) -> (4,9)
control_points = Pseudoinverse.dot(data) # (4,9)*(9,2) -> (4,2)
Bézier = np.array(BézierCoeff(tPlot)).dot(control_points)
residuum = fcn(Bézier[:,0]) - Bézier[:,1]
fig, ax = plt.subplots()
ax.plot(xPlot, fcn(xPlot), 'r-')
ax.plot(xData, data[:,1], 'ro', label='input')
ax.plot(Bézier[:,0],
Bézier[:,1], 'k-', label='fit')
ax.plot(xPlot, 10.*residuum, 'b-', label='10*residuum')
ax.plot(control_points[:,0],
control_points[:,1], 'ko:', fillstyle='none')
ax.legend()
fig.show()
This works well for fcn = np.cos but not for log. I kind of expected that the fit would use the t-component of the control points as additional degrees of freedom, as we would do by dragging the control points:
manual_points = np.array([[0.1,np.log(.1)],[.27,-.6],[.82,.23],[2.5,np.log(2.5)]])
Bézier = np.array(BézierCoeff(tPlot)).dot(manual_points)
residuum = fcn(Bézier[:,0]) - Bézier[:,1]
fig, ax = plt.subplots()
ax.plot(xPlot, fcn(xPlot), 'r-')
ax.plot(xData, data[:,1], 'ro', label='input')
ax.plot(Bézier[:,0],
Bézier[:,1], 'k-', label='fit')
ax.plot(xPlot, 10.*residuum, 'b-', label='10*residuum')
ax.plot(manual_points[:,0],
manual_points[:,1], 'ko:', fillstyle='none')
ax.legend()
fig.show()
The cause of failure, I guess, is that the norm measures the distance between points on the curves instead of the distance between a point on one curve to the nearest point on the other curve.
Short answer: you don't, because that's not how Bezier curves work. Longer answer: have a look at Catmull-Rom splines instead. They're pretty easy to form (the tangent vector at any point P, barring start and end, is parallel to the lines {P-1,P+1}, so they're easy to program, too) and always pass through the points that define them, unlike Bezier curves, which interpolates "somewhere" inside the convex hull set up by all the control points.
A Bezier curve isn't guaranteed to pass through every point you supply it with; control points are arbitrary (in the sense that there is no specific algorithm for finding them, you simply choose them yourself) and only pull the curve in a direction.
If you want a curve which will pass through every point you supply it with, you need something like a natural cubic spline, and due to the limitations of those (you must supply them with increasing x co-ordinates, or it tends to infinity), you'll probably want a parametric natural cubic spline.
There are nice tutorials here:
Cubic Splines
Parametric Cubic Splines
I had the same problem as detailed in the question. I took the code provided Roland Puntaier and was able to make it work. Here:
def get_bezier_parameters(X, Y, degree=2):
""" Least square qbezier fit using penrose pseudoinverse.
Parameters:
X: array of x data.
Y: array of y data. Y[0] is the y point for X[0].
degree: degree of the Bézier curve. 2 for quadratic, 3 for cubic.
Based on https://stackoverflow.com/questions/12643079/b%C3%A9zier-curve-fitting-with-scipy
and probably on the 1998 thesis by Tim Andrew Pastva, "Bézier Curve Fitting".
"""
if degree < 1:
raise ValueError('degree must be 1 or greater.')
if len(X) != len(Y):
raise ValueError('X and Y must be of the same length.')
if len(X) < degree + 1:
raise ValueError(f'There must be at least {degree + 1} points to '
f'determine the parameters of a degree {degree} curve. '
f'Got only {len(X)} points.')
def bpoly(n, t, k):
""" Bernstein polynomial when a = 0 and b = 1. """
return t ** k * (1 - t) ** (n - k) * comb(n, k)
def bmatrix(T):
""" Bernstein matrix for Bézier curves. """
return np.matrix([[bpoly(degree, t, k) for k in range(degree + 1)] for t in T])
def least_square_fit(points, M):
M_ = np.linalg.pinv(M)
return M_ * points
T = np.linspace(0, 1, len(X))
M = bmatrix(T)
points = np.array(list(zip(X, Y)))
return least_square_fit(points, M).tolist()
To fix the end points of the curve, ignore the first and last parameter returned by the function and use your own points.
What Mike Kamermans said is true, but I also wanted to point out that, as far as I know, catmull-rom splines can be defined in terms of cubic beziers. So, if you only have a library that works with cubics, you should still be able to do catmull-rom splines:
http://schepers.cc/getting-to-the-point
https://github.com/DmitryBaranovskiy/raphael/blob/d8fbe4be81d362837f95e33886b80fb41de443b4/dev/raphael.core.js#L1021

Python: two-curve gaussian fitting with non-linear least-squares

My knowledge of maths is limited which is why I am probably stuck. I have a spectra to which I am trying to fit two Gaussian peaks. I can fit to the largest peak, but I cannot fit to the smallest peak. I understand that I need to sum the Gaussian function for the two peaks but I do not know where I have gone wrong. An image of my current output is shown:
The blue line is my data and the green line is my current fit. There is a shoulder to the left of the main peak in my data which I am currently trying to fit, using the following code:
import matplotlib.pyplot as pt
import numpy as np
from scipy.optimize import leastsq
from pylab import *
time = []
counts = []
for i in open('/some/folder/to/file.txt', 'r'):
segs = i.split()
time.append(float(segs[0]))
counts.append(segs[1])
time_array = arange(len(time), dtype=float)
counts_array = arange(len(counts))
time_array[0:] = time
counts_array[0:] = counts
def model(time_array0, coeffs0):
a = coeffs0[0] + coeffs0[1] * np.exp( - ((time_array0-coeffs0[2])/coeffs0[3])**2 )
b = coeffs0[4] + coeffs0[5] * np.exp( - ((time_array0-coeffs0[6])/coeffs0[7])**2 )
c = a+b
return c
def residuals(coeffs, counts_array, time_array):
return counts_array - model(time_array, coeffs)
# 0 = baseline, 1 = amplitude, 2 = centre, 3 = width
peak1 = np.array([0,6337,16.2,4.47,0,2300,13.5,2], dtype=float)
#peak2 = np.array([0,2300,13.5,2], dtype=float)
x, flag = leastsq(residuals, peak1, args=(counts_array, time_array))
#z, flag = leastsq(residuals, peak2, args=(counts_array, time_array))
plt.plot(time_array, counts_array)
plt.plot(time_array, model(time_array, x), color = 'g')
#plt.plot(time_array, model(time_array, z), color = 'r')
plt.show()
This code worked for me providing that you are only fitting a function that is a combination of two Gaussian distributions.
I just made a residuals function that adds two Gaussian functions and then subtracts them from the real data.
The parameters (p) that I passed to Numpy's least squares function include: the mean of the first Gaussian function (m), the difference in the mean from the first and second Gaussian functions (dm, i.e. the horizontal shift), the standard deviation of the first (sd1), and the standard deviation of the second (sd2).
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
######################################
# Setting up test data
def norm(x, mean, sd):
norm = []
for i in range(x.size):
norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
return np.array(norm)
mean1, mean2 = 0, -2
std1, std2 = 0.5, 1
x = np.linspace(-20, 20, 500)
y_real = norm(x, mean1, std1) + norm(x, mean2, std2)
######################################
# Solving
m, dm, sd1, sd2 = [5, 10, 1, 1]
p = [m, dm, sd1, sd2] # Initial guesses for leastsq
y_init = norm(x, m, sd1) + norm(x, m + dm, sd2) # For final comparison plot
def res(p, y, x):
m, dm, sd1, sd2 = p
m1 = m
m2 = m1 + dm
y_fit = norm(x, m1, sd1) + norm(x, m2, sd2)
err = y - y_fit
return err
plsq = leastsq(res, p, args = (y_real, x))
y_est = norm(x, plsq[0][0], plsq[0][2]) + norm(x, plsq[0][0] + plsq[0][1], plsq[0][3])
plt.plot(x, y_real, label='Real Data')
plt.plot(x, y_init, 'r.', label='Starting Guess')
plt.plot(x, y_est, 'g.', label='Fitted')
plt.legend()
plt.show()
You can use Gaussian mixture models from scikit-learn:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GMM(n_components=2, covariance_type='full')
clf.fit(yourdata)
m1, m2 = clf.means_
w1, w2 = clf.weights_
c1, c2 = clf.covars_
histdist = matplotlib.pyplot.hist(yourdata, 100, normed=True)
plotgauss1 = lambda x: plot(x,w1*matplotlib.mlab.normpdf(x,m1,np.sqrt(c1))[0], linewidth=3)
plotgauss2 = lambda x: plot(x,w2*matplotlib.mlab.normpdf(x,m2,np.sqrt(c2))[0], linewidth=3)
plotgauss1(histdist[1])
plotgauss2(histdist[1])
You can also use the function below to fit the number of Gaussian you want with ncomp parameter:
from sklearn import mixture
%pylab
def fit_mixture(data, ncomp=2, doplot=False):
clf = mixture.GMM(n_components=ncomp, covariance_type='full')
clf.fit(data)
ml = clf.means_
wl = clf.weights_
cl = clf.covars_
ms = [m[0] for m in ml]
cs = [numpy.sqrt(c[0][0]) for c in cl]
ws = [w for w in wl]
if doplot == True:
histo = hist(data, 200, normed=True)
for w, m, c in zip(ws, ms, cs):
plot(histo[1],w*matplotlib.mlab.normpdf(histo[1],m,np.sqrt(c)), linewidth=3)
return ms, cs, ws
coeffs 0 and 4 are degenerate - there is absolutely nothing in the data that can decide between them. you should use a single zero level parameter instead of two (ie remove one of them from your code). this is probably what is stopping your fit (ignore the comments here saying this is not possible - there are clearly at least two peaks in that data and you should certainly be able to fit to that).
(it may not be clear why i am suggesting this, but what is happening is that coeffs 0 and 4 can cancel each other out. they can both be zero, or one could be 100 and the other -100 - either way, the fit is just as good. this "confuses" the fitting routine, which spends its time trying to work out what they should be, when there is no single right answer, because whatever value one is, the other can just be the negative of that, and the fit will be the same).
in fact, from the plot, it looks like there may be no need for a zero level at all. i would try dropping both of those and seeing how the fit looks.
also, there is no need to fit coeffs 1 and 5 (or the zero point) in the least squares. instead, because the model is linear in those you could calculate their values each loop. this will make things faster, but is not critical. i just noticed you say your maths is not so good, so probably ignore this one.

Categories

Resources