How to compute and plot a LOWESS curve in Python? - python

How can I find and plot a LOWESS curve that looks like the following using Python?
I'm aware of the LOWESS implementation in statsmodels, but it doesn't seem to be able to give me 95% confidence interval lines that I can shade between. Seaborn has a method that calls the statsmodels implementation, but it can't plot the confidence intervals.
Other StackOverflow answers give code to draw a LOESS/LOWESS line, but none with a confidence interval. Can anyone assist with this? Is anyone aware of an existing implementation that would enable me to do this?
Thanks in advance.

I found a link here is useful, and I put code below:
def lowess(x, y, f=1./3.):
"""
Basic LOWESS smoother with uncertainty.
Note:
- Not robust (so no iteration) and
only normally distributed errors.
- No higher order polynomials d=1
so linear smoother.
"""
# get some paras
xwidth = f*(x.max()-x.min()) # effective width after reduction factor
N = len(x) # number of obs
# Don't assume the data is sorted
order = np.argsort(x)
# storage
y_sm = np.zeros_like(y)
y_stderr = np.zeros_like(y)
# define the weigthing function -- clipping too!
tricube = lambda d : np.clip((1- np.abs(d)**3)**3, 0, 1)
# run the regression for each observation i
for i in range(N):
dist = np.abs((x[order][i]-x[order]))/xwidth
w = tricube(dist)
# form linear system with the weights
A = np.stack([w, x[order]*w]).T
b = w * y[order]
ATA = A.T.dot(A)
ATb = A.T.dot(b)
# solve the syste
sol = np.linalg.solve(ATA, ATb)
# predict for the observation only
yest = A[i].dot(sol)# equiv of A.dot(yest) just for k
place = order[i]
y_sm[place]=yest
sigma2 = (np.sum((A.dot(sol) -y [order])**2)/N )
# Calculate the standard error
y_stderr[place] = np.sqrt(sigma2 *
A[i].dot(np.linalg.inv(ATA)
).dot(A[i]))
return y_sm, y_stderr
import numpy as np
import matplotlib.pyplot as plt
# make some data
x = 5*np.random.random(100)
y = np.sin(x) * 3*np.exp(-x) + np.random.normal(0, 0.2, 100)
order = np.argsort(x)
#run it
y_sm, y_std = lowess(x, y, f=1./5.)
# plot it
plt.plot(x[order], y_sm[order], color='tomato', label='LOWESS')
plt.fill_between(x[order], y_sm[order] - 1.96*y_std[order],
y_sm[order] + 1.96*y_std[order], alpha=0.3, label='LOWESS uncertainty')
plt.plot(x, y, 'k.', label='Observations')
plt.legend(loc='best')
#run it
y_sm, y_std = lowess(x, y, f=1./5.)
# plot it
plt.plot(x[order], y_sm[order], color='tomato', label='LOWESS')
plt.fill_between(x[order], y_sm[order] - y_std[order],
y_sm[order] + y_std[order], alpha=0.3, label='LOWESS uncertainty')
plt.plot(x, y, 'k.', label='Observations')
plt.legend(loc='best')

Related

Interpolating non-uniformly distributed points on a 3D sphere

I have several points on the unit sphere that are distributed according to the algorithm described in https://www.cmu.edu/biolphys/deserno/pdf/sphere_equi.pdf (and implemented in the code below). On each of these points, I have a value that in my particular case represents 1 minus a small error. The errors are in [0, 0.1] if this is important, so my values are in [0.9, 1].
Sadly, computing the errors is a costly process and I cannot do this for as many points as I want. Still, I want my plots to look like I am plotting something "continuous".
So I want to fit an interpolation function to my data, to be able to sample as many points as I want.
After a little bit of research I found scipy.interpolate.SmoothSphereBivariateSpline which seems to do exactly what I want. But I cannot make it work properly.
Question: what can I use to interpolate (spline, linear interpolation, anything would be fine for the moment) my data on the unit sphere? An answer can be either "you misused scipy.interpolation, here is the correct way to do this" or "this other function is better suited to your problem".
Sample code that should be executable with numpy and scipy installed:
import typing as ty
import numpy
import scipy.interpolate
def get_equidistant_points(N: int) -> ty.List[numpy.ndarray]:
"""Generate approximately n points evenly distributed accros the 3-d sphere.
This function tries to find approximately n points (might be a little less
or more) that are evenly distributed accros the 3-dimensional unit sphere.
The algorithm used is described in
https://www.cmu.edu/biolphys/deserno/pdf/sphere_equi.pdf.
"""
# Unit sphere
r = 1
points: ty.List[numpy.ndarray] = list()
a = 4 * numpy.pi * r ** 2 / N
d = numpy.sqrt(a)
m_v = int(numpy.round(numpy.pi / d))
d_v = numpy.pi / m_v
d_phi = a / d_v
for m in range(m_v):
v = numpy.pi * (m + 0.5) / m_v
m_phi = int(numpy.round(2 * numpy.pi * numpy.sin(v) / d_phi))
for n in range(m_phi):
phi = 2 * numpy.pi * n / m_phi
points.append(
numpy.array(
[
numpy.sin(v) * numpy.cos(phi),
numpy.sin(v) * numpy.sin(phi),
numpy.cos(v),
]
)
)
return points
def cartesian2spherical(x: float, y: float, z: float) -> numpy.ndarray:
r = numpy.linalg.norm([x, y, z])
theta = numpy.arccos(z / r)
phi = numpy.arctan2(y, x)
return numpy.array([r, theta, phi])
n = 100
points = get_equidistant_points(n)
# Random here, but costly in real life.
errors = numpy.random.rand(len(points)) / 10
# Change everything to spherical to use the interpolator from scipy.
ideal_spherical_points = numpy.array([cartesian2spherical(*point) for point in points])
r_interp = 1 - errors
theta_interp = ideal_spherical_points[:, 1]
phi_interp = ideal_spherical_points[:, 2]
# Change phi coordinate from [-pi, pi] to [0, 2pi] to please scipy.
phi_interp[phi_interp < 0] += 2 * numpy.pi
# Create the interpolator.
interpolator = scipy.interpolate.SmoothSphereBivariateSpline(
theta_interp, phi_interp, r_interp
)
# Creating the finer theta and phi values for the final plot
theta = numpy.linspace(0, numpy.pi, 100, endpoint=True)
phi = numpy.linspace(0, numpy.pi * 2, 100, endpoint=True)
# Creating the coordinate grid for the unit sphere.
X = numpy.outer(numpy.sin(theta), numpy.cos(phi))
Y = numpy.outer(numpy.sin(theta), numpy.sin(phi))
Z = numpy.outer(numpy.cos(theta), numpy.ones(100))
thetas, phis = numpy.meshgrid(theta, phi)
heatmap = interpolator(thetas, phis)
Issue with the code above:
With the code as-is, I have a
ValueError: The required storage space exceeds the available storage space: nxest or nyest too small, or s too small. The weighted least-squares spline corresponds to the current set of knots.
that is raised when initialising the interpolator instance.
The issue above seems to say that I should change the value of s that is one on the parameters of scipy.interpolate.SmoothSphereBivariateSpline. I tested different values of s ranging from 0.0001 to 100000, the code above always raise, either the exception described above or:
ValueError: Error code returned by bispev: 10
Edit: I am including my findings here. They can't really be considered as a solution, that is why I am editing and not posting as an answer.
With more research I found this question Using Radial Basis Functions to Interpolate a Function on a Sphere. The author has exactly the same problem as me and use a different interpolator: scipy.interpolate.Rbf. I changed the above code by replacing the interpolator and plotting:
# Create the interpolator.
interpolator = scipy.interpolate.Rbf(theta_interp, phi_interp, r_interp)
# Creating the finer theta and phi values for the final plot
plot_points = 100
theta = numpy.linspace(0, numpy.pi, plot_points, endpoint=True)
phi = numpy.linspace(0, numpy.pi * 2, plot_points, endpoint=True)
# Creating the coordinate grid for the unit sphere.
X = numpy.outer(numpy.sin(theta), numpy.cos(phi))
Y = numpy.outer(numpy.sin(theta), numpy.sin(phi))
Z = numpy.outer(numpy.cos(theta), numpy.ones(plot_points))
thetas, phis = numpy.meshgrid(theta, phi)
heatmap = interpolator(thetas, phis)
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import cm
colormap = cm.inferno
normaliser = mpl.colors.Normalize(vmin=numpy.min(heatmap), vmax=1)
scalar_mappable = cm.ScalarMappable(cmap=colormap, norm=normaliser)
scalar_mappable.set_array([])
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.plot_surface(
X,
Y,
Z,
facecolors=colormap(normaliser(heatmap)),
alpha=0.7,
cmap=colormap,
)
plt.colorbar(scalar_mappable)
plt.show()
This code runs smoothly and gives the following result:
The interpolation seems OK except on one line that is discontinuous, just like in the question that led me to this class. One of the answer give the idea of using a different distance, more adapted the the spherical coordinates: the Haversine distance.
def haversine(x1, x2):
theta1, phi1 = x1
theta2, phi2 = x2
return 2 * numpy.arcsin(
numpy.sqrt(
numpy.sin((theta2 - theta1) / 2) ** 2
+ numpy.cos(theta1) * numpy.cos(theta2) * numpy.sin((phi2 - phi1) / 2) ** 2
)
)
# Create the interpolator.
interpolator = scipy.interpolate.Rbf(theta_interp, phi_interp, r_interp, norm=haversine)
which, when executed, gives a warning:
LinAlgWarning: Ill-conditioned matrix (rcond=1.33262e-19): result may not be accurate.
self.nodes = linalg.solve(self.A, self.di)
and a result that is not at all the one expected: the interpolated function have values that may go up to -1 which is clearly wrong.
You can use Cartesian coordinate instead of Spherical coordinate.
The default norm parameter ('euclidean') used by Rbf is sufficient
# interpolation
x, y, z = numpy.array(points).T
interpolator = scipy.interpolate.Rbf(x, y, z, r_interp)
# predict
heatmap = interpolator(X, Y, Z)
Here the result:
ax.plot_surface(
X, Y, Z,
rstride=1, cstride=1,
# or rcount=50, ccount=50,
facecolors=colormap(normaliser(heatmap)),
cmap=colormap,
alpha=0.7, shade=False
)
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
ax.set_zlabel('z axis')
You can also use a cosine distance if you want (norm parameter):
def cosine(XA, XB):
if XA.ndim == 1:
XA = numpy.expand_dims(XA, axis=0)
if XB.ndim == 1:
XB = numpy.expand_dims(XB, axis=0)
return scipy.spatial.distance.cosine(XA, XB)
In order to better see the differences,
I stacked the two images, substracted them and inverted the layer.

Scipy optimize curve_fit gives different plots for same parameters when fitting custom function

I have a problem with fitting a custom function using scipy.optimize in Python and I do not know, why that is happening. I generate data from centered and normalized binomial distribution (Gaussian curve) and then fit a curve. The expected outcome is in the picture when I plot my function over the fitted data. But when I do the fitting, it fails.
I'm convinced it is a pythonic thing because it should give the parameter a = 1 (that's how I define it) and it gives it but then the fit is bad (see picture). However, if I change sigma to 0.65*sigma in:
p_halfg, p_halfg_cov = optimize.curve_fit(lambda x, a:piecewise_half_gauss(x, a, sigma = 0.65*sigma_fit), x, y, p0=[1])
, it gives almost perfect fit (a is then 5/3, as predicted by math). Those fits should be the same and they are not!
I give more comments bellow. Could you please tell me what is happening and where the problem could be?
Plot with a=1 and sigma = sigma_fit
Plot with sigma = 0.65*sigma_fit
I generate data from normalized binomial distribution (I can provide my code but the values are more important now). It is a distribution with N = 10 and p = 0.5 and I'm centering it and taking only the right side of the curve. Then I'm fitting it with my half-gauss function, which should be the same distribution as binomial if its parameter a = 1 (and the sigma is equal to the sigma of the distribution, sqrt(np(1-p)) ). Now the problem is first that it does not fit the data as shown in the picture despite getting the correct value of parameter a.
Notice weird stuff... if I set sigma = 3* sigma_fit, I get a = 1/3 and a very bad fit (underestimate). If I set it to 0.2*sigma_fit, I get also a bad fit and a = 1/0.2 = 5 (overestimate). And so on. Why? (btw. a=1/sigma so the fitting procedure should work).
import numpy as np
import matplotlib.pyplot as plt
import math
pi = math.pi
import scipy.optimize as optimize
# define my function
sigma_fit = 1
def piecewise_half_gauss(x, a, sigma=sigma_fit):
"""Half of normal distribution curve, defined as gaussian centered at 0 with constant value of preexponential factor for x < 0
Arguments: x values as ndarray whose numbers MUST be float type (use linspace or np.arange(start, end, step, dtype=float),
a as a parameter of width of the distribution,
sigma being the deviation, second moment
Returns: Half gaussian curve
Ex:
>>> piecewise_half_gauss(5., 1)
array(0.04839414)
>>> x = np.linspace(0,10,11)
... piecewise_half_gauss(x, 2, 3)
array([0.06649038, 0.06557329, 0.0628972 , 0.05867755, 0.05324133,
0.04698531, 0.04032845, 0.03366645, 0.02733501, 0.02158627,
0.01657952])
>>> piecewise_half_gauss(np.arange(0,11,1, dtype=float), 1, 2.4)
array([1.66225950e-01, 1.52405153e-01, 1.17463281e-01, 7.61037856e-02,
4.14488078e-02, 1.89766470e-02, 7.30345854e-03, 2.36286717e-03,
6.42616248e-04, 1.46914868e-04, 2.82345875e-05])
"""
return np.piecewise(x, [x >= 0, x < 0],
[lambda x: np.exp(-x ** 2 / (2 * ((a * sigma) ** 2))) / (np.sqrt(2 * pi) * sigma * a),
lambda x: 1 / (np.sqrt(2 * pi) * sigma)])
# Create normalized data for binomial distribution Bin(N,p)
n = 10
p = 0.5
x = np.array([0., 1., 2., 3., 4., 5.])
y = np.array([0.25231325, 0.20657662, 0.11337165, 0.0417071 , 0.01028484,
0.00170007])
# Get the estimate for sigma parameter
sigma_fit = (n*p*(1-p))**0.5
# Get fitting parameters
p_halfg, p_halfg_cov = optimize.curve_fit(lambda x, a:piecewise_half_gauss(x, a, sigma = sigma_fit), x, y, p0=[1])
print(sigma_fit, p_halfg, p_halfg_cov)
## Plot the result
# unpack fitting parameters
a = np.float64(p_halfg)
# unpack uncertainties in fitting parameters from diagonal of covariance matrix
#da = [np.sqrt(p_halfg_cov[j,j]) for j in range(p_halfg.size)] # if we fit more parameters
da = np.float64(np.sqrt(p_halfg_cov[0]))
# create fitting function from fitted parameters
f_fit = np.linspace(0, 10, 50)
y_fit = piecewise_half_gauss(f_fit, a)
# Create figure window to plot data
fig = plt.figure(1, figsize=(10,10))
plt.scatter(x, y, color = 'r', label = 'Original points')
plt.plot(f_fit, y_fit, label = 'Fit')
plt.xlabel('My x values')
plt.ylabel('My y values')
plt.text(5.8, .25, 'a = {0:0.5f}$\pm${1:0.6f}'.format(a, da))
plt.legend()
However, if I plot it manually, it fits EXACTLY!
plt.scatter(x, y, c = 'r', label = 'Original points')
plt.plot(np.linspace(0,5,50), piecewise_half_gauss(np.linspace(0,5,50), 1, sigma_fit), label = 'Fit')
plt.legend()
EDIT -- solved:
it is a plotting problem, need to use
y_fit = piecewise_half_gauss(f_fit, a, sigma = 0.6*sigma_fit)
The problem was in plotting and fitting the parameters -- if I fit it with different sigma, I also need to change it in the plotting section when I generate y_fit:
# Get fitting parameters
p_halfg, p_halfg_cov = optimize.curve_fit(lambda x, a:piecewise_half_gauss(x, a, sigma = 0.6*sigma_fit), x, y, p0=[1])
...
y_fit = piecewise_half_gauss(f_fit, a, sigma = 0.6*sigma_fit)

Gaussian fit for python with confidence interval

I'd like to make a Gaussian Fit for some data that has a rough gaussian fit. I'd like the information of data peak (A), center position (mu), and standard deviation (sigma), along with 95% confidence intervals for these values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.stats import norm
# gaussian function
def gaussian_func(x, A, mu, sigma):
return A * np.exp( - (x - mu)**2 / (2 * sigma**2))
# generate toy data
x = np.arange(50)
y = [ 97.04421053, 96.53052632, 96.85684211, 96.33894737, 96.85052632,
96.30526316, 96.87789474, 96.75157895, 97.05052632, 96.73473684,
96.46736842, 96.23368421, 96.22526316, 96.11789474, 96.41263158,
96.32631579, 96.33684211, 96.44421053, 96.48421053, 96.49894737,
97.30105263, 98.58315789, 100.07368421, 101.43578947, 101.92210526,
102.26736842, 101.80421053, 101.91157895, 102.07368421, 102.02105263,
101.35578947, 99.83578947, 98.28, 96.98315789, 96.61473684,
96.82947368, 97.09263158, 96.82105263, 96.24210526, 95.95578947,
95.84210526, 95.67157895, 95.83157895, 95.37894737, 95.25473684,
95.32842105, 95.45684211, 95.31578947, 95.42526316, 95.30526316]
plt.scatter(x,y)
# initial_guess_of_parameters
# この値はソルバーとかで求めましょう.
parameter_initial = np.array([652, 2.9, 1.3])
# estimate optimal parameter & parameter covariance
popt, pcov = curve_fit(gaussian_func, x, y, p0=parameter_initial)
# plot result
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_func(xd, popt[0], popt[1], popt[2])
plt.plot(xd, estimated_curve, label="Estimated curve", color="r")
plt.legend()
plt.savefig("gaussian_fitting.png")
plt.show()
# estimate standard Error
StdE = np.sqrt(np.diag(pcov))
# estimate 95% confidence interval
alpha=0.025
lwCI = popt + norm.ppf(q=alpha)*StdE
upCI = popt + norm.ppf(q=1-alpha)*StdE
# print result
mat = np.vstack((popt,StdE, lwCI, upCI)).T
df=pd.DataFrame(mat,index=("A", "mu", "sigma"),
columns=("Estimate", "Std. Error", "lwCI", "upCI"))
print(df)
Data Plot with Fitted Curve
The data peak and center position seems correct, but the standard deviation is off. Any input is greatly appreciated.
Your scatter indeed looks similar to a gaussian distribution, but it is not centered around zero. Given the specifics of the Gaussian function it will therefor be hard to nicely fit a Gaussian distribution to the data the way you gave us. I would therefor propose by starting with demeaning the x series:
x = np.arange(0, 50) - 24.5
Next I would add one additional parameter to your gaussian function, the offset. Since the regular Gaussian function will always have its tails close to zero it is impossible to otherwise nicely fit your scatterplot:
def gaussian_function(x, A, mu, sigma, offset):
return A * np.exp(-np.power((x - mu)/sigma, 2.)/2.) + offset
Next you should define an error_loss_function to minimise:
def error_loss_function(params):
gaussian = gaussian_function(x, params[0], params[1], params[2], params[3])
errors = gaussian - y
return sum(np.power(errors, 2)) # You can also pick a different error loss function!
All that remains is fitting our curve now:
fit = scipy.optimize.minimize(fun=error_loss_function, x0=[2, 0, 0.2, 97])
params = fit.x # A: 6.57592661, mu: 1.95248855, sigma: 3.93230503, offset: 96.12570778
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_function(xd, params[0], params[1], params[2], params[3])
plt.plot(xd, estimated_curve, label="Estimated curve", color="b")
plt.legend()
plt.show(block=False)
Hopefully this helps. Looks like a fun project, let me know if my answer is not clear.

How does one implement a subsampled RBF (Radial Basis Function) in Numpy?

I was trying to implement a Radial Basis Function in Python and Numpy as describe by CalTech lecture here. The mathematics seems clear to me so I find it strange that its not working (or it seems to not work). The idea is simple, one chooses a subsampled number of centers for each Gaussian form a kernal matrix and tries to find the best coefficients. i.e. solve Kc = y where K is the guassian kernel (gramm) matrix with least squares. For that I did:
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
but when I try to plot my interpolation with the original data they don't look at all alike:
with 100 random centers (from the data set). I also tried 10 centers which produces essentially the same graph as so does using every data point in the training set. I assumed that using every data point in the data set should more or less perfectly copy the curve but it didn't (overfit). It produces:
which doesn't seem correct. I will provide the full code (that runs without error):
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
## Data sets
def get_labels_improved(X,f):
N_train = X.shape[0]
Y = np.zeros( (N_train,1) )
for i in range(N_train):
Y[i] = f(X[i])
return Y
def get_kernel_matrix(x,W,S):
beta = get_beta_np(S)
#beta = 0.5*tf.pow(tf.div( tf.constant(1.0,dtype=tf.float64),S), 2)
Z = -beta*euclidean_distances(X=x,Y=W,squared=True)
K = np.exp(Z)
return K
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = low_x + (high_x - low_x) * np.random.rand(N,1)
# f(x) = 2*(2(cos(x)^2 - 1)^2 -1
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = get_labels_improved(X , f)
K = 2 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 100
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
Y_pred = np.dot( Kern , C )
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()
Since the plots look strange I decided to read the docs for the ploting functions but I couldn't find anything obvious that was wrong.
Scaling of interpolating functions
The main problem is unfortunate choice of standard deviation of the functions used for interpolation:
stddev = 100
The features of your functions (its humps) are of size about 1. So, use
stddev = 1
Order of X values
The mess of red lines is there because plt from matplotlib connects consecutive data points, in the order given. Since your X values are in random order, this results in chaotic left-right movements. Use sorted X:
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
Efficiency issues
Your get_labels_improved method is inefficient, looping over the elements of X. Use Y = f(X), leaving the looping to low-level NumPy internals.
Also, the computation of least-squared solution of an overdetermined system should be done with lstsq instead of computing the pseudoinverse (computationally expensive) and multiplying by it.
Here is the cleaned-up code; using 30 centers gives a good fit.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = f(X)
K = 30 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 1
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X, Y=subsampled_data_points,squared=True))
C = np.linalg.lstsq(Kern, Y)[0]
Y_pred = np.dot(Kern, C)
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()

Python: two-curve gaussian fitting with non-linear least-squares

My knowledge of maths is limited which is why I am probably stuck. I have a spectra to which I am trying to fit two Gaussian peaks. I can fit to the largest peak, but I cannot fit to the smallest peak. I understand that I need to sum the Gaussian function for the two peaks but I do not know where I have gone wrong. An image of my current output is shown:
The blue line is my data and the green line is my current fit. There is a shoulder to the left of the main peak in my data which I am currently trying to fit, using the following code:
import matplotlib.pyplot as pt
import numpy as np
from scipy.optimize import leastsq
from pylab import *
time = []
counts = []
for i in open('/some/folder/to/file.txt', 'r'):
segs = i.split()
time.append(float(segs[0]))
counts.append(segs[1])
time_array = arange(len(time), dtype=float)
counts_array = arange(len(counts))
time_array[0:] = time
counts_array[0:] = counts
def model(time_array0, coeffs0):
a = coeffs0[0] + coeffs0[1] * np.exp( - ((time_array0-coeffs0[2])/coeffs0[3])**2 )
b = coeffs0[4] + coeffs0[5] * np.exp( - ((time_array0-coeffs0[6])/coeffs0[7])**2 )
c = a+b
return c
def residuals(coeffs, counts_array, time_array):
return counts_array - model(time_array, coeffs)
# 0 = baseline, 1 = amplitude, 2 = centre, 3 = width
peak1 = np.array([0,6337,16.2,4.47,0,2300,13.5,2], dtype=float)
#peak2 = np.array([0,2300,13.5,2], dtype=float)
x, flag = leastsq(residuals, peak1, args=(counts_array, time_array))
#z, flag = leastsq(residuals, peak2, args=(counts_array, time_array))
plt.plot(time_array, counts_array)
plt.plot(time_array, model(time_array, x), color = 'g')
#plt.plot(time_array, model(time_array, z), color = 'r')
plt.show()
This code worked for me providing that you are only fitting a function that is a combination of two Gaussian distributions.
I just made a residuals function that adds two Gaussian functions and then subtracts them from the real data.
The parameters (p) that I passed to Numpy's least squares function include: the mean of the first Gaussian function (m), the difference in the mean from the first and second Gaussian functions (dm, i.e. the horizontal shift), the standard deviation of the first (sd1), and the standard deviation of the second (sd2).
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
######################################
# Setting up test data
def norm(x, mean, sd):
norm = []
for i in range(x.size):
norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
return np.array(norm)
mean1, mean2 = 0, -2
std1, std2 = 0.5, 1
x = np.linspace(-20, 20, 500)
y_real = norm(x, mean1, std1) + norm(x, mean2, std2)
######################################
# Solving
m, dm, sd1, sd2 = [5, 10, 1, 1]
p = [m, dm, sd1, sd2] # Initial guesses for leastsq
y_init = norm(x, m, sd1) + norm(x, m + dm, sd2) # For final comparison plot
def res(p, y, x):
m, dm, sd1, sd2 = p
m1 = m
m2 = m1 + dm
y_fit = norm(x, m1, sd1) + norm(x, m2, sd2)
err = y - y_fit
return err
plsq = leastsq(res, p, args = (y_real, x))
y_est = norm(x, plsq[0][0], plsq[0][2]) + norm(x, plsq[0][0] + plsq[0][1], plsq[0][3])
plt.plot(x, y_real, label='Real Data')
plt.plot(x, y_init, 'r.', label='Starting Guess')
plt.plot(x, y_est, 'g.', label='Fitted')
plt.legend()
plt.show()
You can use Gaussian mixture models from scikit-learn:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GMM(n_components=2, covariance_type='full')
clf.fit(yourdata)
m1, m2 = clf.means_
w1, w2 = clf.weights_
c1, c2 = clf.covars_
histdist = matplotlib.pyplot.hist(yourdata, 100, normed=True)
plotgauss1 = lambda x: plot(x,w1*matplotlib.mlab.normpdf(x,m1,np.sqrt(c1))[0], linewidth=3)
plotgauss2 = lambda x: plot(x,w2*matplotlib.mlab.normpdf(x,m2,np.sqrt(c2))[0], linewidth=3)
plotgauss1(histdist[1])
plotgauss2(histdist[1])
You can also use the function below to fit the number of Gaussian you want with ncomp parameter:
from sklearn import mixture
%pylab
def fit_mixture(data, ncomp=2, doplot=False):
clf = mixture.GMM(n_components=ncomp, covariance_type='full')
clf.fit(data)
ml = clf.means_
wl = clf.weights_
cl = clf.covars_
ms = [m[0] for m in ml]
cs = [numpy.sqrt(c[0][0]) for c in cl]
ws = [w for w in wl]
if doplot == True:
histo = hist(data, 200, normed=True)
for w, m, c in zip(ws, ms, cs):
plot(histo[1],w*matplotlib.mlab.normpdf(histo[1],m,np.sqrt(c)), linewidth=3)
return ms, cs, ws
coeffs 0 and 4 are degenerate - there is absolutely nothing in the data that can decide between them. you should use a single zero level parameter instead of two (ie remove one of them from your code). this is probably what is stopping your fit (ignore the comments here saying this is not possible - there are clearly at least two peaks in that data and you should certainly be able to fit to that).
(it may not be clear why i am suggesting this, but what is happening is that coeffs 0 and 4 can cancel each other out. they can both be zero, or one could be 100 and the other -100 - either way, the fit is just as good. this "confuses" the fitting routine, which spends its time trying to work out what they should be, when there is no single right answer, because whatever value one is, the other can just be the negative of that, and the fit will be the same).
in fact, from the plot, it looks like there may be no need for a zero level at all. i would try dropping both of those and seeing how the fit looks.
also, there is no need to fit coeffs 1 and 5 (or the zero point) in the least squares. instead, because the model is linear in those you could calculate their values each loop. this will make things faster, but is not critical. i just noticed you say your maths is not so good, so probably ignore this one.

Categories

Resources