Based upon existing topics on Stackoverflow, I have managed to fit a Gaussian curve to my dataset. However, the fitted Gaussian shows one tail that does not go back to base-level (i.e., in the example below, the right tail suddenly stops at a higher y-value compared to the left tail). This surprises me, as per definition a Gaussian should show a perfectly symmetrical bell-shaped curve. How can I generate a Gaussian curve of which both tails are equally long (i.e., the tails stop at the same width measured from the plume center-line) and end at the same base-level (i.e., the same y-value)? The reason I would like to have this, is because in my data sometimes a second peak starts to arise while the first peak did not go back to base-level yet. I would like to separate these peaks by fitting a Gaussian that goes back to base-level, as theoretically each peak should go back to its base-level. Thanks a lot in advance!
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.array([-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0])
y = np.array([1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532])
def gaussian(x, amp, cen, wid):
return (amp / (np.sqrt(2*np.pi) * wid)) * np.exp(-(x-cen)**2 / (2*wid**2))
def line(x, slope, intercept):
return slope*x + intercept
peak_index = find_peaks(y,height=27.6)[0][0]
mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian) + Model(line)
pars = mod.make_params(amp=max(y), cen=x[peak_index],
wid=np.sqrt(sum((x-mean)**2 * y)/sum(y)), slope=0, intercept=1)
result = mod.fit(y, pars, x=x)
comps = result.eval_components()
plt.plot(x, y, 'bo')
plt.plot(x, comps['gaussian'], 'k--')
Edit: The following example hopefully illustrates why I am interested in this. I have a long data-set in which the signal of different sources are being measured. The data-set is processed such that it generates the arrays x_measured and y_measured that contain the measured values belonging to one source. My program automatically detects the plume that occurs within the measured values, and stores the values of this plume in arrays called x and y. To these x and y arrays, I perform a Gaussian fit.
However, sometimes the measured values show that 2 plumes are overlapping, hence there is no measured plume from and back to base-level. An example is given in the code below. My program for these measured values now gives a Gaussian fit whereby the right tail goes to around y=0, but the left tail of the Gaussian fit stops around y=4.5. I would like the left tail to also go back to around y=0. This is, because theoretically I know that each plume should start and go back to the same base-level, and I want to compute the plume-width of such a Gaussian plume. For the example below, the left tail does not go back to around y=0, hence I cannot determine the width of the plume. I would like to have a Gaussian-fit of which both tails go back to the same base-level of y=0, such that I can determine the width of the plume.
x_measured = np.arange(-20,245,3)
y_measured = np.array([38.7586,38.2323,37.2958,35.9924,34.4196,32.7123,31.0257,29.5169,28.3244,27.5502,27.2458,27.4078,27.9815,28.8728,29.9643,31.1313,32.2545,33.2276,33.9594,34.373,34.4041,34.0009,33.1267,31.7649,29.9247,27.6458,24.9992,22.0845,19.0215,15.9397,12.966,10.2127,7.76834,5.69046,4.00296,2.69719,1.73733,1.06907,0.629744,0.358021,0.201123,0.11878,0.0839719,0.0813392,0.104295,0.151634,0.224209,0.321912,0.441478,0.575581,0.713504,0.843351,0.954777,1.04109,1.09974,1.13118,1.13683,1.11758,1.07369,1.0059,0.917066,0.81321,0.703288,0.597775,0.506678,0.437843,0.396256,0.384633,0.405147,0.461496,0.560387,0.71144,0.925262,1.21022,1.56925,1.99788,2.48458,3.01314,3.56626,4.12898,4.69031,5.24283,5.78014,6.29365,6.77004,7.19071,7.53399,7.78019,7.91889])
x = np.arange(10,104,3)
y = np.array([22.4548,23.4302,25.3389,27.9929,30.486,32.0528,33.5527,35.1304,35.9941,36.8606,37.1889,37.723,36.4069,35.9751,33.8824,31.0909,27.4247,23.3213,18.8772,14.3363,11.1075,7.68792,4.54899,2.2057,0,0,0,0,0,0,0.179834,0])
def gaussian(x, amp, cen, wid):
return (amp / (np.sqrt(2*np.pi) * wid)) * np.exp(-(x-cen)**2 / (2*wid**2))
def line(x, slope, intercept):
return slope*x + intercept
peak_index = find_peaks(y,height=27.6)[0][0]
mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian) + Model(line)
pars = mod.make_params(amp=max(y), cen=x[peak_index],
wid=np.sqrt(sum((x-mean)**2 * y)/sum(y)), slope=0, intercept=1)
result = mod.fit(y, pars, x=x)
comps = result.eval_components()
plt.plot(x, y, 'bo')
plt.plot(x, comps['gaussian'], 'k--')
plt.plot(x_measured,y_measured)
It is unclear why you expect a bimodal fit with the model you defined. Use two different Gaussian functions for your fit, then evaluate the fitted functions for a longer interval x_fit to see the curves returning to baseline:
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.array([-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0])
y = np.array([1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532])
def gaussian1(x, amp1, cen1, wid1):
return (amp1 / (np.sqrt(2*np.pi) * wid1)) * np.exp(-(x-cen1)**2 / (2*wid1**2))
def gaussian2(x, amp2, cen2, wid2):
return (amp2 / (np.sqrt(2*np.pi) * wid2)) * np.exp(-(x-cen2)**2 / (2*wid2**2))
#peak_index = find_peaks(y,height=27.6)[0][0]
#mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian1) + Model(gaussian2)
#I just filled in some start values, the details of educated guesses can be filled in later by you
pars = mod.make_params(amp1=30, amp2=40, cen1=20, cen2=40, wid1=2, wid2=2)
result = mod.fit(y, pars, x=x)
print(result.params)
x_fit=np.linspace(-30, 120, 500)
comps_elem = result.eval_components(x=x_fit)
comps_comb = result.eval(x=x_fit)
plt.plot(x, y, 'bo')
plt.plot(x_fit, comps_comb, 'k')
plt.plot(x_fit, comps_elem['gaussian1'], 'k-.')
plt.plot(x_fit, comps_elem['gaussian2'], 'k--')
plt.show()
Sample output:
The corresponding scipy.curve_fit function would look like this:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
x = [-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0]
y = [1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532]
def gauss(x, mu, sigma, A):
return A*np.exp(-(x-mu)**2/2/sigma**2)
def bimodal(x, mu1, sigma1, A1, mu2, sigma2, A2):
return gauss(x, mu1, sigma1, A1) + gauss(x, mu2, sigma2, A2)
expected = (20, 2, 30, 40, 2, 40)
params, cov = curve_fit(bimodal, x, y, expected)
sigma=np.sqrt(np.diag(cov))
x_fit = np.linspace(-20, 120, 500)
plt.plot(x_fit, bimodal(x_fit, *params), color='red', lw=3, label='model')
plt.plot(x_fit, gauss(x_fit, *params[:3]), color='red', lw=1, ls="--", label='distribution 1')
plt.plot(x_fit, gauss(x_fit, *params[3:]), color='red', lw=1, ls=":", label='distribution 2')
plt.scatter(x, y, marker="X", color="black", label="original data")
plt.legend()
print(pd.DataFrame(data={'params': params, 'sigma': sigma}, index=bimodal.__code__.co_varnames[1:]))
plt.show()
I am trying to fit a curve smoothing function onto a number of my data sets, but I actually need to manually input the guess parameter for the respective lambda, theta, sigma and variables etc for each of such sets, or else it would provide a relatively poor fit.
These leads to two questions:
1)Is there actually a way to program the estimates or get curve_fit to find the best guess parameter to work with?
2)If this is not possible, how can I force curve_fit to work with a given fixed set of guess parameters across different data and have it still produce the best possible result/fit for all?
To give a better example/context for the questions, a lambda value of 0.25 for both data sets produced the following fits:-
But set 1 works better with a lambda value of 0.75 (manually altered). Clearly this is a better fit, but because the guess parameter was set to 0.25, this 'better fit' was not found.
The following are my sample codes:-
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
rawDataList = [0.76,0.77,0.81,0.84,0.83,0.85,0.77,0.66,0.64,0.72,0.69,0.59,0.74,0.65,0.76,
0.76,0.88,0.75,0.53,0.72,0.53,0.74,0.72,0.62,0.73,0.77,0.74,0.54,0.58,0.70,0.83,0.67,0.84,0.62]
rawDataList_2 = [0.74,0.77,0.75,0.66,0.6,0.63,0.76,0.73,0.56,0.68,0.74,0.56,0.76,0.70,0.72,
0.83,0.76,0.69,0.64,0.68,0.71,0.71,0.61,0.78,0.65,0.61,0.72]
def GaussianSmooth(x, c1, c3, Lambda, theta, sigma):
x0 = 0.
return c1 + c3 * np.cos((2*np.pi*(x/Lambda)) - theta) * np.exp(-(x - x0)**2 / (2 * sigma**2))
## For Binned Data of rawDataList
x = np.arange(len(rawDataList))
x = x*0.06 #Convert x-axis to seconds.
y = np.array(rawDataList)
popt,pcov = curve_fit(GaussianSmooth, x, y, p0=[np.mean(rawDataList),np.max(rawDataList) - np.mean(rawDataList),0.75,0.0,1.5], bounds=((0., 0., 0. ,0., 0.), (1.0, 1.0, 10.0, 10.0, 10.0)), method='trf',maxfev=10000)
plt.xlabel('Time (s)')
plt.ylabel('Performance from 0-100%')
plt.title('Fit for Performance')
plt.plot(x, y, 'b+:', color='blue', label='data')
plt.plot(x, GaussianSmooth(x, *popt), 'r-', color='red', label='fit')
plt.legend()
plt.show()
## For Binned Data of rawDataList_2
x = np.arange(len(rawDataList_2))
x = x*0.06 #Convert x-axis to seconds.
y = np.array(rawDataList_2)
popt,pcov = curve_fit(GaussianSmooth, x, y, p0=[np.mean(rawDataList_2),np.max(rawDataList_2) - np.mean(rawDataList_2),0.25,0.0,1.5], bounds=((0., 0., 0. ,0., 0.), (1.0, 1.0, 10.0, 10.0, 10.0)), method='trf',maxfev=10000)
plt.xlabel('Time (s)')
plt.ylabel('Performance from 0-100%')
plt.title('Fit for Performance')
plt.plot(x, y, 'b+:', color='red', label='data')
plt.plot(x, GaussianSmooth(x, *popt), 'r-', color='blue', label='fit')
plt.legend()
plt.show()
POST EDIT IN RESPONSE TO COMMENT 1:
def generate_Initial_Parameters():
global parameterBounds2
parameterBounds = []
parameterBounds.append([np.mean(rawDataList) - 0.05, np.mean(rawDataList) + 0.05]) # parameter bounds for c1; 0.05 arbitrary just to give it a small window to form proper lower and upper bound
parameterBounds.append([np.max(rawDataList) - np.mean(rawDataList) - 0.05, np.max(rawDataList) - np.mean(rawDataList) + 0.05]) # parameter bounds c3
parameterBounds.append([0.125, 10.0]) # parameter bounds for Lambda; Nyquist limit, can't detect more than 8Hz in current data set. So 1/8 = 0.125. 1/0.1 = 10.
parameterBounds.append([0.0, 2*np.pi]) # parameter bounds for theta; Phase offset in radians.
parameterBounds.append([0.0, 3.0]) # parameter bounds for sigma
parameterBounds2 = ((parameterBounds[0][0], parameterBounds[1][0], parameterBounds[2][0],
parameterBounds[3][0], parameterBounds[4][0]), (parameterBounds[0][1],
parameterBounds[1][1], parameterBounds[2][1], parameterBounds[3][1],
parameterBounds[4][1]))
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
popt,pcov = curve_fit(GaussianSmooth, x, y, initialParameters, bounds=parameterBounds2, maxfev=10000)
I have a set of data where x and y are the known parameters in my function, they are written in the function as x=x and y=x1, and I need to fit the data so I can get values for the unknown parameters (E, B0, S0).
I have this so far but when I try to run this I get the error:
ValueError: x and y must have same first dimension, but have shapes (4L,) and (1L,)
This error happens when I try to plot the against the fit curve. Also I get this error in regards to the bounds I have setup:
lb, ub = [np.asarray(b, dtype=float) for b in bounds]
ValueError: too many values to unpack
:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func (x, x1, E, B0, S0):
# function to optimize where x and x1 are known
# E, B0, S0 need to be fitted
return sum((x-np.power((E*B0*(1+((x1-S0)/(B0)))),(1/2)))**2)
#define the data to be fit
xdata = [0.00, 3.42, 4.56, 5.31] #distance
ydata = [335.4, 149.1, 167.1, 292.2] # beam size
plt.plot(xdata, ydata, 'b-', label='data')
pl.show()
# fit for parameters E, B0, and S0
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'r-', label='fit')
#put bounds on the optimization: 0.5<E<5, 1<S0<10, 0.1<B0,10
bnds= [(0.5,5.0),(0.1,10.0),(1,10)]
popt, pcov = curve_fit(func, xdata, ydata, bounds = [(0.5,5.0),(0.1,10.0),
(1.0,10.0)])
plt.plot(xdata,func(xdata, *popt),'g--', label='fit-with-bounds')
plt.xlabel('distance')
plt.ylabel('beam size')
plt.legend()
plt.show()
It's not clear what the sum in the func function is supposed to do. You may leave it out to get rid of the first error.
Second, the bounds in the curve_fit method are the bounds for the independent variable, not for the parameters. Leave the bounds out and you'll get rid of the second error.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func (x, x1, E, B0, S0):
# function to optimize where x and x1 are known
# E, B0, S0 need to be fitted
return (x-np.power((E*B0*(1.+((x1-S0)/(B0)))),(1/2.)))**2
#define the data to be fit
xdata = [0.00, 3.42, 4.56, 5.31] #distance
ydata = [335.4, 149.1, 167.1, 292.2] # beam size
plt.plot(xdata, ydata, 'b-', label='data')
# fit for parameters E, B0, and S0
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'r-', label='fit')
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata,func(xdata, *popt),'g--', label='fit-with-bounds')
plt.xlabel('distance')
plt.ylabel('beam size')
plt.legend()
plt.show()
Now obviously "fit" and "fit-with-bounds" are the same.
Edit: To fit for E, B0, S0 only, the fit function should only take those values as arguments.
funcwithx1 = lambda x,x1, E, B0, S0: (x-np.power((E*B0*(1.+((x1-S0)/(B0)))),(1/2.)))**2
x1 = 4.6
func = lambda x, E, B0, S0: funcwithx1(x, x1, E, B0, S0)
The function is wrongly defined. You know the independent and dependent variables, but you only supply the independent one to the fitted function.
y = func(x; params)
as it stands now, your objective function has 4 parameters to be determined.
Later on, when invoking the curve_fit you supply both, the independent and dependent variables as you correctly do in
popt, pcov = curve_fit(func, xdata, ydata)
Thus, popt is an array of length 4 and probably causing you part of your problems.
I don't know exactly your objective function, so I'll not attempt to fix that. Hope this guides you to solve the issue.
I am trying to fit some data using the following code:
xdata = [0.03447378, 0.06894757, 0.10342136, 0.13789514, 0.17236893,
0.20684271, 0.24131649, 0.27579028, 0.31026407, 0.34473785,
0.37921163, 0.41368542, 0.44815921, 0.48263299]
ydata = [ 2.5844 , 2.87449, 3.01929, 3.10584, 3.18305, 3.24166,
3.28897, 3.32979, 3.35957, 3.39193, 3.41662, 3.43956,
3.45644, 3.47135]
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a, b, c, d):
return a + b*x - c*np.exp(-d*x)
popt, pcov = curve_fit(func, xdata, ydata))
plt.figure()
plt.plot(xdata, ydata, 'ko', label="Original Noised Data")
plt.plot(xdata, func(xdata, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()
The curve is not being fitted:
Data fit with straight line - should be curve
What should I be doing to correctly fit the data?
It looks like the optimizer is getting stuck in a local minimum, or perhaps just a very flat area of the objective function. A better fit can be found by tweaking the initial guess of the parameters that is used by curve_fit. For example, I get a reasonable-looking fit with p0=[1, 1, 1, 2.0] (the default is [1, 1, 1, 1]):
Here's the modified version of your script that I used:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a, b, c, d):
return a + b*x - c*np.exp(-d*x)
xdata = np.array([0.03447378, 0.06894757, 0.10342136, 0.13789514, 0.17236893,
0.20684271, 0.24131649, 0.27579028, 0.31026407, 0.34473785,
0.37921163, 0.41368542, 0.44815921, 0.48263299])
ydata = np.array([ 2.5844 , 2.87449, 3.01929, 3.10584, 3.18305, 3.24166,
3.28897, 3.32979, 3.35957, 3.39193, 3.41662, 3.43956,
3.45644, 3.47135])
p0 = [1, 1, 1, 2.0]
popt, pcov = curve_fit(func, xdata, ydata, p0=p0)
print(popt)
plt.figure()
plt.plot(xdata, ydata, 'ko', label="Original Noised Data")
plt.plot(xdata, func(xdata, *popt), 'r-', label="Fitted Curve")
plt.legend(loc='best')
plt.show()
The printed output is:
[ 3.13903988 0.71827903 0.97047248 15.40936232]
Please try to be more specific with the issue you're having.
Two things I noticed that will prevent your code from working as it is:
line 15 (the curve_fit() call), there is an additional right paranthesis at the end of the line
xdata is a python list, so this won't work once you try to multiply it with a parameter in func, i.e. turn it into a numpy array with
xdata = np.array(xdata)
If you fix these two issues, the fit should work.
Edit: Warren is of course right - fixing the above issues still will get you started in a wrong minimum.