Based upon existing topics on Stackoverflow, I have managed to fit a Gaussian curve to my dataset. However, the fitted Gaussian shows one tail that does not go back to base-level (i.e., in the example below, the right tail suddenly stops at a higher y-value compared to the left tail). This surprises me, as per definition a Gaussian should show a perfectly symmetrical bell-shaped curve. How can I generate a Gaussian curve of which both tails are equally long (i.e., the tails stop at the same width measured from the plume center-line) and end at the same base-level (i.e., the same y-value)? The reason I would like to have this, is because in my data sometimes a second peak starts to arise while the first peak did not go back to base-level yet. I would like to separate these peaks by fitting a Gaussian that goes back to base-level, as theoretically each peak should go back to its base-level. Thanks a lot in advance!
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.array([-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0])
y = np.array([1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532])
def gaussian(x, amp, cen, wid):
return (amp / (np.sqrt(2*np.pi) * wid)) * np.exp(-(x-cen)**2 / (2*wid**2))
def line(x, slope, intercept):
return slope*x + intercept
peak_index = find_peaks(y,height=27.6)[0][0]
mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian) + Model(line)
pars = mod.make_params(amp=max(y), cen=x[peak_index],
wid=np.sqrt(sum((x-mean)**2 * y)/sum(y)), slope=0, intercept=1)
result = mod.fit(y, pars, x=x)
comps = result.eval_components()
plt.plot(x, y, 'bo')
plt.plot(x, comps['gaussian'], 'k--')
Edit: The following example hopefully illustrates why I am interested in this. I have a long data-set in which the signal of different sources are being measured. The data-set is processed such that it generates the arrays x_measured and y_measured that contain the measured values belonging to one source. My program automatically detects the plume that occurs within the measured values, and stores the values of this plume in arrays called x and y. To these x and y arrays, I perform a Gaussian fit.
However, sometimes the measured values show that 2 plumes are overlapping, hence there is no measured plume from and back to base-level. An example is given in the code below. My program for these measured values now gives a Gaussian fit whereby the right tail goes to around y=0, but the left tail of the Gaussian fit stops around y=4.5. I would like the left tail to also go back to around y=0. This is, because theoretically I know that each plume should start and go back to the same base-level, and I want to compute the plume-width of such a Gaussian plume. For the example below, the left tail does not go back to around y=0, hence I cannot determine the width of the plume. I would like to have a Gaussian-fit of which both tails go back to the same base-level of y=0, such that I can determine the width of the plume.
x_measured = np.arange(-20,245,3)
y_measured = np.array([38.7586,38.2323,37.2958,35.9924,34.4196,32.7123,31.0257,29.5169,28.3244,27.5502,27.2458,27.4078,27.9815,28.8728,29.9643,31.1313,32.2545,33.2276,33.9594,34.373,34.4041,34.0009,33.1267,31.7649,29.9247,27.6458,24.9992,22.0845,19.0215,15.9397,12.966,10.2127,7.76834,5.69046,4.00296,2.69719,1.73733,1.06907,0.629744,0.358021,0.201123,0.11878,0.0839719,0.0813392,0.104295,0.151634,0.224209,0.321912,0.441478,0.575581,0.713504,0.843351,0.954777,1.04109,1.09974,1.13118,1.13683,1.11758,1.07369,1.0059,0.917066,0.81321,0.703288,0.597775,0.506678,0.437843,0.396256,0.384633,0.405147,0.461496,0.560387,0.71144,0.925262,1.21022,1.56925,1.99788,2.48458,3.01314,3.56626,4.12898,4.69031,5.24283,5.78014,6.29365,6.77004,7.19071,7.53399,7.78019,7.91889])
x = np.arange(10,104,3)
y = np.array([22.4548,23.4302,25.3389,27.9929,30.486,32.0528,33.5527,35.1304,35.9941,36.8606,37.1889,37.723,36.4069,35.9751,33.8824,31.0909,27.4247,23.3213,18.8772,14.3363,11.1075,7.68792,4.54899,2.2057,0,0,0,0,0,0,0.179834,0])
def gaussian(x, amp, cen, wid):
return (amp / (np.sqrt(2*np.pi) * wid)) * np.exp(-(x-cen)**2 / (2*wid**2))
def line(x, slope, intercept):
return slope*x + intercept
peak_index = find_peaks(y,height=27.6)[0][0]
mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian) + Model(line)
pars = mod.make_params(amp=max(y), cen=x[peak_index],
wid=np.sqrt(sum((x-mean)**2 * y)/sum(y)), slope=0, intercept=1)
result = mod.fit(y, pars, x=x)
comps = result.eval_components()
plt.plot(x, y, 'bo')
plt.plot(x, comps['gaussian'], 'k--')
plt.plot(x_measured,y_measured)
It is unclear why you expect a bimodal fit with the model you defined. Use two different Gaussian functions for your fit, then evaluate the fitted functions for a longer interval x_fit to see the curves returning to baseline:
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.array([-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0])
y = np.array([1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532])
def gaussian1(x, amp1, cen1, wid1):
return (amp1 / (np.sqrt(2*np.pi) * wid1)) * np.exp(-(x-cen1)**2 / (2*wid1**2))
def gaussian2(x, amp2, cen2, wid2):
return (amp2 / (np.sqrt(2*np.pi) * wid2)) * np.exp(-(x-cen2)**2 / (2*wid2**2))
#peak_index = find_peaks(y,height=27.6)[0][0]
#mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian1) + Model(gaussian2)
#I just filled in some start values, the details of educated guesses can be filled in later by you
pars = mod.make_params(amp1=30, amp2=40, cen1=20, cen2=40, wid1=2, wid2=2)
result = mod.fit(y, pars, x=x)
print(result.params)
x_fit=np.linspace(-30, 120, 500)
comps_elem = result.eval_components(x=x_fit)
comps_comb = result.eval(x=x_fit)
plt.plot(x, y, 'bo')
plt.plot(x_fit, comps_comb, 'k')
plt.plot(x_fit, comps_elem['gaussian1'], 'k-.')
plt.plot(x_fit, comps_elem['gaussian2'], 'k--')
plt.show()
Sample output:
The corresponding scipy.curve_fit function would look like this:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
x = [-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0]
y = [1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532]
def gauss(x, mu, sigma, A):
return A*np.exp(-(x-mu)**2/2/sigma**2)
def bimodal(x, mu1, sigma1, A1, mu2, sigma2, A2):
return gauss(x, mu1, sigma1, A1) + gauss(x, mu2, sigma2, A2)
expected = (20, 2, 30, 40, 2, 40)
params, cov = curve_fit(bimodal, x, y, expected)
sigma=np.sqrt(np.diag(cov))
x_fit = np.linspace(-20, 120, 500)
plt.plot(x_fit, bimodal(x_fit, *params), color='red', lw=3, label='model')
plt.plot(x_fit, gauss(x_fit, *params[:3]), color='red', lw=1, ls="--", label='distribution 1')
plt.plot(x_fit, gauss(x_fit, *params[3:]), color='red', lw=1, ls=":", label='distribution 2')
plt.scatter(x, y, marker="X", color="black", label="original data")
plt.legend()
print(pd.DataFrame(data={'params': params, 'sigma': sigma}, index=bimodal.__code__.co_varnames[1:]))
plt.show()
I am new to python and was trying to fit dataset distribution using the following code. The actual data is a list that contains two columns- predicted market price and actual market price. And I was trying to use scipy.curve_fit() but it gave me many lines plotted at the same place. Any help is appreciated.
# import the necessary modules and define a func.
from scipy.optimize import curve_fit
from matplotlib import pyplot as plt
def func(x, a, b, c):
return a * x** b + c
# my data
pred_data = [3.0,1.0,1.0,7.0,6.0,1.0,7.0,4.0,9.0,3.0,5.0,5.0,2.0,6.0,8.0]
actu_data =[ 3.84,1.55,1.15,7.56,6.64,1.09,7.12,4.17,9.45,3.12,5.37,5.65,1.92,6.27,7.63]
popt, pcov = curve_fit(func, pred_data, actu_data)
#adjusting y
yaj = func(pred_data, popt[0],popt[1], popt[2])
# plot the data
plt.plot(pred_data,actu_data, 'ro', label = 'Data')
plt.plot(pred_data,yaj,'b--', label = 'Best fit')
plt.legend()
plt.show()
Scipy doesn't produce multiple lines, the strange output is caused by the way you present your unsorted data to matplotlib. Sort your x-values and you get the desired output:
from scipy.optimize import curve_fit
from matplotlib import pyplot as plt
def func(x, a, b, c):
return a * x** b + c
# my data
pred_data = [3.0,1.0,1.0,7.0,6.0,1.0,7.0,4.0,9.0,3.0,5.0,5.0,2.0,6.0,8.0]
actu_data =[ 3.84,1.55,1.15,7.56,6.64,1.09,7.12,4.17,9.45,3.12,5.37,5.65,1.92,6.27,7.63]
popt, pcov = curve_fit(func, pred_data, actu_data)
#adjusting y
yaj = func(sorted(pred_data), *popt)
# plot the data
plt.plot(pred_data,actu_data, 'ro', label = 'Data')
plt.plot(sorted(pred_data),yaj,'b--', label = 'Best fit')
plt.legend()
plt.show()
A better way is of course to define an evenly-spaced high resolution array for your x-values and calculate the fit for this array to have a smoother representation of your fit function:
from scipy.optimize import curve_fit
import numpy as np
from matplotlib import pyplot as plt
def func(x, a, b, c):
return a * x** b + c
# my data
pred_data = [3.0,1.0,1.0,7.0,6.0,1.0,7.0,4.0,9.0,3.0,5.0,5.0,2.0,6.0,8.0]
actu_data =[ 3.84,1.55,1.15,7.56,6.64,1.09,7.12,4.17,9.45,3.12,5.37,5.65,1.92,6.27,7.63]
popt, pcov = curve_fit(func, pred_data, actu_data)
xaj = np.linspace(min(pred_data), max(pred_data), 1000)
yaj = func(xaj, *popt)
# plot the data
plt.plot(pred_data,actu_data, 'ro', label = 'Data')
plt.plot(xaj, yaj,'b--', label = 'Best fit')
plt.legend()
plt.show()
I have a function: f(theta) = a+b*cos(theta - c) as well as sampled data. I'd like to find the coefficients a, b, and c that minimize mean square error. Any idea if there's an efficient way to do this in python?
EDIT:
import numpy as np
from scipy.optimize import curve_fit
#definition of the function
def myfunc(x, a, b, c):
return a + b * np.cos(x - c)
#sample data
x_data = [0, 60, 120, 180, 240, 300]
y_data = [25, 40, 70, 30, 10, 15]
#the actual curve fitting procedure, a, b, c are stored in popt
popt, _pcov = curve_fit(myfunc, x_data, y_data)
print(popt)
print(np.degrees(popt[2]))
#the rest is just a graphic representation of the data points and the fitted curve
from matplotlib import pyplot as plt
#x_fit = np.linspace(-1, 6, 1000)
y_fit = myfunc(x_data, *popt)
plt.plot(x_data, y_data, "ro")
plt.plot(x_data, y_fit, "b")
plt.xlabel(r'$\theta$ (degrees)');
plt.ylabel(r'$f(\theta)$');
plt.legend()
plt.show()
Here is a picture showing how the curve doesn't really fit the points. It seems like the amplitude should be higher. The local mins and maxes appear to be in the right places.
scipy.optimize.curve_fit makes it really easy to fit data points to your custom function:
import numpy as np
from scipy.optimize import curve_fit
#definition of the function
def myfunc(x, a, b, c):
return a + b * np.cos(x - c)
#sample data
x_data = np.arange(5)
y_data = 2.34 + 1.23 * np.cos(x_data + .23)
#the actual curve fitting procedure, a, b, c are stored in popt
popt, _pcov = curve_fit(myfunc, x_data, y_data)
print(popt)
#the rest is just a graphic representation of the data points and the fitted curve
from matplotlib import pyplot as plt
x_fit = np.linspace(-1, 6, 1000)
y_fit = myfunc(x_fit, *popt)
plt.plot(x_data, y_data, "ro", label = "data points")
plt.plot(x_fit, y_fit, "b", label = "fitted curve\na = {}\nb = {}\nc = {}".format(*popt))
plt.legend()
plt.show()
Output:
[ 2.34 1.23 -0.23]
Edit:
Your question update introduces several problems. First, your x-values are in degree, while np.cos expects values in radians. Therefore, we better convert the values with np.deg2rad. The reverse function would be np.rad2deg.
Second, it is a good idea to fit for different frequencies as well, let's introduce an additional parameter for that.
Third, fits are usually quite sensitive to initial guesses. You can provide a parameter p0 in scipy for that.
Fourth, you changed the resolution of the fitted curve to the low resolution of your data points, hence it looks so undersampled. If we address all these problems:
import numpy as np
from scipy.optimize import curve_fit
#sample data
x_data = [0, 60, 120, 180, 240, 300]
y_data = [25, 40, 70, 30, 10, 15]
#definition of the function with additional frequency value d
def myfunc(x, a, b, c, d):
return a + b * np.cos(d * np.deg2rad(x) - c)
#initial guess of parameters a, b, c, d
p_initial = [np.average(y_data), np.average(y_data), 0, 1]
#the actual curve fitting procedure, a, b, c, d are stored in popt
popt, _pcov = curve_fit(myfunc, x_data, y_data, p0 = p_initial)
print(popt)
#we have to convert the phase shift back into degrees
print(np.rad2deg(popt[2]))
#graphic representation of the data points and the fitted curve
from matplotlib import pyplot as plt
#define x_values for a smooth curve representation
x_fit = np.linspace(np.min(x_data), np.max(x_data), 1000)
y_fit = myfunc(x_fit, *popt)
plt.plot(x_data, y_data, "ro", label = "data")
plt.plot(x_fit, y_fit, "b", label = "fit")
plt.xlabel(r'$\theta$ (degrees)');
plt.ylabel(r'$f(\theta)$');
plt.legend()
plt.show()
we get this output:
[34.31293761 26.92479369 2.20852009 1.18144319]
126.53888003953764
I am trying to fit an exponential law into my data. My (x,y) sample is rather complicated to explain, so for general understanding and reproducibility I will say that: both variables are float and continuous, 0<=x<=100, and 0<=y<=1.
from scipy.optimize import curve_fit
import numpy
import matplotlib.pyplot as plt
#ydata=[...] is my list with y values, which contains 0 values
#xdata=[...] is my list with x values
transf_y=[]
for i in range(len(ydata)):
transf_y.append(ydata[i]+0.00001) #Adding something to avoid zero values
x=numpy.array(xdata,dtype=float)
y=numpy.array(transf_y,dtype=float)
def func(x, a, c, d):
return a * numpy.exp(-c*x)+d
popt, pcov = curve_fit(func, x, y,p0 = (1, 1e-6, 1))
print ("a = %s , c = %s, d = %s" % (popt[0], popt[1], popt[2]))
xx = numpy.linspace(300, 6000, 1000)
yy = func(xx, *popt)
plt.plot(x,y,label='Original Data')
plt.plot(xx, yy, label="Fitted Curve")
plt.legend(loc='upper left')
plt.show()
Now my fitted curve doesn't look anything like a fitted exponential curve. Rather, it looks like a moving average curve as if such curve was added as a trendline on Excel. What could be the problem? If necessary I'll find a way to make the datasets available to make the example reproducible.
This is what I get out of my code (I don't even know why I am getting three elements in the legend while only two are plotted, at least apparently):
A multitude of things:
your plot depicts a original data twice and no discernible fitted data
your data does not seem to be ordered, I assume that is why you get zickzack lines
in your example, your predicted plot will be in the range between 300 and 6000 whereas your raw data 0<=x<=100
That aside, your code is more or less correct and works.
from scipy.optimize import curve_fit
import numpy
import matplotlib.pyplot as plt
xdata=[100.0, 0.0, 90.0, 20.0, 80.0] #is my list with y values, which contains 0 values - edit, you need some raw data which you fit, I inserted some
ydata=[0.001, 1.0, 0.02, 0.56, 0.03] #is my list with x values
transf_y=[]
for i in range(len(ydata)):
transf_y.append(ydata[i]+0.00001) #Adding something to avoid zero values
x1=numpy.array(xdata,dtype=float)
y1=numpy.array(transf_y,dtype=float)
def func(x, a, c, d):
return a * numpy.exp(-c*x)+d
popt, pcov = curve_fit(func, x1, y1,p0 = (1, 1e-6, 1))
print ("a = %s , c = %s, d = %s" % (popt[0], popt[1], popt[2]))
#ok, sorting your data
pairs = []
for i, j in zip(x1, y1):
pairs.append([i,j])
sortedList = sorted(pairs, key = lambda x:x[0])
sorted_x = numpy.array(sortedList)[:,0]
sorted_y = numpy.array(sortedList)[:,1]
#adjusting interval to the limits of your raw data
xx = numpy.linspace(0, 100.0, 1000)
yy = func(xx, *popt)
#and everything looks fine
plt.plot(sorted_x,sorted_y, 'o',label='Original Data')
plt.plot(xx,yy,label='Fitted Data')
plt.legend(loc='upper left')
plt.show()