I have some measured data which can be either a well established gaussian or something that seems to be a gamma distribution, I currently have the following code (snippet) which performs quite well for data that is nicely gaussian:
def gaussFunction(x, A, mu, sigma):
return A*numpy.exp(-(x-mu)**2/(2.*sigma**2))
# Snippet of the code that does the fitting
p0 = [numpy.max(y_points), x_points[numpy.argmax(y_points)],0.1]
# Attempt to fit a gaussian function to the calibrant space
try:
coeff, var_matrix = curve_fit(self.gaussFunction, x_points, y_points, p0)
newX = numpy.linspace(x_points[0],x_points[-1],1000)
newY = self.gaussFunction(newX, *coeff)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x_points, y_points, 'b*')
plt.plot(newX,newY, '--')
plt.show()
Demonstration that it works well for datapoints which are nicely gaussian:
The problem however arises that some of my datapoints are not matching with a good gaussian and I get this:
I would be tempted to try a cubic spline but conceptually I would like to stick to a Gaussian curve fit since that is the data structure that should be within the data (which can occur with a knee or a tail in some data as shown in the second figure). I would highly appreciate if someone has any tip or suggestion on how to deal with this 'issue'.
Related
I am kind of new to scipy and curve_fit.
I have 2 lists:
x values:
[0.723938224, 0.965250965, 1.206563707, 1.447876448, 1.689189189,
1.930501931, 2.171814672]
y values:
[2.758, 2.443, 2.142333333, 1.911, 1.817666667, 1.688333333, 1.616]
I would like to perform a curve_fit on these 2 datasets, but I cannot seem to figure out the relationship. I known roughly the equation that fits them both together:
I know that there' an equation that fits them together:
0.74/((9.81*(x/100))^(1/2))
But how would I prove that the equation is the equation above just using python curve fits. If I do a similar thing in excel, it would automatically give me the equation. How would it work in python?
I am not sure how to perform the curve_fit and draw the trendline. Could someone help? Thanks.
For a start, let's define the curve fit function. You say that Excel tells you the function is of the form a/(b*(x/c)**d). I tell you, Excel knows nothing about anything apart from autofill; this equation can easily be transformed into ((a*c**d)/b)/x**d, so the function we actually have to consider is of the form a/x**b.
Now to the actual curve fitting with scipy:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [0.723938224, 0.965250965, 1.206563707, 1.447876448, 1.689189189, 1.930501931, 2.171814672]
y = [2.758, 2.443, 2.142333333, 1.911, 1.817666667, 1.688333333, 1.616]
def func(x, a, b):
return a/(x**b)
#start values, not really necessary here but good to know the concept
p0 = [2, 0.5]
#the actual curve fitting, returns the parameters in popt and the covariance matrix in pcov
popt, pcov = curve_fit(func, np.asarray(x), np.asarray(y), p0)
#print out the parameters a, b
print(*popt)
#a=2.357411406488454, b=0.5027391574181408
#plot the function to see, if the fit is any good
#first the raw data
plt.scatter(x, y, marker="x", color="red", label="raw data")
#then the fitted curve
x_fit = np.linspace(0.9*min(x), 1.1*max(x), 1000)
y_fit = func(x_fit, *popt)
plt.plot(x_fit, y_fit, color="blue", label="fitted data")
plt.legend()
plt.show()
Output:
Looks good to me. And one shouldn't let Excel near any statistical data, if you asked me.
When I want to use a piecewise function to fit my data, I don't know how to realize that the fitted function is continuous at the breakpoint and its first derivative is equal, that is, to smooth the piecewise function.
I tried to fit the data separately, but as shown in the figure, I didn't know how to add constraints to the fitting process.
def func(x, a0, a1,a2):
return a0+(a1/x)+a2/(x*x)
def highfunc(x1,b0,b1,b2):
return b0+b1*x1+b2*x1*x1
x=speed_low
y=ddma_low
popt, pcov = curve_fit(func, x, y)
a0=popt[0]
a1=popt[1]
a2=popt[2]
yvals = func(x,a0,a1,a2) #拟合y值
x1=speed_high
y1=ddma_high
popt1, pcov1 = curve_fit(highfunc, x1, y1)
b0=popt1[0]
b1=popt1[1]
b2=popt1[2]
yvals1 = highfunc(x1,b0,b1,b2) #拟合y值
xxxxx=np.hstack((x,x1))
yyyyy=np.hstack((yvals,yvals1))
#########################画图####################################
fig, ax = plt.subplots()
levels=np.linspace(0,3,7)
gci=ax.hist2d(Good_ddma,speed_cygnss,bins=400,cmap="jet",norm=LogNorm())
plot2 = plt.scatter(yvals,x,s=0.8,c='k',label='polyfit values')
plot3 = plt.scatter(yvals1,x1,s=0.8,c='r',label='polyfit values')
plot4 = plt.scatter(eeeeeefffff,windspeed_queue,s=20,c='k',marker='+')
The two lines I want to fit are smooth and continuous, that is, the firstenter code here derivative of breakpoint is equal, but after a long time thinking, the spline fitting function can not work, because my piecewise function has limited the expression, this problem has troubled me for many days.
I have a dataset and I am trying to see which is the best distribution its following.
In the firs attempt I tried to fit it with a rayleigh, so
y, x = np.histogram(data, bins=45, normed=True)
param = rayleigh.fit(y) # distribution fitting
# fitted distribution
xx = linspace(0,45,1000)
pdf_fitted = rayleigh.pdf(xx,loc=param[0],scale=param[1])
pdf = rayleigh.pdf(xx,loc=0,scale=8.5)
fig,ax = plt.subplots(figsize=(7,5))
plot(xx,pdf,'r-', lw=5, alpha=0.6, label='rayleigh pdf')
plot(xx,pdf,'k-', label='Data')
plt.bar(x[1:], y)
ax.set_xlabel('Distance, '+r'$x [km]$',size = 15)
ax.set_ylabel('Frequency, '+r'$P(x)$',size=15)
ax.legend(loc='best', frameon=False)
I am trying to do the same with a gamma distribution without succeding
y, x = np.histogram(net1['distance'], bins=45, normed=True)
xx = linspace(0,45,1000)
ag,bg,cg = gamma.fit(y)
pdf_gamma = gamma.pdf(xx, ag, bg,cg)
fig,ax = plt.subplots(figsize=(7,5))
# fitted distribution
plot(xx,pdf_gamma,'r-', lw=5, alpha=0.6, label='gamma pdf')
plot(xx,pdf_gamma,'k-')
plt.bar(x[1:], y, label='Data')
ax.set_xlabel('Distance, '+r'$x [km]$',size = 15)
ax.set_ylabel('Frequency, '+r'$P(x)$',size=15)
ax.legend(loc='best', frameon=False)
Unfortunately scipy.stats.gamma is not well documented.
suppose you have some "raw" data in the form data=array([a1,a2,a3,.....]), these can be the results of an experiment of yours.
You can give these raw values to the fit method: gamma.fit(data) and it will return for you three parameters a,b,c = gamma.fit(data). These are the "shape", the "loc"ation and the "scale" of the gamma curve that fits better the DISTRIBUTION HISTOGRAM of your data (not the actual data).
I noticed from the questions online that many people confuse. They have a distribution of data, and try to fit it with gamma.fit. This is wrong.
The method gamma.fit expects your raw data, not the distribution of your data.
This will presumably solve problems to few of us.
GR
My guess is that you have much of the original data at 0, so the alpha of the fit ends up lower than 1 (0.34) and you get the decreasing shape with singularity at 0. The bar plot does not include the zero (x[1:]) so you don't see the huge bar on the left.
Can I be right?
I have seen several questions in stackoverflow regarding how to fit a log-normal distribution. Still there are two clarifications that I need known.
I have a sample data, the logarithm of which follows a normal distribution. So I can fit the data using scipy.stats.lognorm.fit (i.e a log-normal distribution)
The fit is working fine, and also gives me the standard deviation. Here is my piece of code with the results.
import numpy as np
from scipy import stats
sample = np.log10(data) #taking the log10 of the data
scatter,loc,mean = stats.lognorm.fit(sample) #Gives the paramters of the fit
x_fit = np.linspace(13.0,15.0,100)
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,mean) #Gives the PDF
print "scatter for data is %s" %scatter
print "mean of data is %s" %mean
THE RESULT
scatter for data is 0.186415047243
mean for data is 1.15731050926
From the image you can clearly see that the mean is around 14.2, but what I get is 1.15??!! Why is this so? clearly the log(mean) is also not near 14.2!!
In THIS POST and in THIS QUESTION it is mentioned that the log(mean) is the actual mean.
But you can see from my above code, the fit that I have obtained is using a the sample = log(data) and it also seems to fit well. However when I tried
sample = data
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,np.log10(mean))
The fit does not seem to work.
1) Why is the mean not 14.2?
2) How to draw fill/draw vertical lines showing the 1 sigma confidence region?
You say
I have a sample data, the logarithm of which follows a normal distribution.
Suppose data is the array containing the samples. To fit this data to
a log-normal distribution using scipy.stats.lognorm, use:
s, loc, scale = stats.lognorm.fit(data, floc=0)
Now suppose mu and sigma are the mean and standard deviation of the
underlying normal distribution. To get the estimate of those values
from this fit, use:
estimated_mu = np.log(scale)
estimated_sigma = s
(These are not the estimates of the mean and standard deviation of
the samples in data. See the wikipedia page for the formulas
for the mean and variance of a log-normal distribution in terms of mu and sigma.)
To combine the histogram and the PDF, you can use, for example,
import matplotlib.pyplot as plt.
plt.hist(data, bins=50, normed=True, color='c', alpha=0.75)
xmin = data.min()
xmax = data.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.lognorm.pdf(x, s, scale=scale)
plt.plot(x, pdf, 'k')
If you want to see the log of the data, you could do something like
the following. Note the the PDF of the normal distribution is used
here.
logdata = np.log(data)
plt.hist(logdata, bins=40, normed=True, color='c', alpha=0.75)
xmin = logdata.min()
xmax = logdata.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.norm.pdf(x, loc=estimated_mu, scale=estimated_sigma)
plt.plot(x, pdf, 'k')
By the way, an alternative to fitting with stats.lognorm is to fit log(data)
using stats.norm.fit:
logdata = np.log(data)
estimated_mu, estimated_sigma = stats.norm.fit(logdata)
Related questions:
Fitting lognormal distribution using Scipy vs Matlab
Lognormal Random Numbers Centered around a high value
If i have plotted a histogram in python using matplotlib, how can i easily extract the functional form of the histogram, or i suppose, the function of the bestfit curve to the histogram. Im not sure how to plot this bestfit curve. Any help is appreciated, thanks.
The shape of my histogram is like an inverted lennard-jones potential.
I'm just going to answer both to be thorough. These are two separate problems: fitting a function to your histogram data and then plotting the function. First of all, scipy has an optimization module that you can use to fit your function. Among those curve_fit is probably the easiest.
To give an example,
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
# Model function
def f(x, a, b):
return a * x + b
# Example data
x = np.linspace(0, 10, 20)
y = f(x, 0.2, 3.4) + 0.2 * np.random.normal(size=len(x))
# Do the fit
popt, pcov = curve_fit(f, x, y, [1.0, 1.0])
From curve_fit you get the optimized parameters a, b to your function f and the statistical covariances. You can also pass error for statistical weights as sigma to it.
Now you can plot the data and the histogram. I guess it makes sense to use a higher resolution in x for the curve.
# Plot data
plt.plot(x, y, 'o')
# Plot fit curve
fit_x = np.linspace(0, 10, 200)
plt.plot(fit_x, f(fit_x, *popt))
plt.show()
I haven't specifically dealt with a histogram nor a Lennard-Jones potential here to limit the complexity of the code and focus on the part you asked about. But this example can be adapted to any kind of least squares optimization issue.