Would somebody be able to explain to me how to use the location parameter with the gamma.fit function in Scipy?
It seems to me that a location parameter (μ) changes the support of the distribution from x ≥ 0 to y = ( x - μ ) ≥ 0. If μ is positive then aren't we losing all the data which doesn't satisfy x - μ ≥ 0?
Thanks!
The fit function takes all of the data into consideration when finding a fit. Adding noise to your data will alter the fit parameters and can give a distribution that does not represent the data very well. So we have to be a bit clever when we are using fit.
Below is some code that generates data, y1, with loc=2 and scale=1 using numpy. It also adds noise to the data over the range 0 to 10 to create y2. Fitting y1 yield excellent results, but attempting to fit the noisy y2 is problematic. The noise we added smears out the distribution. However, we can also hold 1 or more parameters constant when fitting the data. In this case we pass floc=2 to the fit, which forces the location to be held at 2 when performing the fit, returning much better results.
from scipy.stats import gamma
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0,10,.1)
y1 = np.random.gamma(shape=1, scale=1, size=1000) + 2 # sets loc = 2
y2 = np.hstack((y1, 10*np.random.rand(100))) # add noise from 0 to 10
# fit the distributions, get the PDF distribution using the parameters
shape1, loc1, scale1 = gamma.fit(y1)
g1 = gamma.pdf(x=x, a=shape1, loc=loc1, scale=scale1)
shape2, loc2, scale2 = gamma.fit(y2)
g2 = gamma.pdf(x=x, a=shape2, loc=loc2, scale=scale2)
# again fit the distribution, but force loc=2
shape3, loc3, scale3 = gamma.fit(y2, floc=2)
g3 = gamma.pdf(x=x, a=shape3, loc=loc3, scale=scale3)
And make some plots...
# plot the distributions and fits. to lazy to do iteration today
fig, axes = plt.subplots(1, 3, figsize=(13,4))
ax = axes[0]
ax.hist(y1, bins=40, normed=True);
ax.plot(x, g1, 'r-', linewidth=6, alpha=.6)
ax.annotate(s='shape = %.3f\nloc = %.3f\nscale = %.3f' %(shape1, loc1, scale1), xy=(6,.2))
ax.set_title('gamma fit')
ax = axes[1]
ax.hist(y2, bins=40, normed=True);
ax.plot(x, g2, 'r-', linewidth=6, alpha=.6)
ax.annotate(s='shape = %.3f\nloc = %.3f\nscale = %.3f' %(shape2, loc2, scale2), xy=(6,.2))
ax.set_title('gamma fit with noise')
ax = axes[2]
ax.hist(y2, bins=40, normed=True);
ax.plot(x, g3, 'r-', linewidth=6, alpha=.6)
ax.annotate(s='shape = %.3f\nloc = %.3f\nscale = %.3f' %(shape3, loc3, scale3), xy=(6,.2))
ax.set_title('gamma fit w/ noise, location forced')
Related
I have written a code to fit the gaussian function in a dataset by scipy curve_fit. There are a few different datasets. One with 19 points and one with 21 points and both of them include different datasets in range of 0.5-0.7, 1.0-1.2 and 1.5-1.7.
Surprisingly, when I ran the code in 19 point datasets, all three of them executed successfully but in case of 21 point datasets, only 1.5-1.7 ranged data had the right fit. All others were given with horribly wrong fit.
Here is the code.
#function declaration
def gauss(x, amp, mu, sigma):
y = amp*np.exp(-(x-mu)**2/(2*sigma**2))
return y
#fitting
popt, pcov = curve_fit(f = gauss, xdata = x, ydata = y)
#print(popt)
amp = popt[0]
mu = popt[1]
sigma = popt[2]
print(amp,mu,sigma)
#krypton value
krypton_y = amp/((math.exp(1))**2)
#print(krypton_y)
krypton_x1 = mu + math.sqrt((-2*(sigma**2))*math.log(krypton_y/amp))
krypton_x2 = mu - math.sqrt((-2*(sigma**2))*math.log(krypton_y/amp))
print(krypton_x1-krypton_x2)
#print(gauss([krypton_x1, krypton_x2], popt[0], popt[1], popt[2]))
#horizontal line
horizontal_x = np.arange(min(x)-0.01, max(x)+0.02, 0.01)
horizontal_y = np.repeat(0, len(horizontal_x))
#build fit set
x_test = np.arange(min(x), max(x), 0.0000001)
y_test = gauss(x_test, popt[0], popt[1], popt[2])
y_krypton = []
for i in horizontal_x:
y_krypton.append(krypton_y)
#Vertical lines
vertical_y = np.arange(-20, amp+20, 0.01)
l = len(vertical_y)
vertical_mean = np.repeat(mu, l)
#fit data
fig = plt.figure()
fig = plt.scatter(x,y, label ='original data', color = 'red', marker = 'x')
fig = plt.plot(x_test, y_test, label = 'Gaussian fit curve')
fig = plt.plot(horizontal_x, y_krypton, color = '#830000', linewidth = 1)
fig = plt.plot(vertical_mean, vertical_y, color = '#0011ed')
fig = plt.xlabel('Distance in mm')
fig = plt.ylabel('Current in nA')
fig = plt.title('Intensity Profile for '+gas+' laser | Z = '+str(z)+'cm')
fig = plt.scatter(mu, amp, s = 25, color = '#0011ed')
fig = plt.scatter(krypton_x1, krypton_y, s = 25, color = '#830000')
fig = plt.scatter(krypton_x2, krypton_y, s = 25, color = '#830000')
plt.annotate('('+"{:.4f}".format(mu)+','+"{:.4f}".format(amp)+')', (mu, amp), xytext = (mu+0.002,amp+0.5))
plt.annotate('('+"{:.4f}".format(krypton_x1)+','+"{:.4f}".format(krypton_y)+')', (krypton_x1, krypton_y), xytext = (krypton_x1+0.002,krypton_y+0.5))
plt.annotate('('+"{:.4f}".format(krypton_x2)+','+"{:.4f}".format(krypton_y)+')', (krypton_x2, krypton_y), xytext = (krypton_x2+0.002,krypton_y+0.5))
plt.legend()
plt.margins(0)
plt.show()
I am also adding two images, the correct fit and the wrong fit.
In order to make clear the difficulty we will use an elementary regression method.
We see that the fitting involves ln(y) which is infinite at the points k<6 and k>16. Those points cannot be used for the numerical calculus. Also the point k=16 is not reliable because the small value of y=0.001 is not accurate enough (only one sigificative digit). So, we use only the points from k=6 to k=15 in the next calculus.
This shows that the non-significative points have to be eliminated. Of course more sophisticated methods implemented in nonlinear regression package with iterative calculus gives better fitting according to some particular criteria of fitting specified in the software.
I am using gaussian_kde from scipy.stats to fit a joint PDF from a multivariate data on, let's say, X and Y.
Now I want to resample from this PDF conditionally on a value of X. For example, once my X=x, generate Y from its conditional distribution.
Let's use the example from the documentation here. kernel.resample(1) would generate a pair of (X,Y) over all of the distribution. How could I generate Y once X is, for example, 0?
An approach could be to create a custom continuous distribution from a pdf.
The pdf can be created from the kernel function. As the pdf needs an area of 1, the kernel limited to a given x0 should be scaled by the area.
The custom distribution seems to be quite slow though. A faster solution could be to create a histogram from ys = np.linspace(-10, 10, 1000); kernel(np.vstack([np.full_like(ys, x0), ys])) and use rv_histogram. Still faster (but much less random) would be to use np.random.choice(..., p=...) with p calculated from the constrained kernel.
The following code starts from an adoption of the linked example code of a 2D kde.
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np
def measure(n):
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1 + m2, m1 - m2 ** 2
m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
x0 = 0.678
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4))
ax1.imshow(np.rot90(Z), cmap=plt.cm.magma_r, alpha=0.4, extent=[xmin, xmax, ymin, ymax])
ax1.plot(m1, m2, 'k.', markersize=2)
ax1.axvline(x0, color='dodgerblue', ls=':')
ax1.set_xlim([xmin, xmax])
ax1.set_ylim([ymin, ymax])
# create a distribution given the kernel function limited to x=x0
class Special_distrib(stats.rv_continuous):
def _pdf(self, y, x0, area_x0):
return kernel(np.vstack([np.full_like(y, x0), y])) / area_x0
ys = np.linspace(-10, 10, 1000)
area_x0 = np.trapz(kernel(np.vstack([np.full_like(ys, x0), ys])), ys)
special_distr = Special_distrib(name="special")
vals = special_distr.rvs(x0, area_x0, size=500)
ax2.hist(vals, bins=20, color='dodgerblue')
plt.show()
I'm trying to fit a second order polynomial to raw data and output the results using Matplotlib. There are about a million points in the data set that I'm trying to fit. It is supposed to be simple, with many examples available around the web. However for some reason I cannot get it right.
I get the following warning message:
RankWarning: Polyfit may be poorly conditioned
This is my output:
This is output using Excel:
See below for my code. What am I missing??
xData = df['X']
yData = df['Y']
xTitle = 'X'
yTitle = 'Y'
title = ''
minX = 100
maxX = 300
minY = 500
maxY = 2200
title_font = {'fontname':'Arial', 'size':'30', 'color':'black', 'weight':'normal',
'verticalalignment':'bottom'} # Bottom vertical alignment for more space
axis_font = {'fontname':'Arial', 'size':'18'}
#Poly fit
# calculate polynomial
z = np.polyfit(xData, yData, 2)
f = np.poly1d(z)
print(f)
# calculate new x's and y's
x_new = xData
y_new = f(x_new)
#Plot
plt.scatter(xData, yData,c='#002776',edgecolors='none')
plt.plot(x_new,y_new,c='#C60C30')
plt.ylim([minY,maxY])
plt.xlim([minX,maxX])
plt.xlabel(xTitle,**axis_font)
plt.ylabel(yTitle,**axis_font)
plt.title(title,**title_font)
plt.show()
The array to plot must be sorted. Here is a comparisson between plotting a sorted and an unsorted array. The plot in the unsorted case looks completely distorted, however, the fitted function is of course the same.
2
-3.496 x + 2.18 x + 17.26
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
x = (np.random.normal(size=300)+1)
fo = lambda x: -3*x**2+ 1.*x +20.
f = lambda x: fo(x) + (np.random.normal(size=len(x))-0.5)*4
y = f(x)
fig, (ax, ax2) = plt.subplots(1,2, figsize=(6,3))
ax.scatter(x,y)
ax2.scatter(x,y)
def fit(ax, x,y, sort=True):
z = np.polyfit(x, y, 2)
fit = np.poly1d(z)
print(fit)
ax.set_title("unsorted")
if sort:
x = np.sort(x)
ax.set_title("sorted")
ax.plot(x, fo(x), label="original func", color="k", alpha=0.6)
ax.plot(x, fit(x), label="fit func", color="C3", alpha=1, lw=2.5 )
ax.legend()
fit(ax, x,y, sort=False)
fit(ax2, x,y, sort=True)
plt.show()
The problem is probably using a power basis for data that is displaced some distance from zero along the x axis. If you use the Polynomial class from numpy.polynomial it will scale and shift the data before the fit, which will help, and also keep track of the scale and shift used. Note that if you want the coefficients in the normal form you will need to convert to that form.
I have a range of particle size distribution data arranged by percentage volume fraction, like so:;
size %
6.68 0.05
9.92 1.15
etc.
I need to fit this data to a lognormal distribution, which I planned to do using python's stats.lognorm.fit function, but this seems to expect the input as an array of variates rather than binned data, judging by what I've read.
I was planning to use a for loop to iterate through the data and .extend each size entry to a placeholder array the required number of times to create an array with a list of variates that corresponds to the binned data.
This seems really ugly and inefficient though, and the kind of thing that there's probably an easy way to do. Is there a way to input binned data into the stats.lognorm.fit function?
I guess one possible workaround is to manually fit a pdf to your bin data, assuming x values are the midpoint of each interval, and y values are the corresponding bin frequency. And then fit a curve based on x and y values using scipy.optimize.curve_fit. I think accuracy of the results will depend the number of bins you have. An example is shown below:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
def pdf(x, mu, sigma):
"""pdf of lognormal distribution"""
return (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2)) / (x * sigma * np.sqrt(2 * np.pi)))
mu, sigma = 3., 1. # actual parameter value
data = np.random.lognormal(mu, sigma, size=1000) # data generation
h = plt.hist(data, bins=30, normed = True)
y = h[0] # frequencies for each bin, this is y value to fit
xs = h[1] # boundaries for each bin
delta = xs[1] - xs[0] # width of bins
x = xs[:-1] + delta / # midpoints of bins, this is x value to fit
popt, pcov = curve_fit(pdf, x, y, p0=[1, 1]) # data fitting, popt contains the fitted parameters
print(popt)
# [ 3.13048122 1.01360758] fitting results
fig, ax = plt.subplots()
ax.hist(data, bins=30, normed=True, align='mid', label='Histogram')
xr = np.linspace(min(xs), max(xs), 10000)
yr = pdf(xr, mu, sigma)
yf = pdf(xr, *popt)
ax.plot(xr, yr, label="Actual")
ax.plot(xr, yf, linestyle = 'dashed', label="Fitted")
ax.legend()
I have two variables that I have plotted using matplotlib scatter function.
I would like to show the 68% confidence region by highlighting it in the plot. I know to show it in a histogram, but I don't know how to do it for a 2D plot like this (x vs y). In my case, the x is Mass and y is Ngal Mstar+2.
An example image of what I am looking for looks like this:
Here they have showed the 68% confidence region using dark blue and 95% confidence region using light blue.
Can it be achieved using one of thescipy.stats modules?
To plot a region between two curves, you could use pyplot.fill_between().
As for your confidence region, I was not sure what you wanted to achieve, so I exemplified with simultaneous confidence bands, by modifying the code from:
https://en.wikipedia.org/wiki/Confidence_and_prediction_bands#cite_note-2
import numpy as np
import matplotlib.pyplot as plt
import scipy.special as sp
## Sample size.
n = 50
## Predictor values.
XV = np.random.uniform(low=-4, high=4, size=n)
XV.sort()
## Design matrix.
X = np.ones((n,2))
X[:,1] = XV
## True coefficients.
beta = np.array([0, 1.], dtype=np.float64)
## True response values.
EY = np.dot(X, beta)
## Observed response values.
Y = EY + np.random.normal(size=n)*np.sqrt(20)
## Get the coefficient estimates.
u,s,vt = np.linalg.svd(X,0)
v = np.transpose(vt)
bhat = np.dot(v, np.dot(np.transpose(u), Y)/s)
## The fitted values.
Yhat = np.dot(X, bhat)
## The MSE and RMSE.
MSE = ((Y-EY)**2).sum()/(n-X.shape[1])
s = np.sqrt(MSE)
## These multipliers are used in constructing the intervals.
XtX = np.dot(np.transpose(X), X)
V = [np.dot(X[i,:], np.linalg.solve(XtX, X[i,:])) for i in range(n)]
V = np.array(V)
## The F quantile used in constructing the Scheffe interval.
QF = sp.fdtri(X.shape[1], n-X.shape[1], 0.95)
QF_2 = sp.fdtri(X.shape[1], n-X.shape[1], 0.68)
## The lower and upper bounds of the Scheffe band.
D = s*np.sqrt(X.shape[1]*QF*V)
LB,UB = Yhat-D,Yhat+D
D_2 = s*np.sqrt(X.shape[1]*QF_2*V)
LB_2,UB_2 = Yhat-D_2,Yhat+D_2
## Make the plot.
plt.clf()
plt.plot(XV, Y, 'o', ms=3, color='grey')
plt.hold(True)
a = plt.plot(XV, EY, '-', color='black', zorder = 4)
plt.fill_between(XV, LB_2, UB_2, where = UB_2 >= LB_2, facecolor='blue', alpha= 0.3, zorder = 0)
b = plt.plot(XV, LB_2, '-', color='blue', zorder=1)
plt.plot(XV, UB_2, '-', color='blue', zorder=1)
plt.fill_between(XV, LB, UB, where = UB >= LB, facecolor='blue', alpha= 0.3, zorder = 2)
b = plt.plot(XV, LB, '-', color='blue', zorder=3)
plt.plot(XV, UB, '-', color='blue', zorder=3)
d = plt.plot(XV, Yhat, '-', color='red',zorder=4)
plt.ylim([-8,8])
plt.xlim([-4,4])
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
The output looks like this:
First of all thank you #snake_charmer for your answer, but I have found a simpler way of solving the issue using curve_fit from scipy.optimize
I fit my data sample using curve_fit which gives me my best fit parameters. What it also gives me is the estimated covariance of the parameters. The diagonals of the same provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters we can use np.sqrt(np.diag(pcov)) where pcov is the covariance matrix.
def fitfunc(M,p1,p2):
N = p1+( (M)*p2 )
return N
The above is the fit function I use for the data.
Now to fit the data using curve_fit
popt_1,pcov_1 = curve_fit(fitfunc,logx,logn,p0=(10.0,1.0),maxfev=2000)
p1_1 = popt_1[0]
p1_2 = popt_1[1]
sigma1 = [np.sqrt(pcov_1[0,0]),np.sqrt(pcov_1[1,1])] #THE 1 SIGMA CONFIDENCE INTERVALS
residuals1 = (logy) - fitfunc((logx),p1_1,p1_2)
xi_sq_1 = sum(residuals1**2) #THE CHI-SQUARE OF THE FIT
curve_y_1 = fitfunc((logx),p1_1,p1_2)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(logx,logy,c='r',label='$0.0<z<0.5$')
ax1.plot(logx,curve_y_1,'y')
ax1.plot(logx,fitfunc(logx,p1_1+sigma1[0],p1_2+sigma1[1]),'m',label='68% conf limits')
ax1.plot(logx,fitfunc(logx,p1_1-sigma1[0],p1_2-sigma1[1]),'m')
So just by using the square root the diagonal elements of the covariance matrix, I can obtain the 1 sigma confidence lines.