Fitting binned lognormal data in Python

Fitting binned lognormal data in Python - python

I have a range of particle size distribution data arranged by percentage volume fraction, like so:;
size %
6.68 0.05
9.92 1.15
etc.
I need to fit this data to a lognormal distribution, which I planned to do using python's stats.lognorm.fit function, but this seems to expect the input as an array of variates rather than binned data, judging by what I've read.
I was planning to use a for loop to iterate through the data and .extend each size entry to a placeholder array the required number of times to create an array with a list of variates that corresponds to the binned data.
This seems really ugly and inefficient though, and the kind of thing that there's probably an easy way to do. Is there a way to input binned data into the stats.lognorm.fit function?

I guess one possible workaround is to manually fit a pdf to your bin data, assuming x values are the midpoint of each interval, and y values are the corresponding bin frequency. And then fit a curve based on x and y values using scipy.optimize.curve_fit. I think accuracy of the results will depend the number of bins you have. An example is shown below:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
def pdf(x, mu, sigma):
"""pdf of lognormal distribution"""
return (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2)) / (x * sigma * np.sqrt(2 * np.pi)))
mu, sigma = 3., 1. # actual parameter value
data = np.random.lognormal(mu, sigma, size=1000) # data generation
h = plt.hist(data, bins=30, normed = True)
y = h[0] # frequencies for each bin, this is y value to fit
xs = h[1] # boundaries for each bin
delta = xs[1] - xs[0] # width of bins
x = xs[:-1] + delta / # midpoints of bins, this is x value to fit
popt, pcov = curve_fit(pdf, x, y, p0=[1, 1]) # data fitting, popt contains the fitted parameters
print(popt)
# [ 3.13048122 1.01360758] fitting results
fig, ax = plt.subplots()
ax.hist(data, bins=30, normed=True, align='mid', label='Histogram')
xr = np.linspace(min(xs), max(xs), 10000)
yr = pdf(xr, mu, sigma)
yf = pdf(xr, *popt)
ax.plot(xr, yr, label="Actual")
ax.plot(xr, yf, linestyle = 'dashed', label="Fitted")
ax.legend()

Related

Clarification of log-normal distribution using Python

What's the meaning of count, bins, ignored in the code below which I found on the numpy website (https://numpy.org/doc/stable/reference/random/generated/numpy.random.lognormal.html).
import numpy as np
import matplotlib.pyplot as plt
mu, sigma = 0.2, 0.5 # mean and standard deviation
s = np.random.lognormal(mu, sigma, 1000)
count, bins, ignored = plt.hist(s, 100, density=True, align='mid')
x = np.linspace(min(bins), max(bins), 10000)
pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))/ (x * sigma * np.sqrt(2 * np.pi)))
plt.plot(x, pdf, linewidth=2, color='r')
plt.axis('tight')
plt.show()

count is the value for the density in each bin (in this example 100 values). bins contains the bin edges (101 in this example), where each pair of edges [i,i+1] are the edges of the bin. ignored is not important for the purpose of that plot. According to the documentation of plt.hist, it is a "Container of individual artists used to create the histogram or list of such containers if there are multiple input datasets".

Gaussian fitted curve showing a tail that does not go back to base-level

Based upon existing topics on Stackoverflow, I have managed to fit a Gaussian curve to my dataset. However, the fitted Gaussian shows one tail that does not go back to base-level (i.e., in the example below, the right tail suddenly stops at a higher y-value compared to the left tail). This surprises me, as per definition a Gaussian should show a perfectly symmetrical bell-shaped curve. How can I generate a Gaussian curve of which both tails are equally long (i.e., the tails stop at the same width measured from the plume center-line) and end at the same base-level (i.e., the same y-value)? The reason I would like to have this, is because in my data sometimes a second peak starts to arise while the first peak did not go back to base-level yet. I would like to separate these peaks by fitting a Gaussian that goes back to base-level, as theoretically each peak should go back to its base-level. Thanks a lot in advance!
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.array([-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0])
y = np.array([1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532])
def gaussian(x, amp, cen, wid):
return (amp / (np.sqrt(2*np.pi) * wid)) * np.exp(-(x-cen)**2 / (2*wid**2))
def line(x, slope, intercept):
return slope*x + intercept
peak_index = find_peaks(y,height=27.6)[0][0]
mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian) + Model(line)
pars = mod.make_params(amp=max(y), cen=x[peak_index],
wid=np.sqrt(sum((x-mean)**2 * y)/sum(y)), slope=0, intercept=1)
result = mod.fit(y, pars, x=x)
comps = result.eval_components()
plt.plot(x, y, 'bo')
plt.plot(x, comps['gaussian'], 'k--')
Edit: The following example hopefully illustrates why I am interested in this. I have a long data-set in which the signal of different sources are being measured. The data-set is processed such that it generates the arrays x_measured and y_measured that contain the measured values belonging to one source. My program automatically detects the plume that occurs within the measured values, and stores the values of this plume in arrays called x and y. To these x and y arrays, I perform a Gaussian fit.
However, sometimes the measured values show that 2 plumes are overlapping, hence there is no measured plume from and back to base-level. An example is given in the code below. My program for these measured values now gives a Gaussian fit whereby the right tail goes to around y=0, but the left tail of the Gaussian fit stops around y=4.5. I would like the left tail to also go back to around y=0. This is, because theoretically I know that each plume should start and go back to the same base-level, and I want to compute the plume-width of such a Gaussian plume. For the example below, the left tail does not go back to around y=0, hence I cannot determine the width of the plume. I would like to have a Gaussian-fit of which both tails go back to the same base-level of y=0, such that I can determine the width of the plume.
x_measured = np.arange(-20,245,3)
y_measured = np.array([38.7586,38.2323,37.2958,35.9924,34.4196,32.7123,31.0257,29.5169,28.3244,27.5502,27.2458,27.4078,27.9815,28.8728,29.9643,31.1313,32.2545,33.2276,33.9594,34.373,34.4041,34.0009,33.1267,31.7649,29.9247,27.6458,24.9992,22.0845,19.0215,15.9397,12.966,10.2127,7.76834,5.69046,4.00296,2.69719,1.73733,1.06907,0.629744,0.358021,0.201123,0.11878,0.0839719,0.0813392,0.104295,0.151634,0.224209,0.321912,0.441478,0.575581,0.713504,0.843351,0.954777,1.04109,1.09974,1.13118,1.13683,1.11758,1.07369,1.0059,0.917066,0.81321,0.703288,0.597775,0.506678,0.437843,0.396256,0.384633,0.405147,0.461496,0.560387,0.71144,0.925262,1.21022,1.56925,1.99788,2.48458,3.01314,3.56626,4.12898,4.69031,5.24283,5.78014,6.29365,6.77004,7.19071,7.53399,7.78019,7.91889])
x = np.arange(10,104,3)
y = np.array([22.4548,23.4302,25.3389,27.9929,30.486,32.0528,33.5527,35.1304,35.9941,36.8606,37.1889,37.723,36.4069,35.9751,33.8824,31.0909,27.4247,23.3213,18.8772,14.3363,11.1075,7.68792,4.54899,2.2057,0,0,0,0,0,0,0.179834,0])
def gaussian(x, amp, cen, wid):
return (amp / (np.sqrt(2*np.pi) * wid)) * np.exp(-(x-cen)**2 / (2*wid**2))
def line(x, slope, intercept):
return slope*x + intercept
peak_index = find_peaks(y,height=27.6)[0][0]
mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian) + Model(line)
pars = mod.make_params(amp=max(y), cen=x[peak_index],
wid=np.sqrt(sum((x-mean)**2 * y)/sum(y)), slope=0, intercept=1)
result = mod.fit(y, pars, x=x)
comps = result.eval_components()
plt.plot(x, y, 'bo')
plt.plot(x, comps['gaussian'], 'k--')
plt.plot(x_measured,y_measured)

It is unclear why you expect a bimodal fit with the model you defined. Use two different Gaussian functions for your fit, then evaluate the fitted functions for a longer interval x_fit to see the curves returning to baseline:
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.array([-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0])
y = np.array([1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532])
def gaussian1(x, amp1, cen1, wid1):
return (amp1 / (np.sqrt(2*np.pi) * wid1)) * np.exp(-(x-cen1)**2 / (2*wid1**2))
def gaussian2(x, amp2, cen2, wid2):
return (amp2 / (np.sqrt(2*np.pi) * wid2)) * np.exp(-(x-cen2)**2 / (2*wid2**2))
#peak_index = find_peaks(y,height=27.6)[0][0]
#mean = sum(x*y)/np.sum(y) #weighted arithmetic mean
mod = Model(gaussian1) + Model(gaussian2)
#I just filled in some start values, the details of educated guesses can be filled in later by you
pars = mod.make_params(amp1=30, amp2=40, cen1=20, cen2=40, wid1=2, wid2=2)
result = mod.fit(y, pars, x=x)
print(result.params)
x_fit=np.linspace(-30, 120, 500)
comps_elem = result.eval_components(x=x_fit)
comps_comb = result.eval(x=x_fit)
plt.plot(x, y, 'bo')
plt.plot(x_fit, comps_comb, 'k')
plt.plot(x_fit, comps_elem['gaussian1'], 'k-.')
plt.plot(x_fit, comps_elem['gaussian2'], 'k--')
plt.show()
Sample output:
The corresponding scipy.curve_fit function would look like this:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
x = [-20.0,-17.0,-14.0,-11.0,-8.0,-5.0,-2.0,1.0,4.0,7.0,10.0,13.0,16.0,19.0,22.0,25.0,28.0,31.0,34.0,37.0,40.0,43.0,46.0,49.0,52.0,55.0,58.0,61.0,64.0,67.0,70.0,73.0,76.0,79.0,82.0]
y = [1.90269,1.93535,2.62402,3.08949,2.82409,3.07588,3.22015,3.18884,5.14053,10.5111,18.6118,28.6343,37.7625,46.3641,53.9163,60.7622,66.5765,71.0596,74.4948,77.7177,80.373,82.5833,83.9021,83.4652,79.0229,71.4679,61.93,52.113,43.8517,36.211,29.3815,23.8966,19.31,15.5209,12.4532]
def gauss(x, mu, sigma, A):
return A*np.exp(-(x-mu)**2/2/sigma**2)
def bimodal(x, mu1, sigma1, A1, mu2, sigma2, A2):
return gauss(x, mu1, sigma1, A1) + gauss(x, mu2, sigma2, A2)
expected = (20, 2, 30, 40, 2, 40)
params, cov = curve_fit(bimodal, x, y, expected)
sigma=np.sqrt(np.diag(cov))
x_fit = np.linspace(-20, 120, 500)
plt.plot(x_fit, bimodal(x_fit, *params), color='red', lw=3, label='model')
plt.plot(x_fit, gauss(x_fit, *params[:3]), color='red', lw=1, ls="--", label='distribution 1')
plt.plot(x_fit, gauss(x_fit, *params[3:]), color='red', lw=1, ls=":", label='distribution 2')
plt.scatter(x, y, marker="X", color="black", label="original data")
plt.legend()
print(pd.DataFrame(data={'params': params, 'sigma': sigma}, index=bimodal.__code__.co_varnames[1:]))
plt.show()

How to draw a matching Bell curve over a histogram?

My code so far, I'm very new to programming and have been trying for a while.
Here I apply the Box-Muller transform to approximate two Gaussian normal distributions starting from a random uniform sampling. Then, I create a histogram for both of them.
Now, I would like to compare the obtained histograms with "the real thing": a standard Bell curve. How to draw such a curve to match the histograms?
import numpy as np
import matplotlib.pyplot as plt
N = 10000
z1 = np.random.uniform(0, 1.0, N)
z2 = np.random.uniform(0, 1.0, N)
R_sq = -2 * np.log(z1)
theta = 2 * np.pi * z2
z1 = np.sqrt(R_sq) * np.cos(theta)
z2 = np.sqrt(R_sq) * np.sin(theta)
fig = plt.figure()
ax = fig.add_subplot(2, 1, 1)
ax.hist(z1, bins=40, range=(-4, 4), color='red')
plt.title("Histgram")
plt.xlabel("z1")
plt.ylabel("frequency")
ax2 = fig.add_subplot(2, 1, 2)
ax2.hist(z2, bins=40, range=(-4, 4), color='blue')
plt.xlabel("z2")
plt.show()

To obtain the 'kernel density estimation', scipy.stats.gaussian_kde calculates a function to fit the data.
To just draw a Gaussian normal curve, there is [scipy.stats.norm]. Subtracting the mean and dividing by the standard deviation, adapts the position to the given data.
Both curves would be drawn such that the area below the curve sums to one. To adjust them to the size of the histogram, these curves need to be scaled by the length of the data times the bin-width. Alternatively, this scaling can stay at 1, and the histogram scaled by adding the parameter hist(..., density=True).
In the demo code the data is mutilated to illustrate the difference between the kde and the Gaussian normal.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.linspace(-4,4,1000)
N = 10000
z1 = np.random.randint(1, 3, N) * np.random.uniform(0, .4, N)
z2 = np.random.uniform(0, 1, N)
R_sq = -2 * np.log(z1)
theta = 2 * np.pi * z2
z1 = np.sqrt(R_sq) * np.cos(theta)
z2 = np.sqrt(R_sq) * np.sin(theta)
fig = plt.figure(figsize=(12,4))
for ind_subplot, zi, col in zip((1, 2), (z1, z2), ('crimson', 'dodgerblue')):
ax = fig.add_subplot(1, 2, ind_subplot)
ax.hist(zi, bins=40, range=(-4, 4), color=col, label='histogram')
ax.set_xlabel("z"+str(ind_subplot))
ax.set_ylabel("frequency")
binwidth = 8 / 40
scale_factor = len(zi) * binwidth
gaussian_kde_zi = stats.gaussian_kde(z1)
ax.plot(x, gaussian_kde_zi(x)*scale_factor, color='springgreen', linewidth=3, label='kde')
std_zi = np.std(zi)
mean_zi = np.mean(zi)
ax.plot(x, stats.norm.pdf((x-mean_zi)/std_zi)*scale_factor, color='black', linewidth=2, label='normal')
ax.legend()
plt.show()
The original values for z1 and z2 very much resemble a normal distribution, and so the black line (the Gaussian normal for the data) and the green line (the KDE) very much resemble each other.
The current code first calculates the real mean and the real standard deviation of the data. As you want to mimic a perfect Gaussian normal, you should compare to the curve with mean zero and standard deviatio one. You'll see they're almost identical on the plot.

non-random sampling versions of np.random.normal

I'm trying to generate a single array that follows an exact gaussian distribution. np.random.normal sort of does this by randomly sampling from a gaussian, but how can I reproduce and exact gaussian given some mean and sigma. So the array would produce a histogram that follows an exact gaussian, not just an approximate gaussian as shown below.
mu, sigma = 10, 1
s = np.random.normal(mu, sigma, 1000)
fig = figure()
ax = plt.axes()
totaln, bbins, patches = ax.hist(s, 10, normed = 1, histtype = 'stepfilled', linewidth = 1.2)
plt.show()

If you'd like an exact gaussian histogram, don't generate points. You can never get an "exact" gaussian distribution from observed points, simply because you can't have a fraction of a point within a histogram bin.
Instead, plot the curve in the form of a bar graph.
import numpy as np
import matplotlib.pyplot as plt
def gaussian(x, mean, std):
scale = 1.0 / (std * np.sqrt(2 * np.pi))
return scale * np.exp(-(x - mean)**2 / (2 * std**2))
mean, std = 2.0, 5.0
nbins = 30
npoints = 1000
x = np.linspace(mean - 3 * std, mean + 3 * std, nbins + 1)
centers = np.vstack([x[:-1], x[1:]]).mean(axis=0)
y = npoints * gaussian(centers, mean, std)
fig, ax = plt.subplots()
ax.bar(x[:-1], y, width=np.diff(x), color='lightblue')
# Optional...
ax.margins(0.05)
ax.set_ylim(bottom=0)
plt.show()

Confidence regions of 1sigma for a 2D plot

I have two variables that I have plotted using matplotlib scatter function.
I would like to show the 68% confidence region by highlighting it in the plot. I know to show it in a histogram, but I don't know how to do it for a 2D plot like this (x vs y). In my case, the x is Mass and y is Ngal Mstar+2.
An example image of what I am looking for looks like this:
Here they have showed the 68% confidence region using dark blue and 95% confidence region using light blue.
Can it be achieved using one of thescipy.stats modules?

To plot a region between two curves, you could use pyplot.fill_between().
As for your confidence region, I was not sure what you wanted to achieve, so I exemplified with simultaneous confidence bands, by modifying the code from:
https://en.wikipedia.org/wiki/Confidence_and_prediction_bands#cite_note-2
import numpy as np
import matplotlib.pyplot as plt
import scipy.special as sp
## Sample size.
n = 50
## Predictor values.
XV = np.random.uniform(low=-4, high=4, size=n)
XV.sort()
## Design matrix.
X = np.ones((n,2))
X[:,1] = XV
## True coefficients.
beta = np.array([0, 1.], dtype=np.float64)
## True response values.
EY = np.dot(X, beta)
## Observed response values.
Y = EY + np.random.normal(size=n)*np.sqrt(20)
## Get the coefficient estimates.
u,s,vt = np.linalg.svd(X,0)
v = np.transpose(vt)
bhat = np.dot(v, np.dot(np.transpose(u), Y)/s)
## The fitted values.
Yhat = np.dot(X, bhat)
## The MSE and RMSE.
MSE = ((Y-EY)**2).sum()/(n-X.shape[1])
s = np.sqrt(MSE)
## These multipliers are used in constructing the intervals.
XtX = np.dot(np.transpose(X), X)
V = [np.dot(X[i,:], np.linalg.solve(XtX, X[i,:])) for i in range(n)]
V = np.array(V)
## The F quantile used in constructing the Scheffe interval.
QF = sp.fdtri(X.shape[1], n-X.shape[1], 0.95)
QF_2 = sp.fdtri(X.shape[1], n-X.shape[1], 0.68)
## The lower and upper bounds of the Scheffe band.
D = s*np.sqrt(X.shape[1]*QF*V)
LB,UB = Yhat-D,Yhat+D
D_2 = s*np.sqrt(X.shape[1]*QF_2*V)
LB_2,UB_2 = Yhat-D_2,Yhat+D_2
## Make the plot.
plt.clf()
plt.plot(XV, Y, 'o', ms=3, color='grey')
plt.hold(True)
a = plt.plot(XV, EY, '-', color='black', zorder = 4)
plt.fill_between(XV, LB_2, UB_2, where = UB_2 >= LB_2, facecolor='blue', alpha= 0.3, zorder = 0)
b = plt.plot(XV, LB_2, '-', color='blue', zorder=1)
plt.plot(XV, UB_2, '-', color='blue', zorder=1)
plt.fill_between(XV, LB, UB, where = UB >= LB, facecolor='blue', alpha= 0.3, zorder = 2)
b = plt.plot(XV, LB, '-', color='blue', zorder=3)
plt.plot(XV, UB, '-', color='blue', zorder=3)
d = plt.plot(XV, Yhat, '-', color='red',zorder=4)
plt.ylim([-8,8])
plt.xlim([-4,4])
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
The output looks like this:

First of all thank you #snake_charmer for your answer, but I have found a simpler way of solving the issue using curve_fit from scipy.optimize
I fit my data sample using curve_fit which gives me my best fit parameters. What it also gives me is the estimated covariance of the parameters. The diagonals of the same provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters we can use np.sqrt(np.diag(pcov)) where pcov is the covariance matrix.
def fitfunc(M,p1,p2):
N = p1+( (M)*p2 )
return N
The above is the fit function I use for the data.
Now to fit the data using curve_fit
popt_1,pcov_1 = curve_fit(fitfunc,logx,logn,p0=(10.0,1.0),maxfev=2000)
p1_1 = popt_1[0]
p1_2 = popt_1[1]
sigma1 = [np.sqrt(pcov_1[0,0]),np.sqrt(pcov_1[1,1])] #THE 1 SIGMA CONFIDENCE INTERVALS
residuals1 = (logy) - fitfunc((logx),p1_1,p1_2)
xi_sq_1 = sum(residuals1**2) #THE CHI-SQUARE OF THE FIT
curve_y_1 = fitfunc((logx),p1_1,p1_2)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(logx,logy,c='r',label='$0.0<z<0.5$')
ax1.plot(logx,curve_y_1,'y')
ax1.plot(logx,fitfunc(logx,p1_1+sigma1[0],p1_2+sigma1[1]),'m',label='68% conf limits')
ax1.plot(logx,fitfunc(logx,p1_1-sigma1[0],p1_2-sigma1[1]),'m')
So just by using the square root the diagonal elements of the covariance matrix, I can obtain the 1 sigma confidence lines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fitting binned lognormal data in Python - python

Related

Clarification of log-normal distribution using Python

Gaussian fitted curve showing a tail that does not go back to base-level

How to draw a matching Bell curve over a histogram?

non-random sampling versions of np.random.normal

Confidence regions of 1sigma for a 2D plot

Categories

Resources