Scaling Lognormal Fit - python

I have two arrays with x- and y- data.
This data shows lognormal behavior. I need a graph of the fit, as well as the mu and the sigma to do some statistics.
I did a fit, in order to calculate the mu, the sigma, and further on some statistical values of it. (See code below)
I obtain the scaling factor, with which I have to multiply the distribution with an integral over the datapoints.
The code below, does work. My question now is, if (I am sure) there is a better way to do this? It feels like a workaround, that will work sometimes. I want a better way to do this, because I have to plot hundreds of these.
My code (sorry, that it is this long, wanted to include everything except import of crude data):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# produce plot True/False
ploton = True
x0=np.array([3.58381e+01, 3.27125e+01, 2.98680e+01, 2.72888e+01, 2.49364e+01,
2.27933e+01, 2.08366e+01, 1.90563e+01, 1.74380e+01, 1.59550e+01,
1.45904e+01, 1.33460e+01, 1.22096e+01, 1.11733e+01, 1.02262e+01,
9.35893e+00, 8.56556e+00, 7.86688e+00, 7.20265e+00, 6.59782e+00,
6.01571e+00, 5.53207e+00, 5.03979e+00, 4.64415e+00, 4.19920e+00,
3.83595e+00, 3.50393e+00, 3.28070e+00, 3.00930e+00, 2.75634e+00,
2.52050e+00, 2.31349e+00, 2.12280e+00, 1.92642e+00, 1.77820e+00,
1.61692e+00, 1.49094e+00, 1.36233e+00, 1.22935e+00, 1.14177e+00,
1.03078e+00, 9.39603e-01, 8.78425e-01, 1.01490e+00, 1.07461e-01,
4.81523e-02, 4.81523e-02, 1.00000e-02, 1.00000e-02])
y0=np.array([3.94604811e+04, 2.78223936e+04, 1.95979179e+04, 2.14447807e+04,
1.68677487e+04, 1.79429516e+04, 1.73589776e+04, 2.16101026e+04,
3.79705638e+04, 6.83622301e+04, 1.73687772e+05, 5.74854475e+05,
1.69497465e+06, 3.79135941e+06, 7.76757753e+06, 1.33429094e+07,
1.96096415e+07, 2.50403065e+07, 2.72818618e+07, 2.53120387e+07,
1.93102362e+07, 1.22219224e+07, 4.96725699e+06, 1.61174658e+06,
3.19352386e+05, 1.80305856e+05, 1.41728002e+05, 1.66191809e+05,
1.33223816e+05, 1.31384905e+05, 2.49100945e+05, 2.28300583e+05,
3.01063903e+05, 1.84271914e+05, 1.26412781e+05, 8.57488083e+04,
1.35536571e+05, 4.50076293e+04, 1.98080100e+05, 2.27630303e+05,
1.89484527e+05, 0.00000000e+00, 1.36543525e+05, 2.20677520e+05,
3.60100586e+05, 1.62676486e+05, 1.90105093e+04, 9.27461467e+05,
1.58373542e+05])
Dnm = x0
dndlndp = y0
#lognormal PDF:
def f(x, mu, sigma) :
return 1/(np.sqrt(2*np.pi)*sigma*x)*np.exp(-((np.log(x)-mu)**2)/(2*sigma**2))
#normalizing y-values to obtain lognormal distributed data:
y0_normalized = y0/np.trapz(x0.ravel(), y0.ravel())
#calculating mu/sigma of this distribution:
params, extras = curve_fit(f, x0.ravel(), y0_normalized.ravel())
median = np.exp(params[0])
mu = params[0]
sigma = params[1]
#output of mu / sigma / calculated median:
print "mu=%g, sigma=%g" % (params[0], params[1])
print "median=%g" % median
#new variable z for smooth fit-curve:
z = np.linspace(0.1, 100, 10000)
#######################
Dnm = np.ravel(Dnm)
dndlndp = np.ravel(dndlndp)
Dnm_rev = list(reversed(Dnm))
dndlndp_rev = list(reversed(dndlndp))
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
#####################
#plotting
if ploton:
plt.plot(z, f(z, mu, sigma)*scalingfactor, label="fit", color = "red")
plt.scatter(x0, y0, label="data")
plt.xlim(3,20)
plt.xscale("log")
plt.legend()
EDIT1: Maybe I should add that I have no idea, why the scaling factor calculated with
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
is right. It was simply try and error. I really want to know, why this does the trick, since the "area" of all bins combined is:
N = np.trapz(dndlndp_rev, np.log(Dnm_rev), dx = np.log(Dnm_rev))
because the width of the bins is log(Dnm).
EDIT2: Thank you for all answers. I copied the arrays into the code, which is now runable. I want to simplify the question, since i think, due to my poor english, i was not able to say what i really want:
I have lognormal set of data. The code above allows me to calculate the mu and the sigma. To do so, i need to normalize the data, and the area under the function is from now on = 1.
In order to plot a lognormal function with the calculated mu and sigma, i need to multiply the function with an (unknown) factor, because the area under the real function is something like 1e8, but sure not one. I did a workaround by calculating this "scalingfactor" via the trapz integral of the diskrete crude data.
There has to be a better way to plot the fitted function, when mu and sigma are already known.

Related

How do i plot 2 separate average lines on my residuals between different x values - python

In the residual plot resulting from the below code, there is substantial drop in values around the halfway point
I would like to help visualise this for those less statistically inclined by plotting 2 average lines of the residual plot
one from x(0, 110)
and the second from x(110, 240)
Here is the code
FINAL LINEAR MODEL
x = merged[['Imp_Col_LNG', 'AveSH_LNG']].values
y = merged['Unproductive_LNG'].values
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x,y)
# plt.scatter(x, y)
yp=reg.predict(x)
# plt.plot(xp,yp)
# plt.text(x.max()*0.7,y.max()*0.1,'$R^2$ =
{score:.4f}'.format(score=reg.score(x,y)))
# plt.show()
plt.scatter(yp, y)
s = yp.argsort()
plt.plot(yp[s], yp[s],color='k',ls='--')
from scipy.stats import norm
ub = yp + norm.ppf(0.5+0.95/2) * res.std(ddof=1)
lb = yp - norm.ppf(0.5+0.95/2) * res.std(ddof=1)
plt.plot(yp[s], ub[s],color='k',ls='--')
plt.plot(yp[s], lb[s],color='k',ls='--')
plt.text(x.max()*0.7,y.max()*0.1,'$R^2$ =
{score:.4f}'.format(score=reg.score(x,y)))
plt.xlabel('Predicted Values')
plt.ylabel('Observed Values')
plt.title('LNG_Shuffles')
plt.show()
RESIDUAL PLOTS
res = pd.Series(y - yp)
checkresiduals(res)
plt.plot(res)
Since we are trying to plot the average of the residuals from (0, 110) and (110, 240), we first have to calculate the averages for each section.
Here, res stores the residual data in the form of a pd.Series object. To get the array information from it, we can use the to_numpy method of the pd.Series objects.
res_data = res.to_numpy()
Now, let's calculate the mean for each part.
first_average = res_data[:110].mean()
second_average = res_data[110:].mean()
Now, since we are going to plot this over two different ranges, we will have to convert these to numpy arrays before plotting.
plt.plot(np.arange(110), np.ones(110) * first_average)
plt.plot(np.arange(110, 240), np.ones(130) * second_average)
This should give you the piecewise residual average plot.

Python: Kernel Density Estimation for positive values

I want to get kernel density estimation for positive data points. Using Python Scipy Stats package, I came up with the following code.
def get_pdf(data):
a = np.array(data)
ag = st.gaussian_kde(a)
x = np.linspace(0, max(data), max(data))
y = ag(x)
return x, y
This works perfectly for most data sets, but it gives an erroneous result for "all positive" data points. To make sure this works correctly, I use numerical integration to compute the area under this curve.
def trapezoidal_2(ag, a, b, n):
h = np.float(b - a) / n
s = 0.0
s += ag(a)[0]/2.0
for i in range(1, n):
s += ag(a + i*h)[0]
s += ag(b)[0]/2.0
return s * h
Since the data is spread in the region (0, int(max(data))), we should get a value close to 1, when executing the following line.
b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)
a = np.array(data)
ag = st.gaussian_kde(a)
trapezoidal_2(ag, 0, int(max(data)), int(max(data))*2)
But it gives a value close to 0.5 when I test.
But when I intergrate from -100 to max(data), it provides a value close to 1.
trapezoidal_2(ag, -100, int(max(data)), int(max(data))*2+200)
The reason is, ag (KDE) is defined for values less than 0, even though the original data set contains only positive values.
So how can I get a kernel density estimation that considers only positive values, such that area under the curve in the region (o, max(data)) is close to 1?
The choice of the bandwidth is quite important when performing kernel density estimation. I think the Scott's Rule and Silverman's Rule work well for distribution similar to a Gaussian. However, they do not work well for the Pareto distribution.
Quote from the doc:
Bandwidth selection strongly influences the estimate obtained from
the KDE (much more so than the actual shape of the kernel). Bandwidth selection
can be done by a "rule of thumb", by cross-validation, by "plug-in
methods" or by other means; see [3], [4] for reviews. gaussian_kde
uses a rule of thumb, the default is Scott's Rule.
Try with different bandwidth values, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
b = 1
sample = stats.pareto.rvs(b, size=3000)
kde_sample_scott = stats.gaussian_kde(sample, bw_method='scott')
kde_sample_scalar = stats.gaussian_kde(sample, bw_method=1e-3)
# Compute the integrale:
print('integrale scott:', kde_sample_scott.integrate_box_1d(0, np.inf))
print('integrale scalar:', kde_sample_scalar.integrate_box_1d(0, np.inf))
# Graph:
x_span = np.logspace(-2, 1, 550)
plt.plot(x_span, stats.pareto.pdf(x_span, b), label='theoretical pdf')
plt.plot(x_span, kde_sample_scott(x_span), label="estimated pdf 'scott'")
plt.plot(x_span, kde_sample_scalar(x_span), label="estimated pdf 'scalar'")
plt.xlabel('X'); plt.legend();
gives:
integrale scott: 0.5572130540733236
integrale scalar: 0.9999999999968957
and:
We see that the kde using the Scott method is wrong.

Python: two normal distribution

I have two data sets where two values where measured. I am interested in the difference between the value and the standard deviation of the difference. I made a histogram which I would like to fit two normal distributions. To calculate the difference between the maxima. I also would like to evaluate the effect that in on data set I have much less data on one value. I've already looked at this link but it is not really what I need:
Python: finding the intersection point of two gaussian curves
for ii in range(2,8):
# Kanal = ii - 1
file = filepath + '\Mappe1.txt'
data = np.loadtxt(file, delimiter='\t', skiprows=1)
data = data[:,ii]
plt.hist(data,bins=100)
plt.xlabel("bins")
plt.ylabel("Counts")
plt.tight_layout()
plt.grid()
plt.figure()
plt.show()
Quick and dirty fitting can be readily achieved using scipy:
from scipy.optimize import curve_fit #non linear curve fitting tool
from matplotlib import pyplot as plt
def func2fit(x1,x2,m_1,m_2,std_1,std_2,height1, height2): #define a simple gauss curve
return height1*exp(-(x1-m_1)**2/2/std_1**2)+height2*exp(-(x2-m_2)**2/2/std_2**2)
init_guess=(-.3,.3,.5,.5,3000,3000)
#contains the initial guesses for the parameters (m_1, m_2, std_1, std_2, height1, height2) using your first figure
#do the fitting
fit_pars, pcov =curve_fit(func2fit,xdata,ydata,init_guess)
#fit_pars contains the mean, the heights and the SD values, pcov contains the estimated covariance of these parameters
plt.plot(xdata,func2fit(xdata,*fit_pars),label='fit') #plot the fit
For further reference consult the scipy manual page:
curve_fit
Assuming that the two samples are independent there is no need to handle this problem using curve fitting. It's basic statistics. Here's some code that does the calculations required, with the source attributed in a comment.
## adapted from http://onlinestatbook.com/2/estimation/difference_means.html
from random import gauss
from numpy import sqrt
sample_1 = [ gauss(0,1) for _ in range(10) ]
sample_2 = [ gauss(1,.5) for _ in range(20) ]
n_1 = len(sample_1)
n_2 = len(sample_2)
mean_1 = sum(sample_1)/n_1
mean_2 = sum(sample_2)/n_2
SSE = sum([(_-mean_1)**2 for _ in sample_1]) + sum([(_-mean_2)**2 for _ in sample_2])
df = (n_1-1) + (n_2-1)
MSE = SSE/df
n_h = 2 / ( 1/n_1 + 1/n_2 )
s_mean_diff = sqrt( 2* MSE / n_h )
print ( 'difference between means', abs(n_1-n_2))
print ( 'std dev of this difference', s_mean_diff )

How to overplot fit results for discrete values in pymc3?

I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.

Python interp1d vs. UnivariateSpline

I'm trying to port some MatLab code over to Scipy, and I've tried two different functions from scipy.interpolate, interp1d and UnivariateSpline. The interp1d results match the interp1d MatLab function, but the UnivariateSpline numbers come out different - and in some cases very different.
f = interp1d(row1,row2,kind='cubic',bounds_error=False,fill_value=numpy.max(row2))
return f(interp)
f = UnivariateSpline(row1,row2,k=3,s=0)
return f(interp)
Could anyone offer any insight? My x vals aren't equally spaced, although I'm not sure why that would matter.
I just ran into the same issue.
Short answer
Use InterpolatedUnivariateSpline instead:
f = InterpolatedUnivariateSpline(row1, row2)
return f(interp)
Long answer
UnivariateSpline is a 'one-dimensional smoothing spline fit to a given set of data points' whereas InterpolatedUnivariateSpline is a 'one-dimensional interpolating spline for a given set of data points'. The former smoothes the data whereas the latter is a more conventional interpolation method and reproduces the results expected from interp1d. The figure below illustrates the difference.
The code to reproduce the figure is shown below.
import scipy.interpolate as ip
#Define independent variable
sparse = linspace(0, 2 * pi, num = 20)
dense = linspace(0, 2 * pi, num = 200)
#Define function and calculate dependent variable
f = lambda x: sin(x) + 2
fsparse = f(sparse)
fdense = f(dense)
ax = subplot(2, 1, 1)
#Plot the sparse samples and the true function
plot(sparse, fsparse, label = 'Sparse samples', linestyle = 'None', marker = 'o')
plot(dense, fdense, label = 'True function')
#Plot the different interpolation results
interpolate = ip.InterpolatedUnivariateSpline(sparse, fsparse)
plot(dense, interpolate(dense), label = 'InterpolatedUnivariateSpline', linewidth = 2)
smoothing = ip.UnivariateSpline(sparse, fsparse)
plot(dense, smoothing(dense), label = 'UnivariateSpline', color = 'k', linewidth = 2)
ip1d = ip.interp1d(sparse, fsparse, kind = 'cubic')
plot(dense, ip1d(dense), label = 'interp1d')
ylim(.9, 3.3)
legend(loc = 'upper right', frameon = False)
ylabel('f(x)')
#Plot the fractional error
subplot(2, 1, 2, sharex = ax)
plot(dense, smoothing(dense) / fdense - 1, label = 'UnivariateSpline')
plot(dense, interpolate(dense) / fdense - 1, label = 'InterpolatedUnivariateSpline')
plot(dense, ip1d(dense) / fdense - 1, label = 'interp1d')
ylabel('Fractional error')
xlabel('x')
ylim(-.1,.15)
legend(loc = 'upper left', frameon = False)
tight_layout()
The reason why the results are different (but both likely correct) is that the interpolation routines used by UnivariateSpline and interp1d are different.
interp1d constructs a smooth B-spline using the x-points you gave to it as knots
UnivariateSpline is based on FITPACK, which also constructs a smooth B-spline. However, FITPACK tries to choose new knots for the spline, to fit the data better (probably to minimize chi^2 plus some penalty for curvature, or something similar). You can find out what knot points it used via g.get_knots().
So the reason why you get different results is that the interpolation algorithm is different. If you want B-splines with knots at data points, use interp1d or splmake. If you want what FITPACK does, use UnivariateSpline. In the limit of dense data, both methods give same results, but when data is sparse, you may get different results.
(How do I know all this: I read the code :-)
Works for me,
from scipy import allclose, linspace
from scipy.interpolate import interp1d, UnivariateSpline
from numpy.random import normal
from pylab import plot, show
n = 2**5
x = linspace(0,3,n)
y = (2*x**2 + 3*x + 1) + normal(0.0,2.0,n)
i = interp1d(x,y,kind=3)
u = UnivariateSpline(x,y,k=3,s=0)
m = 2**4
t = linspace(1,2,m)
plot(x,y,'r,')
plot(t,i(t),'b')
plot(t,u(t),'g')
print allclose(i(t),u(t)) # evaluates to True
show()
This gives me,
UnivariateSpline: A more recent
wrapper of the FITPACK routines.
this might explain the slightly different values? (I also experienced that UnivariateSpline is much faster than interp1d.)

Categories

Resources