Why shape parameter is required in scipy.lognorm? - python

I try to generate probability density function for log-normal distribution. All theory goes that we only need location and scale parameters to describe the distribution (mu - mean of logarithmic values and sigma - standard deviation of logarithmic values). I use scipy library to generate PDF, namely lognorm.pdf. The issue is that this function requires shape parameter. I don't need this parameter (I guess) but I'm forced to use it. I don't know what exactly this parameter describes.
Log-normal scale parameter is in fact standard deviation of logarithmic values. When I generate log-normal PDF with required shape parameter it turns out that shape behaves like standard deviation. At least it looks like this. I know that some other distributions have to be provided with shape parameter which defines overall shape of a distribution but in case of lognormal I have no clue why we need it if it's not required by theory.
I'm confused with what's going on with this shape in this library. Another question is: if I really have to use this parameter then what value should I use? Or how to calculate it?
This is my code where I fixed scale parameter to show how shape changes the plots:
from scipy.stats import lognorm
import numpy as np
import matplotlib.pyplot as plt
data_x = np.arange(-4, 4, 0.001)
pdf_log1 = lognorm.pdf(data_x, s=0.2, loc=0, scale=1)
pdf_log2 = lognorm.pdf(data_x, s=1, loc=0, scale=1)
pdf_log3 = lognorm.pdf(data_x, s=2, loc=0, scale=1)
plt.plot(data_x, pdf_log1, color='red', label='shape=0.2')
plt.plot(data_x, pdf_log2, color='blue', label='shape=1')
plt.plot(data_x, pdf_log3, color='black', label='shape=2')
plt.legend()
plt.show()

Related

Using scipy gaussian kernel density estimation to calculate CDF inverse

The gaussian_kde function in scipy.stats has a function evaluate that can returns the value of the PDF of an input point. I'm trying to use gaussian_kde to estimate the inverse CDF. The motivation is for generating Monte Carlo realizations of some input data whose statistical distribution is numerically estimated using KDE. Is there a method bound to gaussian_kde that serves this purpose?
The example below shows how this should work for the case of a Gaussian distribution. First I show how to do the PDF calculation to set up the specific API I'm trying to achieve:
import numpy as np
from scipy.stats import norm, gaussian_kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
Is there an analogously simple way to compute the inverse CDF? The norm function has a very handy isf function that does exactly this:
cdf_value = np.sort(np.random.rand(npts_sample))
cdf_inv = norm.isf(1 - cdf_value)
Does such a function exist for kde_gaussian? Or is it straightforward to construct such a function from the already implemented methods?
The method integrate_box_1d can be used to compute the CDF, but it is not vectorized; you'll need to loop over points. If memory is not an issue, rewriting its source code (which is essentially just a call to special.ndtr) in vector form may speed things up.
from scipy.special import ndtr
stdev = np.sqrt(kde.covariance)[0, 0]
pde_cdf = ndtr(np.subtract.outer(x, n)).mean(axis=1)
plot(x, pde_cdf)
The plot of the inverse function would be plot(pde_cdf, x). If the goal is to compute the inverse function at a specific point, consider using the inverse of interpolating spline, interpolating the computed values of the CDF.
You can use some python tricks for fast and memory-effective estimation of the CDF (based on this answer):
from scipy.special import ndtr
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
It works as fast as this answer, but has linear (len(kde.dataset)) space complexity instead of the quadratic (actually, len(kde.dataset) * len(x)) one.
All you have to do next is to use inverse approximation, for instance, from statsmodels.
The question has been answered in the other answers but it took me a while to wrap my mind around everything. Here is a complete example of the final solution:
import numpy as np
from scipy import interpolate
from scipy.special import ndtr
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde
# create kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
# grid for plotting
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
# evaluate pdfs
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
# cdf and inv cdf are available directly from scipy
norm_cdf = norm.cdf(x)
norm_inv = norm.ppf(x)
# estimate cdf
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
# estimate inv cdf
inversefunction = interpolate.interp1d(cdf, x, kind='cubic', bounds_error=False)
fig, ax = plt.subplots(1, 3, figsize=(6, 3))
ax[0].plot(x, norm_pdf, c='k')
ax[0].plot(x, kde_pdf, c='r', ls='--')
ax[0].set_title('PDF')
ax[1].plot(x, norm_cdf, c='k')
ax[1].plot(x, cdf, c='r', ls='--')
ax[1].set_title('CDF')
ax[2].plot(x, norm_inv, c='k')
ax[2].plot(x, inversefunction(x), c='r', ls='--')
ax[2].set_title("Inverse CDF")

Why doesn't the `normed` parameter for matplotlib histograms do anything?

I'm confused by the normed argument from matplotlib.pyplot.hist and why it does not change the plot output:
If True, the first element of the return tuple will be the counts
normalized to form a probability density, i.e., n/(len(x)'dbin), i.e.,
the integral of the histogram will sum to 1. If stacked is also True,
the sum of the histograms is normalized to 1.
Default is False
Seems pretty clear. I've seen it called a density function, probability density, etc.
That is, given a random uniform distribution of size 1000 in [0, 10]:
Specifying normed=True should change the y-axis to a density axis, where the sum of the bars is 1.0:
But in reality it does nothing of the sort:
r = np.random.uniform(size=1000)
plt.hist(r, normed=True)
And furthermore:
print(plt.hist(r, normed=True)[0].sum())
# definitely not 1.0
10.012123595
So, I have seen #Carsten König's answers to similar questions and am not asking for a workaround. My question is, what then is the purpose of normed? Am I misinterpreting what this parameter actually does?
The matplotlib documentation even gives an example named "histogram_percent_demo", where the integral looks like it would be over a thousand percent.
The height of the bars do not necessarily sum to one.
It is the area under the curve, which is the same as the integral of the histogram, which equals one:
import numpy as np
import matplotlib.pyplot as plt
r = np.random.uniform(size=1000)
hist, bins, patches = plt.hist(r, normed=True)
print((hist * np.diff(bins)).sum())
# 1.0
norm=True thus returns a histogram which can be interpreted as a probability distribution.
According to matplotlib version 3.0.2,
normed : bool, optional
Deprecated; use the density keyword argument instead.
So if you want density plot, use density=True instead.
Or you can use seaborn.displot, which plots histogram by default using density rather than frequency.
What normed =True does is to scale area under the curve to be 1, as #unutbu has shown.
density=True keeps the same property (area under curve sums to 1) and is more meaningful and useful.
r = np.random.uniform(size=1000)
hist, bins, patches = plt.hist(r, density=True)
print((hist * np.diff(bins)).sum())
[Out] 1

How to estimate density function and calculate its peaks?

I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

How to make a line for the density of the distribution of my data in matlibplot (Python)

How can I make a figure like the following one but with flat curve using matlibplot in Python?
Instead of using a histogram to bin your data have a look at using a KDE for a continuous estimate of the probability distribution. There is an implementation using a gaussian kernel in scipy.stats.gaussian_kde.
As an example:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
data = np.random.normal(0.0, 1.0, 10000) #Generate some data
kde = gaussian_kde(data)
xplot = np.linspace(-5,5,1000)
plt.plot( xplot, kde(xplot), label='KDE' )
plt.hist( data, bins=50, histtype='step', normed=True, label='histogram' )
plt.legend()
plt.show()
Will produce the plot:
Note that when using KDEs the bandwidth of the kernel that you choose can have a very big impact on the representation of the data that gets produced, this is similar to the effect that the bin size would have when making a histogram. Both the scipy documentation that I linked to and the wikipedia page have good writeups on how to make this selection in a well motivated way.

A lognormal distribution in python

I have seen several questions in stackoverflow regarding how to fit a log-normal distribution. Still there are two clarifications that I need known.
I have a sample data, the logarithm of which follows a normal distribution. So I can fit the data using scipy.stats.lognorm.fit (i.e a log-normal distribution)
The fit is working fine, and also gives me the standard deviation. Here is my piece of code with the results.
import numpy as np
from scipy import stats
sample = np.log10(data) #taking the log10 of the data
scatter,loc,mean = stats.lognorm.fit(sample) #Gives the paramters of the fit
x_fit = np.linspace(13.0,15.0,100)
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,mean) #Gives the PDF
print "scatter for data is %s" %scatter
print "mean of data is %s" %mean
THE RESULT
scatter for data is 0.186415047243
mean for data is 1.15731050926
From the image you can clearly see that the mean is around 14.2, but what I get is 1.15??!! Why is this so? clearly the log(mean) is also not near 14.2!!
In THIS POST and in THIS QUESTION it is mentioned that the log(mean) is the actual mean.
But you can see from my above code, the fit that I have obtained is using a the sample = log(data) and it also seems to fit well. However when I tried
sample = data
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,np.log10(mean))
The fit does not seem to work.
1) Why is the mean not 14.2?
2) How to draw fill/draw vertical lines showing the 1 sigma confidence region?
You say
I have a sample data, the logarithm of which follows a normal distribution.
Suppose data is the array containing the samples. To fit this data to
a log-normal distribution using scipy.stats.lognorm, use:
s, loc, scale = stats.lognorm.fit(data, floc=0)
Now suppose mu and sigma are the mean and standard deviation of the
underlying normal distribution. To get the estimate of those values
from this fit, use:
estimated_mu = np.log(scale)
estimated_sigma = s
(These are not the estimates of the mean and standard deviation of
the samples in data. See the wikipedia page for the formulas
for the mean and variance of a log-normal distribution in terms of mu and sigma.)
To combine the histogram and the PDF, you can use, for example,
import matplotlib.pyplot as plt.
plt.hist(data, bins=50, normed=True, color='c', alpha=0.75)
xmin = data.min()
xmax = data.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.lognorm.pdf(x, s, scale=scale)
plt.plot(x, pdf, 'k')
If you want to see the log of the data, you could do something like
the following. Note the the PDF of the normal distribution is used
here.
logdata = np.log(data)
plt.hist(logdata, bins=40, normed=True, color='c', alpha=0.75)
xmin = logdata.min()
xmax = logdata.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.norm.pdf(x, loc=estimated_mu, scale=estimated_sigma)
plt.plot(x, pdf, 'k')
By the way, an alternative to fitting with stats.lognorm is to fit log(data)
using stats.norm.fit:
logdata = np.log(data)
estimated_mu, estimated_sigma = stats.norm.fit(logdata)
Related questions:
Fitting lognormal distribution using Scipy vs Matlab
Lognormal Random Numbers Centered around a high value

Categories

Resources