arange works on stepwise incrementing values and is not random function then why does it give a random distribution?
from scipy.stats import norm
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, norm.pdf(x))
I expect a uniform distribution
The library scipy.stats.norm provides functionality of Normal Distribution, not Uniform distribution. Meaning when you apply the probability density function (pdf), you are not applying a constant function, but something else entirely (also knowns as the Bell curve):
https://en.wikipedia.org/wiki/Normal_distribution
So in the end what you are seeing are points between (-3, 3) visualised on the probability density function of Normal distribution. If you want to see Uniform distribution:
from scipy.stats import uniform
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, uniform.pdf(x))
But that is just a very fancy way to draw a constant line.
Related
I try to generate probability density function for log-normal distribution. All theory goes that we only need location and scale parameters to describe the distribution (mu - mean of logarithmic values and sigma - standard deviation of logarithmic values). I use scipy library to generate PDF, namely lognorm.pdf. The issue is that this function requires shape parameter. I don't need this parameter (I guess) but I'm forced to use it. I don't know what exactly this parameter describes.
Log-normal scale parameter is in fact standard deviation of logarithmic values. When I generate log-normal PDF with required shape parameter it turns out that shape behaves like standard deviation. At least it looks like this. I know that some other distributions have to be provided with shape parameter which defines overall shape of a distribution but in case of lognormal I have no clue why we need it if it's not required by theory.
I'm confused with what's going on with this shape in this library. Another question is: if I really have to use this parameter then what value should I use? Or how to calculate it?
This is my code where I fixed scale parameter to show how shape changes the plots:
from scipy.stats import lognorm
import numpy as np
import matplotlib.pyplot as plt
data_x = np.arange(-4, 4, 0.001)
pdf_log1 = lognorm.pdf(data_x, s=0.2, loc=0, scale=1)
pdf_log2 = lognorm.pdf(data_x, s=1, loc=0, scale=1)
pdf_log3 = lognorm.pdf(data_x, s=2, loc=0, scale=1)
plt.plot(data_x, pdf_log1, color='red', label='shape=0.2')
plt.plot(data_x, pdf_log2, color='blue', label='shape=1')
plt.plot(data_x, pdf_log3, color='black', label='shape=2')
plt.legend()
plt.show()
I visualize density function (PDF) using two plotting approaches: displot() and plot(). I don't understand why displot() doesn't produce normally distributed plot wheras plot() do this perfectly. Density plots should look alike but they don't. What's wrong with displot() here?
from scipy.stats import norm
import seaborn as sns
import numpy as np
data_x= np.arange(-4, 4, 0.001)
norm_pdf = norm.pdf(data_x)
sns.displot(data = norm_pdf, x = data_x, kind='kde')
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
data_x= np.arange(-4, 4, 0.001)
plt.plot(data_x, norm.pdf(data_x))
plt.show()
displot (or the underlying kdeplot) creates an approximation of a probability density function (pdf) to resemble the function that might have generated the given random data. As input, you'll need random data. The function will mimic these data as a sum of Gaussian bell shapes (a "kernel density estimation" with a Gaussian kernel).
Here is an example using 8000 random points as input. You'll notice the curve resembles the normal pdf, but is also a bit "bumpier" (that's how randomness looks like).
data_x = norm.rvs(size=8000)
sns.kdeplot(x=data_x)
When you call kdeplot (or displot(..., kind='kde')) with both data= and x=, while x= isn't a columnname in a dataframe, data= gets ignored. So, you are using 8000 evenly distributed values between -4 and 4. The kde of such data looks like a flat line between -4 and 4. But as the kde supposes the underlying function locally resembles a Gaussian, the start and end are smoothed out.
data_x = np.arange(-4, 4, 0.001)
sns.kdeplot(x=data_x)
The gaussian_kde function in scipy.stats has a function evaluate that can returns the value of the PDF of an input point. I'm trying to use gaussian_kde to estimate the inverse CDF. The motivation is for generating Monte Carlo realizations of some input data whose statistical distribution is numerically estimated using KDE. Is there a method bound to gaussian_kde that serves this purpose?
The example below shows how this should work for the case of a Gaussian distribution. First I show how to do the PDF calculation to set up the specific API I'm trying to achieve:
import numpy as np
from scipy.stats import norm, gaussian_kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
Is there an analogously simple way to compute the inverse CDF? The norm function has a very handy isf function that does exactly this:
cdf_value = np.sort(np.random.rand(npts_sample))
cdf_inv = norm.isf(1 - cdf_value)
Does such a function exist for kde_gaussian? Or is it straightforward to construct such a function from the already implemented methods?
The method integrate_box_1d can be used to compute the CDF, but it is not vectorized; you'll need to loop over points. If memory is not an issue, rewriting its source code (which is essentially just a call to special.ndtr) in vector form may speed things up.
from scipy.special import ndtr
stdev = np.sqrt(kde.covariance)[0, 0]
pde_cdf = ndtr(np.subtract.outer(x, n)).mean(axis=1)
plot(x, pde_cdf)
The plot of the inverse function would be plot(pde_cdf, x). If the goal is to compute the inverse function at a specific point, consider using the inverse of interpolating spline, interpolating the computed values of the CDF.
You can use some python tricks for fast and memory-effective estimation of the CDF (based on this answer):
from scipy.special import ndtr
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
It works as fast as this answer, but has linear (len(kde.dataset)) space complexity instead of the quadratic (actually, len(kde.dataset) * len(x)) one.
All you have to do next is to use inverse approximation, for instance, from statsmodels.
The question has been answered in the other answers but it took me a while to wrap my mind around everything. Here is a complete example of the final solution:
import numpy as np
from scipy import interpolate
from scipy.special import ndtr
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde
# create kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)
# grid for plotting
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
# evaluate pdfs
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)
# cdf and inv cdf are available directly from scipy
norm_cdf = norm.cdf(x)
norm_inv = norm.ppf(x)
# estimate cdf
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
for item in x)
# estimate inv cdf
inversefunction = interpolate.interp1d(cdf, x, kind='cubic', bounds_error=False)
fig, ax = plt.subplots(1, 3, figsize=(6, 3))
ax[0].plot(x, norm_pdf, c='k')
ax[0].plot(x, kde_pdf, c='r', ls='--')
ax[0].set_title('PDF')
ax[1].plot(x, norm_cdf, c='k')
ax[1].plot(x, cdf, c='r', ls='--')
ax[1].set_title('CDF')
ax[2].plot(x, norm_inv, c='k')
ax[2].plot(x, inversefunction(x), c='r', ls='--')
ax[2].set_title("Inverse CDF")
I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]
I've fitted a frechet distribution in R and would like to use this in a python script. However inputting the same distribution parameters in scipy.stats.frechet_r gives me a very different curve. Is this a mistake in my implementation or a fault in scipy ?
R distribution:
vs Scipy distribution:
R frechet parameters: loc=17.440, shape=0.198, scale=8.153
python code:
from scipy.stats import frechet_r
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1)
F=frechet_r(loc=17.440 ,scale= 8.153, c= 0.198)
x=np.arange(0.01,120,0.01)
ax.plot(x, F.pdf(x), 'k-', lw=2)
plt.show()
edit - relevant documentation.
The Frechet parameters were calculated in R using the fgev function in the 'evd' package http://cran.r-project.org/web/packages/evd/evd.pdf (page 40)
Link to the scipy documentation:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.frechet_r.html#scipy.stats.frechet_r
I haven't used the frechet_r function from scipy.stats (when just quickly testing it I got the same plot out as you) but you can get the required behaviour from genextreme in scipy.stats. It is worth noting that for genextreme the Frechet and Weibull shape parameter have the 'opposite' sign to usual. That is, in your case you would need to use a shape parameter of -0.198:
from scipy.stats import genextreme as gev
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1)
x=np.arange(0.01,120,0.01)
# The order for this is array, shape, loc, scale
F=gev.pdf(x,-0.198,loc=17.44,scale=8.153)
plt.plot(x,F,'g',lw=2)
plt.show()