I have a set of 1D data saved in a python. I can get the probability density function using gaussian_kde function from scipy. I want to know whether the returned distribution matches with a theoretical distribution such as Normal distribution. For that can I use KL divergence? If so how can I use python to do that?
This is my python code to get the probability density function.
array = np.array(values)
KDEpdf = gaussian_kde(array)
x = np.linspace(0, 50, 1500)
kdepdf = KDEpdf.evaluate(x)
plt.plot(x, kdepdf, label="", color="blue")
plt.legend()
plt.show()
There are couple of ways to do it:
Plot it against a normal fitted probability distribution. Like: plt.hist(x, norm.pdf(x,mu, std))
Compare kdepdf distribution with a uniform random dataset using something like Q-Q plot for both dataset.
Use chi square test, be cautious with the bin size you choose. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.chi square test. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.
Related
I have a 40 year time-series of surge levels in the ocean to which I'm trying to fit a lognormal distribution using scipy.stats. However, as far as I know (and read) a lognormal distribution cannot have negative values by definition. The scipy implementation uses a generalized version with three parameters, shape, location and scale, enabling to 'shift and scale' the distribution, which makes it possible to fit to negative values. However, can it then still be considered a lognormal distribution?
The surge data in the example below (grey histogram) has around half its values below 0, and the computed lognorm fit is actually very good (orange line; shape = 0.27, loc = -0.57, scale = 0.56). However, if I am trying to use a lognorm with the mu / sigma parameterization (i.e. mu = log(scale), sigma = shape, and loc fixed at 0), see also Wikipedia, it returns an error (due to the negative values).
What I don't really understand is if a 'shifted' 3 parameter lognorm still classifies as a lognormal distribution? I prefer to use the standard parameterization, however for many timeseries this will not be possible and generally the obtained fit is worse.
I have this task:
Choose your favorite continuous distribution (the less it looks normal, the more interesting; try to choose one of the distributions we have not discussed in the course).
Generate a sample of 1000 from it, build a histogram of the sample,
and draw a theoretical density of distribution of your random value
on top of it (so that the values are on the same scale, don't forget
to set the histogram to normed=True). Your task is to estimate the
distribution of your random sample average for different sample
sizes.
3
To do this, generate 1000 samples of n volume and build
histograms of their sample averages for three or more n values (for
example, 5, 10, 50).
Using the information on the mean and
variance of the original distribution (easily found on wikipedia),
calculate the values of the normal distribution parameters, which,
according to the central limit theorem, approximate the distribution
of the sample averages. Note: to calculate the values of these
parameters, it is the theoretical mean and the variance of your
random value that should be used, and not their sample estimates.
5.
On top of each histogram, draw the density of the corresponding
normal distribution (be careful with the parameters of the function,
it takes the input not the dispersion, but the standard deviation).
Describe the difference between the distributions obtained at different n values. How
does the accuracy of the approximation of
the distribution of sample averages change with the growth of n?
So if I want to pick exponential distribution in Python do I need to go like that?
from scipy.stats import expon
import matplotlib.pyplot as plt
exdist=sc.expon(loc=2,scale=3) # loc and scale - shift and scale parameters, default values 0 and 1.
mean, var, skew, kurt = exdist.stats(moments='mvsk') # Let's see the moments of our distribution.
x = np.linspace((0,2,100))
ax.plot(x, exdist.pdf(x)) # Let's draw it
arr=exdist.rvc(size=1000) # generation of thousand of random numbers. (Is it for task 3?)
And I constantly getting this error: here is a screenshot from Jupyter:
https://i.stack.imgur.com/zwUtu.png
Could you please explain to me how to write the right code? I can't figure out where to start or where to make a mistake. Do I have to use arr.mean() to search for a sample mean and plt.hist(arr,bins=) to build a histogram? I would be very grateful for the explanation.
Can anyone please explain what goes behind the scenes of a norm.pdf function in python?
I saw a uniform distribution (formed using x = np.arange(-3, 3, 0.001)) being used to plot a normal distribution using plt.plot(x, norm.pdf(x)). So how does norm.pdf convert uniformly distributed values into a normal distribution?
pdf is short for 'Probability Density Function', it represents the density of a random distribution for a given value; that is, how likely is that distribution to output that value? This is the most commonly plotted chart for most distributions, since peaks (on the y axis) represent commonly output values (on the x axis).
Statistical distributions, like the normal/gaussian distribution, have a nice, often parametrized, function to performs this mapping. norm.pdf refers to the PDF for a normal distribution, which you can find here.
Hence, plt.plot(x, norm.pdf(x)) plots, for a bunch of x values, how likely the normally-distributed random variable norm is to output a value of x, hence the bell curve that gets plotted.
I can easily generate a random number along a gaussian/normal probability distribution.
slice = random.gauss(50.0, 15.0)
But the probability distribution is the inverse of what I want:
But what I want is the inverse probability.
Inverse of Gaussian probability looks like this:
And I would actually like to capture high probability not just for the left side, but also the right so.
So literally whatever the probability is of a result along a normal distribution...I want the actual probability of my results to be the opposite. So if it's 90% probability on a normal distribution, I want that result to appear 10% of the time, etc.
While you work on editing your question, the answer can be found in the docs for scipy - check invgauss. Specifically,
from scipy.stats import invgauss
r = invgauss.rvs(mu, size=1000)
will generate a 1000 numbers drawn from an inverse Gaussian distribution centered around mu (your mean). To draw the pdf:
rv = invgauss(mu)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')
for some axis object. To allow for more control you have:
invgauss.pdf(x, mu, loc, scale)
where scale in particular has a mathematical relationship to the STD, though I don't remember it offhand. The canonical form usually only depends on the mean.
I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.
If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()