I have a 40 year time-series of surge levels in the ocean to which I'm trying to fit a lognormal distribution using scipy.stats. However, as far as I know (and read) a lognormal distribution cannot have negative values by definition. The scipy implementation uses a generalized version with three parameters, shape, location and scale, enabling to 'shift and scale' the distribution, which makes it possible to fit to negative values. However, can it then still be considered a lognormal distribution?
The surge data in the example below (grey histogram) has around half its values below 0, and the computed lognorm fit is actually very good (orange line; shape = 0.27, loc = -0.57, scale = 0.56). However, if I am trying to use a lognorm with the mu / sigma parameterization (i.e. mu = log(scale), sigma = shape, and loc fixed at 0), see also Wikipedia, it returns an error (due to the negative values).
What I don't really understand is if a 'shifted' 3 parameter lognorm still classifies as a lognormal distribution? I prefer to use the standard parameterization, however for many timeseries this will not be possible and generally the obtained fit is worse.
Related
I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph
I have a set of 1D data saved in a python. I can get the probability density function using gaussian_kde function from scipy. I want to know whether the returned distribution matches with a theoretical distribution such as Normal distribution. For that can I use KL divergence? If so how can I use python to do that?
This is my python code to get the probability density function.
array = np.array(values)
KDEpdf = gaussian_kde(array)
x = np.linspace(0, 50, 1500)
kdepdf = KDEpdf.evaluate(x)
plt.plot(x, kdepdf, label="", color="blue")
plt.legend()
plt.show()
There are couple of ways to do it:
Plot it against a normal fitted probability distribution. Like: plt.hist(x, norm.pdf(x,mu, std))
Compare kdepdf distribution with a uniform random dataset using something like Q-Q plot for both dataset.
Use chi square test, be cautious with the bin size you choose. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.chi square test. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.
I can easily generate a random number along a gaussian/normal probability distribution.
slice = random.gauss(50.0, 15.0)
But the probability distribution is the inverse of what I want:
But what I want is the inverse probability.
Inverse of Gaussian probability looks like this:
And I would actually like to capture high probability not just for the left side, but also the right so.
So literally whatever the probability is of a result along a normal distribution...I want the actual probability of my results to be the opposite. So if it's 90% probability on a normal distribution, I want that result to appear 10% of the time, etc.
While you work on editing your question, the answer can be found in the docs for scipy - check invgauss. Specifically,
from scipy.stats import invgauss
r = invgauss.rvs(mu, size=1000)
will generate a 1000 numbers drawn from an inverse Gaussian distribution centered around mu (your mean). To draw the pdf:
rv = invgauss(mu)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')
for some axis object. To allow for more control you have:
invgauss.pdf(x, mu, loc, scale)
where scale in particular has a mathematical relationship to the STD, though I don't remember it offhand. The canonical form usually only depends on the mean.
Ok, so Im trying to use scipys implementation of kstest as a way of evaluating which distribution best fits the data. My understanding of how kstest works is that the statistic represents the probability of the null hypothesis (ie the probability returned is the probability that the model in question is wrong for the data). This works about as expected for a uniform distribution betwen 0.0 and 1.0
a = np.random.uniform(size=4999)
print(scipy.stats.kstest(a, 'uniform', args=(0.0,1.0)))
KstestResult(statistic=0.010517039009963702, pvalue=0.63796173656227928)
However, when I shift the uniform distributions bounds from (0.0, 1.0) to (2.0,3.0), the K-S statistic is oddly high
a = np.random.uniform(2.0, 3.0,size=4999)
print(scipy.stats.kstest(a, 'uniform', args=(2.0,3.0)))
KstestResult(statistic=0.66671700832788283, pvalue=0.0)
Shouldnt the value of the test statistic in the second case be low as well, since the parameters passed approximate the distribution as closely as before?
The numpy (used by you) and scipy.stats (used by ks test) versions of uniform work differently:
>>> np.random.uniform(2,3,5000).max()
2.9999333044165271
>>> stats.uniform(2,3).rvs(5000).max()
4.9995316751114043
In numpy the second parameter is interpreted as the upper bound, in scipy.stats it is the scale paramter, i.e. the width.
I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.
If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()