If we have discrete cdf for quantiles like
quantiles = array([1.000e-04, 1.000e-03, 1.000e-02, 2.000e-02, 3.000e-02, 4.000e-02,
5.000e-02, 6.000e-02, 7.000e-02, 8.000e-02, 9.000e-02, 1.000e-01,
2.000e-01, 3.000e-01, 4.000e-01, 5.000e-01, 6.000e-01, 7.000e-01,
8.000e-01, 9.000e-01, 9.100e-01, 9.200e-01, 9.300e-01, 9.400e-01,
9.500e-01, 9.600e-01, 9.700e-01, 9.800e-01, 9.900e-01, 9.990e-01,
9.999e-01])
Is it valid to create a reverse mapping linear interpolation ? That is from the cdf quantiles, we estimate the value of the random variable satifying cdf condition p(x < a) = p_a. Then we get uniformly distributed values from 0 to 1 and generate random variable in question (think of mapping from y to x axis on a cdf plot). Would the PDF from this be a good approximation ?
f = interp1d(quantiles, matching_discrete_cdf, kind='linear')
uni_rv = stats.uniform.rvs(loc=percentiles.min(),
scale=percentiles.max() - percentiles.min(), size=nof_items)
pdf = f(uni_rv)
I assume that when you write "pdf" you mean "sample" and not an actual probability density function; and when you write "matching_discrete_cdf", you mean the "percent point function" (PPF) which is the inverse of CDF. The terminological confusion aside, the idea is sound: generating a sample for a custom distribution by transforming a uniform sample by the PPF is a standard approach.
The interpolation will distort the distribution a little, and so will the fact that the quantiles 1.000e-04 and 9.999e-01 of the original distribution will become the min and max of the generated numbers (the original distribution had some small chance of being outside these limits). But this should be acceptable, and is inevitable given the data you have. Maybe use the cubic interpolation instead of linear one?
On the chance that you actually want the PDF and not a sample -- the PDF is the derivative of the CDF. I would use cubic spline interpolation on the CDF values (InterpolatedUnivariateSpline), and then take its derivative.
Related
After calculating the Fast Fourier Transform (FFT) of a time series in Python/Scipy, I am trying to plot the 95% confidence level that for which the power spectrum is different from red or white noise, but haven't found a straightforward way to do so. I tried following this thread: Power spectrum in python - significance levels
and wrote the following code to test for a sine function with random noise:
import numpy as np
from scipy.stats import chi2
from scipy.fft import rfft, rfftfreq
x=np.linspace(0,10,500)
data = np.sin(20*np.pi*x)+np.random.rand(500) - 0.5
yf = rfft(data)
xf = rfftfreq(len(data), 1)
n=len(data)
var=np.var(data)
### degrees of freedom
M=n/2
phi=(2*(n-1)-M/2.)/M
###values of chi-squared
chi_val_99 = chi2.isf(q=0.01/2, df=phi) #/2 for two-sided test
chi_val_95 = chi2.isf(q=0.05/2, df=phi)
### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)/n, color='k')
plt.axhline(y=(var/n)*(chi_val_95/phi),color='r',linestyle='--')
But the resulting line lies below all of the power spectrum, as in Fig. 1. What am I doing wrong? Is there another way to get the significance of the FFT power spectrum ?
Background considerations
I did not read the entire references included in the answer you linked to (and in particular Pankofsky et. al.), but couldn't find an explicit derivation of the formula and exactly under which conditions the results applied. On the other hand I've found a few other references where a derivation could more readily be confirmed.
Based on the answer to this question on dsp.stackexchange.com, if you only had white gaussian noise with unit variance, the squared-amplitude of each Fourier coefficients would have Chi-squared distribution with degree of freedom asymptotically 2 (sum of 2 Gaussians, one for each of the real and imaginary parts of the complex Fourier coefficient, when n >> 1). When the noise does not have unit variance, it follows a more general Gamma distribution (although in this case you can simply think of it as scaling the survival function). For noise with a uniform distribution in the [-0.5,0.5] range, and a sufficiently large number of samples, the distribution can also be approximated by a Gamma distribution thanks to the Central Limit Theorem.
To illustrate and better understand these distribution, we can go through gradually more complex cases.
Frequency domain distribution of random noise
For sake of comparing with the later case of uniformly distributed data we will use a gaussian noise with a matching variance. Since the variance of uniformly distributed data is in the range [-0.5,0.5] is 1/12, this gives us the following data:
data = np.sqrt(1.0/12)*np.random.randn(500)
Now let us check the statistics on the power spectrum. As indicated earlier, the squared magnitude of each frequency coefficient is a random variable with an approximately Gamma distribution. The shape parameter is half the degrees of freedom of a Chi-Squared distribution that could have been used for a unit-variance case (so 1 in this case), and the scale parameter corresponds to the square of the scaling of the time-domain (from linearity the variate yf scales as data, such that np.abs(yf)**2 scales as the square of data).
We can validate this by plotting the histogram of the data against the probability density function:
yf = rfft(data)
spectrum = np.abs(yf)**2/len(data)
plt.figure(figsize=(5,5))
plt.hist(spectrum, bins=100, density=True, label='data')
z = np.linspace(0, np.max(spectrum), 100)
plt.plot(z, gamma.pdf(z, 1, scale=1.0/12), 'k', label='$\Gamma(1,{:.3f})$'.format(1.0/12))
As you can see the values are in pretty good agreement:
Going back to the spectrum plot:
# degrees of freedom
phi = 2
###values of chi-squared
chi_val_95 = chi2.isf(q=0.05/2, df=phi) #/2 for two-sided test
### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)**2/n, color='k')
# the following two lines should overlap
plt.axhline(y=var*(chi_val_95/phi),color='r',linestyle='--')
plt.axhline(y=gamma.isf(q=0.05/2, a=1, scale=var),color='b')
Just changing the data to use a uniform distribution in the [-0.5,0.5] range (with data = np.random.rand(500) - 0.5) gives an almost identical plot, with the confidence level remaining unchanged.
Frequency domain distribution of signal with noise
To get a single threshold value corresponding to a 95% confidence interval where the noise part would fall if you could separate it from the data containing a sinusoidal component and noise (or otherwise stated as the 95% confidence interval of the null-hypothesis that the data is white noise), you would need the variance of the noise. While trying to estimate this variance you may quickly realize that the sinusoidal contributes a non-negligible portion of the overall data's variance. To remove this contribution we could take advantage of the fact that sinusoidal signals are more readily separated in the frequency-domain.
So we could simply discard the x% largest values of the spectrum, under the assumption that those are mostly contributed by spike of the sinusoidal component in the frequency-domain. Note that 95 percentile choice below for the outliers is somewhat arbitrary:
# remove outliers
threshold = np.percentile(np.abs(yf)**2, 95)
filtered = [x for x in np.abs(yf)**2 if x <= threshold]
Then we can get the time-domain variance using Parseval's theorem:
# estimate variance
# In time-domain variance ~ np.sum(data**2)/len(data))
# In frequency-domain, using Parseval's theorem we get np.sum(data**2)/len(data) = np.mean(np.abs(spectrum)**2)/len(data)
var = np.mean(filtered)/len(data)
Note that due to the dynamic range of values across the spectrum, you may prefer to visualize the results on a logarithmic scale:
plt.figure(figsize=(5,5))
plt.plot(xf,10*np.log10(np.abs(yf)**2/n), color='k')
plt.axhline(y=10*np.log10(gamma.isf(q=0.05/2, a=1, scale=var)),color='r',linestyle='--')
If on the other hand you are trying to obtain a frequency-dependent 95% confidence interval, then you'd need to consider the contribution of the sinusoidal component at each frequency. For sake of simplicity we will assume here that the amplitude of the sinusoidal component and the variance of the noise are known (otherwise we'd first need to estimate these). In this case the distribution gets shifted by the sinusoidal component's contribution:
signal = np.sin(20*np.pi*x)
data = signal + np.random.rand(500) - 0.5
Sf = rfft(signal) # Assuming perfect knowledge of the sinusoidal component
yf = rfft(data)
noiseVar = 1.0/12 # Assuming perfect knowledge of the noise variance
threshold95 = np.abs(Sf)**2/n + gamma.isf(q=0.05/2, a=1, scale=noiseVar)
plt.figure(figsize=(5,5))
plt.plot(xf, 10*np.log10(np.abs(yf)**2/n), color='k')
plt.plot(xf, 10*np.log10(threshold95), color='r',linestyle='--')
Finally, while I kept the final plots in squared-amplitude units, nothing prevents you from taking the square root and view the corresponding thresholds in amplitude units.
Edit : I've used a gamma(1,s) distribution which is an asymptotically good distribution for data with sufficient number of samples n. For really small data sizes the distribution more closely match a gamma(0.5*(n/(n//2+1)),s) (due to the DC and Nyquist coefficients being purely real, thus having 1 degree of freedom unlike all other coefficients).
I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph
I have a numpy array in range [0,1000] with exponential distribution with lambda_x, I want to transform this numpy array to an array with different exponential distribution lambda_y. How can I find a function that does this mapping?
tried to use the inverse function but it didnt work.
def inverse(X, lambd):
"""Inverse of exponential distribution """
return -np.log(1-X)/lambd
It should be as simple as taking your original exponentials X and scaling them by multiplying by λx/λy to produce Y's.
A well known mechanism is to generate exponentials via inverse transform sampling. The second example on that page shows that if you generate U's which are uniformly distributed between 0 and 1 (where 100*U corresponds to the percentiles of the distribution) and transform them using the formula -ln(1 - U) / λ, you will get exponentials with rate λ. If λ is λx it yields your X distribution, and if λ is λy it yields the Y distribution. Hence rescaling by the ratio of the lambdas will convert from one to the other for a given percentile.
I can easily generate a random number along a gaussian/normal probability distribution.
slice = random.gauss(50.0, 15.0)
But the probability distribution is the inverse of what I want:
But what I want is the inverse probability.
Inverse of Gaussian probability looks like this:
And I would actually like to capture high probability not just for the left side, but also the right so.
So literally whatever the probability is of a result along a normal distribution...I want the actual probability of my results to be the opposite. So if it's 90% probability on a normal distribution, I want that result to appear 10% of the time, etc.
While you work on editing your question, the answer can be found in the docs for scipy - check invgauss. Specifically,
from scipy.stats import invgauss
r = invgauss.rvs(mu, size=1000)
will generate a 1000 numbers drawn from an inverse Gaussian distribution centered around mu (your mean). To draw the pdf:
rv = invgauss(mu)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')
for some axis object. To allow for more control you have:
invgauss.pdf(x, mu, loc, scale)
where scale in particular has a mathematical relationship to the STD, though I don't remember it offhand. The canonical form usually only depends on the mean.
I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.
If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()