I can easily generate a random number along a gaussian/normal probability distribution.
slice = random.gauss(50.0, 15.0)
But the probability distribution is the inverse of what I want:
But what I want is the inverse probability.
Inverse of Gaussian probability looks like this:
And I would actually like to capture high probability not just for the left side, but also the right so.
So literally whatever the probability is of a result along a normal distribution...I want the actual probability of my results to be the opposite. So if it's 90% probability on a normal distribution, I want that result to appear 10% of the time, etc.
While you work on editing your question, the answer can be found in the docs for scipy - check invgauss. Specifically,
from scipy.stats import invgauss
r = invgauss.rvs(mu, size=1000)
will generate a 1000 numbers drawn from an inverse Gaussian distribution centered around mu (your mean). To draw the pdf:
rv = invgauss(mu)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')
for some axis object. To allow for more control you have:
invgauss.pdf(x, mu, loc, scale)
where scale in particular has a mathematical relationship to the STD, though I don't remember it offhand. The canonical form usually only depends on the mean.
Related
After calculating the Fast Fourier Transform (FFT) of a time series in Python/Scipy, I am trying to plot the 95% confidence level that for which the power spectrum is different from red or white noise, but haven't found a straightforward way to do so. I tried following this thread: Power spectrum in python - significance levels
and wrote the following code to test for a sine function with random noise:
import numpy as np
from scipy.stats import chi2
from scipy.fft import rfft, rfftfreq
x=np.linspace(0,10,500)
data = np.sin(20*np.pi*x)+np.random.rand(500) - 0.5
yf = rfft(data)
xf = rfftfreq(len(data), 1)
n=len(data)
var=np.var(data)
### degrees of freedom
M=n/2
phi=(2*(n-1)-M/2.)/M
###values of chi-squared
chi_val_99 = chi2.isf(q=0.01/2, df=phi) #/2 for two-sided test
chi_val_95 = chi2.isf(q=0.05/2, df=phi)
### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)/n, color='k')
plt.axhline(y=(var/n)*(chi_val_95/phi),color='r',linestyle='--')
But the resulting line lies below all of the power spectrum, as in Fig. 1. What am I doing wrong? Is there another way to get the significance of the FFT power spectrum ?
Background considerations
I did not read the entire references included in the answer you linked to (and in particular Pankofsky et. al.), but couldn't find an explicit derivation of the formula and exactly under which conditions the results applied. On the other hand I've found a few other references where a derivation could more readily be confirmed.
Based on the answer to this question on dsp.stackexchange.com, if you only had white gaussian noise with unit variance, the squared-amplitude of each Fourier coefficients would have Chi-squared distribution with degree of freedom asymptotically 2 (sum of 2 Gaussians, one for each of the real and imaginary parts of the complex Fourier coefficient, when n >> 1). When the noise does not have unit variance, it follows a more general Gamma distribution (although in this case you can simply think of it as scaling the survival function). For noise with a uniform distribution in the [-0.5,0.5] range, and a sufficiently large number of samples, the distribution can also be approximated by a Gamma distribution thanks to the Central Limit Theorem.
To illustrate and better understand these distribution, we can go through gradually more complex cases.
Frequency domain distribution of random noise
For sake of comparing with the later case of uniformly distributed data we will use a gaussian noise with a matching variance. Since the variance of uniformly distributed data is in the range [-0.5,0.5] is 1/12, this gives us the following data:
data = np.sqrt(1.0/12)*np.random.randn(500)
Now let us check the statistics on the power spectrum. As indicated earlier, the squared magnitude of each frequency coefficient is a random variable with an approximately Gamma distribution. The shape parameter is half the degrees of freedom of a Chi-Squared distribution that could have been used for a unit-variance case (so 1 in this case), and the scale parameter corresponds to the square of the scaling of the time-domain (from linearity the variate yf scales as data, such that np.abs(yf)**2 scales as the square of data).
We can validate this by plotting the histogram of the data against the probability density function:
yf = rfft(data)
spectrum = np.abs(yf)**2/len(data)
plt.figure(figsize=(5,5))
plt.hist(spectrum, bins=100, density=True, label='data')
z = np.linspace(0, np.max(spectrum), 100)
plt.plot(z, gamma.pdf(z, 1, scale=1.0/12), 'k', label='$\Gamma(1,{:.3f})$'.format(1.0/12))
As you can see the values are in pretty good agreement:
Going back to the spectrum plot:
# degrees of freedom
phi = 2
###values of chi-squared
chi_val_95 = chi2.isf(q=0.05/2, df=phi) #/2 for two-sided test
### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)**2/n, color='k')
# the following two lines should overlap
plt.axhline(y=var*(chi_val_95/phi),color='r',linestyle='--')
plt.axhline(y=gamma.isf(q=0.05/2, a=1, scale=var),color='b')
Just changing the data to use a uniform distribution in the [-0.5,0.5] range (with data = np.random.rand(500) - 0.5) gives an almost identical plot, with the confidence level remaining unchanged.
Frequency domain distribution of signal with noise
To get a single threshold value corresponding to a 95% confidence interval where the noise part would fall if you could separate it from the data containing a sinusoidal component and noise (or otherwise stated as the 95% confidence interval of the null-hypothesis that the data is white noise), you would need the variance of the noise. While trying to estimate this variance you may quickly realize that the sinusoidal contributes a non-negligible portion of the overall data's variance. To remove this contribution we could take advantage of the fact that sinusoidal signals are more readily separated in the frequency-domain.
So we could simply discard the x% largest values of the spectrum, under the assumption that those are mostly contributed by spike of the sinusoidal component in the frequency-domain. Note that 95 percentile choice below for the outliers is somewhat arbitrary:
# remove outliers
threshold = np.percentile(np.abs(yf)**2, 95)
filtered = [x for x in np.abs(yf)**2 if x <= threshold]
Then we can get the time-domain variance using Parseval's theorem:
# estimate variance
# In time-domain variance ~ np.sum(data**2)/len(data))
# In frequency-domain, using Parseval's theorem we get np.sum(data**2)/len(data) = np.mean(np.abs(spectrum)**2)/len(data)
var = np.mean(filtered)/len(data)
Note that due to the dynamic range of values across the spectrum, you may prefer to visualize the results on a logarithmic scale:
plt.figure(figsize=(5,5))
plt.plot(xf,10*np.log10(np.abs(yf)**2/n), color='k')
plt.axhline(y=10*np.log10(gamma.isf(q=0.05/2, a=1, scale=var)),color='r',linestyle='--')
If on the other hand you are trying to obtain a frequency-dependent 95% confidence interval, then you'd need to consider the contribution of the sinusoidal component at each frequency. For sake of simplicity we will assume here that the amplitude of the sinusoidal component and the variance of the noise are known (otherwise we'd first need to estimate these). In this case the distribution gets shifted by the sinusoidal component's contribution:
signal = np.sin(20*np.pi*x)
data = signal + np.random.rand(500) - 0.5
Sf = rfft(signal) # Assuming perfect knowledge of the sinusoidal component
yf = rfft(data)
noiseVar = 1.0/12 # Assuming perfect knowledge of the noise variance
threshold95 = np.abs(Sf)**2/n + gamma.isf(q=0.05/2, a=1, scale=noiseVar)
plt.figure(figsize=(5,5))
plt.plot(xf, 10*np.log10(np.abs(yf)**2/n), color='k')
plt.plot(xf, 10*np.log10(threshold95), color='r',linestyle='--')
Finally, while I kept the final plots in squared-amplitude units, nothing prevents you from taking the square root and view the corresponding thresholds in amplitude units.
Edit : I've used a gamma(1,s) distribution which is an asymptotically good distribution for data with sufficient number of samples n. For really small data sizes the distribution more closely match a gamma(0.5*(n/(n//2+1)),s) (due to the DC and Nyquist coefficients being purely real, thus having 1 degree of freedom unlike all other coefficients).
I have this task:
Choose your favorite continuous distribution (the less it looks normal, the more interesting; try to choose one of the distributions we have not discussed in the course).
Generate a sample of 1000 from it, build a histogram of the sample,
and draw a theoretical density of distribution of your random value
on top of it (so that the values are on the same scale, don't forget
to set the histogram to normed=True). Your task is to estimate the
distribution of your random sample average for different sample
sizes.
3
To do this, generate 1000 samples of n volume and build
histograms of their sample averages for three or more n values (for
example, 5, 10, 50).
Using the information on the mean and
variance of the original distribution (easily found on wikipedia),
calculate the values of the normal distribution parameters, which,
according to the central limit theorem, approximate the distribution
of the sample averages. Note: to calculate the values of these
parameters, it is the theoretical mean and the variance of your
random value that should be used, and not their sample estimates.
5.
On top of each histogram, draw the density of the corresponding
normal distribution (be careful with the parameters of the function,
it takes the input not the dispersion, but the standard deviation).
Describe the difference between the distributions obtained at different n values. How
does the accuracy of the approximation of
the distribution of sample averages change with the growth of n?
So if I want to pick exponential distribution in Python do I need to go like that?
from scipy.stats import expon
import matplotlib.pyplot as plt
exdist=sc.expon(loc=2,scale=3) # loc and scale - shift and scale parameters, default values 0 and 1.
mean, var, skew, kurt = exdist.stats(moments='mvsk') # Let's see the moments of our distribution.
x = np.linspace((0,2,100))
ax.plot(x, exdist.pdf(x)) # Let's draw it
arr=exdist.rvc(size=1000) # generation of thousand of random numbers. (Is it for task 3?)
And I constantly getting this error: here is a screenshot from Jupyter:
https://i.stack.imgur.com/zwUtu.png
Could you please explain to me how to write the right code? I can't figure out where to start or where to make a mistake. Do I have to use arr.mean() to search for a sample mean and plt.hist(arr,bins=) to build a histogram? I would be very grateful for the explanation.
I have a set of 1D data saved in a python. I can get the probability density function using gaussian_kde function from scipy. I want to know whether the returned distribution matches with a theoretical distribution such as Normal distribution. For that can I use KL divergence? If so how can I use python to do that?
This is my python code to get the probability density function.
array = np.array(values)
KDEpdf = gaussian_kde(array)
x = np.linspace(0, 50, 1500)
kdepdf = KDEpdf.evaluate(x)
plt.plot(x, kdepdf, label="", color="blue")
plt.legend()
plt.show()
There are couple of ways to do it:
Plot it against a normal fitted probability distribution. Like: plt.hist(x, norm.pdf(x,mu, std))
Compare kdepdf distribution with a uniform random dataset using something like Q-Q plot for both dataset.
Use chi square test, be cautious with the bin size you choose. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.chi square test. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.
Can anyone please explain what goes behind the scenes of a norm.pdf function in python?
I saw a uniform distribution (formed using x = np.arange(-3, 3, 0.001)) being used to plot a normal distribution using plt.plot(x, norm.pdf(x)). So how does norm.pdf convert uniformly distributed values into a normal distribution?
pdf is short for 'Probability Density Function', it represents the density of a random distribution for a given value; that is, how likely is that distribution to output that value? This is the most commonly plotted chart for most distributions, since peaks (on the y axis) represent commonly output values (on the x axis).
Statistical distributions, like the normal/gaussian distribution, have a nice, often parametrized, function to performs this mapping. norm.pdf refers to the PDF for a normal distribution, which you can find here.
Hence, plt.plot(x, norm.pdf(x)) plots, for a bunch of x values, how likely the normally-distributed random variable norm is to output a value of x, hence the bell curve that gets plotted.
I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.
If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()