PDF of the sum of Gaussian distributions using FFT - python

I am trying to derive the PDF of the sum of independent random variables. At first i would like to do this for a simple case: sum of Gaussian random variables.
I was surprised to see that I don't get a Gaussian density function when I sum an even number of gaussian random variables. I actually get:
which looks like two halfs of a Gaussian distribution.
On the other hand, when I sum an odd number of Gaussian distributions i get the right distribution:
below the code I used to produce the results above:
import numpy as np
from scipy.stats import norm
from scipy.fftpack import fft,ifft
import matplotlib.pyplot as plt
%matplotlib inline
a=10**(-15)
end=norm(0,1).ppf(a)
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
plt.subplot(211)
plt.plot(np.real(ifft(fft(pdf)**2)))
plt.subplot(212)
plt.plot(np.real(ifft(fft(pdf)**3)))
Could someone help me understand why I get odd results for even sums of Gaussian distributions?

Even though your code creates a zero-mean Gaussian PDF:
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
the FFT does not know about sample, and only sees pdf with samples at 0, 1, 2, 3, ... 999. The FFT expects the origin to be the first sample of the signal. To the FFT function, your PDF is not zero mean, but has a mean of 500.
Thus, what is going on here is that you are adding two PDFs with a 500 mean, leading to one with a 1000 mean. And because the FFT imposes a periodicity to the spatial domain signal, you are seeing the PDF exiting the graph on the right and coming back in on the left.
Adding 3 PDFs shifts the mean to 1500, which due to periodicity is the same as 500, meaning it ends up in the same place as the original PDF.
The solution is to shift the origin to the first sample for the FFT, and shift the result back:
from scipy.fftpack import fftshift, ifftshift
pdf2 = fftshift(ifft(fft(ifftshift(pdf))**2))
ifftshift shifts the signal so that the center sample ends up at the first sample, and fftshift shifts it back to where you wanted it for display.
But do note that the way you generate the PDF, the origin is not at a sample, and so the above will not work exactly. Instead, use:
sample=np.linspace(end,-end,1001)
pdf=norm(0,1).pdf(sample)
By picking 1001 samples instead of 1000, zero is exactly at the middle sample.

Use R!
library(ggplot2)
f <- function(n) {
x1 <- rnorm(n)
x2 <- rnorm(n)
X <- x1+x2
return(ds)
}
ds.list <- lapply(10^(2:5),f)
ds <- Reduce(rbind,ds.list)
ggplot(ds,aes(X,fill = n)) + geom_density(alpha = 0.5) + xlab("")
Here's the distribution plot:

Related

FFT on the csv data gives peak at 0

I have a dataset a csv file representing a wave like shown below. I would like to find the frequency of oscillations, so I have done fft. But the output of fft is peak at zero. I am new to python and fft. So I am not sure what I am doing wrong.
The data is captured at 300Hz(300 data points in one second). The data set contains 6317 values.
[image1]
Every peak has a wave following it. Here is an example at data points from 250 to 350
[image2]
import matplotlib.pyplot as plt
import csv
import numpy as np
csvfile=open('./abc.csv')
csvreader=csv.reader(csvfile)
readdata=next(csvreader)
csvfile.close()
data=np.array([readdata],dtype='float')
data1=data.reshape(6317,)
sp = np.fft.fft(data1)
sp_mag=np.abs(sp)/data1.size
freq = np.fft.fftfreq(data1.shape[-1])
plt.subplot(2,1,1)
plt.plot(data1)
plt.subplot(2,1,2)
plt.plot(freq,sp_mag)
plt.show()
The csv is available here .
The frequency associated with first three and next 3 peaks is same. So in fft i expect two peaks t different frequency.
Any help is really appreciated. Kindly let me know if any other data is needed to answer this question.
The value of the FFT at 0 is proportional to the sum of the data. Probably the easiest fix is to subtract off the mean of the data before taking the FFT (assuming you don't care about the constant offset).
Adopting the notation from wikipedia
X[m] = sum[ x[n]*exp(-i*2*pi*n*m/N) ]
(X is the FFT, x is the original data)
For m=0, the exponential factors are all ==1, so X[0] == sum[x[n]] (for this convention on where to put the normalization factors).

Generate random samples for each sample length for a distribution

My goal is to have draw 500 sample points, take its mean, and then do 6000 times from a distribution. Basically:
Take sample lengths ranging from N = 1 to 500. For each sample length,
draw 6000 samples and estimate the mean from each of the samples.
Calculate the standard deviation from these means for each sample
length, and show graphically that the decrease in standard deviation
corresponds to a square root reduction.
I am trying to do this on a gamma distribution, but all of my standard deviations are coming out as zero... and I'm not sure why.
This is the program so far:
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import gamma
# now taking random gamma samples
stdevs = []
length = np.arange(1, 401,1)
mean=[]
for i in range(400):
sample = np.random.gamma(shape=i,size=1000)
mean.append(np.mean(sample))
stdevs.append(np.std(mean))
# then trying to plot the standard deviations but it's just a line..
# thought there should be a decrease
plt.plot(length, stdevs,label='sampling')
plt.show()
I thought there should be a decrease in the standard deviation, not an increase. What might I be doing wrong when trying to draw 1000 samples from a gamma distribution and estimate the mean and standard deviation?
I think you are misusing shape. Shape is the shape of the distribution not the number of independent draws.
import numpy as np
import matplotlib.pyplot as plt
# Reproducible
gen = np.random.default_rng(20210513)
# Generate 400 (max sample size) by 1000 (number of indep samples)
sample = gen.gamma(shape=2, size=(400, 1000))
# Use cumsum to compute the cumulative sum
means = np.cumsum(sample, axis=0)
# Divid the cumsume by the number of observations used in each
# A little care needed to get broadcasting to work right
means = means / np.arange(1,401)[:,None]
# Compute the std dev using the observations in each row
stdevs = means.std(axis=1)
# Plot
plt.plot(np.arange(1,401), stdevs,label='sampling')
plt.show()
This produces the pictire.
The problem is with the line stdevs.append(np.std(sample.mean(axis=0)))
This takes the standard deviation of a single value i.e. the mean of your sample array, so it will always be 0.
You need to pass np.std() all the values in your sample not just its mean.
stdevs.append(np.std(sample)) will give you your array of standard deviations for each sampling.

Weird FFT plot with numpy random set

Code below:
import numpy as np
from numpy import random_intel
import mkl_fft
import matplotlib.pyplot as plt
n = 10**5
a = np.random_intel.rand(n)
b = mkl_fft.fft(a)
plt.scatter(b.real,b.imag)
plt.show()
print(b)
for i in b :
if i.real > n/2:
print("Weird FFT Number is ",i)
Result is :
You can see:
Weird FFT Number is (50020.99077289924+0j)
Why FFT with random set came out one particular number?
(Thanks to Paul Panzer & SleuthEye)
With mkl_fft.fft(a-0.5) the final result is:
[2019/03/29 Updated]
With normalized data everything went well
b = mkl_fft.fft((a - np.mean(a))/np.std(a))
The average value of (a - np.mean(a))/np.std(a) is near zero
That is the constant or zero frequency mode, which is essentially the mean of your signal. You are sampling uniformly from the unit interval, so the mean is ~0.5. Some fft implementations scale this with the number of points to save a multiplication.
The large value in the FFT output happens to be the very first one which corresponds to the DC component. This indicates that the input has a non-zero average value over the entire data set.
Indeed if you look closer at the input data, you might notice that the values are always between 0 and 1, with an average value around 0.5. This is consistent with the rand function implementation which provides pseudo-random samples drawn from a uniform distribution over [0, 1).
You may confirm this to be the case by subtracting the average value with
b = mkl_fft.fft(a - np.mean(a))
and noting that the large initial value b[0] should be near zero.

extracting phase information using numpy fft

I am trying to use a fast fourier transform to extract the phase shift of a single sinusoidal function. I know that on paper, If we denote the transform of our function as T, then we have the following relations:
However, I am finding that while I am able to accurately capture the frequency of my cosine wave, the phase is inaccurate unless I sample at an extremely high rate. For example:
import numpy as np
import pylab as pl
num_t = 100000
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
w = 2.0*np.pi*30.0
phase = np.pi/2.0
amp = np.fft.rfft(np.cos(w*t+phase))
freqs = np.fft.rfftfreq(t.shape[-1],dt)
print (np.arctan2(amp.imag,amp.real))[30]
pl.subplot(211)
pl.plot(freqs[:60],np.sqrt(amp.real**2+amp.imag**2)[:60])
pl.subplot(212)
pl.plot(freqs[:60],(np.arctan2(amp.imag,amp.real))[:60])
pl.show()
Using num=100000 points I get a phase of 1.57173880459.
Using num=10000 points I get a phase of 1.58022110476.
Using num=1000 points I get a phase of 1.6650441064.
What's going wrong? Even with 1000 points I have 33 points per cycle, which should be enough to resolve it. Is there maybe a way to increase the number of computed frequency points? Is there any way to do this with a "low" number of points?
EDIT: from further experimentation it seems that I need ~1000 points per cycle in order to accurately extract a phase. Why?!
EDIT 2: further experiments indicate that accuracy is related to number of points per cycle, rather than absolute numbers. Increasing the number of sampled points per cycle makes phase more accurate, but if both signal frequency and number of sampled points are increased by the same factor, the accuracy stays the same.
Your points are not distributed equally over the interval, you have the point at the end doubled: 0 is the same point as 1. This gets less important the more points you take, obviusly, but still gives some error. You can avoid it totally, the linspace has a flag for this. Also it has a flag to return you the dt directly along with the array.
Do
t, dt = np.linspace(0, 1, num_t, endpoint=False, retstep=True)
instead of
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
then it works :)
The phase value in the result bin of an unrotated FFT is only correct if the input signal is exactly integer periodic within the FFT length. Your test signal is not, thus the FFT measures something partially related to the phase difference of the signal discontinuity between end-points of the test sinusoid. A higher sample rate will create a slightly different last end-point from the sinusoid, and thus a possibly smaller discontinuity.
If you want to decrease this FFT phase measurement error, create your test signal so the your test phase is referenced to the exact center (sample N/2) of the test vector (not the 1st sample), and then do an fftshift operation (rotate by N/2) so that there will be no signal discontinuity between the 1st and last point in your resulting FFT input vector of length N.
This snippet of code might help:
def reconstruct_ifft(data):
"""
In this function, we take in a signal, find its fft, retain the dominant modes and reconstruct the signal from that
Parameters
----------
data : Signal to do the fft, ifft
Returns
-------
reconstructed_signal : the reconstructed signal
"""
N = data.size
yf = rfft(data)
amp_yf = np.abs(yf) #amplitude
yf = yf*(amp_yf>(THRESHOLD*np.amax(amp_yf)))
reconstructed_signal = irfft(yf)
return reconstructed_signal
The 0.01 is the threshold of amplitudes of the fft that you would want to retain. Making the THRESHOLD greater(more than 1 does not make any sense), will give
fewer modes and cause higher rms error but ensures higher frequency selectivity.
(Please adjust the TABS for the python code)

Bandwidth of an EEG signal

I'm trying to perform FFT of an EEG signal in Python, and then basing on the bandwidth determine whether it's alpha or beta signal. It looked fine, but the resulting plots are nothing like they should, the frequencies and magnitude values are not what I expected. Any help appreciated, here's the code:
from scipy.io import loadmat
import scipy
import numpy as np
from pylab import *
import matplotlib.pyplot as plt
eeg = loadmat("eeg_2013.mat");
eeg1=eeg['eeg1'][0]
eeg2=eeg['eeg2'][0]
fs = eeg['fs'][0][0]
fft1 = scipy.fft(eeg1)
f = np.linspace (fs,len(eeg1), len(eeg1), endpoint=False)
plt.figure(1)
plt.subplot(211)
plt.plot (f, abs (fft1))
plt.title ('Magnitude spectrum of the signal')
plt.xlabel ('Frequency (Hz)')
show()
plt.subplot(212)
fft2 = scipy.fft(eeg2)
f = np.linspace (fs,len(eeg2), len(eeg2), endpoint=False)
plt.plot (f, abs (fft2))
plt.title ('Magnitude spectrum of the signal')
plt.xlabel ('Frequency (Hz)')
show()
And the plots:
In order to get an array of the fft frequencies, you should use fftfreq; it gives you an array of frequencies to use as absciss:
from scipy.fftpack import fftfreq
eeg = loadmat("eeg_2013.mat");
eeg1=eeg['eeg1'][0]
eeg2=eeg['eeg2'][0]
fs = eeg['fs'][0][0]
fft1 = scipy.fft(eeg1)
f=fftfreq(eeg1.size,1/fs)
Sorry, I can't test this code in real conditions because you didn't post a data sample, but I hope this should work.
Concerning how to determine the bandwidth, as far as I understand, you want to get the fundamental frequency. There are different ways, more or less complicated whether your signal is noisy or not, ... In your case, you only want to know if the fundamental frequency f0 is in the range 8-13Hz (alpha) or 13-30Hz (beta); one very simple way is to compute the maximum of the fft in the range 8-13Hz: fft1[(f>8) & (f<13)].max() and if it's more than, say, 1000, it's an alpha wave, otherwise it's beta. If your signals are less similar, please post some examples of different kinds of samples and the result you would have, so that we can try more complicated algorithms.
If your sampling frequency is fs and you have N=len(eeg1) samples, then the fft procedure will, of course, return an array of N values. The first N/2 of them correspond to the frequency range 0..fs/2, the second half of the frequency corresponds to the mirrored frequency range -fs/2..0. For real input signals the mirrored half is just the complex conjugate of the positive half, so it can be disregarded in further analysis (but not in the inverse fft).
So essentially, you should format
f=linspace(0,N-1,N)*fs/N
Edit: or even more simple with minimal changes to the inital code
f = np.linspace (0,fs,len(eeg1), endpoint=False)
so f ranges from 0 to just before fs and disregard the second half of the fft result in the output:
plt.plot( f(0:N/2), abs( fft1(0:N/2) ) )
Added: You can use fftshift to exchange both halves, then the correct frequency range is
f = np.linspace (-fs/2,fs/2,len(eeg1), endpoint=False)

Categories

Resources