FFT on the csv data gives peak at 0 - python

I have a dataset a csv file representing a wave like shown below. I would like to find the frequency of oscillations, so I have done fft. But the output of fft is peak at zero. I am new to python and fft. So I am not sure what I am doing wrong.
The data is captured at 300Hz(300 data points in one second). The data set contains 6317 values.
[image1]
Every peak has a wave following it. Here is an example at data points from 250 to 350
[image2]
import matplotlib.pyplot as plt
import csv
import numpy as np
csvfile=open('./abc.csv')
csvreader=csv.reader(csvfile)
readdata=next(csvreader)
csvfile.close()
data=np.array([readdata],dtype='float')
data1=data.reshape(6317,)
sp = np.fft.fft(data1)
sp_mag=np.abs(sp)/data1.size
freq = np.fft.fftfreq(data1.shape[-1])
plt.subplot(2,1,1)
plt.plot(data1)
plt.subplot(2,1,2)
plt.plot(freq,sp_mag)
plt.show()
The csv is available here .
The frequency associated with first three and next 3 peaks is same. So in fft i expect two peaks t different frequency.
Any help is really appreciated. Kindly let me know if any other data is needed to answer this question.

The value of the FFT at 0 is proportional to the sum of the data. Probably the easiest fix is to subtract off the mean of the data before taking the FFT (assuming you don't care about the constant offset).
Adopting the notation from wikipedia
X[m] = sum[ x[n]*exp(-i*2*pi*n*m/N) ]
(X is the FFT, x is the original data)
For m=0, the exponential factors are all ==1, so X[0] == sum[x[n]] (for this convention on where to put the normalization factors).

Related

Is the Fourier transform I am computing in Python displayed in the time domain? If so, how can I display this in the Frequency domain?

My goal is to detect if a certain frequency is present in an audio recording and output a binary response. To do this, I plan on performing a Fourier transform on the audio file, and querying the values contained in the frequency bins. If I find that the bin associated with the frequency I am looking for has a high value, this should mean that it is present (if my thinking is correct). However, I am having trouble generating my transform correctly. My code is below:
from scipy.io import wavfile
from scipy.fft import fft, fftfreq
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
user_in = input("Please enter the relative path to your wav file --> ")
sampling_rate, data = wavfile.read(user_in)
print("sampling rate:", sampling_rate)
duration = len(data) / float(sampling_rate)
print("duration:", duration)
number_samples_in_seg = int(sampling_rate * duration)
fft_of_data = fft(data)
fft_bins_from_data = fftfreq(number_samples_in_seg, 1 / sampling_rate)
print(fft_bins_from_data.size)
plt.plot(fft_bins_from_data, fft_of_data, label="Real part")
plt.show()
Trying this code using a few different wav files leads me to wonder whether I am displaying my transform in the time domain, rather than the frequency domain, which I need:
Input: 200hz.wav
Output:
sampling rate: 48000
duration: 60.000375
2880018
Input: 8000hz.wav
Output:
sampling rate: 48000
duration: 60.000375
2880018
With these files that should contain a pure signal, I would expect to see only one spike on my plot, where x = 200 or x = 800. One final file contributes to my concern that I am not viewing the frequency domain:
Input: beep.wav
Output:
sampling rate: 48000
duration: 5.061958333333333
24297
This appears to show the distinct beeping as it progresses over an x-axis of time.
I attempted to clean up the plotting by only plotting the magnitude of the positive values. Unfortunately, I am still not seeing the frequencies isolated on a frequency spectrum:
plt.plot(fft_bins_from_data[0:number_samples_in_seg//2], abs(fft_of_data[0:number_samples_in_seg//2])
plt.show()
beep output updated
I have referred to these resources before posting:
How to get a list of frequencies in a wav file
Python frequency detection
Fourier Transforms With scipy.fft: Python Signal Processing
Calculate the magnitude and phase of a signal at a particular frequency in python
What is the difference between numpy.fft.fft and numpy.fft.fftfreq
A summary of my questions:
Are my plots displaying the time domain or frequency domain of the signal?
Why is the number of samples equal to the number of bins, and should this be the case for frequency domain?
If these plots are indeed the frequency domain, how do I interpret them and query the values in the bins?
Try this:
import scipy as sp
import scipy.signal as sig
import numpy as np
from numpy import fft
import matplotlib.pyplot as plt
number_samples_in_seg = len(data)
time_axis = np.arange(0, number_samples_in_seg)/sampling_rate
win = sig.windows.hann(number_samples_in_seg)
windowed_data = win*data
plt.plot(time_axis, windowed_data)
That will plot the signal in the time domain if that's not obvious. I applied a Hann window to the signal, which will reduce artifacts if the start and end of the signal don't match up (as the FFT assumes that the snippet of the signal is periodic).
For the plotting of the FFT:
fft_data = fft.fft(windowed_data)[0:int(np.floor(number_samples_in_seg/2))]
freq_axis = sp.fft.fftfreq(number_samples_in_seg, 1.0/sample_rate)[0:int(np.floor(number_samples_in_seg/2))]
plt.plot(freq_axis, 20.0*np.log10(np.abs(fft_data)))
The square bracket indexing on fft_data and freq_axis are to eliminate the negative frequency portion of the FFT. I generated a 200Hz sine wave in Audacity with a length of 4096 samples (just so that it fit within a power of two for nice FFT-ing) and there is a peak at 200Hz in my plot. Also note the 20*log10(abs(fft_data)) thing for plotting in dB.
The above should answer your question #3. As for question #2, the FFT always has the same number of time and frequency points. Not sure about question #1, but again, the above code should sort that out.

Understanding scipy.signal.spectrogram()'s output

I am trying to understand what the scipy.signal.spectrogram()'s output are, and how to use them. Currently, I read a .wav file and generate a spectrogram.
from scipy.io import wavfile as wav
from scipy import signal
sample_rate, data = wav.read('sound.wav')
f, t, Sxx = signal.spectrogram(data, sample_rate)
--
In case understanding this completely wrong, my idea of a spectrogram is a 3D graph consisting of:
x-axis: time
y-axis: frequency
pixel colour/brightness: amplitude
So I'm wondering how f, t and Sxx relate to the time, frequency, and amplitude.
Thanks for reading, any help is appreciated!
f is the frequency array, containing the frequencies of every band of the fft. Which can be used as the labels for a graph
t is the time array, containing the time at which this FFT was made relative to the source signal. Again can be used for labels.
The Sxx array contains the amplitudes and is a 2d array whose shape is the length of f by the length of t.
Therefore the axis which matches the length of the time array is the time axis and the other the frequency.
You will need to find the min and max values of the Sxx array yourself, if you want to normalise for display.

PDF of the sum of Gaussian distributions using FFT

I am trying to derive the PDF of the sum of independent random variables. At first i would like to do this for a simple case: sum of Gaussian random variables.
I was surprised to see that I don't get a Gaussian density function when I sum an even number of gaussian random variables. I actually get:
which looks like two halfs of a Gaussian distribution.
On the other hand, when I sum an odd number of Gaussian distributions i get the right distribution:
below the code I used to produce the results above:
import numpy as np
from scipy.stats import norm
from scipy.fftpack import fft,ifft
import matplotlib.pyplot as plt
%matplotlib inline
a=10**(-15)
end=norm(0,1).ppf(a)
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
plt.subplot(211)
plt.plot(np.real(ifft(fft(pdf)**2)))
plt.subplot(212)
plt.plot(np.real(ifft(fft(pdf)**3)))
Could someone help me understand why I get odd results for even sums of Gaussian distributions?
Even though your code creates a zero-mean Gaussian PDF:
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
the FFT does not know about sample, and only sees pdf with samples at 0, 1, 2, 3, ... 999. The FFT expects the origin to be the first sample of the signal. To the FFT function, your PDF is not zero mean, but has a mean of 500.
Thus, what is going on here is that you are adding two PDFs with a 500 mean, leading to one with a 1000 mean. And because the FFT imposes a periodicity to the spatial domain signal, you are seeing the PDF exiting the graph on the right and coming back in on the left.
Adding 3 PDFs shifts the mean to 1500, which due to periodicity is the same as 500, meaning it ends up in the same place as the original PDF.
The solution is to shift the origin to the first sample for the FFT, and shift the result back:
from scipy.fftpack import fftshift, ifftshift
pdf2 = fftshift(ifft(fft(ifftshift(pdf))**2))
ifftshift shifts the signal so that the center sample ends up at the first sample, and fftshift shifts it back to where you wanted it for display.
But do note that the way you generate the PDF, the origin is not at a sample, and so the above will not work exactly. Instead, use:
sample=np.linspace(end,-end,1001)
pdf=norm(0,1).pdf(sample)
By picking 1001 samples instead of 1000, zero is exactly at the middle sample.
Use R!
library(ggplot2)
f <- function(n) {
x1 <- rnorm(n)
x2 <- rnorm(n)
X <- x1+x2
return(ds)
}
ds.list <- lapply(10^(2:5),f)
ds <- Reduce(rbind,ds.list)
ggplot(ds,aes(X,fill = n)) + geom_density(alpha = 0.5) + xlab("")
Here's the distribution plot:

scipy/numpy FFT on data from file

I looked into many examples of scipy.fft and numpy.fft. Specifically this example Scipy/Numpy FFT Frequency Analysis is very similar to what I want to do. Therefore, I used the same subplot positioning and everything looks very similar.
I want to import data from a file, which contains just one column to make my first test as easy as possible.
My code writes like this:
import numpy as np
import scipy as sy
import scipy.fftpack as syfp
import pylab as pyl
# Read in data from file here
array = np.loadtxt("data.csv")
length = len(array)
# Create time data for x axis based on array length
x = sy.linspace(0.00001, length*0.00001, num=length)
# Do FFT analysis of array
FFT = sy.fft(array)
# Getting the related frequencies
freqs = syfp.fftfreq(array.size, d=(x[1]-x[0]))
# Create subplot windows and show plot
pyl.subplot(211)
pyl.plot(x, array)
pyl.subplot(212)
pyl.plot(freqs, sy.log10(FFT), 'x')
pyl.show()
The problem is that I will always get my peak at exactly zero, which should not be the case at all. It really should appear at around 200 Hz.
With smaller range:
Still biggest peak at zero.
As already mentioned, it seems like your signal has a DC component, which will cause a peak at f=0. Try removing the mean with, e.g., arr2 = array - np.mean(array).
Furthermore, for analyzing signals, you might want to try plotting power spectral density.:
import matplotlib.pylab as plt
import matplotlib.mlab as mlb
Fs = 1./(d[1]- d[0]) # sampling frequency
plt.psd(array, Fs=Fs, detrend=mlb.detrend_mean)
plt.show()
Take a look at the documentation of plt.psd(), since there a quite a lot of options to fiddle with. For investigating the change of the spectrum over time, plt.specgram() comes in handy.

Converting excel files to python to frequency

Essentially I've got an excel files with voltage in the first column, and time in the second. I want to find the period of the voltages, as it returns a graph of voltage in y axis and time in x axis with a periodicity, looking similar to a sine function.
To find the frequency I have uploaded my excel file to python as I think this will make it easier- there may be something I've missed that will simplify this.
So far in python I have:
import xlrd
import numpy as N
import numpy.fft as F
import matplotlib.pyplot as P
wb = xlrd.open_workbook('temp7.xls') #LOADING EXCEL FILE
wb.sheet_names()
sh = wb.sheet_by_index(0)
first_column = sh.col_values(1) #VALUES FROM EXCEL
second_column = sh.col_values(2) #VALUES FROM EXCEL
Now how do I find the frequency from this?
I'm not sure how much you know about the Fourier transform, so forgive me if this is too much background.
Your signal does not have "a frequency", it is but it can be thought of as the sum of many frequencies. The Fourier transform will tell you the weights of all the frequencies that make up your signal. Unfortunately information may be lost when sampling from the analog (continuous time) to digital (discrete time) domain. This puts a constraint on the information we can get about frequency - namely that the maximum frequency component we can determine is related to the digital sampling rate (Nyquist-Shannon criterion):
fs > 2B
Where fs is your sampling rate (samples/unit time, typically in Hz or something like it), and B is the maximum frequency of your signal. If your signal actually has frequencies higher than B they will be "aliased" to some value lower than B.
For your problem, all you have to do is this:
x = N.array(first_column)
X = F.fft(x)
Now X is the frequency-domain representation of your voltage signal. The corresponding frequency axis covers [0, fs), based on the sampling theorem. So, what is fs? You need to calculate that by looking at the number of samples you have divided by the total duration of your sampled signal (note your units here):
fs = len(second_column) / second_column[-1]
Note that this representation of your signal will also (probably) be complex, i.e. each frequency will have an associated amplitude and phase.
Hopefully this helps, and hopefully I didn't cover a bunch of stuff you already knew.

Categories

Resources