Applying Fourier Transform on Time Series data and avoiding aliasing

Applying Fourier Transform on Time Series data and avoiding aliasing - python

I am willing to apply Fourier transform on a time series data to convert data into frequency domain. I am not sure if the method I've used to apply Fourier Transform is correct or not? Following is the link to data that I've used.
After reading the data file I've plotted original data using
t = np.linspace(0,55*24*60*60, 55)
s = df.values
sns.set_style("darkgrid")
plt.ylabel("Amplitude")
plt.xlabel("Time [s]")
plt.plot(t, s)
plt.show()
Since the data is on a daily frequency I've converted it into seconds using 24*60*60 and for a period of 55 days using 55*24*60*60
The graph looks as follows:
Next I've implemeted Fourier Transform using following piece of code and obtained the image as follows:
#Applying Fourier Transform
fft = fftpack.fft(s)
#Time taken by one complete cycle of wave (seconds)
T = t[1] - t[0]
#Calculating sampling frequency
F = 1/T
N = s.size
#Avoid aliasing by multiplying sampling frequency by 1/2
f = np.linspace(0, 0.5*F, N)
#Convert frequency to mHz
f = f * 1000
#Plotting frequency domain against amplitude
sns.set_style("darkgrid")
plt.ylabel("Amplitude")
plt.xlabel("Frequency [mHz]")
plt.plot(f[:N // 2], np.abs(fft)[:N // 2])
plt.show()
I've following questions:
I am not sure if my above methodology is correct to implement Fourier Transform.
I am not sure if the method I am using to avoid aliasing is correct.
If, what I've done is correct than how to interpret the three peaks in Frequency domain plot.
Finally, how would I invert transform using only frequencies that are significant.

While I'd refrain from answering your first two questions (it looks okay to me but I'd love an expert's input), I can weigh in on the latter two:
If, what I've done is correct than how to interpret the three peaks in Frequency domain plot.
Well, that means you've got three main components to your signal at frequencies roughly 0.00025 mHz (not the best choice of units here, possibly!), 0.00125 mHz and 0.00275 mHz.
Finally, how would I invert transform using only frequencies that are significant.
You could just zero out every frequency below a cutoff you decide (say, absolute value of 3 - that should cover your peaks here). Then you can do:
below_cutoff = np.abs(fft) < 3
fft[below_cutoff] = 0
cleaner_signal = fftpack.ifft(fft)
And that should do it, really!

Related

How to remove a zero frequency artefact from FFT using numpy.fft.fft() when detrending or subtracting the mean does not work

I am trying to calculate the FFT of the following data : data.txt
y_array = np.loadtxt('data.txt',dtype='complex')
plt.plot(np.real(y_array))
y_array_fft = np.fft.fftshift(np.fft.fft(y_array))
x_array = np.linspace(-125,125,len(y_array))
The FFT plot :
I want to remove the artefact at zero frequency which I think is the DC offset. For this, I tried subtracting the mean from the original signal and also use scipy.signal.detrend() for removing a linear trend from data before fft. However, both operations do not seem to have any effect on the FFT.
y_array_detrend = signal.detrend(y_array)
y_array_mean_subtracted = y_array-np.mean(y_array)
y_array_detrend_fft = np.fft.fftshift(np.fft.fft(y_array_detrend))
y_array_mean_subtracted_fft = np.fft.fftshift(np.fft.fft(y_array_mean_subtracted))
FFT after transforming the data using scipy.detrend():
FFT after subtracting the mean from data :
Any help or comments is greatly appreciated!

If you subtract the mean, then what's left isn't a 0 Hz artifact, but some low frequency spectrum (perhaps between 2 to 10 Hz in your plot, depending on your dimensions). Try a high pass filter.
Also, since it's complex data, make sure you subtracted the complex mean.

Area under the peak of a FFT in Python

I'm trying to do some tests before I proceed analyzing some real dataset via FFT, and I've found the following problem.
First, I create a signal as the sum of two cosines and then use rfft to to the transformation (since it has only real values):
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import rfft, rfftfreq
# Number of sample points
N = 800
# Sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N*T, N)
y = 0.5*np.cos(10*2*np.pi*x) + 0.5*np.cos(200*2*np.pi*x)
# FFT
yf = rfft(y)
xf = rfftfreq(N, T)
fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(x,y)
ax[1].plot(xf, 2.0/N*np.abs(yf))
As it can be seen from the definition of the signal, I have two oscillations with amplitude 0.5 and frequency 10 and 200. Now, I would expect the FFT spectrum to be something like two deltas at those points, but apparently increasing the frequency broadens the peaks:
From the first peak it can be infered that the amplitude is 0.5, but not for the second. I've tryied to obtain the area under the peak using np.trapz and use that as an estimate for the amplitude, but as it is close to a dirac delta it's very sensitive to the interval I choose. My problem is that I need to get the amplitude as exact as possible for my data analysis.
EDIT: As it seems to be something related with the number of points, I decided to increment (now that I can) the sample frequency. This seems to solve the problem, as it can be seen in the figure:
However, it still seems strange that for a certain number of points and sample frequency, the high frequency peaks broaden...

It is not strange , you have leakage of the frequency bins. When you discretize the signal (sampling) needed for the Fourier transfrom , frequency bins are created which are frequency intervals where the the amplitude is calculated. And each bin has wide which is given by the sample_rate / num_points . So , the less the number of bins the more difficult is to assign precise amplitudes to every frequency. Other problems in choosing the best sampling rate exist such as the shannon-nyquist theorem to prevent aliasing. https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem . But depending on the problem sometimes there some custom rates used for sampling. E.g. when dealing with audio a sampling rate of 44,100 Hz is widely used , cause is based on the limits of the human hearing. So it depends also on nature of the data you want to perform analysis as you wrote. Anyway , since this question has also theoretical value , you can also check https://dsp.stackexchange.com for some useful info.

I would comment to George's answer, but yet I cannot.
Maybe a starting point for your research are the properties of the Discrete Fourier Transform.
The signal in the time domain is actual the cosines multiplied by a box window which transforms into the frequency domain as the convolution of the deltas with the sinc function. The sinc functions will smear the spectrum.
However, I am not sure we are observing spectral leakage here, since the window fits exactly to the full period of cosines. The discretization of the bins might still play a role here.

How to do Histogram Equalization based on audio frequency?

I have already tried histogram equalization based on image and it works just fine.
But now I want to implement the same approach using audio frequency instead of image gray scale. Which means I would like to make the spectrum flatter. The sampling rate I use is 44.1kHz and want to make the frequency evenly spread to range 0-22050Hz, but the peak is still the highest.
Here is the spectrum:
And this is what I have tried:
I think the original histogram I plot is already wrong, I can't count the number of occurrences per frequency, or maybe I shouldn't do this at all. Somebody told me I need to use fft() but I have no idea how to do it.
Any help would be appreciated! Thanks
Here is the code for how I plot the spectrum :
import librosa
import numpy as np
import matplotlib.pyplot as plt
import math
file = 'example.wav'
y, sr = librosa.load(file, sr=None)
n_fft = 2048
S = librosa.stft(y, n_fft=n_fft, hop_length=n_fft//2)
S = abs(S)
D_AVG = np.mean(S, axis=1)
plt.figure(figsize=(25, 12))
plt.bar(np.arange(D_AVG.shape[0]), D_AVG)
x_ticks_positions = [n for n in range(0, n_fft // 2, n_fft // 16)]
x_ticks_labels = [str(sr / 2048 * n) + 'Hz' for n in x_ticks_positions]
plt.xticks(x_ticks_positions, x_ticks_labels)
plt.xlabel('Frequency')
plt.ylabel('dB')
plt.savefig('spectrum.png')

"Equalization" in the sense of making a flat frequency spectrum is usually done by a whitening transformation. This post on dsp.stackexchange might also be helpful. As Mark mentioned, this spectral equalization is different from histogram equalization in image processing.
Equalizing/whitening the spectrum of a signal:
Estimate the PSD. Given an array of samples x with sample rate fs, you can compute a robust estimate of the power spectral density (PSD) with scipy.signal.welch:
f, psd = scipy.signal.welch(x, fs=fs)
This function performs Welch's method. Basically, it divides up the signal into several segments, does FFT on each one, and averages the power spectra to get a good estimate of how much power x has at each frequency on average. The point of all this is it gets a more reliable frequency characterization than just taking one FFT of x as a whole.
Compute equalizer gain. Use eq_gain = 1 / (1e-6 + psd)**0.5, or something similar, to determine the gain of the equalizer. The 1e-6 denominator offset is to avoid division by zero. It often happens that the PSD extremely small for some frequencies because, say, x went through an anti-aliasing filter that made some high frequency powers nearly zero.
Apply the equalizer gain. Finally, eq_gain needs to be applied to the signal x to equalize it. There are many ways this could be done, but one way is to use scipy.signal.firwin2 to turn the gains into an FIR filter,
eq_filter = scipy.signal.firwin2(99, f, eq_gain, fs=fs)
and use a convolution or scipy.signal.lfilter to apply the filter to x. You can then use scipy.signal.welch again to check that the PSD is flatter than before.

How to interpret this fft graph

I want to apply Fourier transformation using fft function to my time series data to find "patterns" by extracting the dominant frequency components in the observed data, ie. the lowest 5 dominant frequencies to predict the y value (bacteria count) at the end of each time series.
I would like to preserve the smallest 5 coefficients as features, and eliminate the rest.
My code is as below:
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
X = df.iloc[0:2,0:10000]
dft_X = np.fft.fft(X)
print(dft_X)
print(len(dft_X))
plt.plot(dft_X)
plt.grid(True)
plt.show()
# What is the graph about(freq/amplitude)? How much data did it use?
for i in dft_X:
m = i[np.argpartition(i,5)[:5]]
n = i[np.argpartition(i,range(5))[:5]]
print(m,'\n',n)
Here is the output:
But I am not sure how to interpret this graph. To be precise,
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
Parameters:
n : int
Window length.
d : scalar, optional
Sample spacing (inverse of the sampling rate). Defaults to 1.
Returns:
f : ndarray
Array of length n containing the sample frequencies.
How to determine n(window length) and sample spacing?
3) Why are transformed values all complex numbers?
Thanks

I'm gonna answer in reverse order of your questions
3) Why are transformed values all complex numbers?
The output of a Fourier Transform is always complex numbers. To get around this fact, you can either apply the absolute value on the output of the transform, or only plot the real part using:
plt.plot(dft_X.real)
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
No, the "frequency values" will be visible on the output of the FFT.
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
Your graph has so many lines because it's making a line for each column of your data set. Apply the FFT on each row separately (or possibly just transpose your dataframe) and then you'll get more actual frequency domain plots.
Follow up
Would using absolute value or real part of the output as features for a later model have different effect than using the original output?
Absolute values are easier to work with usually.
Using real part
Using absolute value
Here's the Octave code that generated this:
Fs = 4000; % Sampling rate of signal
T = 1/Fs; % Period
L = 4000; % Length of signal
t = (0:L-1)*T; % Time axis
freq = 1000; % Frequency of our sinousoid
sig = sin(freq*2*pi*t); % Fill Time-Domain with 1000 Hz sinusoid
f_sig = fft(sig); % Apply FFT
f = Fs*(0:(L/2))/L; % Frequency axis
figure
plot(f,abs(f_sig/L)(1:end/2+1)); % peak at 1kHz)
figure
plot(f,real(f_sig/L)(1:end/2+1)); % main peak at 1kHz)
In my example, you can see the absolute value returned no noise at frequencies other than the sinusoid of frequency 1kHz I generated while the real part had a bigger peak at 1kHz but also had much more noise.
As for effects, I don't know what you mean by that.
is it expected that "frequency values" always be complex numbers
Always? No. The Fourier series represents the frequency coefficients at which the sum of sines and cosines completely equate any continuous periodic function. Sines and cosines can be written in complex forms through Euler's formula. This is the most convenient way to store Fourier coefficients. In truth, the imaginary part of your frequency-domain signal represents the phase of the signal. (i.e if I have 2 sine functions of the same frequency, they can have different complex forms depending on the time shifting). However, most libraries that provide an FFT function will, by default, store FFT coefficients as complex numbers, to facilitate phase and magnitude calculations.
Is it convention that FFT use each column of dataset when plotting a line
I think it is an issue with mathplotlib.plot, not np.fft.
Could you please show me how to apply FFT on each row separately
There are many ways to go around this and I don't want to force you down one path, so I will propose the general solution to iterate over each row of your dataframe and apply the FFT on each specific row. Otherwise, in your case, I believe transposing your output could also work.

How do I get the frequencies from a signal?

I am look for a way to obtain the frequency from a signal. Here's an example:
signal = [numpy.sin(numpy.pi * x / 2) for x in range(1000)]
This Array will represent the sample of a recorded sound (x = miliseconds)
sin(pi*x/2) => 250 Hrz
How can we go from the signal (list of points), to obtaining the frequencies form this array?
Note:
I have read many Stackoverflow threads and watch many youtube videos. I am yet to find an answer. Please use simple words.
(I am Thankfull for every answer)

What you're looking for is known as the Fourier Transform
A bit of background
Let's start with the formal definition:
The Fourier transform (FT) decomposes a function (often a function of time, or a signal) into its constituent frequencies
This is in essence a mathematical operation that when applied over a signal, gives you an idea of how present each frequency is in the time series. In order to get some intuition behind this, it might be helpful to look at the mathematical definition of the DFT:
Where k here is swept all the way up t N-1 to calculate all the DFT coefficients.
The first thing to notice is that, this definition resembles somewhat that of the correlation of two functions, in this case x(n) and the negative exponential function. While this may seem a little bit abstract, by using Euler's formula and by playing a bit around with the definition, the DFT can be expressed as the correlation with both a sine wave and a cosine wave, which will account for the imaginary and the real parts of the DFT.
So keeping in mind that this is in essence computing a correlation, whenever a corresponding sine or cosine from the decomposition of the complex exponential matches with that of x(n), there will be a peak in X(K), meaning that, such frequency is present in the signal.
How can we do the same with numpy?
So having given a very brief theoretical background, let's consider an example to see how this can be implemented in python. Lets consider the following signal:
import numpy as np
import matplotlib.pyplot as plt
Fs = 150.0; # sampling rate
Ts = 1.0/Fs; # sampling interval
t = np.arange(0,1,Ts) # time vector
ff = 50; # frequency of the signal
y = np.sin(2*np.pi*ff*t)
plt.plot(t, y)
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.show()
Now, the DFT can be computed by using np.fft.fft, which as mentioned, will be telling you which is the contribution of each frequency in the signal now in the transformed domain:
n = len(y) # length of the signal
k = np.arange(n)
T = n/Fs
frq = k/T # two sides frequency range
frq = frq[:len(frq)//2] # one side frequency range
Y = np.fft.fft(y)/n # dft and normalization
Y = Y[:n//2]
Now, if we plot the actual spectrum, you will see that we get a peak at the frequency of 50Hz, which in mathematical terms it will be a delta function centred in the fundamental frequency of 50Hz. This can be checked in the following Table of Fourier Transform Pairs table.
So for the above signal, we would get:
plt.plot(frq,abs(Y)) # plotting the spectrum
plt.xlabel('Freq (Hz)')
plt.ylabel('|Y(freq)|')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.