Currently I am working on a project that requires me to pick out audio clips and compare them based off their FFT results (i.e. spectrogram). All of my audio clips are 0.200s long, but when I process them through the transform, they are no longer the same length. The code I am using for the transform uses numpy and librosa libraries:
def extractFFT(audioArr):
fourierArr = []
fourierComplex = []
for x in range(len(audioArr)):
y, sr = lb.load(audioArr[x])
fourier = np.fft.fft(y)
fourier = fourier.real
fourierArr.append(fourier)
return fourierArr
I am only taking the real number portion of the transform because I also wanted to pass this through a PCA, which does not allow for complex numbers. Regardless, I can perform neither LDA (linear discriminant analysis) or PCA on this FFT array of audio clips, since some are of different lengths.
The code I have for the LDA is as follows, where the labels are given for a frequencyArr of length 4:
def LDA(frequencyArr):
splitMark = int(len(frequencyArr)*0.8)
trainingData = frequencyArr[:splitMark]
validationData = frequencyArr[splitMark:]
labels = [1,1,2,2]
lda = LinearDiscriminantAnalysis()
lda.fit(trainingData,labels[:splitMark])
print(f"prediction: {lda.predict(validationData)}")
This throws the following value error, coming from the lda.fit(trainingData,labels[:splitMark]) line:
ValueError: setting an array element with a sequence.
I know this error stems from the array not being of a set 2 dimensional shape, since I don't receive this error when the FFT elements are all of equal length and the code works as intended.
Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!
Note, they normally only differ by a few points, say for 3 of the audio clips the FFT length is 4410 but for the 4th it is 4409. I know I can probably just trim the lengths down to the smallest length out of the group, but I'd prefer a cleaner method that won't leave out any values.
First of all: Do not only take the real part of the transform result. It won't do you any good. Use the power (r^2+i^2) or magnitude (sqrt(power)) to get the strength of the signal for a frequency bin.
Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!
They are simply not the same length. I bet the sample number of your clips isn't exactly identical.
After y, sr = lb.load(audioArr[x]) do print('sample count = {}'.format(len(y))) and you will most likely see different values (you've stated as much yourself).
As you already point out, of course you could simply cut of the signal at min(len(y)) and then feed it into the FFT. But typically, what you do to get around this is to use a discrete STFT, which has a fixed window size. This ensures same length input size to the FFT. You can use librosa's implementation as an easy starting point. The docs also explain how to get magnitude/power.
So instead of:
y, sr = lb.load(audioArr[x])
fourier = np.fft.fft(y)
fourier = fourier.real
fourierArr.append(fourier)
You do:
y, sr = lb.load(audioArr[x])
# get the magnitudes
D = np.abs(librosa.stft(y, n_fft=4096)) # use 4096 as window length
fourierArr.append(D[0]) # only use the first frame of the STFT
In essence, if you use the Fourier transform with different length input, you will get different length output, which is something that LDA does not forgive, when using this output as training data. So you have to make sure your input has the same length. The easiest way to do this is to use the STFT (or simply cut all your input to min). IMO, there is nothing unclean about this and it will not affect results much, if you are missing a couple of samples.
Related
I want to apply Fourier transformation using fft function to my time series data to find "patterns" by extracting the dominant frequency components in the observed data, ie. the lowest 5 dominant frequencies to predict the y value (bacteria count) at the end of each time series.
I would like to preserve the smallest 5 coefficients as features, and eliminate the rest.
My code is as below:
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
X = df.iloc[0:2,0:10000]
dft_X = np.fft.fft(X)
print(dft_X)
print(len(dft_X))
plt.plot(dft_X)
plt.grid(True)
plt.show()
# What is the graph about(freq/amplitude)? How much data did it use?
for i in dft_X:
m = i[np.argpartition(i,5)[:5]]
n = i[np.argpartition(i,range(5))[:5]]
print(m,'\n',n)
Here is the output:
But I am not sure how to interpret this graph. To be precise,
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
Parameters:
n : int
Window length.
d : scalar, optional
Sample spacing (inverse of the sampling rate). Defaults to 1.
Returns:
f : ndarray
Array of length n containing the sample frequencies.
How to determine n(window length) and sample spacing?
3) Why are transformed values all complex numbers?
Thanks
I'm gonna answer in reverse order of your questions
3) Why are transformed values all complex numbers?
The output of a Fourier Transform is always complex numbers. To get around this fact, you can either apply the absolute value on the output of the transform, or only plot the real part using:
plt.plot(dft_X.real)
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
No, the "frequency values" will be visible on the output of the FFT.
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
Your graph has so many lines because it's making a line for each column of your data set. Apply the FFT on each row separately (or possibly just transpose your dataframe) and then you'll get more actual frequency domain plots.
Follow up
Would using absolute value or real part of the output as features for a later model have different effect than using the original output?
Absolute values are easier to work with usually.
Using real part
Using absolute value
Here's the Octave code that generated this:
Fs = 4000; % Sampling rate of signal
T = 1/Fs; % Period
L = 4000; % Length of signal
t = (0:L-1)*T; % Time axis
freq = 1000; % Frequency of our sinousoid
sig = sin(freq*2*pi*t); % Fill Time-Domain with 1000 Hz sinusoid
f_sig = fft(sig); % Apply FFT
f = Fs*(0:(L/2))/L; % Frequency axis
figure
plot(f,abs(f_sig/L)(1:end/2+1)); % peak at 1kHz)
figure
plot(f,real(f_sig/L)(1:end/2+1)); % main peak at 1kHz)
In my example, you can see the absolute value returned no noise at frequencies other than the sinusoid of frequency 1kHz I generated while the real part had a bigger peak at 1kHz but also had much more noise.
As for effects, I don't know what you mean by that.
is it expected that "frequency values" always be complex numbers
Always? No. The Fourier series represents the frequency coefficients at which the sum of sines and cosines completely equate any continuous periodic function. Sines and cosines can be written in complex forms through Euler's formula. This is the most convenient way to store Fourier coefficients. In truth, the imaginary part of your frequency-domain signal represents the phase of the signal. (i.e if I have 2 sine functions of the same frequency, they can have different complex forms depending on the time shifting). However, most libraries that provide an FFT function will, by default, store FFT coefficients as complex numbers, to facilitate phase and magnitude calculations.
Is it convention that FFT use each column of dataset when plotting a line
I think it is an issue with mathplotlib.plot, not np.fft.
Could you please show me how to apply FFT on each row separately
There are many ways to go around this and I don't want to force you down one path, so I will propose the general solution to iterate over each row of your dataframe and apply the FFT on each specific row. Otherwise, in your case, I believe transposing your output could also work.
I have tried to create a low pass filter for aiff files, but the sound coming out is white noise. I only understand the broad overview of how an FFT works, so I'm guessing my problems are related to that.
Basically I open the audio file (of say a piano loop), convert it to mono, then perform an FFT on the samples, then I tried to remove the upper frequencies by setting them to zero. And finally I perform an IFTT and save the results to a new file.
import aifc
import struct
import numpy as np
def getMonoSamples(fileName):
enter code here`obj = aifc.open(fileName,'r')
obj.setpos(0)
numFrames = obj.getnframes()
myFrames = obj.readframes(numFrames)
samplingRate = obj.getframerate()
data = struct.unpack('{n}h'.format(n=numFrames*2), myFrames)
data = np.array(data)
dataLeft =[]
for i,x in enumerate(data):
if i%2==1:
dataLeft.append(x)
obj.close()
return dataLeft,numFrames,samplingRate
def writeMonoFile(fileName,samples,nframes):
mono_file=aifc.open(file, 'w')
comptype="NONE"
compname="not compressed"
nchannels=1
sampwidth=2
mono_file.setparams((nchannels, sampwidth, int(sampling_rate), nframes, comptype, compname))
print "writing sample aif..."
for s in samples:
mono_file.writeframes(struct.pack('h', s))
mono_file.close()
def lpFilter(dataFft):
new =[None]*len(dataFft)
for i,x in enumerate(dataFft):
#if the frequency is above 5000, remove it
if i>5000:
new[i]=0
else:
new[i]=x
return new
# get audio samples from a function that converts stereo to mono
sampleData,numFrames,samplingRate = getMonoSamples('beetP2.aif')
dataFft = np.fft.fft(sampleData)
filtered = lpFilter(dataFft)
invFft = np.fft.ifft(filtered)
invFft = [int(x) for x in invFft]
file = "test.aif"
writeMonoFile(file,invFft,numFrames)
I do get a warning: "ComplexWarning: Casting complex values to real discards the imaginary part" but I also get this warning when simply performing a stereo to mono conversion and saving. The audio seems to sound fine until I try to filter it. I'm guessing this is related, but not sure how to get around it.
Any audio sample I filter winds up sounding like white noise instead of a filtered version of itself.
Switching to real-to-complex numpy.fft.rfft and its inverse numpy.fft.irfft likely resolves the issue.
As the complex-to-complex DFT transform is applied to the real array sampleData, the output array is a complex array dataFft of the same size. The first item of this array corresponds to the DC component, the second item to frequency 1/N, the third to 2/N... Nevertheless, the second half of the array should rather be described as components of negative frequencies. Hence, the frequency of the last item of the array is -1/N, the item before -2/N... As described in what FFTW really computes
For those who like to think in terms of positive and negative frequencies, this means that the positive frequencies are stored in the first half of the output and the negative frequencies are stored in backwards order in the second half of the output. (The frequency -k/n is the same as the frequency (n-k)/n.)
As the signal is real, the component of frequency -k/N must be the complex conjugate of the component of frequency k/N. For instance, a cosine wave of frequency k/N gives birth to two equal real components of frequencies k/N and -k/N.
By zeroing the second half of the array, components featuring low negative frequencies are discarded and the array does not correspond to the DFT of a real array anymore. It is not a low pass filter and might explain the resulting white noise. As the inverse DFT is applied, invFft = np.fft.ifft(filtered), its outcome invFft is complex, featuring the same size as the original array sampleData.
Uisng real-to-complex DFT would turn to the real array sampleData into a complex array dataFft of about half the size. Zeroing one component of this array means zeroing both the positive and the negative frequency, making sure that the array can still be viewed as the DFT of a real array. This real array can finally be recovered by applying the inverse transform irfft.
I perform a Short Time Fourier Transform as described here.
from scipy.signal import stft
f, t, Zxx = stft(data)
As far as I understand I get the following objects: (1) an 1D array containing values of frequencies, (2) an 1D array containing values of time, and (3) a 2D array containing intensities of a given frequency at given moment of time.
My question is about how to control / modify the grid of frequencies. Per default I got a grid of 129 frequencies. The first thing that I would like to do is to increase the number of frequencies (to have a more granular grid).
In addition to that, it would be nice to be able to specify what frequencies range should be used.
As Uvar said, the range of observable frequencies is limited by the parameter nperseg. Given n samples, one can observe only n/2 + 1 frequencies, namely the frequencies fs*k/n with k = 0,1,2,...,n/2 where fs is the sampling frequency, and n is nperseg. Anything higher is lost due to aliasing. This is a mathematical limitation, nothing SciPy can do about it. To have a sufficiently granular list of frequencies, increase nperseg. The default value nperseg = 256 gives (256/2) + 1 = 129 frequencies.
The discrete Fourier transform gives you all observable frequencies at once, it is not possible to choose a custom range. Of course, you can slice the output f to select the range of frequencies of interest.
I'm trying to apply machine learning algorithms on raw audio. My training would be on the Fourier coefficient of the audio signal.
I was trying to get those and apply ifft to get my audio back but it doesn't work with my implementation, which is :
fs, data = wavfile.read('dataset piano/wav/music (1).wav')
Te = 0.25
T = 40
a = data.T[0] #retrieve first channel
#put the information in a matrix, one row will contain the fourier coefficients of 0.25s of music.
#The whole matrix, which has 40 rows will contain information of 10s of the wav file.
X = np.array([fft(a[int(i*fs*Te):int((i+1)*fs*Te)]) for i in range(T)])
Z = ifft(X.flatten())
Z = Z.astype(data.dtype)
wavfile.write('test3.wav',fs,Z)
Normally it should play the first 10s of the wav file but it doesn't and I really don't understand why. All I get is a high-pitched sound. I am using the fft and ifft from scipy.
You were very close. Just change
Z = ifft(X.flatten())
to
Z = ifft(X).flatten()
What you are doing is computing an inverse Fourier transform on a concatenation of spectra, which really makes no sense. I think what you rather want to do, is concatenate inverse Fourier transform on spectra. This is what I have done and managed to reconstitute a signal that sounds well.
ifft(X) will run an IFFT on every array along the last dimension, which is the spectrum dimension in your case, and return an array of the same shape (40, 11025). Then flatten will concatenate every row, making an sensible signal.
I have a WAV file which I would like to visualize in the frequency domain. Next, I would like to write a simple script that takes in a WAV file and outputs whether the energy at a certain frequency "F" exceeds a threshold "Z" (whether a certain tone has a strong presence in the WAV file). There are a bunch of code snippets online that show how to plot an FFT spectrum in Python, but I don't understand a lot of the steps.
I know that wavfile.read(myfile) returns the sampling rate (fs) and the data array (data), but when I run an FFT on it (y = numpy.fft.fft(data)), what units is y in?
To get the array of frequencies for the x-axis, some posters do this where n = len(data):
X = numpy.linspace(0.0, 1.0/(2.0*T), n/2)
and others do this:
X = numpy.fft.fftfreq(n) * fs)[range(n/2)]
Is there a difference between these two methods and is there a good online explanation for what these operations do conceptually?
Some of the online tutorials about FFTs mention windowing, but not a lot of posters use windowing in their code snippets. I see that numpy has a numpy.hamming(N), but what should I use as the input to that method and how do I "apply" the output window to my FFT arrays?
For my threshold computation, is it correct to find the frequency in X that's closest to my desired tone/frequency and check if the corresponding element (same index) in Y has an amplitude greater than the threshold?
FFT data is in units of normalized frequency where the first point is 0 Hz and one past the last point is fs Hz. You can create the frequency axis yourself with linspace(0.0, (1.0 - 1.0/n)*fs, n). You can also use fftfreq but the components will be negative.
These are the same if n is even. You can also use rfftfreq I think. Note that this is only the "positive half" of your frequencies, which is probably what you want for audio (which is real-valued). Note that you can use rfft to just produce the positive half of the spectrum, and then get the frequencies with rfftfreq(n,1.0/fs).
Windowing will decrease sidelobe levels, at the cost of widening the mainlobe of any frequencies that are there. N is the length of your signal and you multiply your signal by the window. However, if you are looking in a long signal you might want to "chop" it up into pieces, window them, and then add the absolute values of their spectra.
"is it correct" is hard to answer. The simple approach is as you said, find the bin closest to your frequency and check its amplitude.