I'm trying to apply machine learning algorithms on raw audio. My training would be on the Fourier coefficient of the audio signal.
I was trying to get those and apply ifft to get my audio back but it doesn't work with my implementation, which is :
fs, data = wavfile.read('dataset piano/wav/music (1).wav')
Te = 0.25
T = 40
a = data.T[0] #retrieve first channel
#put the information in a matrix, one row will contain the fourier coefficients of 0.25s of music.
#The whole matrix, which has 40 rows will contain information of 10s of the wav file.
X = np.array([fft(a[int(i*fs*Te):int((i+1)*fs*Te)]) for i in range(T)])
Z = ifft(X.flatten())
Z = Z.astype(data.dtype)
wavfile.write('test3.wav',fs,Z)
Normally it should play the first 10s of the wav file but it doesn't and I really don't understand why. All I get is a high-pitched sound. I am using the fft and ifft from scipy.
You were very close. Just change
Z = ifft(X.flatten())
to
Z = ifft(X).flatten()
What you are doing is computing an inverse Fourier transform on a concatenation of spectra, which really makes no sense. I think what you rather want to do, is concatenate inverse Fourier transform on spectra. This is what I have done and managed to reconstitute a signal that sounds well.
ifft(X) will run an IFFT on every array along the last dimension, which is the spectrum dimension in your case, and return an array of the same shape (40, 11025). Then flatten will concatenate every row, making an sensible signal.
Related
Currently I am working on a project that requires me to pick out audio clips and compare them based off their FFT results (i.e. spectrogram). All of my audio clips are 0.200s long, but when I process them through the transform, they are no longer the same length. The code I am using for the transform uses numpy and librosa libraries:
def extractFFT(audioArr):
fourierArr = []
fourierComplex = []
for x in range(len(audioArr)):
y, sr = lb.load(audioArr[x])
fourier = np.fft.fft(y)
fourier = fourier.real
fourierArr.append(fourier)
return fourierArr
I am only taking the real number portion of the transform because I also wanted to pass this through a PCA, which does not allow for complex numbers. Regardless, I can perform neither LDA (linear discriminant analysis) or PCA on this FFT array of audio clips, since some are of different lengths.
The code I have for the LDA is as follows, where the labels are given for a frequencyArr of length 4:
def LDA(frequencyArr):
splitMark = int(len(frequencyArr)*0.8)
trainingData = frequencyArr[:splitMark]
validationData = frequencyArr[splitMark:]
labels = [1,1,2,2]
lda = LinearDiscriminantAnalysis()
lda.fit(trainingData,labels[:splitMark])
print(f"prediction: {lda.predict(validationData)}")
This throws the following value error, coming from the lda.fit(trainingData,labels[:splitMark]) line:
ValueError: setting an array element with a sequence.
I know this error stems from the array not being of a set 2 dimensional shape, since I don't receive this error when the FFT elements are all of equal length and the code works as intended.
Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!
Note, they normally only differ by a few points, say for 3 of the audio clips the FFT length is 4410 but for the 4th it is 4409. I know I can probably just trim the lengths down to the smallest length out of the group, but I'd prefer a cleaner method that won't leave out any values.
First of all: Do not only take the real part of the transform result. It won't do you any good. Use the power (r^2+i^2) or magnitude (sqrt(power)) to get the strength of the signal for a frequency bin.
Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!
They are simply not the same length. I bet the sample number of your clips isn't exactly identical.
After y, sr = lb.load(audioArr[x]) do print('sample count = {}'.format(len(y))) and you will most likely see different values (you've stated as much yourself).
As you already point out, of course you could simply cut of the signal at min(len(y)) and then feed it into the FFT. But typically, what you do to get around this is to use a discrete STFT, which has a fixed window size. This ensures same length input size to the FFT. You can use librosa's implementation as an easy starting point. The docs also explain how to get magnitude/power.
So instead of:
y, sr = lb.load(audioArr[x])
fourier = np.fft.fft(y)
fourier = fourier.real
fourierArr.append(fourier)
You do:
y, sr = lb.load(audioArr[x])
# get the magnitudes
D = np.abs(librosa.stft(y, n_fft=4096)) # use 4096 as window length
fourierArr.append(D[0]) # only use the first frame of the STFT
In essence, if you use the Fourier transform with different length input, you will get different length output, which is something that LDA does not forgive, when using this output as training data. So you have to make sure your input has the same length. The easiest way to do this is to use the STFT (or simply cut all your input to min). IMO, there is nothing unclean about this and it will not affect results much, if you are missing a couple of samples.
I have tried to create a low pass filter for aiff files, but the sound coming out is white noise. I only understand the broad overview of how an FFT works, so I'm guessing my problems are related to that.
Basically I open the audio file (of say a piano loop), convert it to mono, then perform an FFT on the samples, then I tried to remove the upper frequencies by setting them to zero. And finally I perform an IFTT and save the results to a new file.
import aifc
import struct
import numpy as np
def getMonoSamples(fileName):
enter code here`obj = aifc.open(fileName,'r')
obj.setpos(0)
numFrames = obj.getnframes()
myFrames = obj.readframes(numFrames)
samplingRate = obj.getframerate()
data = struct.unpack('{n}h'.format(n=numFrames*2), myFrames)
data = np.array(data)
dataLeft =[]
for i,x in enumerate(data):
if i%2==1:
dataLeft.append(x)
obj.close()
return dataLeft,numFrames,samplingRate
def writeMonoFile(fileName,samples,nframes):
mono_file=aifc.open(file, 'w')
comptype="NONE"
compname="not compressed"
nchannels=1
sampwidth=2
mono_file.setparams((nchannels, sampwidth, int(sampling_rate), nframes, comptype, compname))
print "writing sample aif..."
for s in samples:
mono_file.writeframes(struct.pack('h', s))
mono_file.close()
def lpFilter(dataFft):
new =[None]*len(dataFft)
for i,x in enumerate(dataFft):
#if the frequency is above 5000, remove it
if i>5000:
new[i]=0
else:
new[i]=x
return new
# get audio samples from a function that converts stereo to mono
sampleData,numFrames,samplingRate = getMonoSamples('beetP2.aif')
dataFft = np.fft.fft(sampleData)
filtered = lpFilter(dataFft)
invFft = np.fft.ifft(filtered)
invFft = [int(x) for x in invFft]
file = "test.aif"
writeMonoFile(file,invFft,numFrames)
I do get a warning: "ComplexWarning: Casting complex values to real discards the imaginary part" but I also get this warning when simply performing a stereo to mono conversion and saving. The audio seems to sound fine until I try to filter it. I'm guessing this is related, but not sure how to get around it.
Any audio sample I filter winds up sounding like white noise instead of a filtered version of itself.
Switching to real-to-complex numpy.fft.rfft and its inverse numpy.fft.irfft likely resolves the issue.
As the complex-to-complex DFT transform is applied to the real array sampleData, the output array is a complex array dataFft of the same size. The first item of this array corresponds to the DC component, the second item to frequency 1/N, the third to 2/N... Nevertheless, the second half of the array should rather be described as components of negative frequencies. Hence, the frequency of the last item of the array is -1/N, the item before -2/N... As described in what FFTW really computes
For those who like to think in terms of positive and negative frequencies, this means that the positive frequencies are stored in the first half of the output and the negative frequencies are stored in backwards order in the second half of the output. (The frequency -k/n is the same as the frequency (n-k)/n.)
As the signal is real, the component of frequency -k/N must be the complex conjugate of the component of frequency k/N. For instance, a cosine wave of frequency k/N gives birth to two equal real components of frequencies k/N and -k/N.
By zeroing the second half of the array, components featuring low negative frequencies are discarded and the array does not correspond to the DFT of a real array anymore. It is not a low pass filter and might explain the resulting white noise. As the inverse DFT is applied, invFft = np.fft.ifft(filtered), its outcome invFft is complex, featuring the same size as the original array sampleData.
Uisng real-to-complex DFT would turn to the real array sampleData into a complex array dataFft of about half the size. Zeroing one component of this array means zeroing both the positive and the negative frequency, making sure that the array can still be viewed as the DFT of a real array. This real array can finally be recovered by applying the inverse transform irfft.
We are trying to build a program to get amplitude and frequency list from an .wav file, trying it in Python.
We tried pyaudio for that I don't know much about pyaudio, so I need some suggestions on it.
import scipy
import numpy as np
file = '123.wav'
from scipy.io import wavfile as wav
fs, data = wav.read(file)
length=len(data.shape)
#if length==2:
# data= data.sum(axis=1)/2
n = data.shape[0]
sec = n/float(fs)
ts = 1.00/fs
t = scipy.arange(0,sec,ts)
FFT = abs(scipy.fft(data))
FFT_size = FFT[range(n//2)]
freq = scipy.fftpack.fftfreq(data.size, t[1]-t[0])
max_freq = max(freq)
min_freq = min(freq)
plot_freq(freq, n, t, data)
The actual result returning is frequency list. I also want amplitude list don't know how to get it.
typically a call to an fft api will return an array of imaginary numbers where each array element contains a complex number in the form of ( Areal, AImaginary ) where each element of the array represents a frequency (the value of the freq is implied by array index [find the formula to calc freq based on array index])
on the complex array element 0 represents frequency 0 which is your direct current offset, then freq of each subsequent freq is calculated using
incr_freq := sample_rate / number_of_samples
so for that to be meaningful you must have prior knowledge of the sample rate of your source input time series ( audio or whatever ) and number of samples is just the length of the floating point raw audio curve array you fed into your fft call
... as you iterate across this array of complex numbers calculate the amplitude using the Areal and AImaginary of each frequency bin's complex number using formula
curr_mag = 2.0 * math.Sqrt(curr_real*curr_real+curr_imag*curr_imag) / number_of_samples
as you iterate across the complex array returned from your fft call be aware of notion of Nyquist Limit which means you only consume the first half of the number of elements of that complex array (and double the magnitude of each freq - see formula above)
... see the full pseudocode at Get frequency with highest amplitude from FFT
... I ran your code and nothing happened ... what is the meaning of your python
[range(n//2)]
You possibly want pitch, not spectral frequency, which is a different algorithm than just using an FFT to find the highest magnitude. An FFT returns the entire spectral frequency range (every frequency up to Fs/2, not just one frequency), in your case for the entire file. And the highest magnitude is often not for the pitch frequency (possibly for some high overtone instead).
You also took the FFT of the entire file, not a bunch of FFTs for time slices (usually small overlapping windows) at the time increment you desire for your list's temporal resolution. This will produce a time array of all the FFT frequency arrays (thus, a 2D array). Usually called a spectrogram. There may be a built in function for this in some library.
Can I make amplitude from this formula
the frequency of the wave is set by whatever is driving the oscillation in the medium. Examples are a speaker that sets up a sound wave, or the hand that shakes the end of a stretched string.
the speed of the wave is a property of the medium.
the wavelength of the wave is then determined by the frequency and speed:
λ = v/f
I don't know it gonna be the right process or not
I am look for a way to obtain the frequency from a signal. Here's an example:
signal = [numpy.sin(numpy.pi * x / 2) for x in range(1000)]
This Array will represent the sample of a recorded sound (x = miliseconds)
sin(pi*x/2) => 250 Hrz
How can we go from the signal (list of points), to obtaining the frequencies form this array?
Note:
I have read many Stackoverflow threads and watch many youtube videos. I am yet to find an answer. Please use simple words.
(I am Thankfull for every answer)
What you're looking for is known as the Fourier Transform
A bit of background
Let's start with the formal definition:
The Fourier transform (FT) decomposes a function (often a function of time, or a signal) into its constituent frequencies
This is in essence a mathematical operation that when applied over a signal, gives you an idea of how present each frequency is in the time series. In order to get some intuition behind this, it might be helpful to look at the mathematical definition of the DFT:
Where k here is swept all the way up t N-1 to calculate all the DFT coefficients.
The first thing to notice is that, this definition resembles somewhat that of the correlation of two functions, in this case x(n) and the negative exponential function. While this may seem a little bit abstract, by using Euler's formula and by playing a bit around with the definition, the DFT can be expressed as the correlation with both a sine wave and a cosine wave, which will account for the imaginary and the real parts of the DFT.
So keeping in mind that this is in essence computing a correlation, whenever a corresponding sine or cosine from the decomposition of the complex exponential matches with that of x(n), there will be a peak in X(K), meaning that, such frequency is present in the signal.
How can we do the same with numpy?
So having given a very brief theoretical background, let's consider an example to see how this can be implemented in python. Lets consider the following signal:
import numpy as np
import matplotlib.pyplot as plt
Fs = 150.0; # sampling rate
Ts = 1.0/Fs; # sampling interval
t = np.arange(0,1,Ts) # time vector
ff = 50; # frequency of the signal
y = np.sin(2*np.pi*ff*t)
plt.plot(t, y)
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.show()
Now, the DFT can be computed by using np.fft.fft, which as mentioned, will be telling you which is the contribution of each frequency in the signal now in the transformed domain:
n = len(y) # length of the signal
k = np.arange(n)
T = n/Fs
frq = k/T # two sides frequency range
frq = frq[:len(frq)//2] # one side frequency range
Y = np.fft.fft(y)/n # dft and normalization
Y = Y[:n//2]
Now, if we plot the actual spectrum, you will see that we get a peak at the frequency of 50Hz, which in mathematical terms it will be a delta function centred in the fundamental frequency of 50Hz. This can be checked in the following Table of Fourier Transform Pairs table.
So for the above signal, we would get:
plt.plot(frq,abs(Y)) # plotting the spectrum
plt.xlabel('Freq (Hz)')
plt.ylabel('|Y(freq)|')
plt.show()
I am trying to understand what the scipy.signal.spectrogram()'s output are, and how to use them. Currently, I read a .wav file and generate a spectrogram.
from scipy.io import wavfile as wav
from scipy import signal
sample_rate, data = wav.read('sound.wav')
f, t, Sxx = signal.spectrogram(data, sample_rate)
--
In case understanding this completely wrong, my idea of a spectrogram is a 3D graph consisting of:
x-axis: time
y-axis: frequency
pixel colour/brightness: amplitude
So I'm wondering how f, t and Sxx relate to the time, frequency, and amplitude.
Thanks for reading, any help is appreciated!
f is the frequency array, containing the frequencies of every band of the fft. Which can be used as the labels for a graph
t is the time array, containing the time at which this FFT was made relative to the source signal. Again can be used for labels.
The Sxx array contains the amplitudes and is a 2d array whose shape is the length of f by the length of t.
Therefore the axis which matches the length of the time array is the time axis and the other the frequency.
You will need to find the min and max values of the Sxx array yourself, if you want to normalise for display.