Getting MFCC from a spectrogram time/ frequency series array - python

I have several spectrogra time/frequency [500,1024] files.
I need to calculate the MFCC of these files. There are lot's of the library for calculating MFCC on a raw audio file but I'm looking a method in python for calculating directly from np.array.

This can be done with librosa, as it allows to pass in spectrograms instead of audio waveform using the parameter S.
I am assuming that you have a STFT magnitude spectrogram (linear spectrogram with phase discarded). Then need to convert this into a mel-filtered spectrogram, perform log-scaling, and then do the DCT-2 and truncation to obtain MFCC coefficients. Skeleton code below:
import librosa
import numpy
# TODO: you need to provide these
sr = my_samplerate
my_stft
mels = librosa.feature.melspectrogram(S=my_stft, sr=sr, n_mels=64)
log_mels = librosa.core.amplitude_to_db(mels, ref=numpy.max)
mfcc = librosa.feature.mfcc(S=log_mels, sr=sr, n_mfcc=20)
See the librosa API reference for more details.

Related

Is there a way to add gain to an audio signal with Librosa in python?

I am currently working on augmenting audio in Python. I've been using Librosa due to its speed and simplicity but need to fallback on PyDub for some other utilities such as applying gain.
Is there a mathematical way to add gain to the Numpy array provided with librosa.load? In PyDub it is quite easy but I have to constantly convert back between Pydub's get_array_of_samples() to np.array then to the proper 32 bit float representation on the [-1,1) scale (that Librosa uses by default). I'd rather keep it all in one library for simplicity.
Also a normalization of an audio signal to 0 db gain beforehand would be useful too. I am a bit new to a lot of the terminology used in audio signal processing.
This is what I am currently doing. Down the road I would like to make this a class method which starts with using librosa's numpy array, so if there is a way to mathematically add specified gain in a certain unit to a numpy array from librosa that would be ideal.
Thanks
import librosa
import numpy as np
from pydub import AudioSegment, effects
pydub_audio = AudioSegment.from_file(audio_file_path)
pydub_audio = pydub_audio.set_frame_rate(16000) # make file 16k khz frame rate
print("Original dBFS is {}".format(pydub_audio.dBFS))
pydub_audio = pydub_audio.apply_gain(20) # apply 20db of gain to introduce clipping
#pydub_audio = effects.normalize(pydub_audio)
print("New dBFS is {}".format(pydub_audio.dBFS))
pydub_array = pydub_audio.get_array_of_samples()
pydub_array = np.array(pydub_array)
print("PyDub audio type is {}".format(pydub_array.dtype))
pydub_array_32bitfloat = pydub_array.astype(np.float32, order = 'C') / 32768 # rescaling to between [-1, 1] like librosa
print("Rescaled Pydub type is {}".format(pydub_array_32bitfloat.dtype))
import soundfile as sf
sf.write(r"test_pydub_gain.wav", pydub_array_32bitfloat, samplerate = 16000, format = 'wav')
thinking about it, (if i am not wrong), mathematicaly the gain is:
dBFS = 20 * log (level2 / level1)
so i would multiply all elements of the array by
10**(dBFS/20) to apply the gain

Difference between load of librosa and read of scipy.io.wavfile

I have a question about the difference between the load function of librosa and the read function of scipy.io.wavfile.
from scipy.io import wavfile
import librosa
fs, data = wavfile.read(name)
data, fs = librosa.load(name)
The imported voice file is the same file. If you run the code above, the values ​​of the data come out of the two functions differently. I want to know why the value of the data is different.
From the docstring of librosa.core.load:
Load an audio file as a floating point time series.
Audio will be automatically resampled to the given rate (default sr=22050).
To preserve the native sampling rate of the file, use sr=None.
scipy.io.wavfile.read does not automatically resample the data, and the samples are not converted to floating point if they are integers in the file.
It's worth also mentioning that librosa.load() normalizes the data (so that all the data points are between 1 and -1), whereas wavfile.read() does not.
The data is different because scipy does not normalize the input signal.
Here is a snippet showing how to change scipy output to match librosa's:
nbits = 16
l_wave, rate = librosa.core.load(path, sr=None)
rate, s_wave = scipy.io.wavfile.read(path)
s_wave /= 2 ** (nbits - 1)
all(s_wave == l_wave)
# True
librosa.core.load has support for 24 bit audio files and 96kHz sample rates. Because of this, converting to float and default resampling, it can be considerably slower than scipy.io.wavfile.read in many cases.

How should audio be pre-processed for classification?

I am currently developing an audio classifier with the Python API of TensorFlow, using the UrbanSound8K dataset and trying to distinguish between 10 mutually exclusive classes.
The audio files are 4 seconds long and contain 176400 data points which results in serious memory issues. How should the audio be pre-processed to reduce memory usage?
And how can more useful features be extracted from the audio (using convolution and pooling)?
I personally prefer spectrograms as input for neural nets when it comes to sound classification. This way, raw audio data is transformed into an image representation and you can treat it like a basic image classification task.
There are a number of ways to choose from, here is what I usually do using scipy, python_speech_features and pydub:
import numpy as np
import scipy.io.wavfile as wave
import python_speech_features as psf
from pydub import AudioSegment
#your sound file
filepath = 'my-sound.wav'
def convert(path):
#open file (supports all ffmpeg supported filetypes)
audio = AudioSegment.from_file(path, path.split('.')[-1].lower())
#set to mono
audio = audio.set_channels(1)
#set to 44.1 KHz
audio = audio.set_frame_rate(44100)
#save as wav
audio.export(path, format="wav")
def getSpectrogram(path, winlen=0.025, winstep=0.01, NFFT=512):
#open wav file
(rate,sig) = wave.read(path)
#get frames
winfunc=lambda x:np.ones((x,))
frames = psf.sigproc.framesig(sig, winlen*rate, winstep*rate, winfunc)
#Magnitude Spectrogram
magspec = np.rot90(psf.sigproc.magspec(frames, NFFT))
#noise reduction (mean substract)
magspec -= magspec.mean(axis=0)
#normalize values between 0 and 1
magspec -= magspec.min(axis=0)
magspec /= magspec.max(axis=0)
#show spec dimensions
print magspec.shape
return magspec
#convert file if you need to
convert(filepath)
#get spectrogram
spec = getSpectrogram(filepath)
First, you need to standardize your audio files in terms of sample rate and channels. You can do that (and more) with the excellent pydub package.
After that, you need to transform your audio signal into an image with FFT. You can do that with scipy.io.wavefile and the sigproc modul of python_speech_features. I like the magnitude spectrogram, rotate it 90 degrees, normalize it and use the resulting NumPy array as input for my convnets. You can change the spatial dimensions of the spectrogram by adjusting the values of winstep and NFFT to fit your input size.
There might be easier ways to do all that; I achieved good overall classification results using the code above.

Python Convert an array to Wav

In Python, I have an array of floats representing the voltages of an analog signal.
Can anyone explain how I can change the array into a .wav format? I have seen this
Do I first need to change the data format from [1.23,1.24,1.25,1.26] (for example) to 1.231.241.251.26 before adding the headers so that it's read correctly?
I eventually plan on using FFT on the values to derive the fundamental frequencies is there a better way to store the values in this case?
Thank you
If you know the sampling frequency of your signal and data is already scaled appropriately by max(abs(data)) then you can do it very easily using scipy:
from __future__ import print_function
import scipy.io.wavfile as wavf
import numpy as np
if __name__ == "__main__":
samples = np.random.randn(44100)
fs = 44100
out_f = 'out.wav'
wavf.write(out_f, fs, samples)
You can also use the standard wave module.

Audio spectrum extraction from audio file by python

Sorry if I submit a duplicate, but I wonder if there is any lib in python which makes you able to extract sound spectrum from audio files. I want to be able to take an audio file and write an algoritm which will return a set of data {TimeStampInFile; Frequency-Amplitude}.
I heard that this is usually called Beat Detection, but as far as I see beat detection is not a precise method, it is good only for visualisation, while I want to manipulate on the extracted data and then convert it back to an audio file. I don't need to do this real-time.
I will appreciate any suggestions and recommendations.
You can compute and visualize the spectrum and the spectrogram this using scipy, for this test i used this audio file: vignesh.wav
from scipy.io import wavfile # scipy library to read wav files
import numpy as np
AudioName = "vignesh.wav" # Audio File
fs, Audiodata = wavfile.read(AudioName)
# Plot the audio signal in time
import matplotlib.pyplot as plt
plt.plot(Audiodata)
plt.title('Audio signal in time',size=16)
# spectrum
from scipy.fftpack import fft # fourier transform
n = len(Audiodata)
AudioFreq = fft(Audiodata)
AudioFreq = AudioFreq[0:int(np.ceil((n+1)/2.0))] #Half of the spectrum
MagFreq = np.abs(AudioFreq) # Magnitude
MagFreq = MagFreq / float(n)
# power spectrum
MagFreq = MagFreq**2
if n % 2 > 0: # ffte odd
MagFreq[1:len(MagFreq)] = MagFreq[1:len(MagFreq)] * 2
else:# fft even
MagFreq[1:len(MagFreq) -1] = MagFreq[1:len(MagFreq) - 1] * 2
plt.figure()
freqAxis = np.arange(0,int(np.ceil((n+1)/2.0)), 1.0) * (fs / n);
plt.plot(freqAxis/1000.0, 10*np.log10(MagFreq)) #Power spectrum
plt.xlabel('Frequency (kHz)'); plt.ylabel('Power spectrum (dB)');
#Spectrogram
from scipy import signal
N = 512 #Number of point in the fft
f, t, Sxx = signal.spectrogram(Audiodata, fs,window = signal.blackman(N),nfft=N)
plt.figure()
plt.pcolormesh(t, f,10*np.log10(Sxx)) # dB spectrogram
#plt.pcolormesh(t, f,Sxx) # Lineal spectrogram
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [seg]')
plt.title('Spectrogram with scipy.signal',size=16);
plt.show()
i tested all the code and it works, you need, numpy, matplotlib and scipy.
cheers
I think your question has three separate parts:
How to load audio files into python?
How to calculate spectrum in python?
What to do with the spectrum?
1. How to load audio files in python?
You are probably best off by using scipy, as it provides a lot of signal processing functions. For loading audio files:
import scipy.io.wavfile
samplerate, data = scipy.io.wavfile.read("mywav.wav")
Now you have the sample rate (samples/s) in samplerate and data as a numpy.array in data. You may want to transform the data into floating point, depending on your application.
There is also a standard python module wave for loading wav-files, but numpy/scipy offers a simpler interface and more options for signal processing.
2. How to calculate the spectrum
Brief answer: Use FFT. For more words of wisdom, see:
Analyze audio using Fast Fourier Transform
Longer answer is quite long. Windowing is very important, otherwise you'll have strange spectra.
3. What to do with the spectrum
This is a bit more difficult. Filtering is often performed in time domain for longer signals. Maybe if you tell us what you want to accomplish, you'll receive a good answer for this one. Calculating the frequency spectrum is one thing, getting meaningful results with it in signal processing is a bit more complicated.
(I know you did not ask this one, but I see it coming with a probability >> 0. Of course, it may be that you have good knowledge on audio signal processing, in which case this is irrelevant.)

Categories

Resources