I am currently developing an audio classifier with the Python API of TensorFlow, using the UrbanSound8K dataset and trying to distinguish between 10 mutually exclusive classes.
The audio files are 4 seconds long and contain 176400 data points which results in serious memory issues. How should the audio be pre-processed to reduce memory usage?
And how can more useful features be extracted from the audio (using convolution and pooling)?
I personally prefer spectrograms as input for neural nets when it comes to sound classification. This way, raw audio data is transformed into an image representation and you can treat it like a basic image classification task.
There are a number of ways to choose from, here is what I usually do using scipy, python_speech_features and pydub:
import numpy as np
import scipy.io.wavfile as wave
import python_speech_features as psf
from pydub import AudioSegment
#your sound file
filepath = 'my-sound.wav'
def convert(path):
#open file (supports all ffmpeg supported filetypes)
audio = AudioSegment.from_file(path, path.split('.')[-1].lower())
#set to mono
audio = audio.set_channels(1)
#set to 44.1 KHz
audio = audio.set_frame_rate(44100)
#save as wav
audio.export(path, format="wav")
def getSpectrogram(path, winlen=0.025, winstep=0.01, NFFT=512):
#open wav file
(rate,sig) = wave.read(path)
#get frames
winfunc=lambda x:np.ones((x,))
frames = psf.sigproc.framesig(sig, winlen*rate, winstep*rate, winfunc)
#Magnitude Spectrogram
magspec = np.rot90(psf.sigproc.magspec(frames, NFFT))
#noise reduction (mean substract)
magspec -= magspec.mean(axis=0)
#normalize values between 0 and 1
magspec -= magspec.min(axis=0)
magspec /= magspec.max(axis=0)
#show spec dimensions
print magspec.shape
return magspec
#convert file if you need to
convert(filepath)
#get spectrogram
spec = getSpectrogram(filepath)
First, you need to standardize your audio files in terms of sample rate and channels. You can do that (and more) with the excellent pydub package.
After that, you need to transform your audio signal into an image with FFT. You can do that with scipy.io.wavefile and the sigproc modul of python_speech_features. I like the magnitude spectrogram, rotate it 90 degrees, normalize it and use the resulting NumPy array as input for my convnets. You can change the spatial dimensions of the spectrogram by adjusting the values of winstep and NFFT to fit your input size.
There might be easier ways to do all that; I achieved good overall classification results using the code above.
Related
I am trying to write a model for audio classification, for that I am using torchaudio.datasets.SPEECHCOMMANDS dataset. I have trained the model but on custom sounds for prediction and evaluation, I am having issues with converting them to same format as used by my model.
Following is torchaudio.info command and result of my dataset audio file:
metadata = torchaudio.info("./data/SpeechCommands/speech_commands_v0.02/yes/00f0204f_nohash_0.wav")
print("Origional :", metadata)
Result:
Origional : AudioMetaData(sample_rate=16000, num_frames=16000, num_channels=1, bits_per_sample=16, encoding=PCM_S)
I am using pydub to apply transformation to my data. Following is the code I have used:
from pydub import AudioSegment
sound = AudioSegment.from_wav("/content/yes.wav") # yes.wav is my recorded file
sound = sound.set_channels(1)
sound = sound.set_frame_rate(16000)
sound.export("/content/yes_mono.wav", format="wav")
Still after these transformation, my custom audio file's format is different to the one in dataset.
metadata = torchaudio.info("/content/yes_mono.wav")
print("Custom Mono:", metadata)
#Result: Custom Mono: AudioMetaData(sample_rate=16000, num_frames=36864, num_channels=1, bits_per_sample=16, encoding=PCM_S)
Here num_frames are different and I have been unable to find a way to solve this. Please suggest a solution or highlight issue in my understanding.
I am segmenting drum audio files at each transient and exporting the audio to individual wav files. The problem is, all of my files have a dc offset that I cannot seem to get rid of which is causing popping sounds at the end of the file. I am able to use Audacity's built in high-pass filter to verify that applying a filter would fix my problem, but I have not yet been able to replicate those results with code.
My preference is to use torchaudio's highpass_biquad() method but I am open to using scipy filters too. The main goal is to remove the offset so that the audio files do not have a popping sound at the end.
How do I implement a high pass filter to correct the dc offset like Audacity's high pass filter does as shown in the pictures?
torchaudio approach
from torchaudio.functional import highpass_biquad
import librosa
wav, sample_rate = librosa.load(path, sr=None, mono=False) # files are 24 bit 44.1k
wav_tensor = torch.from_numpy(wav)
cuttoff_freq = 15.0
wav_filtered = highpass_biquad(wav_tensor, sample_rate, cutoff_freq)
scipy approach
from scipy import signal
import librosa
wav, sample_rate = librosa.load(path, sr=None, mono=False) # files are 24 bit 44.1k
cutoff_freq = 15
# METHOD 1
b, a = signal.butter(N=5, Wn=cutoff_freq, btype='high', fs=sample_rate)
wav_filtered = signal.filtfilt(b, a, wav)
# METHOD 2
sos = signal.butter(N=5, Wn=cutoff_freq, btype='hp', fs=sample_rate, output='sos')
wav_filtered = signal.sosfilt(sos, wav)
Picture 1 is the output of torch highpass_biquad method. The scipy approach yields similar results.
Picture 2 is the audio after applying highpass effect in audacity. This is the desired output of my code.
Picture 3 is an example of output with no high pass filtering applied. Most files come out centered below 0dB.
It turns out the dc offset was produced by a scaling function in wavio when writing the file. The highpass filters were working correctly after all.
I am currently working on augmenting audio in Python. I've been using Librosa due to its speed and simplicity but need to fallback on PyDub for some other utilities such as applying gain.
Is there a mathematical way to add gain to the Numpy array provided with librosa.load? In PyDub it is quite easy but I have to constantly convert back between Pydub's get_array_of_samples() to np.array then to the proper 32 bit float representation on the [-1,1) scale (that Librosa uses by default). I'd rather keep it all in one library for simplicity.
Also a normalization of an audio signal to 0 db gain beforehand would be useful too. I am a bit new to a lot of the terminology used in audio signal processing.
This is what I am currently doing. Down the road I would like to make this a class method which starts with using librosa's numpy array, so if there is a way to mathematically add specified gain in a certain unit to a numpy array from librosa that would be ideal.
Thanks
import librosa
import numpy as np
from pydub import AudioSegment, effects
pydub_audio = AudioSegment.from_file(audio_file_path)
pydub_audio = pydub_audio.set_frame_rate(16000) # make file 16k khz frame rate
print("Original dBFS is {}".format(pydub_audio.dBFS))
pydub_audio = pydub_audio.apply_gain(20) # apply 20db of gain to introduce clipping
#pydub_audio = effects.normalize(pydub_audio)
print("New dBFS is {}".format(pydub_audio.dBFS))
pydub_array = pydub_audio.get_array_of_samples()
pydub_array = np.array(pydub_array)
print("PyDub audio type is {}".format(pydub_array.dtype))
pydub_array_32bitfloat = pydub_array.astype(np.float32, order = 'C') / 32768 # rescaling to between [-1, 1] like librosa
print("Rescaled Pydub type is {}".format(pydub_array_32bitfloat.dtype))
import soundfile as sf
sf.write(r"test_pydub_gain.wav", pydub_array_32bitfloat, samplerate = 16000, format = 'wav')
thinking about it, (if i am not wrong), mathematicaly the gain is:
dBFS = 20 * log (level2 / level1)
so i would multiply all elements of the array by
10**(dBFS/20) to apply the gain
I have several spectrogra time/frequency [500,1024] files.
I need to calculate the MFCC of these files. There are lot's of the library for calculating MFCC on a raw audio file but I'm looking a method in python for calculating directly from np.array.
This can be done with librosa, as it allows to pass in spectrograms instead of audio waveform using the parameter S.
I am assuming that you have a STFT magnitude spectrogram (linear spectrogram with phase discarded). Then need to convert this into a mel-filtered spectrogram, perform log-scaling, and then do the DCT-2 and truncation to obtain MFCC coefficients. Skeleton code below:
import librosa
import numpy
# TODO: you need to provide these
sr = my_samplerate
my_stft
mels = librosa.feature.melspectrogram(S=my_stft, sr=sr, n_mels=64)
log_mels = librosa.core.amplitude_to_db(mels, ref=numpy.max)
mfcc = librosa.feature.mfcc(S=log_mels, sr=sr, n_mfcc=20)
See the librosa API reference for more details.
Sorry if I submit a duplicate, but I wonder if there is any lib in python which makes you able to extract sound spectrum from audio files. I want to be able to take an audio file and write an algoritm which will return a set of data {TimeStampInFile; Frequency-Amplitude}.
I heard that this is usually called Beat Detection, but as far as I see beat detection is not a precise method, it is good only for visualisation, while I want to manipulate on the extracted data and then convert it back to an audio file. I don't need to do this real-time.
I will appreciate any suggestions and recommendations.
You can compute and visualize the spectrum and the spectrogram this using scipy, for this test i used this audio file: vignesh.wav
from scipy.io import wavfile # scipy library to read wav files
import numpy as np
AudioName = "vignesh.wav" # Audio File
fs, Audiodata = wavfile.read(AudioName)
# Plot the audio signal in time
import matplotlib.pyplot as plt
plt.plot(Audiodata)
plt.title('Audio signal in time',size=16)
# spectrum
from scipy.fftpack import fft # fourier transform
n = len(Audiodata)
AudioFreq = fft(Audiodata)
AudioFreq = AudioFreq[0:int(np.ceil((n+1)/2.0))] #Half of the spectrum
MagFreq = np.abs(AudioFreq) # Magnitude
MagFreq = MagFreq / float(n)
# power spectrum
MagFreq = MagFreq**2
if n % 2 > 0: # ffte odd
MagFreq[1:len(MagFreq)] = MagFreq[1:len(MagFreq)] * 2
else:# fft even
MagFreq[1:len(MagFreq) -1] = MagFreq[1:len(MagFreq) - 1] * 2
plt.figure()
freqAxis = np.arange(0,int(np.ceil((n+1)/2.0)), 1.0) * (fs / n);
plt.plot(freqAxis/1000.0, 10*np.log10(MagFreq)) #Power spectrum
plt.xlabel('Frequency (kHz)'); plt.ylabel('Power spectrum (dB)');
#Spectrogram
from scipy import signal
N = 512 #Number of point in the fft
f, t, Sxx = signal.spectrogram(Audiodata, fs,window = signal.blackman(N),nfft=N)
plt.figure()
plt.pcolormesh(t, f,10*np.log10(Sxx)) # dB spectrogram
#plt.pcolormesh(t, f,Sxx) # Lineal spectrogram
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [seg]')
plt.title('Spectrogram with scipy.signal',size=16);
plt.show()
i tested all the code and it works, you need, numpy, matplotlib and scipy.
cheers
I think your question has three separate parts:
How to load audio files into python?
How to calculate spectrum in python?
What to do with the spectrum?
1. How to load audio files in python?
You are probably best off by using scipy, as it provides a lot of signal processing functions. For loading audio files:
import scipy.io.wavfile
samplerate, data = scipy.io.wavfile.read("mywav.wav")
Now you have the sample rate (samples/s) in samplerate and data as a numpy.array in data. You may want to transform the data into floating point, depending on your application.
There is also a standard python module wave for loading wav-files, but numpy/scipy offers a simpler interface and more options for signal processing.
2. How to calculate the spectrum
Brief answer: Use FFT. For more words of wisdom, see:
Analyze audio using Fast Fourier Transform
Longer answer is quite long. Windowing is very important, otherwise you'll have strange spectra.
3. What to do with the spectrum
This is a bit more difficult. Filtering is often performed in time domain for longer signals. Maybe if you tell us what you want to accomplish, you'll receive a good answer for this one. Calculating the frequency spectrum is one thing, getting meaningful results with it in signal processing is a bit more complicated.
(I know you did not ask this one, but I see it coming with a probability >> 0. Of course, it may be that you have good knowledge on audio signal processing, in which case this is irrelevant.)