I am trying to load a .wav file in Python using librosa library. Let's assume my code is as simple as:
import librosa
import numpy as np
pcm_data, spl_rate = librosa.core.load(resource_file, sr=None)
In general it does work, however I am experiencing strange quantization problems when reading audio files with amplitude of less than 1e-5. I need some really low amplitude noise samples for my project (VERY little ambient noise, yet not complete silence).
For instance, when I generate white noise of amplitude 0.00001 in Audacity, its waveform is visible in Audacity preview when fully magnified. It is also visible after exporting the waveform as 32bit float and re-importing it to empty Audacity project. However, when I read that file using code presented above, np.max(np.abs(pcm_data)) is 0.0. Did I just reach limits of Python in this matter? How do I read my data (without pre-scaling and rescaling in runtime)?
Related
I am loading a room impulse response h from a mat file in both matlab and python. After this, I am loading an audio file and convolving with that h. In matlab, I am using the following code:
sound(conv(x,h(1,:)'),Fs)
Where h is a 4x2048, 4 channels room impulse response (for 4 microphones) with 2048 samples, x is the audio file with 27077 samples and Fs is the sample rate.
On python, I am using the following code:
import numpy as np
import sounddevice as sd
output = np.convolve(x,h[0,:])
sd.play(output,Fs)
sd.wait()
The output in python is simply noise and results with garbage where the MATLAB output sounds reasonable.
I have tested sizes of the sound and h after loading and the size of the convolved signal which are all the same. I do not know how to debug this.
The values, strangely, are not the same. For MATLAB, h(1,1) holds -0.0128 where python, h[0,0] holds -0.012767..(a long tail). Is that reasonable? If the numbers are different, is that why am I getting noise, though why should the numbers be different when I am loading both from the same file?
Am I using the right way to convolve this two?
So here's the idea: you can generate a spectrogram from an audio file using shorttime Fourier transform (stft). Then some people have generated something called a "binary mask" to generate different audio (ie. with background noise removed etc.) from the inverse stft.
Here's what I understand:
stft is a simple equation that is applied to the audio file, which generates the information that can easily be displayed a spectrogram.
By taking the inverse of the stft matrix, and multiplying it by a matrix of the same size (the binary matrix) you can create a new matrix with information to generate an audio file with the masked sound.
Once I do the matrix multiplication, how is the new audio file created?
It's not much but here's what I've got in terms of code:
from librosa import load
from librosa.core import stft, istft
y, sample_rate = load('1.wav')
spectrum = stft(y)
back_y = istft(spectrum)
Thank you, and here are some slides that got me this far. I'd appreciate it if you could give me an example/demo in python
I was just getting started with a code to pre-process some audio data in order to lately feed a neural network with it. Before explaining more deeply my actual problem, mention that I took the reference for how to do the project from this site. Also used some code taken from this post and read for more info in the signal.spectogram doc and this post.
For now with all of the sources mentioned before, I managed to get the wav audio file as a numpy array and plot both its amplitude and spectrogram. Theese represent a recording of me saying the word "command" in Spanish.
The strange fact here is that I search on the internet and found that human voice spectrum moves between 80 and 8k Hz, so just to get sure I compared this output with the one Audacity spectrogram returned. As you can see, this seems to be more coherent with the info found, as the frequency range is the one supposed to be for humans.
So that takes me to final question: Am I doing something wrong in the process of reading the audio or generating the spectrogram or maybe am I having plot issues?
By the way I'm new to both python and signal processing so thx in advance for your patience.
Here is the code I'm actually using:
def espectrograma(wav):
sample_rate, samples = wavfile.read(wav)
frequencies, times, spectrogram = signal.spectrogram(samples, sample_rate, nperseg=320, noverlap=16, scaling='density')
#dBS = 10 * np.log10(spectrogram) # convert to dB
plt.subplot(2,1,1)
plt.plot(samples[0:3100])
plt.subplot(2,1,2)
plt.pcolormesh(times, frequencies, spectrogram)
plt.imshow(spectrogram,aspect='auto',origin='lower',cmap='rainbow')
plt.ylim(0,30)
plt.ylabel('Frecuencia [kHz]')
plt.xlabel('Fragmento[20ms]')
plt.colorbar()
plt.show()
The computation of the spectrogram seems fine to me. If you plot the spectrogram in log scale you should observe something more similar to the audition plots you referenced. So uncomment your line
#dBS = 10 * np.log10(spectrogram) # convert to dB
and then use the variable dBS for the plotting instead of spectrogram in
plt.pcolormesh(times, frequencies, spectrogram)
plt.imshow(spectrogram,aspect='auto',origin='lower',cmap='rainbow')
The spectrogram uses a fourier transform to convert your timeseries data into frequency domain.
The maximum frequency that can be measured is (sampling frequency) / 2, so in this case it may seem like your sampling frequency is 60KHz?
Anyway, regarding your question. It may be correct that the human voice spectrum lies within this range, but the fourier transform is never perfect. I would simply adjust your Y-Axis to specifically look at these frequencies.
It seems to me that you are calculating your spectrogram correctly, at least as long as you are reading the sample_rate and samples correctly..
I'm trying to modify this example: https://svn.enthought.com/enthought/browser/Chaco/trunk/examples/advanced/spectrum.py. Unfortunately I have not been able to get it to scale. If I double the sampling rate, the graph lags from the sound input. I'd like to find out which part of the code is the bottleneck. I tried to use cProfile but didn't investigate very far.
I wrote the original version of spectrum.py, and I believe that the bottleneck is in the drawing, in particular the spectrogram plot. If you change the code to not draw every time it computes an FFT, it should keep up better.
How would I go about using Python to read the frequency peaks from a WAV PCM file and then be able to generate an image of it, for spectogram analysis?
I'm trying to make a program that allows you to read any audio file, converting it to WAV PCM, and then finding the peaks and frequency cutoffs.
Python's wave library will let you import the audio. After that, you can use numpy to take an FFT of the audio.
Then, matplotlib makes very nice charts and graphs - absolutely comparable to MATLAB.
It's old as dirt, but this article would probably get you started on almost exactly the problem you're describing (article in Python of course).
Loading WAV files is easy using audiolab:
from audiolab import wavread
signal, fs, enc = wavread('test.wav')
or for reading any general audio format and converting to WAV:
from audiolab import Sndfile
sound_file = Sndfile('test.w64', 'r')
signal = wave_file.read_frames(wave_file.nframes)
The spectrogram is built into PyLab:
from pylab import *
specgram(signal)
Specifically, it's part of matplotlib. Here's a better example.
from pylab import *
specgram(signal)
is the easiest. Also quite handy in this context:
subplot
But be warned: Matplotlib is very slow but it creates beautiful images. You should not use it for demanding animation, even less when you are dealing with 3D
If you need to convert from PCM format to integers, you'll want to use struct.unpack.