Given a wav file (mono 16KHz sampling rate) of an audio recording of a human talking, is there a way to extract just the voice, thereby filtering out most mechanical and background noise? I'm trying to use librosa package in Python 3.6 for this, but can't figure out how piptrack works (or if there is a simpler way).
When tried using an fft/ifft to restrict frequencies to 300-3400 range, the resulting sound was severely distorted.
sr, y = scipy.io.wavfile.read(wav_file_path)
x = np.fft.rfft(y)[0:3400]
x[0:300] = 0
x = np.fft.irfft(x)
Extracting the human voice of an audio file is an actively researched problem. It's often referred to as 'Speech Enhancement' in scientific literature. Latest developments in the field tend to be presented at the Interspeech and IEEE ICASSP conferences. You can also check out the Deep Noise Surpression Challenge from Microsoft.
The complexity of removing unwanted sound from a speech recording is highly dependent on the unwanted sound, and how much you know about it. If, as your attempt suggest, you are only interested in filtering out low frequency noise, then you may be able to get some noise reduction with a proper low pass filter. Librosa has some filter implementations, and numpy/scipy will give you even more options.
Simply zeroing fft coefficients will give terrible distortion. See this stackoverflow answer as to why this never is a good idea.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I need to build a software that does audio recognition from a small audio sample (A) inside other audio samples (B), and output how many times A appears inside the audio from B (if there is a match).
What I have: A database with hundreds of audios
Input: New audio
Expected Output: A boolean if the input matches a sample from the database, and how many times appeared the input inside the matched audio (from the db).
Any code, open source project, guides, books, videos, tutorial, etc... is useful! Thanks everyone!
This is a very broad question, but let me try to back up and describe a little bit about how audio recognition works generally, and how you might perform this yourself.
I'm going to assume the audio comes from an audio file and not a stream, but it should be relatively easy to understand either way.
The Basics of Digital Audio
An audio file is a series of samples which are recorded into a device through a process called sampling. Sampling is the process by which a continuous analog signal (for instance, the electrical signal from a microphone or an electric guitar) is turned into a discrete, digital signal.
With audio signals, sampling is almost always done at a single sampling rate, which is generally somewhere between 8kHz and 192kHz. The only particularly important things to know about sampling for you are:
The highest frequency that a digital audio system can represent is called the nyquist rate, which is half the sampling rate. So if you're using a sampling rate of 48kHz, the highest possible represented frequency is 24kHz. This is generally plenty because humans can only hear up to 20kHz, so you're safe to use any sampling rate over 40kHz unless you're trying to record something that isn't for humans.
After being sampled, the digital audio file is stored in terms of either floating point or integer values. Most often, an audio file is represented as either 32-bit floating, 24-bit integer, or 16-bit integer. In any case, most modern audio processing is done with floating point numbers, and is generally scaled within the window (-1.0, 1.0). In this system, alternating -1.0s and 1.0s is the loudest possible square wave at the highest possible frequency, and a series of 0.0s is silence.
Audio Recognition
General algorithms for audio recognition are complex and often inefficient relative to a certain amount of use cases. For instance, are you trying to determine whether an audio file exactly matches another audio file, or whether they would sound nearly identical? For instance, let's look at the simplest audio comparison algorithm (at least the simplest I can come up with).
def compareAudioFiles(a, b):
if len(a) != len(b):
return False
for idx in range(len(a)):
# if the current item in a isn't equal to the current item in b
if a[idx] != b[idx]:
return False
return True # if the two above returns aren't triggered, a and b are the same.
This works **only under specific circumstances* -- if the audio files are even slightly different, they won't be matched as identical. Let's talk about a few ways that this could fail:
Floating point comparison -- it is risky to use == between floats because floats are compared with such accuracy that tiny changes to the samples would cause them to register as different. For instance:
SamplesA = librosa.core.load('audio_file_A.wav')
SamplesB = librosa.core.load('audio_file_A.wav')
SamplesB[0] *= 1.0...00000001 # replace '...' with lots of zeros
compareAudioFiles(SamplesA, SamplesB) # will be false.
Even though the slight change to SamplesB is imperceivable, it is recognized by compareAudioFiles.
Zero padding -- a single sample of 0 before or after the file will cause failure:
SamplesA = librosa.core.load('audio_file_A.wav')
SamplesB = numpy.append(SamplesA, 0) # adds one zero to the end
# will be False because len(SamplesA) != len(samplesB)
compareAudioFiles(SamplesA, SamplesB) # False
There are tons of other reasons this wouldn't work, like phase mismatch, bias, and filtered low frequency or high frequency signals which aren't audible.
You could continue to improve this algorithm to make up for some things like these, but it would still probably never work well enough to match perceived sounds to others. In short, if you want to do this in such a way which compares the way audio sounds you need to use an acoustic fingerprinting library. One such library is pyacoustid. Otherwise, if you want to compare audio samples from files on their own, you can probably come up with a relatively stable algorithm which measures the difference between sounds in the time domain, taking into account zero padding, imprecision, bias, and other noise.
For general purpose audio operations in Python, I'd recommend LibROSA
Good luck!
I have many .wav files having heart sounds recorded through MIC by putting phone directly on people chest. I want to calculate BPM from these sounds. Could you please help regarding this? Any library,algorithm or tutorials?
Can you (are you allowed to) put some sample somewhere?
I've played with some ECG (up to 12 eletrode) and neural signals (spikes look a lot similar to the R-S transition). Those spikes were so big, a simple find_peaks from scipy.signal was enough to detect them. I used a butterworth filter before that though. You might need that too, filtering out the 50/60Hz mains is common, there might be similar noises in audio as well.
After finding the peaks, beats per minute is a division (and probably some averaging).
What you're trying to do is essentially calculate the fourier domain for the given sound file, and then identify the strongest peak. That's likely going to be the frequency of your dominant signal (which in this case should be the heart-rate).
Thankfully, someone else has already asked / answered this on stackoverflow.
The only caveat with this approach, is if there are other repetitive signals that dominate the heart-beat, in which case you may need to clean your data first.
I want to use python to dispose of an Audio file which can recognize only my voice. For example, I speak to a raspberry pi car about "forward". It will go straight but other people who speak "forward" cannot control my car.
or I want to regard another person's sounds as noise and eliminate it. How can I do? someone told me can use pca or ica to reduce those noisy.
You first recognize the command then extract the speaker with i-vector or d-vector to identify you.
You can find description of the algorithms in Apple's blog, for example. You can find implementation of the mentioned algorithms in Kaldi, they are not very easy to integrate though.
I'm looking to calculate the loudness of a piece of audio using Python — probably by extracting the peak volume of a piece of audio, or possibly using a more accurate measure (RMS?).
What's the best way to do this? I've had a look at pyaudio, but that didn't seem to do what I wanted. What looked good was ruby-audio, as this seemingly has sound.abs.max built into it.
The input audio will be taken from various local MP3 files that are around 30s in duration.
I think that the RMS would be the the most accurate measure. One thing to note is that we percieve loudness differently at different frequencies, so convert the audio to frequency space with an fft (numpy.fft should work great on only 30s of audio). Now compute a power spectral density from this. Weight the PSD by frequency using some loudness curve. Especially frequencies below 10Hz, since there will be a lot of power there (it would dominate the RMS calculation in the time-domain), yet we can't hear it. Now integrate the PSD and take the square root and that will give a percieved RMS.
You can also break the mp3 into sections or windows and apply this technique to give the volume in particular sections.
I have a guitar and I need my pc to be able to tell what note is being played, recognizing the tone. Is it possible to do it in python, also is it possible with pygame? Being able of doing it in pygame would be very helpful.
To recognize the frequency of an audio signal, you would use the FFT (fast Fourier transform) algorithm. As far as I can tell, PyGame has no means to record audio, nor does it support the FFT transform.
First, you need to capture the raw sampled data from the sound card; this kind of data is called PCM (Pulse Code Modulation). The simplest way to capture audio in Python is using the PyAudio library (Python bindings to PortAudio). GStreamer can also do it, it's probably an overkill for your purposes. Capturing 16-bit samples at a rate of 48000 Hz is pretty typical and probably the best a normal sound card will give you.
Once you have raw PCM audio data, you can use the fftpack module from the scipy library to run the samples through the FFT transform. This will give you a frequency distribution of the analysed audio signal, i.e., how strong is the signal in certain frequency bands. Then, it's a matter of finding the frequency that has the strongest signal.
You might need some additional filtering to avoid harmonic frequencies I am not sure.
I once wrote a utility that does exactly that - it analyses what sounds are being played.
You can look at the code here (or you can download the whole project. its integrated with Frets On Fire, a guitar hero open source clone to create a real guitar hero). It was tested using a guitar, an harmonica and whistles :) The code is ugly, but it works :)
I used pymedia to record, and scipy for the FFT.
Except for the basics that others already noted, I can give you some tips:
If you record from mic, there is a lot of noise. You'll have to use a lot of trial-and-error to set thresholds and sound clean up methods to get it working. One possible solution is to use an electric guitar, and plug its output to the audio-in. This worked best for me.
Specifically, there is a lot of noise around 50Hz. That's not so bad, but its overtones (see below) are at 100 Hz and 150 Hz, and that's close to guitar's G2 and D3.... As I said my solution was to switch to an electric guitar.
There is a tradeoff between speed of detection, and accuracy. The more samples you take, the longer it will take you to detect sounds, but you'll be more accurate detecting the exact pitch. If you really want to make a project out of this, you probably need to use several time scales.
When a tones is played, it has overtones. Sometimes, after a few seconds, the overtones might even be more powerful than the base tone. If you don't deal with this, your program with think it heard E2 for a few seconds, and then E3. To overcome this, I used a list of currently playing sounds, and then as long as this note, or one of its overtones had energy in it, I assumed its the same note being played....
It is specifically hard to detect when someone plays the same note 2 (or more) times in a row, because it's hard to distinguish between that, and random fluctuations of sound level. You'll see in my code that I had to use a constant that had to be configured to match the guitar used (apparently every guitar has its own pattern of power fluctuations).
You will need to use an audio library such as the built-in audioop.
Analyzing the specific note being played is not trivial, but can be done using those APIs.
Also could be of use: http://wiki.python.org/moin/PythonInMusic
Very similar questions:
Audio Processing - Tone Recognition
Real time pitch detection
Real-time pitch detection using FFT
Turning sound into a sequence of notes is not an easy thing to do, especially with multiple notes at once. Read through Google results for "frequency estimation" and "note recognition".
I have some Python frequency estimation examples, but this is only a portion of what you need to solve to get notes from guitar recordings.
This link shows some one doing it in VB.NET but the basics of what need to be done to achieve your goal is captured in these links below.
STFT
Colley Tukey
FFT