Detecting audio inside audio [Audio Recognition] [closed]

Detecting audio inside audio [Audio Recognition] [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I need to build a software that does audio recognition from a small audio sample (A) inside other audio samples (B), and output how many times A appears inside the audio from B (if there is a match).
What I have: A database with hundreds of audios
Input: New audio
Expected Output: A boolean if the input matches a sample from the database, and how many times appeared the input inside the matched audio (from the db).
Any code, open source project, guides, books, videos, tutorial, etc... is useful! Thanks everyone!

This is a very broad question, but let me try to back up and describe a little bit about how audio recognition works generally, and how you might perform this yourself.
I'm going to assume the audio comes from an audio file and not a stream, but it should be relatively easy to understand either way.
The Basics of Digital Audio
An audio file is a series of samples which are recorded into a device through a process called sampling. Sampling is the process by which a continuous analog signal (for instance, the electrical signal from a microphone or an electric guitar) is turned into a discrete, digital signal.
With audio signals, sampling is almost always done at a single sampling rate, which is generally somewhere between 8kHz and 192kHz. The only particularly important things to know about sampling for you are:
The highest frequency that a digital audio system can represent is called the nyquist rate, which is half the sampling rate. So if you're using a sampling rate of 48kHz, the highest possible represented frequency is 24kHz. This is generally plenty because humans can only hear up to 20kHz, so you're safe to use any sampling rate over 40kHz unless you're trying to record something that isn't for humans.
After being sampled, the digital audio file is stored in terms of either floating point or integer values. Most often, an audio file is represented as either 32-bit floating, 24-bit integer, or 16-bit integer. In any case, most modern audio processing is done with floating point numbers, and is generally scaled within the window (-1.0, 1.0). In this system, alternating -1.0s and 1.0s is the loudest possible square wave at the highest possible frequency, and a series of 0.0s is silence.
Audio Recognition
General algorithms for audio recognition are complex and often inefficient relative to a certain amount of use cases. For instance, are you trying to determine whether an audio file exactly matches another audio file, or whether they would sound nearly identical? For instance, let's look at the simplest audio comparison algorithm (at least the simplest I can come up with).
def compareAudioFiles(a, b):
if len(a) != len(b):
return False
for idx in range(len(a)):
# if the current item in a isn't equal to the current item in b
if a[idx] != b[idx]:
return False
return True # if the two above returns aren't triggered, a and b are the same.
This works **only under specific circumstances* -- if the audio files are even slightly different, they won't be matched as identical. Let's talk about a few ways that this could fail:
Floating point comparison -- it is risky to use == between floats because floats are compared with such accuracy that tiny changes to the samples would cause them to register as different. For instance:
SamplesA = librosa.core.load('audio_file_A.wav')
SamplesB = librosa.core.load('audio_file_A.wav')
SamplesB[0] *= 1.0...00000001 # replace '...' with lots of zeros
compareAudioFiles(SamplesA, SamplesB) # will be false.
Even though the slight change to SamplesB is imperceivable, it is recognized by compareAudioFiles.
Zero padding -- a single sample of 0 before or after the file will cause failure:
SamplesA = librosa.core.load('audio_file_A.wav')
SamplesB = numpy.append(SamplesA, 0) # adds one zero to the end
# will be False because len(SamplesA) != len(samplesB)
compareAudioFiles(SamplesA, SamplesB) # False
There are tons of other reasons this wouldn't work, like phase mismatch, bias, and filtered low frequency or high frequency signals which aren't audible.
You could continue to improve this algorithm to make up for some things like these, but it would still probably never work well enough to match perceived sounds to others. In short, if you want to do this in such a way which compares the way audio sounds you need to use an acoustic fingerprinting library. One such library is pyacoustid. Otherwise, if you want to compare audio samples from files on their own, you can probably come up with a relatively stable algorithm which measures the difference between sounds in the time domain, taking into account zero padding, imprecision, bias, and other noise.
For general purpose audio operations in Python, I'd recommend LibROSA
Good luck!

Related

How to calculate beats per minute from heart sounds recorded through android MIC?

I have many .wav files having heart sounds recorded through MIC by putting phone directly on people chest. I want to calculate BPM from these sounds. Could you please help regarding this? Any library,algorithm or tutorials?

Can you (are you allowed to) put some sample somewhere?
I've played with some ECG (up to 12 eletrode) and neural signals (spikes look a lot similar to the R-S transition). Those spikes were so big, a simple find_peaks from scipy.signal was enough to detect them. I used a butterworth filter before that though. You might need that too, filtering out the 50/60Hz mains is common, there might be similar noises in audio as well.
After finding the peaks, beats per minute is a division (and probably some averaging).

What you're trying to do is essentially calculate the fourier domain for the given sound file, and then identify the strongest peak. That's likely going to be the frequency of your dominant signal (which in this case should be the heart-rate).
Thankfully, someone else has already asked / answered this on stackoverflow.
The only caveat with this approach, is if there are other repetitive signals that dominate the heart-beat, in which case you may need to clean your data first.

Determining "noise" in bandwidth data

I'm have bandwidth data which identifies protocol usage by tonnage and hour. Based on the protocols, you can tell when something is just connect vs actually being used (1000 bits compared to million or billions of bits) in that hour for that specific protocol. The problem is When looking at each protocol, they are all heavily right skewed. Where 80% of the records are the just connected or what I'm calling "noise.
The task I have is to separate out this noise and focus on only when the protocol is actually being used. My classmates are all just doing this manually and removing at a low threshold. I was hoping there was a way to automate this and using statistics instead of just picking a threshold that "looks good." We have something like 30 different protocols each with a different amount of bits which would represent "noise" i.e. a download prototypical might have 1000 bits where a messaging app might have 75 bits when they are connected but not in full use. Similarly they will have different means and gaps between i.e. download mean is 215,000,000 and messaging is 5,000,000. There isn't any set pattern between them.
Also this "noise" has many connections but only accounts for 1-3% of the total bandwidth being used, this is why we are tasked with identify actual usage vs passive usage.
I don't want any actual code, as I'd like to practice with the implementation and solution building myself. But the logic, process, or name of a statistical method would be very helpful.

Do you have labeled examples, and do you have other data besides the bandwidth? One way to do this would be to train some kind of ML classifier if you have a decent amount of data where you know it's either in use or not in use. If you have enough data you also might be able to do this unsupervised. For a start a simple Naive Bayes classifier works well for binary solutions. As you may be away, NB was the original bases for spam detection (is it spam or not). So your case of is it noise or not should also work, but you will get more robust results if you have other data in addition to the bandwidth to train on. Also, I am wondering if there isn't a way to improve the title of your post so that it communicates your question more quickly.

Python | librosa: how to extract human voice from an audio wav file?

Given a wav file (mono 16KHz sampling rate) of an audio recording of a human talking, is there a way to extract just the voice, thereby filtering out most mechanical and background noise? I'm trying to use librosa package in Python 3.6 for this, but can't figure out how piptrack works (or if there is a simpler way).
When tried using an fft/ifft to restrict frequencies to 300-3400 range, the resulting sound was severely distorted.
sr, y = scipy.io.wavfile.read(wav_file_path)
x = np.fft.rfft(y)[0:3400]
x[0:300] = 0
x = np.fft.irfft(x)

Extracting the human voice of an audio file is an actively researched problem. It's often referred to as 'Speech Enhancement' in scientific literature. Latest developments in the field tend to be presented at the Interspeech and IEEE ICASSP conferences. You can also check out the Deep Noise Surpression Challenge from Microsoft.
The complexity of removing unwanted sound from a speech recording is highly dependent on the unwanted sound, and how much you know about it. If, as your attempt suggest, you are only interested in filtering out low frequency noise, then you may be able to get some noise reduction with a proper low pass filter. Librosa has some filter implementations, and numpy/scipy will give you even more options.
Simply zeroing fft coefficients will give terrible distortion. See this stackoverflow answer as to why this never is a good idea.

Detecting regions in (x, y) data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need to be able to detect regions of a list of (x, y) data based on the features of the data. Some example data is shown in the first image. Right now, I need to be able to find the region between the black marks (sorry for the poor quality, imgur's editor isn't very accurate). Unfortunately, the problem is complicated by being different lengths and shapes each time this data is collected, as seen in the second image. The sharp drop from ~98 to ~85 is consistent, and the two dip/two peak feature between ~1e-9 and ~1.5e-9 should be fairly consistent.
My question is, what is the best approach for detecting events in a signal, based on features of the signal? If I can get this sliced into the three regions marked (beginning to first mark, first to second mark, second mark to end), then I believe I can extend the method to handle my more complex situations.
I've solved similar problems before, but this one is unique in the amount of variation that occurs from one set of data to another. Last time I simply wrote a hand-crafted algorithm to find a local extrema and use it to locate the edge, but I feel like it's a rather ugly and inefficient solution that can't be easily reused.
I'm using Python 2.7.5, but ideally this should be a language agnostic solution so that I can implement it in other environments like VB.NET.

Just based on the two examples that you posted, I have a couple of different suggestions: thresholding or template matching.
Thresholds
Because you mentioned that the vertical drop in the signal is relatively constant, especially for the first event you're detecting, it seems like you could use a thresholding method, where you place the event at the first occurrence of the signal crossing some threshold of interest. For instance, to place the first event (in Python, and assuming that your measurement data live in a sequence of tuples containing x-y pairs) :
def detect_onset_event(measurements):
armed = False
for offset, (timestamp, value) in enumerate(measurements):
if value > 90:
armed = True
if armed and value < 85:
return offset
return -1 # failure condition, might want to raise ValueError()
So here we trigger at the first sample offset that drops below 85 after the signal has gone above 90.
You could do something similar for the second event, but it looks like the signal levels that are significant for that event might be a little less clear-cut. Depends on your application and measurement data. This is a good example of what makes thresholding approaches not so great -- they can be brittle and rely on hard-coded values. But if your measurements are quite regular, then this can work well, with little coding effort.
Templates
In this method, you can create a template for each signal event of interest, and then convolve the templates over your signal to identify similar regions of the signal.
import numpy
def detect_twopeak_event(measurements, template):
data = numpy.asarray(measurements) # convert to numpy array
activations = numpy.convolve(
data[:, 1], # convolve over all "value" elements
template)
return activations.argmax()
Here you'll need to create a list of the sample measurements that constitute the event you're trying to detect -- for example, you might extract the measurements from the two-peak area of an example signal to use as your template. Then by convolving this template over the measurement data, you'll get a metric for how similar the measurements are to your template. You can just return the index of the best match (as in the code above) or pass these similarity estimates to some other process to pick a "best."
There are many ways to create templates, but I think one of the most promising approaches is to use an average of a bunch of neighborhoods from labeled training events. That is, suppose you have a database of signals paired with the sample offset where a given event happens. You could create a template by averaging a windowed region around these labeled events :
def create_mean_template(signals, offsets, radius=20):
w = numpy.hanning(2 * radius)
return numpy.mean(
[s[o-radius:o+radius] * w for s, o in zip(signals, offsets)],
axis=0)
This has been used successfully in many signal processing domains like face recognition (e.g., you can create a template for an eye by averaging the pixels around a bunch of labeled eyes).
One place where the template approach will start to fail is if your signal has a lot of areas that look like the template, but these areas don't correspond to events you want to detect. It's tricky to deal with this, so the template method works best if there's a distinctive signal pattern that happens near your event.
Another way the template method will fail is if your measurement data contain, say, a two-peak area that's interesting but occurs at a different frequency than the samples you use as your template. In this case, you might be able to make your templates more robust to slight frequency changes by working in the time-frequency domain rather than the time-amplitude domain. There, instead of making 1D templates that correspond to the temporal pattern of amplitude changes you're interested in, you can run a windowed FFT on your measurements and then come up with kD templates that correspond to the k-dimensional frequency changes over a small region surrounding the event you're interested in.
Hope some of these suggestions are helpful !

you could probably use a Hidden Markov Model with 6+ states, I am no math genius so I would use one with discrete states and round your data to nearest integer, my model would look something alike:
state 1: start blob (emissions around 97)
state 2: 'fall' (emissions between 83 and 100)
state 3: interesting stuff ( emissions between 82-86)
state 4: peak (80-88)
sate 5: last peak (80-94)
state 6: base line (87-85)
HMM are not the perfect tool, because they mostly capture ranges of emissions in each state, but they are good at tolerating the stuff coming out much earlier or later because they only care about the p value between states and therefore
I hope this helps and makes sense
if you are super lazy you could probably just label 6 spectra by hand and then cut the data accordingly and calculate the p values for each emission of each state.
#pseudo code
emissions = defaultdict(int) # with relevant labels initialized to 0
for state_lable, value in data:
emissions[state_lable][value] += 1
# then normalize all states to 1 and voila you have a HMM
the above is super over simplified but should be much better and more robust than the if-statement stuff you usually do :)... HMMs usually also have a transition matrix, but because the signal of your data is so strong you could 'skip' that one and go for my pragmatic solution :)
and then subsequently use the viterbi path to label all your future experiments

Recognising tone of the audio

I have a guitar and I need my pc to be able to tell what note is being played, recognizing the tone. Is it possible to do it in python, also is it possible with pygame? Being able of doing it in pygame would be very helpful.

To recognize the frequency of an audio signal, you would use the FFT (fast Fourier transform) algorithm. As far as I can tell, PyGame has no means to record audio, nor does it support the FFT transform.
First, you need to capture the raw sampled data from the sound card; this kind of data is called PCM (Pulse Code Modulation). The simplest way to capture audio in Python is using the PyAudio library (Python bindings to PortAudio). GStreamer can also do it, it's probably an overkill for your purposes. Capturing 16-bit samples at a rate of 48000 Hz is pretty typical and probably the best a normal sound card will give you.
Once you have raw PCM audio data, you can use the fftpack module from the scipy library to run the samples through the FFT transform. This will give you a frequency distribution of the analysed audio signal, i.e., how strong is the signal in certain frequency bands. Then, it's a matter of finding the frequency that has the strongest signal.
You might need some additional filtering to avoid harmonic frequencies I am not sure.

I once wrote a utility that does exactly that - it analyses what sounds are being played.
You can look at the code here (or you can download the whole project. its integrated with Frets On Fire, a guitar hero open source clone to create a real guitar hero). It was tested using a guitar, an harmonica and whistles :) The code is ugly, but it works :)
I used pymedia to record, and scipy for the FFT.
Except for the basics that others already noted, I can give you some tips:
If you record from mic, there is a lot of noise. You'll have to use a lot of trial-and-error to set thresholds and sound clean up methods to get it working. One possible solution is to use an electric guitar, and plug its output to the audio-in. This worked best for me.
Specifically, there is a lot of noise around 50Hz. That's not so bad, but its overtones (see below) are at 100 Hz and 150 Hz, and that's close to guitar's G2 and D3.... As I said my solution was to switch to an electric guitar.
There is a tradeoff between speed of detection, and accuracy. The more samples you take, the longer it will take you to detect sounds, but you'll be more accurate detecting the exact pitch. If you really want to make a project out of this, you probably need to use several time scales.
When a tones is played, it has overtones. Sometimes, after a few seconds, the overtones might even be more powerful than the base tone. If you don't deal with this, your program with think it heard E2 for a few seconds, and then E3. To overcome this, I used a list of currently playing sounds, and then as long as this note, or one of its overtones had energy in it, I assumed its the same note being played....
It is specifically hard to detect when someone plays the same note 2 (or more) times in a row, because it's hard to distinguish between that, and random fluctuations of sound level. You'll see in my code that I had to use a constant that had to be configured to match the guitar used (apparently every guitar has its own pattern of power fluctuations).

You will need to use an audio library such as the built-in audioop.
Analyzing the specific note being played is not trivial, but can be done using those APIs.
Also could be of use: http://wiki.python.org/moin/PythonInMusic

Very similar questions:
Audio Processing - Tone Recognition
Real time pitch detection
Real-time pitch detection using FFT
Turning sound into a sequence of notes is not an easy thing to do, especially with multiple notes at once. Read through Google results for "frequency estimation" and "note recognition".
I have some Python frequency estimation examples, but this is only a portion of what you need to solve to get notes from guitar recordings.

This link shows some one doing it in VB.NET but the basics of what need to be done to achieve your goal is captured in these links below.
STFT
Colley Tukey
FFT

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.