categorizing short audio samples

categorizing short audio samples - python

I have a small number of similar types of sounds (I shall refer to these as DB_sounds) to which I need to match a recording (Rec_sounds). Each Rec_sound is short and unique and needs to be matched to its corresponding DB_sound. How do I go about matching them?
To illustrate my problem, consider the following:
Bob, with a deep voice in room A (with some background noise) says Ma
Alice, with high voice in room B says Eh
A Baby is learning to speak. His first word is Eh
Ma and Eh are 2 different types of DB_sounds, so I have to return 2 different results. I have several DB_sound samples of different people saying Ma and Eh to compare the Rec_sounds to
The sounds that I am dealing with are voice recordings of single syllables like la, ba, ne, eh, ma etc.
How should I tackle this?
I don't think audio fingerprinting will work (see spectrogram), and existing voice recognition software like this google api integration in python don't work since I am not trying to recognize human language, but just sounds.
I don't mind building something from the ground up, just point me in a direction you think will work, and please add plenty justification for why you think so.
Spectrograms of 8 samples of a baby saying EH
Time domain graphs of 8 samples of a baby saying EH

If you just want to recognize sounds, I would start with a simple procedure:
Crop silence from each sound sample (simple energy treshold).
Compute Audio Features for each sample of your database (e.g. MFCCs).
Perform a cross-validated classification procedure to map the audio features to the sound category you want to recognize.
Helpful Python Libs: scipy for reading wav files, essentia for audio feature extraction, scikit-learn for classification and other machine learning.

Related

Recognize start of piano music in an MP3 file which starts with a spoken introduction, and remove spoken part, using Python

I have a number of .mp3 files which all start with a short voice introduction followed by piano music. I would like to remove the voice part and just be left with the piano part, preferably using a Python script. The voice part is of variable length, ie I cannot use ffmpeg to remove a fixed number of seconds from the start of each file.
Is there a way of detecting the start of the piano part and then know how many seconds to remove using ffmpeg or even using Python itself?.
Thank you

This is a non-trivial problem if you want a good outcome.
Quick and dirty solutions would involve inferred parameters like:
"there's usually 15 seconds of no or low-db audio between the speaker and the piano"
"there's usually not 15 seconds of no or low-db audio in the middle of the piano piece"
and then use those parameters to try to get something "good enough" using audio analysis libraries.
I suspect you'll be disappointed with that approach given that I can think of many piano pieces with long pauses and this reads like a classic ML problem.
The best solution here is to use ML with a classification model and a large data set. Here's a walk-through that might help you get started. However, this isn't going to be a few minutes of coding. This is a typical ML task that will involve collecting and tagging lots of data (or having access to pre-tagged data), building a ML pipeline, training a neural net, and so forth.
Here's another link that may be helpful. He's using a pretrained model to reduce the amount of data required to get started, but you're still going to put in quite a bit of work to get this going.

How can I detect presence (and calculate extent) of overlapping speakers in audio file?

I have a collection of WAV audio files that contain broadcast recordings. Mostly audio parts of video recordings of news broadcasts etc. (I don't have the original videos). I need to estimate what %age of those files have overlapping speakers, i.e. when 2 or more people are talking more or less at the same time. And for those files where overlap does occur, what %age of those are overlapping speech. I don't care if it's 2, 3 or 23 people talking at the same time, as long as it's more than 1. Gender, age etc don't matter either. On the other hand, those recording are in many different languages, of varying quality, and may also contain background noise (street sounds, music etc). So this problem seems to be simpler than speaker diarization, but has complicating factors.
So is there a library (preferably Python) or a command line tool that can do this out of the box. One that does not require any supervised training (that is, I don't have any labeled data to train it with). Unsupervised training might be OK, but I prefer to avoid it too.
Thank you
UPDATE: Downstream processing of these files might define the task a bit better: Ultimately, we'll process them with ASR in order to index resulting transcripts for keyword search. When we search for a keyword "blah" in a multi-speaker recording, we won't care which speaker said it as long as any one of them did. Intuitively, getting "blah" correctly from a recording where there are multiple speakers but everyone carefully waits for their turn to speak would be easier than when everyone is speaking at the same time. I am trying to measure how much overlap is in those recordings. Among other things, this will allow me to quantitatively compare 2 sets of such recordings and conclude that one is harder than the other.

How to analyze music MFCC?

I'm trying to make a program for recognizing music instrument and notes(like C, C# B, ...) using machine learning in python.
I got data from IRMAS and philhamonic orchestra homepage.
How can I analyze music? I want to get noise removed and MFCC values. in 20 second music, I want to get within 20 featured values. I'm trying to use SVM using these data.
Sorry for too broad question... If there is something else i should mention, let me know then i'll answer imediately.
I have mathematica, also. I tried it using 'MFCC encoder' but i have no idea how can i normalize these data and set a threshold.

Take a look at this Mathematica example of using Neural Networks and MFCC encoding to classify music genre.

finding speed and tone of speech in an audio using python

Given an audio , I want to calculate the pace of the speech. i.e how fast or slow is it.
Currently I am doing the following:
- convert speech to text and obtaining a transcript (using a free tool).
- count number of words in transcript.
- calculate length or duration of file.
- finally, pace = (number of words in transcript / duration of file).
However the accuracy of the pace obtained is dependent purely on transcription , which I think is an unnecessary step.
Is there any python-library/sox/ffmpeg way that will enable me to
to calculate, in a straightforward way,the speed/pace of talk in an audio
dominant Pitches/tones of that audio?
I referred : I referred : http://sox.sourceforge.net/sox.html and https://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/

Your method sounds interesting as a quick first-order approximation, but limited by the transcript resolution. You can analyze directly the audio file.
I'm not familiar with Sox, but from their manual seems like the stat option gives "... time and frequency domain statistical information about the audio"
Sox claims to be a "Swiss Army knife of audio manipulation", and just by skimming through their docs seems like it might suit you to find the general tempo.
If you want to run pitch analysis too, then you can develop your own algorithm with python - I recently used librosa and found it very useful and well documented.

Analyse audio files with Python

I actually have Photodiode connect to my PC an do capturing with Audacity.
I want to improve this by using an old RPI1 as dedicated test station. As result the shutter speed should appear on the console. I would prefere a python solution for getting signal an analyse it.
Can anyone give me some suggestions? I played around with oct2py, but i dont really under stand how to calculate the time between the two peak of the signal.

I have no expertise on sound analysis with Python and this is what I found doing some internet research as far as I am interested by this topic
pyAudioAnalysis for an eponym purpose
You an use pyAudioAnalysis developed by Theodoros Giannakopoulos
Towards your end, function mtFileClassification() from audioSegmentation.py can be a good start. This function
splits an audio signal to successive mid-term segments and extracts mid-term feature statistics from each of these sgments, using mtFeatureExtraction() from audioFeatureExtraction.py
classifies each segment using a pre-trained supervised model
merges successive fix-sized segments that share the same class label to larger segments
visualize statistics regarding the results of the segmentation - classification process.
For instance
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc, CM] = aS.mtFileClassification("data/scottish.wav","data/svmSM", "svm", True, 'data/scottish.segments')
Note that the last argument of this function is a .segment file. This is used as ground-truth (if available) in order to estimate the overall performance of the classification-segmentation method. If this file does not exist, the performance measure is not calculated. These files are simple comma-separated files of the format: ,,. For example:
0.01,9.90,speech
9.90,10.70,silence
10.70,23.50,speech
23.50,184.30,music
184.30,185.10,silence
185.10,200.75,speech
...
If I have well understood your question this is at least what you want to generate isn't it ? I rather think you have to provide it there.
Most of these information are directly quoted from his wiki which I suggest you to read it. Yet don't hesitate to reach out as far as I am really interested by this topic
Other available libraries for audio analysis :

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.