Separate music instruments and estimate note density

Separate music instruments and estimate note density - python

I have a .wav file of a musical performance with three instruments: a clarinet, a bass and drums. My goal is to detect the note speed played by the clarinet player (so I want the number of notes played every five seconds: ex: between 0s and 5s = 4 notes played; between 5s and 10s = 2 notes played, etc.).
To separate I used spleeter by Deezer, and to calculate the note density, I used the librosa function librosa.onset.onset_detect(). However, I am note quite satisfied with the result.
Does anybody have a better idea to solve my problem?

This problem that can be divided into two parts :
Transform audio data to musical data (trying for the three instruments to be separated during this step)
Analyze the density of musical notes
To make the first step as consistent as possible, you should use a single library. For instance you could use an audio to midi converter (there is one python library here) to get the musical data. Then you could do the rest of the job by yourself, counting MIDI notes and so on (which would be a lot simpler than working with audio).

Related

Recognize start of piano music in an MP3 file which starts with a spoken introduction, and remove spoken part, using Python

I have a number of .mp3 files which all start with a short voice introduction followed by piano music. I would like to remove the voice part and just be left with the piano part, preferably using a Python script. The voice part is of variable length, ie I cannot use ffmpeg to remove a fixed number of seconds from the start of each file.
Is there a way of detecting the start of the piano part and then know how many seconds to remove using ffmpeg or even using Python itself?.
Thank you

This is a non-trivial problem if you want a good outcome.
Quick and dirty solutions would involve inferred parameters like:
"there's usually 15 seconds of no or low-db audio between the speaker and the piano"
"there's usually not 15 seconds of no or low-db audio in the middle of the piano piece"
and then use those parameters to try to get something "good enough" using audio analysis libraries.
I suspect you'll be disappointed with that approach given that I can think of many piano pieces with long pauses and this reads like a classic ML problem.
The best solution here is to use ML with a classification model and a large data set. Here's a walk-through that might help you get started. However, this isn't going to be a few minutes of coding. This is a typical ML task that will involve collecting and tagging lots of data (or having access to pre-tagged data), building a ML pipeline, training a neural net, and so forth.
Here's another link that may be helpful. He's using a pretrained model to reduce the amount of data required to get started, but you're still going to put in quite a bit of work to get this going.

Detecting a noise in an audio stream

My goal is to be able to detect a specific noise that comes through the speakers of a PC using Python. That means the following, in pseudo code:
Sound is being played out of the speakers, by applications such as games for example,
ny "audio to detect" sound happens, and I want to detect that, and take an action
The specific sound I want to detect can be found here.
If I break that down, i believe I need two things:
A way to sample the audio that is being streamed to an audio device
I actually have this bit working -- with the code found here : https://gist.github.com/renegadeandy/8424327f471f52a1b656bfb1c4ddf3e8 -- it is based off of sounddevice example plot - which I combine with an audio loopback device. This allows my code, to receive a callback with data that is played to the speakers.
A way to compare each sample with my "audio to detect" sound file.
The detection does not need to be exact - it just needs to be close. For example there will be lots of other noises happening at the same time, so its more being able to detect the footprint of the "audio to detect" within the audio stream of a variety of sounds.
Having investigated this, I found technologies mentioned in this post on SO and also this interesting article on Chromaprint. The Chromaprint article uses fpcalc to generate fingerprints, but because my "audio to detect" is around 1 - 2 seconds, fpcalc can't generate the fingerprint. I need something which works across smaller timespaces.
Can somebody help me with the problem #2 as detailed above?
How should I attempt this comparison (ideally with a little example), based upon my sampling using sounddevice in the audio_callback function.
Many thanks in advance.

How can I detect presence (and calculate extent) of overlapping speakers in audio file?

I have a collection of WAV audio files that contain broadcast recordings. Mostly audio parts of video recordings of news broadcasts etc. (I don't have the original videos). I need to estimate what %age of those files have overlapping speakers, i.e. when 2 or more people are talking more or less at the same time. And for those files where overlap does occur, what %age of those are overlapping speech. I don't care if it's 2, 3 or 23 people talking at the same time, as long as it's more than 1. Gender, age etc don't matter either. On the other hand, those recording are in many different languages, of varying quality, and may also contain background noise (street sounds, music etc). So this problem seems to be simpler than speaker diarization, but has complicating factors.
So is there a library (preferably Python) or a command line tool that can do this out of the box. One that does not require any supervised training (that is, I don't have any labeled data to train it with). Unsupervised training might be OK, but I prefer to avoid it too.
Thank you
UPDATE: Downstream processing of these files might define the task a bit better: Ultimately, we'll process them with ASR in order to index resulting transcripts for keyword search. When we search for a keyword "blah" in a multi-speaker recording, we won't care which speaker said it as long as any one of them did. Intuitively, getting "blah" correctly from a recording where there are multiple speakers but everyone carefully waits for their turn to speak would be easier than when everyone is speaking at the same time. I am trying to measure how much overlap is in those recordings. Among other things, this will allow me to quantitatively compare 2 sets of such recordings and conclude that one is harder than the other.

Acurately mixing two notes over each other

I have a large library of many pre-recorded music notes (some ~1200), which are all of consistant amplitude.
I'm researching methods of layering two notes over each other so that it sounds like a chord where both notes are played at the same time.
Samples with different attack times:
As you can see, these samples have different peak amplitude points, which need to line up in order to sound like a human played chord.
Manually aligned attack points:
The 2nd image shows the attack points manually alligned by ear, but this is a unfeasable method for such a large data set where I wish to create many permutations of chord samples.
I'm considering a method whereby I identify the time of peak amplitude of two audio samples, and then align those two peak amplitude times when mixing the notes to create the chord. But I am unsure of how to go about such an implementation.
I'm thinking of using python mixing solution such as the one found here Mixing two audio files together with python with some tweaking to mix audio samples over each other.
I'm looking for ideas on how I can identify the times of peak amplitude in my audio samples, or if you have any thoughts on other ways this idea could be implemented I'd be very interested.

Incase anyone were actually interested in this question, I have found a solution to my problem. It's a little convoluded, but it has yeilded excellent results.
To find the time of peak amplitude of a sample, I found this thread here: Finding the 'volume' of a .wav at a given time where the top answer provided links to a scala library called AudioFile, which provided a method to find the peak amplite by going through a sample in frame buffer windows. However this library required all files to be in .aiff format, so a second library of samples was created consisting of all the old .wav samples converted to .aiff.
After reducing the frame buffer window, I was able to determine in which frame the highest amplitude was found. Dividing this frame by the sample rate of the audio samples (which was known to be 48000), I was able to accurately find the time of peak amplitude. This information was used to create a file which stored both the name of the sample file, along with its time of peak amplitude.
Once this was accomplished, a python script was written using the Pydub library http://pydub.com/ which would pair up two samples, and find the difference (t) in their times of peak amplitudes. The sample with the lowest time of peak amplitude would have silence of length (t) preappended to it from a .wav containing only silence.
These two samples were then overlayed onto each other to produce the accurately mixed chord!

finding speed and tone of speech in an audio using python

Given an audio , I want to calculate the pace of the speech. i.e how fast or slow is it.
Currently I am doing the following:
- convert speech to text and obtaining a transcript (using a free tool).
- count number of words in transcript.
- calculate length or duration of file.
- finally, pace = (number of words in transcript / duration of file).
However the accuracy of the pace obtained is dependent purely on transcription , which I think is an unnecessary step.
Is there any python-library/sox/ffmpeg way that will enable me to
to calculate, in a straightforward way,the speed/pace of talk in an audio
dominant Pitches/tones of that audio?
I referred : I referred : http://sox.sourceforge.net/sox.html and https://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/

Your method sounds interesting as a quick first-order approximation, but limited by the transcript resolution. You can analyze directly the audio file.
I'm not familiar with Sox, but from their manual seems like the stat option gives "... time and frequency domain statistical information about the audio"
Sox claims to be a "Swiss Army knife of audio manipulation", and just by skimming through their docs seems like it might suit you to find the general tempo.
If you want to run pitch analysis too, then you can develop your own algorithm with python - I recently used librosa and found it very useful and well documented.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.