Get sound input & Find similar sound with Python - python

What I wanna do is just like 'Shazam' or 'SoundHound' with Python, only sound version, not music.
For example, when I make sound(e.g door slam), find the most similar sound data in the sound list.
I don't know you can understand that because my English is bad, but just imagine the sound version of 'Shazam'.
I know that 'Shazam' doesn't have open API.
Is there any api like 'Shazam'?
Or,
How can I implement it?

There are several libraries you can use, but none of them will classify a sample as a 'door shut' for example. BUT you can use those libraries for feature extraction and build/get a data set of sound, build a classifier, train it and use it for sound classification.
The libraries:
Friture - Friture is a graphical program designed to do time-frequency analysis on audio input in real-time. It provides a set of visualization widgets to display audio data, such as a scope, a spectrum analyser, a rolling 2D spectrogram.
LibXtract - LibXtract is a simple, portable, lightweight library of audio feature extraction functions. The purpose of the library is to provide a relatively exhaustive set of feature extraction primatives that are designed to be 'cascaded' to create a extraction hierarchies.
Yaafe - Yet Another Audio Feature Extractor is a toolbox for audio analysis. Easy to use and efficient at extracting a large number of audio features simultaneously. WAV and MP3 files supported, or embedding in C++, Python or Matlab applications.
Aubio - Aubio is a tool designed for the extraction of annotations from audio signals. Its features include segmenting a sound file before each of its attacks, performing pitch detection, tapping the beat and producing midi streams from live audio.
LibROSA - A python module for audio and music analysis. It is easy to use, and implements many commonly used features for music analysis.
If you do choose to use my advise as I mention above, I recommend on
scikit-learn as Machine Learning libraries. It contains a lot of classifiers you may want to use.

The problem here is that music has structure, while the sounds you want to find may have different signatures. Using the door as an example, the weight, size and material of the door alone will influence the types of acoustic signatures it will produce. If you want to search by similarity, a bag-of-features approach may be the easy(ish) way to go. However, there are different approaches, such as taking samples by a sliding window along the spectrogram of a sound, and trying to match (by similarity) with a previous sound you recorded, decomposition of the sound, etc...

Related

band pass filter for removing noise in the set sound (collection of sound ) in python

how will we use a bandpass filter for removing noise in the set sound (collection of sound ) not only one sound but a set of sounds?? using a python programming language
I have a set of audio waves (example: lung sound signal). each sound has one .text extension. so, I want to remove noise from all audio and then I will connect audio with .text. and finally I will finish my work.
the main point I need is to remove noise through preprocessing step in deep learning. how will I do it?
please, help
There's a recipe on Scipy cookbook for a butterworth bandpass:
https://scipy-cookbook.readthedocs.io/items/ButterworthBandpass.html
You might be able to adapt that as your bandpass, but it sort of depends a bit on the frequencies you want to be able to filter out.
I'd say it would be easier to do your audio pre-processing in an audio specific programme, there are free ones out there like Audacity and then feed the processed data into your deep learning module. Good luck!

Detecting a noise in an audio stream

My goal is to be able to detect a specific noise that comes through the speakers of a PC using Python. That means the following, in pseudo code:
Sound is being played out of the speakers, by applications such as games for example,
ny "audio to detect" sound happens, and I want to detect that, and take an action
The specific sound I want to detect can be found here.
If I break that down, i believe I need two things:
A way to sample the audio that is being streamed to an audio device
I actually have this bit working -- with the code found here : https://gist.github.com/renegadeandy/8424327f471f52a1b656bfb1c4ddf3e8 -- it is based off of sounddevice example plot - which I combine with an audio loopback device. This allows my code, to receive a callback with data that is played to the speakers.
A way to compare each sample with my "audio to detect" sound file.
The detection does not need to be exact - it just needs to be close. For example there will be lots of other noises happening at the same time, so its more being able to detect the footprint of the "audio to detect" within the audio stream of a variety of sounds.
Having investigated this, I found technologies mentioned in this post on SO and also this interesting article on Chromaprint. The Chromaprint article uses fpcalc to generate fingerprints, but because my "audio to detect" is around 1 - 2 seconds, fpcalc can't generate the fingerprint. I need something which works across smaller timespaces.
Can somebody help me with the problem #2 as detailed above?
How should I attempt this comparison (ideally with a little example), based upon my sampling using sounddevice in the audio_callback function.
Many thanks in advance.

Datasets like "The LJ Speech Dataset"

I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:
main folder:
-wavs
-001.wav
-002.wav
-etc
-metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**
So, my question is: Are There other datasets like this one for further training?
But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems?
And also different slangs or things like that can cause problems?
There a few resources:
The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/
these guys seem to be maintaining a list
https://github.com/candlewill/Speech-Corpus-Collection
And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset
Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data
Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.
Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.
As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.
If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/#klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The tutorial has you:
Download the audio from the audiobook.
Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).
You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.

Analyse audio files with Python

I actually have Photodiode connect to my PC an do capturing with Audacity.
I want to improve this by using an old RPI1 as dedicated test station. As result the shutter speed should appear on the console. I would prefere a python solution for getting signal an analyse it.
Can anyone give me some suggestions? I played around with oct2py, but i dont really under stand how to calculate the time between the two peak of the signal.
I have no expertise on sound analysis with Python and this is what I found doing some internet research as far as I am interested by this topic
pyAudioAnalysis for an eponym purpose
You an use pyAudioAnalysis developed by Theodoros Giannakopoulos
Towards your end, function mtFileClassification() from audioSegmentation.py can be a good start. This function
splits an audio signal to successive mid-term segments and extracts mid-term feature statistics from each of these sgments, using mtFeatureExtraction() from audioFeatureExtraction.py
classifies each segment using a pre-trained supervised model
merges successive fix-sized segments that share the same class label to larger segments
visualize statistics regarding the results of the segmentation - classification process.
For instance
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc, CM] = aS.mtFileClassification("data/scottish.wav","data/svmSM", "svm", True, 'data/scottish.segments')
Note that the last argument of this function is a .segment file. This is used as ground-truth (if available) in order to estimate the overall performance of the classification-segmentation method. If this file does not exist, the performance measure is not calculated. These files are simple comma-separated files of the format: ,,. For example:
0.01,9.90,speech
9.90,10.70,silence
10.70,23.50,speech
23.50,184.30,music
184.30,185.10,silence
185.10,200.75,speech
...
If I have well understood your question this is at least what you want to generate isn't it ? I rather think you have to provide it there.
Most of these information are directly quoted from his wiki which I suggest you to read it. Yet don't hesitate to reach out as far as I am really interested by this topic
Other available libraries for audio analysis :

What's a good way to examine audio with python and split it between high, mid and low pitches for visualizaton?

So, I'm planning on trying out making a light organ with an Arduino and Python, communicating over serial to control the brightness of several LEDs. The computer will use the microphone or a playing MP3 to generate the data.
I'm not so sure how to handle the audio processing. What's a good option for python that can take either a playing audio file or microphone data (I'd prefer the microphone), and then split it into different frequency ranges and write the intensity to variables? Do I need to worry about overtones if I use the microphone?
If you're not committed to using Python, you should also look at using PureData (PD) to handle the audio analysis. Interfacing PD to the Arduino is already a solved problem, and there are a lot of pre-existing components that make working with audio easy.
Try http://wiki.python.org/moin/Audio for links to various Python audio processing packages.
The audioop package has some basic waveform manipulation functions.
See also:
Detect and record a sound with python
Detect & Record Audio in Python
Portaudio has a Python interface that would let you read data off the microphone.
For the band splitting, you could use something like a band-pass filter feeding into an envelope follower -- one filter+follower for each frequency band of interest.

Categories

Resources