I'm preparing an experiment, and I want to write a program using python to recognize certain word spoken by the participants.
I searched a lot about speech recognition in python but the results are complicated.(e.g. CMUSphinx).
What I want to achieve is a program, that receive a sound file (contains only one word, not English), and I tell the program what the sound means and what output I want to see.
I have seen the sklearn example about recognizing hand-written digits. I want to know if I can do something like the example:
training the program to return certain output (e.g. numbers) according to sound files from different people saying same word;
when take in new sound files from other person saying same word, return same values.
Can I do this with python and sklearn?
If so, where should I start?
Thank you!
I've written such program in text recognition. I can tell you if you chose to "teach" your program manually you will have a lot of work think about the variation in voice due to accents etc.
You could start looking for a sound analyzer here (Musical Analysis). try to identify the waves of a simple word like "yes" and write an alghorithm that percentages the variation of the soundfile. this way you can put a margin in to safe yourself from false-positives / vice-versa.
Also you might need to remove background noise from the soundfile first as they may interfer with your identification patterns.
Related
I want to write code to analyze short protein sequences and determine their similarity. I have no reference sequence but rather I want to write some sort of for loop to compare them all to each other to see how many duplicate sequences I have, as well as regions where they are similar.
I currently have all of their sequences in a csv.
I have taken a bioinformatics course and have done something similar with Illumina sequencing data but I started from an SRA table and had fasta files.
Also, I am trying to use CD hit but but I am running into problems with the makefile and the compatibility of my compiler. I installed homebrew to get around the issue but I am still running into the problem and the make CXX=g++-9 CC=gcc-9 comand won't work.
I was wondering if there was more update to the method than CD-Hit because I have noticed that no one has really used CD Hit since 2020.
Also the only coding languages I know are R and Shell but I am currently learning Python.
https://bioinfo.lifl.fr/yass/index.php
I have used it for SARS-CoV-2, found similarity to many viruses
I have a number of .mp3 files which all start with a short voice introduction followed by piano music. I would like to remove the voice part and just be left with the piano part, preferably using a Python script. The voice part is of variable length, ie I cannot use ffmpeg to remove a fixed number of seconds from the start of each file.
Is there a way of detecting the start of the piano part and then know how many seconds to remove using ffmpeg or even using Python itself?.
Thank you
This is a non-trivial problem if you want a good outcome.
Quick and dirty solutions would involve inferred parameters like:
"there's usually 15 seconds of no or low-db audio between the speaker and the piano"
"there's usually not 15 seconds of no or low-db audio in the middle of the piano piece"
and then use those parameters to try to get something "good enough" using audio analysis libraries.
I suspect you'll be disappointed with that approach given that I can think of many piano pieces with long pauses and this reads like a classic ML problem.
The best solution here is to use ML with a classification model and a large data set. Here's a walk-through that might help you get started. However, this isn't going to be a few minutes of coding. This is a typical ML task that will involve collecting and tagging lots of data (or having access to pre-tagged data), building a ML pipeline, training a neural net, and so forth.
Here's another link that may be helpful. He's using a pretrained model to reduce the amount of data required to get started, but you're still going to put in quite a bit of work to get this going.
I've built a simple CNN word detector that is accurately able to predict a given word when using a 1-second .wav as input. As seems to be the standard, I'm using the MFCC of the audio files as input for the CNN.
However, my goal is to be able to apply this to longer audio files with multiple words being spoken, and to have the model be able to predict if and when a given word is spoken. I've been searching online how the best approach, but seem to be hitting a wall and I truly apologize if the answer could've been easily found through google.
My first thought is to cut the audio file into several windows of 1-second length that intersect each other -
and then convert each window into an MFCC and use these as input for the model prediction.
My second thought would be to instead use onset detection in attempts isolate each word, add padding if the word if it was < 1-second, and then feed these as input for the model prediction.
Am I way off here? Any references or recommendations would hugely appreciated. Thank you.
Cutting the audio up into analysis windows is the way to go. It is common to use some overlap. The MFCC features can be calculated first and then split done using an integer number of frames that gets you closest to the window length you want (1s).
See How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)? for example code
I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:
main folder:
-wavs
-001.wav
-002.wav
-etc
-metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**
So, my question is: Are There other datasets like this one for further training?
But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems?
And also different slangs or things like that can cause problems?
There a few resources:
The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/
these guys seem to be maintaining a list
https://github.com/candlewill/Speech-Corpus-Collection
And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset
Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data
Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.
Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.
As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.
If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/#klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The tutorial has you:
Download the audio from the audiobook.
Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).
You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.
I'm new to programming and do not have much experience yet. I understand some python codes, but not into detail.
I have an Excel file which contains log files of problems people encountered. The description of the problem is pasted as an email (so it's a bunch of text). I want to analyze all of these texts (almost 1.000 rows in Excel) at once, and I think Python can do this.
The type of analysis I want to do is sentiment analysis (positive, neutral, negative) or I want to see the main problem out of the text. I don't know if the second one is possible.
I copied the emails that are listed in the Excel file, to a .txt file, so now every rule is one message. How can I use Python to analyze every single rule as one message and let it show me the sentiment or the main problem?
I'd appreciate the help
Sentiment analysis is a fairly large problem in computer science/language. How specific did you want to get?
I'd recommend looking into Text-Processing for simple SA.
Their API docs are here, http://text-processing.com/docs/sentiment.html, which will return a simple pos and neg score for your text.
If you want anything more specific, I'd recommend looking into the IBM Watson, specifically Natural Language Understanding https://www.ibm.com/watson/developercloud/natural-language-understanding.html