How to train CNN on common voice dataset - python

I am trying to train a cnn with the common voice dataset. I am new to speech recognition and am not able to find any links on how to use the dataset with keras. I followed this article to build a simple word classification network. But I want to scale it up with the common voice dataset. any help is appreciated.
Thank you

What you can do is looking at MFCCs. In short, these are features extracted from the audio waveform by using signal processing techniques to transcribe the way humans perceive sound. In python, you can use python-speech-features to compute MFCCs.
Once you have prepared your data, you can build a CNN; for example something like this one:
You can also use RNNs (LSTM or GRU for example), but this is a bit more advanced.
EDIT: A very good dataset to start, if you want:
Speech Commands Dataset

Related

How to create pre-trained model for LSTM with non-image data in python?

I have Data A from accelerometer and gyroscope sensors like this
I want to create a pre-trained model to be used to classify with the LSTM method using Data A in python. Is that possible? Because from what I read, pre-trained is used for image data and uses methods such as CNN for classification. In addition, I tried to find data that has gone through the pre-trained process but have not found it so I doubt whether it is possible.
And if I do the classification using LSTM, can I use the A data that has gone through the pre-trained?
Is there a tutorial I can study? Any help would be very much appreciated.
I advise you to look at some HAR datasets you can find plenty of examples in Kaggle that are using a LSTM, the first one that comes to my mind is the excellent tutorial of Jason Brownlee https://machinelearningmastery.com/how-to-develop-rnn-models-for-human-activity-recognition-time-series-classification/

Clustering algorithm for Voice clustering

What is the best Clustering methodology we can use in Voice domain ?
For example if we have the voice utterances from multiple speakers and we need to cluster them in to specific baskets where each of the baskets correspond to one speaker.For this what is the best clustering algorithm that we can use ?
I'd suggest RNN-LSTM. There is a great tutorial explaining about music genre classification using this neural network. I've watched it and it's very didatic to understand:
First you have to understand your audio data (take a look here). In this link he explains MFCC (Mel Frequency Cepstral Coefficients), which allows you to extract features of your audio data into a spectogram. On image below, each amplitude of the MFCC represents a feature of the audio (e.g. features of the speaker voice).
Then you have to preprocess the data for the classification (practical example here)
And then train your neural network to predict to which speaker the audio belongs. He shows here, but I'd recommend you watch the entire series. I think it's the best I've seen about this topic, giving all the background, code anda dataset necessary to solve such speaker classificatin problem.
Hope you enjoy the links, they've really helped me and sure they will solve your question.
There are two approaches here: supervised classification as Eduardo suggests, or unsupervised clustering. Supervised requires training data (audio clips labeled with who is speaking) while unsupervised does not (although you do need some labeled examples to evaluate the method). Here I'll discuss unsupervised clustering.
The biggest difference is that an unsupervised model that works for this task can be applied to audio clips from new speakers, and any number of speakers!!!!! Supervised models will only work on the speakers, and number of speakers, on which they were trained. This is a huge limitation.
The most important element will be a way to encode each audio clip into a fixed-length vector such that the encoding somehow contains the needed information which is who is speaking. If you transcribed into text, this could be TF*IDF or BERT, which would pick out differences in topic, speech style, etc, but this would perform poorly if the clips of different speakers come from the same conversation. There's probably some pretrained encoder for voice clips that would work well here, not as familiar with these.
Clustering method: Simple k-means may work here, where k would be the number of people included in the dataset if known. If not known, you could use clustering metrics such as inertia and silhouette with the elbow heuristic to pick the optimal k, which may represent the number of speakers if your encoding is really good. Additionally, you could use a hierarchical method like agglomerative clustering if there is some inherent hierarchality in the voice clips such as half of the people talk only about science while the other half talk only about literature, or separating first by gender or age or something.
Evaluation: Use PCA to project each fixed-length vector encoding onto 2D so you can visualize it and assign each cluster's voice clips a unique color. This will show you which clusters are more similar to each other, and the organization of these clusters will show you what features are being represented by the encodings.
Pros and Cons of Unsupervised:
Pros:
Is flexible to number of unique speakers and their voices. Meaning if you successfully build a clusterer that clusters audios based on their speaker, you can take this model and apply it to a totally different set of audios from different people, even a different number of people, and it will likely work similarly. A classifier would need to be trained on the unique people and the same number of people that it is applied to, otherwise it will not work.
No need for large labeled dataset, only enough examples to verify the program works. You can even do this after the fact by just listening to samples in one cluster and seeing if they sound like one person.
Cons:
It may not work. You have little control over what features are represented in the embedding, and thus determine cluster assignment. The way you control this is by picking a method of embedding that does this. An embedding method could be as simple as the average volume of the clip, but what would work better is taking the front half of a supervised model that someone else has trained on a voice task, effectively taking a hidden state from that model and using it as your embedding. If the task is similar to your task, such as a classifier to identify speaker, it will probably work well.
Hard to objectively compare unless you have a labeled test set
My suggestion: If you have a labeled set of voices, use half of this to train a classifier as Eduardo suggests, and use that model's hidden states has your embedding method, then send that to k-means, and use the other half of the labeled examples as a test set.

Transfer learning for facial identification using classifiers

I wish to know whether I can use an Inception or ResNet model to identify faces. I want to know whether transfer learning and training is even considerable for my task.
I just want to be able to identify faces but I am also curious whether I can retrain/optimize a pre-trained model for my task.
Or have I been reading of things wrong; do I need to get a pre-trained model that was designed for faces?
I have tried poking around with Inception and VGG16 but I have not trained them for faces. I am working on it but I want to know whether this is even viable or simply a waste of time. If I use transfer learning with FaceNet I think I'll be better off.
Transfer learning for facial detection would be a great way to go ahead. Also, yes transfer learning with facenet is a great idea.
Also, for transfer learning to work it is not necessary that the model had to be initially pre-trained with only faces like using facenet. A model pre-trained with imagenet would also be pretty darn good! This is a very hot topic, so do not try to reinvent the wheel. There are many repositories that have already done this using transfer learning from imagenet dataset and using resnet50 with astonishingly good results.
Here is a link to one such repository:
https://github.com/loheden/face_recognition_with_siamese_network
Also note that siamese networks is a technique that is especially good in the facial recognition use case. The concept of siamese is really simple: take two images and compare the features of these two images. If the similarity in features are above a set threshold, then the two images are the same (the two faces are the same) else not the same (face not recognized).
Here is a research paper on siamese networks for facial recognition.
Also, here is a two-part tutorial on how to implement the siamese network for facial recognition using transfer learning:
http://www.loheden.com/2018/07/face-recognition-with-siamese-network.html
http://www.loheden.com/2018/07/face-recognition-with-siamese-network_29.html
The above tutorial's code is in the first Github link I shared at the beginning of this answer.

how to predict non discrete value from training a dataset of images?

i'm doing some research about predict soundwave speed using thousand of pictures of rocks. each picture of rock layer is taged with a vlaue of soundwave speed. i'm trying to use tensorflow to train this datasets. but it's not a problem of classification, it's more like a problem of regression analysis except training dataset is image not some numberic data. i'm a beginner of tensorflow. please someone teach me how to solve this problem.
If you're looking for a step-by-step guide on how to obtain a regressed continuous output from an image-based input using tensorflow, take a look at this guide:
https://www.pyimagesearch.com/2019/01/28/keras-regression-and-cnns/
Cheers!

Machine Learning - Features design for Images

I just started learning about machine learning recently and have a project where I have to develop a program for QR code localization so that a QR code can be detected and read at any angle of rotation. Development will be done in Python.
The plan is to gather various images of the QR codes at different angles with different backgrounds. From this I would like to create a dataset for training with neural networks and then testing.
The issue that I'm having is that I can't seem to figure out a correct feature design for the dataset and how to identify the QR code from the images for feature processing. Would I use ground-truth images to isolate the QR code or edge magnitude maps? Feature design for images seems to confuse me.
Any help with this would be amazing? Thanks for your time.
You mention that you want to train neural networks. Instead of starting with your problem, start with a beginner example.
Start with MNIST example for deep learning.
Train your Neural Network on notMNIST dataset that is used in Udacity Deep Learning Course.
In these two examples, you will see that you do not design features but NN somehow founds correct features. Easiest solution would be to use same technique for QR codes in your dataset.

Categories

Resources