Clustering algorithm for Voice clustering

Clustering algorithm for Voice clustering - python

What is the best Clustering methodology we can use in Voice domain ?
For example if we have the voice utterances from multiple speakers and we need to cluster them in to specific baskets where each of the baskets correspond to one speaker.For this what is the best clustering algorithm that we can use ?

I'd suggest RNN-LSTM. There is a great tutorial explaining about music genre classification using this neural network. I've watched it and it's very didatic to understand:
First you have to understand your audio data (take a look here). In this link he explains MFCC (Mel Frequency Cepstral Coefficients), which allows you to extract features of your audio data into a spectogram. On image below, each amplitude of the MFCC represents a feature of the audio (e.g. features of the speaker voice).
Then you have to preprocess the data for the classification (practical example here)
And then train your neural network to predict to which speaker the audio belongs. He shows here, but I'd recommend you watch the entire series. I think it's the best I've seen about this topic, giving all the background, code anda dataset necessary to solve such speaker classificatin problem.
Hope you enjoy the links, they've really helped me and sure they will solve your question.

There are two approaches here: supervised classification as Eduardo suggests, or unsupervised clustering. Supervised requires training data (audio clips labeled with who is speaking) while unsupervised does not (although you do need some labeled examples to evaluate the method). Here I'll discuss unsupervised clustering.
The biggest difference is that an unsupervised model that works for this task can be applied to audio clips from new speakers, and any number of speakers!!!!! Supervised models will only work on the speakers, and number of speakers, on which they were trained. This is a huge limitation.
The most important element will be a way to encode each audio clip into a fixed-length vector such that the encoding somehow contains the needed information which is who is speaking. If you transcribed into text, this could be TF*IDF or BERT, which would pick out differences in topic, speech style, etc, but this would perform poorly if the clips of different speakers come from the same conversation. There's probably some pretrained encoder for voice clips that would work well here, not as familiar with these.
Clustering method: Simple k-means may work here, where k would be the number of people included in the dataset if known. If not known, you could use clustering metrics such as inertia and silhouette with the elbow heuristic to pick the optimal k, which may represent the number of speakers if your encoding is really good. Additionally, you could use a hierarchical method like agglomerative clustering if there is some inherent hierarchality in the voice clips such as half of the people talk only about science while the other half talk only about literature, or separating first by gender or age or something.
Evaluation: Use PCA to project each fixed-length vector encoding onto 2D so you can visualize it and assign each cluster's voice clips a unique color. This will show you which clusters are more similar to each other, and the organization of these clusters will show you what features are being represented by the encodings.
Pros and Cons of Unsupervised:
Pros:
Is flexible to number of unique speakers and their voices. Meaning if you successfully build a clusterer that clusters audios based on their speaker, you can take this model and apply it to a totally different set of audios from different people, even a different number of people, and it will likely work similarly. A classifier would need to be trained on the unique people and the same number of people that it is applied to, otherwise it will not work.
No need for large labeled dataset, only enough examples to verify the program works. You can even do this after the fact by just listening to samples in one cluster and seeing if they sound like one person.
Cons:
It may not work. You have little control over what features are represented in the embedding, and thus determine cluster assignment. The way you control this is by picking a method of embedding that does this. An embedding method could be as simple as the average volume of the clip, but what would work better is taking the front half of a supervised model that someone else has trained on a voice task, effectively taking a hidden state from that model and using it as your embedding. If the task is similar to your task, such as a classifier to identify speaker, it will probably work well.
Hard to objectively compare unless you have a labeled test set
My suggestion: If you have a labeled set of voices, use half of this to train a classifier as Eduardo suggests, and use that model's hidden states has your embedding method, then send that to k-means, and use the other half of the labeled examples as a test set.

Related

Intent classification with large number of intent classes

I am working on a data set of approximately 3000 questions and I want to perform intent classification. The data set is not labelled yet, but from the business perspective, there's a requirement of identifying approximately 80 various intent classes. Let's assume my training data has approximately equal number of each classes and is not majorly skewed towards some of the classes. I am intending to convert the text to word2vec or Glove and then feed into my classifier.
I am familiar with cases in which I have a smaller number of intent classes, such as 8 or 10 and the choice of machine learning classifiers such as SVM, naive bais or deeplearning (CNN or LSTM).
My question is that if you have had experience with such large number of intent classes before, and which of machine learning algorithm do you think will perform reasonably? do you think if i use deep learning frameworks, still large number of labels will cause poor performance given the above training data?
We need to start labelling the data and it is rather laborious to come up with 80 classes of labels and then realise that it is not performing well, so I want to ensure that I am making the right decision on how many classes of intent maximum I should consider and what machine learning algorithm do you suggest?
Thanks in advance...

First, word2vec and GloVe are, almost, dead. You should probably consider using more recent embeddings like BERT or ELMo (both of which are sensitive to the context; in other words, you get different embeddings for the same word in a different context). Currently, BERT is my own preference since it's completely open-source and available (gpt-2 was released a couple of days ago which is apparently a little bit better. But, it's not completely available to the public).
Second, when you use BERT's pre-trained embeddings, your model has the advantage of seeing a massive amount of text (Google massive) and thus can be trained on small amounts of data which will increase it's performance drastically.
Finally, if you could classify your intents into some coarse-grained classes, you could train a classifier to specify which of these coarse-grained classes your instance belongs to. Then, for each coarse-grained class train another classifier to specify the fine-grained one. This hierarchical structure will probably improve the results. Also for the type of classifier, I believe a simple fully connected layer on top of BERT would suffice.

NLP - which technique to use to classify labels of a paragraph?

I'm fairly new to NLP and trying to learn the techniques that can help me get my job done.
Here is my task: I have to classify stages of a drilling process based on text memos.
I have to classify labels for "Activity", "Activity Detail", "Operation" based on what's written in "Com" column.
I've been reading a lot of articles online and all the different kinds of techniques that I've read really confuses me.
The buzz words that I'm trying to understand are
Skip-gram (prediction based method, Word2Vec)
TF-IDF (frequency based method)
Co-Occurrence Matrix (frequency based method)
I am given about ~40,000 rows of data (pretty small, I know), and I came across an article that says neural-net based models like Skip-gram might not be a good choice if I have small number of training data. So I was also looking into frequency based methods too. Overall, I am unsure which technique is the best for me.
Here's what I understand:
Skip-gram: technique used to represent words in a vector space. But I don't understand what to do next once I vectorized my corpus
TF-IDF: tells how important each word is in each sentence. But I still don't know how it can be applied on my problem
Co-Occurence Matrix: I don'y really understand what it is.
All the three techniques are to numerically represent texts. But I am unsure what step I should take next to actually classify labels.
What approach & sequence of techniques should I use to tackle my problem? If there's any open source Jupyter notebook project, or link to an article (hopefully with codes) that did the similar job done, please share it here.

Let's get things a bit clearer. You task is to create a system that will predict labels for given texts, right? And label prediction (classification) can't be done for unstructured data (texts). So you need to make your data structured, and then train and infer your classifier. Therefore, you need to induce two separate systems:
Text vectorizer (as you said, it helps to numerically represent texts).
Classifier (to predict the labels for numerically represented texts).
Skip-Gram and co-occurrence matrix are ways to vectorize your texts (here is a nice article that explains their difference). In case of skip-gram you could download and use a 3rd party model that already has mapping of vectors to most of the words; in case of co-occurrence matrix you need to build it on your texts (if you have specific lexis, it will be a better way). In this matrix you could use different measures to represent the degree of co-occurrence of words with words or documents with documents. TF-IDF is one of such measures (that gives a score for every word-document pair); there are a lot of others (PMI, BM25, etc). This article should help to implement classification with co-occurrence matrix on your data. And this one gives an idea how to do the same with Word2Vec.
Hope it helped!

Clustering a set of images

I have a folder with hundres/thousands of images, some of them look alike. I would like to create clusters separating those images (those which look alike in the same cluster).
I can't determine the number of clusters that will be needed, it depends on the images.
Does anyone have an idea on how to do this using Python, OpenCV and which algorithm to use?
I've made some research and found that AffinityPropagation or DBSCAN can be useful for me but I don't know where to start (how to encode my images, what should I pass to those algorithms etc...)

Unfortunately it is not that simple with images, since naively clustering would result in clusters of images with the same colors, not the same "content". You can use a neural network as a feature extractor for the images, I see two options:
Use a pre-trained network and get the features from an intermediate layer
Train an autoencoder on your dataset, and use the latent features
Option 1 is cheaper since you can easily find pre-trained models, option 2 is much more computationally expensive but should work better, especially if there is no pre-trained model on your domain.
This tutorial (randomly found on the internet) seems to be a good introduction to method 2.

Predicting Energy Consumption of different buildings

I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)

The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.

twitter/facebook comments classification into various categories

I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks

This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
Tools
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.

Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!

What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.