I am new to clustering and doing some minor project on clustering tweets, I used TF-IDF and then hierarchial clustering. I am confused about setting up threshold value for hierarchical clustering. What should be its value and how to decide it?
I used python scikit module for implementation.
While there are several methods that exist to help terminate hierarchical clustering (or clustering in general) there is no best general way to do this. This stems from the fact that there is no "correct" clustering of arbitrary data. Rather, "correctness" is very domain and application specific.
So while you can try out different methods (e.g., elbow or others) they will in turn have their own parameters that you will have to "tune" to obtain a clustering that you deem "correct". This video might help you out a bit (though it mainly deals with k-means, the concepts extend to other clustering approaches) - https://www.youtube.com/watch?v=3JPGv0XC6AE
I assume you are talking about choosing the amount of clusters to extract from your hierarchical clustering algorithm. There are several ways of doing this, and there is a nice Wikipedia article about it for some theory: http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
For practical examples take a look at this question: Tutorial for scipy.cluster.hierarchy
Related
I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.
The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.
The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.
I'm quite new to the whole clustering and stuff, so I'm a bit lost in the final bit of programming. I'm working on a project which clusters students based on Semantic Similarity of topics using a Hierarchical Algorithm.
What I understand is I have to collect all topics in a list, then apply clustering techniques like Hierarchical clustering.
How can I write a code in python to cluster the students based on the semantic similarity between the topic?
For clustering there are lot of algorithm. I propose you NLMF (Non Linear matrix factorisation), efficient and well spread. There are many others.
BASIC: If you just want to use python to achieve clustering, have a look at the python librairy nimfa: https://nimfa.biolab.si/
There are many others.
You'll have to manage you input data so they fit the expected inout format.
ADVANCED: If you want to understand, learn and maybe code an existing algorithm, look at these slides: https://perso.telecom-paristech.fr/essid/teach/NMF_tutorial_ICME-2014.pdf
RESEARCH TOPIC: If you want to do your own algorithm, I can't help you in this SO answer ;)
In my case I have a dataset of letters and symbols, detected in an image. The detected items are represented by their coordinates, type (letter, number etc), value, orientation and not the actual bounding box of the image. My goal is, using this dataset, to group them into different "words" or contextual groups in general.
So far I achieved ok-ish results by applying classic unsupervised clustering, using DBSCAN algorithm, but still this is way tοo limited on the geometric distance of the samples and so the resulting groups cannot resemble the "words" I am aiming for. So I am searching for a way to influence the results of the clustering algorithm by using the knowledge I have about the "word-like" nature of the clusters needed.
My possible approach that I thought was to create a dataset of true and false clusters and train an SVM model (or any classifier) to detect whether a proposed cluster is correct or not. But still for this, I have no solid proof that I can train a model well enough to discriminate between good and bad clusters, plus I find it difficult to efficiently and consistently represent the clusters, based on the features of their members. Moreover, since my "testing data" will be a big amount of all possible combinations of the letters and symbols I have, the whole approach seems a bit too complicated to attempt implementing it without any proof or indications that it's going to work in the end.
To conclude, my question is, if someone has any prior experience with that kind of task (in my mind sounds rather simple task, but apparently it is not). Do you know of any supervised clustering algorithm and if so, which is the proper way to represent clusters of data so that you can efficiently train a model with them?
Any idea/suggestion or even hint towards where I can research about it will be much appreciated.
There are papers on supervised clustering. A nice, clear one is Eick et al., which is available for free. Unfortunately, I do not think any off-the-shelf libraries in python support this. There is also this in the specific realm of text, but it is a much more domain-specific approach compared to Eick.
But there is a very simple solution that is effectively a type of supervised clustering. Decision Trees essentially chop feature space into regions of high-purity, or at least attempt to. So you can do this as a quick type of supervised clustering:
Create a Decision Tree using the label data.
Think of each leaf as a "cluster."
In sklearn, you can retrieve the leaves of a Decision Tree by using the apply() method.
A standard approach would be to use the dendrogram.
Then merge branches only if they agree with your positive examples and don't violate any of your negative examples.
I have a dataset (~80k rows) that contains a comma-separated list of tags (skills), for example:
python, java, javascript,
marketing, communications, leadership,
web development, node.js, react
...
Some are as short as 1, others can be as long as 50+ skills. I would like to cluster groups of skills together (Intuitively, people in same cluster would have a very similar set of skills)
First, I use CountVectorizer from sklearn to vectorise the list of words and perform a dimensionr reduction using SVD, reducing it to 50 dimensions (from 500+). Finally, I perform KMeans Clustering with n=50 , but the results are not optimal -- Groups of skills clustered together seems to be very unrelated.
How should I go about improving the results? I'm also not sure if SVD is the most appropriate form of dimension reduction for this use case.
I would start with the following approaches:
If you have enough data, try something like word2vec to get an embedding for each tag. You can use pre-trained models, but probably better to train on you own data since it has unique semantics. Make sure you have an OOV embedding for tags that don't appear enough times. Then use K-means, Agglomerative Hierarchical Clustering, or other known clustering methods.
I would construct a weighted undirected-graph, where each tag is a node, and edges represent the number of times 2 tags appeared in the same list. Once the graph is constructed, I would use a community detection algorithm for clustering. Networkx is a very nice library in python that lets you do that.
For any approach (including yours), don't give up before you do some hyper-parameter tuning. Maybe all you need is a smaller representation, or another K (for the KMeans).
Good luck!
All the TF-IDF, cosine, etc. only works well for very long texts, where the vectors can be seen to model a term frequency distribution with reasonable numeric accuracy. For short texts, this is not reliable enough to produce useful clusters.
Furthermore, k-means needs to put every record into a cluster. But what about nonsense data - say someone with the only skill "Klingon"?
Instead, use
Frequent Itemset Mining
This makes perfect sense on tags. It identifies groups of tags that occur frequently together. So one pattern is, e.g., "python sklearn, numpy"; and the cluster is all the users that have these skills.
Note that these clusters will overlap, and some may be in no clusters. That is of course harder to use, but for most applications it makes sense that records can belong to multiple, or no, clusters.
Recently I had worked on image clustering which found similar images and grouped them together. I had used python's skimage module to calculate SSIM and then cluster all images based on some threshold that was decided.
I want to do similar for the text. I want to create automatic clusters containing similar text. For example, cluster-1 could have all text that represents working mothers, cluster-2 could have all text representing people talking about food and so on. I understand this has to be unsupervised learning. Do we have similar python module's that could help achieve this task? I also checked out google's tensorflow to see if I could get something from it but did not find anything relating to text clustering in its documentation.
There are numerous ways you can approach the task. In most cases the clustering algorithms are very similar to image clustering but what you need to define is the distance metric - in this case semantic similarity metric of some kind.
For this purpose you can use the approaches I list in another question around the topic of semantic similarity (even if a bit more detailed).
The one additional approach worth mentioning is 'automatic clustering' provided by topical modelling tools like LSA which you can run fairly easy using gensim package.