I am a newbie in text mining, here is my situation.
Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']].
I first calculate similarity score of each pairwise word to obtain a 4x4 matrix(in this case) M, where Mij is the similarity score of word i and j.
After transforming the words into numeric data, i utilize different clustering library(such as sklearn) or implement it by myself to get the word clusters.
I want to know does this approach makes sense? Besides, how do I determine the value of k? More importantly, i know that there exist different clustering technique, i am thinking whether i should use k-means or k-medoids for word clustering?
Following up the answer by Brian O'Donnell, once you've computed the semantic similarity with word2vec (or FastText or GLoVE, ...), you can then cluster the matrix using sklearn.clustering. I've found that for small matrices, spectral clustering gives the best results.
It's worth keeping in mind that the word vectors are often embedded on a high-dimensional sphere. K-means with a Euclidean distance matrix fails to capture this, and may lead to poor results for the similarity of words that aren't immediate neighbors.
If you want to cluster words by their "semantic similarity" (i.e. likeness of their meaning) take a look at Word2Vec and GloVe. Gensim has an implementation for Word2Vec. This web page, "Word2Vec Tutorial", by Radim Rehurek gives a tutorial on using Word2Vec to determine similar words.
Adding on to what's already been said regarding similarity scores, finding k in clustering applications generally is aided by scree plots (also known as an "elbow curve"). In these plots, you'll usually have some measure of dispersion between clusters on the y-axis, and the number of clusters on the x-axis. Finding the minimum (second derivative) of the curve in the scree plot gives you a more objective measure of cluster "uniqueness."
Related
I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.
The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.
The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.
I have an array of thousands of doc2vec vectors with 90 dimensions. For my current purposes I would like to find a way to "sample" the different regions of this vector space, to get a sense of the diversity of the corpus. For example, I would like to partition my space into n regions, and get the most relevant word vectors for each of these regions.
I've tried clustering with hdbscan (after reducing the dimensionality with UMAP) to carve the vector space at its natural joints, but it really doesn't work well.
So now I'm wondering whether there is a way to sample the "far out regions" of the space (n vectors that are most distant from each other).
Would that be a good strategy?
How could I do this?
Many thanks in advance!
Wouldn't a random sample from all vectors necessarily encounter any of the various 'regions' in the set?
If there are "natural joints" and clusters to the documents, some clustering algorithm should be able to find the N clusters, then the smaller number of NxN distances between each cluster's centroid to each other cluster's centroid might identify those "furthest out" clusters.
Note for any vector, you can use the Doc2Vec doc-vectors most_similar() with a topn value of 0/false-ish to get the (unsorted) similarities to all other model doc-vectors. You could then find the least-similar vectors in that set. If your dataset is small enough for it to be practical to do this for "all" (or some large sampling) of doc-vectors, then perhaps other docs that appear in the "bottom N" least-similar, for the most number of other vectors, would be the most "far out".
Whether this idea of "far out" is actually shown in the data, or useful, isn't clear. (In high-dimensional spaces, everything can be quite "far" from everything else in ways that don't match our 2d/3d intuitions, and slight differences in some vectors being a little "further" might not correspond to useful distinctions.)
I have a dataset (~80k rows) that contains a comma-separated list of tags (skills), for example:
python, java, javascript,
marketing, communications, leadership,
web development, node.js, react
...
Some are as short as 1, others can be as long as 50+ skills. I would like to cluster groups of skills together (Intuitively, people in same cluster would have a very similar set of skills)
First, I use CountVectorizer from sklearn to vectorise the list of words and perform a dimensionr reduction using SVD, reducing it to 50 dimensions (from 500+). Finally, I perform KMeans Clustering with n=50 , but the results are not optimal -- Groups of skills clustered together seems to be very unrelated.
How should I go about improving the results? I'm also not sure if SVD is the most appropriate form of dimension reduction for this use case.
I would start with the following approaches:
If you have enough data, try something like word2vec to get an embedding for each tag. You can use pre-trained models, but probably better to train on you own data since it has unique semantics. Make sure you have an OOV embedding for tags that don't appear enough times. Then use K-means, Agglomerative Hierarchical Clustering, or other known clustering methods.
I would construct a weighted undirected-graph, where each tag is a node, and edges represent the number of times 2 tags appeared in the same list. Once the graph is constructed, I would use a community detection algorithm for clustering. Networkx is a very nice library in python that lets you do that.
For any approach (including yours), don't give up before you do some hyper-parameter tuning. Maybe all you need is a smaller representation, or another K (for the KMeans).
Good luck!
All the TF-IDF, cosine, etc. only works well for very long texts, where the vectors can be seen to model a term frequency distribution with reasonable numeric accuracy. For short texts, this is not reliable enough to produce useful clusters.
Furthermore, k-means needs to put every record into a cluster. But what about nonsense data - say someone with the only skill "Klingon"?
Instead, use
Frequent Itemset Mining
This makes perfect sense on tags. It identifies groups of tags that occur frequently together. So one pattern is, e.g., "python sklearn, numpy"; and the cluster is all the users that have these skills.
Note that these clusters will overlap, and some may be in no clusters. That is of course harder to use, but for most applications it makes sense that records can belong to multiple, or no, clusters.
How to cluster only words in a given set of Data: i have been going through few algorithms online like k-Means algotihm,but it seems they are related to document clustering instead of word clustering.Can anyone suggest me some way to only cluster words in a given set of data???.
please am new to python.
Based on the fact that my last answer was indeed a false answer since it was used for document clustering and not word clustering, here is the real answer.
What you are looking for is word2vec.
Indeed, word2vec is a google tool based on deep-learning that works really well. It transforms words into vectorial representation, and therefore allows you to do multiple things with it.
For example, one of its many examples that work well are algebric relation of words:
vector('puppy') - vector('dog') + vector('cat') is close to vector('kitten')
vector('king') - vector('man') + vector('woman') is close to vector('queen')
What it means by that is it can sort of encompass the context of a word, and therefore it will work really well for numerous applications.
When you have vectors instead of words, you can pretty much do anything you want. You can for example do a k-means clustering with a cosine distance as the measure of dissimilarity...
Hope this answers well to your question. You can read more about word2vec in different papers or websites if you'd like. I won't link them here since it is not the subject of the question.
Word clustering will be really disappointing because the computer does not understand language.
You could use levenshtein distance and then do hierarchical clustering.
But:
dog and fog have a distance of 1, i.e. are highly similar.
dog and cat have 3 out of 3 letters different.
So unless you can define a good measure of similarity, don't cluster words.
I am new to clustering and need some advice on how to approach this problem...
Let's say I have thousands of sentences, but a few from the sample could be:
Experience In Networking
STRONG Sales Experience
Strong Networking Skills Preferred
Sales Expertise REquired
Chocolate Apples
Jobs are crucial for Networking Majors
In order to cluster these the best way, what approach could I take?
I have looked into k-means with word vectoring, but when I have thousands of sentences that may all contain different words, would this be efficient to build a vector of that size and then go through each trying to see which sentence has those words?
What other approaches are out there that I have not found?
What I have done so far:
Imported the sentences from CSV to a DICT With ID: Sentence
I am removing stop words from each sentence
I am then counting all words individually to build a master vector and keeping a count of how many times a word appears.
There are two related (but distinct technique-wise) questions here; the first is relates to choice of clustering technique for this data.
The second, predicate question relates to the data model--i.e., for each sentence in the raw data, how to transform it to a data vector suitable for input to a clustering algorithm.
Clustering Technique
k-means is probably the most popular clustering technique, but there are many betters; consider how k-kmeans works: the user selects from among the data, a small number of data points (the cluster centers for the initial iteration in the k-means algorithm, aka centroids). Next, the distance between each data point and the set of centroids is determined and each data point assigned to the centroid it is closes to; then new centroids are determined from the mean value of the data points assigned to the same cluster. These two steps are repeated until some convergence criterion is reached (e.g., between two consecutive iterations, the centroids combined movement falls below some threshold).
The better clustering techniques do much more than just move the cluster centers around--for instance, spectral clustering techniques rotate and stretch/squeeze the data to find a single axis of maximum variance then determine additional axes orthogonal to the original one and to each other--i.e., a transformed feature space. PCA (principal component analysis), LDA (linear discriminant analysis), and kPCA are all members of this class, the defining characteristic of which is that that calculation of the eigenvalue/eigenvector pairs for each feature in the original data or in the covariance matrix. Scikit-learn has a module for PCA computation.
Data Model
As you have observed, the common dilemma in constructing a data model from unstructured text data is including a feature for every word in the entire corpus (minus stop words) often results in very high sparsity over the dataset (i.e., each sentence includes only a small fraction of the total words across all sentences so each data vector is consequently sparse; on the other hand, if the corpus is trimmed so that for instance only the top 10% of the words are used as features, then some/many of the sentences have completely unpopulated data vectors.
Here's one common sequence of techniques to help solve this problem, which might be particularly effective given your data: Combine related terms into a single term using the common processing sequence of normalizing, stemming and synonymizing.
This is intuitive: e.g.,
Normalize: transform all words to lowercase (Python strings have a lower method, so
REquired.lower()
Obviously, this prevents Required, REquired, and required from comprising three separate features in your data vector, and instead collapses them into a single term.
Stem: After stemming, required, require, and requiring, are collapsed to a single token, requir.
Two of the most common stemmers are the Porter and Lancaster stemmers (the NLTK, discussed below, has both).
Synonymize: Terms like fluent, capable, and skilled, can, depending on context, all be collapsed to a single term, by identifying in a common synonym list.
The excellent Python NLP library, NLTK has (at least) several excellent synonym compilations, or digital thesaurus (thesauri?) to help you do all three of these, programmatically.
For instance, nltk.corpus.reader.lin is one (just one, there are at least several more synonym-finders in the NLTLK), and it's simple to use--just import this module and call synonym, passing in a term.
Multiple stemmers are in NLTK's stem package.
I actually just recently put together a guide to document clustering in Python. I would suggest using a combination of k-means and latent dirichlet allocation. Take a look and let me know if I can further explain anything: http://brandonrose.org/clustering