Troubleshooting tips for clustering word2vec output with DBSCAN - python

I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of the words in 2D space seems to reflect their semantic meaning. There's a group of misspellings, clothes, etc.
However, I'm having trouble getting DBSCAN to output meaningful results. It seems to label almost everything in the "0" group (colored teal in the images). As I increase epsilon, the "0" group takes over everything. Here are screenshots with epsilon=10, and epsilon=12.5. With epsilon=20, almost everything is in the same group.
I would expect, for instance, the group of "clothing" words to all get clustered together (they're unclustered # eps=10). I would also expect more on the order of 100 clusters, as opposed to 5 - 12 clusters, and to be able to control the size and number of the clusters using epsilon.
A few questions, then. Am I understanding the use of DBSCAN correctly? Is there another clustering algorithm that might be a better choice? How can I know what a good clustering algorithm for my data is?
Is it safe to assume my model is tuned pretty well, given that the TSNE looks about right?
What other techniques can I use in order to isolate the issue with clustering? How do I know if it's my word2vec model, my use of DBSCAN, or something else?
Here's the code I'm using to perform DBSCAN:
import sys
import gensim
import json
from optparse import OptionParser
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# snip option parsing
model = gensim.models.Word2Vec.load(options.file);
words = sorted(model.vocab.keys())
vectors = StandardScaler().fit_transform([model[w] for w in words])
db = DBSCAN(eps=options.epsilon).fit(vectors)
labels = db.labels_
core_indices = db.core_sample_indices_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Estimated {:d} clusters".format(n_clusters), file=sys.stderr)
output = [{'word': w, 'label': np.asscalar(l), 'isCore': i in core_indices} for i, (l, w) in enumerate(zip(labels, words))]
print(json.dumps(output))

I'm having the same problem and trying these solutions, posting it here hoping it could help you or someone else:
Adapting the min_samples value in DBSCAN to your problem, in my case the default value, 4, was too high as some clusters could also be formed by 2 words.
Obviously, starting from a better corpus could be the solution to your problem, if the model is badly initialized, it won't perform
Perhaps DBSCAN is not the better choiche, I am also approaching K-Means for this problem
Iterating the creation of the model also helped me understand better which parameters to choose:
for eps in np.arange(0.1, 50, 0.1):
dbscan_model = DBSCAN(eps=eps, min_samples=3, metric_params=None, algorithm="auto", leaf_size=30, p=None, n_jobs=1)
labels = dbscan_model.fit_predict(mat_words)
clusters = {}
for i, w in enumerate(words_found):
clusters[w] = labels[i]
dbscan_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = len([lab for lab in labels if lab == -1])
print("EPS: ", eps, "\tClusters: ", n_clusters, "\tNoise: ", n_noise)

As far as I can tell from the various visualizations of word2vec, the vectors probably won't cluster well.
First of all, there is nothing in the word2vec objective that would encourage clustering. On the contrary, it optimizes words to resemble the neighbors, so nearby words will get similar vectors. That is necessary for the word substitution aim.
Secondly, based on the plots, I am not sure there are "dense" regions separated by areas of low density in there. Instead, the data usually more looks like one big blob. But when almost all the vectors are in that big blob, they will almost all be in the same cluster!
Last but not least, most words probably don't cluster. Yes, numbers will likely cluster. You'd expect verbs to cluster vs. nouns, but "to bear" and "a bear" is the same to word2vec, and so is "bar" (verb and noun) etc. - there are too many polysemies for such clusters to be well separated even if the embedding were perfect!
Your best guess is to increase minors and lower epsilon until most data is noise, and you find some remaining clusters.

Related

How do you correctly cluster document names & find similarities between documents based on Word2Vec model?

I have a set of documents (3000) which each contain a short description. I want to use Word2Vec model to see if I can cluster these documents based on the description.
I'm doing it the in the following way, but I am not sure if this is a "good" way to do it. Would love to get feedback.
I'm using Google's trained w2v model.
wv = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True,encoding="ISO-8859-1", limit = 100000)
Each document is split into words where stop words are removed, and I have used stemming as well.
My initial idea was to fetch the word vector for each word in each documents description, average it, and then cluster based on this.
doc2vecs = []
for i in range(0, len(documents_df['Name'])):
vec = [0 for k in range(300)]
for j in range(0, len(documents_df['Description'][i])):
if documents_df['Description'][i][j] in wv:
vec += wv[documents_df['Description'][i][j]]
doc2vecs.append(vec/300)
I'm then finding similarities using
similarities = squareform(pdist(doc2vecs, 'cosine'))
Which returns a matrix of the cosine between each vector in doc2vec.
I then try to cluster the documents.
num_clusters = 2
km = cluster.KMeans(n_clusters=num_clusters)
km.fit(doc2vecs)
So basically what I am wondering is:
Is this method of clustering the average word vector for each word in the document a reasonable way to cluster the documents?
In 2019, unless you have serious resource constraints, you don't need to vectorize documents by averaging word embeddings. You can use Universal Sentence Encoder to vectorize documents in a few lines of code.
Most clustering algorithms do better in low dimensions, so from here you want to do dimensionality reduction, then clustering. AFAIK, you'll get the best results from UMAP. Their docs explain how to do this very clearly.

How to correctly translate Kmeans labels to category labels

I have been using Sklearn's Kmeans implementation
I have been clustering a dataset which is labeled, and I have been using sklearn's clustering metrics in order to test the clustering performance.
Sklearn's Kmeans clustering output is as you know a list of numbers in the range of k_clusters. However my labels are strings.
So far I had no problems with them since the metrics from sklearn.metrics.cluster work with mixed inputs (int & str label lists).
However now I want to use some of the classification metrics and from what I gather, the inputs k_true and k_pred need to be of the same set. Either numbers in range of k, or then string labels that my dataset is using. If I try it, it returns the following error:
AttributeError: 'bool' object has no attribute 'sum'
So, how could I translate the k_means labels into an other type of labels? Or even the other way around (string labels -> integer labels).
How could I even begin implementing it? Since k_means is pretty non-deterministic, the labels might change from iteration to iteration. Is there a legit way in order to correctly translate Kmeans labels?
EDIT:
EXAMPLE
for k = 4
kmeans output: [0,3,3,2,........0]
class labels : ['CAT','DOG','DOG','BIRD',.......'CHICKEN']
Clustering is not classification.
The methods do not predict a label, so you must not use a classification evaluation measure. That would be like measuring the quality of an apple in miles per gallon...
If you insist on doing the wrong thing(tm) then use the Hungarian algorithm to find the best mapping. But beware: the number of clusters and the number of classes will usually not be the same. If this is the case, using such a mapping will either be unfairly negative (not mapping extra clusters) or unfairly positive (mapping !uktiple clusters to the same label will consider the N points are N clusters "solution" optimal). It's better to only use clustering measures.
You can create mapping using a dictionary, say
mapping_dict = { 0: 'cat', 1: 'chicken', 2:'bird', 3:'dog'}
Then you can simply apply this mapping using say list comprehension,etc.
Suppose your labels are stored in a list kmeans_predictions
mapped_predictions = [ mapping_dict[x] for x in kmeans_predictions]
Then use mapped_predictions as your predictions
Update : Based on your comments, i believe you have to do it the other way round. I mean convert your labels into `int' mappings.
Also, you cannot use just any classification metric here. Use Completeness score, v-measure and homogenity as these are more suited for clustering problems. It would be incorrect to just blindly use any random classification metric here.

How to use FLANN for labeling and clustering?

I read a paper that their retrieval system is based on SIFT descriptor and fast approximate k-means clustering. I installed pyflann. If I am not mistaken the following commands only find the indices of the close datapoints to a specific sample (for example, here, the indices of 5 nearest points from dataset to testset)
from pyflann import *
from numpy import *
from numpy.random import *
dataset = rand(10000, 128)
testset = rand(1000, 128)
flann = FLANN()
result,dists = flann.nn(dataset,testset,5,algorithm="kmeans",
branching=32, iterations=7, checks=16)
I went through user manual, however, could find how can I do k-means clusterin with FLANN. and How can I fit the test based on the cluster centers. As we can use the kmeans++ clustering` in scikitlearn, and then we are fitting the dataset based on the model:
kmeans=KMeans(n_clusters=100,init='k-means++',random_state = 0, verbose=0)
kmeans.fit(dataset)
and later we can assign labels to the test set by using KDTree for example.
kdt=KDTree(kmeans.cluster_centers_)
Q=testset #query
kdt_dist,kdt_idx=kdt.query(Q,k=1) #knn
test_labels=kdt_idx #knn=1 labels
Could someone please help me how can I use the same procedure with FLANN? (I mean clustering the dataset (finding the cluster centers and quantizing features) and then quantizing testset based on cluster centers found from the previous step).
You won't be able to do the best variations with FLANN, because these use two indexes at the same time, and are ugly to implement.
But you can build a new index on the centers for every iteration. But unless you have k > 1000 it probably will not help much.

How to improve distance function in python

I am trying to do a classification exercise on email docs (strings containing words).
I defined the distance function as following:
def distance(wordset1, wordset2):
if len(wordset1) < len(wordset2):
return len(wordset2) - len(wordset1)
elif len(wordset1) > len(wordset2):
return len(wordset1) - len(wordset2)
elif len(wordset1) == len(wordset2):
return 0
However, the accuracy in the end is pretty low (0.8). I guess this is because of the not so accurate distance function. How can I improve the function? Or what are other ways to calculate the "distance" between email docs?
One common measure of similarity for use in this situation is the Jaccard similarity. It ranges from 0 to 1, where 0 indicates complete dissimilarity and 1 means the two documents are identical. It is defined as
wordSet1 = set(wordSet1)
wordSet2 = set(wordSet2)
sim = len(wordSet1.intersection(wordSet2))/len(wordSet1.union(wordSet2))
Essentially, it is the ratio of the intersection of the sets of words to the ratio of the union of the sets of words. This helps control for emails that are of different sizes while still giving a good measure of similarity.
You didn't mention the type of wordset1 and wordset2. I'll assume they are both strings.
You defined your distance as the word counting and got a bad score. It's obvious text length is not a good dissimilarity measure: two emails with different sizes can talk about the same thing, while two emails of same size be talking about completely different things.
So, as suggested above, you could try and check for SIMILAR WORDS instead:
import numpy as np
def distance(wordset1, wordset2):
wordset1 = set(wordset1.split())
wordset2 = set(wordset2.split())
common_words = wordset1 & wordset2
if common_words:
return 1 / len(common_words)
else:
# They don't share any word. They are infinitely different.
return np.inf
The problem with that is that two big emails are more likely to share words than two small ones, and this metric would favor those, making them "more similar to each other" in comparison to the small ones. How do we solve this? Well, we normalize the metric somehow:
import numpy as np
def distance(wordset1, wordset2):
wordset1 = set(wordset1.split())
wordset2 = set(wordset2.split())
common_words = wordset1 & wordset2
if common_words:
# The distance, normalized by the total
# number of different words in the emails.
return 1 / len(common_words) / (len(wordset1 | wordset2))
else:
# They don't share any word. They are infinitely different.
return np.inf
This seems cool, but completely ignores the FREQUENCY of the words. To account for this, we can use the Bag-of-words model. That is, create a list of all possible words and histogram their appearance in each document. Let's use CountVectorizer implementation from scikit-learn to make our job eaiser:
from sklearn.feature_extraction.text import CountVectorizer
def distance(wordset1, wordset2):
model = CountVectorizer()
X = model.fit_transform([wordset1, wordset2]).toarray()
# uses Euclidean distance between bags.
return np.linalg.norm(X[0] - X[1])
But now consider two pairs of emails. The emails in the first pair are composed by perfectly written English, full of "small" words (e.g. a, an, is, and, that) necessary for it to be grammatically correct. The emails in the second pair are different: only containing the keywords, it's extremely dry. You see, chances are the first pair will be more similar than the second one. That happens because we are currently accounting for all words the same, while we should be prioritizing the MEANINGFUL words in each text. To do that, let's use term frequency–inverse document frequency. Luckly, there's a very similar implementation in scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
def distance(wordset1, wordset2):
model = TfidfVectorizer()
X = model.fit_transform([wordset1, wordset2]).toarray()
similarity_matrix = X.dot(X.T)
# The dissimilarity between samples wordset1 and wordset2.
return 1-similarity_matrix[0, 1]
Read more about this in this question. Also, duplicate?
You should now have a fairly good accuracy. Try it out. If it's still not as good as you want, then we have to go deeper... (get it? Because... Deep-learning). The first thing is that we need either a dataset to train over or an already trained model. That's required because networks have many parameters that MUST be adjusted in order to provide useful transformations.
What's been missing so far is UNDERSTANDING. We histogrammed the words, striping them from any context or meaning. Instead, let's keep them where they are and try to recognize blocks of patterns. How can this be done?
Embed the words into numbers, which will deal with the different sizes of words.
Pad every number (word embed) sequence to a single length.
Use convolutional networks to extract meaninful features from sequences.
Use fully-connected networks to project the features extracted to a space that minimizes the distance between similar emails and maximizes the distance between non-similar ones.
Let's use Keras to simply our lives. It should look something like this:
# ... imports and params definitions
model = Sequential([
Embedding(max_features,
embedding_dims,
input_length=maxlen,
dropout=0.2),
Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1),
MaxPooling1D(pool_length=model.output_shape[1]),
Flatten(),
Dense(256, activation='relu'),
])
# ... train or load model weights.
def distance(wordset1, wordset2):
global model
# X = ... # Embed both emails.
X = sequence.pad_sequences(X, maxlen=maxlen)
y = model.predict(X)
# Euclidean distance between emails.
return np.linalg.norm(y[0]-y[1])
There's a practical example on sentence processing which you can check Keras github repo. Also, someone solves this exact same problem using a siamese recurrent network in this stackoverflow question.
Well, I hope this gives you some direction. :-)

Sequential k-means clustering using scikit-learn

Is there a way to perform sequential k-means clustering using scikit-learn? I can't seem to find a proper way to add new data, without re-fitting all the data.
Thank you
scikit-learn's KMeans class has a predict method that, given some (new) points, determines which of the clusters these points would belong to. Calling this method does not change the cluster centroids.
If you do want the centroids to be changed by the addition of new data, i.e. you want to do clustering in an online setting, use the MiniBatchKMeans estimator and its partial_fit method.
You can pass in initial values for the centroids with the init parameter to sklearn.cluster.kmeans. So then you can just do:
centroids, labels, inertia = k_means(data, k)
new_data = np.append(data, extra_pts)
new_centroids, new_labels, new_inertia = k_means(new_data, k, init=centroids)
assuming you're just adding data points and not changing k.
I think this will sometimes mean you get a suboptimal result, but it should usually be faster. You might want to occasionally redo the fit with, say, 10 random seeds and take the best one.
It's also relatively easy to write your own function that finds out which centroid is closest to a point that you are considering. Assuming you have some matrix X that is ready for kmeans:
centroids, labels, inertia = cluster.k_means(X, 5)
def pred(arr):
return np.argmin([np.linalg.norm(arr-b) for b in centroids])
You can confirm that this works via:
[pred(X[i]) == labels[i] for i in range(len(X))]

Categories

Resources