I read a paper that their retrieval system is based on SIFT descriptor and fast approximate k-means clustering. I installed pyflann. If I am not mistaken the following commands only find the indices of the close datapoints to a specific sample (for example, here, the indices of 5 nearest points from dataset to testset)
from pyflann import *
from numpy import *
from numpy.random import *
dataset = rand(10000, 128)
testset = rand(1000, 128)
flann = FLANN()
result,dists = flann.nn(dataset,testset,5,algorithm="kmeans",
branching=32, iterations=7, checks=16)
I went through user manual, however, could find how can I do k-means clusterin with FLANN. and How can I fit the test based on the cluster centers. As we can use the kmeans++ clustering` in scikitlearn, and then we are fitting the dataset based on the model:
kmeans=KMeans(n_clusters=100,init='k-means++',random_state = 0, verbose=0)
kmeans.fit(dataset)
and later we can assign labels to the test set by using KDTree for example.
kdt=KDTree(kmeans.cluster_centers_)
Q=testset #query
kdt_dist,kdt_idx=kdt.query(Q,k=1) #knn
test_labels=kdt_idx #knn=1 labels
Could someone please help me how can I use the same procedure with FLANN? (I mean clustering the dataset (finding the cluster centers and quantizing features) and then quantizing testset based on cluster centers found from the previous step).
You won't be able to do the best variations with FLANN, because these use two indexes at the same time, and are ugly to implement.
But you can build a new index on the centers for every iteration. But unless you have k > 1000 it probably will not help much.
Related
Excuse me if the questions are too simple but I am getting into machine learning having time constraints.
I must apply a mixed classification to a df with the following steps:
Apply the KMeans with 50 clusters
From the barycenters and labels obtained for each cluster, a dendrogram must be displayed, in order to choose the right k.
Then apply an HCA algorithm from the barycenters obtained in step 1 with the number of clusters from step 2.
Calculate the barycenters of each new group
Use the calculated barycenters to consolidate the clusters by the KMeans algorithm.
What I do is:
clf = KMeans(n_clusters=50)
centroids = clf.cluster_centers_
labels = clf.labels_
From there I get confused with the dendrogram. So far I have used it only over the whole df and I am not certain how to involve the barycenters and labels from the KMeans correctly.
Z = linkage(df, method='ward', metric='euclidean')
dendrogram(Z, labels=df.index, leaf_rotation=90., color_threshold=0)
plt.show()
Last but not least, I do not know how to get the barycenters in the AgglomerativeClustering.
Any clarification would be of help. Thanks in advance!
Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.
Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )
That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.
I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.
I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of the words in 2D space seems to reflect their semantic meaning. There's a group of misspellings, clothes, etc.
However, I'm having trouble getting DBSCAN to output meaningful results. It seems to label almost everything in the "0" group (colored teal in the images). As I increase epsilon, the "0" group takes over everything. Here are screenshots with epsilon=10, and epsilon=12.5. With epsilon=20, almost everything is in the same group.
I would expect, for instance, the group of "clothing" words to all get clustered together (they're unclustered # eps=10). I would also expect more on the order of 100 clusters, as opposed to 5 - 12 clusters, and to be able to control the size and number of the clusters using epsilon.
A few questions, then. Am I understanding the use of DBSCAN correctly? Is there another clustering algorithm that might be a better choice? How can I know what a good clustering algorithm for my data is?
Is it safe to assume my model is tuned pretty well, given that the TSNE looks about right?
What other techniques can I use in order to isolate the issue with clustering? How do I know if it's my word2vec model, my use of DBSCAN, or something else?
Here's the code I'm using to perform DBSCAN:
import sys
import gensim
import json
from optparse import OptionParser
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# snip option parsing
model = gensim.models.Word2Vec.load(options.file);
words = sorted(model.vocab.keys())
vectors = StandardScaler().fit_transform([model[w] for w in words])
db = DBSCAN(eps=options.epsilon).fit(vectors)
labels = db.labels_
core_indices = db.core_sample_indices_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Estimated {:d} clusters".format(n_clusters), file=sys.stderr)
output = [{'word': w, 'label': np.asscalar(l), 'isCore': i in core_indices} for i, (l, w) in enumerate(zip(labels, words))]
print(json.dumps(output))
I'm having the same problem and trying these solutions, posting it here hoping it could help you or someone else:
Adapting the min_samples value in DBSCAN to your problem, in my case the default value, 4, was too high as some clusters could also be formed by 2 words.
Obviously, starting from a better corpus could be the solution to your problem, if the model is badly initialized, it won't perform
Perhaps DBSCAN is not the better choiche, I am also approaching K-Means for this problem
Iterating the creation of the model also helped me understand better which parameters to choose:
for eps in np.arange(0.1, 50, 0.1):
dbscan_model = DBSCAN(eps=eps, min_samples=3, metric_params=None, algorithm="auto", leaf_size=30, p=None, n_jobs=1)
labels = dbscan_model.fit_predict(mat_words)
clusters = {}
for i, w in enumerate(words_found):
clusters[w] = labels[i]
dbscan_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = len([lab for lab in labels if lab == -1])
print("EPS: ", eps, "\tClusters: ", n_clusters, "\tNoise: ", n_noise)
As far as I can tell from the various visualizations of word2vec, the vectors probably won't cluster well.
First of all, there is nothing in the word2vec objective that would encourage clustering. On the contrary, it optimizes words to resemble the neighbors, so nearby words will get similar vectors. That is necessary for the word substitution aim.
Secondly, based on the plots, I am not sure there are "dense" regions separated by areas of low density in there. Instead, the data usually more looks like one big blob. But when almost all the vectors are in that big blob, they will almost all be in the same cluster!
Last but not least, most words probably don't cluster. Yes, numbers will likely cluster. You'd expect verbs to cluster vs. nouns, but "to bear" and "a bear" is the same to word2vec, and so is "bar" (verb and noun) etc. - there are too many polysemies for such clusters to be well separated even if the embedding were perfect!
Your best guess is to increase minors and lower epsilon until most data is noise, and you find some remaining clusters.
I am trying to do hierarchical clustering on an m*n array.
Input array : 500 * 1000 (1000 features, 500 observations)
Calculate distance matrix using a self-defined pdist function
Feed this distance matrix to linkage function :
clusters = sch.linkage(distanceMatrix,'single')
Form flat clusters :
fc = sch.fcluster(clusters,cutoff,'distance')
This gives me some clusters (around 80, using a cutoff value of 6.0).
Now, is there anyway, that i can get the 1000 features corresponding to each cluster as well? ( like we get the features of the centroids using K-means clustering).
Clusters in hierarchical clustering (or pretty much anything except k-means and Gaussian Mixture EM that are restricted to "spherical" - actually: convex - clusters) do not necessarily have sensible means.
Because they allow for non-spherical clusters. That actually is a feature...
https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity_based_clustering_.28hierarchical_clustering.29
Have a look at the right image titled "Linkage clustering examples". What good is a cluster in this "banana" example? The centroid might not even be in the cluster!
Note that you can still compute the centroid yourself, if you need it. As the clustering algorithm does not need the centroid, it will not be computing it for you automatically, obviously.