I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_
Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.
But the obvious first step to do is to run a profiler to identify slow operations!
Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.
Related
This question is about creating a K-nearest neighbor graph [KNNG] from a dataset with an unknown number of centroids (which is not the same as K-means clustering).
Suppose that you have a dataset of observations stored in a data matrix X[n_samples, n_features] with each row being an observation or feature vector and each column being a feature. Now suppose you want to compute the (weighted) k-Neighbors graph for points in X using sklearn.neighbors.kneighbors_graph.
What are the basic methods to pick the number of neighbors to use for each sample? What algorithms scale well when you have lots of observations?
I have seen this brute force method below but it doesn't do well when the sample dataset size becomes large and you have to pick a good starting upper bound for n_neighbors_max. Does this algorithm have a name?
def autoselect_K(X, n_neighbors_max, threshold):
# get the pairwise euclidean distance between every observation
D = sklearn.metrics.pairwise.euclidean_distances(X, X)
chosen_k = n_neighbors_max
for k in range(2, n_neighbors_max):
k_avg = []
# loop over each row in the distance matrix
for row in D:
# sort the row from smallest distance to largest distance
sorted_row = numpy.sort(row)
# calculate the mean of the smallest k+1 distances
k_avg.append(numpy.mean(sorted_row[0:k]))
# find the median of the averages
kmedian_dist = numpy.median(k_avg)
if kmedian_dist >= threshold:
chosen_k = k
break
# return the number of nearest neighbors to use
return chosen_k
From your code, it appears that you are looking for a classification result based on the nearest neighbour.
In such a case your search over the distance matrix is akin to a brute force search and defeats the purpose of Nearest neighbour algorithms.
Perhaps what you are looking for is the NNClassifier. Here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Regarding the choice of the number of nearest neighbours, this depends on the sparsity of your data. It helps to view Nearest Neighbour as a way to bound your search. Rather than look over all samples. It will allow you
to narrow the search to the top-N (nearest neighbour) samples. Afterward
you can apply a domain specific technique on these N samples to get the desired result.
The k-means clustering algorithm objective is to find:
I looked at several implementations of it in python, and in some of them the norm is not squared.
For example (taken from here):
def form_clusters(labelled_data, unlabelled_centroids):
"""
given some data and centroids for the data, allocate each
datapoint to its closest centroid. This forms clusters.
"""
# enumerate because centroids are arrays which are unhashable
centroids_indices = range(len(unlabelled_centroids))
# initialize an empty list for each centroid. The list will
# contain all the datapoints that are closer to that centroid
# than to any other. That list is the cluster of that centroid.
clusters = {c: [] for c in centroids_indices}
for (label,Xi) in labelled_data:
# for each datapoint, pick the closest centroid.
smallest_distance = float("inf")
for cj_index in centroids_indices:
cj = unlabelled_centroids[cj_index]
distance = np.linalg.norm(Xi - cj)
if distance < smallest_distance:
closest_centroid_index = cj_index
smallest_distance = distance
# allocate that datapoint to the cluster of that centroid.
clusters[closest_centroid_index].append((label,Xi))
return clusters.values()
And to give the contrary, expected, implementation (taken from here; this is just the distance calculation):
import numpy as np
from numpy.linalg import norm
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
Now, I know there are several ways to calculate the norm\distance, but I looked only at implementations that used np.linalg.norm with ord=None or ord=2, and as I said, in some of them the norm is not squared, yet they cluster correctly.
Why?
By experience, to use the norm or the squared norm as the objective function of an optimization algorithm yields to similar results. The minimum value of the objetive function will change, but the parameters obtained will be the same. I always guessed that the inner product generates a quadratic function and the root of that product only changed the magnitude but not the objetive function topology. A more detailed answer can be found in here. https://math.stackexchange.com/questions/2253443/difference-between-least-squares-and-minimum-norm-solution
Hope it helps.
I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a smaller --but still square-- matrix), and the algorithm iterates through successively smaller distance matrices until all clusters are found. Because each iteration depends on the last (a new distance matrix is formed so that there are no overlapping members between clusters), I think I can not avoid a slow for loop in python. I've tried numba (jit) to speed it up but I think it is reverting to python mode and so does not result in any speed gains. Here are two implementations of the algorithm. The first is slower than the latter. Any suggestions for speedups are most welcome. I am aware of other clustering algorithms as implemented in scipy or sklearn (such as DBSCAN, kmeans/medoids, etc), but am very keen to use the current one for my application. Thanks in advance for any suggestions.
Method 1 (slower):
def cluster(distance_matrix, cutoff=1):
indices = np.arange(0, len(distance_matrix))
boolean_distance_matrix = distance_matrix <= cutoff
centroids = []
members = []
while boolean_distance_matrix.any():
centroid = np.argmax(np.sum(boolean_distance_matrix, axis=0))
mem_indices = boolean_distance_matrix[:, centroid]
mems = indices[mem_indices]
boolean_distance_matrix[mems, :] = False
boolean_distance_matrix[:, mems] = False
centroids.append(centroid)
members.append(mems)
return members, centroids
Method 2 (faster, but still slow for large matrices):
It takes as input an adjacency (sparse) matrix formed from sklearn's nearest neighbors implementation. This is the simplest and fastest way I could think to get the relevant distance matrix for clustering. I believe working with the sparse matrix also speeds up the clustering algorithm.
nbrs = NearestNeighbors(metric='euclidean', radius=1.5,
algorithm='kd_tree')
nbrs.fit(data)
adjacency_matrix = nbrs.radius_neighbors_graph(data)
def cluster(adjacency_matrix, gt=1):
rows = adjacency_matrix.nonzero()[0]
cols = adjacency_matrix.nonzero()[1]
members = []
member = np.ones(len(range(gt+1)))
centroids = []
appendc = centroids.append
appendm = members.append
while len(member) > gt:
un, coun = np.unique(cols, return_counts=True)
centroid = un[np.argmax(coun)]
appendc(centroid)
member = rows[cols == centroid]
appendm(member)
cols = cols[np.in1d(rows, member, invert=True)]
rows = rows[np.in1d(rows, member, invert=True)]
return members, centroids
I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance as metric forces me to change the average function (the average in accordance to cosine distance must be an element by element average of the normalized vectors).
I have seen this elegant solution of manually overriding the distance function of sklearn, and I want to use the same technique to override the averaging section of the code but I couldn't find it.
Does anyone knows How can it be done ?
How critical is it that the distance metric doesn't satisfy the triangular inequality?
If anyone knows a different efficient implementation of kmeans where I use cosine metric or satisfy an distance and averaging functions it would also be realy helpful.
Thank you very much!
Edit:
After using the angular distance instead of cosine distance, The code looks as something like that:
def KMeans_cosine_fit(sparse_data, nclust = 10, njobs=-1, randomstate=None):
# Manually override euclidean
def euc_dist(X, Y = None, Y_norm_squared = None, squared = False):
#return pairwise_distances(X, Y, metric = 'cosine', n_jobs = 10)
return np.arccos(cosine_similarity(X, Y))/np.pi
k_means_.euclidean_distances = euc_dist
kmeans = k_means_.KMeans(n_clusters = nclust, n_jobs = njobs, random_state = randomstate)
_ = kmeans.fit(sparse_data)
return kmeans
I noticed (with mathematics calculations) that if the vectors are normalized the standard average works well for the angular metric. As far as I understand, I have to change _mini_batch_step() in k_means_.py. But the function is pretty complicated and I couldn't understand how to do it.
Does anyone knows about alternative solution?
Or maybe, Does anyone knows how can I edit this function with a one that always forces the centroids to be normalized?
So it turns out you can just normalise X to be of unit length and use K-means as normal. The reason being if X1 and X2 are unit vectors, looking at the following equation, the term inside the brackets in the last line is cosine distance.
So in terms of using k-means, simply do:
length = np.sqrt((X**2).sum(axis=1))[:,None]
X = X / length
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
And if you need the centroids and distance matrix do:
len_ = np.sqrt(np.square(kmeans.cluster_centers_).sum(axis=1)[:,None])
centers = kmeans.cluster_centers_ / len_
dist = 1 - np.dot(centers, X.T) # K x N matrix of cosine distances
Notes:
Just realised that you are trying to minimise the distance between the mean vector of the cluster, and its constituents. The mean vector has length of less than one when you simply average the vectors. But in practice, it's still worth running the normal sklearn algorithm and checking the length of the mean vector. In my case the mean vectors were close to unit length (averaging around 0.9, but this depends on how dense your data is).
TLDR: Use the spherecluster package as #σηγ pointed out.
You can normalize your data and then use KMeans.
from sklearn import preprocessing
from sklearn.cluster import KMeans
kmeans = KMeans().fit(preprocessing.normalize(X))
Unfortunately no.
Sklearn current implementation of k-means only uses Euclidean distances.
The reason is K-means includes calculation to find the cluster center and assign a sample to the closest center, and Euclidean only have the meaning of the center among samples.
If you want to use K-means with cosine distance, you need to make your own function or class. Or, try to use other clustering algorithm such as DBSCAN.
I wrote my own clustering algorithm (bad, I know) for my problem. It works well, but could work faster.
Algorithm takes list of values (1D) as in input, and works like this:
For each cluster, calculate distance to closest neighbor cluster
Select the cluster A which has smallest distance to neighbor B
If distance between A and B is less then threshold, return
Combine A and B
Goto 1.
I probably reinvented a wheel here..
This is my brute foce code, how to make it faster? I've Scipy and Numpy installed, if there's something ready made
#cluster center as simple average value
def cluster_center(cluster):
return sum(cluster) / len(cluster)
#Distance between clusters
def cluster_distance(a, b):
return abs(cluster_center(a) - cluster_center(b))
while True:
cluster_distances = []
#If nothing to cluster, ready
if len(clusters) < 2:
break
#Go thru all clusters, calculate shortest distance to neighbor
for cluster in clusters:
cluster_distances.append((cluster, sorted([(cluster_distance(cluster, c), c) for c in clusters if c != cluster])[0]))
#Find out closest pair
cluster_distances.sort(cmp=lambda a,b:cmp(a[1], b[1]))
#Check if distance is under threshold 15
if cluster_distances[0][1][0] < 15:
a = cluster_distances[0][0]
b = cluster_distances[0][1][1]
#Combine clusters (combine lists)
a.extend(b)
#Form a new cluster list
clusters = [c[0] for c in cluster_distances if c[0] != b]
else:
break
Usually, the term "cluster analysis" is only used for multi-variate partitions. Because in 1d, you can actually sort your data, and solve much of these problems much easier this way.
So to speed up your approach, sort your data! And reconsider what you then need to do.
As for a more advanced method: do kernel density estimation, and look for local minima as splitting points.