Computing Nearest neighbor graph using sklearn? - python

This question is about creating a K-nearest neighbor graph [KNNG] from a dataset with an unknown number of centroids (which is not the same as K-means clustering).
Suppose that you have a dataset of observations stored in a data matrix X[n_samples, n_features] with each row being an observation or feature vector and each column being a feature. Now suppose you want to compute the (weighted) k-Neighbors graph for points in X using sklearn.neighbors.kneighbors_graph.
What are the basic methods to pick the number of neighbors to use for each sample? What algorithms scale well when you have lots of observations?
I have seen this brute force method below but it doesn't do well when the sample dataset size becomes large and you have to pick a good starting upper bound for n_neighbors_max. Does this algorithm have a name?
def autoselect_K(X, n_neighbors_max, threshold):
# get the pairwise euclidean distance between every observation
D = sklearn.metrics.pairwise.euclidean_distances(X, X)
chosen_k = n_neighbors_max
for k in range(2, n_neighbors_max):
k_avg = []
# loop over each row in the distance matrix
for row in D:
# sort the row from smallest distance to largest distance
sorted_row = numpy.sort(row)
# calculate the mean of the smallest k+1 distances
k_avg.append(numpy.mean(sorted_row[0:k]))
# find the median of the averages
kmedian_dist = numpy.median(k_avg)
if kmedian_dist >= threshold:
chosen_k = k
break
# return the number of nearest neighbors to use
return chosen_k

From your code, it appears that you are looking for a classification result based on the nearest neighbour.
In such a case your search over the distance matrix is akin to a brute force search and defeats the purpose of Nearest neighbour algorithms.
Perhaps what you are looking for is the NNClassifier. Here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Regarding the choice of the number of nearest neighbours, this depends on the sparsity of your data. It helps to view Nearest Neighbour as a way to bound your search. Rather than look over all samples. It will allow you
to narrow the search to the top-N (nearest neighbour) samples. Afterward
you can apply a domain specific technique on these N samples to get the desired result.

Related

Implementation details of K-means++ without sklearn

I am doing K-means using MINST dataset. However, I found difficulties in the implementation on initialization and some further steps.
For the initialization, I have to first pick one random data point to the first centroid. Then for the remaining centroids, we also pick data points randomly, but from a weighted probability distribution, until all the centroids are chosen
I am sticking in this step, how can I apply this distribution to choose? I mean, how to implement it? for the D_{k-1}(x), can I just use np.linalg.norm to compile and square it?
For my implementation, I now just initialized the first element
self.centroids = np.zeros((self.num_clusters, input_x.shape[1]))
ran_num = np.random.choice(input_x.shape[0])
self.centroids[0] = input_x[ran_num]
for k in range(1, self.num_clusters):
for the next step, do I need to find the next centroid by obtaining the largest distance between the previous centroid and all sample points?
You need to create a distribution where the probability to select an observation is the (normalized) distance between the observation and its closest cluster. Thus, to select a new cluster center, there is a high probability to select observations that are far from all already existing cluster centers. Similarly, there is a low probability to select observations that are close to already existing cluster centers.
This would look like this:
centers = []
centers.append(X[np.random.randint(X.shape[0])]) # inital center = one random sample
distance = np.full(X.shape[0], np.inf)
for j in range(1,self.n_clusters):
distance = np.minimum(np.linalg.norm(X - centers[-1], axis=1), distance)
p = np.square(distance) / np.sum(np.square(distance)) # probability vector [p1,...,pn]
sample = np.random.choice(X.shape[0], p = p)
centers.append(X[sample])

Is k-means++ meant to be perfect every time? What other initialization strategies can yield the best k-means?

I've implemented a k-means algorithm and performance is highly dependent on how centroids were initialized. I'm finding random uniform initialization to give a good k-means about 5% of the time, whereas with k-means++, it's closer to 50%. Why is the yield for good k-means so low? I should disclaim I've only used a handful of data sets and my good/bad rates are indicative of only those, not broadly.
Here's an example using k-means++ where the end result was not great. The Dunn Index of this clustering is 0.16.
And an example where it worked perfectly with a Dunn Index of 0.67.
I was maybe under the naive impression k-means++ produced a good k-means every time. Is there perhaps something wrong with my code?
def initialize_centroids(points, k):
"""
Parameters:
points : a list of Points.
k : how many centroids to place.
Returns:
A list of centroids.
"""
clusters = []
clusters.append(choice(points)) # first centroid is random point
for _ in range(k - 1): # for other centroids
distances = []
for p in points:
d = inf
for c in clusters: # find the minimal distance between p and c
d = min(d, distance(p, c))
distances.append(d)
# find maximum distance index from minimal distances
clusters.append(points[distances.index(max(distances))])
return clusters
This is adapted from the algorithm as found on Wikipedia:
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
The difference is the centroids are chosen such that it is the furthest distance, not a probability to choose between furthest distances.
My intention is to compare the Dunn Index over different values of k, and empirically the Dunn Index being higher means better clustering. I can't collect (good) data if half of the time it doesn't work, so my results are skewed due to the faultiness of k-means++ or my implementation thereof.
What other initialization strategies can be employed to get a more consistent result?

Speeding up nearest Neighbor with scipy.spatial.cKDTree

I'm trying to optimize my nearest neighbor distance code that I need to calculate for many iterations of the same dataset
I am calculating the nearest neighbor distance for the points in dataset A, to the points in dataset B. Both datasets contain ~ (1000-2000) 2-dimensional points. While the points in dataset A stay the same, I have lots of different iterations for dataset B (~100000), B0,B1, ...B100000.I wonder if I can somehow speed this up given that A stays the same.
To calculate the nearest neighbor distances I use
for i in range(100000):
tree = spatial.cKDTree(B[i])
mindist1, minid = tree.query(A)
score[i] = np.mean(mindist1**4))**0.25
# And some other calculations
...
I wonder if there is a way to speed this up given A stays the same throughout the entire loop. It seems to me like there should be a smarter way to do this given that A is the same.

How can I improve my Shared Nearest Neighbor clustering algorithm?

I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_
Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.
But the obvious first step to do is to run a profiler to identify slow operations!
Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.

How to compute cluster assignments from linkage/distance matrices

if you have this hierarchical clustering call in scipy in Python:
from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(dist_matrix), linkage_method)
then what's an efficient way to go from this to cluster assignments for individual points? i.e. a vector of length N where N is number of points, where each entry i is the cluster number of point i, given the number of clusters generated by a given threshold thresh on the resulting clustering?
To clarify: The cluster number would be the cluster that it's in after applying a threshold to the tree. In which case you would get a unique cluster for each leaf node for the cluster that it is in. Unique in the sense that each point belongs to one "most specific cluster" which is defined by the threshold where you cut the dendrogram.
I know that scipy.cluster.hierarchy.fclusterdata gives you this cluster assignment as its return value, but I am starting from a custom made distance matrix and distance metric, so I cannot use fclusterdata. The question boils down to: how can I compute what fclusterdata is computing -- the cluster assignments?
If I understand you right, that is what fcluster does:
scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)
Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.
...
Returns: An array of length n. T[i] is the flat cluster number to which original observation i belongs.
So just call fcluster(linkage_matrix, t), where t is your threshold.

Categories

Resources