I wrote my own clustering algorithm (bad, I know) for my problem. It works well, but could work faster.
Algorithm takes list of values (1D) as in input, and works like this:
For each cluster, calculate distance to closest neighbor cluster
Select the cluster A which has smallest distance to neighbor B
If distance between A and B is less then threshold, return
Combine A and B
Goto 1.
I probably reinvented a wheel here..
This is my brute foce code, how to make it faster? I've Scipy and Numpy installed, if there's something ready made
#cluster center as simple average value
def cluster_center(cluster):
return sum(cluster) / len(cluster)
#Distance between clusters
def cluster_distance(a, b):
return abs(cluster_center(a) - cluster_center(b))
while True:
cluster_distances = []
#If nothing to cluster, ready
if len(clusters) < 2:
break
#Go thru all clusters, calculate shortest distance to neighbor
for cluster in clusters:
cluster_distances.append((cluster, sorted([(cluster_distance(cluster, c), c) for c in clusters if c != cluster])[0]))
#Find out closest pair
cluster_distances.sort(cmp=lambda a,b:cmp(a[1], b[1]))
#Check if distance is under threshold 15
if cluster_distances[0][1][0] < 15:
a = cluster_distances[0][0]
b = cluster_distances[0][1][1]
#Combine clusters (combine lists)
a.extend(b)
#Form a new cluster list
clusters = [c[0] for c in cluster_distances if c[0] != b]
else:
break
Usually, the term "cluster analysis" is only used for multi-variate partitions. Because in 1d, you can actually sort your data, and solve much of these problems much easier this way.
So to speed up your approach, sort your data! And reconsider what you then need to do.
As for a more advanced method: do kernel density estimation, and look for local minima as splitting points.
Related
I'm trying to implement Integer Programming for Nearest Neighbor Classifier in python using cvxpy.
Short intro
Given a dataset of n points with a color (red or blue) we would like to choose the minimal number of candidate points, s.t for each point that isn`t a candidate, its closest candidate has the same color.
My flow
Given a set of n points (with colors) define an indicator vector I (|I| = n),
I_i = 1 if and only if vertex i is chosen as a candidate
In addition, I defined two more vectors, named as A and B (|A| = |B| = n) as follow:
A_i = the distance between v_i to it's closest candidate with the **same** color
B_i = the distance between v_i to it's closest candidate with a **different** color
Therefore, I have n constrains which are:
B_i > A_i
for any i
My target is to minimize the sum of vector I (which represents the number of candidates)
My Issue
Its seems that the vectors A, B are changing because they affected by I, since when a candidate is chosen, it is affecting its entry in I which affects A and B and the constrains are dependent on those vectors..
Any suggestions?
Thanks !
To recap: you want to find the smallest set of examples belonging to a given training set such that the resulting nearest neighbor classifier achieves perfect accuracy on that training set.
I would suggest that you formulate this as follows. Create a 0–1 variable x(e) for each example e indicating whether e is chosen. For each ordered pair of examples e and e′ with different labels, write a constraint
x(e′) ≤ ∑e′′∈C(e,e′) x(e′′)
where C(e, e′) is the set of examples e′′ with the same label as e such that e′′ is closer to e than e′ is to e (including e′′ = e). This means that, if e′ is chosen, then it is not the nearest chosen example to e.
We also need
∑e x(e) ≥ 1
to disallow the empty set. Finally, the objective is
minimize ∑e x(e).
I've implemented a k-means algorithm and performance is highly dependent on how centroids were initialized. I'm finding random uniform initialization to give a good k-means about 5% of the time, whereas with k-means++, it's closer to 50%. Why is the yield for good k-means so low? I should disclaim I've only used a handful of data sets and my good/bad rates are indicative of only those, not broadly.
Here's an example using k-means++ where the end result was not great. The Dunn Index of this clustering is 0.16.
And an example where it worked perfectly with a Dunn Index of 0.67.
I was maybe under the naive impression k-means++ produced a good k-means every time. Is there perhaps something wrong with my code?
def initialize_centroids(points, k):
"""
Parameters:
points : a list of Points.
k : how many centroids to place.
Returns:
A list of centroids.
"""
clusters = []
clusters.append(choice(points)) # first centroid is random point
for _ in range(k - 1): # for other centroids
distances = []
for p in points:
d = inf
for c in clusters: # find the minimal distance between p and c
d = min(d, distance(p, c))
distances.append(d)
# find maximum distance index from minimal distances
clusters.append(points[distances.index(max(distances))])
return clusters
This is adapted from the algorithm as found on Wikipedia:
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
The difference is the centroids are chosen such that it is the furthest distance, not a probability to choose between furthest distances.
My intention is to compare the Dunn Index over different values of k, and empirically the Dunn Index being higher means better clustering. I can't collect (good) data if half of the time it doesn't work, so my results are skewed due to the faultiness of k-means++ or my implementation thereof.
What other initialization strategies can be employed to get a more consistent result?
This question is about creating a K-nearest neighbor graph [KNNG] from a dataset with an unknown number of centroids (which is not the same as K-means clustering).
Suppose that you have a dataset of observations stored in a data matrix X[n_samples, n_features] with each row being an observation or feature vector and each column being a feature. Now suppose you want to compute the (weighted) k-Neighbors graph for points in X using sklearn.neighbors.kneighbors_graph.
What are the basic methods to pick the number of neighbors to use for each sample? What algorithms scale well when you have lots of observations?
I have seen this brute force method below but it doesn't do well when the sample dataset size becomes large and you have to pick a good starting upper bound for n_neighbors_max. Does this algorithm have a name?
def autoselect_K(X, n_neighbors_max, threshold):
# get the pairwise euclidean distance between every observation
D = sklearn.metrics.pairwise.euclidean_distances(X, X)
chosen_k = n_neighbors_max
for k in range(2, n_neighbors_max):
k_avg = []
# loop over each row in the distance matrix
for row in D:
# sort the row from smallest distance to largest distance
sorted_row = numpy.sort(row)
# calculate the mean of the smallest k+1 distances
k_avg.append(numpy.mean(sorted_row[0:k]))
# find the median of the averages
kmedian_dist = numpy.median(k_avg)
if kmedian_dist >= threshold:
chosen_k = k
break
# return the number of nearest neighbors to use
return chosen_k
From your code, it appears that you are looking for a classification result based on the nearest neighbour.
In such a case your search over the distance matrix is akin to a brute force search and defeats the purpose of Nearest neighbour algorithms.
Perhaps what you are looking for is the NNClassifier. Here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Regarding the choice of the number of nearest neighbours, this depends on the sparsity of your data. It helps to view Nearest Neighbour as a way to bound your search. Rather than look over all samples. It will allow you
to narrow the search to the top-N (nearest neighbour) samples. Afterward
you can apply a domain specific technique on these N samples to get the desired result.
I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_
Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.
But the obvious first step to do is to run a profiler to identify slow operations!
Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.
I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a smaller --but still square-- matrix), and the algorithm iterates through successively smaller distance matrices until all clusters are found. Because each iteration depends on the last (a new distance matrix is formed so that there are no overlapping members between clusters), I think I can not avoid a slow for loop in python. I've tried numba (jit) to speed it up but I think it is reverting to python mode and so does not result in any speed gains. Here are two implementations of the algorithm. The first is slower than the latter. Any suggestions for speedups are most welcome. I am aware of other clustering algorithms as implemented in scipy or sklearn (such as DBSCAN, kmeans/medoids, etc), but am very keen to use the current one for my application. Thanks in advance for any suggestions.
Method 1 (slower):
def cluster(distance_matrix, cutoff=1):
indices = np.arange(0, len(distance_matrix))
boolean_distance_matrix = distance_matrix <= cutoff
centroids = []
members = []
while boolean_distance_matrix.any():
centroid = np.argmax(np.sum(boolean_distance_matrix, axis=0))
mem_indices = boolean_distance_matrix[:, centroid]
mems = indices[mem_indices]
boolean_distance_matrix[mems, :] = False
boolean_distance_matrix[:, mems] = False
centroids.append(centroid)
members.append(mems)
return members, centroids
Method 2 (faster, but still slow for large matrices):
It takes as input an adjacency (sparse) matrix formed from sklearn's nearest neighbors implementation. This is the simplest and fastest way I could think to get the relevant distance matrix for clustering. I believe working with the sparse matrix also speeds up the clustering algorithm.
nbrs = NearestNeighbors(metric='euclidean', radius=1.5,
algorithm='kd_tree')
nbrs.fit(data)
adjacency_matrix = nbrs.radius_neighbors_graph(data)
def cluster(adjacency_matrix, gt=1):
rows = adjacency_matrix.nonzero()[0]
cols = adjacency_matrix.nonzero()[1]
members = []
member = np.ones(len(range(gt+1)))
centroids = []
appendc = centroids.append
appendm = members.append
while len(member) > gt:
un, coun = np.unique(cols, return_counts=True)
centroid = un[np.argmax(coun)]
appendc(centroid)
member = rows[cols == centroid]
appendm(member)
cols = cols[np.in1d(rows, member, invert=True)]
rows = rows[np.in1d(rows, member, invert=True)]
return members, centroids