I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a smaller --but still square-- matrix), and the algorithm iterates through successively smaller distance matrices until all clusters are found. Because each iteration depends on the last (a new distance matrix is formed so that there are no overlapping members between clusters), I think I can not avoid a slow for loop in python. I've tried numba (jit) to speed it up but I think it is reverting to python mode and so does not result in any speed gains. Here are two implementations of the algorithm. The first is slower than the latter. Any suggestions for speedups are most welcome. I am aware of other clustering algorithms as implemented in scipy or sklearn (such as DBSCAN, kmeans/medoids, etc), but am very keen to use the current one for my application. Thanks in advance for any suggestions.
Method 1 (slower):
def cluster(distance_matrix, cutoff=1):
indices = np.arange(0, len(distance_matrix))
boolean_distance_matrix = distance_matrix <= cutoff
centroids = []
members = []
while boolean_distance_matrix.any():
centroid = np.argmax(np.sum(boolean_distance_matrix, axis=0))
mem_indices = boolean_distance_matrix[:, centroid]
mems = indices[mem_indices]
boolean_distance_matrix[mems, :] = False
boolean_distance_matrix[:, mems] = False
centroids.append(centroid)
members.append(mems)
return members, centroids
Method 2 (faster, but still slow for large matrices):
It takes as input an adjacency (sparse) matrix formed from sklearn's nearest neighbors implementation. This is the simplest and fastest way I could think to get the relevant distance matrix for clustering. I believe working with the sparse matrix also speeds up the clustering algorithm.
nbrs = NearestNeighbors(metric='euclidean', radius=1.5,
algorithm='kd_tree')
nbrs.fit(data)
adjacency_matrix = nbrs.radius_neighbors_graph(data)
def cluster(adjacency_matrix, gt=1):
rows = adjacency_matrix.nonzero()[0]
cols = adjacency_matrix.nonzero()[1]
members = []
member = np.ones(len(range(gt+1)))
centroids = []
appendc = centroids.append
appendm = members.append
while len(member) > gt:
un, coun = np.unique(cols, return_counts=True)
centroid = un[np.argmax(coun)]
appendc(centroid)
member = rows[cols == centroid]
appendm(member)
cols = cols[np.in1d(rows, member, invert=True)]
rows = rows[np.in1d(rows, member, invert=True)]
return members, centroids
Related
I'm working on document clustering where I first build a distance matrix from the tf-idf results. I use the below code to get my tf-idf matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(models)
This results in a matrix of (9069, 22210). Now I want to build a distance matrix from this (9069*9069). I'm using the following code for that:
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from scipy.spatial import distance
arrX = X.toarray()
rowSize = X.shape[0]
distMatrix = np.zeros(shape=(rowSize, rowSize))
#build distance matrix
for i, x in enumerate(arrX):
for j, y in enumerate(arrX):
distMatrix[i][j] = distance.braycurtis(x, y)
np.savetxt("dist.csv", distMatrix, delimiter=",")
The problem with this code is that it's extremely slow for this matrix size. Is there a faster way of doing this?
The biggest issue is that the algorithms runs in O(n^3) since each iteration to distance.braycurtis requires the computation of two arrays of size 9069. Since the computation is done 9069*9069 times. This means thousands billion scalar operations are required so to complete the computation. This is huge. The thing is the complexity of the algorithm probably cannot be improved. There are several ways to speed this up:
The first thing to do is not to recompute the distance twice. Indeed, this distance seems to be a commutative operator so distMatrix[i][j] == distMatrix[j][i]. You can compute the upper triangular part and then copy it to the lower triangular part.
Another optimization is simply not to use distance.braycurtis because it is slow: it takes about 10 us/call on my machine. This is mainly because it creates several temporary arrays, is mostly memory-bound because of Numpy operations, and also because np.sum is not very fast (mainly because it uses of a pretty precise algorithm that is hard to optimize). Moreover, it is sequential while nearly all mainstream processor have multiple cores nowadays. We can use Numba so to massively speed up this operation:
import numba as nb
#nb.njit(['float32(float32[::1], float32[::1])', 'float64(float64[::1], float64[::1])'], fastmath=True)
def fastBrayCurtis(arr1, arr2):
assert arr1.size == arr2.size
assert arr1.size > 0
zero = arr1[0] * 0 # Trick to set `zero` to the right type regarding the one of `arr1`
df, sm = zero, zero
for k in range(arr1.size):
df += np.abs(arr1[k] - arr2[k])
sm += np.abs(arr1[k] + arr2[k])
return df / sm
# The signature of the function is provided so to compile the function eagerly
# with both 32-bit and 64-bit floating-point 2D contiguous arrays.
#nb.njit(['float32[:,::1](float32[:,::1])', 'float64[:,::1](float64[:,::1])'], fastmath=True, parallel=True)
def brayCurtisDistMatrix(arr):
n = arr.shape[0]
distance = np.empty((n, n), dtype=arr.dtype)
# Compute the distance matrix in parallel while balancing the work between threads
for i in nb.prange((n+1)//2):
# Top of the upper triangular part (many items)
for j in range(i, n):
distance[j, i] = distance[i, j] = fastBrayCurtis(arr[i], arr[j])
# Bottom of the upper triangular part (few items)
for j in range(n-1-i, n):
distance[j, n-1-i] = distance[n-1-i, j] = fastBrayCurtis(arr[n-1-i], arr[j])
return distance
This code is about 440 times faster than the initial one on my 6-core i5-9600KF processor. Actually, a quick theoretical analysis combined with profiling results shows that the algorithm is close to be optimal (>75% of the computing power of my processor is used)! If this is not enough, you should consider using the simple-precision implementation. If this is still not enough, you should then also consider writing an optimized GPU code for that (or simply reconsider the need to compute such a huge distance matrix).
You see, the individual elements of the NumPy multidimensional matrix you give in as input are saved in memory in 2 ways. They are:
ROW MAJOR
COLUMN MAJOR
Each has its advantages and disadvantages.
You can even control the way it is stored.
I hope you find this helpful
I have two lists containing x and y number of n-dimensional points respectively. I had to calculate the sum of minimum distances of each point in list one (containing x points) from each point in second list (containing y points). The distance I am calculating is Euclidean distance. The optimized solution is needed.
I have already implemented its naive solution in Python. But its time complexity is too much to be used anywhere. There will be optimization possible. Can this problems time complexity be reduced than what I have implemented?
I was reading thispaper which I was trying to implement. In this they were having the similar problem to which they stated that it's special condition of Earth Mover Distance. As there was no code given, hence unable to know how it got implemented. Thus my naive implementation, the above code was too slow to work on data set of 11k documents. I used Google Colab for executing my code.
# Calculating Euclidean distance between two points
def euclidean_dist(x,y):
dd = 0.0
#len(x) is number of dimensions. Basically x and y is a
#list which contains coordinates of a point
for i in range(len(x)):
dd = dd+(x[i]-y[i])**2
return dd**(1/2)
# Calculating the desired solution to our problem
def dist(l1,l2):
min_dd = 0.0
dd = euclidean_dist(l1[0],l2[0])
for j in range(len(l1)):
for k in range(len(l2)):
temp = euclidean_dist(l1[j],l2[k])
if dd > temp:
dd = temp
min_dd = min_dd+dd
dd = euclidean_dist(l1[j],l2[0])
return min_dd
To reduce runtime, I would suggest finding manhattan distances (delta x + delta y), sorting the resulting array for each point and then creating a buffer of +20% of lowest manhattan distance, if values in the sorted list are in that range of +20%, you can compute euclidean distances and find the correct/minimum euclidean answer.
This will reduce some time, but the 20% figure might not reduce time if the points are all close together as most of them will fit in the buffer region, try fine-tuning the 20% parameter to see what works best for your dataset. Keep in mind reducing it too much might lead to inaccurate answers due to the nature of euclidean vs. manhattan distances.
It is similar to a k-nearest-neighbor problem so finding each closest point to a given point cost O(N) and for your problem should be O(N^2).
Sometimes Using kd-tree MAY improve some performance if your data is low-dimensional.
To calculate the distance between two points, you can use the distance formula:
which you can implement like that in python:
import math
def dist(x1, y1, x2, y2):
return math.sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2))
Then all you need to do is to loop over X or Y list, check the distance of two points and store it if it's under the current stored minimal distance. You should end up with a O(n²) complexity algorithm which is what you seems to want. Here a working example:
min_dd = None
for i in range(len(l1)):
for j in range(i + 1, len(l1)):
dd = dist(l1[i], l2[i], l1[j], l2[j])
if min_dd is None or dd < min_dd:
min_dd = dd
With this you can get pretty good performances even with large list of points.
Small arrays
For two numpy arrays x and y of shape (n,) and (m,) respectively, you can vectorize the distance calculations and then get the minimum distance:
import numpy as np
n = 10
m = 20
x = np.random.random(n)
y = np.random.random(m)
# Using squared distance matrix and taking the
# square root at the minimum value
distance_matrix = (x[:,None]-y[None,:])**2
minimum_distance_sum = np.sum(np.sqrt(np.min(distance_matrix, axis=1)))
For arrays of shape (n,l) and (m,l), you just need to calculate the distance_matrix as:
distance_matrix = np.sum((x[:,None]-y[None,:])**2, axis=2)
Alternatively, you could use np.linalg.norm, scipy.spatial.distance.cdist, np.einsum etc., but in many cases they are not faster.
Large arrays
If l, n and m above are too large for you to keep the distance_matrix in memory, you can use the mathematical lower and upper bound of the euclidean distance to increase the speed (see this paper. Since this relies on for loops, it will be very slow, but one can wrap the functions with numba to counter this:
import numpy as np
import numba
#numba.jit(nopython=True, fastmath=True)
def get_squared_distance(a,b):
return np.sum((a-b)**2)
def get_minimum_distance_sum(x,y):
n = x.shape[0]
m = y.shape[0]
l = x.shape[1]
# Calculate mean and standard deviation of both arrays
mx = np.mean(x, axis=1)
my = np.mean(y, axis=1)
sx = np.std(x, axis=1)
sy = np.std(y, axis=1)
return _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy)
#numba.jit(nopython=True, fastmath=True)
def _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy):
min_distance_sum = 0
for i in range(n):
min_distance = get_squared_distance(x[i], y[0])
for j in range(1,m):
if i == 0 and j == 0:
continue
lower_bound = l * ((mx[i] - my[j])**2 + (sx[i] - sy[j])**2)
if lower_bound >= min_distance:
continue
distance = get_squared_distance(x[i], y[j])
if distance < min_distance:
min_distance = distance
min_distance_sum += np.sqrt(min_distance)
return min_distance_sum
def test_minimum_distance_sum():
# Will likely be much larger for this to be faster than the other method
n = 10
m = 20
l = 100
x = np.random.random((n,l))
y = np.random.random((m,l))
return get_minimum_distance_sum(x,y)
This approach should be faster than the former approach with increased array size. The algorithm can be improved slightly as described in the paper, but any speedup would depend heavily on the shape of the arrays.
Timings
On my laptop, on two arrays of shape (1000,100), your approach takes ~1 min, the "small arrays" approach takes 690 ms and the "large arrays" approach takes 288 ms. For two arrays of shape (100, 3), your approach takes 28 ms, the "small arrays" approach takes 429 μs and the "large arrays" approach takes 578 μs.
I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_
Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.
But the obvious first step to do is to run a profiler to identify slow operations!
Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.
I wrote my own clustering algorithm (bad, I know) for my problem. It works well, but could work faster.
Algorithm takes list of values (1D) as in input, and works like this:
For each cluster, calculate distance to closest neighbor cluster
Select the cluster A which has smallest distance to neighbor B
If distance between A and B is less then threshold, return
Combine A and B
Goto 1.
I probably reinvented a wheel here..
This is my brute foce code, how to make it faster? I've Scipy and Numpy installed, if there's something ready made
#cluster center as simple average value
def cluster_center(cluster):
return sum(cluster) / len(cluster)
#Distance between clusters
def cluster_distance(a, b):
return abs(cluster_center(a) - cluster_center(b))
while True:
cluster_distances = []
#If nothing to cluster, ready
if len(clusters) < 2:
break
#Go thru all clusters, calculate shortest distance to neighbor
for cluster in clusters:
cluster_distances.append((cluster, sorted([(cluster_distance(cluster, c), c) for c in clusters if c != cluster])[0]))
#Find out closest pair
cluster_distances.sort(cmp=lambda a,b:cmp(a[1], b[1]))
#Check if distance is under threshold 15
if cluster_distances[0][1][0] < 15:
a = cluster_distances[0][0]
b = cluster_distances[0][1][1]
#Combine clusters (combine lists)
a.extend(b)
#Form a new cluster list
clusters = [c[0] for c in cluster_distances if c[0] != b]
else:
break
Usually, the term "cluster analysis" is only used for multi-variate partitions. Because in 1d, you can actually sort your data, and solve much of these problems much easier this way.
So to speed up your approach, sort your data! And reconsider what you then need to do.
As for a more advanced method: do kernel density estimation, and look for local minima as splitting points.
I have 250,000 lists containing an average of 100 strings each, stored across 10 dictionaries. I need to calculate the pairwise similarity of all lists (the similarity metric isn't relevant here; but, briefly, it involves taking the intersection of the two lists and normalizing the result by some constant).
The code I've come up with for the pairwise comparisons is quite straightforward. I'm just using itertools.product to compare every list to every other list. The problem is performing these calculations on 250,000 lists in a time-efficient way. To anyone who's dealt with a similar problem: Which of the usual options (scipy, PyTables) is best for this in terms of the following criteria:
supports python data types
smartly stores a very sparse matrix (approx 80% of the values will be 0)
efficient (can do the calculations in under 10 hours)
Do you just want the most efficient way to determine the distance between any two points in your data?
Or do you actually need this m x m distance matrix that stores all pair-wise similarity values for all rows in your data?
Usually it's far more efficient to persist your data in some metric space,
using a data structure optimized for rapid retrieval, than it is to
pre-calculate the pair-wise similarity values in advance and just look them up.
Needless to say, the distance matrix option scales horribly--
n data points requires an n x n distance matrix to store the pair-wise
similarity scores.
A kd-tree is the technique of choice for data of small dimension
("small" here means something like number of features less than about 20);
Voronoi tesselation is often preferred for higher dimension data.
Much more recently, the ball tree has been used as a superior alternative
to both--it has the performance of the kd-tree but without the degradation
at high dimension.
scikit-learn has an excellent implementation which includes
unit tests. It is well-documented and currently under active development.
scikit-learn is built on NumPy and SciPy and so both are dependencies. The various installation options for scikit-learn are provided on the Site.
The most common use case for Ball Trees is in k-Nearest Neighbors; but it will
work quite well on its own, eg., in cases like the one described in the OP.
you can use the scikit-learn Ball Tree implementation like so:
>>> # create some fake data--a 2D NumPy array having 10,000 rows and 10 columns
>>> D = NP.random.randn(10000 * 10).reshape(10000, 10)
>>> # import the BallTree class (here bound to a local variable of same name)
>>> from sklearn.neighbors import BallTree as BallTree
>>> # call the constructor, passing in the data array and a 'leaf size'
>>> # the ball tree is instantiated and populated in the single step below:
>>> BT = BallTree(D, leaf_size=5, p=2)
>>> # 'leaf size' specifies the data (number of points) at which
>>> # point brute force search is triggered
>>> # 'p' specifies the distance metric, p=2 (the default) for Euclidean;
>>> # setting p equal to 1, sets Manhattan (aka 'taxi cab' or 'checkerboard' dist)
>>> type(BT)
<type 'sklearn.neighbors.ball_tree.BallTree'>
instantiating & populating the ball tree is very fast
(timed using Corey Goldberg's timer class):
>>> with Timer() as t:
BT = BallTree(D, leaf_size=5)
>>> "ball tree instantiated & populated in {0:2f} milliseconds".format(t.elapsed)
'ball tree instantiated & populated in 13.90 milliseconds'
querying the ball tree is also fast:
an example query: provide the three data points closest to the data point row index 500; and for each of them, return their index and their distance from this reference point at D[500,:]
>>> # ball tree has an instance method, 'query' which returns pair-wise distance
>>> # and an index; one distance and index is returned per 'pair' of data points
>>> dx, idx = BT.query(D[500,:], k=3)
>>> dx # distance
array([[ 0. , 1.206, 1.58 ]])
>>> idx # index
array([[500, 556, 373]], dtype=int32)
>>> with Timer() as t:
dx, idx = BT.query(D[500,:], k=3)
>>> "query results returned in {0:2f} milliseconds".format(t.elapsed)
'query results returned in 15.85 milliseconds'
The default distance metric in the scikit-learn Ball Tree implementation is Minkowski, which is just a generalization of Euclidean and Manhattan (ie, in the Minkowski expression, there is a parameter, p, which when set to 2 collapses to Euclidean, and Manhattan, for p=1.
If you define appropriate distance (similarity) function then some functions from scipy.spatial.distance might help