Most efficient way to calculate pairwise similarity of 250k lists

Most efficient way to calculate pairwise similarity of 250k lists - python

I have 250,000 lists containing an average of 100 strings each, stored across 10 dictionaries. I need to calculate the pairwise similarity of all lists (the similarity metric isn't relevant here; but, briefly, it involves taking the intersection of the two lists and normalizing the result by some constant).
The code I've come up with for the pairwise comparisons is quite straightforward. I'm just using itertools.product to compare every list to every other list. The problem is performing these calculations on 250,000 lists in a time-efficient way. To anyone who's dealt with a similar problem: Which of the usual options (scipy, PyTables) is best for this in terms of the following criteria:
supports python data types
smartly stores a very sparse matrix (approx 80% of the values will be 0)
efficient (can do the calculations in under 10 hours)

Do you just want the most efficient way to determine the distance between any two points in your data?
Or do you actually need this m x m distance matrix that stores all pair-wise similarity values for all rows in your data?
Usually it's far more efficient to persist your data in some metric space,
using a data structure optimized for rapid retrieval, than it is to
pre-calculate the pair-wise similarity values in advance and just look them up.
Needless to say, the distance matrix option scales horribly--
n data points requires an n x n distance matrix to store the pair-wise
similarity scores.
A kd-tree is the technique of choice for data of small dimension
("small" here means something like number of features less than about 20);
Voronoi tesselation is often preferred for higher dimension data.
Much more recently, the ball tree has been used as a superior alternative
to both--it has the performance of the kd-tree but without the degradation
at high dimension.
scikit-learn has an excellent implementation which includes
unit tests. It is well-documented and currently under active development.
scikit-learn is built on NumPy and SciPy and so both are dependencies. The various installation options for scikit-learn are provided on the Site.
The most common use case for Ball Trees is in k-Nearest Neighbors; but it will
work quite well on its own, eg., in cases like the one described in the OP.
you can use the scikit-learn Ball Tree implementation like so:
>>> # create some fake data--a 2D NumPy array having 10,000 rows and 10 columns
>>> D = NP.random.randn(10000 * 10).reshape(10000, 10)
>>> # import the BallTree class (here bound to a local variable of same name)
>>> from sklearn.neighbors import BallTree as BallTree
>>> # call the constructor, passing in the data array and a 'leaf size'
>>> # the ball tree is instantiated and populated in the single step below:
>>> BT = BallTree(D, leaf_size=5, p=2)
>>> # 'leaf size' specifies the data (number of points) at which
>>> # point brute force search is triggered
>>> # 'p' specifies the distance metric, p=2 (the default) for Euclidean;
>>> # setting p equal to 1, sets Manhattan (aka 'taxi cab' or 'checkerboard' dist)
>>> type(BT)
<type 'sklearn.neighbors.ball_tree.BallTree'>
instantiating & populating the ball tree is very fast
(timed using Corey Goldberg's timer class):
>>> with Timer() as t:
BT = BallTree(D, leaf_size=5)
>>> "ball tree instantiated & populated in {0:2f} milliseconds".format(t.elapsed)
'ball tree instantiated & populated in 13.90 milliseconds'
querying the ball tree is also fast:
an example query: provide the three data points closest to the data point row index 500; and for each of them, return their index and their distance from this reference point at D[500,:]
>>> # ball tree has an instance method, 'query' which returns pair-wise distance
>>> # and an index; one distance and index is returned per 'pair' of data points
>>> dx, idx = BT.query(D[500,:], k=3)
>>> dx # distance
array([[ 0. , 1.206, 1.58 ]])
>>> idx # index
array([[500, 556, 373]], dtype=int32)
>>> with Timer() as t:
dx, idx = BT.query(D[500,:], k=3)
>>> "query results returned in {0:2f} milliseconds".format(t.elapsed)
'query results returned in 15.85 milliseconds'
The default distance metric in the scikit-learn Ball Tree implementation is Minkowski, which is just a generalization of Euclidean and Manhattan (ie, in the Minkowski expression, there is a parameter, p, which when set to 2 collapses to Euclidean, and Manhattan, for p=1.

If you define appropriate distance (similarity) function then some functions from scipy.spatial.distance might help

Related

How can I improve my Shared Nearest Neighbor clustering algorithm?

I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_

Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.
But the obvious first step to do is to run a profiler to identify slow operations!
Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.

Finding the nearest neighbours for a subset of samples

I have a dataset of about 3 million samples (each with just 3 features). I'm using scikit's sklearn.neighbors module - specifically radius_neighbor_graph - to find which samples fall within a small radius of a specific sample.
This works fine, but unsurprisingly it's really, really slow to compute this graph.
It's also very wasteful, because I only ever need to know the neighbors for a small subset of my samples (~ 100,000 of them) - and I know this subset in advance.
So... is there any way of being more efficient by calculating the neighbours within a given radius for just this subset of samples? It seems like it should be simple, but I can't think of an easy way of doing it.

First of all, the task of creating a radius-neighborhood-graph involves reading the N by N distance-matrix associated to your dataset. Since distance matrices have nice properties you can save some time, but still complexity lies somewhere in O(N^2). Here N is the number of data points in your data set X.
So one could say, that only a small number of n < N points are of interest as the center of a neighborhood, but the majority of points are just interesting as neighbors. This would result in an n by N distance matrix, where row i contains the distances of data point i to each other data point j, 1 <= i <= n, 1 <= j <= N. But this "distance matrix" has none of the desirable properties of a normal distance matrix (it is not even a square matrix), that you could use to speed up the process of creating an epsilon-neighborhood-graph.
Therefore I don't think that you find a predefined function for your case. If you want to build one your own, the steps should be as follows: Let X be your data set and i be the data point of interest.
Create the distance matrix D associated to your data set, use scipy.spatial.distance_matrix and take as x the small subset of your data set and as y the whole data set.
Create a list, neighbors = []
Loop over the i'th row of the distance matrix. If D(i,j) < epsilon, then save j in neighbors. It is the index of a data point in the epsilon neighborhood of i.
Return neighbors
Of course the computation of the distance matrix should happen once at the beginning (maybe in init() if you wrap everything up in a class), and the function/method that returns all epsilon neighbors of a data point should only depend on the index of the data point in question.
Hope this helps!

Clustering a sparse co-occurrence matrix

I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.

If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.

Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.

Finding nearest neighbors within a distance and taking the average of those neighbors using cKDTree

I'm using python scripting to read in two large (millions of points) point clouds as arrays ("A" and "B").
I need to find the nearest "B" neighbors of the points in "A", but within 5 cm of each point in "A". I also want to average the neighbors within the 5 cm radius of the points in "A."
Is there a way to do this using cKTree all at once, including the averaging?

I'm not sure about what do you want to do, but If I understand you correctly you can follow this steps:
# this are just random arrays for test
A = 20 * np.random.rand(1000, 3)
B = 20 * np.random.rand(1000, 3)
Compute a cKDTree for each point cloud
tree_A = cKDTree(A)
tree_B = cKDTree(B)
Find the points in A that are at most at 5 units from each point in B:
# faster than loop + query_ball_point
neighbourhood = tree_B.query_ball_tree(tree_A, 5)
Compute the mean over all of those groups of points:
means = np.zeros_like(A)
for i in range(len(neighbourhood)):
means[i] = A[neighbourhood[i]].mean(0)

cKDTree does not have any units; I'm hopeful that your measurements are all in the the units (cm) as your desired manipulations.
What do you mean that you want to "average the neighbors"? Is this simply the mean location of all the neighbors within the 5-unit ball?
From what you've posted, I believe that the critical operation for you is
for A_point in A:
hood = B.query_ball_point(A_point, 5)
Now, just "average" the points in hood. I assume that you know how to do that part; cKDTree doesn't have such an operation, since SciPy and Python supply those on the base types.
You could do this with A as the first argument to query_ball_point, but then you'd get a huge list of neighbourhoods, and perhaps blow your memory limit.
Does that get you moving?

Problems in performing K means clustering

I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?

In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.

As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.