linear_sum_assignment with large sets

linear_sum_assignment with large sets - python

Given two sets of points in n-dimensional space, both sets have the same size, one can one map points from one set to the other, such that each point is only used once and the total euclidean distance between the pairs of points is minimized using the linear_sum_assignment from scipy (an example can be found here.
However, this requires to explicitly set up the cost matrix which can become prohibitive for large point sets.
What would be the best way to solve this problem in python if the distance between each two points can be computed but the point sets are so large that an explicit cost matrix is prohibitive?

Related

Sort coordinates of pointcloud by distance to previous point

Pointcloud of rope with desired start and end point
I have a pointcloud of a rope-like object with about 300 points. I'd like to sort the 3D coordinates of that pointcloud, so that one end of the rope has index 0 and the other end has index 300 like shown in the image. Other pointclouds of that object might be U-shaped so I can't sort by X,Y or Z coordinate. Because of that I also can't sort by the distance to a single point.
I have looked at KDTree by sklearn or scipy to compute the nearest neighbour of each point but I don't know how to go from there and sort the points in an array without getting double entries.
Is there a way to sort these coordinates in an array, so that from a starting point the array gets appended with the coordinates of the next closest point?

First of all, obviously, there is no strict solution to this problem (and even there is no strict definition of what you want to get). So anything you may write will be a heuristic of some sort, which will be failing in some cases, especially as your point cloud gets some non-trivial form (do you allow loops in your rope, for example?)
This said, a simple approach may be to build a graph with the points being the vertices, and every two points connected by an edge with a weight equal to the straight-line distance between these two points.
And then build a minimal spanning tree of this graph. This will provide a kind of skeleton for your point cloud, and you can devise any simple algorithm atop of this skeleton.
For example, sort all points by their distance to the start of the rope, measured along this tree. There is only one path between any two vertices of the tree, so for each vertex of the tree calculate the length of the single path to the rope start, and sort all the vertices by this distance.

As suggested in other answer there is no strict solution to this problem and there can be some edge cases such as loop, spiral, tube, but you can go with heuristic approaches to solve for your use case. Read about some heuristic approaches such as hill climbing, simulated annealing, genetic algorithms etc.
For any heuristic approach you need a method to find how good is a solution, let's say if i give you two array of 3000 elements how will you identify which solution is better compared to other ? This methods depends on your use case.
One approach at top of my mind, hill climbing
method to measure the goodness of the solution : take the euclidian distance of all the adjacent elements of array and take the sum of their distance.
Steps :
create randomised array of all the 3000 elements.
now select two random index out of these 3000 and swap the elements at those indexes, and see if it improves your ans (if sum of euclidian distance of adjacent element reduces)
If it improves your answer then keep those elements swapped
repeat step 2/3 for large number of epochs(10^6)
This solution will lead into stagnation as there is lack of diversity. For better results use simulated annealing, genetic algorithms.

Clustering algorithms with custom distance function in Python

I've got a clustering problem that I believe requires an intuitive distance function. Each instance has an x, y coordinate but also has a set of attributes that describe it (varying in number per instance). Ideally it would be possible to pass it pythonobjects (instances of a class) and compare them arbitrarily based on their content.
I want to represent the distance as a weighted sum of the euclidean distance between the x, y values and something like a jaccard index to measure the set overlap of the other attributes. Something like:
dist = (euclidean(x1, y1, x2, y2) * 0.6) + (1-jaccard(attrs1, attrs2) * 0.4)
Most of the clustering algorithms and implementations I've found convert instance features into numbers. For example with dbscan in sklearn, to do my distance function I would need to convert the numbers back into the original representation somehow.
It would be great if it were possible to do clustering using a distance function that can compare instances in any arbitrary way. For example imagine a euclidean distance function that would evaluate objects as closer if they matched on another non-spatial feature.
def dist(ins1, ins2):
euc = euclidean(ins1.x, ins1.y, ins2.x, ins2.y)
if ins1.feature1 == ins2.feature1:
euc = euc * 0.9
return euc
Is there a method that would suit this? It would also be nice if the number of clusters didn't have to be set upfront (but this is not critical for me).

Actually, almost all the clustering algorithms (except for k-means, which needs numbers to compute the mean, obviously) can be used with arbitrary distance functions.
In sklearn, most algorithms accept metric="precomputed" and a distance matrix instead of the original input data. Please check the documentation more carefully. For example DBSCAN:
If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
What you lose is the ability to accelerate some algorithms by indexing. Computing a distance matrix is O(n^2), so your algorithm cannot be faster than that. In sklearn, you would need to modify the sklearn Cython code to add a new distance function (using a pyfunc will yield very bad performance, unfortunately). Java tools such as ELKI can be extended with little overhead because the Just-in-time compiler of Java optimizes this well. If your distance is metric then many indexes can be used for acceleration of e.g. DBSCAN.

Clustering Algorithms for a Set of Data Points

I have a collection of Points objects, containing latitude and longitude (along with a few other irrelevant properties). I want to form clusters i.e. collections of points that are close together, relative to other points.
Alternatively, I would like an algorithm which, if given a list of clusters containing close-by points and a new point, determines which cluster the new point belongs to (and adds it to a new cluster if it doesn't belong to an existing cluster).
I looked at Hierarchical Clustering algorithms but those run too slow. The k-means algorithm requires you to know the number of clusters beforehand, whcih is not really very helpful.
Thanks!

Try density based clustering methods.
DBSCAN is one of the most popular of those.
I am assuming you are using python.
Check out this:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
When you cluster based on GPS lat/lon, you may want to use a different distance calculation method than DBSCAN's default. Use its metric parameter to use your own distance calculation function or distance matrix. For distance calculations check out Haversine Formula.

Finding points in space closer than a certain value

In an python application I'm developing I have an array of 3D points (of size between 2 and 100000) and I have to find the points that are within a certain distance from each other (say between two values, like 0.1 and 0.2). I need this for a graphic application and this search should be very fast (~1/10 of a second for a sample of 10000 points)
As a first experiment I tried to use the scipy.spatial.KDTree.query_pairs implementation, and with a sample of 5000 point it takes 5 second to return the indices. Do you know any approach that may work for this specific case?
A bit more about the application:
The points represents atom coordinates and the distance search is useful to determine the bonds between atoms. Bonds are not necessarily fixed but may change at each step, such as in the case of hydrogen bonds.

Great question! Here is my suggestion:
Divide each coordinate by your "epsilon" value of 0.1/0.2/whatever and round the result to an integer. This creates a "quotient space" of points where distance no longer needs to be determined using the distance formula, but simply by comparing the integer coordinates of each point. If all coordinates are the same, then the original points were within approximately the square root of three times epsilon from each other (for example). This process is O(n) and should take 0.001 seconds or less.
(Note: you would want to augment the original point with the three additional integers that result from this division and rounding, so that you don't lose the exact coordinates.)
Sort the points in numeric order using dictionary-style rules and considering the three integers in the coordinates as letters in words. This process is O(n * log(n)) and should take certainly less than your 1/10th of a second requirement.
Now you simply proceed through this sorted list and compare each point's integer coordinates with the previous and following points. If all coordinates match, then both of the matching points can be moved into your "keep" list of points, and all the others can be marked as "throw away." This is an O(n) process which should take very little time.
The result will be a subset of all the original points, which contains only those points that could be possibly involved in any bond, with a bond being defined as approximately epsilon or less apart from some other point in your original set.
This process is not mathematically exact, but I think it is definitely fast and suited for your purpose.

The first thing that comes to my mind is:
If we calculate the distance between each two atoms in the set it will be O(N^2) operations. It is very slow.
What about to introduce the statical orthogonal grid with some cells size (for example close to the distance you are interested) and then determine the atoms belonging to the each cell of the grid (it takes O(N) operations) After this procedure you can reduce the time for searching of the neighbors.

Find all coordinates within a circle in geographic data in python

I've got millions of geographic points. For each one of these, I want to find all "neighboring points," i.e., all other points within some radius, say a few hundred meters.
There is a naive O(N^2) solution to this problem---simply calculate the distance of all pairs of points. However, because I'm dealing with a proper distance metric (geographic distance), there should be a quicker way to do this.
I would like to do this within python. One solution that comes to mind is to use some database (mySQL with GIS extentions, PostGIS) and hope that such a database would take care of efficiently performing the operation described above using some index. I would prefer something simpler though, that doesn't require me to build and learn about such technologies.
A couple of points
I will perform the "find neighbors" operation millions of times
The data will remain static
Because the problem is in a sense simple, I'd like to see they python code that solves it.
Put in terms of python code, I want something along the lines of:
points = [(lat1, long1), (lat2, long2) ... ] # this list contains millions lat/long tuples
points_index = magical_indexer(points)
neighbors = []
for point in points:
point_neighbors = points_index.get_points_within(point, 200) # get all points within 200 meters of point
neighbors.append(point_neighbors)

scipy
First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.
Binary Search
Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:
Compute the PCA of the dataset. You want to rotate the dataset such that the most significant direction is first, and the orthogonal (less large) second direction is, well, second. You can skip this and just choose X or Y, but it's computationally cheap and usually easy to implement. If you do just choose X or Y, choose the direction with greater variance.
Sort the points by the major direction (call this direction X).
To find the nearest neighbor of a given point, find the index of the point nearest in X by binary search (if the point is already in your collection, you may already know this index and don't need the search). Iteratively look to the next and previous points, maintaining the best match so far and its distance from your search point. You can stop looking when the difference in X is greater than or equal to the distance to the best match so far (in practice, usually very few points).
To find all points within a given range, do the same as step 3, except don't stop until the difference in X exceeds the range.
Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.
One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.
Delauney triangulation
Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...

Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.
from scipy.spatial import cKDTree
from scipy import inf
max_distance = 0.0001 # Assuming lats and longs are in decimal degrees, this corresponds to 11.1 meters
points = [(lat1, long1), (lat2, long2) ... ]
tree = cKDTree(points)
point_neighbors_list = [] # Put the neighbors of each point here
for point in points:
distances, indices = tree.query(point, len(points), p=2, distance_upper_bound=max_distance)
point_neighbors = []
for index, distance in zip(indices, distances):
if distance == inf:
break
point_neighbors.append(points[index])
point_neighbors_list.append(point_neighbors)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.