I've got millions of geographic points. For each one of these, I want to find all "neighboring points," i.e., all other points within some radius, say a few hundred meters.
There is a naive O(N^2) solution to this problem---simply calculate the distance of all pairs of points. However, because I'm dealing with a proper distance metric (geographic distance), there should be a quicker way to do this.
I would like to do this within python. One solution that comes to mind is to use some database (mySQL with GIS extentions, PostGIS) and hope that such a database would take care of efficiently performing the operation described above using some index. I would prefer something simpler though, that doesn't require me to build and learn about such technologies.
A couple of points
I will perform the "find neighbors" operation millions of times
The data will remain static
Because the problem is in a sense simple, I'd like to see they python code that solves it.
Put in terms of python code, I want something along the lines of:
points = [(lat1, long1), (lat2, long2) ... ] # this list contains millions lat/long tuples
points_index = magical_indexer(points)
neighbors = []
for point in points:
point_neighbors = points_index.get_points_within(point, 200) # get all points within 200 meters of point
neighbors.append(point_neighbors)
scipy
First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.
Binary Search
Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:
Compute the PCA of the dataset. You want to rotate the dataset such that the most significant direction is first, and the orthogonal (less large) second direction is, well, second. You can skip this and just choose X or Y, but it's computationally cheap and usually easy to implement. If you do just choose X or Y, choose the direction with greater variance.
Sort the points by the major direction (call this direction X).
To find the nearest neighbor of a given point, find the index of the point nearest in X by binary search (if the point is already in your collection, you may already know this index and don't need the search). Iteratively look to the next and previous points, maintaining the best match so far and its distance from your search point. You can stop looking when the difference in X is greater than or equal to the distance to the best match so far (in practice, usually very few points).
To find all points within a given range, do the same as step 3, except don't stop until the difference in X exceeds the range.
Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.
One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.
Delauney triangulation
Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...
Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.
from scipy.spatial import cKDTree
from scipy import inf
max_distance = 0.0001 # Assuming lats and longs are in decimal degrees, this corresponds to 11.1 meters
points = [(lat1, long1), (lat2, long2) ... ]
tree = cKDTree(points)
point_neighbors_list = [] # Put the neighbors of each point here
for point in points:
distances, indices = tree.query(point, len(points), p=2, distance_upper_bound=max_distance)
point_neighbors = []
for index, distance in zip(indices, distances):
if distance == inf:
break
point_neighbors.append(points[index])
point_neighbors_list.append(point_neighbors)
Related
Pointcloud of rope with desired start and end point
I have a pointcloud of a rope-like object with about 300 points. I'd like to sort the 3D coordinates of that pointcloud, so that one end of the rope has index 0 and the other end has index 300 like shown in the image. Other pointclouds of that object might be U-shaped so I can't sort by X,Y or Z coordinate. Because of that I also can't sort by the distance to a single point.
I have looked at KDTree by sklearn or scipy to compute the nearest neighbour of each point but I don't know how to go from there and sort the points in an array without getting double entries.
Is there a way to sort these coordinates in an array, so that from a starting point the array gets appended with the coordinates of the next closest point?
First of all, obviously, there is no strict solution to this problem (and even there is no strict definition of what you want to get). So anything you may write will be a heuristic of some sort, which will be failing in some cases, especially as your point cloud gets some non-trivial form (do you allow loops in your rope, for example?)
This said, a simple approach may be to build a graph with the points being the vertices, and every two points connected by an edge with a weight equal to the straight-line distance between these two points.
And then build a minimal spanning tree of this graph. This will provide a kind of skeleton for your point cloud, and you can devise any simple algorithm atop of this skeleton.
For example, sort all points by their distance to the start of the rope, measured along this tree. There is only one path between any two vertices of the tree, so for each vertex of the tree calculate the length of the single path to the rope start, and sort all the vertices by this distance.
As suggested in other answer there is no strict solution to this problem and there can be some edge cases such as loop, spiral, tube, but you can go with heuristic approaches to solve for your use case. Read about some heuristic approaches such as hill climbing, simulated annealing, genetic algorithms etc.
For any heuristic approach you need a method to find how good is a solution, let's say if i give you two array of 3000 elements how will you identify which solution is better compared to other ? This methods depends on your use case.
One approach at top of my mind, hill climbing
method to measure the goodness of the solution : take the euclidian distance of all the adjacent elements of array and take the sum of their distance.
Steps :
create randomised array of all the 3000 elements.
now select two random index out of these 3000 and swap the elements at those indexes, and see if it improves your ans (if sum of euclidian distance of adjacent element reduces)
If it improves your answer then keep those elements swapped
repeat step 2/3 for large number of epochs(10^6)
This solution will lead into stagnation as there is lack of diversity. For better results use simulated annealing, genetic algorithms.
Does there exist nearest neighbor data structure that supports delete and add operations along with exact nearest neighbor queries? Looking for a Python implementation ideally.
Attempts:
Found MANY implementations for approximate nearest neighbor queries in high dimensional spaces.
Found KD Trees and Ball Trees but they do not allow for dynamic rebalancing.
Thinking an algorithm could be possible with locality sensitive hashing.
Looking at OctTrees.
Context:
For each point of 10,000 points, query for it's nearest neighbor
Evaluate each pair of neighbors
Pick one and delete the pair of points and add a merged point.
Repeat for some number of iterations
Yes. There exists such a datastructure. I invented one. I had exactly this problem at hand. The datastructure makes KD-trees seem excessively complex. It consists of only sorted lists of points in each dimensionality the points have.
Obviously you can add and removing a n-dimensional point from n lists sorted by their respective dimensions rather trivially without issue. A lot of the tricks allows one to iterate these lists and mathematically prove you have the shortest distance to a point. See my answer here for elaboration and code.
I must note though that your context is wrong. The closest point for A may be B, but it doesn't hold that B's closest point is necessarily A. You could rig a chain of points such that each distance between each link is less than the one before but also necessarily further than the other points resulting in there being only 1 pair of neighbors that share their nearest neighbor.
Given 10,000 64-dimensional vectors and I need to find the vector with the least euclidean distance to an arbitrary point.
The tricky part is that these 10,000 vectors move. Most of the algorithms I have seen assume stationary points and thus can make good use of indexes. I imagine it will be too expensive to rebuild indexes on every timestep.
Below is the pseudo code.
for timestep in range(100000):
data = get_new_input()
nn = find_nearest_neighbor(data)
nn.move_towards(data)
One thing to note is that the vectors only move a little bit on each timestep, about 1%-5%. One non-optimal solution is to rebuild indexes every ~1000 timesteps. It is OK if the nearest neighbor is approximate. Maybe using each vectors momentum would be useful?
I am wondering what is the best algorithm to use in this scenario?
I have a very large data set comprised of (x,y) coordinates. I need to know which of these points are in certain regions of the 2D space. These regions are bounded by 4 lines in the 2D domain (some of the sides are slightly curved).
For smaller datasets I have used a cumbersome for loop to test each individual point for membership of each region. This doesn't seem like a good option any more due to the size of data set.
Is there a better way to do this?
For example:
If I have a set of points:
(0,1)
(1,2)
(3,7)
(1,4)
(7,5)
and a region bounded by the lines:
y=2
y=5
y=5*sqrt(x) +1
x=2
I want to find a way to identify the point (or points) in that region.
Thanks.
The exact code is on another computer but from memory it was something like:
point_list = []
for i in range(num_po):
a=5*sqrt(points[i,0]) +1
b=2
c=2
d=5
if (points[i,1]<a) && (points[i,0]<b) && (points[i,1]>c) && (points[i,1]<d):
point_list.append(points[i])
This isn't the exact code but should give an idea of what I've tried.
If you have a single (or small number) of regions, then it is going to be hard to do much better than to check every point. The check per point can be fast, particularly if you choose the fastest or most discriminating check first (eg in your example, perhaps, x > 2).
If you have many regions, then speed can be gained by using a spatial index (perhaps an R-Tree), which rapidly identifies a small set of candidates that are in the right area. Then each candidate is checked one by one, much as you are checking already. You could choose to index either the points or the regions.
I use the python Rtree package for spatial indexing and find it very effective.
This is called the range searching problem and is a much-studied problem in computational geometry. The topic is rather involved (with your square root making things nonlinear hence more difficult). Here is a nice blog post about using SciPy to do computational geometry in Python.
Long comment:
You are not telling us the whole story.
If you have this big set of points (say N of them) and one set of these curvilinear quadrilaterals (say M of them) and you need to solve the problem once, you cannot avoid exhaustively testing all points against the acceptance area.
Anyway, you can probably preprocess the M regions in such a way that testing a point against the acceptance area takes less than M operations (closer to Log(M)). But due to the small value of M, big savings are unlikely.
Now if you don't just have one acceptance area but many of them to be applied in turn on the same point set, then more sophisticated solutions are possible (namely range searching), that can trade N comparisons to about Log(N) of them, a quite significant improvement.
It may also be that the point set is not completely random and there is some property of the point set that can be exploited.
You should tell us more and show a sample case.
In an python application I'm developing I have an array of 3D points (of size between 2 and 100000) and I have to find the points that are within a certain distance from each other (say between two values, like 0.1 and 0.2). I need this for a graphic application and this search should be very fast (~1/10 of a second for a sample of 10000 points)
As a first experiment I tried to use the scipy.spatial.KDTree.query_pairs implementation, and with a sample of 5000 point it takes 5 second to return the indices. Do you know any approach that may work for this specific case?
A bit more about the application:
The points represents atom coordinates and the distance search is useful to determine the bonds between atoms. Bonds are not necessarily fixed but may change at each step, such as in the case of hydrogen bonds.
Great question! Here is my suggestion:
Divide each coordinate by your "epsilon" value of 0.1/0.2/whatever and round the result to an integer. This creates a "quotient space" of points where distance no longer needs to be determined using the distance formula, but simply by comparing the integer coordinates of each point. If all coordinates are the same, then the original points were within approximately the square root of three times epsilon from each other (for example). This process is O(n) and should take 0.001 seconds or less.
(Note: you would want to augment the original point with the three additional integers that result from this division and rounding, so that you don't lose the exact coordinates.)
Sort the points in numeric order using dictionary-style rules and considering the three integers in the coordinates as letters in words. This process is O(n * log(n)) and should take certainly less than your 1/10th of a second requirement.
Now you simply proceed through this sorted list and compare each point's integer coordinates with the previous and following points. If all coordinates match, then both of the matching points can be moved into your "keep" list of points, and all the others can be marked as "throw away." This is an O(n) process which should take very little time.
The result will be a subset of all the original points, which contains only those points that could be possibly involved in any bond, with a bond being defined as approximately epsilon or less apart from some other point in your original set.
This process is not mathematically exact, but I think it is definitely fast and suited for your purpose.
The first thing that comes to my mind is:
If we calculate the distance between each two atoms in the set it will be O(N^2) operations. It is very slow.
What about to introduce the statical orthogonal grid with some cells size (for example close to the distance you are interested) and then determine the atoms belonging to the each cell of the grid (it takes O(N) operations) After this procedure you can reduce the time for searching of the neighbors.