Finding which points are in a 2D region

Finding which points are in a 2D region - python

I have a very large data set comprised of (x,y) coordinates. I need to know which of these points are in certain regions of the 2D space. These regions are bounded by 4 lines in the 2D domain (some of the sides are slightly curved).
For smaller datasets I have used a cumbersome for loop to test each individual point for membership of each region. This doesn't seem like a good option any more due to the size of data set.
Is there a better way to do this?
For example:
If I have a set of points:
(0,1)
(1,2)
(3,7)
(1,4)
(7,5)
and a region bounded by the lines:
y=2
y=5
y=5*sqrt(x) +1
x=2
I want to find a way to identify the point (or points) in that region.
Thanks.
The exact code is on another computer but from memory it was something like:
point_list = []
for i in range(num_po):
a=5*sqrt(points[i,0]) +1
b=2
c=2
d=5
if (points[i,1]<a) && (points[i,0]<b) && (points[i,1]>c) && (points[i,1]<d):
point_list.append(points[i])
This isn't the exact code but should give an idea of what I've tried.

If you have a single (or small number) of regions, then it is going to be hard to do much better than to check every point. The check per point can be fast, particularly if you choose the fastest or most discriminating check first (eg in your example, perhaps, x > 2).
If you have many regions, then speed can be gained by using a spatial index (perhaps an R-Tree), which rapidly identifies a small set of candidates that are in the right area. Then each candidate is checked one by one, much as you are checking already. You could choose to index either the points or the regions.
I use the python Rtree package for spatial indexing and find it very effective.

This is called the range searching problem and is a much-studied problem in computational geometry. The topic is rather involved (with your square root making things nonlinear hence more difficult). Here is a nice blog post about using SciPy to do computational geometry in Python.

Long comment:
You are not telling us the whole story.
If you have this big set of points (say N of them) and one set of these curvilinear quadrilaterals (say M of them) and you need to solve the problem once, you cannot avoid exhaustively testing all points against the acceptance area.
Anyway, you can probably preprocess the M regions in such a way that testing a point against the acceptance area takes less than M operations (closer to Log(M)). But due to the small value of M, big savings are unlikely.
Now if you don't just have one acceptance area but many of them to be applied in turn on the same point set, then more sophisticated solutions are possible (namely range searching), that can trade N comparisons to about Log(N) of them, a quite significant improvement.
It may also be that the point set is not completely random and there is some property of the point set that can be exploited.
You should tell us more and show a sample case.

Related

Sort coordinates of pointcloud by distance to previous point

Pointcloud of rope with desired start and end point
I have a pointcloud of a rope-like object with about 300 points. I'd like to sort the 3D coordinates of that pointcloud, so that one end of the rope has index 0 and the other end has index 300 like shown in the image. Other pointclouds of that object might be U-shaped so I can't sort by X,Y or Z coordinate. Because of that I also can't sort by the distance to a single point.
I have looked at KDTree by sklearn or scipy to compute the nearest neighbour of each point but I don't know how to go from there and sort the points in an array without getting double entries.
Is there a way to sort these coordinates in an array, so that from a starting point the array gets appended with the coordinates of the next closest point?

First of all, obviously, there is no strict solution to this problem (and even there is no strict definition of what you want to get). So anything you may write will be a heuristic of some sort, which will be failing in some cases, especially as your point cloud gets some non-trivial form (do you allow loops in your rope, for example?)
This said, a simple approach may be to build a graph with the points being the vertices, and every two points connected by an edge with a weight equal to the straight-line distance between these two points.
And then build a minimal spanning tree of this graph. This will provide a kind of skeleton for your point cloud, and you can devise any simple algorithm atop of this skeleton.
For example, sort all points by their distance to the start of the rope, measured along this tree. There is only one path between any two vertices of the tree, so for each vertex of the tree calculate the length of the single path to the rope start, and sort all the vertices by this distance.

As suggested in other answer there is no strict solution to this problem and there can be some edge cases such as loop, spiral, tube, but you can go with heuristic approaches to solve for your use case. Read about some heuristic approaches such as hill climbing, simulated annealing, genetic algorithms etc.
For any heuristic approach you need a method to find how good is a solution, let's say if i give you two array of 3000 elements how will you identify which solution is better compared to other ? This methods depends on your use case.
One approach at top of my mind, hill climbing
method to measure the goodness of the solution : take the euclidian distance of all the adjacent elements of array and take the sum of their distance.
Steps :
create randomised array of all the 3000 elements.
now select two random index out of these 3000 and swap the elements at those indexes, and see if it improves your ans (if sum of euclidian distance of adjacent element reduces)
If it improves your answer then keep those elements swapped
repeat step 2/3 for large number of epochs(10^6)
This solution will lead into stagnation as there is lack of diversity. For better results use simulated annealing, genetic algorithms.

Approximate maximum of an unknown curve

I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?

I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.

Finding points in space closer than a certain value

In an python application I'm developing I have an array of 3D points (of size between 2 and 100000) and I have to find the points that are within a certain distance from each other (say between two values, like 0.1 and 0.2). I need this for a graphic application and this search should be very fast (~1/10 of a second for a sample of 10000 points)
As a first experiment I tried to use the scipy.spatial.KDTree.query_pairs implementation, and with a sample of 5000 point it takes 5 second to return the indices. Do you know any approach that may work for this specific case?
A bit more about the application:
The points represents atom coordinates and the distance search is useful to determine the bonds between atoms. Bonds are not necessarily fixed but may change at each step, such as in the case of hydrogen bonds.

Great question! Here is my suggestion:
Divide each coordinate by your "epsilon" value of 0.1/0.2/whatever and round the result to an integer. This creates a "quotient space" of points where distance no longer needs to be determined using the distance formula, but simply by comparing the integer coordinates of each point. If all coordinates are the same, then the original points were within approximately the square root of three times epsilon from each other (for example). This process is O(n) and should take 0.001 seconds or less.
(Note: you would want to augment the original point with the three additional integers that result from this division and rounding, so that you don't lose the exact coordinates.)
Sort the points in numeric order using dictionary-style rules and considering the three integers in the coordinates as letters in words. This process is O(n * log(n)) and should take certainly less than your 1/10th of a second requirement.
Now you simply proceed through this sorted list and compare each point's integer coordinates with the previous and following points. If all coordinates match, then both of the matching points can be moved into your "keep" list of points, and all the others can be marked as "throw away." This is an O(n) process which should take very little time.
The result will be a subset of all the original points, which contains only those points that could be possibly involved in any bond, with a bond being defined as approximately epsilon or less apart from some other point in your original set.
This process is not mathematically exact, but I think it is definitely fast and suited for your purpose.

The first thing that comes to my mind is:
If we calculate the distance between each two atoms in the set it will be O(N^2) operations. It is very slow.
What about to introduce the statical orthogonal grid with some cells size (for example close to the distance you are interested) and then determine the atoms belonging to the each cell of the grid (it takes O(N) operations) After this procedure you can reduce the time for searching of the neighbors.

Fastest way to approximately compare values in large numpy arrays?

I have two arrays, array A with ~1M lines and array B with ~400K lines. Each contains, among other things, coordinates of a point. For each point in array A, I need to find how many points in array B are within a certain distance of it. How do I avoid naively comparing everything to everything? Based on its speed at the start, running naively would take 10+ days on my machine. That required nested loops, but the arrays are too large to construct a distance matrix (400G entries!)
I thought the way would be to check only a limited set of B coordinates against each A coordinates. However, I haven't determined an easy way of doing that. That is, what's the easiest/quickest way to make a selection that doesn't require checking all the values in B (which is exactly the same task I'm trying to avoid)?
EDIT: I should've mentioned these aren't 2D (or nD) Cartesian, but spherical surface (lat/long), and distance is great-circle distance.

I cannot give a full answer right now, but some hints to get you started. It will be much more efficient to organise the points in B in a kd-tree. You can use the class scipy.spatial.KDTree to do this easily, and you can use the query() method on this class to request the points within a given distance.

Here is one possible implementation of the cross match between list of points on the sphere using k-d tree.
http://code.google.com/p/astrolibpy/source/browse/my_utils/match_lists.py
Another way is to use healpy module and their get_neighbors method.

Find all coordinates within a circle in geographic data in python

I've got millions of geographic points. For each one of these, I want to find all "neighboring points," i.e., all other points within some radius, say a few hundred meters.
There is a naive O(N^2) solution to this problem---simply calculate the distance of all pairs of points. However, because I'm dealing with a proper distance metric (geographic distance), there should be a quicker way to do this.
I would like to do this within python. One solution that comes to mind is to use some database (mySQL with GIS extentions, PostGIS) and hope that such a database would take care of efficiently performing the operation described above using some index. I would prefer something simpler though, that doesn't require me to build and learn about such technologies.
A couple of points
I will perform the "find neighbors" operation millions of times
The data will remain static
Because the problem is in a sense simple, I'd like to see they python code that solves it.
Put in terms of python code, I want something along the lines of:
points = [(lat1, long1), (lat2, long2) ... ] # this list contains millions lat/long tuples
points_index = magical_indexer(points)
neighbors = []
for point in points:
point_neighbors = points_index.get_points_within(point, 200) # get all points within 200 meters of point
neighbors.append(point_neighbors)

scipy
First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.
Binary Search
Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:
Compute the PCA of the dataset. You want to rotate the dataset such that the most significant direction is first, and the orthogonal (less large) second direction is, well, second. You can skip this and just choose X or Y, but it's computationally cheap and usually easy to implement. If you do just choose X or Y, choose the direction with greater variance.
Sort the points by the major direction (call this direction X).
To find the nearest neighbor of a given point, find the index of the point nearest in X by binary search (if the point is already in your collection, you may already know this index and don't need the search). Iteratively look to the next and previous points, maintaining the best match so far and its distance from your search point. You can stop looking when the difference in X is greater than or equal to the distance to the best match so far (in practice, usually very few points).
To find all points within a given range, do the same as step 3, except don't stop until the difference in X exceeds the range.
Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.
One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.
Delauney triangulation
Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...

Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.
from scipy.spatial import cKDTree
from scipy import inf
max_distance = 0.0001 # Assuming lats and longs are in decimal degrees, this corresponds to 11.1 meters
points = [(lat1, long1), (lat2, long2) ... ]
tree = cKDTree(points)
point_neighbors_list = [] # Put the neighbors of each point here
for point in points:
distances, indices = tree.query(point, len(points), p=2, distance_upper_bound=max_distance)
point_neighbors = []
for index, distance in zip(indices, distances):
if distance == inf:
break
point_neighbors.append(points[index])
point_neighbors_list.append(point_neighbors)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.