Fastest way to approximately compare values in large numpy arrays?

Fastest way to approximately compare values in large numpy arrays? - python

I have two arrays, array A with ~1M lines and array B with ~400K lines. Each contains, among other things, coordinates of a point. For each point in array A, I need to find how many points in array B are within a certain distance of it. How do I avoid naively comparing everything to everything? Based on its speed at the start, running naively would take 10+ days on my machine. That required nested loops, but the arrays are too large to construct a distance matrix (400G entries!)
I thought the way would be to check only a limited set of B coordinates against each A coordinates. However, I haven't determined an easy way of doing that. That is, what's the easiest/quickest way to make a selection that doesn't require checking all the values in B (which is exactly the same task I'm trying to avoid)?
EDIT: I should've mentioned these aren't 2D (or nD) Cartesian, but spherical surface (lat/long), and distance is great-circle distance.

I cannot give a full answer right now, but some hints to get you started. It will be much more efficient to organise the points in B in a kd-tree. You can use the class scipy.spatial.KDTree to do this easily, and you can use the query() method on this class to request the points within a given distance.

Here is one possible implementation of the cross match between list of points on the sphere using k-d tree.
http://code.google.com/p/astrolibpy/source/browse/my_utils/match_lists.py
Another way is to use healpy module and their get_neighbors method.

Related

Sort coordinates of pointcloud by distance to previous point

Pointcloud of rope with desired start and end point
I have a pointcloud of a rope-like object with about 300 points. I'd like to sort the 3D coordinates of that pointcloud, so that one end of the rope has index 0 and the other end has index 300 like shown in the image. Other pointclouds of that object might be U-shaped so I can't sort by X,Y or Z coordinate. Because of that I also can't sort by the distance to a single point.
I have looked at KDTree by sklearn or scipy to compute the nearest neighbour of each point but I don't know how to go from there and sort the points in an array without getting double entries.
Is there a way to sort these coordinates in an array, so that from a starting point the array gets appended with the coordinates of the next closest point?

First of all, obviously, there is no strict solution to this problem (and even there is no strict definition of what you want to get). So anything you may write will be a heuristic of some sort, which will be failing in some cases, especially as your point cloud gets some non-trivial form (do you allow loops in your rope, for example?)
This said, a simple approach may be to build a graph with the points being the vertices, and every two points connected by an edge with a weight equal to the straight-line distance between these two points.
And then build a minimal spanning tree of this graph. This will provide a kind of skeleton for your point cloud, and you can devise any simple algorithm atop of this skeleton.
For example, sort all points by their distance to the start of the rope, measured along this tree. There is only one path between any two vertices of the tree, so for each vertex of the tree calculate the length of the single path to the rope start, and sort all the vertices by this distance.

As suggested in other answer there is no strict solution to this problem and there can be some edge cases such as loop, spiral, tube, but you can go with heuristic approaches to solve for your use case. Read about some heuristic approaches such as hill climbing, simulated annealing, genetic algorithms etc.
For any heuristic approach you need a method to find how good is a solution, let's say if i give you two array of 3000 elements how will you identify which solution is better compared to other ? This methods depends on your use case.
One approach at top of my mind, hill climbing
method to measure the goodness of the solution : take the euclidian distance of all the adjacent elements of array and take the sum of their distance.
Steps :
create randomised array of all the 3000 elements.
now select two random index out of these 3000 and swap the elements at those indexes, and see if it improves your ans (if sum of euclidian distance of adjacent element reduces)
If it improves your answer then keep those elements swapped
repeat step 2/3 for large number of epochs(10^6)
This solution will lead into stagnation as there is lack of diversity. For better results use simulated annealing, genetic algorithms.

Finding which points are in a 2D region

I have a very large data set comprised of (x,y) coordinates. I need to know which of these points are in certain regions of the 2D space. These regions are bounded by 4 lines in the 2D domain (some of the sides are slightly curved).
For smaller datasets I have used a cumbersome for loop to test each individual point for membership of each region. This doesn't seem like a good option any more due to the size of data set.
Is there a better way to do this?
For example:
If I have a set of points:
(0,1)
(1,2)
(3,7)
(1,4)
(7,5)
and a region bounded by the lines:
y=2
y=5
y=5*sqrt(x) +1
x=2
I want to find a way to identify the point (or points) in that region.
Thanks.
The exact code is on another computer but from memory it was something like:
point_list = []
for i in range(num_po):
a=5*sqrt(points[i,0]) +1
b=2
c=2
d=5
if (points[i,1]<a) && (points[i,0]<b) && (points[i,1]>c) && (points[i,1]<d):
point_list.append(points[i])
This isn't the exact code but should give an idea of what I've tried.

If you have a single (or small number) of regions, then it is going to be hard to do much better than to check every point. The check per point can be fast, particularly if you choose the fastest or most discriminating check first (eg in your example, perhaps, x > 2).
If you have many regions, then speed can be gained by using a spatial index (perhaps an R-Tree), which rapidly identifies a small set of candidates that are in the right area. Then each candidate is checked one by one, much as you are checking already. You could choose to index either the points or the regions.
I use the python Rtree package for spatial indexing and find it very effective.

This is called the range searching problem and is a much-studied problem in computational geometry. The topic is rather involved (with your square root making things nonlinear hence more difficult). Here is a nice blog post about using SciPy to do computational geometry in Python.

Long comment:
You are not telling us the whole story.
If you have this big set of points (say N of them) and one set of these curvilinear quadrilaterals (say M of them) and you need to solve the problem once, you cannot avoid exhaustively testing all points against the acceptance area.
Anyway, you can probably preprocess the M regions in such a way that testing a point against the acceptance area takes less than M operations (closer to Log(M)). But due to the small value of M, big savings are unlikely.
Now if you don't just have one acceptance area but many of them to be applied in turn on the same point set, then more sophisticated solutions are possible (namely range searching), that can trade N comparisons to about Log(N) of them, a quite significant improvement.
It may also be that the point set is not completely random and there is some property of the point set that can be exploited.
You should tell us more and show a sample case.

Numpy ndarray containing objects of variable size (arrays of objects)

Good evening,
I am currently working on a first year university project to simulate continuum percolation. This involves randomly distributing some discs/spheres/hyperspheres across a square/cube/hypercube in n dimensional space and finding a cluster of connected particles that spans the boundaries.
In order to speed up what is essentially collision detection between all these particles to group them up into connected clusters, I have decided to use spatial partitioning so my program scales nicely with number of particles. This requires me to divide the n dimensional space up with evenly sized boxes/cubes/hypercubes and place particles inside the relevant boxes so that an optimised collision check may be done which requires less comparisons since only particles lying in the boxes/cubes/hypercubes adjacent to that in which the new particle lies need to be checked. All the detail has been worked out algorithmically.
However, it seemed like a good idea to use an ndarray which has "dimension" equal to that of the space being studied. Then each "point" in the ndarray would itself contain an array of particle objects. It would be easy to look at the objects in the ndarray existing in coordinates around that of the new particle and cycle through the arrays contained in those which would themselves contain the other particles against which the check must be done. I then found out that ndarray can only contain objects of a fixed size, which these arrays of particles are not since they grow as particles are randomly added to the system.
Would a normal numpy array of array of array (etc..) be the only solution or do structures similar to ndarray but able to accomodate objects of variable size exist? Ndarray seemed great because it is part of numpy which is written in the compiled language c so it would be fast. Furthermore an ndarray would not require and loops to construct as I believe an array of arrays of arrays (etc...) would (NB: dimensionality of space and the increments of spatial division are not constant as particles of different radii can be added, meaning a change in the size of the spatial division squares/cubes/hypercubes).
Speed is very important in this program and it would be a shame to see the algorithmically good optimisations I have found be ruined by bad implementation!

Have you considered using a kd-tree instead? kd-trees support fast enumeration of the neighbours of a point by splitting the space (much like you suggested with the multidimensional arrays).
As a nice bonus, there's already a decent kd-tree implementation in SciPy, the companion project to NumPy: scipy.spatial.KDTree.

Calculate 3D variant for summed area table using numpy cumsum

In case of a 2D array array.cumsum(0).cumsum(1) gives the Integral image of the array.
What happens if I compute array.cumsum(0).cumsum(1).cumsum(2) over a 3D array?
Do I get a 3D extension of Integral Image i.e, Integral volume over the array?
Its hard to visualize what happens in case of 3D.
I have gone through this discussion.
3D variant for summed area table (SAT)
This gives a recursive way on how to compute the Integral volume. What if I use the cumsum along the 3 axes. Will it give me the same thing?
Will it be more efficient than the recursive method?

Yes, the formula you give, array.cumsum(0).cumsum(1).cumsum(2), will work.
What the formula does is compute a few partial sums so that the sum of these sums is the volume sum. That is, every element needs to be summed exactly once, or, in other words, no element can be skipped and no element counted twice. I think going through each of these questions (is any element skipped or counted twice) is a good way to verify to yourself that this will work. And also run a small test:
x = np.ones((20,20,20)).cumsum(0).cumsum(1).cumsum(2)
print x[2,6,10] # 231.0
print 3*7*11 # 231
Of course, with all ones there could two errors that cancel each other out, but this wouldn't happen everywhere so it's a reasonable test.
As for efficiency, I would guess that the single pass approach is probably faster, but not by a lot. Also, the above could be sped up using an output array, eg, cumsum(n, out=temp) as otherwise three arrays will be created for this calculation. The best way to know is to test (but only if you need to).

Python - Best way to compare Arrays (SURF Descriptors)

I currently have arrays that look something like this:
[ 5.23324730e-03 1.01221129e-04 5.23324730e-03 ...,]
There are 500 such rows and 64 columns. I would like to compare a row like the one above, to other rows in a similar format. That is, I want to compare the 1st element in one array to the first element in the second array and so on.
The idea is to work out how closely they match... Would anyone have any ideas how I might go about this efficiently? I should note that values may not be identical.... But if I could find values that differ by amounts under a certain threshold, that would be fine.
If anyone is wondering - I'm trying to compare SURF descriptors...
Thanks so much for your help!

You can save it as a numpy matrix and then calculate the cosine similarity of each row. This can be done efficiently using the numpy dot product product method

The question depends on your definition of closely match. One common way would be calculate euclidean distance.
How can the euclidean distance be calculated with numpy?
or
Distance between numpy arrays, columnwise

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.