Optimize scipy nearest neighbor search - python

I am trying to find all the nearest neighbors which are within 1 KM radius. Here is my script to construct tree and search the nearest points,
from pysal.cg.kdtree import KDTree
def construct_tree(s):
data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
return tree
def get_neighbors(s,tree):
indices = tree.query_ball_point(s, 1)
return indices
#Constructing the tree for search
tree = construct_tree(data)
#Finding the nearest neighbours within 1KM
data['neighborhood'] = data['lat_long'].apply(lambda row: get_neighbors(row,tree))
From what I read in pysal page, it says -
kd-tree built on top of kd-tree functionality in scipy. If using scipy
0.12 or greater uses the scipy.spatial.cKDTree, otherwise uses scipy.spatial.KDTree.
In my case it should be using cKDTree. This is working fine for a sample dataset, but since the tree.query_ball_point returns the list of indices as a result. Each list will have 100s of elements. For my data points (2 Million records), this is growing bigger and bigger and stops due to memory issue after certain point. Any idea on how to solve this?

Just in case if anyone looking for an answer for this, I have solved it by finding the nearest neighbours for a group (tree.query_ball_point can handle batches) and write in to database and then process next group, rather than keeping all in memory. Thanks.

Related

Is there a non brute force based solution to optimise the minimum sum of a 2D array only using 1 value from each row and column

I have a 2 arrays; one is an ordered array generated from a set of previous positions for connected points; the second is a new set of points specifying the new positions of the points. The task is to match up each old point with the best fitting new position. The differential between each set of points is stored in a new Array which is of size n*n. The objective is to find a way to map each previous point to a new point resulting in the smallest total sum. As such each old point is a row of the matrix and must match to a single column.
I have already looked into a exhaustive search. Although this works it has complexity O(n!) which is just not a valid solution.
The code below can be used to generate test data for the 2D array.
import numpy as np
def make_data():
org = np.random.randint(5000, size=(100, 2))
new = np.random.randint(5000, size=(100, 2))
arr = []
# ranges = []
for i,j in enumerate(org):
values = np.linalg.norm(new-j, axis=1)
arr.append(values)
# print(arr)
# print(ranges)
arr = np.array(arr)
return arr
Here are some small examples of the array and the expected output.
Ex. 1
1 3 5
0 2 3
5 2 6
The above output should return [0,2,1] to signify that row 0 maps to column 0, row 1 to column 2 and row 2 to column 1. As the optimal solution would b 1,3,2
In
The algorithm would be nice to be 100% accurate although something much quicker that is 85%+ would also be valid.
Google search terms: "weighted graph minimum matching". You can consider your array to be a weighted graph, and you're looking for a matching that minimizes edge length.
The assignment problem is a fundamental combinatorial optimization problem. It consists of finding, in a weighted bipartite graph, a matching in which the sum of weights of the edges is as large as possible. A common variant consists of finding a minimum-weight perfect matching.
https://en.wikipedia.org/wiki/Assignment_problem
The Hungarian method is a combinatorial optimization algorithm that solves the assignment problem in polynomial time and which anticipated later primal-dual methods.
https://en.wikipedia.org/wiki/Hungarian_algorithm
I'm not sure whether to post the whole algorithm here; it's several paragraphs and in wikipedia markup. On the other hand I'm not sure whether leaving it out makes this a "link-only answer". If people have strong feelings either way, they can mention them in the comments.

The fastest way of checking the closest distance between points

I have 2 dictionaries. Both have key value pairs of an index and a world space location.
Something like:
{
"vertices" :
{
1: "(0.004700, 130.417480, -13.546420)",
2: "(0.1, 152.4, 13.521)",
3: "(58.21, 998.412, -78.0051)"
}
}
Dictionary 1 will always have about 20 - 100 entries, dictionary 2 will always have around 10,000 entries.
For every point in dictionary 1, I want to find the point in dictionary 2 that's closest to it. What is the fastest way of doing that? For each entry in dictionary 1, loop through all entries in dictionary 2 and return the one that's closest by.
Some untested pseudo code:
for point, distance in dict_1.iteritems():
closest_point = get_closest_point(dict_1.get(point))
def get_closest_point(self, start_point)
furthest_distance = 2000000
closest_point = 0
for index, end_point in dict_1.iteritems():
distance = get_distance(self, start_point, end_point)
if distance < furthest_distance:
furthest_distance = distance
closest_point = closest_point
return closest_point
I think something like this will work. The "problem" is that if I have 100 entries in dictionary 1, it will be 100 x 10,000 = 1,000,000 iterations. That just doesn't seem very fast or elegant to me.
Is there a better way of doing this in Maya/Python?
EDIT:
Just want to comment that I've used a closestPointOnMesh node before, which works just fine and is a lot easier if the points you're checking against are actually part of a mesh. You could do something like this:
selected_object = pm.PyNode(pm.selected()[0])
cpom = pm.createNode("closestPointOnMesh", name="cpom")
for vertex, distance in dict_1.iteritems():
selected_object.worldMesh >> cpom.inMesh
cpom.inPosition.set(dict_1.get(vertex))
print "closest vertex is %s " % cpom.closestVertexIndex.get()
Instant reply from the node and all is dandy. However, if the list of point you're checking against are not part of a mesh you can't use this. Would it actually be possible/quicker to:
Construct a mesh out of the points in dictionary 2
Use mesh with closestPointOnMesh node
Delete mesh
You definitely need an acceleration structure for non-trivial amounts of points. A KD tree or an octree is what you want -- KD trees are more performant on search but slower to build and can be harder to code. Also since Octrees are spatial rather than binary they may make it easier to do trivial tests.
You can get a python octree here: http://code.activestate.com/recipes/498121-python-octree-implementation/
if you're doing a lot of distance checks you'll definitely want to use Maya API vector classes to do the actual math compares -- that will be much, much faster than the equivalent python although. You can get these from pymel.datatypes if you don't know the API well, although using the newer API2 versions is pretty painless.
You need what is called a KD Tree. Build a KD Tree with points in your second dictionary and query for the closest point to each point in first dictionary.
I am not familiar with maya, if you can use scipy, you can use this.
PS: There seems to be an implementation in C++ here.

python find the intersection point of two numpy array

I have two numpy array that describes a spatial curve, that are intersected on one point and I want to find the nearest value in both array for that intersection point, I have this code that works fine but its to slow for large amount of points.
from scipy import spatial
def nearest(arr0, arr1):
ptos = []
j = 0
for i in arr0:
distance, index = spatial.KDTree(arr1).query(i)
ptos.append([distance, index, j])
j += 1
ptos.sort()
return (arr1[ptos[0][1]].tolist(), ptos[0][1], ptos[0][2])
the result will be (<point coordinates>,<position in arr1>,<position in arr0>)
Your code is doing a lot of things you don't need. First you're rebuilding the KDtree on every loop and that's a waste. Also query takes an array of points, so no need to write your own loop. Ptos is an odd data structure, and you don't need it (and don't need to sort it). Try something like this.
from scipy import spatial
def nearest(arr0, arr1):
tree = spatial.KDTree(arr1)
distance, arr1_index = tree.query(arr0)
best_arr0 = distance.argmin()
best_arr1 = arr1_index[best_arr0]
two_closest_points = (arr0[best_arr0], arr1[best_arr1])
return two_closest_points, best_arr1, best_arr0
If that still isn't fast enough, you'll need to describe your problem in more detail and figure out if another search algorithm will work better for your problem.

Speeding up distance between all possible pairs in an array

I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!
For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )
Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.
your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.
Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.

Python: nearest neighbour (or closest match) filtering on data records (list of tuples)

I am trying to write a function that will filter a list of tuples (mimicing an in-memory database), using a "nearest neighbour" or "nearest match" type algorithim.
I want to know the best (i.e. most Pythonic) way to go about doing this. The sample code below hopefully illustrates what I am trying to do.
datarows = [(10,2.0,3.4,100),
(11,2.0,5.4,120),
(17,12.9,42,123)]
filter_record = (9,1.9,2.9,99) # record that we are seeking to retrieve from 'database' (or nearest match)
weights = (1,1,1,1) # weights to approportion to each field in the filter
def get_nearest_neighbour(data, criteria, weights):
for each row in data:
# calculate 'distance metric' (e.g. simple differencing) and multiply by relevant weight
# determine the row which was either an exact match or was 'least dissimilar'
# return the match (or nearest match)
pass
if __name__ == '__main__':
result = get_nearest_neighbour(datarow, filter_record, weights)
print result
For the snippet above, the output should be:
(10,2.0,3.4,100)
since it is the 'nearest' to the sample data passed to the function get_nearest_neighbour().
My question then is, what is the best way to implement get_nearest_neighbour()?. For the purpose of brevity etc, assume that we are only dealing with numeric values, and that the 'distance metric' we use is simply an arithmentic subtraction of the input data from the current row.
Simple out-of-the-box solution:
import math
def distance(row_a, row_b, weights):
diffs = [math.fabs(a-b) for a,b in zip(row_a, row_b)]
return sum([v*w for v,w in zip(diffs, weights)])
def get_nearest_neighbour(data, criteria, weights):
def sort_func(row):
return distance(row, criteria, weights)
return min(data, key=sort_func)
If you'd need to work with huge datasets, you should consider switching to Numpy and using Numpy's KDTree to find nearest neighbors. Advantage of using Numpy is that not only it uses more advanced algorithm, but also it's implemented a top of highly optimized LAPACK (Linear Algebra PACKage).
About naive-NN:
Many of these other answers propose "naive nearest-neighbor", which is an O(N*d)-per-query algorithm (d is the dimensionality, which in this case seems constant, so it's O(N)-per-query).
While an O(N)-per-query algorithm is pretty bad, you might be able to get away with it, if you have less than any of (for example):
10 queries and 100000 points
100 queries and 10000 points
1000 queries and 1000 points
10000 queries and 100 points
100000 queries and 10 points
Doing better than naive-NN:
Otherwise you will want to use one of the techniques (especially a nearest-neighbor data structure) listed in:
http://en.wikipedia.org/wiki/Nearest_neighbor_search (most likely linked off from that page), some examples linked:
http://en.wikipedia.org/wiki/K-d_tree
http://en.wikipedia.org/wiki/Locality_sensitive_hashing
http://en.wikipedia.org/wiki/Cover_tree
especially if you plan to run your program more than once. There are most likely libraries available. To otherwise not use a NN data structure would take too much time if you have a large product of #queries * #points. As user 'dsign' points out in comments, you can probaby squeeze out a large additional constant factor of speed by using the numpy library.
However if you can get away with using the simple-to-implement naive-NN though, you should use it.
use heapq.nlargest on a generator calculating the distance*weight for each record.
something like:
heapq.nlargest(N, ((row, dist_function(row,criteria,weight)) for row in data), operator.itemgetter(1))

Categories

Resources