Finding optimal nearest neighbour pairs - python

Goal
I am writing a "colocalization" script to identify unique co-localized pairs of coordinates between two sets of data. My data is quite large with <100k points in each set so performance is pretty important.
For example, I have two sets of points:
import numpy as np
points_a = np.array([[1, 1],[2, 2],[3, 3],[6, 6]])
points_b = np.array([[1, 1],[2, 3],[3, 5],[6, 6], [7,6]]) # may be longer than points_a
For each point in points_a I want to find the nearest point in points_b. However, I don't want any point in points_b used in more than one pair. I can easily find the nearest neighbors using NearestNeighbors or one of the similar routines:
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.kneighbors(points_a)
print(indices)
>>> [0, 1, 1, 3]
As above, this can give me a solution where a point in point_b is used twice. I would like to instead find the solution where each point is used once while minimizing the total distance across all pairs. In the above case:
[0, 1, 2, 3]
I figure a start would be to use NearestNeighbors or similar to find nearest neighbor candidates:
from scipy.spatial import KDTree
max_search_r = 3
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.radius_neighbors(points_a, max_search_radius)
print(distances)
print(indices)
>>>[[0, 2,23], [1.41, 1], [2.82, 1, 2], [0, 1]]
>>>[[0, 1], [0, 1], [0, 1, 2], [0, 1]]
This shrinks down the overall search parameters but I am unclear how I can then compute the global optimum. I stumbled across this post: Find optimal unique neighbour pairs based on closest distance
but the solution is for only a single set of points and I am unclear how I could translate the method to my case.
Any advice would be greatly appreciated!
Update
Hey all. With everyone's advice I found a somewhat working solution:
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix, csgraph
def colocalize_points(points_a: np.ndarray, points_b: np.ndarray, r: int):
""" Find pairs that minimize global distance. Filters out anything outside radius `r` """
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=r)
# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
sm_filtered = sm[a_matchable != -1]
# now run the distance minimizing matching
row_match, col_match = csgraph.min_weight_full_bipartite_matching(sm_filtered)
return row_match, col_match
Only issue I have is that by filtering the matrix with maximum_bipartite_matching I cannot be sure I truly have the best result since it just returns the first match. For example, if I have 2 points in A [[2,2][3,3]] whose only candidate match is [3,3], maximum_bipartite_matching will keep whichever appears first. So if [2,2] appears first in the matrix, [3,3] will be dropped despite being a better match.
Update 1
To address comments below, here is my reasoning why maximum_bipartite_matching does not give me the desired solution. Consider points:
points_a = np.array([(1, 1), (2, 2), (3, 3)])
points_b = np.array([(1, 1), (2, 2), (3, 5), (2, 3)])
The optimal a,b point pairing that minimizes distance will be:
[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (2, 3)]
However if I run the following:
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=3)
# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
print([(points_a[a], points_b[i]) for i, a in enumerate(a_matchable)])
I get solution:
[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (3, 5)]
Swapping the last two points in points_b will give me the expected solution. This indicates to me that the algorithm is not taking the distance (weight) into account and instead just tries to maximize the number of connections. I could very well have made a mistake though so please let me know.

Related

How do i code for find coordinates in 2d matrix?

If the 3x4 matrix is shown below,
a=[[1,2,3,4], [5,6,7,8], [9,10,11,12]]
I want to find 7 and draw the coordinate value (2,3) into a variable.
Do you have a built-in function?
In matlab, [row, col] = find(a==7), and result is row=2,col=3.
I'm curious about how Python works.
After initializing the value of the matrix value you want,
val = 7
here is a nice one-liner:
array = [(ix,iy) for ix, row in enumerate(a) for iy, i in enumerate(row) if i == val]
Output of print(array):
[(1, 2)]
Note the one-liner will catch all instances of the number 7 in a matrix, not just one. Also note the indexes start at 0, so row 2 will be displayed as 1 and column 3 will be displayed as 2. If, say, you have more than one instance of 7 in a row and want the actual row and column numbers (not starting at 0), this may be helpful:
a=[[1,7,7,4], [5,6,7,8], [9,10,11,7]]
val = 7
array = [(ix+1,iy+1) for ix, row in enumerate(a) for iy, i in enumerate(row) if i == val]
print(array)
Output:
[(1, 2), (1, 3), (2, 3), (3, 4)]
To do it similar to Matlab you would have to use numpy
import numpy as np
a = [[1,2,3,4], [5,6,7,8], [9,10,11,12]]
a = np.array(a)
rows, cols = np.where(a == 7)
print(rows[0], cols[0])
It can find all 7 in matrix so it returns rows, cols as lists.
And it counts rows/cols starting at 0 so you may have to add +1 to get the same results as matlab
I would use numpy's where function. Here's another post that displays it's use nicely. I'd apply it to your use case like so:
import numpy as np
arr = np.array([[1, 2, 3],[4, 100, 6],[100, 8, 9]])
positions = np.where(arr == 100)
# positions = (array([1, 2], dtype=int64), array([1, 0], dtype=int64))
positions = [tuple(cor.item() for cor in pos) for pos in positions]
# positions = [(1, 2), (1, 0)]
Note that this solution allows for the possibly that the desired pattern might occur more than once.

I want to calculate slope and intercept of a linear fit using pykalman module

Consider the linear regression of Y on X, where (xi, yi) = (2, 7), (0, 2), (5, 14) for i = 1, 2, 3. The solution is (a, b) = (2.395, 2.079), obtained using the regression function on a hand-held calculator.
I want to calculate the slope and the intercept of a linear fit using
the pykalman module. I'm getting
ValueError: The shape of all parameters is not consistent. Please re-check their values.
I'd really appreciate if someone would help me.
Here is my code :
from pykalman import KalmanFilter
import numpy as np
measurements = np.asarray([[7], [2], [14]])
initial_state_matrix = [[1], [1]]
transition_matrix = [[1, 0], [0, 1]]
observation_covariance_matrix = [[1, 0],[0, 1]]
observation_matrix = [[2, 1], [0, 1], [5, 1]]
kf1 = KalmanFilter(n_dim_state=2, n_dim_obs=6,
transition_matrices=transition_matrix,
observation_matrices=observation_matrix,
initial_state_mean=initial_state_matrix,
observation_covariance=observation_covariance_matrix)
kf1 = kf1.em(measurements, n_iter=0)
(smoothed_state_means, smoothed_state_covariances) = kf1.smooth(measurements)
print smoothed_state_means
Here's the code snippet:
from pykalman import KalmanFilter
import numpy as np
kf = KalmanFilter()
(filtered_state_means, filtered_state_covariances) = kf.filter_update(filtered_state_mean = [[0],[0]], filtered_state_covariance = [[90000,0],[0,90000]], observation=np.asarray([[7],[2],[14]]),transition_matrix = np.asarray([[1,0],[0,1]]), observation_matrix = np.asarray([[2,1],[0,1],[5,1]]), observation_covariance = np.asarray([[.1622,0,0],[0,.1622,0],[0,0,.1622]]))
print filtered_state_means
print filtered_state_covariances
for x in range(0, 1000):
(filtered_state_means, filtered_state_covariances) = kf.filter_update(filtered_state_mean = filtered_state_means, filtered_state_covariance = filtered_state_covariances, observation=np.asarray([[7],[2],[14]]),transition_matrix = np.asarray([[1,0],[0,1]]), observation_matrix = np.asarray([[2,1],[0,1],[5,1]]), observation_covariance = np.asarray([[.1622,0,0],[0,.1622,0],[0,0,.1622]]))
print filtered_state_means
print filtered_state_covariances
filtered_state_covariance was chosen large because we have no idea where our filter_state_mean is initially and the observations are just [[y1],[y2],[y3]]. Observation_matrix is [[x1,1],[x2,1],[x3,1]] thus giving second element as our intercept. Imagine it like this y1 = m*x1+c where m and c are slope and intercept respectively. In our case filtered_state_mean = [[m],[c]]. Notice that the new filtered_state_means is used as filtered_state_mean for new kf.filter_update() (in iterating loop) because we now know where mean lies with filtered_state_covariance = filtered_state_covariances. Iterating it 1000 times converges the mean to real value. If you want to know about the function/method used the link is: https://pykalman.github.io/
If the system state does not change between measurements (also called vacuous movement step), then transition_matrix φ = I.
I'm not sure if what I'm going to say now is true or not. So please correct me if I am wrong
observation_covariance matrix must be of size m x m where m is the number of observations (in our case = 3). The diagonal elements are just variances I believe variance_y1, variance_y2 and variance_y3 and off-diagonal elements are covariances. For example element (1,2) in matrix is standard deviation of y1,(COMMA NOT PRODUCT) standard deviation of y2 and is equal to element (2,1). Similarly for other elements. Can someone help me include uncertainty in x1, x2 and x3. I mean how do you implement uncertainties in x in the above code.

Find all nearest neighbors within a specific distance

I have a large list of x and y coordinates, stored in an numpy array.
Coordinates = [[ 60037633 289492298]
[ 60782468 289401668]
[ 60057234 289419794]]
...
...
What I want is to find all nearest neighbors within a specific distance (lets say 3 meters) and store the result so that I later can do some further analysis on the result.
For most packages I found it is necessary to decided how many NNs should be found but I just want all within the set distance.
How can I achieve something like that and what is the fastest and best way to achieve something like that for a large dataset (some million points)?
You could use a scipy.spatial.cKDTree:
import numpy as np
import scipy.spatial as spatial
points = np.array([(1, 2), (3, 4), (4, 5)])
point_tree = spatial.cKDTree(points)
# This finds the index of all points within distance 1 of [1.5,2.5].
print(point_tree.query_ball_point([1.5, 2.5], 1))
# [0]
# This gives the point in the KDTree which is within 1 unit of [1.5, 2.5]
print(point_tree.data[point_tree.query_ball_point([1.5, 2.5], 1)])
# [[1 2]]
# More than one point is within 3 units of [1.5, 1.6].
print(point_tree.data[point_tree.query_ball_point([1.5, 1.6], 3)])
# [[1 2]
# [3 4]]
Here is an example showing how you can
find all the nearest neighbors to an array of points, with one call
to point_tree.query_ball_point:
import numpy as np
import scipy.spatial as spatial
import matplotlib.pyplot as plt
np.random.seed(2015)
centers = [(1, 2), (3, 4), (4, 5)]
points = np.concatenate([pt+np.random.random((10, 2))*0.5
for pt in centers])
point_tree = spatial.cKDTree(points)
cmap = plt.get_cmap('copper')
colors = cmap(np.linspace(0, 1, len(centers)))
for center, group, color in zip(centers, point_tree.query_ball_point(centers, 0.5), colors):
cluster = point_tree.data[group]
x, y = cluster[:, 0], cluster[:, 1]
plt.scatter(x, y, c=color, s=200)
plt.show()

List all faces for all edges

A very simple question:
how to compute efficiently in Python (or Cython) the following quantity.
Given the list of polygons in 3D (polygon
There is a list of polygons given in the following form:
vertex = np.array([[0, 0, 0], [0, 0, 1], [0, 1, 0],[1, 0, 0],[0.5, 0.5, 0.5]], order = 'F').T
polygons = np.array([3, 0, 1, 2, 4, 1, 2, 3 ,4])
i.e. polygon is a 1D array, which contains entries of the form [N,i1,i2,i3,i4,...],
N is the number of vertices in a polygons and then the id numbers of the vertices in the vertex array (in the example above there is one triangle with 3 vertices [0,1,2] and one polygon with 4 vertices [1,2,3,4]
I need to compute the information: a list of all edges and for each edge the information
which faces contain this edge.
And I need to do it fast: the number of vertices can be large.
Update
The polygon is closed, i.e. a polygon [4, 0, 1, 5, 7] means that there are 4 vertices and edges are 0-1, 1-5, 5-7, 7-0
The face is a synonim to polygon in fact.
Dunno if this is the fastest option, most probably not, but it works. I think the slowest part is edges.index((v, polygon[i + 1])) where we have to find if this edge is already in list. Vertex array is not really needed since edge is a pair of vertex indexes. I used face_index as a reference to polygon index since you didn't write what face is.
vertex = [[0,0,0], [0,0,1], [0,1,0],[1,0,0],[0.5,0.5,0.5]]
polygons = [3,0,1,2,4,1,2,3,4]
_polygons = polygons
edges = []
faces = []
face_index = 0
while _polygons:
polygon = _polygons[1:_polygons[0] + 1]
polygon.append(polygon[0])
_polygons = _polygons[_polygons[0] + 1:]
for i, v in enumerate(polygon[0:-1]):
if not (v, polygon[i + 1]) in edges:
edges.append((v, polygon[i + 1]))
faces.append([face_index, ])
else:
faces[edges.index((v, polygon[i + 1]))].append(face_index)
face_index += 1
edges = map(lambda edge, face: (edge, face), edges, faces)
print edges
<<< [((0, 1), [0]), ((1, 2), [0, 1]), ((2, 0), [0]), ((2, 3), [1]), ((3, 4), [1]), ((4, 1), [1])]
You can make it faster by removing line polygon.append(polygon[0]) and append first vertice of polygon to vertices list in polygon manually, which shouldn't be a problem.
I mean change polygons = [3,0,1,2,4,1,2,3,4] into polygons = [3,0,1,2,0,4,1,2,3,4,1].
PS Try to use PEP8. It is a code typing style. It says that you should put a space after every comma in iterables so it's eaasier to read.

A 3-D grid of regularly spaced points

I want to create a list containing the 3-D coords of a grid of regularly spaced points, each as a 3-element tuple. I'm looking for advice on the most efficient way to do this.
In C++ for instance, I simply loop over three nested loops, one for each coordinate. In Matlab, I would probably use the meshgrid function (which would do it in one command). I've read about meshgrid and mgrid in Python, and I've also read that using numpy's broadcasting rules is more efficient. It seems to me that using the zip function in combination with the numpy broadcast rules might be the most efficient way, but zip doesn't seem to be overloaded in numpy.
Use ndindex:
import numpy as np
ind=np.ndindex(3,3,2)
for i in ind:
print(i)
# (0, 0, 0)
# (0, 0, 1)
# (0, 1, 0)
# (0, 1, 1)
# (0, 2, 0)
# (0, 2, 1)
# (1, 0, 0)
# (1, 0, 1)
# (1, 1, 0)
# (1, 1, 1)
# (1, 2, 0)
# (1, 2, 1)
# (2, 0, 0)
# (2, 0, 1)
# (2, 1, 0)
# (2, 1, 1)
# (2, 2, 0)
# (2, 2, 1)
Instead of meshgrid and mgrid, you can use ogrid, which is a "sparse" version of mgrid. That is, only the dimension along which the values change are filled in. The others are simply broadcast. This uses much less memory for large grids than the non-sparse alternatives.
For example:
>>> import numpy as np
>>> x, y = np.ogrid[-1:2, -2:3]
>>> x
array([[-1],
[ 0],
[ 1]])
>>> y
array([[-2, -1, 0, 1, 2]])
>>> x**2 + y**2
array([[5, 2, 1, 2, 5],
[4, 1, 0, 1, 4],
[5, 2, 1, 2, 5]])
I would say go with meshgrid or mgrid, in particular if you need non-integer coordinates. I'm surprised that Numpy's broadcasting rules would be more efficient, as meshgrid was designed especially for the problem that you want to solve.
for multi-d (greater than 2) meshgrids, use numpy.lib.index_tricks.nd_grid like so:
import numpy
grid = numpy.lib.index_tricks.nd_grid()
g1 = grid[:3,:3,:3]
g2 = grid[0:1:0.5, 0:1, 0:2]
g3 = grid[0:1:3j, 0:1:2j, 0:2:2j]
where g1 has x values of [0,1,2]
and g2 has x values of [0,.5],
and g3 has x values of [0.0,0.5,1.0] (the 3j defining the step count instead of the step increment. see the documentation for more details.
Here's an efficient option similar to your C++ solution, which I've used for exactly the same purpose:
import numpy, itertools, collections
def grid(xmin, xmax, xstep, ymin, ymax, ystep, zmin, zmax, zstep):
"return nested tuples of grid-sampled coordinates that include maxima"
return collections.deque( itertools.product(
numpy.arange(xmin, xmax+xstep, xstep).tolist(),
numpy.arange(ymin, ymax+ystep, ystep).tolist(),
numpy.arange(zmin, zmax+zstep, zstep).tolist() ) )
Performance is best (in my tests) when using a.tolist(), as shown above, but you can use a.flat instead and drop the deque() to get an iterator that will sip memory. Of course, you can also use a plain old tuple() or list() instead of deque() for a slight performance penalty (again, in my tests).

Categories

Resources