Related
Consider the following two sets of points. I would like to find the optimal 2D translation and rotation that aligns the largest number of points between dataset blue and dataset orange, where a point is considered aligned if the distance to its nearest neighbor in the other dataset is smaller than a threshold.
I understand that this is related to "Iterative Closest Point" algorithms, but in this case the situation is a bit harder because not all points from one dataset are in the other, and also because some points may turn out to be "false positives" (noise).
Is there an efficient way of doing this?
I come across the same problem and found solution in comaring the CCD stars observation figures, the basic idea is to find the best match of the triangles of the two set of points.
I then use astroalign package to calculate the transformation matrix, and align to all the points. Thank the Lord, it works pretty good.
import itertools
import numpy as np
import matplotlib.pyplot as plt
import astroalign as aa
def getTriangles(set_X, X_combs):
"""
Inefficient way of obtaining the lengths of each triangle's side.
Normalized so that the minimum length is 1.
"""
triang = []
for p0, p1, p2 in X_combs:
d1 = np.sqrt((set_X[p0][0] - set_X[p1][0]) ** 2 +
(set_X[p0][1] - set_X[p1][1]) ** 2)
d2 = np.sqrt((set_X[p0][0] - set_X[p2][0]) ** 2 +
(set_X[p0][1] - set_X[p2][1]) ** 2)
d3 = np.sqrt((set_X[p1][0] - set_X[p2][0]) ** 2 +
(set_X[p1][1] - set_X[p2][1]) ** 2)
d_min = min(d1, d2, d3)
d_unsort = [d1 / d_min, d2 / d_min, d3 / d_min]
triang.append(sorted(d_unsort))
return triang
def sumTriangles(ref_triang, in_triang):
"""
For each normalized triangle in ref, compare with each normalized triangle
in B. find the differences between their sides, sum their absolute values,
and select the two triangles with the smallest sum of absolute differences.
"""
tr_sum, tr_idx = [], []
for i, ref_tr in enumerate(ref_triang):
for j, in_tr in enumerate(in_triang):
# Absolute value of lengths differences.
tr_diff = abs(np.array(ref_tr) - np.array(in_tr))
# Sum the differences
tr_sum.append(sum(tr_diff))
tr_idx.append([i, j])
# Index of the triangles in ref and in with the smallest sum of absolute
# length differences.
tr_idx_min = tr_idx[tr_sum.index(min(tr_sum))]
ref_idx, in_idx = tr_idx_min[0], tr_idx_min[1]
print("Smallest difference: {}".format(min(tr_sum)))
return ref_idx, in_idx
set_ref = np.array([[2511.268821,44.864124],
[2374.085032,201.922566],
[1619.282942,216.089335],
[1655.866502,221.127787],
[ 804.171659,2133.549517], ])
set_in = np.array([[1992.438563,63.727282],
[2285.793346,255.402548],
[1568.915358, 279.144544],
[1509.720134, 289.434629],
[1914.255205, 349.477788],
[2370.786382, 496.026836],
[ 482.702882, 508.685952],
[2089.691026, 523.18825 ],
[ 216.827439, 561.807396],
[ 614.874621, 2007.304727],
[1286.639124, 2155.264827],
[ 729.566116, 2190.982364]])
# All possible triangles.
ref_combs = list(itertools.combinations(range(len(set_ref)), 3))
in_combs = list(itertools.combinations(range(len(set_in)), 3))
# Obtain normalized triangles.
ref_triang, in_triang = getTriangles(set_ref, ref_combs), getTriangles(set_in, in_combs)
# Index of the ref and in triangles with the smallest difference.
ref_idx, in_idx = sumTriangles(ref_triang, in_triang)
# Indexes of points in ref and in of the best match triangles.
ref_idx_pts, in_idx_pts = ref_combs[ref_idx], in_combs[in_idx]
print ('triangle ref %s matches triangle in %s' % (ref_idx_pts, in_idx_pts))
print ("ref:", [set_ref[_] for _ in ref_idx_pts])
print ("input:", [set_in[_] for _ in in_idx_pts])
ref_pts = np.array([set_ref[_] for _ in ref_idx_pts])
in_pts = np.array([set_in[_] for _ in in_idx_pts])
transf, (in_list,ref_list) = aa.find_transform(in_pts, ref_pts)
transf_in = transf(set_in)
print(f'transformation matrix: {transf}')
plt.scatter(set_ref[:,0],set_ref[:,1], s=100,marker='.', c='r',label='Reference')
plt.scatter(set_in[:,0],set_in[:,1], s=100,marker='.', c='b',label='Input')
plt.scatter(transf_in[:,0],transf_in[:,1], s=100,marker='+', c='b',label='Input Aligned')
plt.plot(ref_pts[:,0],ref_pts[:,1], c='r')
plt.plot(in_pts[:,0],in_pts[:,1], c='b')
plt.legend()
plt.tight_layout()
plt.savefig( 'align_coordinates.png', format = 'png')
plt.show()
Using python/numpy, I would like to create a 2D matrix M whose components are:
I know I can do this with a bunch of for loops but is there a better way to do this by using numpy (not using for loops)?
This is how I tried, which end up giving me a value error.
I tried to first define a function that takes the sum over k:
define sum_function(i,j):
initial_array = np.arange(g(i,j),h(i,j)+1)
applied_array = f(i,j,initial_array)
return applied_array.sum()
then I tried to create the M matrix with np.mgrid as follows:
ii, jj = np.mgrid(start:fin, start:fin)
M_matrix = sum_function(ii,jj)
--
(Edited)
Let me write down the concrete form of a matrix as an example:
M_{i,j} = \sum_{k=min(i,j)}^{i+j}\sin{\left( (i+j)^k \right)}
if i,j = 0,1, then this matrix is 2 by 2 and it's form will be
\bigl(\begin{smallmatrix}
\sin(0) & \sin(1) \
\sin(1)& \sin(2)+\sin(4)
\end{smallmatrix}\bigr)
Now if the matrix gets really big, how would I create this matrix without using for loops?
To simplify thinking, lets ravel the i,j dimensions to one, ij dimension. Can we evaluate 3 arrays:
G = g(ij) # for all ij values
H = h(ij)
F = f(ij, kk) # for all ij, and all kk
In other words, can g,h,f be evaluated at multiple values, to produce whole-arrays?
If the G and H values were the same for all ij, or subsets (preferably slices), then
F[:, G:H].sum(axis=1)
would be the value for all ij.
If the H-G difference, the size of each slice, was the same, then we can construct a 2d indexing array, GH such that
F[:, GH].sum(axis=1)
In other words we are summing constant size windows of the F rows.
But if the H-G differences vary across ij, I think we are stuck with doing the sum for each ij element separately - with Python level loops, or ones complied with numba or cython.
I think I myself found an answer to this. I first create 3D array F_{i,j,k} = f(i,j,k). And then create a mask_array whose component is Ture if g(i,j) < k < f(i,j), False otherwise. Then I compute the element-wise multiplication of these two arrays, F*mask_array, and then taking the sum over k axis.
For example, this matrix can be efficiently created by the following code.
M_{i,j} = \sum_{k=min(i,j)}^{i+j}\sin{\left( (i+j)^k \right)}
#in this example, g(i,j) = min(i,j) and h(i,j) = i+j f(i,j,k) = sin((i+j)^k)
# 0<= i, j <= 2
#kk should range from min g(i,j) to max h(i,j)
ii, jj, kk = np.mgrid[0:3,0:3,0:5]
# k > g(i,j)
frm1 = kk >= jj
frm2 = kk >= ii
frm = np.logical_or(frm1,frm2)
# k < h(i,j)
to = kk <= ii+jj
#mask
k_mask = np.logical_and(frm,to)
def f(i,j,k):
return np.sin((i+j)**k)
M_before_mask = f(ii,jj,kk)
#Matrix created
M_matrix = (M_before_mask*k_mask).sum(axis=2)
I am looking for an efficient way to find the mean of of values with a certain radius of an element in a 2D NumPy array, excluding the center point and values < 0.
My current method is to create a disc shaped mask (using the method here) and find the mean of points within this mask. This is taking too long however...over 10 minutes to calculate ~18000 points within my 300x300 array.
The array I want to find means within is here titled "arr"
def radMask(index,radius,array,insert):
a,b = index
nx,ny = array.shape
y,x = np.ogrid[-a:nx-a,-b:ny-b]
mask = x*x + y*y <= radius*radius
array[mask] = insert
return array
arr_mask = np.zeros_like(arr).astype(int)
arr_mask = radMask(center, radius, arr_mask, 1)
arr_mask[arr < 0] = 0 #Exclude points with no echo
arr_mask[ind] = 0 #Exclude center point
arr_mean = 0
if np.any(dbz_bg):
arr_mean = sp.mean(arr[arr_mask])
Is there any more efficient way to do this? I've looked into some of the image processing filters/tools but can't quite wrap my head around it.
is this helpful? This takes only a couple of seconds on my laptop for ~ 18000 points:
import numpy as np
#generate a random 300x300 matrix for testing
inputMat = np.random.random((300,300))
radius=50
def radMask(index,radius,array):
a,b = index
nx,ny = array.shape
y,x = np.ogrid[-a:nx-a,-b:ny-b]
mask = x*x + y*y <= radius*radius
return mask
#meanAll is going to store ~18000 points
meanAll=np.zeros((130,130))
for x in range(130):
for y in range(130):
centerMask=(x,y)
mask=radMask(centerMask,radius,inputMat)
#un-mask center and values below 0
mask[centerMask]=False
mask[inputMat<0]=False
#get the mean
meanAll[x,y]=np.mean(inputMat[mask])
So I'm running a KNN in order to create clusters. From each cluster, I would like to obtain the medoid of the cluster.
I'm employing a fractional distance metric in order to calculate distances:
where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1
I would then calculate the medoid as:
where S is the set of datapoints, and δ is the absolute value of the distance metric used above.
I've looked online to no avail trying to find implementations of medoid (even with other distance metrics, but most thing were specifically k-means or k-medoid which [I think] is relatively different from what I want.
Essentially this boils down to me being unable to translate the math into effective programming. Any help would or pointers in the right direction would be much appreciated! Here's a short list of what I have so far:
I have figured out how to calculate the fractional distance metric (the first equation) so I think I'm good there.
I know numpy has an argmin() function (documented here).
Extra points for increased efficiency without lack of accuracy (I'm trying not to brute force by calculating every single fractional distance metric (because the number of point pairs might lead to a factorial complexity...).
compute pairwise distance matrix
compute column or row sum
argmin to find medoid index
i.e. numpy.argmin(distMatrix.sum(axis=0)) or similar.
So I've accepted the answer here, but I thought I'd provide my implementation if anyone else was trying to do something similar:
(1) This is the distance function:
def fractional(p_coord_array, q_coord_array):
# f is an arbitrary value, but must be greater than zero and
# less than one. In this case, I used 3/10. I took advantage
# of the difference of cubes in this case, so that I wouldn't
# encounter an overflow error.
a = np.sum(np.array(p_coord_array, dtype=np.float64))
b = np.sum(np.array(q_coord_array, dtype=np.float64))
a2 = np.sum(np.power(p_coord_array, 2))
ab = np.sum(p_coord_array) * np.sum(q_coord_array)
b2 = np.sum(np.power(p_coord_array, 2))
diffab = a - b
suma2abb2 = a2 + ab + b2
temp_dist = abs(diffab * suma2abb2)
temp_dist = np.power(temp_dist, 1./10)
dist = np.power(temp_dist, 10./3)
return dist
(2) The medoid function (if the length of the dataset was less than 6000 [if greater than that, I ran into overflow errors... I'm still working on that bit to be perfectly honest...]):
def medoid(dataset):
point = []
w = len(dataset)
if(len(dataset) < 6000):
h = len(dataset)
dist_matrix = [[0 for x in range(w)] for y in range(h)]
list_combinations = [(counter_1, counter_2, data_1, data_2) for counter_1, data_1 in enumerate(dataset) for counter_2, data_2 in enumerate(dataset) if counter_1 < counter_2]
for counter_3, tuple in enumerate(list_combinations):
temp_dist = fractional(tuple[2], tuple[3])
dist_matrix[tuple[0]][tuple[1]] = abs(temp_dist)
dist_matrix[tuple[1]][tuple[0]] = abs(temp_dist)
Any questions, feel free to comment!
If you don't mind using brute force this might help:
def calc_medoid(X, Y, f=2):
n = len(X)
m = len(Y)
dist_mat = np.zeros((m, n))
# compute distance matrix
for j in range(n):
center = X[j, :]
for i in range(m):
if i != j:
dist_mat[i, j] = np.linalg.norm(Y[i, :] - center, ord=f)
medoid_id = np.argmin(dist_mat.sum(axis=0)) # sum over y
return medoid_id, X[medoid_id, :]
Here is an example of computing a medoid for a single cluster with Euclidean distance.
import numpy as np, pandas as pd, matplotlib.pyplot as plt
a, b, c, d = np.array([0,1]), np.array([1, 3]), np.array([4,2]), np.array([3, 1.5])
vCenroid = np.mean([a, b, c, d], axis=0)
def GetMedoid(vX):
vMean = np.mean(vX, axis=0) # compute centroid
return vX[np.argmin([sum((x - vMean)**2) for x in vX])] # pick a point closest to centroid
vMedoid = GetMedoid([a, b, c, d])
print(f'centroid = {vCenroid}')
print(f'medoid = {vMedoid}')
df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D plane', s=100);
plt.plot(vCenroid[0], vCenroid[1], 'ro', ms=10); # plot centroid as red circle
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20); # plot medoid as red star
You can also use the following package to compute medoid for one or more clusters
!pip -q install scikit-learn-extra > log
from sklearn_extra.cluster import KMedoids
GetMedoid = lambda vX: KMedoids(n_clusters=1).fit(vX).cluster_centers_
GetMedoid([a, b, c, d])[0]
I would say that you just need to compute the median.
np.median(np.asarray(points), axis=0)
Your median is the point with the biggest centrality.
Note: if you are using distances different than Euclidean this doesn't hold.
I have the following code, where points is many lines by 3 cols list of lists, coorRadius is a radius within which I want to find the local coordinate maxima, and localCoordinateMaxima is an array where I store the i's of these maxima:
for i,x in enumerate(points):
check = 1
for j,y in enumerate(points):
if linalg.norm(x-y) <= coorRadius and x[2] < y[2]:
check = 0
if check == 1:
localCoordinateMaxima.append(i)
print localCoordinateMaxima
Unfortunately, this takes forever when I have several thousand points, I am looking for a way to speed it up. I tried to do it with if all() condition, however I didn't manage it and I am not even sure it will be more efficient. Could you guys propose a way to make it faster?
Best!
Your search for neighbors is best done using a KDTree.
from scipy.spatial import cKDTree
tree = cKDTree(points)
pairs = tree.query_pairs(coorRadius)
Now pairs is a set of two item tuples (i, j), where i < j and points[i] and points[j] are within coorRadius of each other. You can now simply iterate over these, which will likely be a much smaller set than the len(points)**2 you are currently iterating over:
is_maximum = [True] * len(points)
for i, j in pairs:
if points[i][2] < points[j][2]:
is_maximum[i] = False
elif points[j][2] < points[i][2]:
is_maximum[j] = False
localCoordinateMaxima, = np.nonzero(is_maximum)
This can be further sped up by vectorizing it:
pairs = np.array(list(pairs))
pairs = np.vstack((pairs, pairs[:, ::-1]))
pairs = pairs[np.argsort(pairs[:, 0])]
is_z_smaller = points[pairs[:, 0], 2] < points[pairs[:, 1], 2]
bins, = np.nonzero(pairs[:-1, 0] != pairs[1:, 0])
bins = np.concatenate(([0], bins+1))
is_maximum = np.logical_and.reduceat(is_z_smaller, bins)
localCoordinateMaxima, = np.nonzero(is_maximum)
The above code assumes that every point has at least one neighbor within coorRadius. If that is not the case, you need to slightly complicate things:
pairs = np.array(list(pairs))
pairs = np.vstack((pairs, pairs[:, ::-1]))
pairs = pairs[np.argsort(pairs[:, 0])]
is_z_smaller = points[pairs[:, 0], 2] < points[pairs[:, 1], 2]
bins, = np.nonzero(pairs[:-1, 0] != pairs[1:, 0])
has_neighbors = pairs[np.concatenate(([True], bins)), 0]
bins = np.concatenate(([0], bins+1))
is_maximum = np.ones((len(points),), bool)
is_maximum[has_neighbors] &= np.logical_and.reduceat(is_z_smaller, bins)
localCoordinateMaxima, = np.nonzero(is_maximum)
Here is the version of your code just tightened-up a bit:
for i, x in enumerate(points):
x2 = x[2]
for y in points:
if linalg.norm(x-y) <= coorRadius and x2 < y[2]:
break
else:
localCoordinateMaxima.append(i)
print localCoordinateMaxima
Changes:
Factor-out the x[2] lookup.
The j variable was unused.
Add a break for an early-out
Use a for-else construct instead of a flag variable
With numpy this is not too hard. You can do it with a single (long) expression, if you want:
import numpy as np
points = np.array(points)
localCoordinateMaxima = np.where(np.all((np.linalg.norm(points-points[None,:], axis=-1) >
coorRadius) |
(points[:,2] >= points[:,None,2]),
axis=-1))
The algorithm your current code implements is essentially where(not(any(w <= x and y < z))). If you distribute the not through the logical operations inside of it (using Demorgan's laws), you can avoid one level of nesting by flipping the inequalities, getting where(all(w > x or y >= z))).
w is a matrix of norms applied to the differences of the points broadcast together. x is a constant. y and z are both arrays with the third coordinates of the points, shaped so that they broadcast together into the same shape as w.