So I'm running a KNN in order to create clusters. From each cluster, I would like to obtain the medoid of the cluster.
I'm employing a fractional distance metric in order to calculate distances:
where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1
I would then calculate the medoid as:
where S is the set of datapoints, and δ is the absolute value of the distance metric used above.
I've looked online to no avail trying to find implementations of medoid (even with other distance metrics, but most thing were specifically k-means or k-medoid which [I think] is relatively different from what I want.
Essentially this boils down to me being unable to translate the math into effective programming. Any help would or pointers in the right direction would be much appreciated! Here's a short list of what I have so far:
I have figured out how to calculate the fractional distance metric (the first equation) so I think I'm good there.
I know numpy has an argmin() function (documented here).
Extra points for increased efficiency without lack of accuracy (I'm trying not to brute force by calculating every single fractional distance metric (because the number of point pairs might lead to a factorial complexity...).
compute pairwise distance matrix
compute column or row sum
argmin to find medoid index
i.e. numpy.argmin(distMatrix.sum(axis=0)) or similar.
So I've accepted the answer here, but I thought I'd provide my implementation if anyone else was trying to do something similar:
(1) This is the distance function:
def fractional(p_coord_array, q_coord_array):
# f is an arbitrary value, but must be greater than zero and
# less than one. In this case, I used 3/10. I took advantage
# of the difference of cubes in this case, so that I wouldn't
# encounter an overflow error.
a = np.sum(np.array(p_coord_array, dtype=np.float64))
b = np.sum(np.array(q_coord_array, dtype=np.float64))
a2 = np.sum(np.power(p_coord_array, 2))
ab = np.sum(p_coord_array) * np.sum(q_coord_array)
b2 = np.sum(np.power(p_coord_array, 2))
diffab = a - b
suma2abb2 = a2 + ab + b2
temp_dist = abs(diffab * suma2abb2)
temp_dist = np.power(temp_dist, 1./10)
dist = np.power(temp_dist, 10./3)
return dist
(2) The medoid function (if the length of the dataset was less than 6000 [if greater than that, I ran into overflow errors... I'm still working on that bit to be perfectly honest...]):
def medoid(dataset):
point = []
w = len(dataset)
if(len(dataset) < 6000):
h = len(dataset)
dist_matrix = [[0 for x in range(w)] for y in range(h)]
list_combinations = [(counter_1, counter_2, data_1, data_2) for counter_1, data_1 in enumerate(dataset) for counter_2, data_2 in enumerate(dataset) if counter_1 < counter_2]
for counter_3, tuple in enumerate(list_combinations):
temp_dist = fractional(tuple[2], tuple[3])
dist_matrix[tuple[0]][tuple[1]] = abs(temp_dist)
dist_matrix[tuple[1]][tuple[0]] = abs(temp_dist)
Any questions, feel free to comment!
If you don't mind using brute force this might help:
def calc_medoid(X, Y, f=2):
n = len(X)
m = len(Y)
dist_mat = np.zeros((m, n))
# compute distance matrix
for j in range(n):
center = X[j, :]
for i in range(m):
if i != j:
dist_mat[i, j] = np.linalg.norm(Y[i, :] - center, ord=f)
medoid_id = np.argmin(dist_mat.sum(axis=0)) # sum over y
return medoid_id, X[medoid_id, :]
Here is an example of computing a medoid for a single cluster with Euclidean distance.
import numpy as np, pandas as pd, matplotlib.pyplot as plt
a, b, c, d = np.array([0,1]), np.array([1, 3]), np.array([4,2]), np.array([3, 1.5])
vCenroid = np.mean([a, b, c, d], axis=0)
def GetMedoid(vX):
vMean = np.mean(vX, axis=0) # compute centroid
return vX[np.argmin([sum((x - vMean)**2) for x in vX])] # pick a point closest to centroid
vMedoid = GetMedoid([a, b, c, d])
print(f'centroid = {vCenroid}')
print(f'medoid = {vMedoid}')
df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D plane', s=100);
plt.plot(vCenroid[0], vCenroid[1], 'ro', ms=10); # plot centroid as red circle
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20); # plot medoid as red star
You can also use the following package to compute medoid for one or more clusters
!pip -q install scikit-learn-extra > log
from sklearn_extra.cluster import KMedoids
GetMedoid = lambda vX: KMedoids(n_clusters=1).fit(vX).cluster_centers_
GetMedoid([a, b, c, d])[0]
I would say that you just need to compute the median.
np.median(np.asarray(points), axis=0)
Your median is the point with the biggest centrality.
Note: if you are using distances different than Euclidean this doesn't hold.
Related
Consider the following two sets of points. I would like to find the optimal 2D translation and rotation that aligns the largest number of points between dataset blue and dataset orange, where a point is considered aligned if the distance to its nearest neighbor in the other dataset is smaller than a threshold.
I understand that this is related to "Iterative Closest Point" algorithms, but in this case the situation is a bit harder because not all points from one dataset are in the other, and also because some points may turn out to be "false positives" (noise).
Is there an efficient way of doing this?
I come across the same problem and found solution in comaring the CCD stars observation figures, the basic idea is to find the best match of the triangles of the two set of points.
I then use astroalign package to calculate the transformation matrix, and align to all the points. Thank the Lord, it works pretty good.
import itertools
import numpy as np
import matplotlib.pyplot as plt
import astroalign as aa
def getTriangles(set_X, X_combs):
"""
Inefficient way of obtaining the lengths of each triangle's side.
Normalized so that the minimum length is 1.
"""
triang = []
for p0, p1, p2 in X_combs:
d1 = np.sqrt((set_X[p0][0] - set_X[p1][0]) ** 2 +
(set_X[p0][1] - set_X[p1][1]) ** 2)
d2 = np.sqrt((set_X[p0][0] - set_X[p2][0]) ** 2 +
(set_X[p0][1] - set_X[p2][1]) ** 2)
d3 = np.sqrt((set_X[p1][0] - set_X[p2][0]) ** 2 +
(set_X[p1][1] - set_X[p2][1]) ** 2)
d_min = min(d1, d2, d3)
d_unsort = [d1 / d_min, d2 / d_min, d3 / d_min]
triang.append(sorted(d_unsort))
return triang
def sumTriangles(ref_triang, in_triang):
"""
For each normalized triangle in ref, compare with each normalized triangle
in B. find the differences between their sides, sum their absolute values,
and select the two triangles with the smallest sum of absolute differences.
"""
tr_sum, tr_idx = [], []
for i, ref_tr in enumerate(ref_triang):
for j, in_tr in enumerate(in_triang):
# Absolute value of lengths differences.
tr_diff = abs(np.array(ref_tr) - np.array(in_tr))
# Sum the differences
tr_sum.append(sum(tr_diff))
tr_idx.append([i, j])
# Index of the triangles in ref and in with the smallest sum of absolute
# length differences.
tr_idx_min = tr_idx[tr_sum.index(min(tr_sum))]
ref_idx, in_idx = tr_idx_min[0], tr_idx_min[1]
print("Smallest difference: {}".format(min(tr_sum)))
return ref_idx, in_idx
set_ref = np.array([[2511.268821,44.864124],
[2374.085032,201.922566],
[1619.282942,216.089335],
[1655.866502,221.127787],
[ 804.171659,2133.549517], ])
set_in = np.array([[1992.438563,63.727282],
[2285.793346,255.402548],
[1568.915358, 279.144544],
[1509.720134, 289.434629],
[1914.255205, 349.477788],
[2370.786382, 496.026836],
[ 482.702882, 508.685952],
[2089.691026, 523.18825 ],
[ 216.827439, 561.807396],
[ 614.874621, 2007.304727],
[1286.639124, 2155.264827],
[ 729.566116, 2190.982364]])
# All possible triangles.
ref_combs = list(itertools.combinations(range(len(set_ref)), 3))
in_combs = list(itertools.combinations(range(len(set_in)), 3))
# Obtain normalized triangles.
ref_triang, in_triang = getTriangles(set_ref, ref_combs), getTriangles(set_in, in_combs)
# Index of the ref and in triangles with the smallest difference.
ref_idx, in_idx = sumTriangles(ref_triang, in_triang)
# Indexes of points in ref and in of the best match triangles.
ref_idx_pts, in_idx_pts = ref_combs[ref_idx], in_combs[in_idx]
print ('triangle ref %s matches triangle in %s' % (ref_idx_pts, in_idx_pts))
print ("ref:", [set_ref[_] for _ in ref_idx_pts])
print ("input:", [set_in[_] for _ in in_idx_pts])
ref_pts = np.array([set_ref[_] for _ in ref_idx_pts])
in_pts = np.array([set_in[_] for _ in in_idx_pts])
transf, (in_list,ref_list) = aa.find_transform(in_pts, ref_pts)
transf_in = transf(set_in)
print(f'transformation matrix: {transf}')
plt.scatter(set_ref[:,0],set_ref[:,1], s=100,marker='.', c='r',label='Reference')
plt.scatter(set_in[:,0],set_in[:,1], s=100,marker='.', c='b',label='Input')
plt.scatter(transf_in[:,0],transf_in[:,1], s=100,marker='+', c='b',label='Input Aligned')
plt.plot(ref_pts[:,0],ref_pts[:,1], c='r')
plt.plot(in_pts[:,0],in_pts[:,1], c='b')
plt.legend()
plt.tight_layout()
plt.savefig( 'align_coordinates.png', format = 'png')
plt.show()
I'm trying to calculate the weighted topological overlap for an adjacency matrix but I cannot figure out how to do it correctly using numpy. The R function that does the correct implementation is from WGCNA (https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity). The formula for computing this (I THINK) is detailed in equation 4 which I believe is correctly reproduced below.
Does anyone know how to implement this correctly so it reflects the WGCNA version?
Yes, I know about rpy2 but I'm trying to go lightweight on this if possible.
For starters, my diagonal is not 1 and the values have no consistent error from the original (e.g. not all off by x).
When I computed this in R, I used the following:
> library(WGCNA, quiet=TRUE)
> df_adj = read.csv("https://pastebin.com/raw/sbAZQsE6", row.names=1, header=TRUE, check.names=FALSE, sep="\t")
> df_tom = TOMsimilarity(as.matrix(df_adj), TOMType="unsigned", TOMDenom="min")
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# I've uploaded it to this url: https://pastebin.com/raw/HT2gBaZC
I'm not sure where my code is incorrect. The source code for the R version is here but it's using C backend scripts? which is very difficult for me interpret.
Here is my implementation in Python :
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
def get_iris_data():
iris = load_iris()
# Iris dataset
X = pd.DataFrame(iris.data,
index = [*map(lambda x:f"iris_{x}", range(150))],
columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), iris.feature_names)])
y = pd.Series(iris.target,
index = X.index,
name = "Species")
return X, y
# Get data
X, y = get_iris_data()
# Create an adjacency network
# df_adj = np.abs(X.T.corr()) # I've uploaded this part to this url: https://pastebin.com/raw/sbAZQsE6
df_adj = pd.read_csv("https://pastebin.com/raw/sbAZQsE6", sep="\t", index_col=0)
A_adj = df_adj.values
# Correct TOM from WGCNA for the A_adj
# See above for code
# https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity
df_tom__wgcna = pd.read_csv("https://pastebin.com/raw/HT2gBaZC", sep="\t", index_col=0)
# My attempt
A = A_adj.copy()
dimensions = A.shape
assert dimensions[0] == dimensions[1]
d = dimensions[0]
# np.fill_diagonal(A, 0)
# Equation (4) from http://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf
A_tom = np.zeros_like(A)
for i in range(d):
a_iu = A[i]
k_i = a_iu.sum()
for j in range(i+1, d):
a_ju = A[:,j]
k_j = a_ju.sum()
l_ij = np.dot(a_iu, a_ju)
a_ij = A[i,j]
numerator = l_ij + a_ij
denominator = min(k_i, k_j) + 1 - a_ij
w_ij = numerator/denominator
A_tom[i,j] = w_ij
A_tom = (A_tom + A_tom.T)
There is a package called GTOM (https://github.com/benmaier/gtom) but it is not for weighted adjacencies. The author of GTOM also took a look at this problem (which a much more sophisticated/efficient NumPy implementation but it's still not producing the expected results).
Does anyone know how to reproduce the WGCNA implementation?
EDIT: 2019.06.20
I've adapted some of the code from #scleronomic and #benmaier with credits in the doc string. The function is available in soothsayer from v2016.06 and on. Hopefully this will allow people to use topological overlap in Python easier instead of only being able to use R.
https://github.com/jolespin/soothsayer/blob/master/soothsayer/networks/networks.py
import numpy as np
import soothsayer as sy
df_adj = sy.io.read_dataframe("https://pastebin.com/raw/sbAZQsE6")
df_tom = sy.networks.topological_overlap_measure(df_adj)
df_tom__wgcna = sy.io.read_dataframe("https://pastebin.com/raw/HT2gBaZC")
np.allclose(df_tom, df_tom__wgcna)
# True
First let's look at the parts of the equation for the case of a binary adjacency matrix a_ij:
a_ij: indicates if node i is connected to node j
k_i: count of the neighbors of node i (connectivity)
l_ij: count of the common neighbors of node i and node j
so w_ij measures how many of the neighbors of the node with the lower connectivity are also neighbors of the other node (ie. w_ij measures "their relative inter-connectedness").
My guess is that they define the diagonal of A to be zero instead of one.
With this assumption I can reproduce the values of WGCNA.
A[range(d), range(d)] = 0 # Assumption
L = A # A # Could be done smarter by using the symmetry
K = A.sum(axis=1)
A_tom = np.zeros_like(A)
for i in range(d):
for j in range(i+1, d):
numerator = L[i, j] + A[i, j]
denominator = min(K[i], K[j]) + 1 - A[i, j]
A_tom[i, j] = numerator / denominator
A_tom += A_tom.T
A_tom[range(d), range(d)] = 1 # Set diagonal to 1 by default
A_tom__wgcna = np.array(pd.read_csv("https://pastebin.com/raw/HT2gBaZC",
sep="\t", index_col=0))
print(np.allclose(A_tom, A_tom__wgcna))
An intuition why the diagonal of A should be zero instead of one can be seen for a simple example with a binary A:
Graph Case Zero Case One
B A B C D A B C D
/ \ A 0 1 1 1 A 1 1 1 1
A-----D B 1 0 0 1 B 1 1 0 1
\ / C 1 0 0 1 C 1 0 1 1
C D 1 1 1 0 D 1 1 1 1
The given description of equation 4 explains:
Note that w_ij = 1 if the node with fewer connections satisfies two conditions:
(a) all of its neighbors are also neighbors of the other node and
(b) it is connected to the other node.
In contrast, w_ij = 0 if i and j are un-connected and the two nodes do not share any neighbors.
So the connection between A-D should fulfill this criterion and be w_14=1.
Case Zero Diagonal:
Case One Diagonal:
What is still missing when applying the formula is that the diagonal values don't match. I set them to one by default. What is the inter-connectedness of a node with itself anyway? A value different than one (or zero, depending on definition) doesn't make sense to me.
Neither Case Zero nor Case One result in w_ii=1 in the simple example.
In Case Zero it would be necessary that k_i+1 == l_ii, and in Case One it would be necessary that k_i == l_ii+1, which both seems wrong to me.
So to summarize I would set the diagonal of the adjacency matrix to zero, use the given equation and set the diagonal of the result to one by default.
Given adjacency matrix A, its possible to calculate the TOM matrix W without the use of for loops, which speeds up the process tremendously
L = np.dot(A,A)
k = np.sum(A,axis=0); d = len(k); tile = np.tile(k,(d,1))
K = np.min(np.stack((tile,tile.T),axis=2),axis=2)
W = (L + A)/(K + 1 - A); np.fill_diagonal(W, 1)
I have two arrays of x-y coordinates, and I would like to find the minimum Euclidean distance between each point in one array with all the points in the other array. The arrays are not necessarily the same size. For example:
xy1=numpy.array(
[[ 243, 3173],
[ 525, 2997]])
xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])
My current method loops through each coordinate xy in xy1 and calculates the distances between that coordinate and the other coordinates.
mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))
for i,xy in enumerate(xy1):
dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
mindist[i],minid[i]=dists.min(),dists.argmin()
Is there a way to eliminate the for loop and somehow do element-by-element calculations between the two arrays? I envision generating a distance matrix for which I could find the minimum element in each row or column.
Another way to look at the problem. Say I concatenate xy1 (length m) and xy2 (length p) into xy (length n), and I store the lengths of the original arrays. Theoretically, I should then be able to generate a n x n distance matrix from those coordinates from which I can grab an m x p submatrix. Is there a way to efficiently generate this submatrix?
(Months later)
scipy.spatial.distance.cdist( X, Y )
gives all pairs of distances,
for X and Y 2 dim, 3 dim ...
It also does 22 different norms, detailed
here .
# cdist example: (nx,dim) (ny,dim) -> (nx,ny)
from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist
#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1
# change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )
title = "%s dim %d nx %d ny %d metric %s" % (
__file__, dim, nx, ny, metric )
print "\n", title
#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric ) # -> (nx, ny) distances
#...............................................................................
print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
dist[0,3], cdist( [X[0]], [Y[3]] ))
# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2 ~ .4 +- .2/sqrt dim
To compute the m by p matrix of distances, this should work:
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
the .outer calls make two such matrices (of scalar differences along the two axes), the .hypot calls turns those into a same-shape matrix (of scalar euclidean distances).
The accepted answer does not fully address the question, which requests to find the minimum distance between the two sets of points, not the distance between every point in the two sets.
Although a straightforward solution to the original question indeed consists of computing the distance between every pair and subsequently finding the minimum one, this is not necessary if one is only interested in the minimum distances. A much faster solution exists for the latter problem.
All the proposed solutions have a running time that scales as m*p = len(xy1)*len(xy2). This is OK for small datasets, but an optimal solution can be written that scales as m*log(p), producing huge savings for large xy2 datasets.
This optimal execution time scaling can be achieved using scipy.spatial.KDTree as follows
import numpy as np
from scipy import spatial
xy1 = np.array(
[[243, 3173],
[525, 2997]])
xy2 = np.array(
[[682, 2644],
[277, 2651],
[396, 2640]])
# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)
# This solution by #denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)
where mindist is the minimum distance between each point in xy1 and the set of points in xy2
For what you're trying to do:
dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)
Edit: Instead of calling sqrt, doing squares, etc., you can use numpy.hypot:
dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])
import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)
I think the following function also works.
import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
Y = X if Y is None else Y
xx = (X ** 2).sum(axis = 1)[:, None]
yy = (Y ** 2).sum(axis = 1)[:, None]
return xx + yy.T - 2 * (X # Y.T)
Explanation
Suppose each row of X and Y are coordinates of the two sets of points.
Let their sizes be m X p and p X n respectively.
The result will produce a numpy array of size m X n with the (i, j)-th entry being the distance between the i-th row and the j-th row of X and Y respectively.
I highly recommend using some inbuilt method for calculating squares, and roots for they are customized for optimized way to calculate and very safe against overflows.
#alex answer below is the most safest in terms of overflow and should also be very fast. Also for single points you can use math.hypot which now supports more than 2 dimensions.
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
Safety concerns
i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square
overflow/underflow/speeds
I think that the most straightforward and efficient solution is to do it like this:
distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.
Although many answers here are great, there is another way which has not been mentioned here, using numpy's vectorization / broadcasting properties to compute the distance between each points of two different arrays of different length (and, if wanted, the closest matches). I publish it here because it can be very handy to master broadcasting, and it also solves this problem elengantly while remaining very efficient.
Assuming you have two arrays like so:
# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)
You can't do the operation a-b: numpy complains with operands could not be broadcast together with shapes (6,2) (4,2). The trick to allow broadcasting is to manually add a dimension for numpy to broadcast along to. By leaving the dimension 2 in both reshaped arrays, numpy knows that it must perform the operation over this dimension.
deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points
distance_matrix = (deltas ** 2).sum(axis=2)
The distance_matrix has a shape (6,4): for each point in a, the distances to all points in b are computed. Then, if you want the "minimum Euclidean distance between each point in one array with all the points in the other array", you would do :
distance_matrix.argmin(axis=1)
This returns the index of the point in b that is closest to each point of a.
I read one paper and they mentioned as
flipped data with 50% chance along x-axis.
Given an input data is 40x40x24. How can I perform the above requirement? I am trying the bellow code using python 2.7 but I am not sure about "50% chance" meaning
data_flip = np.flipud(data)
data_flip = data[:, ::-1, :]
First, in order to choose out of n elements with probability p you can simply use: np.random.rand(n) < p. r = np.random.rand() generates a number from a uniform distribution over [0, 1), so the probability that r is smaller than some constant p (where p is in [0,1]) is exactly p. This probability is actually the CDF of the distribution, which in this case where a=0 and b=1 is:
F(p) = 0, p<0
p, 0<=p<=1
1, p>1
Second, to flip the data along the x axis use np.fliplr rather than np.flipud (which flips along the y axis):
# generate a 3D array size 3x3x5
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
A = np.tile( np.expand_dims(A, axis=2), (1,1,5) )
# index the 3rd axis with probability 0.5
p = 0.5
idxs = np.random.rand(A.shape[2]) < p
# flip left-right the chosen arrays in the 3rd dimension
A[:,:,idxs] = np.fliplr(A[:,:,idxs])
I implemented a basic nearest-neighbors search in my study work.
The fact is that the basic numpy implementation is working well, but just adding the '#jit' decorator (compiling in Numba), the outputs are differents (it duplicates some neighbors in the end for some unknown reason...)
Here is the basic algorithm:
import numpy as np
from numba import jit
#jit(nopython=True)
def knn(p, points, k):
'''Find the k nearest neighbors (brute force) of the point p
in the list points (each row is a point)'''
n = p.size # Lenght of the points
M = points.shape[0] # Number of points
neighbors = np.zeros((k,n))
distances = 1e6*np.ones(k)
for i in xrange(M):
d = 0
pt = points[i, :] # Point to compare
for r in xrange(n): # For each coordinate
aux = p[r] - pt[r]
d += aux * aux
if d < distances[k-1]: # We find a new neighbor
pos = k-1
while pos>0 and d<distances[pos-1]: # Find the position
pos -= 1
pt = points[i, :]
# Insert neighbor and distance:
neighbors[pos+1:, :] = neighbors[pos:-1, :]
neighbors[pos, :] = pt
distances[pos+1:] = distances[pos:-1]
distances[pos] = d
return neighbors, distances
For testing:
p = np.random.rand(10)
points = np.random.rand(250, 10)
k = 5
neighbors = knn(p, points, k)
WITHOUT the #jit decorator, one gets the correct answer:
In [1]: distances
Out[1]: array([ 0.3933974 , 0.44754336, 0.54548715, 0.55619749, 0.5657846 ])
But the Numba compilation gives weird outputs:
Out[2]: distances
Out[2]: array([ 0.3933974 , 0.44754336, 0.54548715, 0.54548715, 0.54548715])
Somebody can help? I don't realize why it happens...
Thanks you.
I believe the issue is that Numba is handling writing one slice into another differently when those slices are overlapping than when running without. I'm not familiar with the internals of numpy, but perhaps there is special logic to handle dealing with volatile memory operations like this, that aren't there in Numba. Change the following lines and the results with the jit decorator become consistent with the plain python version:
neighbors[pos+1:, :] = neighbors[pos:-1, :].copy()
...
distances[pos+1:] = distances[pos:-1].copy()