I have a 2D array and I want to find for each (x, y) point the distance to its nearest neighbor as fast as possible.
I can do this using scipy.spatial.distance.cdist:
import numpy as np
from scipy.spatial.distance import cdist
# Random data
data = np.random.uniform(0., 1., (1000, 2))
# Distance between the array and itself
dists = cdist(data, data)
# Sort by distances
dists.sort()
# Select the 1st distance, since the zero distance is always 0.
# (distance of a point with itself)
nn_dist = dists[:, 1]
This works, but I feel like its too much work and KDTree should be able to handle this but I'm not sure how. I'm not interested in the coordinates of the nearest neighbor, I just want the distance (and to be as fast as possible).
KDTree can do this. The process is almost the same as when using cdist. But cdist is much faster. And as pointed out in the comments, cKDTree is even faster:
import numpy as np
from scipy.spatial.distance import cdist
from scipy.spatial import KDTree
from scipy.spatial import cKDTree
import timeit
# Random data
data = np.random.uniform(0., 1., (1000, 2))
def scipy_method():
# Distance between the array and itself
dists = cdist(data, data)
# Sort by distances
dists.sort()
# Select the 1st distance, since the zero distance is always 0.
# (distance of a point with itself)
nn_dist = dists[:, 1]
return nn_dist
def KDTree_method():
# You have to create the tree to use this method.
tree = KDTree(data)
# Then you find the closest two as the first is the point itself
dists = tree.query(data, 2)
nn_dist = dists[0][:, 1]
return nn_dist
def cKDTree_method():
tree = cKDTree(data)
dists = tree.query(data, 2)
nn_dist = dists[0][:, 1]
return nn_dist
print(timeit.timeit('cKDTree_method()', number=100, globals=globals()))
print(timeit.timeit('scipy_method()', number=100, globals=globals()))
print(timeit.timeit('KDTree_method()', number=100, globals=globals()))
Output:
0.34952507635557595
7.904083715193579
20.765962179145546
Once again, then very unneeded proof that C is awesome!
Related
I have two large numpy arrays for which I want to calculate an Euclidean Distance using sklearn. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for loop.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
n = 3
sample_size = 5
X = np.random.randint(0, 10, size=(sample_size, n))
Y = np.random.randint(0, 10, size=(sample_size, n))
lst = []
for f in range(0, sample_size):
ed = euclidean_distances([X[f]], [Y[f]])
lst.append(ed[0][0])
print(lst)
euclidean_distances computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances that does what you want:
from sklearn.metrics.pairwise import paired_distances
d = paired_distances(X,Y)
# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)
d = np.sqrt(np.sum((X-Y)**2, axis=1))
I have an array of dates (1000 columns,2 arrows).
Link of the data: https://mega.nz/#!MMlhWbbT!bwsu4_t98hLNX-A7IYnWipPydtWILkKxgMzXhu3ytHE
I want to calculate the distances (without repeating or counting twice). I am using,
def D(x1,x2,y1,y2):
return math.sqrt((x2-x1)**2+(y2-y1)**2)
x1=dt1[0][0]
x2=dt1[1][0]
y1=dt1[0][1]
y2=dt1[1][1]
print(D(x1,x2,y1,y2))
But there are 1000 points, how I can determine the distance using a for or something like that?
This will calculate the distance between two consecutive points over the whole array:
for x in range(0,len(dt1)):
print(D(dt1[x][0],dt1[x+1][0],dt1[x][1],dt1[x+1][1]))
If you want to calculate the distance between any two points in the array without repetitions, this should do it (includes the new request by the OP):
distances = []
for x in range(0,len(dt1)):
for y in range(x+1,len(dt1)):
dist = D(dt1[x][0],dt1[y][0],dt1[x][1],dt1[y][1])
distances.append(dist)
print(distances)
You can use np.linalg.norm to compute Euclidean distance:
In [1]: import numpy as np
In [2]: dt1 = np.random.rand(2, 2)
In [3]: dt1
Out[3]:
array([[0.79791459, 0.71415415],
[0.52647092, 0.894041 ]])
In [4]: np.linalg.norm(dt1[0] - dt1[1])
Out[4]: 0.3256392880558975
let's say I have the following numpy matrix (simplified):
matrix = np.array([[1, 1],
[2, 2],
[5, 5],
[6, 6]]
)
And now I want to get the vector from the matrix closest to a "search" vector:
search_vec = np.array([3, 3])
What I have done is the following:
min_dist = None
result_vec = None
for ref_vec in matrix:
distance = np.linalg.norm(search_vec-ref_vec)
distance = abs(distance)
print(ref_vec, distance)
if min_dist == None or min_dist > distance:
min_dist = distance
result_vec = ref_vec
The result works, but is there a native numpy solution to do it more efficient?
My problem is, that the bigger the matrix becomes, the slower the entire process will be.
Are there other solutions that handle these problems in a more elegant and efficient way?
Approach #1
We can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance -
In [276]: from scipy.spatial import cKDTree
In [277]: matrix[cKDTree(matrix).query(search_vec, k=1)[1]]
Out[277]: array([2, 2])
Approach #2
With SciPy's cdist -
In [286]: from scipy.spatial.distance import cdist
In [287]: matrix[cdist(matrix, np.atleast_2d(search_vec)).argmin()]
Out[287]: array([2, 2])
Approach #3
With Scikit-learn's Nearest Neighbors -
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1).fit(matrix)
closest_vec = matrix[nbrs.kneighbors(np.atleast_2d(search_vec))[1][0,0]]
Approach #4
With Scikit-learn's kdtree -
from sklearn.neighbors import KDTree
kdt = KDTree(matrix, metric='euclidean')
cv = matrix[kdt.query(np.atleast_2d(search_vec), k=1, return_distance=False)[0,0]]
Approach #5
From eucl_dist package (disclaimer: I am its author) and following the wiki contents, we could leverage matrix-multiplication -
M = matrix.dot(search_vec)
d = np.einsum('ij,ij->i',matrix,matrix) + np.inner(search_vec,search_vec) -2*M
closest_vec = matrix[d.argmin()]
I have a list of feature vectors, and would like to compute the L2 distance of a feature vector to all other feature vectors, as a uniqueness measure. Here, min_distances[i] gives the L2 norm of the i-th feature vector.
import numpy as np
# Generate data
nrows = 2000
feature_length = 128
feature_vecs = np.random.rand(nrows, feature_length)
# Calculate min L2 norm from each feature vector
# to all other feature vectors
min_distances = np.zeros(nrows)
indices = np.arange(nrows)
for i in indices:
min_distances[i] = np.min(np.linalg.norm(
feature_vecs[i != indices] - feature_vecs[i],
axis=1))
Being O(n^2) it's painfully slow, and would like to optimize it. Can I get rid of the for-loop / vectorize this such that min and linalg.norm are called only once?
Approach #1
Here's one with cdist -
from scipy.spatial.distance import cdist,pdist,squareform
d = squareform(pdist(feature_vecs))
np.fill_diagonal(d,np.nan)
min_distances = np.nanmin(d,axis=0)
Approach #2
Another with cKDTree -
from scipy.spatial import cKDTree
min_distances = cKDTree(feature_vecs).query(feature_vecs, k=2)[0][:,1]
In python, is there a vectorized efficient way to calculate the cosine distance of a sparse array u to a sparse matrix v, resulting in an array of elements [1, 2, ..., n] corresponding to cosine(u,v[0]), cosine(u,v[1]), ..., cosine(u, v[n])?
Not natively. You can however use the library scipy that can compute the cosine distance between two vectors for you: http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html. You can build a version that takes a matrix using this as a stepping stone.
Add the vector onto the end of the matrix, calculate a pairwise distance matrix using sklearn.metrics.pairwise_distances() and then extract the relevant column/row.
So for vector v (with shape (D,)) and matrix m (with shape (N,D)) do:
import sklearn
from sklearn.metrics import pairwise_distances
new_m = np.concatenate([m,v[None,:]], axis=0)
distance_matrix = sklearn.metrics.pairwise_distances(new_m, axis=0), metric="cosine")
distances = distance_matrix[-1,:-1]
Not ideal, but better than iterating!
This method can be extended if you are querying more than one vector. To do this, a list of vectors can be concatenated instead.
I think there is a way using the definition and the numpy library:
Definition:
import numpy as np
#just creating random data
u = np.random.random(100)
v = np.random.random((100,100))
#dot product: for every row in v, multiply u and sum the elements
u_dot_v = np.sum(u*v,axis = 1)
#find the norm of u and each row of v
mod_u = np.sqrt(np.sum(u*u))
mod_v = np.sqrt(np.sum(v*v,axis = 1))
#just apply the definition
final = 1 - u_dot_v/(mod_u*mod_v)
#verify with the cosine function from scipy
from scipy.spatial.distance import cosine
final2 = np.array([cosine(u,i) for i in v])
The definition of cosine distance i found here :https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine
In scipy.spatial.distance.cosine()
http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html
Below worked for me, have to provide correct signature
from scipy.spatial.distance import cosine
def cosine_distances(embedding_matrix, extracted_embedding):
return cosine(embedding_matrix, extracted_embedding)
cosine_distances = np.vectorize(cosine_distances, signature='(m),(d)->()')
cosine_distances(corpus_embeddings, extracted_embedding)
In my case
corpus_embeddings is a (10000,128) matrix
extracted_embedding is a 128-dimensional vector