I have an array of dates (1000 columns,2 arrows).
Link of the data: https://mega.nz/#!MMlhWbbT!bwsu4_t98hLNX-A7IYnWipPydtWILkKxgMzXhu3ytHE
I want to calculate the distances (without repeating or counting twice). I am using,
def D(x1,x2,y1,y2):
return math.sqrt((x2-x1)**2+(y2-y1)**2)
x1=dt1[0][0]
x2=dt1[1][0]
y1=dt1[0][1]
y2=dt1[1][1]
print(D(x1,x2,y1,y2))
But there are 1000 points, how I can determine the distance using a for or something like that?
This will calculate the distance between two consecutive points over the whole array:
for x in range(0,len(dt1)):
print(D(dt1[x][0],dt1[x+1][0],dt1[x][1],dt1[x+1][1]))
If you want to calculate the distance between any two points in the array without repetitions, this should do it (includes the new request by the OP):
distances = []
for x in range(0,len(dt1)):
for y in range(x+1,len(dt1)):
dist = D(dt1[x][0],dt1[y][0],dt1[x][1],dt1[y][1])
distances.append(dist)
print(distances)
You can use np.linalg.norm to compute Euclidean distance:
In [1]: import numpy as np
In [2]: dt1 = np.random.rand(2, 2)
In [3]: dt1
Out[3]:
array([[0.79791459, 0.71415415],
[0.52647092, 0.894041 ]])
In [4]: np.linalg.norm(dt1[0] - dt1[1])
Out[4]: 0.3256392880558975
Related
I have two large 2D arrays (3x100,000) corresponding to 3D coordinates and I would like to find index of each correspondence.
An example:
mat1 = np.array([[0,0,0],[0,0,0],[0,0,0],[10,11,12],[1,2,3]]).T
mat2 = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]).T
So here I need to obtain indexes of 3 and 0. And I need to find each correspondence on around 100,000 coordinates. Is there a specific function in Python to do this work? Apply a for loop could be probl
res = [3,0]
To sum up, my need:
We can use Cython-powered kd-tree for quick nearest-neighbor lookup -
In [77]: from scipy.spatial import cKDTree
In [78]: d,idx = cKDTree(mat2.T).query(mat1.T, k=1)
In [79]: idx[np.isclose(d,0)]
Out[79]: array([3, 0])
I have a 2D array and I want to find for each (x, y) point the distance to its nearest neighbor as fast as possible.
I can do this using scipy.spatial.distance.cdist:
import numpy as np
from scipy.spatial.distance import cdist
# Random data
data = np.random.uniform(0., 1., (1000, 2))
# Distance between the array and itself
dists = cdist(data, data)
# Sort by distances
dists.sort()
# Select the 1st distance, since the zero distance is always 0.
# (distance of a point with itself)
nn_dist = dists[:, 1]
This works, but I feel like its too much work and KDTree should be able to handle this but I'm not sure how. I'm not interested in the coordinates of the nearest neighbor, I just want the distance (and to be as fast as possible).
KDTree can do this. The process is almost the same as when using cdist. But cdist is much faster. And as pointed out in the comments, cKDTree is even faster:
import numpy as np
from scipy.spatial.distance import cdist
from scipy.spatial import KDTree
from scipy.spatial import cKDTree
import timeit
# Random data
data = np.random.uniform(0., 1., (1000, 2))
def scipy_method():
# Distance between the array and itself
dists = cdist(data, data)
# Sort by distances
dists.sort()
# Select the 1st distance, since the zero distance is always 0.
# (distance of a point with itself)
nn_dist = dists[:, 1]
return nn_dist
def KDTree_method():
# You have to create the tree to use this method.
tree = KDTree(data)
# Then you find the closest two as the first is the point itself
dists = tree.query(data, 2)
nn_dist = dists[0][:, 1]
return nn_dist
def cKDTree_method():
tree = cKDTree(data)
dists = tree.query(data, 2)
nn_dist = dists[0][:, 1]
return nn_dist
print(timeit.timeit('cKDTree_method()', number=100, globals=globals()))
print(timeit.timeit('scipy_method()', number=100, globals=globals()))
print(timeit.timeit('KDTree_method()', number=100, globals=globals()))
Output:
0.34952507635557595
7.904083715193579
20.765962179145546
Once again, then very unneeded proof that C is awesome!
I have a list of feature vectors, and would like to compute the L2 distance of a feature vector to all other feature vectors, as a uniqueness measure. Here, min_distances[i] gives the L2 norm of the i-th feature vector.
import numpy as np
# Generate data
nrows = 2000
feature_length = 128
feature_vecs = np.random.rand(nrows, feature_length)
# Calculate min L2 norm from each feature vector
# to all other feature vectors
min_distances = np.zeros(nrows)
indices = np.arange(nrows)
for i in indices:
min_distances[i] = np.min(np.linalg.norm(
feature_vecs[i != indices] - feature_vecs[i],
axis=1))
Being O(n^2) it's painfully slow, and would like to optimize it. Can I get rid of the for-loop / vectorize this such that min and linalg.norm are called only once?
Approach #1
Here's one with cdist -
from scipy.spatial.distance import cdist,pdist,squareform
d = squareform(pdist(feature_vecs))
np.fill_diagonal(d,np.nan)
min_distances = np.nanmin(d,axis=0)
Approach #2
Another with cKDTree -
from scipy.spatial import cKDTree
min_distances = cKDTree(feature_vecs).query(feature_vecs, k=2)[0][:,1]
In the following code we calculate magnitudes of vectors between all pairs of given points. To speed up this operation in NumPy we can use broadcasting
import numpy as np
points = np.random.rand(10,3)
pair_vectors = points[:,np.newaxis,:] - points[np.newaxis,:,:]
pair_dists = np.linalg.norm(pair_vectors,axis=2).shape
or outer product iteration
it = np.nditer([points,points,None], flags=['external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it:
c[...] = b - a
pair_vectors = it.operands[2]
pair_dists = np.linalg.norm(pair_vectors,axis=2)
My question is how could one use broadcasting or outer product iteration to create an array with the form 10x10x6 where the last axis contains the coordinates of both points in a pair (extension). And in a related way, is it possible to calculate pair distances using broadcasting or outer product iteration directly, i.e. produce a matrix of form 10x10 without first calculating the difference vectors (reduction).
To clarify, the following code creates the desired matrices using slow looping.
pair_coords = np.zeros(10,10,6)
pair_dists = np.zeros(10,10)
for i in range(10):
for j in range(10):
pair_coords[i,j,0:3] = points[i,:]
pair_coords[i,j,3:6] = points[j,:]
pair_dists[i,j] = np.linalg.norm(points[i,:]-points[j,:])
This is a failed attempt to calculate distanced (or apply any other function that takes 6 coordinates of both points in a pair and produce a scalar) using outer product iteration.
res = np.zeros((10,10))
it = np.nditer([points,points,res], flags=['reduce_ok','external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it: c[...] = np.linalg.norm(b-a)
pair_dists = it.operands[2]
Here's an approach to produce those arrays in vectorized ways -
from itertools import product
from scipy.spatial.distance import pdist, squareform
N = points.shape[0]
# Get indices for selecting rows off points array and stacking them
idx = np.array(list(product(range(N),repeat=2)))
p_coords = np.column_stack((points[idx[:,0]],points[idx[:,1]])).reshape(N,N,6)
# Get the distances for upper triangular elements.
# Then create a symmetric one for the final dists array.
p_dists = squareform(pdist(points))
Few other vectorized approaches are discussed in this post, so have a look there too!
In python, is there a vectorized efficient way to calculate the cosine distance of a sparse array u to a sparse matrix v, resulting in an array of elements [1, 2, ..., n] corresponding to cosine(u,v[0]), cosine(u,v[1]), ..., cosine(u, v[n])?
Not natively. You can however use the library scipy that can compute the cosine distance between two vectors for you: http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html. You can build a version that takes a matrix using this as a stepping stone.
Add the vector onto the end of the matrix, calculate a pairwise distance matrix using sklearn.metrics.pairwise_distances() and then extract the relevant column/row.
So for vector v (with shape (D,)) and matrix m (with shape (N,D)) do:
import sklearn
from sklearn.metrics import pairwise_distances
new_m = np.concatenate([m,v[None,:]], axis=0)
distance_matrix = sklearn.metrics.pairwise_distances(new_m, axis=0), metric="cosine")
distances = distance_matrix[-1,:-1]
Not ideal, but better than iterating!
This method can be extended if you are querying more than one vector. To do this, a list of vectors can be concatenated instead.
I think there is a way using the definition and the numpy library:
Definition:
import numpy as np
#just creating random data
u = np.random.random(100)
v = np.random.random((100,100))
#dot product: for every row in v, multiply u and sum the elements
u_dot_v = np.sum(u*v,axis = 1)
#find the norm of u and each row of v
mod_u = np.sqrt(np.sum(u*u))
mod_v = np.sqrt(np.sum(v*v,axis = 1))
#just apply the definition
final = 1 - u_dot_v/(mod_u*mod_v)
#verify with the cosine function from scipy
from scipy.spatial.distance import cosine
final2 = np.array([cosine(u,i) for i in v])
The definition of cosine distance i found here :https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine
In scipy.spatial.distance.cosine()
http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html
Below worked for me, have to provide correct signature
from scipy.spatial.distance import cosine
def cosine_distances(embedding_matrix, extracted_embedding):
return cosine(embedding_matrix, extracted_embedding)
cosine_distances = np.vectorize(cosine_distances, signature='(m),(d)->()')
cosine_distances(corpus_embeddings, extracted_embedding)
In my case
corpus_embeddings is a (10000,128) matrix
extracted_embedding is a 128-dimensional vector