In python, is there a vectorized efficient way to calculate the cosine distance of a sparse array u to a sparse matrix v, resulting in an array of elements [1, 2, ..., n] corresponding to cosine(u,v[0]), cosine(u,v[1]), ..., cosine(u, v[n])?
Not natively. You can however use the library scipy that can compute the cosine distance between two vectors for you: http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html. You can build a version that takes a matrix using this as a stepping stone.
Add the vector onto the end of the matrix, calculate a pairwise distance matrix using sklearn.metrics.pairwise_distances() and then extract the relevant column/row.
So for vector v (with shape (D,)) and matrix m (with shape (N,D)) do:
import sklearn
from sklearn.metrics import pairwise_distances
new_m = np.concatenate([m,v[None,:]], axis=0)
distance_matrix = sklearn.metrics.pairwise_distances(new_m, axis=0), metric="cosine")
distances = distance_matrix[-1,:-1]
Not ideal, but better than iterating!
This method can be extended if you are querying more than one vector. To do this, a list of vectors can be concatenated instead.
I think there is a way using the definition and the numpy library:
Definition:
import numpy as np
#just creating random data
u = np.random.random(100)
v = np.random.random((100,100))
#dot product: for every row in v, multiply u and sum the elements
u_dot_v = np.sum(u*v,axis = 1)
#find the norm of u and each row of v
mod_u = np.sqrt(np.sum(u*u))
mod_v = np.sqrt(np.sum(v*v,axis = 1))
#just apply the definition
final = 1 - u_dot_v/(mod_u*mod_v)
#verify with the cosine function from scipy
from scipy.spatial.distance import cosine
final2 = np.array([cosine(u,i) for i in v])
The definition of cosine distance i found here :https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine
In scipy.spatial.distance.cosine()
http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html
Below worked for me, have to provide correct signature
from scipy.spatial.distance import cosine
def cosine_distances(embedding_matrix, extracted_embedding):
return cosine(embedding_matrix, extracted_embedding)
cosine_distances = np.vectorize(cosine_distances, signature='(m),(d)->()')
cosine_distances(corpus_embeddings, extracted_embedding)
In my case
corpus_embeddings is a (10000,128) matrix
extracted_embedding is a 128-dimensional vector
Related
I have two large numpy arrays for which I want to calculate an Euclidean Distance using sklearn. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for loop.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
n = 3
sample_size = 5
X = np.random.randint(0, 10, size=(sample_size, n))
Y = np.random.randint(0, 10, size=(sample_size, n))
lst = []
for f in range(0, sample_size):
ed = euclidean_distances([X[f]], [Y[f]])
lst.append(ed[0][0])
print(lst)
euclidean_distances computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances that does what you want:
from sklearn.metrics.pairwise import paired_distances
d = paired_distances(X,Y)
# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)
d = np.sqrt(np.sum((X-Y)**2, axis=1))
I have a list of feature vectors, and would like to compute the L2 distance of a feature vector to all other feature vectors, as a uniqueness measure. Here, min_distances[i] gives the L2 norm of the i-th feature vector.
import numpy as np
# Generate data
nrows = 2000
feature_length = 128
feature_vecs = np.random.rand(nrows, feature_length)
# Calculate min L2 norm from each feature vector
# to all other feature vectors
min_distances = np.zeros(nrows)
indices = np.arange(nrows)
for i in indices:
min_distances[i] = np.min(np.linalg.norm(
feature_vecs[i != indices] - feature_vecs[i],
axis=1))
Being O(n^2) it's painfully slow, and would like to optimize it. Can I get rid of the for-loop / vectorize this such that min and linalg.norm are called only once?
Approach #1
Here's one with cdist -
from scipy.spatial.distance import cdist,pdist,squareform
d = squareform(pdist(feature_vecs))
np.fill_diagonal(d,np.nan)
min_distances = np.nanmin(d,axis=0)
Approach #2
Another with cKDTree -
from scipy.spatial import cKDTree
min_distances = cKDTree(feature_vecs).query(feature_vecs, k=2)[0][:,1]
Let`s say I have a matrix like this:
[[5.05537647 4.96643654 4.88792309 4.48089566 4.4469417 3.7841264]
[4.81800568 4.75527558 4.69862751 3.81999698 3.7841264 3.68258605]
[4.64717983 4.60021917 4.55716111 4.07718641 4.0245128 4.69862751]
[4.51752158 4.35840703 4.30839634 3.97312429 3.9655597 3.68258605]
[4.38592909 4.33261686 4.2856032 4.26411249 4.24381326 3.7841264]]
I need to calculate cosine similarity between rows of matrix but without using cosine similarity from "scipy" and "sklearn.metrics.pairwise". But I can use "math".
I tried it with this code, but I can`t understand how can I iterate over each row of matrix.
import math
def cosine_similarity(matrix):
for row1 in matrix:
for row2 in matrix:
sum1, sum2, sum3 = 0, 0, 0
for i in range(len(row1)):
a = row1[i]; b = row2[i]
sum1 += a*a
sum2 += b*b
sum3 += a*b
return sum3 / math.sqrt(sum1*sum2)
cosine_similarity(matrix)
Do you have any ideas how can I do that? Thank you!
You can use the vectorized operation since you have a numpy matrix. Furthermore, math.sqrt doesn't allow vectorized operation therefore, you can use np.sqrt to vectorize the square root operation. Following is the code where you store the similarity indices in a list and return it.
import numpy as np
def cosine_similarity(matrix):
sim_index = []
for row1 in matrix:
for row2 in matrix:
sim_index.append(sum(row1*row2)/np.sqrt(sum(row1**2) * sum(row2**2)))
return sim_index
cosine_similarity(matrix)
# 1.0,0.9985287276116063,0.9943589065201967,0.9995100043150523,0.9986115804314727,0.9985287276116063,1.0,0.9952419798474134,0.9984515542959852,0.9957338741601842,0.9943589065201967,0.9952419798474134,1.0,0.9970632589904104,0.9962784686967592,0.9995100043150523,0.9984515542959852,0.9970632589904104,1.0,0.9992584450362125,0.9986115804314727,0.9957338741601842,0.9962784686967592,0.9992584450362125,1.0
Further short code using list comprehension
sim_index = np.array([sum(r1*r2)/np.sqrt(sum(r1**2) * sum(r2**2)) for r1 in matrix for r2 in matrix])
The final list is converted to array for reshaping for plotting purpose.
Visualizing the similarity matrix : Here since each row is completely identical to itself, the similarity index is 1 (yellow color). Hence the diagonal of the matrix plotted is fully yellow (index = 1).
import matplotlib.pyplot as plt
plt.imshow(sim_index.reshape((5,5)))
plt.colorbar()
In the following code we calculate magnitudes of vectors between all pairs of given points. To speed up this operation in NumPy we can use broadcasting
import numpy as np
points = np.random.rand(10,3)
pair_vectors = points[:,np.newaxis,:] - points[np.newaxis,:,:]
pair_dists = np.linalg.norm(pair_vectors,axis=2).shape
or outer product iteration
it = np.nditer([points,points,None], flags=['external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it:
c[...] = b - a
pair_vectors = it.operands[2]
pair_dists = np.linalg.norm(pair_vectors,axis=2)
My question is how could one use broadcasting or outer product iteration to create an array with the form 10x10x6 where the last axis contains the coordinates of both points in a pair (extension). And in a related way, is it possible to calculate pair distances using broadcasting or outer product iteration directly, i.e. produce a matrix of form 10x10 without first calculating the difference vectors (reduction).
To clarify, the following code creates the desired matrices using slow looping.
pair_coords = np.zeros(10,10,6)
pair_dists = np.zeros(10,10)
for i in range(10):
for j in range(10):
pair_coords[i,j,0:3] = points[i,:]
pair_coords[i,j,3:6] = points[j,:]
pair_dists[i,j] = np.linalg.norm(points[i,:]-points[j,:])
This is a failed attempt to calculate distanced (or apply any other function that takes 6 coordinates of both points in a pair and produce a scalar) using outer product iteration.
res = np.zeros((10,10))
it = np.nditer([points,points,res], flags=['reduce_ok','external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it: c[...] = np.linalg.norm(b-a)
pair_dists = it.operands[2]
Here's an approach to produce those arrays in vectorized ways -
from itertools import product
from scipy.spatial.distance import pdist, squareform
N = points.shape[0]
# Get indices for selecting rows off points array and stacking them
idx = np.array(list(product(range(N),repeat=2)))
p_coords = np.column_stack((points[idx[:,0]],points[idx[:,1]])).reshape(N,N,6)
# Get the distances for upper triangular elements.
# Then create a symmetric one for the final dists array.
p_dists = squareform(pdist(points))
Few other vectorized approaches are discussed in this post, so have a look there too!
I'm looking for dynamically growing vectors in Python, since I don't know their length in advance. In addition, I would like to calculate distances between these sparse vectors, preferably using the distance functions in scipy.spatial.distance (although any other suggestions are welcome). Any ideas how to do this? (Initially, it doesn't need to be efficient.)
Thanks a lot in advance!
You can use regular python lists (which are dynamic) as vectors. Trivial example follows.
from scipy.spatial.distance import sqeuclidean
a = [1,2,3]
b = [0,0,0]
print sqeuclidean(a,b) # 14
As per aganders3's suggestion, do note that you can also use numpy arrays if needed:
import numpy
a = numpy.array([1,2,3])
If the sparse part of your question is crucial I'd use scipy for that - it has support for sparse matrixes. You can define a 1xn matrix and use it as a vector. This works (the parameter is the size of the matrix, filled with zeroes by default):
sqeuclidean(scipy.sparse.coo_matrix((1,3)),scipy.sparse.coo_matrix((1,3))) # 0
There are many kinds of sparse matrixes, some dictionary based (see comment). You can define a row sparse matrix from a list like this:
scipy.sparse.csr_matrix([1,2,3])
Here is how you can do it in numpy:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([0, 0, 0])
c = np.sum(((a - b) ** 2)) # 14