calculate cosine similarity matrix without scipy, sklearn.metrics.pairwise

calculate cosine similarity matrix without scipy, sklearn.metrics.pairwise - python

Let`s say I have a matrix like this:
[[5.05537647 4.96643654 4.88792309 4.48089566 4.4469417 3.7841264]
[4.81800568 4.75527558 4.69862751 3.81999698 3.7841264 3.68258605]
[4.64717983 4.60021917 4.55716111 4.07718641 4.0245128 4.69862751]
[4.51752158 4.35840703 4.30839634 3.97312429 3.9655597 3.68258605]
[4.38592909 4.33261686 4.2856032 4.26411249 4.24381326 3.7841264]]
I need to calculate cosine similarity between rows of matrix but without using cosine similarity from "scipy" and "sklearn.metrics.pairwise". But I can use "math".
I tried it with this code, but I can`t understand how can I iterate over each row of matrix.
import math
def cosine_similarity(matrix):
for row1 in matrix:
for row2 in matrix:
sum1, sum2, sum3 = 0, 0, 0
for i in range(len(row1)):
a = row1[i]; b = row2[i]
sum1 += a*a
sum2 += b*b
sum3 += a*b
return sum3 / math.sqrt(sum1*sum2)
cosine_similarity(matrix)
Do you have any ideas how can I do that? Thank you!

You can use the vectorized operation since you have a numpy matrix. Furthermore, math.sqrt doesn't allow vectorized operation therefore, you can use np.sqrt to vectorize the square root operation. Following is the code where you store the similarity indices in a list and return it.
import numpy as np
def cosine_similarity(matrix):
sim_index = []
for row1 in matrix:
for row2 in matrix:
sim_index.append(sum(row1*row2)/np.sqrt(sum(row1**2) * sum(row2**2)))
return sim_index
cosine_similarity(matrix)
# 1.0,0.9985287276116063,0.9943589065201967,0.9995100043150523,0.9986115804314727,0.9985287276116063,1.0,0.9952419798474134,0.9984515542959852,0.9957338741601842,0.9943589065201967,0.9952419798474134,1.0,0.9970632589904104,0.9962784686967592,0.9995100043150523,0.9984515542959852,0.9970632589904104,1.0,0.9992584450362125,0.9986115804314727,0.9957338741601842,0.9962784686967592,0.9992584450362125,1.0
Further short code using list comprehension
sim_index = np.array([sum(r1*r2)/np.sqrt(sum(r1**2) * sum(r2**2)) for r1 in matrix for r2 in matrix])
The final list is converted to array for reshaping for plotting purpose.
Visualizing the similarity matrix : Here since each row is completely identical to itself, the similarity index is 1 (yellow color). Hence the diagonal of the matrix plotted is fully yellow (index = 1).
import matplotlib.pyplot as plt
plt.imshow(sim_index.reshape((5,5)))
plt.colorbar()

Related

Euclidean distance between the two points using vectorized approach

I have two large numpy arrays for which I want to calculate an Euclidean Distance using sklearn. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for loop.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
n = 3
sample_size = 5
X = np.random.randint(0, 10, size=(sample_size, n))
Y = np.random.randint(0, 10, size=(sample_size, n))
lst = []
for f in range(0, sample_size):
ed = euclidean_distances([X[f]], [Y[f]])
lst.append(ed[0][0])
print(lst)

euclidean_distances computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances that does what you want:
from sklearn.metrics.pairwise import paired_distances
d = paired_distances(X,Y)
# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)
d = np.sqrt(np.sum((X-Y)**2, axis=1))

Why does my matrix vector multiplication in NumPy yield a two dimensional array instead of a one dimensional vector?

I have a matrix called inverseJ, which is a 2x2 matrix ([[0.07908312, 0.03071918], [-0.12699082, -0.0296126]]), and a one dimensional vector deltaT of length two ([-31.44630082, -16.9922145]). In NumPy, multiplying these should yield a one dimensional vector again, as in this example. However, when I multiply these using inverseJ.dot(deltaT), I get a two dimensional array ([[-3.00885838, 4.49657509]]) with the only element being the vector I am actually looking for. Does anyone know why I am not simply getting a vector? Any help is greatly appreciated!
Whole script for reference
from __future__ import division
import sys
import io
import os
from math import *
import numpy as np
if __name__ == "__main__":
# Fingertip position
x = float(sys.argv[1])
y = float(sys.argv[2])
# Initial guesses
q = np.array([0., 0.])
q[0] = float(sys.argv[3])
q[1] = float(sys.argv[4])
error = 0.01
while(error > 0.001):
# Configuration matrix
T = np.array([17.3*cos(q[0] + (5/3)*q[1])+25.7*cos(q[0] + q[1])+41.4*cos(q[0]),
17.3*sin(q[0] + (5/3)*q[1])+25.7*sin(q[0] + q[1])+41.4*sin(q[0])])
# Deviation
deltaT = np.subtract(np.array([x,y]), T)
error = deltaT[0]**2 + deltaT[1]**2
# Jacobian
J = np.matrix([ [-25.7*sin(q[0]+q[1])-17.3*sin(q[0]+(5/3)*q[1])-41.4*sin(q[0]), -25.7*sin(q[0]+q[1])-28.8333*sin(q[0]+(5/3)*q[1])],
[25.7*cos(q[0]+q[1])+17.3*cos(q[0]+(5/3)*q[1])+41.4*cos(q[0]), 25.7*cos(q[0]+q[1])+28.8333*cos(q[0]+(5/3)*q[1])]])
#Inverse of the Jacobian
det = J.item((0,0))*J.item((1,1)) - J.item((0,1))*J.item((1,0))
inverseJ = 1/det * np.matrix([ [J.item((1,1)), -J.item((0,1))],
[-J.item((1,0)), J.item((0,0))]])
### THE PROBLEMATIC MATRIX VECTOR MULTIPLICATION IN QUESTION
q = q + inverseJ.dot(deltaT)

When a matrix is involved in an operation, the output is another matrix. matrix object are matrices in the strict linear algebra sense. They are always 2D, even if they have only one element.
On the contrary, the example you mention uses arrays, not matrices. Arrays are more "loosely behaved". One of the differences is that "useless" dimensions are removed, yielding a 1D vector in this example.

This simply seems to be the way numpy.dot() functions. It does a simple array multiplication which, since one of the parameters is two dimensional, returns a two dimensional array. dot() is not a smart method, it just does what it's told without sanity checks from what I can gather in the documentation here. Note that this is not an error in your code, but you will have to extract the inner list yourself.

Broadcasting with reduction or extension in Numpy

In the following code we calculate magnitudes of vectors between all pairs of given points. To speed up this operation in NumPy we can use broadcasting
import numpy as np
points = np.random.rand(10,3)
pair_vectors = points[:,np.newaxis,:] - points[np.newaxis,:,:]
pair_dists = np.linalg.norm(pair_vectors,axis=2).shape
or outer product iteration
it = np.nditer([points,points,None], flags=['external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it:
c[...] = b - a
pair_vectors = it.operands[2]
pair_dists = np.linalg.norm(pair_vectors,axis=2)
My question is how could one use broadcasting or outer product iteration to create an array with the form 10x10x6 where the last axis contains the coordinates of both points in a pair (extension). And in a related way, is it possible to calculate pair distances using broadcasting or outer product iteration directly, i.e. produce a matrix of form 10x10 without first calculating the difference vectors (reduction).
To clarify, the following code creates the desired matrices using slow looping.
pair_coords = np.zeros(10,10,6)
pair_dists = np.zeros(10,10)
for i in range(10):
for j in range(10):
pair_coords[i,j,0:3] = points[i,:]
pair_coords[i,j,3:6] = points[j,:]
pair_dists[i,j] = np.linalg.norm(points[i,:]-points[j,:])
This is a failed attempt to calculate distanced (or apply any other function that takes 6 coordinates of both points in a pair and produce a scalar) using outer product iteration.
res = np.zeros((10,10))
it = np.nditer([points,points,res], flags=['reduce_ok','external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it: c[...] = np.linalg.norm(b-a)
pair_dists = it.operands[2]

Here's an approach to produce those arrays in vectorized ways -
from itertools import product
from scipy.spatial.distance import pdist, squareform
N = points.shape[0]
# Get indices for selecting rows off points array and stacking them
idx = np.array(list(product(range(N),repeat=2)))
p_coords = np.column_stack((points[idx[:,0]],points[idx[:,1]])).reshape(N,N,6)
# Get the distances for upper triangular elements.
# Then create a symmetric one for the final dists array.
p_dists = squareform(pdist(points))
Few other vectorized approaches are discussed in this post, so have a look there too!

Cosine distance of vector to matrix

In python, is there a vectorized efficient way to calculate the cosine distance of a sparse array u to a sparse matrix v, resulting in an array of elements [1, 2, ..., n] corresponding to cosine(u,v[0]), cosine(u,v[1]), ..., cosine(u, v[n])?

Not natively. You can however use the library scipy that can compute the cosine distance between two vectors for you: http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html. You can build a version that takes a matrix using this as a stepping stone.

Add the vector onto the end of the matrix, calculate a pairwise distance matrix using sklearn.metrics.pairwise_distances() and then extract the relevant column/row.
So for vector v (with shape (D,)) and matrix m (with shape (N,D)) do:
import sklearn
from sklearn.metrics import pairwise_distances
new_m = np.concatenate([m,v[None,:]], axis=0)
distance_matrix = sklearn.metrics.pairwise_distances(new_m, axis=0), metric="cosine")
distances = distance_matrix[-1,:-1]
Not ideal, but better than iterating!
This method can be extended if you are querying more than one vector. To do this, a list of vectors can be concatenated instead.

I think there is a way using the definition and the numpy library:
Definition:
import numpy as np
#just creating random data
u = np.random.random(100)
v = np.random.random((100,100))
#dot product: for every row in v, multiply u and sum the elements
u_dot_v = np.sum(u*v,axis = 1)
#find the norm of u and each row of v
mod_u = np.sqrt(np.sum(u*u))
mod_v = np.sqrt(np.sum(v*v,axis = 1))
#just apply the definition
final = 1 - u_dot_v/(mod_u*mod_v)
#verify with the cosine function from scipy
from scipy.spatial.distance import cosine
final2 = np.array([cosine(u,i) for i in v])
The definition of cosine distance i found here :https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine

In scipy.spatial.distance.cosine()
http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html

Below worked for me, have to provide correct signature
from scipy.spatial.distance import cosine
def cosine_distances(embedding_matrix, extracted_embedding):
return cosine(embedding_matrix, extracted_embedding)
cosine_distances = np.vectorize(cosine_distances, signature='(m),(d)->()')
cosine_distances(corpus_embeddings, extracted_embedding)
In my case
corpus_embeddings is a (10000,128) matrix
extracted_embedding is a 128-dimensional vector

Calculate Similarity of Sparse Matrix

I am using Python with numpy, scipy and scikit-learn module.
I'd like to classify the arrays in very big sparse matrix. (100,000 * 100,000)
The values in the matrix are equal to 0 or 1. The only thing I have is the index of value = 1.
a = [1,3,5,7,9]
b = [2,4,6,8,10]
which means
a = [0,1,0,1,0,1,0,1,0,1,0]
b = [0,0,1,0,1,0,1,0,1,0,1]
How can I change the index array to the sparse array in scipy ?
How can I classify those array quickly ?
Thank you very much.

If you choose the sparse coo_matrix you can create it passing the indices like:
from scipy.sparse import coo_matrix
import scipy
nrows = 100000
ncols = 100000
row = scipy.array([1,3,5,7,9])
col = scipy.array([2,4,6,8,10])
values = scipy.ones(col.size)
m = coo_matrix((values, (row,col)), shape=(nrows, ncols), dtype=float)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

calculate cosine similarity matrix without scipy, sklearn.metrics.pairwise - python

Related

Euclidean distance between the two points using vectorized approach

Why does my matrix vector multiplication in NumPy yield a two dimensional array instead of a one dimensional vector?

Broadcasting with reduction or extension in Numpy

Cosine distance of vector to matrix

Calculate Similarity of Sparse Matrix

Categories

Resources