return Cosine Similarity not as single value - python

How can I make a pure NumPy function that will return an array of the shape of the 2 arrays with the cosine similarities of all the pairwise comparisons of the rows of the input array?
I don't want to return a single value.
dataSet1 = [5, 6, 7, 2]
dataSet2 = [2, 3, 1, 15]
def cosine_similarity(list1, list2):
# How to?
pass
print(cosine_similarity(dataSet1, dataSet2))

You can use scipy for this as stated in this answer.
from scipy import spatial
dataSet1 = [5, 6, 7, 2]
dataSet2 = [2, 3, 1, 15]
result = 1 - spatial.distance.cosine(dataSet1, dataSet2)

You can also use the cosine_similarity function from sklearn.
from sklearn.feature_extraction.text import CountVectorizer ##if the documents are text
from sklearn.metrics.pairwise import cosine_similarity
def cos(docs):
if len(docs)==1:
return []
cos_final = []
count_vectorizer= CountVectorizer(tokenizer=tokenize)
doc1= ['missing' if x is np.nan else x for x in docs]
count_vec=count_vectorizer.fit_transform(doc1)
#print(count_vec)
cosine_sim_matrix= cosine_similarity(count_vec)
#print(cosine_sim_matrix)
return cosine_sim_matrix

What you are searching for is cosine_similarity from sklearn library.
Here is a simple example:
Lets we have x which has 5 dimensional 3 vectors and y which has only 1 vector. We can compute cosine similarity as follows:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
x = np.random.rand(3,5)
y = np.random.rand(1,5)
# >>> x
# array([[0.21668023, 0.05705532, 0.6391782 , 0.97990692, 0.90601101],
# [0.82725409, 0.30221347, 0.98101159, 0.13982621, 0.88490538],
# [0.09895812, 0.19948788, 0.12710054, 0.61409403, 0.56001643]])
# >>> y
# array([[0.70531146, 0.10222257, 0.6027328 , 0.87662291, 0.27053804]])
cosine_similarity(x, y)
Then the output is the cosine similarity of each vector from x (3) with y (1) so the output has 3x1 values:
array([[0.84139047],
[0.75146312],
[0.75255157]])

Related

Python - Library function that given a X,Y pair of point find Xn, Yn which is the closest pair to that pair

Disclaimer note: I'm looking for a library, or pre-existing function that accomplishes this. Similar questions ask about the fundamental algorithm where I am looking for a quick implementation. So I apoligize if this appears to be a duplicate question as I'm just looking for a black boxed answer
Given a pair of geo coordinate points:
[34.232,-119.123]
And an array of other points:
[ [36.232,-117.123], [35.232,-119.123], [33.232,-112.123] ]
I'm looking for a function out there that would return a pair from the list above that is closest to the original coordinate
Edited from simple integers to float values
Per comment:
from scipy.spatial.distance import cdist
import numpy as np
def closest(point, ref):
dist = cdist(ref, [point])
return ref[np.argmin(dist)]
point = [1,2]
ref = [ [3,1], [4,1], [2,5] ]
closest(point,ref)
# out [3,1]
My two cents:
from scipy.spatial.distance import euclidean
from functools import partial
key = partial(euclidean, [1,2])
lst = [[3, 1], [4, 1], [2, 5]]
res = min(lst, key=key)
print(res)
Output
[3, 1]
One more:
from sklearn.neighbors import KDTree
import numpy as np
X = np.array([[3,1], [4,1], [2,5]])
tree = KDTree(X, leaf_size=2)
dist, ind = tree.query(np.array([1,2]).reshape(1,-1), k=1)
X[ind][0][0]
# array([3, 1])
Using numpy norm for euclidian distance
def fun(x, points):
points = np.array(points)
return points[np.argmin(np.linalg.norm(points-np.array(x), axis=1))]
print (fun([1,2], [[3,1], [4,1], [2,5]]))
print (fun([1,2], [[3,1], [2,1], [2,5]]))
Output:
[3 1]
[2 1]

Scipy library similarity score calculations

I'm trying to compute similarity scores using vectors:
from scipy.spatial import distance
x = [1,2,4]
y = [1,3,5]
d = distance.cdist(x, y, 'seuclidean', V=None)
However, I keep getting this error:
ValueError: XA must be a 2-dimensional array.
First off you need to use NumPy arrays for input arrays and the error is telling you they need to be 2-D (as column vectors in this case). So:
from scipy.spatial import distance
import numpy as np
x = [1,2,4]
y = [1,3,5]
x = np.array(x).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)
x
array([[1],
[2],
[4]])
y
array([[1],
[3],
[5]])
d = distance.cdist(x, y, 'seuclidean', V=None)
d
array([[ 0. , 1.22474487, 2.44948974],
[ 0.61237244, 0.61237244, 1.83711731],
[ 1.83711731, 0.61237244, 0.61237244]])
There are many method in the distance module that do calculate a similarity (distance) between two vectors. A common example is cosine.
x = [1,2,4]
y = [1,3,5]
distance.cosine(x, y)
0.0040899966895213691

interpolation between arrays in python

What is the easiest and fastest way to interpolate between two arrays to get new array.
For example, I have 3 arrays:
x = np.array([0,1,2,3,4,5])
y = np.array([5,4,3,2,1,0])
z = np.array([0,5])
x,y corresponds to data-points and z is an argument. So at z=0 x array is valid, and at z=5 y array valid. But I need to get new array for z=1. So it could be easily solved by:
a = (y-x)/(z[1]-z[0])*1+x
Problem is that data is not linearly dependent and there are more than 2 arrays with data. Maybe it is possible to use somehow spline interpolation?
This is a univariate to multivariate regression problem. Scipy supports univariate to univariate regression, and multivariate to univariate regression. But you can instead iterate over the outputs, so this is not such a big problem. Below is an example of how it can be done. I've changed the variable names a bit and added a new point:
import numpy as np
from scipy.interpolate import interp1d
X = np.array([0, 5, 10])
Y = np.array([[0, 1, 2, 3, 4, 5],
[5, 4, 3, 2, 1, 0],
[8, 6, 5, 1, -4, -5]])
XX = np.array([0, 1, 5]) # Find YY for these
YY = np.zeros((len(XX), Y.shape[1]))
for i in range(Y.shape[1]):
f = interp1d(X, Y[:, i])
for j in range(len(XX)):
YY[j, i] = f(XX[j])
So YY are the result for XX. Hope it helps.

How to normalize a NumPy array to a unit vector?

I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.
Say the input matrix is:
A=
[0 1 0 0 1
0 0 1 1 1
1 1 0 1 0]
The sparse representation is:
A =
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3
In Python, it's straightforward to work with the matrix-input format:
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Gives:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished?
You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))
#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))
Results:
pairwise dense output:
[[ 1. 0.40824829 0.40824829]
[ 0.40824829 1. 0.33333333]
[ 0.40824829 0.33333333 1. ]]
pairwise sparse output:
(0, 1) 0.408248290464
(0, 2) 0.408248290464
(0, 0) 1.0
(1, 0) 0.408248290464
(1, 2) 0.333333333333
(1, 1) 1.0
(2, 1) 0.333333333333
(2, 0) 0.408248290464
(2, 2) 1.0
If you want column-wise cosine similarities simply transpose your input matrix beforehand:
A_sparse.transpose()
The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)
See below for a discussion of how to optimize for sparsity.
import numpy as np
# base similarity matrix (all dot products)
# replace this with A.dot(A.T).toarray() for sparse representation
similarity = np.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag
If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.
If this is the case, list your 'items' in rows and create A using scipy.sparse. Then replace the first line as indicated.
If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents.
I have tried some methods above. However, the experiment by #zbinsd has its limitation. The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90%.
In my condition, the sparse is with the shape of (7000, 25000) and the sparsity of 97%. The method 4 is extremely slow and I can't tolerant getting the results. I use the method 6 which is finished in 10 s. Amazingly, I try the method below and it's finished in only 0.247 s.
import sklearn.preprocessing as pp
def cosine_similarities(mat):
col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
return col_normed_mat.T * col_normed_mat
This efficient method is linked by enter link description here
I took all these answers and wrote a script to 1. validate each of the results (see assertion below) and 2. see which is the fastest.
Code and results are below:
# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T
# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))
print "Input data shape:", Asp.shape
# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
if method == 1:
return 1 - squareform(pdist(A, metric='cosine'))
if method == 2:
Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
return np.dot(Anorm, Anorm.T)
if method == 3:
Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
return linear_kernel(Anorm)
if method == 4:
similarity = np.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
return cosine.T * inv_mag
if method == 5:
'''
Just a version of method 4 that takes in sparse arrays
'''
similarity = A*A.T
square_mag = np.array(A.sum(axis=1))
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag).T
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = np.array(similarity.multiply(inv_mag))
return cosine * inv_mag.T
if method == 6:
return cosine_similarity(A)
# Assert that all results are consistent with the first model ("truth")
for m in range(1, 7):
if m in [5]: # The sparse case
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
else:
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))
# Time them:
print "Method 1"
%timeit calc_sim(A, method=1)
print "Method 2"
%timeit calc_sim(A, method=2)
print "Method 3"
%timeit calc_sim(A, method=3)
print "Method 4"
%timeit calc_sim(A, method=4)
print "Method 5"
%timeit calc_sim(Asp, method=5)
print "Method 6"
%timeit calc_sim(A, method=6)
Results:
Input data shape: (100, 10000)
Method 1
10 loops, best of 3: 71.3 ms per loop
Method 2
100 loops, best of 3: 8.2 ms per loop
Method 3
100 loops, best of 3: 8.6 ms per loop
Method 4
100 loops, best of 3: 2.54 ms per loop
Method 5
10 loops, best of 3: 73.7 ms per loop
Method 6
10 loops, best of 3: 77.3 ms per loop
Hi you can do it this way
temp = sp.coo_matrix((data, (row, col)), shape=(3, 59))
temp1 = temp.tocsr()
#Cosine similarity
row_sums = ((temp1.multiply(temp1)).sum(axis=1))
rows_sums_sqrt = np.array(np.sqrt(row_sums))[:,0]
row_indices, col_indices = temp1.nonzero()
temp1.data /= rows_sums_sqrt[row_indices]
temp2 = temp1.transpose()
temp3 = temp1*temp2
Building off of Vaali's solution:
def sparse_cosine_similarity(sparse_matrix):
out = (sparse_matrix.copy() if type(sparse_matrix) is csr_matrix else
sparse_matrix.tocsr())
squared = out.multiply(out)
sqrt_sum_squared_rows = np.array(np.sqrt(squared.sum(axis=1)))[:, 0]
row_indices, col_indices = out.nonzero()
out.data /= sqrt_sum_squared_rows[row_indices]
return out.dot(out.T)
This takes a sparse matrix (preferably a csr_matrix) and returns a csr_matrix. It should do the more intensive parts using sparse calculations with pretty minimal memory overhead. I haven't tested it extensively though, so caveat emptor (Update: I feel confident in this solution now that I've tested and benchmarked it)
Also, here is the sparse version of Waylon's solution in case it helps anyone, not sure which solution is actually better.
def sparse_cosine_similarity_b(sparse_matrix):
input_csr_matrix = sparse_matrix.tocsr()
similarity = input_csr_matrix * input_csr_matrix.T
square_mag = similarity.diagonal()
inv_square_mag = 1 / square_mag
inv_square_mag[np.isinf(inv_square_mag)] = 0
inv_mag = np.sqrt(inv_square_mag)
return similarity.multiply(inv_mag).T.multiply(inv_mag)
Both solutions seem to have parity with sklearn.metrics.pairwise.cosine_similarity
:-D
Update:
Now I have tested both solutions against my existing Cython implementation: https://github.com/davidmashburn/sparse_dot/blob/master/test/benchmarks_v3_output_table.txt
and it looks like the first algorithm performs the best of the three most of the time.
You should check out scipy.sparse. You can apply operations on those sparse matrices just like how you use a normal matrix.
def norm(vector):
return sqrt(sum(x * x for x in vector))
def cosine_similarity(vec_a, vec_b):
norm_a = norm(vec_a)
norm_b = norm(vec_b)
dot = sum(a * b for a, b in zip(vec_a, vec_b))
return dot / (norm_a * norm_b)
This method seems to be somewhat faster than using sklearn's implementation if you pass in one pair of vectors at a time.
I suggest to run in two steps:
1) generate mapping A that maps A:column index->non zero objects
2) for each object i (row) with non-zero occurrences(columns) {k1,..kn} calculate cosine similarity just for elements in the union set A[k1] U A[k2] U.. A[kn]
Assuming a big sparse matrix with high sparsity this will gain a significant boost over brute force
#jeff 's solution is changed
As version of scikit-learn 1.1.2, you don't need to use scipy's sparse before cosine_similarity.
All you need is cosine_similarity
from typing import Tuple
import numpy as np
import perfplot
import scipy
from sklearn.metrics.pairwise import cosine_similarity as cosine_similarity_sklearn_internal
from scipy import spatial
from scipy import sparse
import sklearn.preprocessing as pp
target_dtype = "float16"
class prettyfloat(float):
def __repr__(self):
return "%.2f" % self
def cosine_similarity_sklearn(x):
return cosine_similarity_sklearn_internal(x)
def cosine_similarity_sklearn_sparse(x):
x_sparse = sparse.csr_matrix(x)
return cosine_similarity_sklearn_internal(x_sparse)
def cosine_similarity_einsum(x, y=None):
"""
Calculate the cosine similarity between two vectors.
if x == y, only use x
"""
# cosine_similarity in einsum notation without astype
normed_x = x / np.linalg.norm(x, axis=1)[:, None]
normed_y = y / np.linalg.norm(y, axis=1)[:, None] if y else normed_x
return np.einsum("ik,jk->ij", normed_x, normed_y)
def cosine_similarity_scipy(x, y=None):
"""
Calculate the cosine similarity between two vectors.
if x == y, only use x
"""
return 1 - spatial.distance.cosine(x, x)
def setup_n(n) -> Tuple[np.ndarray, np.ndarray]:
nd_arr = np.random.randn(int(2 ** n), 512).astype(target_dtype)
return nd_arr
def equality_check(a, b):
if type(a) != np.ndarray:
a = a.todense()
if type(b) != np.ndarray:
b = b.todense()
return np.isclose(a.astype(target_dtype), b.astype(target_dtype), atol=1e-3).all()
fig = perfplot.show(
setup=setup_n,
n_range=[k for k in range(1, 10)],
kernels=[
cosine_similarity_sklearn,
cosine_similarity_sklearn_sparse,
cosine_similarity_einsum,
# cosine_similarity_scipy,
],
labels=["sk-def", "sk+sparse", "einsum"],
logx=False,
logy=False,
xlabel='2^n',
equality_check=equality_check,
)
Using perfplot, it show, `from typing import Tuple
import numpy as np
import perfplot
import scipy
from sklearn.metrics.pairwise import cosine_similarity` is the best.
in scikit-learn==1.1.2,1.1.3
It can be different result in float64 and float16.
For float64,
For float16,

Categories

Resources