Fastest way to calculate cosine similartity between two 2D arrays - python

I have one array A containing 64000 embeddings and an other array B containing 12000 embeddings (each of the embedding is 1024 floats long).
Now I want to calculate the cosine similarity for all the pairs between array A and array B (cartesian product).
To perform that (using pandas), I merge array A with array B using .merge(how="cross").
It gives me 768 000 000 pairs.
Now I am looking for the fastest way of calculating the cosine sim. for now I used something like this using Numpy:
def compute_cosine_sim(a, b):
return dot(a, b)/(norm(a)*norm(b))
np.vectorize(compute_cosine_sim)(df.embedding_A.to_numpy(), df.embedding_B.to_numpy())
To keep the RAM at reasonable level, I use pandas Dataframe chunking.
The problem is my method is not. fast enough, and I was wondering if there wasn't something to change here, especially regarding the effectiveness of the numpy function I use.
To give some details, I reach 130000 iter/sec with this function, is it normal ?
Also, could this kind of operation be run on GPU easily ?
Thanks for the help

You could matrix multiply embedding A with the transpose of embedding B to get the dot products of all pairs, and then divide the output along the columns by the norm of the vectors in A, and the rows by the norm of the vectors in B:
import numpy as np
a = np.random.randn(10, 4)
b = np.random.randn(8, 4)
out = a # b.T / np.linalg.norm(a, axis=1, keepdims=True) / np.linalg.norm(b.T, axis=0, keepdims=True)
# out has shape (10, 8)

Related

Most efficient way to calculate every L2 distance between vectors of vector array A and vectors of vector array B?

I need to implement an algorithm. But it takes a lot of time to compute and I need to make it as fast as possible.
Right now I have two numpy arrays:
Array A -> 2000 vectors of 512 elements,
Array B -> 1000 vectors of 512 elements.
I need to calculate every single distance between the vectors from Array A and Array B. Right now, I take 1 vector from array A, and calculate it's distances to all vectors in Array B as follows:
np.sum(np.abs(B-A[0])**2,axis=-1)**(0.5)
But using this I have to loop for 2000 cycles and it takes a lot of time.
Any alternatives?
sklearn.metrics.pairwise_distances solves exactly this problem.

Fastest way generate and sum arrays

I am generating a series of Gaussian arrays given a x vector of length (1400), and arrays for the sigma, center, amplitude (amp), all with length (100). I thought the best way to speed this up would be to use numpy and list comprehension:
g = np.sum([(amp[i]*np.exp(-0.5*(x - (center[i]))**2/(sigma[i])**2)) for i in range(len(center))],axis=0)
Each row is a gaussian along a vector x, and then I sum the columns into a single array of length x.
But this doesn't seem to speed things up at all. I think there is a faster way to do this while avoiding the for loop but I can't quite figure out how.
You should use vectorized computation instead of comprehension so the loops are all performed at c speed.
In order to do so you have to reshape x to be a column vector. For example you could do x = x.reshape((1400,1)).
Then you can operate directly on the arrays, like this:
v=(amp*np.exp(-0.5*(x - (center))**2/(sigma)**2
Then you obtain an array of shape (1400,100) which you can sum up to a vector by np.sum(v, axe=1)
You should try to vectorize all the operations. IMHO the most efficient to first converts your input data to numpy arrays (if they were plain Python lists) and then let numpy process the computations:
np_amp = np.array(amp)
np_center = np.array(center)
np_sigma = np.array(sigma)
g = np.sum((np_amp*np.exp(-0.5*(x - (np_center))**2/(np_sigma)**2)),axis=0)

Tensorflow efficient pairwise inner product

In Tensorflow (python), given a matrix X of shape (n x d), where each row is a data point, I would like to compute the pairwise inner products of these n data points, i.e., the upper triangle of XX'.
Of course I could compute the whole XX' and fetch its upper triangle, but this means I would compute the off-diagonal elements twice. How to compute these efficiently in Tensorflow (python) by computing the inner product only once per pair?
With numpy, you can do this:
import numpy as np
A = np.random.randn(5, 3)
inds = np.triu_indices(5) # upper triangle indices
# expensive way to do it
ipu1 = np.dot(A, A.T)[inds]
# possibly less expensive way to do it.
ipu2 = np.einsum('ij,ij->i', A[inds[0]], A[inds[1]])
print(np.allclose(ipu1, ipu2))
This outputs True. Tensorflow does not have the triu_indices function build in, but it is not hard to write one if needed by looking at the numpy code. It does have einsum.

Vectorized Portfolio Risk

Im have N pairs of portfolio weights stored in a numpy array and would like to calculate portfolio risk which is w * E * w_T where w_T is weight transpose. The way I came up with is to loop through each weight pair and apply the matrix multiplication. Is there a vectorized approach to this such that given a weight pair (or if possible N weights that all sum to 1) I apply a single covariance matrix to each row to get the risk (ie without loop)?
import numpy as np
w = np.array([[0.2,0.8],[0.5,0.5]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
w1 = w[0].reshape([1,2]) # each row in w
#portfolio risk
np.dot(np.dot(w1,covar),w1.T)
#Adam's answer is valid, but for big arrays, can result with very big temporary arrays (NxN), and unnecessary computations (computing the off-diagonal elements).
Here's a similar, yet much more efficient solution:
(I added another weight-pair, to distinguish between the different dimensions of the problem)
w = np.array([[0.2,0.8],[0.5,0.5], [0.33, 0.67]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
(np.dot(w, covar) * w).sum(axis=-1)
=> array([ 2.77600000e-05, 2.80000000e-05, 2.68916000e-05])
By using plain-multiplication in the second step, I'm avoiding the unnecessary compuations of the off-diagonals.
EDIT: explaining the temporary arrays
# first multiplication (in both solutions)
np.dot(w, covar).shape
(3, 2)
# second, my solution
(np.dot(w, covar) * w).shape
(3, 2)
# second, Adam's solution
np.dot(np.dot(w,covar),w.T).shape
(3, 3)
Now, if you have N sets of weights you want to compute risk for (in this example N=3), and M instruments in your portfolio (here M=2), and N>>M, you get an array which is much bigger with Adam's solution (NxN). Not only that it will consume more memory, the computation populating the off-diagonal elements are expensive (matrix multiplication), and unnecessary.
It seems like your code is already set up for a vectorized approach, but you are only dealing with one row at a time. Grabbing the diagonals from the result when using your full weight matrix should give you what you want.
# portfolio risk
np.diagonal(np.dot(np.dot(w,covar),w.T))

Python: Cosine Similarity m * n matrices

I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column.
for example raw vector looks like this
1,23,2,5,6,2,2,6,2,
12,4,5,5,
1,2,4,
1,
2,
2
:
Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.
If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.
Here an example of how I would do it with NumPy and SciPy:
import numpy as np
from scipy.spatial import distance
A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
Aflat = np.hstack(A)
Bflat = np.hstack(B)
dist = distance.cosine(Aflat, Bflat)
The result here is dist = 1.10e-16 (i.e., 0).
Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).
I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.
from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")
Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.
Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.

Categories

Resources