Efficiently compute cosine similarity

Efficiently compute cosine similarity - python

I have a bank of about 100k strings and when I get a new string, I want to match it to the most similar string.
My thoughts were to use tf-idf (makes sense as keywords are quite important), then match using the cosine distance. Is there an efficient way to do this using pandas/scikit-learn/scipy etc? I'm currently doing this:
df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)
which is obviously quite slow. I was thinking of maybe a KD-tree, but it takes a lot of memory as the tf-idf vectors have a dimension of 2000.

Consider using vectorized computations rather than looping over DataFrame rows (which is very slow and should be avoided).
I'm not sure how the arrays are represented in the dataframe, so make sure you're starting out with two arrays of the same shape.
from numpy import einsum
from numpy.linalg import norm
arr_a = df["tf_idf"].values
arr_b = df["new_string"].values
cos_sim = einsum('ij,ij->i', arr_a, arr_b) / (norm(arr_a, axis=1)*norm(arr_b, axis=1))
df["cosine_distance"] = 1 - cos_sim
This code directly calculates the cosine distance using vector operations (einsum reference) and will run orders of magnitude faster than the DataFrame.apply() method.

Related

Vectorizing Computation of Cosine Similarity Matrix

I have a matrix of 63695 row vectors of dim 384.
I would like to compute the cosine similarity model for this matrix.
I was thinking of vectorizing it.
How would one proceed to that objective?

If you look in scikit-learns source code you will see that X and Y are first normalized and then X_norm # Y_norm.T (dot product) is returned. Or if as in your case no Y exists it is X_norm # X_norm.T.
Normalizing and transposing can be discarded when looking at the runtime, but the matrix multiplaction of a (63695 x 384) matrix should take somewhere in the neighbourhood of 63695*63695 (elements in result matrix) times 384*384 (element-wise multiplactions and additions to calculate one element) calculations, so something like 63695*63695*384*384 = 598,236,810,854,400 operations. (Or strictly, that number of multiplications plus that same number of additions.)
And as you already mentioned it requires 4 (Bytes for float32) * 63695 * 63695 = ~16.2 GB of memory to handle that result matrix.
Do you really need that enormous matrix? What type of data are you handling and what are you trying to do? If we are talking about e.g. vector represenations of text data then you should look at removing duplicates, processing it in chunks or reducing the dimensionality before analysing similarity. If you are looking for something like ranking these cosine similarities and finding then k most similar ones you'd be much better of using algorithms for finding similar data points instead of doing it all by hand yourself.

Fastest way generate and sum arrays

I am generating a series of Gaussian arrays given a x vector of length (1400), and arrays for the sigma, center, amplitude (amp), all with length (100). I thought the best way to speed this up would be to use numpy and list comprehension:
g = np.sum([(amp[i]*np.exp(-0.5*(x - (center[i]))**2/(sigma[i])**2)) for i in range(len(center))],axis=0)
Each row is a gaussian along a vector x, and then I sum the columns into a single array of length x.
But this doesn't seem to speed things up at all. I think there is a faster way to do this while avoiding the for loop but I can't quite figure out how.

You should use vectorized computation instead of comprehension so the loops are all performed at c speed.
In order to do so you have to reshape x to be a column vector. For example you could do x = x.reshape((1400,1)).
Then you can operate directly on the arrays, like this:
v=(amp*np.exp(-0.5*(x - (center))**2/(sigma)**2
Then you obtain an array of shape (1400,100) which you can sum up to a vector by np.sum(v, axe=1)

You should try to vectorize all the operations. IMHO the most efficient to first converts your input data to numpy arrays (if they were plain Python lists) and then let numpy process the computations:
np_amp = np.array(amp)
np_center = np.array(center)
np_sigma = np.array(sigma)
g = np.sum((np_amp*np.exp(-0.5*(x - (np_center))**2/(np_sigma)**2)),axis=0)

python - How do I calculate the similarity between pairs of documents and queries?

I have a very large dataset which is essentially document - search query pairs and I want to calculate the similarity for each pair. I've calculated the TF-IDF for each of the documents and queries. I realize that given two vectors you can calculate the similarity using linear_kernel. However, I'm not sure how to do this on a very large set of data (i.e. no for loops).
Here is what I have so far:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
df_train = pd.read_csv('train.csv')
vectorizer = TfidfVectorizer()
doc_tfidf = vectorizer.fit_transform(df_train["document"])
query_tfidf = vectorizer.transform(df_train["query"])
linear_kernel(doc_tfidf, query_tfidf)
Now this gives me an NxN matrix, where N is the number of document-query pairs I have. What I am looking for is N-size vector with a single value per document-query pair.
I realize I could do this with a for loop, but with a dataset of about 500K pairs this would not work. Is there some way that I could vectorize this calculation?
UPDATE: So I think I have a solution that works and seems to be fast. In the code above I replace:
linear_kernel(doc_tfidf, query_tfidf)
with
df_train['similarity'] = desc_tfidf.multiply(query_tfidf).sum(axis=1)
Does this seem like a sane approach? Is there a better way to do this?

Cosine similarity is typically used to compute the similarity between text documents, which in scikit-learn is implemented in sklearn.metrics.pairwise.cosine_similarity.
However, because TfidfVectorizer also performs a L2 normalization of the results by default (i.e. norm='l2'), in this case it is sufficient to compute the dot product to get the cosine similarity.
In your example, you should therefore use,
similarity = doc_tfidf.dot(query_tfidf.T).T
instead of an element-wise multiplication.

Numpy, avoid loop in 3d array difference nested summation

I have a simple problem for Numpy: I have 3d coordinates and I want to compute the overlap between two distinct configurations with the following function
def Overlap(rt, r0,a):
s=0
for i in range(len(rt)):
s+=(( pl.norm(r0[i]-rt ,axis=1) <=a).astype('int')).sum()
return s`
Where rt and r0 represent two m by 3 tables, the configurations.
Practically, it computes the distance between a vector in the first configuration and any other vector in the second, checks for a threshold value a, and returns the total sum after a loop over all the positions. Is there a smart way to avoid the explicit for loop? I have the feeling that the complexity cannot really be changed, but there is maybe a way to avoid the slowness of the native for construct.

How about the following:
from scipy.spatial.distance import cdist
import numpy as np
overlap = np.sum(cdist(rt, r0) <= a)
When m is 1000 on my machine, this is about 9x faster. It's much faster for small arrays

Python: Cosine Similarity m * n matrices

I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column.
for example raw vector looks like this
1,23,2,5,6,2,2,6,2,
12,4,5,5,
1,2,4,
1,
2,
2
:
Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.

If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.
Here an example of how I would do it with NumPy and SciPy:
import numpy as np
from scipy.spatial import distance
A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
Aflat = np.hstack(A)
Bflat = np.hstack(B)
dist = distance.cosine(Aflat, Bflat)
The result here is dist = 1.10e-16 (i.e., 0).
Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).

I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.
from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")

Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.
Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently compute cosine similarity - python

Related

Vectorizing Computation of Cosine Similarity Matrix

Fastest way generate and sum arrays

python - How do I calculate the similarity between pairs of documents and queries?

Numpy, avoid loop in 3d array difference nested summation

Python: Cosine Similarity m * n matrices

Categories

Resources