I have a matrix of 63695 row vectors of dim 384.
I would like to compute the cosine similarity model for this matrix.
I was thinking of vectorizing it.
How would one proceed to that objective?
If you look in scikit-learns source code you will see that X and Y are first normalized and then X_norm # Y_norm.T (dot product) is returned. Or if as in your case no Y exists it is X_norm # X_norm.T.
Normalizing and transposing can be discarded when looking at the runtime, but the matrix multiplaction of a (63695 x 384) matrix should take somewhere in the neighbourhood of 63695*63695 (elements in result matrix) times 384*384 (element-wise multiplactions and additions to calculate one element) calculations, so something like 63695*63695*384*384 = 598,236,810,854,400 operations. (Or strictly, that number of multiplications plus that same number of additions.)
And as you already mentioned it requires 4 (Bytes for float32) * 63695 * 63695 = ~16.2 GB of memory to handle that result matrix.
Do you really need that enormous matrix? What type of data are you handling and what are you trying to do? If we are talking about e.g. vector represenations of text data then you should look at removing duplicates, processing it in chunks or reducing the dimensionality before analysing similarity. If you are looking for something like ranking these cosine similarities and finding then k most similar ones you'd be much better of using algorithms for finding similar data points instead of doing it all by hand yourself.
I am generating a series of Gaussian arrays given a x vector of length (1400), and arrays for the sigma, center, amplitude (amp), all with length (100). I thought the best way to speed this up would be to use numpy and list comprehension:
g = np.sum([(amp[i]*np.exp(-0.5*(x - (center[i]))**2/(sigma[i])**2)) for i in range(len(center))],axis=0)
Each row is a gaussian along a vector x, and then I sum the columns into a single array of length x.
But this doesn't seem to speed things up at all. I think there is a faster way to do this while avoiding the for loop but I can't quite figure out how.
You should use vectorized computation instead of comprehension so the loops are all performed at c speed.
In order to do so you have to reshape x to be a column vector. For example you could do x = x.reshape((1400,1)).
Then you can operate directly on the arrays, like this:
v=(amp*np.exp(-0.5*(x - (center))**2/(sigma)**2
Then you obtain an array of shape (1400,100) which you can sum up to a vector by np.sum(v, axe=1)
You should try to vectorize all the operations. IMHO the most efficient to first converts your input data to numpy arrays (if they were plain Python lists) and then let numpy process the computations:
np_amp = np.array(amp)
np_center = np.array(center)
np_sigma = np.array(sigma)
g = np.sum((np_amp*np.exp(-0.5*(x - (np_center))**2/(np_sigma)**2)),axis=0)
I have a very large dataset which is essentially document - search query pairs and I want to calculate the similarity for each pair. I've calculated the TF-IDF for each of the documents and queries. I realize that given two vectors you can calculate the similarity using linear_kernel. However, I'm not sure how to do this on a very large set of data (i.e. no for loops).
Here is what I have so far:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
df_train = pd.read_csv('train.csv')
vectorizer = TfidfVectorizer()
doc_tfidf = vectorizer.fit_transform(df_train["document"])
query_tfidf = vectorizer.transform(df_train["query"])
linear_kernel(doc_tfidf, query_tfidf)
Now this gives me an NxN matrix, where N is the number of document-query pairs I have. What I am looking for is N-size vector with a single value per document-query pair.
I realize I could do this with a for loop, but with a dataset of about 500K pairs this would not work. Is there some way that I could vectorize this calculation?
UPDATE: So I think I have a solution that works and seems to be fast. In the code above I replace:
linear_kernel(doc_tfidf, query_tfidf)
with
df_train['similarity'] = desc_tfidf.multiply(query_tfidf).sum(axis=1)
Does this seem like a sane approach? Is there a better way to do this?
Cosine similarity is typically used to compute the similarity between text documents, which in scikit-learn is implemented in sklearn.metrics.pairwise.cosine_similarity.
However, because TfidfVectorizer also performs a L2 normalization of the results by default (i.e. norm='l2'), in this case it is sufficient to compute the dot product to get the cosine similarity.
In your example, you should therefore use,
similarity = doc_tfidf.dot(query_tfidf.T).T
instead of an element-wise multiplication.
I have a simple problem for Numpy: I have 3d coordinates and I want to compute the overlap between two distinct configurations with the following function
def Overlap(rt, r0,a):
s=0
for i in range(len(rt)):
s+=(( pl.norm(r0[i]-rt ,axis=1) <=a).astype('int')).sum()
return s`
Where rt and r0 represent two m by 3 tables, the configurations.
Practically, it computes the distance between a vector in the first configuration and any other vector in the second, checks for a threshold value a, and returns the total sum after a loop over all the positions. Is there a smart way to avoid the explicit for loop? I have the feeling that the complexity cannot really be changed, but there is maybe a way to avoid the slowness of the native for construct.
How about the following:
from scipy.spatial.distance import cdist
import numpy as np
overlap = np.sum(cdist(rt, r0) <= a)
When m is 1000 on my machine, this is about 9x faster. It's much faster for small arrays
I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column.
for example raw vector looks like this
1,23,2,5,6,2,2,6,2,
12,4,5,5,
1,2,4,
1,
2,
2
:
Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.
If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.
Here an example of how I would do it with NumPy and SciPy:
import numpy as np
from scipy.spatial import distance
A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
Aflat = np.hstack(A)
Bflat = np.hstack(B)
dist = distance.cosine(Aflat, Bflat)
The result here is dist = 1.10e-16 (i.e., 0).
Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).
I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.
from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")
Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.
Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.