Determining cosine similarity for large datasets

Determining cosine similarity for large datasets - python

I am currently using a dataset of over 2.5 million images, of which I use the image itself as a comparison to eachother, for use in a content-based recommendation engine.
I use the following code to calculate the cosine similarity using some precomputed embeddings.
cosine_similarity = 1-pairwise_distances(embeddings, metric='cosine')
However my issue is that currently I've estimated requiring around 11,000GB in memory to create this similarity matrix;
Are there any alternatives to getting a similarity metric between every data point in my dataset or is there another way to go about this whole process?

You have 2,500,000 entries. So resulting matrix has 6.25e+12 entries. You need to ask yourself what do you plan to do with this data, compute only what you need, and then the storage will follow. Computation of cosine distance is almost free (it is literally a dot product) so you can always just do it "on the fly", no need to precompute, and the question really boils down to how much actual time/compute.

if you have a recommendation business problem using these 2.5 million images, you may want to check TF recommenders which basically use %30 of data for retrieval and you can run a second ranking classifier on top of the initial model to explore more. this two-step approach would be key to memory constraints and already battle tested by instagram and others

Related

Community detection for larger than memory embeddings dataset

I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py
It works great, but, it doesn't really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I'm certain there are a class of algorithms - and hopefully a python library - that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?

That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
there's a whole literature out there;
if you already have a scalable clustering algorithm you like, you can use that.

Clustering Parallelization Approach in Python for Strings(using Levenshtein as distance) on bigger data set

recently, at work I have been given the task to cluster ~1mio strings by similarity. As distance function I chose Levenshtein, and to cluster it, Agglomerative Clustering.
So far so good, the code worked relatively easy, but the problem I'm now facing is that the distance matrix just gets too big. To parallelize it, I could do batches of let's say 10k x 10k but merging those classification vectors is not possible.
My understanding is that bottom up approaches cannot be easily merged. Could there be a top down approach to cluster them? K-Means does not seem to work with Levenshtein distance, after my research. I hope some people could help me with a new idea to approach the data parallelization problem and how to merge those subset results to a single one.
Greetings, Tony

compare similarity of multiple texts using python

So i have about 300-500 text articles that i would like to compare the similarity of and figure which are related / duplicates some articles might be addressing the same topics but not identical. so to tackle this i started experimenting with spaCy and the similarity function .. now the problem is similarity only compares two documents at a time and I think i would need to loop every single text and to compare it to the other one which is a very slow and memory consuming process is there a way around this ?

I don't know how you are going to go about comparing similarities between texts, but let's say that you are going to compare each one to another using Jaccard or cosine similarities.
Then, you could use the all-pairs similarity search proposed in this paper which has an implementation here. This algorithm is extremely fast, especially for such a small data size.
The all-pairs search returns two documents and their similarity, so if you want to find a "family" of similar documents, then you will further need to apply a graph traversal like DFS. A stack overflow post on python tuples uses adjacency lists and provides O^(n+m) time complexity.
Here's an example where you could use the all-pairs algorithm that tries to find reposts in the reddit jokes subreddit.

Scaling t-SNE to millions of observations in scikit-learn

t-SNE can supposedly scale to millions of observations (see here), but I'm curious how that can be true, at least in the Sklearn implementation.
I'm trying it on a dataset with ~100k items, each with ~190 features. Now, I'm aware that I can do a first pass of dimensionality reduction with, e.g. PCA, but the problem seems more fundamental.
t-SNE computes and stores the full, dense similarity matrix calculated for the input observations (
I've confirmed this by looking at the source). In my case, this is a 10 billion element dense matrix, which by itself requires 80 GB+ of memory. Extrapolate this to just one million observations, and you're looking at 8 terabytes of RAM just to store the distance matrix (let alone computation time...)
So, how can we possibly scale t-SNE to millions of datapoints in the sklearn implementation? Am I missing something? The sklearn docs at least imply that it's possible:
By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.
That's my emphasis, but I would certainly read that as implying the Barnes-hut method can scale to millions of examples, but I'll reiterate that the code requires calculating the full distance matrix well before we even get to any of the actual t-sne transformations (with or without Barnes-hut).
So am I missing something? Is it possible to scale this up to millions of datapoints?

Barnes-Hut does NOT require you to compute and storex the full, dense similarity matrix calculated for the input observations.
Also, take a look at the references mentioned at the documentation. In particular, this one. Quoting that page:
The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.
That page also links to this talk about how the approximation works: Visualizing Data Using t-SNE.

I recommend you using another algorithm called UMAP. It is proven to perform at least as well as t-SNE and in most cases, it performs better. Most importantly, it scales significantly better. Their approach to the problem is similar so they generate similar results but UMAP is a lot faster (Look at the last graph here: https://umap-learn.readthedocs.io/en/latest/benchmarking.html). You can look at the original paper and the following link for details.
https://www.nature.com/articles/nbt.4314.pdf
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668#:~:text=tSNE%20is%20Dead.&text=Despite%20tSNE%20made%20a%20dramatic,be%20fixed%20sooner%20or%20later.

OpenVisuMap (at github) implements t-SNE without resorting to approximation. It uses GPU to calculate the distance matrix on-fly. It still has O(N^2) calculation complexity, but only O(N) memory complexity.

How can Latent Semantic Indexing be used for feature selection?

I am studying some machine-learning and I have come across, in several places, that Latent Semantic Indexing may be used for feature selection. Can someone please provide a brief, simplified explanation of how this is done? Ideally both theoretically and in commented code. How does it differ from Principal Component Analysis?
What language it is written in doesn't really worry me, just that I can understand both code and theory.

LSA is conceptually similar to PCA, but is used in different settings.
The goal of PCA is to transform data into new, possibly less-dimensional space. For example, if you wanted to recognize faces and use 640x480 pixel images (i.e. vectors in 307200-dimensional space), you would probably try to reduce this space to something reasonable to both - make it computationally simpler and make data less noisy. PCA does exactly this: it "rotates" axes of your high-dimensional space and assigns "weight" to each of new axes, so that you can throw away least important of them.
LSA, on other hand, is used to analyze semantic similarity of words. It can't handle images, or bank data, or some other custom dataset. It is designed specifically for text processing, and works specifically with term-document matrices. Such matrices, however, are often considered too large, so they are reduced to form lower-rank matrices in a way very similar to PCA (both of them use SVD). Feature selection, though, is not performed here. Instead, what you get is feature vector transformation. SVD provides you with some transformation matrix (let's call it S), which, being multiplied by input vector x gives new vector x' in a smaller space with more important basis.
This new basis is your new features. Though, they are not selected, but rather obtained by transforming old, larger basis.
For more details on LSA, as long as implementation tips, see this article.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.