I have pre-made database full of 512 dimensional vectors and want to implement an efficient searching algorithm over them.
Research
Cosine similarity:
The best algorithm in this case would consist of cosine similarity measure, which is basically a normalized dot product, which is:
def cossim(a, b): numpy.inner(a, b)/(numpy.linalg.norm(a)*numpy.linalg.norm(b))
In Python.
Linear search:
Most obvious and simple search for this case would be linear search O(n), which iterates the whole database and eventually picks the most similar result:
def linear_search(query_text, db): # where db is set of 512D vectors
most_similar = ("", 0) # placeholder
for query in db:
current_sim = cossim(query_text, query) # cossim function defined above
if current_sim > most_similar[1]:
most_similar = (query, current_sim)
return most_similar[0]
As you can see, the whole database should be scanned, which might be quite inefficient if database contains hundreds of thousands of vectors.
Quasilinear search: (partially resolved)
There is a fundamental relation between Cosine similarity and Euclidean distance (explained very well in this answer) - we can derive Euclidean distance from following equation:
|a - b|² = 2(1 - cossim(a,b))
As mentioned in the answer, Euclidean distance will get smaller as cosine between two vectors get larger, therefore we can turn this into the closest pairs of points problem, which can be solved in quasilinear O(n log n) time using recursive divide and conquer algorithm.
Thus I have to implement my own divide and conquer algorithm that will find closest pair of 512 dimensional vectors.
But unfortunately, this problem can't be directly solved due to high dimensionality of vectors. Classical divide and conquer algorithm is only specialized for two dimensions.
Indexing for binary search (unresolved):
The best way to optimize cosine similarity search in terms of speed from my knowledge would be indexing and then performing binary search.
The main problem here is that indexing 512 dimensional vectors is quite difficult, and I'm not yet aware of anything other than locality sensitive hashing that may, or may not be useful for indexing my database (main concern is dimensionality reduction, which might possibly cause consequential decrease in accuracy).
There is a new Angular Multi-index Hashing method, which unfortunately only works for binary based vectors and dimension independent similarity computation if vectors are sparse, but they are not.
Finally, there also is An Optimal Algorithm for Approximate Nearest
Neighbor Searching in Fixed Dimensions, which at first glance might perhaps be the best solution, but in the document it is stated:
Unfortunately, exponential factors in query time do imply that our
algorithm is not practical for large values of d. However, our
empirical evidence in Section 6 shows that the constant factors are
much smaller than the bound given in Theorem 1 for the many
distributions that we have tested. Our algorithm can provide
significant improvements over brute-force search in dimensions as high
as 20, with a relatively small average error.
We are trying to perform query over 20 * 25.6 = 512 dimensional vectors, which will make the algorithm above highly inefficient.
There was a similar question that contains similar concerns, but unfortunately the solution for indexing was yet not found.
Problem
Is there any way to optimize cosine similarity search for such vectors other than quasilinear search? Perhaps there is some other way of indexing high-dimensional vectors? I believe something like this has already been done before.
Closest solution
I believe I have found solution that might potentially be a solution, it includes randomized partition trees for indexing couple hundred dimensional vector databases, which I believe is what I exactly need. (see here)
Thank you!
Related
I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.
The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.
The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.
I've got a clustering problem that I believe requires an intuitive distance function. Each instance has an x, y coordinate but also has a set of attributes that describe it (varying in number per instance). Ideally it would be possible to pass it pythonobjects (instances of a class) and compare them arbitrarily based on their content.
I want to represent the distance as a weighted sum of the euclidean distance between the x, y values and something like a jaccard index to measure the set overlap of the other attributes. Something like:
dist = (euclidean(x1, y1, x2, y2) * 0.6) + (1-jaccard(attrs1, attrs2) * 0.4)
Most of the clustering algorithms and implementations I've found convert instance features into numbers. For example with dbscan in sklearn, to do my distance function I would need to convert the numbers back into the original representation somehow.
It would be great if it were possible to do clustering using a distance function that can compare instances in any arbitrary way. For example imagine a euclidean distance function that would evaluate objects as closer if they matched on another non-spatial feature.
def dist(ins1, ins2):
euc = euclidean(ins1.x, ins1.y, ins2.x, ins2.y)
if ins1.feature1 == ins2.feature1:
euc = euc * 0.9
return euc
Is there a method that would suit this? It would also be nice if the number of clusters didn't have to be set upfront (but this is not critical for me).
Actually, almost all the clustering algorithms (except for k-means, which needs numbers to compute the mean, obviously) can be used with arbitrary distance functions.
In sklearn, most algorithms accept metric="precomputed" and a distance matrix instead of the original input data. Please check the documentation more carefully. For example DBSCAN:
If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
What you lose is the ability to accelerate some algorithms by indexing. Computing a distance matrix is O(n^2), so your algorithm cannot be faster than that. In sklearn, you would need to modify the sklearn Cython code to add a new distance function (using a pyfunc will yield very bad performance, unfortunately). Java tools such as ELKI can be extended with little overhead because the Just-in-time compiler of Java optimizes this well. If your distance is metric then many indexes can be used for acceleration of e.g. DBSCAN.
Given 10,000 64-dimensional vectors and I need to find the vector with the least euclidean distance to an arbitrary point.
The tricky part is that these 10,000 vectors move. Most of the algorithms I have seen assume stationary points and thus can make good use of indexes. I imagine it will be too expensive to rebuild indexes on every timestep.
Below is the pseudo code.
for timestep in range(100000):
data = get_new_input()
nn = find_nearest_neighbor(data)
nn.move_towards(data)
One thing to note is that the vectors only move a little bit on each timestep, about 1%-5%. One non-optimal solution is to rebuild indexes every ~1000 timesteps. It is OK if the nearest neighbor is approximate. Maybe using each vectors momentum would be useful?
I am wondering what is the best algorithm to use in this scenario?
t-SNE can supposedly scale to millions of observations (see here), but I'm curious how that can be true, at least in the Sklearn implementation.
I'm trying it on a dataset with ~100k items, each with ~190 features. Now, I'm aware that I can do a first pass of dimensionality reduction with, e.g. PCA, but the problem seems more fundamental.
t-SNE computes and stores the full, dense similarity matrix calculated for the input observations (
I've confirmed this by looking at the source). In my case, this is a 10 billion element dense matrix, which by itself requires 80 GB+ of memory. Extrapolate this to just one million observations, and you're looking at 8 terabytes of RAM just to store the distance matrix (let alone computation time...)
So, how can we possibly scale t-SNE to millions of datapoints in the sklearn implementation? Am I missing something? The sklearn docs at least imply that it's possible:
By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.
That's my emphasis, but I would certainly read that as implying the Barnes-hut method can scale to millions of examples, but I'll reiterate that the code requires calculating the full distance matrix well before we even get to any of the actual t-sne transformations (with or without Barnes-hut).
So am I missing something? Is it possible to scale this up to millions of datapoints?
Barnes-Hut does NOT require you to compute and storex the full, dense similarity matrix calculated for the input observations.
Also, take a look at the references mentioned at the documentation. In particular, this one. Quoting that page:
The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.
That page also links to this talk about how the approximation works: Visualizing Data Using t-SNE.
I recommend you using another algorithm called UMAP. It is proven to perform at least as well as t-SNE and in most cases, it performs better. Most importantly, it scales significantly better. Their approach to the problem is similar so they generate similar results but UMAP is a lot faster (Look at the last graph here: https://umap-learn.readthedocs.io/en/latest/benchmarking.html). You can look at the original paper and the following link for details.
https://www.nature.com/articles/nbt.4314.pdf
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668#:~:text=tSNE%20is%20Dead.&text=Despite%20tSNE%20made%20a%20dramatic,be%20fixed%20sooner%20or%20later.
OpenVisuMap (at github) implements t-SNE without resorting to approximation. It uses GPU to calculate the distance matrix on-fly. It still has O(N^2) calculation complexity, but only O(N) memory complexity.
I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.
You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.