Given 10,000 64-dimensional vectors and I need to find the vector with the least euclidean distance to an arbitrary point.
The tricky part is that these 10,000 vectors move. Most of the algorithms I have seen assume stationary points and thus can make good use of indexes. I imagine it will be too expensive to rebuild indexes on every timestep.
Below is the pseudo code.
for timestep in range(100000):
data = get_new_input()
nn = find_nearest_neighbor(data)
nn.move_towards(data)
One thing to note is that the vectors only move a little bit on each timestep, about 1%-5%. One non-optimal solution is to rebuild indexes every ~1000 timesteps. It is OK if the nearest neighbor is approximate. Maybe using each vectors momentum would be useful?
I am wondering what is the best algorithm to use in this scenario?
Related
Pointcloud of rope with desired start and end point
I have a pointcloud of a rope-like object with about 300 points. I'd like to sort the 3D coordinates of that pointcloud, so that one end of the rope has index 0 and the other end has index 300 like shown in the image. Other pointclouds of that object might be U-shaped so I can't sort by X,Y or Z coordinate. Because of that I also can't sort by the distance to a single point.
I have looked at KDTree by sklearn or scipy to compute the nearest neighbour of each point but I don't know how to go from there and sort the points in an array without getting double entries.
Is there a way to sort these coordinates in an array, so that from a starting point the array gets appended with the coordinates of the next closest point?
First of all, obviously, there is no strict solution to this problem (and even there is no strict definition of what you want to get). So anything you may write will be a heuristic of some sort, which will be failing in some cases, especially as your point cloud gets some non-trivial form (do you allow loops in your rope, for example?)
This said, a simple approach may be to build a graph with the points being the vertices, and every two points connected by an edge with a weight equal to the straight-line distance between these two points.
And then build a minimal spanning tree of this graph. This will provide a kind of skeleton for your point cloud, and you can devise any simple algorithm atop of this skeleton.
For example, sort all points by their distance to the start of the rope, measured along this tree. There is only one path between any two vertices of the tree, so for each vertex of the tree calculate the length of the single path to the rope start, and sort all the vertices by this distance.
As suggested in other answer there is no strict solution to this problem and there can be some edge cases such as loop, spiral, tube, but you can go with heuristic approaches to solve for your use case. Read about some heuristic approaches such as hill climbing, simulated annealing, genetic algorithms etc.
For any heuristic approach you need a method to find how good is a solution, let's say if i give you two array of 3000 elements how will you identify which solution is better compared to other ? This methods depends on your use case.
One approach at top of my mind, hill climbing
method to measure the goodness of the solution : take the euclidian distance of all the adjacent elements of array and take the sum of their distance.
Steps :
create randomised array of all the 3000 elements.
now select two random index out of these 3000 and swap the elements at those indexes, and see if it improves your ans (if sum of euclidian distance of adjacent element reduces)
If it improves your answer then keep those elements swapped
repeat step 2/3 for large number of epochs(10^6)
This solution will lead into stagnation as there is lack of diversity. For better results use simulated annealing, genetic algorithms.
I have an array of thousands of doc2vec vectors with 90 dimensions. For my current purposes I would like to find a way to "sample" the different regions of this vector space, to get a sense of the diversity of the corpus. For example, I would like to partition my space into n regions, and get the most relevant word vectors for each of these regions.
I've tried clustering with hdbscan (after reducing the dimensionality with UMAP) to carve the vector space at its natural joints, but it really doesn't work well.
So now I'm wondering whether there is a way to sample the "far out regions" of the space (n vectors that are most distant from each other).
Would that be a good strategy?
How could I do this?
Many thanks in advance!
Wouldn't a random sample from all vectors necessarily encounter any of the various 'regions' in the set?
If there are "natural joints" and clusters to the documents, some clustering algorithm should be able to find the N clusters, then the smaller number of NxN distances between each cluster's centroid to each other cluster's centroid might identify those "furthest out" clusters.
Note for any vector, you can use the Doc2Vec doc-vectors most_similar() with a topn value of 0/false-ish to get the (unsorted) similarities to all other model doc-vectors. You could then find the least-similar vectors in that set. If your dataset is small enough for it to be practical to do this for "all" (or some large sampling) of doc-vectors, then perhaps other docs that appear in the "bottom N" least-similar, for the most number of other vectors, would be the most "far out".
Whether this idea of "far out" is actually shown in the data, or useful, isn't clear. (In high-dimensional spaces, everything can be quite "far" from everything else in ways that don't match our 2d/3d intuitions, and slight differences in some vectors being a little "further" might not correspond to useful distinctions.)
I have pre-made database full of 512 dimensional vectors and want to implement an efficient searching algorithm over them.
Research
Cosine similarity:
The best algorithm in this case would consist of cosine similarity measure, which is basically a normalized dot product, which is:
def cossim(a, b): numpy.inner(a, b)/(numpy.linalg.norm(a)*numpy.linalg.norm(b))
In Python.
Linear search:
Most obvious and simple search for this case would be linear search O(n), which iterates the whole database and eventually picks the most similar result:
def linear_search(query_text, db): # where db is set of 512D vectors
most_similar = ("", 0) # placeholder
for query in db:
current_sim = cossim(query_text, query) # cossim function defined above
if current_sim > most_similar[1]:
most_similar = (query, current_sim)
return most_similar[0]
As you can see, the whole database should be scanned, which might be quite inefficient if database contains hundreds of thousands of vectors.
Quasilinear search: (partially resolved)
There is a fundamental relation between Cosine similarity and Euclidean distance (explained very well in this answer) - we can derive Euclidean distance from following equation:
|a - b|² = 2(1 - cossim(a,b))
As mentioned in the answer, Euclidean distance will get smaller as cosine between two vectors get larger, therefore we can turn this into the closest pairs of points problem, which can be solved in quasilinear O(n log n) time using recursive divide and conquer algorithm.
Thus I have to implement my own divide and conquer algorithm that will find closest pair of 512 dimensional vectors.
But unfortunately, this problem can't be directly solved due to high dimensionality of vectors. Classical divide and conquer algorithm is only specialized for two dimensions.
Indexing for binary search (unresolved):
The best way to optimize cosine similarity search in terms of speed from my knowledge would be indexing and then performing binary search.
The main problem here is that indexing 512 dimensional vectors is quite difficult, and I'm not yet aware of anything other than locality sensitive hashing that may, or may not be useful for indexing my database (main concern is dimensionality reduction, which might possibly cause consequential decrease in accuracy).
There is a new Angular Multi-index Hashing method, which unfortunately only works for binary based vectors and dimension independent similarity computation if vectors are sparse, but they are not.
Finally, there also is An Optimal Algorithm for Approximate Nearest
Neighbor Searching in Fixed Dimensions, which at first glance might perhaps be the best solution, but in the document it is stated:
Unfortunately, exponential factors in query time do imply that our
algorithm is not practical for large values of d. However, our
empirical evidence in Section 6 shows that the constant factors are
much smaller than the bound given in Theorem 1 for the many
distributions that we have tested. Our algorithm can provide
significant improvements over brute-force search in dimensions as high
as 20, with a relatively small average error.
We are trying to perform query over 20 * 25.6 = 512 dimensional vectors, which will make the algorithm above highly inefficient.
There was a similar question that contains similar concerns, but unfortunately the solution for indexing was yet not found.
Problem
Is there any way to optimize cosine similarity search for such vectors other than quasilinear search? Perhaps there is some other way of indexing high-dimensional vectors? I believe something like this has already been done before.
Closest solution
I believe I have found solution that might potentially be a solution, it includes randomized partition trees for indexing couple hundred dimensional vector databases, which I believe is what I exactly need. (see here)
Thank you!
Does there exist nearest neighbor data structure that supports delete and add operations along with exact nearest neighbor queries? Looking for a Python implementation ideally.
Attempts:
Found MANY implementations for approximate nearest neighbor queries in high dimensional spaces.
Found KD Trees and Ball Trees but they do not allow for dynamic rebalancing.
Thinking an algorithm could be possible with locality sensitive hashing.
Looking at OctTrees.
Context:
For each point of 10,000 points, query for it's nearest neighbor
Evaluate each pair of neighbors
Pick one and delete the pair of points and add a merged point.
Repeat for some number of iterations
Yes. There exists such a datastructure. I invented one. I had exactly this problem at hand. The datastructure makes KD-trees seem excessively complex. It consists of only sorted lists of points in each dimensionality the points have.
Obviously you can add and removing a n-dimensional point from n lists sorted by their respective dimensions rather trivially without issue. A lot of the tricks allows one to iterate these lists and mathematically prove you have the shortest distance to a point. See my answer here for elaboration and code.
I must note though that your context is wrong. The closest point for A may be B, but it doesn't hold that B's closest point is necessarily A. You could rig a chain of points such that each distance between each link is less than the one before but also necessarily further than the other points resulting in there being only 1 pair of neighbors that share their nearest neighbor.
I've got millions of geographic points. For each one of these, I want to find all "neighboring points," i.e., all other points within some radius, say a few hundred meters.
There is a naive O(N^2) solution to this problem---simply calculate the distance of all pairs of points. However, because I'm dealing with a proper distance metric (geographic distance), there should be a quicker way to do this.
I would like to do this within python. One solution that comes to mind is to use some database (mySQL with GIS extentions, PostGIS) and hope that such a database would take care of efficiently performing the operation described above using some index. I would prefer something simpler though, that doesn't require me to build and learn about such technologies.
A couple of points
I will perform the "find neighbors" operation millions of times
The data will remain static
Because the problem is in a sense simple, I'd like to see they python code that solves it.
Put in terms of python code, I want something along the lines of:
points = [(lat1, long1), (lat2, long2) ... ] # this list contains millions lat/long tuples
points_index = magical_indexer(points)
neighbors = []
for point in points:
point_neighbors = points_index.get_points_within(point, 200) # get all points within 200 meters of point
neighbors.append(point_neighbors)
scipy
First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.
Binary Search
Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:
Compute the PCA of the dataset. You want to rotate the dataset such that the most significant direction is first, and the orthogonal (less large) second direction is, well, second. You can skip this and just choose X or Y, but it's computationally cheap and usually easy to implement. If you do just choose X or Y, choose the direction with greater variance.
Sort the points by the major direction (call this direction X).
To find the nearest neighbor of a given point, find the index of the point nearest in X by binary search (if the point is already in your collection, you may already know this index and don't need the search). Iteratively look to the next and previous points, maintaining the best match so far and its distance from your search point. You can stop looking when the difference in X is greater than or equal to the distance to the best match so far (in practice, usually very few points).
To find all points within a given range, do the same as step 3, except don't stop until the difference in X exceeds the range.
Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.
One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.
Delauney triangulation
Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...
Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.
from scipy.spatial import cKDTree
from scipy import inf
max_distance = 0.0001 # Assuming lats and longs are in decimal degrees, this corresponds to 11.1 meters
points = [(lat1, long1), (lat2, long2) ... ]
tree = cKDTree(points)
point_neighbors_list = [] # Put the neighbors of each point here
for point in points:
distances, indices = tree.query(point, len(points), p=2, distance_upper_bound=max_distance)
point_neighbors = []
for index, distance in zip(indices, distances):
if distance == inf:
break
point_neighbors.append(points[index])
point_neighbors_list.append(point_neighbors)