KD-Tree Implementation in SQL - python

Is anyone aware of a KD-Tree, or similar spatial index, implemented in SQL? I was considering writing my own using Python and Django's ORM, but I'd like to avoid reinventing the wheel.
I have a table containing millions of rows, with each row containing 128 columns representing image feature data. Given an arbitrary 128-element long list of image features, I want to use a KD-Tree to find the N most similar images in the database. I've found a lot of KD-Tree implementations, but they all appear to only load in local memory and don't scale or talk to databases.

KD-tree does not work well for high-dimensional data, and 128 dimensions would be quite high. The KD-tree indexes each dimension at a different level of the tree, and when performing a query the algorithm will do a lot of back-tracking (searching both sides of a branch) and ends up searching most of the points in the tree. When this happens the advantages of using a tree structure disappear and an exhaustive comparison ends up running faster.
You may want to find an existing image similarity search system that you can map your data into. Here is one called Lire which extracts features from images and indexes them using Lucene.
If your work is more research-oriented you may want to read up on metric space indexes and approximate k-nearest neighbor search.

I might be a little out here, but your best bet may be using the Gist / Gin indexes inside of Postgresql

Related

Community detection for larger than memory embeddings dataset

I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py
It works great, but, it doesn't really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I'm certain there are a class of algorithms - and hopefully a python library - that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?
That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
there's a whole literature out there;
if you already have a scalable clustering algorithm you like, you can use that.

Geohashing vs SearchAPI for geospatial querying using datastore

I am creating an appEngine application in python that will need to perform efficient geospatial queries on datastore data. An example use case would be, I need to find the first 20 posts within a 10 mile radius of the current user. Having done some research into my options, I have found that currently what seems like the 2 best approaches for achieving this type of functionality would be:
Indexing geoHashed geopoint data using Python's GeoModel library
Creating/deleting documents of structured data using Google's newer SearchAPI
It seems from a high level perspective that indexing geohashes and performing queries on them directly would be less costly and much faster than having to create and delete a document for every geospatial query, however i've also read that geohashing can be very inaccurate along the equator or along 'faultlines' created by the hashing algorithm. I've seen very few posts contrasting the best methods in detail, and I think stack is a good place to have this conversation, so my questions are as follows:
Has anyone implemented similar features and had positive experiences with either methods?
Which method would be the cheaper alternative?
Which would be the faster alternative?
Is there another important method I'm leaving out?
Thanks in advance.
Geohashing does not have to be inaccurate at all. It's all in the implementation details. What I mean is you can check the neighbouring geocells as well to handle border-cases, and make sure that includes neighbours on the other side of the equator.
If your use case is finding other entities within a radius as you suggest, I would definitely recommend using the Search API. They have a distance function tailored for that use.
Search API queries are more expensive than Datastore queries yes, but if you weigh in the computation time to do these calculations in your instance and probably iterating through all entities for each geohash to make sure the distance is actually less than the desired radius, then I would say Search API is the winner. And don't forget about the implementation time.
You can have a look at this post, it can be another great alternative.
I have used this within my app and it works great for my requirement to find my app users with-in provided radius .

performance - finding all points within certain distance by lat/long

I have a CSV file with points tagged by lat/long (~10K points). I'd like to search for all points within a given distance of a user/specified lat/long coordinate — say, for example, the centroid of Manhattan.
I'm pretty new to programming and databases, so this may be a basic question. If so, I apologize. Is it performant to do this search in pure Python without using a database? As in, could I simply read the CSV into memory and do the search with a Python script? If it is performant, would it scale well as the number of points increases?
Or is this simply infeasible in Python, and I need to investigate using a database that supports geospatial queries?
Additionally, how do I go about understanding the performance of these types of calculations so that I can develop a good intuition for this?
This is definitely possible in python without any databases. I would definitely recommend using numpy. I would do the following:
read all points from csv into a numpy array
Calculate the distance of each point to your given point
Sort the distance or simply find the one with minimum distance using argmin
Because all calculations are vectorized, they happen at close to C speed.
With an okay computer, the I/O will take like 2-3 seconds and the calculation will take less than 100-200 milliseconds.
In terms of math, you can try http://en.wikipedia.org/wiki/Haversine_formula

Optimizing DBSCAN for neo4j in Cypher / Python

Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am running into serious performance bottlenecks. I'll describe the implementation then ask for help.
I discretized the possible epsilon values and put counts of the number of neighbors under each discretization in each node in order to be able to retrieve all of the core nodes in one query.
START a = node(*)
WHERE a.rel<cutoff threshold>! >= {minp}
RETURN a
This part is fast, the part that isn't fast is the follow up query :
START a = node({i})
SET a.label<cutoff threshold>_<minpoints> = {clust}
WITH a
MATCH a -[:'|'.join(<valid distance relations>)]- (x)
WHERE not(has(x.label<cutoff threshold>_<minpoints>))
WITH x
SET x.label<cutoff threshold>_<minpoints>={clust}
RETURN x
I then pick a core node to start from, and as long as there are still core node neighbors, run the above query to label their neighbors.
I think the problem is that my graph has very different levels of sparsity - starting from only weak similarity it is almost fully connected, with ~50M relations between ~10k nodes, whereas at strong similarity there are as few as ~20k relations between ~10k nodes (or fewer). No matter what, it is always REALLY slow. What is the best way for me to handle this? Is it to index on relationship type and starting node? I haven't been able to find any resources on this problem, and surprisingly there isn't already an implementation since this is a pretty standard graph algorithm. I could use scikit.learn but then I would be restricted to in-memory distance matricies only :(
What version of neo4j did you try this with?
Up until 1.8 performance has been no design goal of cypher (rather the language)
Have a look at a recent snapshot (1.9-SNAP).
Also make sure that your hot dataset is not just loaded from disk (otherwise you measure disk-io) so your memory mapped settings and also JVM heap is large enough.
You might also want to check out the GCR cache from Neo4j enterprise which has a smaller memory footprint.
What is the cardinality of count(x) in your query? If it is too small you have too many small transactions going on. Depending if your run python embedded or REST use a larger tx-scope or REST-batch-operations
You're already using parameters which is great. What is the variability of your rel-types ?
Any chance to share your dataset/generator and the code with us (Neo4j) for performance testing on our side?
There are DBSCAN implementations around that use indexing. I don't know about neo4j so I can't really tell if your approach is efficient. The thing you might need to precompute is actually a sparse version of your graph, with only the edges that are within the epsilon threshold.
What I'd like to point out that apparently you have different densities in your data set, so you might want to instead use OPTICS, which is a variant of DBSCAN that does away with the epsilon parameter (and also doesn't need to distinguish "core" nodes, as every node is a core node for a certain epsilon). Do not use the Weka version (or the weka-inspired python version that is floating around). They are half OPTICS and half DBSCAN.
When you have efficient sorted updatable heaps available, OPTICS can be pretty fast.

Using PyLucene as a K-NN Classifier

I have a dataset composed of millions of examples, where each example contains 128 continuous-value features classified with a name. I'm trying to find a large robust database/index to use to use as a KNN classifier for high-dimensional data. I tried Weka's IBk classifier, but it chokes on this much data, and even then it has to be loaded into memory. Would Lucene, specifically through the PyLucene interface, be a possible alternative?
I've found Lire, which seems to use Lucene in a similar way, but after reviewing the code, I'm not sure how they're pulling it off, or if it's the same thing I'm trying to do.
I realize Lucene is designed as a text indexing tool, and not as a general purpose classifier, but is it possible to use it in this way?
Lucene doesn't seem like the right choice given what you've told us. Lucene would give you a way to store the data, but in terms of retrieval, it's not designed to do anything but search over textual strings.
Since K-NN is so simple, you might be better off creating your own data store in a typical RDBMS or something like Berkeley DB. You could create keys/indicies based on sub-hypercubes of the various dimensions to speed things up - start at the bucket of the item to be classified and move outward...
This is done in Lucene already with geospatial searches. Of course, the built-in geospatial searches only use two dimensions, so you'll have to modify it a bit. But the basic idea of using numeric range queries will work.
(Note: I'm not aware of anyone doing high-dimensional kNN with Lucene. So I can't comment on how fast it will be.)

Categories

Resources