Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am running into serious performance bottlenecks. I'll describe the implementation then ask for help.
I discretized the possible epsilon values and put counts of the number of neighbors under each discretization in each node in order to be able to retrieve all of the core nodes in one query.
START a = node(*)
WHERE a.rel<cutoff threshold>! >= {minp}
RETURN a
This part is fast, the part that isn't fast is the follow up query :
START a = node({i})
SET a.label<cutoff threshold>_<minpoints> = {clust}
WITH a
MATCH a -[:'|'.join(<valid distance relations>)]- (x)
WHERE not(has(x.label<cutoff threshold>_<minpoints>))
WITH x
SET x.label<cutoff threshold>_<minpoints>={clust}
RETURN x
I then pick a core node to start from, and as long as there are still core node neighbors, run the above query to label their neighbors.
I think the problem is that my graph has very different levels of sparsity - starting from only weak similarity it is almost fully connected, with ~50M relations between ~10k nodes, whereas at strong similarity there are as few as ~20k relations between ~10k nodes (or fewer). No matter what, it is always REALLY slow. What is the best way for me to handle this? Is it to index on relationship type and starting node? I haven't been able to find any resources on this problem, and surprisingly there isn't already an implementation since this is a pretty standard graph algorithm. I could use scikit.learn but then I would be restricted to in-memory distance matricies only :(
What version of neo4j did you try this with?
Up until 1.8 performance has been no design goal of cypher (rather the language)
Have a look at a recent snapshot (1.9-SNAP).
Also make sure that your hot dataset is not just loaded from disk (otherwise you measure disk-io) so your memory mapped settings and also JVM heap is large enough.
You might also want to check out the GCR cache from Neo4j enterprise which has a smaller memory footprint.
What is the cardinality of count(x) in your query? If it is too small you have too many small transactions going on. Depending if your run python embedded or REST use a larger tx-scope or REST-batch-operations
You're already using parameters which is great. What is the variability of your rel-types ?
Any chance to share your dataset/generator and the code with us (Neo4j) for performance testing on our side?
There are DBSCAN implementations around that use indexing. I don't know about neo4j so I can't really tell if your approach is efficient. The thing you might need to precompute is actually a sparse version of your graph, with only the edges that are within the epsilon threshold.
What I'd like to point out that apparently you have different densities in your data set, so you might want to instead use OPTICS, which is a variant of DBSCAN that does away with the epsilon parameter (and also doesn't need to distinguish "core" nodes, as every node is a core node for a certain epsilon). Do not use the Weka version (or the weka-inspired python version that is floating around). They are half OPTICS and half DBSCAN.
When you have efficient sorted updatable heaps available, OPTICS can be pretty fast.
Related
I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py
It works great, but, it doesn't really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I'm certain there are a class of algorithms - and hopefully a python library - that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?
That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
there's a whole literature out there;
if you already have a scalable clustering algorithm you like, you can use that.
I am trying to use the BK-tree data structure in python to store a corpus with ~10 billion entries (1e10) in order to implement a fast fuzzy search engine.
Once I add over ~10 million (1e7) values to a single BK-tree, I start to see a significant degradation in the performance of querying.
I was thinking to store the corpus into a forest of a thousand BK-trees and to query them in parallel.
Does this idea sound feasible? Should I create and query 1,000 BK-trees simultaneously? What else can I do in order to use BK-tree for this corpus.
I use pybktree.py and my queries are intended to find all entries within an edit distance d.
Is there some architecture or database which will allow me to store those trees?
Note: I don’t run out of memory, rather the tree begins to be inefficient (presumably each node has too many children).
FuzzyWuzzy
Since you are mentioning your usage of FuzzyWuzzy as distance metric I will concentrate on efficient ways to implement the fuzz.ratio algorithm used by FuzzyWuzzy. FuzzyWuzzy provides the following two implementations for fuzz.ratio:
difflib, which is completely implemented in Python
python-Levenshtein which uses a weighted Levenshtein distance with the weight 2 for substitutions (substitutions are deletion + insertion). Python-Levenshtein is implemented in C and a lot faster than the pure Python implementation.
Implementation in python-Levenshtein
The implementation of python-Levenshtein uses the following implementation:
removes common prefix and suffix of the two strings, since they do not have any influence on the end result. This can be done in linear time, so matching similar strings is very fast.
The Levenshtein distance between the trimmed strings is implemented with quadratic runtime and linear memory usage.
RapidFuzz
I am the author of the library RapidFuzz which implements the algorithms used by FuzzyWuzzy in a more performant way. RapidFuzz uses the following interface for fuzz.ratio:
def ratio(s1, s2, processor = None, score_cutoff = 0)
The additional score_cutoff parameter can be used to provide a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. This can be used by the implementation to use more a more optimized implementation in some cases. In the following I will describe the optimizations used by RapidFuzz depending on the input parameters. In the following max distance refers to the maximum distance that is possible without getting a ratio below the score threshold.
max distance == 0
The similarity can be calculated using a direct comparison,
since no difference between the strings is allowed. The time complexity of
this algorithm is O(N).
max distance == 1 and len(s1) == len(s2)
The similarity can be calculated using a direct comparisons as well, since a substitution would cause a edit distance higher than max distance. The time complexity of this algorithm is O(N).
Remove common prefix
A common prefix/suffix of the two compared strings does not affect
the Levenshtein distance, so the affix is removed before calculating the similarity. This step is performed for any of the following algorithms.
max distance <= 4
The mbleven algorithm is used. This algorithm
checks all possible edit operations that are possible under
the threshold max distance. A description of the original algorithm can be found here. I changed this algorithm to support the weigth of 2 for substitutions. As a difference to the normal Levenshtein distance this algorithm can even be used up to a threshold of 4 here, since the higher weight of substitutions decreases the amount of possible edit operations. The time complexity of this algorithm is O(N).
len(shorter string) <= 64 after removing common affix
The BitPAl algorithm is used, which calculates the Levenshtein distance in
parallel. The algorithm is described here and is extended with support
for UTF32 in this implementation. The time complexity of this algorithm is O(N).
Strings with a length > 64
The Levenshtein distance is calculated using
Wagner-Fischer with Ukkonens optimization. The time complexity of this algorithm is O(N * M).
This could be replaced with a blockwise implementation of BitPal in the future.
Improvements to processors
FuzzyWuzzy provides multiple processors like process.extractOne that are used to calculate the similarity between a query and multiple choices. Implementing this in C++ as well allows two more important optimizations:
when a scorer is used that is implemented in C++ as well we can directly call the C++ implementation of the scorer and do not have to go back and forth between Python and C++, which provides a massive speedup
We can preprocess the query depending on the scorer that is used. As an example when fuzz.ratio is used as scorer it only has to store the query into the 64bit blocks used by BitPal once, which saves around 50% of the runtime when calculating the Levenshtein distance
So far only extractOne and extract_iter are implemented in Python, while extract which you would use is still implemented in Python and uses extract_iter. So it can already use the 2. optimization, but still has to switch a lot between Python and C++ which is not optimal (This will probably be added in v1.0.0 as well).
Benchmarks
I performed benchmarks for extractOne and the individual scorers during the development that shows the performance difference between RapidFuzz and FuzzyWuzzy. Keep in mind that the performance for your case (all strings length 20) is probably not as good, since many of the strings in the dataset used are very small.
The source of the reproducible-science DATA :
words.txt ( dataset with 99171 words )
The hardware the graphed benchmarks were run on (specification) :
CPU: single core of a i7-8550U
RAM: 8 GB
OS: Fedora 32
Benchmark Scorers
The code for this benchmark can be found here
Benchmark extractOne
For this benchmark the code of process.extractOne is slightly changed to remove the score_cutoff parameter. This is done because in extractOne the score_cutoff is increased whenever a better match is found (and it exits once it finds a perfect match). In the future it would make more sense to benchmark process.extract which does not has this behavior (the benchmark is performed using process.extractOne, since process.extract is not fully implemented in C++ yet). The benchmark code can be found here
This shows that when possible the scorers should not be used directly but through the processors, that can perform a lot more optimizations.
Alternative
As an Alternative you could use a C++ implementation. The library RapidFuzz is available for C++ here. The implementation in C++ is relatively simple as well
// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());
rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
for (const auto& choice : choices)
{
results.push_back(scorer.ratio(choice));
}
or in parallel using open mp
// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());
rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
#pragma omp parallel for
for (const auto& choice : choices)
{
results.push_back(scorer.ratio(choice));
}
On my machine (see Benchmark above) this evaluates 43 million words/sec and 123 million words/sec in the parallel version. This is around 1.5 times as fast as the Python implementation (due to conversions between Python and C++ Types). However the main advantage of the C++ version is that you are relatively free to combine algorithms whichever way you want, while in the Python version your forced to use the process functions that are implemented in C++ to achieve good performance.
Few thoughts
BK-trees
Kudos to Ben Hoyt and his link to the issue which I will draw from. That being said, the first observation from the mentioned issue is that the BK tree isn't exactly logarithmic. From what you told us your usual d is ~6, which is 3/10 of your string length. Unfortunately, that means that if we look at the tables from the issue you will get the complexity of somewhere between O(N^0.8) to O(N). In the optimistic case of the
exponent being 0.8(it will likely be slightly worse) you get an improvement factor of ~100 on your 10B entries. So if you have a reasonably fast implementation of BK-trees it can still be worth it to use them or use them as a basis for a further optimization.
The downside of this is that even if you use 1000 trees in parallel, you will only get the improvement from the parallelization as the perfomance of the trees depends on the d rather than on the amount of the nodes within the tree. However even if you run all the 1000 trees at once with a massive machine, we are at the ~10M nodes/tree which you reported as slow. Still, computation wise, this seems doable.
A brute force approach
If you don't mind paying a little I would look into something like Google cloud big query if that doesn't clash with some kind of data confidentiality. They will brute force the solution for you - for a fee. The current rate is $5/TB of a query. Your dataset is ~10B rows * 20chars. Taking one byte per char, one query would take 200GB so ~1$ per query if you went the lazy way.
However, since the charge is per byte of a data in a column and not per complexity of a question, you could improve on this by storing your strings as bits - 2bits per a letter, this would save you 75% of the expenses.
Improving further, you can write your query in such a way that it will ask for a dozen strings at once. You might need to be a bit careful to use a batch of similar strings for the purpose of the query to avoid clogging of the result with too many one-offs though.
Brute forcing of the BK-trees
Since if you go with the route above, you will have to pay depending on the volume, the ~100-fold decrease in the computations needed becomes ~100-fold decrease in price which might be useful, especially if you have a lot of queries to run.
However you would need to figure out a way to store this tree in a several layers of databases to query recursively as the Bigquery pricing depends on the volume of the data in the queried table.
Building a smart batch engine for recursive processing of the queries to minimize the costs could be fun optimization excercise.
A choice of language
One more thing. While I think that Python is a good language for fast prototyping, analysis and thinking about code in general you are past that stage. You are currently looking for a way to do a specific, well defined and well thought operation as fast as possible. Python is not a great language for this as this example shows. While I used all the tricks I could think of in Python, the Java and C solutions were still several times faster. (Not to mention the rust one that beat us all - but he beat us by algorithm as well so it's hard to compare.) So if you go from python to a faster language, you might gain another factor or ten or maybe even more of a performance gain. This could be another fun optimization exercise.
Note: I am being rather conservative with the estimate as the fuzzywuzzy already offers to use a C library in the background so I'm not too sure about how much of the work still depends on the python. My experience in similar cases is that the performance gain can be factor of 100 from pure python(or worse, pure R) to a compiled language.
Quite late to the party, but here is a possible solution which
I would implement if I were in your situation:
Save the dataset as text file, and put that file on a very
fast disk region (preferably on tmpfs).
Prepare a beefy computer with many physical CPU cores (such
as Threadripper 3990X that has 64 cores).
Use this implementation and GNU parallel to grok the dataset.
Here is a bit of technical info behind this solution:
The optimized version of Myers' algorithm (linked above) can
process about 14 million entries per sec on a single CPU core.
If you can fully utilize all the 64 physical cores, you can
archive the throughput of 896 million per sec (= 14m * 64 cores).
At this speed, you can perform a single query on 10 billion
datasets in 12 seconds using a single machine.
I posted more detailed analysis at this article.
As shown in the article, I could perform a query against a dataset of 100 million records
in 1.04s with my cheap desktop machine.
By using a more performant CPU (or splitting the task between
multiple computers), I believe you can archive the desired result.
Hope this helps.
first question, I will do my best to be as clear as possible.
If I can provide UMAP with a distance function that also outputs a gradient or some other relevant information, can I apply UMAP to non-traditional looking data? (I.e., a data set with points of inconsistent dimension, data points that are non-uniformly sized matrices, etc.) The closest I have gotten to finding something that looks vaguely close to my question is in the documentation here (https://umap-learn.readthedocs.io/en/latest/embedding_space.html), but this seems to be sort of the opposite process, and as far as I can tell still supposes you are starting with tuple-based data of uniform dimension.
I'm aware that one way around this is just to calculate a full pairwise distance matrix ahead of time and give that to UMAP, but from what I understand of the way UMAP is coded, it only performs a subset of all possible distance calculations, and is thus much faster for the same amount of data than if I were to take the full pre-calculation route.
I am working in python3, but if there is an implementation of UMAP dimension reduction in some other environment that permits this, I would be willing to make a detour in my workflow to obtain this greater flexibility with incoming data types.
Thank you.
Algorithmically this is quite possible, but in practice most implementations do not support anything other than fixed dimension vectors. If computing the all pairs distances is not tractable another option is to try to find a way to featurize or vectorize the data in a way that will allow for easy distance computations. This is, of course, not always possible. The final option is to implement things yourself, but this requires handling the nearest neighbour search, which is likely a non-trivial coding project in and of itself.
I'm looking for some general advice on how to either re-write application code to be non-naive, or whether to abandon neo4j for another data storage model. This is not only "subjective", as it relates significantly to specific, correct usage of the neo4j driver in Python and why it performs the way it does with my code.
Background:
My team and I have been using neo4j to store graph-friendly data that is initially stored in Python objects. Originally, we were advised by a local/in-house expert to use neo4j, as it seemed to fit our data storage and manipulation/querying requirements. The data are always specific instances of a set of carefully-constructed ontologies. For example (pseudo-data):
Superclass1 -contains-> SubclassA
Superclass1 -implements->SubclassB
Superclass1 -isAssociatedWith-> Superclass2
SubclassB -hasColor-> Color1
Color1 -hasLabel-> string::"Red"
...and so on, to create some rather involved and verbose hierarchies.
For prototyping, we were storing these data as sequences of grammatical triples (subject->verb/predicate->object) using RDFLib, and using RDFLib's graph-generator to construct a graph.
Now, since this information is just a complicated hierarchy, we just store it in some custom Python objects. We also do this in order to provide an easy API to others devs that need to interface with our core service. We hand them a Python library that is our Object model, and let them populate it with data, or, we populate it and hand it to them for easy reading, and they do what they want with it.
To store these objects permanently, and to hopefully accelerate the writing and reading (querying/filtering) of these data, we've built custom object-mapping code that utilizes the official neo4j python driver to write and read these Python objects, recursively, to/from a neo4j database.
The Problem:
For large and complicated data sets (e.g. 15k+ nodes and 15k+ relations), the object relational mapping (ORM) portion of our code is too slow, and scales poorly. But neither I, nor my colleague are experts in databases or neo4j. I think we're being naive about how to accomplish this ORM. We began to wonder if it even made sense to use neo4j, when more traditional ORMs (e.g. SQL Alchemy) might just be a better choice.
For example, the ORM commit algorithm we have now is a recursive function that commits an object like this (pseudo code):
def commit(object):
for childstr in object: # For each child object
child = getattr(object, childstr) # Get the actual object
if attribute is <our object base type): # Open transaction, make nodes and relationship
with session.begin_transaction() as tx:
<construct Cypher query with:
MERGE object (make object node)
MERGE child (make its child node)
MERGE object-[]->child (create relation)
>
tx.run(<All 3 merges>)
commit(child) # Recursively write the child and its children to neo4j
Is it naive to do it like this? Would an OGM library like Py2neo's OGM be better, despite ours being customized? I've seen this and similar questions that recommend this or that OGM method, but in this article, it says not to use OGMs at all.
Must we really just implement every method and benchmark for performance? It seems like there must be some best-practices (other than using the batch IMPORT, which doesn't fit our use cases). And we've read through articles like those linked, and seen the various tips on writing better queries, but it seems better to step back and examine the case more generally before attempting to optimize code line-by line. Although it's clear that we can improve the ORM algorithm to some degree.
Does it make sense to write and read large, deep hierarchical objects to/from neo4j using a recursive strategy like this? Is there something in Cypher, or the neo4j drivers that we're missing? Or is it better to use something like Py2neo's OGM? Is it best to just abandon neo4j altogether? The benefits of neo4j and Cypher are difficult to ignore, and our data does seem to fit well in a graph. Thanks.
It's hard to know without looking at all the code and knowing the class hierarchy, but at the moment I'd hazard a guess that your code is slow in the OGM bit because every relationship is created in its own transaction. So you're doing a huge number of transactions for a larger graph which is going to slow things down.
I'd suggest for an initial import where you're creating every class/object, rather than just adding a new one or editing the relationships for one class, that you use your class inspectors to simply create a graph representation of the data, and then use Cypher to construct it in a lot fewer transactions in Neo4J. Using some basic topological graph theory you could then optimise it by reducing the number of lookups you need to do, too.
You can create a NetworkX MultiDiGraph in your python code to model the structure of your classes. From there on in there are a few different strategies to put the data into Neo4J - I also just found this but have no idea about whether it works or how efficient it is.
The most efficient way to query to import your graph will depend on the topology of the graph, and whether it is cyclical or not. Some options are below.
1. Create the Graph in Two Sets of Queries
Run one query for every node label to create every node, and then another to create every edge between every combination of node labels (the efficiency of this will depend on how many different node labels you're using).
2. Starting from the topologically highest or lowest point in the graph, create the graph as a series of paths
If you have lots of different edge labels and node labels, this might involve writing a lot of cypher logic combining UNWIND and FOREACH (CASE r.label = 'SomeLabel' THEN [1] ELSE [] | CREATE (n:SomeLabel {node_unique_id: x})->, but if the graph is very hierarchical you could also use python to keep track of which nodes have all their lower nodes and relationships created already and then use that knowledge to limit the size of paths that get sent to Neo4J in a query.
3. Use APOC to import the whole graph
Another option, which may or may not fit your use case and may or may not be more performant would be to export the graph to GraphML using NetworkX and then use the APOC GraphML import tool.
Again, it's hard to offer a precise solution without seeing all your data, but I hope this is somewhat useful as a steer in the right direction! Happy to help / answer any other questions based on more data.
There is a lot going on here so I'll try to address this in smaller questions
Would an OGM library like Py2neo's OGM be better
With any ORM/OGM library, the reality is that you can always get better performance by bypassing them and delving into the belly of the beast. That is not really the ORMs entire job though. An ORM is meant to save you time and effort by making relatively efficient DB use easy.
So it depends, if you want best performance, skip the ORM, and invest your time working on as low a level as you can (*Requires advanced low level knowledge of the beast you are working with, and a lot of your time). Otherwise, an ORM library is usually your best bet.
Our code is too slow, and scales poorly
Databases are complex. If at all possible, I would recommend bringing someone(s) on board to be a company wide database admin/expert. (This is harder when you don't already have one to vet new hires actually know what they are talking about)
Assuming that is not an option, here are some things to consider.
IO is expensive. Especially over the network. Minimize data that has to be sent in either direction. (This is why you page return results. Only return the data you need, as you actually need it)
Caveat to that, creating request connections is very expensive. Minimize calls to the DB. (Have fun balancing the two ^_^) (Note: ORMs usually have built in mechanics to only commit what has changed)
Get to the data you want fast. Create indexes in the database to vastly improve fetch speed. The more unique and consistent the id is, the better.
Caveat, indexes have to be updated on writes that alter a value in them. So indexes reduce write speed and eat more memory to gain read speed. Minimize indexes.
Transactions are a memory operation. Committing a transaction is a disk IO operation. This is why batch jobs are far more efficient.
Caveat, Memory isn't infinite. Keep your jobs a reasonable size.
As you can probably tell, scaling DB operations to production levels is not fun. It's too easy to burn yourself over-optimizing on any axis, and this is just surface level over simplifications.
For prototyping, we were storing these data as sequences of grammatical triples
Less a question, and more a statement, but different types of databases have different strengths and weaknesses. Scheme-less DBs are more specialized for cache stores; Graph DBs are specialized for querying based on relationships (edges); Relational DBs are specialized for fetching/updating records (tables); And Triplestores are more Specialized for, well, triples (RDF); (ect. there are more types)
I mention this because it sounds like your data might be mostly "write once, read many". In this case, you probably actually should be using a Triplestore. You can use any DB type for anything, but picking the best DB requires you to know how you use your data, and how that use can possible evolve.
Must we really just implement every method and benchmark for performance?
Well, this is part of why stored procedures are so important. ORMs help abstract this part, and having an in house domain expert would really help. It could just be that you are pushing the limits of what 1 machine can do. Maybe you just need to upgrade to a cluster; or maybe you have horrible code inefficiencies that have you touching a node 10k times in 1 save operation when no (or 1) value changed. To be honest though, bench-marking doesn't do much unless you know what you are looking for. For example, usually the difference between 5 hours and 0.5 seconds could be as simple as creating 1 index.
(To be fair, while buying bigger and better database servers/clusters may be the inefficient solution, it is sometimes the most cost effective compared to the salary of 1 Database Admin. So, again, depends your your priorities. And I'm sure your boss would probably prioritize differently from what you'd like)
TL;DR
You should hire a domain expert to help you.
If that is not an option, go to the bookstore (or google) pick up Databases 4 dummies (hands on learn databases online tutorial classes), and become the domain expert yourself. (Which you can than use to boost your worth to the company)
If you don't have time for that, probably your only saving grace would be to just upgrade your hardware to solve the problem with brute force. (*As long as growth isn't exponential)
Is anyone aware of a KD-Tree, or similar spatial index, implemented in SQL? I was considering writing my own using Python and Django's ORM, but I'd like to avoid reinventing the wheel.
I have a table containing millions of rows, with each row containing 128 columns representing image feature data. Given an arbitrary 128-element long list of image features, I want to use a KD-Tree to find the N most similar images in the database. I've found a lot of KD-Tree implementations, but they all appear to only load in local memory and don't scale or talk to databases.
KD-tree does not work well for high-dimensional data, and 128 dimensions would be quite high. The KD-tree indexes each dimension at a different level of the tree, and when performing a query the algorithm will do a lot of back-tracking (searching both sides of a branch) and ends up searching most of the points in the tree. When this happens the advantages of using a tree structure disappear and an exhaustive comparison ends up running faster.
You may want to find an existing image similarity search system that you can map your data into. Here is one called Lire which extracts features from images and indexes them using Lucene.
If your work is more research-oriented you may want to read up on metric space indexes and approximate k-nearest neighbor search.
I might be a little out here, but your best bet may be using the Gist / Gin indexes inside of Postgresql