I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py
It works great, but, it doesn't really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I'm certain there are a class of algorithms - and hopefully a python library - that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?
That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
there's a whole literature out there;
if you already have a scalable clustering algorithm you like, you can use that.
Related
I am currently using a dataset of over 2.5 million images, of which I use the image itself as a comparison to eachother, for use in a content-based recommendation engine.
I use the following code to calculate the cosine similarity using some precomputed embeddings.
cosine_similarity = 1-pairwise_distances(embeddings, metric='cosine')
However my issue is that currently I've estimated requiring around 11,000GB in memory to create this similarity matrix;
Are there any alternatives to getting a similarity metric between every data point in my dataset or is there another way to go about this whole process?
You have 2,500,000 entries. So resulting matrix has 6.25e+12 entries. You need to ask yourself what do you plan to do with this data, compute only what you need, and then the storage will follow. Computation of cosine distance is almost free (it is literally a dot product) so you can always just do it "on the fly", no need to precompute, and the question really boils down to how much actual time/compute.
if you have a recommendation business problem using these 2.5 million images, you may want to check TF recommenders which basically use %30 of data for retrieval and you can run a second ranking classifier on top of the initial model to explore more. this two-step approach would be key to memory constraints and already battle tested by instagram and others
I am building a massive network, that is filled with isolated nodes, but also rather large clusters as well. I have used Louvain's algorithm to achieve the best partition - however some communities are too large. I was curious what algorithms (preferably with Python frameworks) have similar run time to Louvain but penalize too large of communities while achieving ideal modularity?
You may try to iterate the community detection algorithm (Louvain or other) by running it on the too large communities you first find. This will partition them into smaller ones.
Notice also that Louvain and other community detection algorithms generally do not produce the best partition, but a good partition with respect to a given quality function. In most cases, finding the best partition is NP-hard.
With this in mind, one may include a scale parameter into the quality function, and detect relevant community at different scales: Post-Processing Hierarchical Community Structures: Quality Improvements and Multi-scale View
I am trying to cluster a data set with about 1,100,000 observations, each with three values.
The code is pretty simple in R:
df11.dist <-dist(df11cl) , where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.
the error I get is :
Error: cannot allocate vector of size 4439.0 Gb
Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.
I have also found this solution, but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?
Can you recommend anything in either R or python?
For trivial reasons, the function dist needs quadratic memory.
So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., I.e. 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.
And 1 million is still pretty small, isn't it?
Using dist on big data is impossible. End of story.
For larger data sets, you'll need to
use methods such as k-means that do not use pairwise distances
use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
subsample your data to make it smaller
In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am testing out a few clustering algorithms on a dataset of text documents (with word frequencies as features). Running some of the methods of Scikit Learn Clustering one after the other, below is how long they take on ~ 50,000 files with 26 features per file. There are big differences in how long each take to converge that get more extreme the more data I put in; some of them (e.g. MeanShift) just stop working after the dataset grows to a certain size.
(Times given below are totals from the start of the script, i.e. KMeans took 0.004 minutes, Meanshift (2.56 - 0.004) minutes, etc. )
shape of input: (4957, 26)
KMeans: 0.00491824944814
MeanShift: 2.56759268443
AffinityPropagation: 4.04678163528
SpectralClustering: 4.1573699673
DBSCAN: 4.16347868443
Gaussian: 4.16394021908
AgglomerativeClustering: 5.52318491936
Birch: 5.52657626867
I know that some clustering algorithms are inherently more computing intensive (e.g. the chapter here outlines that Kmeans' demand is linear to number of data points while hierarchical models are O(m2logm)).
So I was wondering
How can I determine how many data points each of these algorithms can
handle; and are the number of input files / input features equally
relevant in this equation?
How much does the computation intensity depend on the clustering
settings -- e.g. the distance metric in Kmeans or the e in DBSCAN?
Does clustering success influence computation time? Some algorithms
such as DBSCAN finish very quickly - mabe because they don't find
any clustering in the data; Meanshift does not find clusters either
and still takes forever. (I'm using the default settings here). Might
that change drastically once they discover structure in the data?
How much is raw computing power a limiting factor for these kind of
algorithms? Will I be able to cluster ~ 300,000 files with ~ 30
features each on a regular desktop computer? Or does it make sense to
use a computer cluster for these kind of things?
Any help is greatly appreciated! The tests were run on an Mac mini, 2.6 Ghz, 8 GB. The data input is a numpy array.
This is a too broad question.
In fact, most of these questions are unanswered.
For example k-means is not simply linear O(n), but because the number of iterations needed until convergence tends to grow with data set size, it's more expensive than that (if run until convergence).
Hierarchical clustering can be anywhere from O(n log n) to O(n^3) mostly depending on the way it is implemented and on the linkage. If I recall correctly, the sklearn implementation is the O(n^3) algorithm.
Some algorithms have parameters to stop early. Before they are actually finished! For k-means, you should use tol=0 if you want to really finish the algorithm. Otherwise, it stops early if the relative improvement is less than this factor - which can be much too early. MiniBatchKMeans does never convergence. Because it only looks at random parts of the data every time, it would just go on forever unless you choose a fixed amount of iterations.
Never try to draw conclusions from small data sets. You need to go to your limits. I.e. what is the largest data set you can still process within say, 1, and 2, and 4, and 12 hours, with each algorithm?
To get meaningful results, your runtimes should be hours, except if the algorithms simply run out of memory before that - then you might be interested in predicting how far you could scale until your run out of memory - assuming you had 1 TB of RAM, how large would the data be that you can still process?
The problem is, you can't simply use the same parameters for data sets of different size. If you do not chose the parameters well (e.g. DBSCAN puts everything into noise, or everything into one cluster) then you cannot draw conclusions from that either.
And then, there might simply be an implementation error. DBSCAN in sklearn has become a lot lot lot faster recently. It's still the same algorithm. So most results done 2 years ago were simply wrong, because the implementation of DBSCAN in sklearn was bad... now it is much better, but is it optimal? Probably not. And similar problems might be in any of these algorithms!
Thus, doing a good benchmark of clustering is really difficult. In fact, I have not seen a good benchmark in a looong time.
Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am running into serious performance bottlenecks. I'll describe the implementation then ask for help.
I discretized the possible epsilon values and put counts of the number of neighbors under each discretization in each node in order to be able to retrieve all of the core nodes in one query.
START a = node(*)
WHERE a.rel<cutoff threshold>! >= {minp}
RETURN a
This part is fast, the part that isn't fast is the follow up query :
START a = node({i})
SET a.label<cutoff threshold>_<minpoints> = {clust}
WITH a
MATCH a -[:'|'.join(<valid distance relations>)]- (x)
WHERE not(has(x.label<cutoff threshold>_<minpoints>))
WITH x
SET x.label<cutoff threshold>_<minpoints>={clust}
RETURN x
I then pick a core node to start from, and as long as there are still core node neighbors, run the above query to label their neighbors.
I think the problem is that my graph has very different levels of sparsity - starting from only weak similarity it is almost fully connected, with ~50M relations between ~10k nodes, whereas at strong similarity there are as few as ~20k relations between ~10k nodes (or fewer). No matter what, it is always REALLY slow. What is the best way for me to handle this? Is it to index on relationship type and starting node? I haven't been able to find any resources on this problem, and surprisingly there isn't already an implementation since this is a pretty standard graph algorithm. I could use scikit.learn but then I would be restricted to in-memory distance matricies only :(
What version of neo4j did you try this with?
Up until 1.8 performance has been no design goal of cypher (rather the language)
Have a look at a recent snapshot (1.9-SNAP).
Also make sure that your hot dataset is not just loaded from disk (otherwise you measure disk-io) so your memory mapped settings and also JVM heap is large enough.
You might also want to check out the GCR cache from Neo4j enterprise which has a smaller memory footprint.
What is the cardinality of count(x) in your query? If it is too small you have too many small transactions going on. Depending if your run python embedded or REST use a larger tx-scope or REST-batch-operations
You're already using parameters which is great. What is the variability of your rel-types ?
Any chance to share your dataset/generator and the code with us (Neo4j) for performance testing on our side?
There are DBSCAN implementations around that use indexing. I don't know about neo4j so I can't really tell if your approach is efficient. The thing you might need to precompute is actually a sparse version of your graph, with only the edges that are within the epsilon threshold.
What I'd like to point out that apparently you have different densities in your data set, so you might want to instead use OPTICS, which is a variant of DBSCAN that does away with the epsilon parameter (and also doesn't need to distinguish "core" nodes, as every node is a core node for a certain epsilon). Do not use the Weka version (or the weka-inspired python version that is floating around). They are half OPTICS and half DBSCAN.
When you have efficient sorted updatable heaps available, OPTICS can be pretty fast.