scikit-learn DBSCAN memory usage

scikit-learn DBSCAN memory usage - python

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion.
I have a dataset with ~2.5 million samples, each with 35 features (floating point values) that I'm trying to cluster. I've been trying to do this with scikit-learn's implementation of DBSCAN, using the Manhattan distance metric and a value of epsilon estimated from some small random samples drawn from the data. So far, so good. (here is the snippet, for reference)
db = DBSCAN(eps=40, min_samples=10, metric='cityblock').fit(mydata)
My issue at the moment is that I easily run out of memory. (I'm currently working on a machine with 16 GB of RAM)
My question is, is DBSCAN calculating the pairwise distance matrix on the fly as it runs, and that's what's gobbling up my memory? (2.5 million ^ 2) * 8 bytes is obviously stupidly large, I would understand that. Should I not be using the fit() method? And more generally, is there a way around this issue, or am I generally barking up the wrong tree here?
Apologies if the answer winds up being obvious. I've been puzzling over this for a few days. Thanks!
Addendum: Also if anyone could explain the difference between fit(X) and fit_predict(X) to me more explicitly I'd also appreciate that--I'm afraid I just don't quite get it.
Addendum #2: To be sure, I just tried this on a machine with ~550 GB of RAM and it still blew up, so I feel like DBSCAN is likely trying to make a pairwise distance matrix or something I clearly don't want it to do. I guess now the big question is how to stop that behavior, or find other methods that might suit my needs more. Thanks for bearing with me here.
Addendum #3(!): I forgot to attach the traceback, here it is,
Traceback (most recent call last):
File "tDBSCAN.py", line 34, in <module>
db = DBSCAN(eps=float(sys.argv[2]), min_samples=10, metric='cityblock').fit(mydata)
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/base.py", line 329, in fit_predict
self.fit(X)
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 186, in fit
**self.get_params())
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 69, in dbscan
D = pairwise_distances(X, metric=metric)
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 651, in pairwise_distances
return func(X, Y, **kwds)
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 237, in manhattan_distances
D = np.abs(X[:, np.newaxis, :] - Y[np.newaxis, :, :])
MemoryError

The problem apparently is a non-standard DBSCAN implementation in scikit-learn.
DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)).
The implementation in scikit however, apparently, computes the full O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.
So I see two choices:
You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.
Otherwise, you may want to reimplement DBSCAN, as the implementation in scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.
Update: by now, sklearn no longer computes a distance matrix and can, e.g., use a kd-tree index. However, because of "vectorization" it will still precompute the neighbors of every point, so the memory usage of sklearn for large epsilon is O(n²), whereas to my understanding the version in ELKI will only use O(n) memory. So if you run out of memory, choose a smaller epsilon and/or try ELKI.

You can do this using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm. You do not need to precompute a distance matrix.
This example clusters over a million GPS latitude-longitude points with DBSCAN/haversine and avoids memory usage problems:
df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed, which blows up your RAM real quick. But if you use Anaconda, you can quickly set this up with:
conda install scikit-learn=0.15
Or, create a clean virtual environment for this clustering task:
conda create -n clusterenv python=3.4 scikit-learn=0.15 matplotlib pandas jupyter
activate clusterenv

This issue with sklearn is discussed here:
https://github.com/scikit-learn/scikit-learn/issues/5275
There are two options presented there;
One is to use OPTICS (which requires sklearn v21+), which is an alternative but closely related algorithm to DBSCAN:
https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html
The others are to precompute the adjacency matrix, or to use sample weights.
Some more details on these options can be found under Notes here:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

I faced the same problem when i was using old version on sklearn 0.19.1 because the complexity was O(N^2).
But now the problem has been resolved in new version 0.20.2 and no memory error anymore, and the complexity become O(n.d) where d is the average number of neighbours.
it's not the ideal complexity but much better than old versions.
Check the notes in this release, to avoid high memory usage:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

The DBSCAN algorithm actually does compute the distance matrix, so no chance here.
For this much data, I would recommend using MiniBatchKMeans.
You can not use the Manhattan metric there out of the box, but you could do your own implementation. Maybe try the standard implementation with the euclidean metric first.
I don't know many clustering algorithms that don't perform pairwise distances.
Using the newly embedded cheat-sheet bottom center: though luck.

Related

How to implement custom metric in UMAP?

I am looking to linearly combine features to be used by UMAP. Some of them are GCS coordinates and require a haversine treatment while others can be compared using their euclidean distance.
distance(v1, v2) = alpha * euclidean(f1_eucl, f2_eucl) + beta * haversine(f1_hav, f2_hav)
So far, I have tried:
Creating a custom distance matrix. The squared matrix takes 70GB using float64, 35GB with float32. Using fastdist, I get a computation time of 7min, which is quite slow compared to UMAP's 2-3min -- all included. This approach falls apart as soon as I try adding the euclidean and haversine matrices together (140GB which is massive compared to UMAP's 5GB). I also tried chunking the computation using dask. The result is memory-friendly but my session kept crashing so I couldn't even tell how long that would have taken.
Using a custom callable to be ingested by UMAP. Thanks to the jit compilation using numba, I get the results quite quickly and have no memory problem. The major problem here is it looks like UMAP ignores my callable when the dataset reaches 4096 in size. If I set the callable to return 0, UMAP still shows the patterns of the original dataset in the graphs. If somebody could explain me what this is due to, that'd be great.
In summary, how do you go about, practically speaking, implementing a custom metric in UMAP? And bonus question, do you think this is a sound approach? Thanks.

The custom metric in numba should work for more than 4096 points. That's a relevant number because that's the stage at which it cuts over to using approximate nearest neighbor search (which is passes of to pynndescent). Now pynndescent does support custom metrics compiled with numba, so if it is somehow going astray it is because that is not getting passed to pynndescent correctly. Still, I would have expected an error, not defaulting to euclidean distance.

Performing UMAP dimension reduction on inconsistently shaped data - python

first question, I will do my best to be as clear as possible.
If I can provide UMAP with a distance function that also outputs a gradient or some other relevant information, can I apply UMAP to non-traditional looking data? (I.e., a data set with points of inconsistent dimension, data points that are non-uniformly sized matrices, etc.) The closest I have gotten to finding something that looks vaguely close to my question is in the documentation here (https://umap-learn.readthedocs.io/en/latest/embedding_space.html), but this seems to be sort of the opposite process, and as far as I can tell still supposes you are starting with tuple-based data of uniform dimension.
I'm aware that one way around this is just to calculate a full pairwise distance matrix ahead of time and give that to UMAP, but from what I understand of the way UMAP is coded, it only performs a subset of all possible distance calculations, and is thus much faster for the same amount of data than if I were to take the full pre-calculation route.
I am working in python3, but if there is an implementation of UMAP dimension reduction in some other environment that permits this, I would be willing to make a detour in my workflow to obtain this greater flexibility with incoming data types.
Thank you.

Algorithmically this is quite possible, but in practice most implementations do not support anything other than fixed dimension vectors. If computing the all pairs distances is not tractable another option is to try to find a way to featurize or vectorize the data in a way that will allow for easy distance computations. This is, of course, not always possible. The final option is to implement things yourself, but this requires handling the nearest neighbour search, which is likely a non-trivial coding project in and of itself.

varying number of clusters without recalculating tree every time

We are using sklearn in python and trying to run agglomerative clustering (Wards) on a range of cluster numbers (i.e. N=2-9) using full_tree without having to re-compute the tree for each individual value of N, by using the cache. This was answered in an old post from 2016 but that answer doesn't seem to work anymore. See (sklearn agglomerative clustering: dynamically updating the number of clusters).
In other words, run fit over different values of N, without re-clustering every time. However, we are getting syntax errors and not able to call up the labels for any of the clusters stored in cache afterwards. Code looks something like:
x = AgglomerativeClustering(memory="mycachedir", compute_full_tree=True
but x.fit_predict(inputDF{2}) does not fit the syntax of the memory access command
Anybody know the syntax for calling up labels from the cache in this scenario? Thanks
P.S. I'm a newbie so apologies in advance if I am not being clear.
Tried solution posted in 2016 (sklearn agglomerative clustering: dynamically updating the number of clusters).
Code looks something like:
x = AgglomerativeClustering(memory="mycachedir", compute_full_tree=True
but x.fit_predict(inputDF{2}) does not fit the syntax of the memory access command
We expect to run clustering on a given array input and retrieve labels of each cluster when we vary the number of clusters "N" over a range, using the cache memory rather than re-computing the tree every time

The sklearn API is badly suited for this.
It's much better to use agglomerative clustering from scipy. Because it consists of two steps: Building the linkage / dendrogram and then extracting a flat clustering from this. The first step is O(n³) with Ward, but the second step is only O(n) I think. A similar approach can be found in ELKI, too. But unfortunately, sklearn follows this narrow "fit-predict" view originating from classification, and that does not support such a two-step approach.
There is also other functionality available in scipy, but not in sklearn, if I am not mistaken. Just have a look.

efficient filtering near/inside clusters after they found - python

essentially I applied a DBSCAN algorithm (sklearn) with an euclidean distance on a subset of my original data. I found my clusters and all is fine: except for the fact that I want to keep only values that are far enough from those on which I did not run my analysis on. I have a new distance to test such new stuff with and I wanted to understand how to do it WITHOUT numerous nested loops.
in a picture:
my found clusters are in blue whereas the red ones are the points to which I don't want to be near. the crosses are the points belonging to the cluster that are carved out as they are within the new distance I specified.
now, as much I could do something of the sort:
for i in red_points:
for j in blu_points:
if dist(i,j) < given_dist:
original_dataframe.remove(j)
I refuse to believe there isn't a vectorized method. also, I can't afford to do as above simply because I'll have huge tables to operate upon and I'd like to avoid my CPU to evaporate away.
any and all suggestions welcome

Of course you can vectoriue this, but it will then still be O(n*m). Better neighbor search algorithms are not vectorized. e.g. kd-tree and ball-tree.
Both are available in sklearn, and used by the DBSCAN module. Please see the sklearn.neighbors package.

If you need exact answers, the fastest implementation should be sklearn's pairwise distance calculator:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
If you can accept an approximate answer, you can do better with the kd tree's queryradius(): http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html

Scaling t-SNE to millions of observations in scikit-learn

t-SNE can supposedly scale to millions of observations (see here), but I'm curious how that can be true, at least in the Sklearn implementation.
I'm trying it on a dataset with ~100k items, each with ~190 features. Now, I'm aware that I can do a first pass of dimensionality reduction with, e.g. PCA, but the problem seems more fundamental.
t-SNE computes and stores the full, dense similarity matrix calculated for the input observations (
I've confirmed this by looking at the source). In my case, this is a 10 billion element dense matrix, which by itself requires 80 GB+ of memory. Extrapolate this to just one million observations, and you're looking at 8 terabytes of RAM just to store the distance matrix (let alone computation time...)
So, how can we possibly scale t-SNE to millions of datapoints in the sklearn implementation? Am I missing something? The sklearn docs at least imply that it's possible:
By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.
That's my emphasis, but I would certainly read that as implying the Barnes-hut method can scale to millions of examples, but I'll reiterate that the code requires calculating the full distance matrix well before we even get to any of the actual t-sne transformations (with or without Barnes-hut).
So am I missing something? Is it possible to scale this up to millions of datapoints?

Barnes-Hut does NOT require you to compute and storex the full, dense similarity matrix calculated for the input observations.
Also, take a look at the references mentioned at the documentation. In particular, this one. Quoting that page:
The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.
That page also links to this talk about how the approximation works: Visualizing Data Using t-SNE.

I recommend you using another algorithm called UMAP. It is proven to perform at least as well as t-SNE and in most cases, it performs better. Most importantly, it scales significantly better. Their approach to the problem is similar so they generate similar results but UMAP is a lot faster (Look at the last graph here: https://umap-learn.readthedocs.io/en/latest/benchmarking.html). You can look at the original paper and the following link for details.
https://www.nature.com/articles/nbt.4314.pdf
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668#:~:text=tSNE%20is%20Dead.&text=Despite%20tSNE%20made%20a%20dramatic,be%20fixed%20sooner%20or%20later.

OpenVisuMap (at github) implements t-SNE without resorting to approximation. It uses GPU to calculate the distance matrix on-fly. It still has O(N^2) calculation complexity, but only O(N) memory complexity.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.