t-SNE can supposedly scale to millions of observations (see here), but I'm curious how that can be true, at least in the Sklearn implementation.
I'm trying it on a dataset with ~100k items, each with ~190 features. Now, I'm aware that I can do a first pass of dimensionality reduction with, e.g. PCA, but the problem seems more fundamental.
t-SNE computes and stores the full, dense similarity matrix calculated for the input observations (
I've confirmed this by looking at the source). In my case, this is a 10 billion element dense matrix, which by itself requires 80 GB+ of memory. Extrapolate this to just one million observations, and you're looking at 8 terabytes of RAM just to store the distance matrix (let alone computation time...)
So, how can we possibly scale t-SNE to millions of datapoints in the sklearn implementation? Am I missing something? The sklearn docs at least imply that it's possible:
By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.
That's my emphasis, but I would certainly read that as implying the Barnes-hut method can scale to millions of examples, but I'll reiterate that the code requires calculating the full distance matrix well before we even get to any of the actual t-sne transformations (with or without Barnes-hut).
So am I missing something? Is it possible to scale this up to millions of datapoints?
Barnes-Hut does NOT require you to compute and storex the full, dense similarity matrix calculated for the input observations.
Also, take a look at the references mentioned at the documentation. In particular, this one. Quoting that page:
The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.
That page also links to this talk about how the approximation works: Visualizing Data Using t-SNE.
I recommend you using another algorithm called UMAP. It is proven to perform at least as well as t-SNE and in most cases, it performs better. Most importantly, it scales significantly better. Their approach to the problem is similar so they generate similar results but UMAP is a lot faster (Look at the last graph here: https://umap-learn.readthedocs.io/en/latest/benchmarking.html). You can look at the original paper and the following link for details.
https://www.nature.com/articles/nbt.4314.pdf
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668#:~:text=tSNE%20is%20Dead.&text=Despite%20tSNE%20made%20a%20dramatic,be%20fixed%20sooner%20or%20later.
OpenVisuMap (at github) implements t-SNE without resorting to approximation. It uses GPU to calculate the distance matrix on-fly. It still has O(N^2) calculation complexity, but only O(N) memory complexity.
Related
I am looking to linearly combine features to be used by UMAP. Some of them are GCS coordinates and require a haversine treatment while others can be compared using their euclidean distance.
distance(v1, v2) = alpha * euclidean(f1_eucl, f2_eucl) + beta * haversine(f1_hav, f2_hav)
So far, I have tried:
Creating a custom distance matrix. The squared matrix takes 70GB using float64, 35GB with float32. Using fastdist, I get a computation time of 7min, which is quite slow compared to UMAP's 2-3min -- all included. This approach falls apart as soon as I try adding the euclidean and haversine matrices together (140GB which is massive compared to UMAP's 5GB). I also tried chunking the computation using dask. The result is memory-friendly but my session kept crashing so I couldn't even tell how long that would have taken.
Using a custom callable to be ingested by UMAP. Thanks to the jit compilation using numba, I get the results quite quickly and have no memory problem. The major problem here is it looks like UMAP ignores my callable when the dataset reaches 4096 in size. If I set the callable to return 0, UMAP still shows the patterns of the original dataset in the graphs. If somebody could explain me what this is due to, that'd be great.
In summary, how do you go about, practically speaking, implementing a custom metric in UMAP? And bonus question, do you think this is a sound approach? Thanks.
The custom metric in numba should work for more than 4096 points. That's a relevant number because that's the stage at which it cuts over to using approximate nearest neighbor search (which is passes of to pynndescent). Now pynndescent does support custom metrics compiled with numba, so if it is somehow going astray it is because that is not getting passed to pynndescent correctly. Still, I would have expected an error, not defaulting to euclidean distance.
first question, I will do my best to be as clear as possible.
If I can provide UMAP with a distance function that also outputs a gradient or some other relevant information, can I apply UMAP to non-traditional looking data? (I.e., a data set with points of inconsistent dimension, data points that are non-uniformly sized matrices, etc.) The closest I have gotten to finding something that looks vaguely close to my question is in the documentation here (https://umap-learn.readthedocs.io/en/latest/embedding_space.html), but this seems to be sort of the opposite process, and as far as I can tell still supposes you are starting with tuple-based data of uniform dimension.
I'm aware that one way around this is just to calculate a full pairwise distance matrix ahead of time and give that to UMAP, but from what I understand of the way UMAP is coded, it only performs a subset of all possible distance calculations, and is thus much faster for the same amount of data than if I were to take the full pre-calculation route.
I am working in python3, but if there is an implementation of UMAP dimension reduction in some other environment that permits this, I would be willing to make a detour in my workflow to obtain this greater flexibility with incoming data types.
Thank you.
Algorithmically this is quite possible, but in practice most implementations do not support anything other than fixed dimension vectors. If computing the all pairs distances is not tractable another option is to try to find a way to featurize or vectorize the data in a way that will allow for easy distance computations. This is, of course, not always possible. The final option is to implement things yourself, but this requires handling the nearest neighbour search, which is likely a non-trivial coding project in and of itself.
I am trying to cluster a data set with about 1,100,000 observations, each with three values.
The code is pretty simple in R:
df11.dist <-dist(df11cl) , where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.
the error I get is :
Error: cannot allocate vector of size 4439.0 Gb
Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.
I have also found this solution, but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?
Can you recommend anything in either R or python?
For trivial reasons, the function dist needs quadratic memory.
So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., I.e. 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.
And 1 million is still pretty small, isn't it?
Using dist on big data is impossible. End of story.
For larger data sets, you'll need to
use methods such as k-means that do not use pairwise distances
use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
subsample your data to make it smaller
In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.
I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit to fit the parameters to my model.
The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.
I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.
However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.
It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800 covariates and n=4000 observations are used.
Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?
My questions:
Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?
How could I modify my code to work for numbers around p=800 and n=4000?
In the code you currently use (from several other questions), you implicitly use the Newton-Raphson method. This is the default for the sm.Logit model. It computes and inverts the Hessian matrix to speed-up estimation, but that is incredibly costly for large matrices - let alone oft results in numerical instability when the matrix is near singular, as you have already witnessed. This is briefly explained on the relevant Wikipedia entry.
You can work around this by using a different solver, like e.g. the bfgs (or lbfgs), like so,
model = sm.Logit(y, X)
result = model.fit(method='bfgs')
This runs perfectly well for me even with n = 10000, p = 2000.
Aside from estimation, and more problematically, your code for generating samples results in data that suffer from a large degree of quasi-separability, in which case the whole MLE approach is questionable at best. You should urgently look into this, as it suggests your data may not be as well-behaved as you might like them to be. Quasi-separability is very well explained here.
Currently, I am using a Python implementation of NMF. I'm thinking of ways to improve NMF, since it can become slow if you have a lot of documents. Since NMF works with matrix multiplications, I was thinking to maybe use GPUs (Graphics Processing Units). I found a solution that implements NMF on GPUs.
The question is: would it be a good solution to use NMF with GPU support in order to speed up performance of NMF? Or should I take a different approach?
Currently, the Alternating Nonnegative Least Squares with block principal pivoting is the fastest way to compute NMF.
You can find an implementation for Python here: https://github.com/kimjingu/nonnegfac-python
If you are sure that the GPU implementation is using one of the fastest methods, then go for it. Slower methods (e.g. Multiplicative Update) can be orders of magnitude slower, and it might be not worth using a GPU.