Non-negative Matrix Factorization (NMF) using GPUs

Non-negative Matrix Factorization (NMF) using GPUs - python

Currently, I am using a Python implementation of NMF. I'm thinking of ways to improve NMF, since it can become slow if you have a lot of documents. Since NMF works with matrix multiplications, I was thinking to maybe use GPUs (Graphics Processing Units). I found a solution that implements NMF on GPUs.
The question is: would it be a good solution to use NMF with GPU support in order to speed up performance of NMF? Or should I take a different approach?

Currently, the Alternating Nonnegative Least Squares with block principal pivoting is the fastest way to compute NMF.
You can find an implementation for Python here: https://github.com/kimjingu/nonnegfac-python
If you are sure that the GPU implementation is using one of the fastest methods, then go for it. Slower methods (e.g. Multiplicative Update) can be orders of magnitude slower, and it might be not worth using a GPU.

Related

How to implement a fast fuzzy-search engine using BK-trees when the corpus has 10 billion unique DNA sequences?

I am trying to use the BK-tree data structure in python to store a corpus with ~10 billion entries (1e10) in order to implement a fast fuzzy search engine.
Once I add over ~10 million (1e7) values to a single BK-tree, I start to see a significant degradation in the performance of querying.
I was thinking to store the corpus into a forest of a thousand BK-trees and to query them in parallel.
Does this idea sound feasible? Should I create and query 1,000 BK-trees simultaneously? What else can I do in order to use BK-tree for this corpus.
I use pybktree.py and my queries are intended to find all entries within an edit distance d.
Is there some architecture or database which will allow me to store those trees?
Note: I don’t run out of memory, rather the tree begins to be inefficient (presumably each node has too many children).

FuzzyWuzzy
Since you are mentioning your usage of FuzzyWuzzy as distance metric I will concentrate on efficient ways to implement the fuzz.ratio algorithm used by FuzzyWuzzy. FuzzyWuzzy provides the following two implementations for fuzz.ratio:
difflib, which is completely implemented in Python
python-Levenshtein which uses a weighted Levenshtein distance with the weight 2 for substitutions (substitutions are deletion + insertion). Python-Levenshtein is implemented in C and a lot faster than the pure Python implementation.
Implementation in python-Levenshtein
The implementation of python-Levenshtein uses the following implementation:
removes common prefix and suffix of the two strings, since they do not have any influence on the end result. This can be done in linear time, so matching similar strings is very fast.
The Levenshtein distance between the trimmed strings is implemented with quadratic runtime and linear memory usage.
RapidFuzz
I am the author of the library RapidFuzz which implements the algorithms used by FuzzyWuzzy in a more performant way. RapidFuzz uses the following interface for fuzz.ratio:
def ratio(s1, s2, processor = None, score_cutoff = 0)
The additional score_cutoff parameter can be used to provide a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. This can be used by the implementation to use more a more optimized implementation in some cases. In the following I will describe the optimizations used by RapidFuzz depending on the input parameters. In the following max distance refers to the maximum distance that is possible without getting a ratio below the score threshold.
max distance == 0
The similarity can be calculated using a direct comparison,
since no difference between the strings is allowed. The time complexity of
this algorithm is O(N).
max distance == 1 and len(s1) == len(s2)
The similarity can be calculated using a direct comparisons as well, since a substitution would cause a edit distance higher than max distance. The time complexity of this algorithm is O(N).
Remove common prefix
A common prefix/suffix of the two compared strings does not affect
the Levenshtein distance, so the affix is removed before calculating the similarity. This step is performed for any of the following algorithms.
max distance <= 4
The mbleven algorithm is used. This algorithm
checks all possible edit operations that are possible under
the threshold max distance. A description of the original algorithm can be found here. I changed this algorithm to support the weigth of 2 for substitutions. As a difference to the normal Levenshtein distance this algorithm can even be used up to a threshold of 4 here, since the higher weight of substitutions decreases the amount of possible edit operations. The time complexity of this algorithm is O(N).
len(shorter string) <= 64 after removing common affix
The BitPAl algorithm is used, which calculates the Levenshtein distance in
parallel. The algorithm is described here and is extended with support
for UTF32 in this implementation. The time complexity of this algorithm is O(N).
Strings with a length > 64
The Levenshtein distance is calculated using
Wagner-Fischer with Ukkonens optimization. The time complexity of this algorithm is O(N * M).
This could be replaced with a blockwise implementation of BitPal in the future.
Improvements to processors
FuzzyWuzzy provides multiple processors like process.extractOne that are used to calculate the similarity between a query and multiple choices. Implementing this in C++ as well allows two more important optimizations:
when a scorer is used that is implemented in C++ as well we can directly call the C++ implementation of the scorer and do not have to go back and forth between Python and C++, which provides a massive speedup
We can preprocess the query depending on the scorer that is used. As an example when fuzz.ratio is used as scorer it only has to store the query into the 64bit blocks used by BitPal once, which saves around 50% of the runtime when calculating the Levenshtein distance
So far only extractOne and extract_iter are implemented in Python, while extract which you would use is still implemented in Python and uses extract_iter. So it can already use the 2. optimization, but still has to switch a lot between Python and C++ which is not optimal (This will probably be added in v1.0.0 as well).
Benchmarks
I performed benchmarks for extractOne and the individual scorers during the development that shows the performance difference between RapidFuzz and FuzzyWuzzy. Keep in mind that the performance for your case (all strings length 20) is probably not as good, since many of the strings in the dataset used are very small.
The source of the reproducible-science DATA :
words.txt ( dataset with 99171 words )
The hardware the graphed benchmarks were run on (specification) :
CPU: single core of a i7-8550U
RAM: 8 GB
OS: Fedora 32
Benchmark Scorers
The code for this benchmark can be found here
Benchmark extractOne
For this benchmark the code of process.extractOne is slightly changed to remove the score_cutoff parameter. This is done because in extractOne the score_cutoff is increased whenever a better match is found (and it exits once it finds a perfect match). In the future it would make more sense to benchmark process.extract which does not has this behavior (the benchmark is performed using process.extractOne, since process.extract is not fully implemented in C++ yet). The benchmark code can be found here
This shows that when possible the scorers should not be used directly but through the processors, that can perform a lot more optimizations.
Alternative
As an Alternative you could use a C++ implementation. The library RapidFuzz is available for C++ here. The implementation in C++ is relatively simple as well
// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());
rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
for (const auto& choice : choices)
{
results.push_back(scorer.ratio(choice));
}
or in parallel using open mp
// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());
rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
#pragma omp parallel for
for (const auto& choice : choices)
{
results.push_back(scorer.ratio(choice));
}
On my machine (see Benchmark above) this evaluates 43 million words/sec and 123 million words/sec in the parallel version. This is around 1.5 times as fast as the Python implementation (due to conversions between Python and C++ Types). However the main advantage of the C++ version is that you are relatively free to combine algorithms whichever way you want, while in the Python version your forced to use the process functions that are implemented in C++ to achieve good performance.

Few thoughts
BK-trees
Kudos to Ben Hoyt and his link to the issue which I will draw from. That being said, the first observation from the mentioned issue is that the BK tree isn't exactly logarithmic. From what you told us your usual d is ~6, which is 3/10 of your string length. Unfortunately, that means that if we look at the tables from the issue you will get the complexity of somewhere between O(N^0.8) to O(N). In the optimistic case of the
exponent being 0.8(it will likely be slightly worse) you get an improvement factor of ~100 on your 10B entries. So if you have a reasonably fast implementation of BK-trees it can still be worth it to use them or use them as a basis for a further optimization.
The downside of this is that even if you use 1000 trees in parallel, you will only get the improvement from the parallelization as the perfomance of the trees depends on the d rather than on the amount of the nodes within the tree. However even if you run all the 1000 trees at once with a massive machine, we are at the ~10M nodes/tree which you reported as slow. Still, computation wise, this seems doable.
A brute force approach
If you don't mind paying a little I would look into something like Google cloud big query if that doesn't clash with some kind of data confidentiality. They will brute force the solution for you - for a fee. The current rate is $5/TB of a query. Your dataset is ~10B rows * 20chars. Taking one byte per char, one query would take 200GB so ~1$ per query if you went the lazy way.
However, since the charge is per byte of a data in a column and not per complexity of a question, you could improve on this by storing your strings as bits - 2bits per a letter, this would save you 75% of the expenses.
Improving further, you can write your query in such a way that it will ask for a dozen strings at once. You might need to be a bit careful to use a batch of similar strings for the purpose of the query to avoid clogging of the result with too many one-offs though.
Brute forcing of the BK-trees
Since if you go with the route above, you will have to pay depending on the volume, the ~100-fold decrease in the computations needed becomes ~100-fold decrease in price which might be useful, especially if you have a lot of queries to run.
However you would need to figure out a way to store this tree in a several layers of databases to query recursively as the Bigquery pricing depends on the volume of the data in the queried table.
Building a smart batch engine for recursive processing of the queries to minimize the costs could be fun optimization excercise.
A choice of language
One more thing. While I think that Python is a good language for fast prototyping, analysis and thinking about code in general you are past that stage. You are currently looking for a way to do a specific, well defined and well thought operation as fast as possible. Python is not a great language for this as this example shows. While I used all the tricks I could think of in Python, the Java and C solutions were still several times faster. (Not to mention the rust one that beat us all - but he beat us by algorithm as well so it's hard to compare.) So if you go from python to a faster language, you might gain another factor or ten or maybe even more of a performance gain. This could be another fun optimization exercise.
Note: I am being rather conservative with the estimate as the fuzzywuzzy already offers to use a C library in the background so I'm not too sure about how much of the work still depends on the python. My experience in similar cases is that the performance gain can be factor of 100 from pure python(or worse, pure R) to a compiled language.

Quite late to the party, but here is a possible solution which
I would implement if I were in your situation:
Save the dataset as text file, and put that file on a very
fast disk region (preferably on tmpfs).
Prepare a beefy computer with many physical CPU cores (such
as Threadripper 3990X that has 64 cores).
Use this implementation and GNU parallel to grok the dataset.
Here is a bit of technical info behind this solution:
The optimized version of Myers' algorithm (linked above) can
process about 14 million entries per sec on a single CPU core.
If you can fully utilize all the 64 physical cores, you can
archive the throughput of 896 million per sec (= 14m * 64 cores).
At this speed, you can perform a single query on 10 billion
datasets in 12 seconds using a single machine.
I posted more detailed analysis at this article.
As shown in the article, I could perform a query against a dataset of 100 million records
in 1.04s with my cheap desktop machine.
By using a more performant CPU (or splitting the task between
multiple computers), I believe you can archive the desired result.
Hope this helps.

Eigenvectors of a large sparse matrix in tensorflow

I was wondering if there is a way to calculate the first few eigenvectors of a very large sparse matrix in tensorflow, hoping that it might be faster than scipy's implementation of ARPACK, which doesn't seem to support parallel computing. At least, as far as I noticed.

I believe you should rather look into PETCs4py or SLEPc4py.
They are python binding of PETSc (Portable, Extensible Toolkit for
Scientific Computation) and SLEPc (Scalable Library for Eigenvalue Problem Computations).
PETSc and SLEPc support MPI and therefore PETCs4py and SLEPc4py do too.
I believe you will find useful examples in examples

Python - go beyond RAM limits?

I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.
Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?

You will need to do it manually.
There are probably two different core-problems here:
A: holding your training-data
B: training the regressor
For A, you can try numpy's memmap which abstracts swapping away.
As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.
For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.
Keep in mind, that this training-process decomposes into at least two new elements:
Efficient being in regards to memory
Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
Efficient convergence
Those algorithms in the link above should be okay for both.
SGDRegressor can be parameterized to resemble RidgeRegression.
Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!
Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).

Scaling t-SNE to millions of observations in scikit-learn

t-SNE can supposedly scale to millions of observations (see here), but I'm curious how that can be true, at least in the Sklearn implementation.
I'm trying it on a dataset with ~100k items, each with ~190 features. Now, I'm aware that I can do a first pass of dimensionality reduction with, e.g. PCA, but the problem seems more fundamental.
t-SNE computes and stores the full, dense similarity matrix calculated for the input observations (
I've confirmed this by looking at the source). In my case, this is a 10 billion element dense matrix, which by itself requires 80 GB+ of memory. Extrapolate this to just one million observations, and you're looking at 8 terabytes of RAM just to store the distance matrix (let alone computation time...)
So, how can we possibly scale t-SNE to millions of datapoints in the sklearn implementation? Am I missing something? The sklearn docs at least imply that it's possible:
By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.
That's my emphasis, but I would certainly read that as implying the Barnes-hut method can scale to millions of examples, but I'll reiterate that the code requires calculating the full distance matrix well before we even get to any of the actual t-sne transformations (with or without Barnes-hut).
So am I missing something? Is it possible to scale this up to millions of datapoints?

Barnes-Hut does NOT require you to compute and storex the full, dense similarity matrix calculated for the input observations.
Also, take a look at the references mentioned at the documentation. In particular, this one. Quoting that page:
The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.
That page also links to this talk about how the approximation works: Visualizing Data Using t-SNE.

I recommend you using another algorithm called UMAP. It is proven to perform at least as well as t-SNE and in most cases, it performs better. Most importantly, it scales significantly better. Their approach to the problem is similar so they generate similar results but UMAP is a lot faster (Look at the last graph here: https://umap-learn.readthedocs.io/en/latest/benchmarking.html). You can look at the original paper and the following link for details.
https://www.nature.com/articles/nbt.4314.pdf
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668#:~:text=tSNE%20is%20Dead.&text=Despite%20tSNE%20made%20a%20dramatic,be%20fixed%20sooner%20or%20later.

OpenVisuMap (at github) implements t-SNE without resorting to approximation. It uses GPU to calculate the distance matrix on-fly. It still has O(N^2) calculation complexity, but only O(N) memory complexity.

Parallel exact matrix diagonalization with Python

Is anyone aware of an implemented version (perhaps using scipy/numpy) of parallel exact matrix diagonalization (equivalently, finding the eigensystem)? If it helps, my matrices are symmetric and sparse. I would hate to spend a day reinventing the wheel.
EDIT:
My matrices are at least 10,000x10,000 (but, preferably, at least 20 times larger). For now, I only have access to a 4-core Intel machine (with hyperthreading, so 2 processes per core), ~3.0Ghz each with 12GB of RAM. I may later have access to a 128-core node ~3.6Ghz/core with 256GB of RAM, so single machine/multiple cores should do it (for my other parallel tasks, I have been using multiprocessing). I would prefer for the algorithms to scale well.
I do need exact diagonalization, so scipy.sparse routines are not be good for me (tried, didn't work well). I have been using numpy.linalg.eigh (I see only single core doing all the computations).
Alternatively (to the original question): is there an online resource where I can find out more about compiling SciPy so as to insure parallel execution?

For symmetric sparse matrix eigenvalue/eigenvector finding, you may use scipy.sparse.linalg.eigsh. It uses ARPACK behind the scenes, and there are parallel ARPACK implementations. AFAIK, SciPy can be compiled with one if your scipy installation uses the serial version.
However, this is not a good answer, if you need all eigenvalues and eigenvectors for the matrix, as the sparse version uses the Lanczos algorithm.
If your matrix is not overwhelmingly large, then just use numpy.linalg.eigh. It uses LAPACK or BLAS and may use parallel code internally.
If you end up rolling your own, please note that SciPy/NumPy does all the heavy lifting with different highly optimized linear algebra packages, not in pure Python. Due to this the performance and degree of parallelism depends heavily on the libraries your SciPy/NumPy installation is compiled with.
(Your question does not reveal if you just want to have parallel code running on several processors, or on several computers. Also, the size of your matrix has a big impact on the best method. So, this answer may be completely off-the-mark.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.