I want to perform an SVD on a big array M[159459,159459].
Since SVD computation depends on the input matrix of shape (159459,159459), this here does not address my goal.
I have tried to use:
scipy.linalg.svd
scipy.linalg.svd(check_finite=False)
change the driver to lapack_driver='gesvd
numpy.linalg.svd
However, I always get a MemoryError. Finally, I want to compute the full SVD, because I want to perform a Procrustes analysis, i.e. if M is the matrix that I have now, I need M = USV'
import numpy as np
from scipy import linalg
#M = np.load("M.npy")
M = np.random.rand(159459,159459)
U, s, Vh = linalg.svd(M, check_finite=False, lapack_driver='gesvd)
Everything fails.
My system details:
$ cat /proc/meminfo
MemTotal: 527842404 kB
MemFree: 523406068 kB
MemAvailable: 521659112 kB
Memory size matters, latency costs will hurt you next:
Given mM.shape == [159459, 159459],given mM.dtype is by default float(64)there will be a need to have about:203.42 [GB] for the ori ginal of mM[159459, 159459], plus203.42 [GB] for the computed mU[159459, 159459], plus203.42 [GB] for the computed Vh[159459, 159459]0.0013 [GB] for the computed vS[159459]
A cheapest ever step, by trying a linear-only downscaling by a factor of 2 ( and not more than 4 ) from float64 to float32 or even float16 is not the game changer and is even heavily penalised by numpy inefficiencies ( if not internally performed back-conversions up to float64 again - my own attempts were so bleeding on this, that I share the resulting dissatisfaction here, so as to avoid repeating my own errors on trying to start with the lowest hanging fruit first ... )
In case your analysis may work just with the vector vS, only the .svd( ..., compute_uv = False, ... ) flag will avoid making a space for about ~ 1/2 [TB] RAM-allocations by not returning ( and thus not reserving space for them ) instances of mU and Vh.
Even such a case does not mean your SLOC will survive as is in the just about the reported 0.5 TB RAM-system. The scipy.linalg.svd() processing will allocated internally working resources, that are outside of your scope of coding ( sure, unless you re-factor and re-design the scipy.linalg module on your own, which is fair to consider very probable if not sure ) and configuration control. So, be warned that even when you test the compute_uv = False-mode of processing, the .svd() may still throw an error, if it fails to internally allocate required internally used data-structures, that do not fit the current RAM.
This also means that even using the numpy.memmap(), which may be a successful trick to off-load the in-RAM representation of the original mM ( avoiding some remarkable part of the first needed 203.4 [GB] from sitting and blocking the usage of the hosts RAM ), yet there are costs you will have to pay for using this trick.
My experiments, at smaller scales of the .memmap-s, used for matrix processing and in ML-optimisations, yield about 1E4 ~ 1E6 slower processing because, in spite of the smart caching, the numpy.memmap()-instances are dependent on the disk-I/O.
Best result will come from using advanced, TB-sized, SSD-only-storage devices, hosted right on the computing device on some fast and low-latency access-bus M.2 or PCIx16.
The last piece of experience, one might not yet want to hear here:
Using larger host-based RAM, which means using a multi-TB computing device, is the safest way to go further. Testing the above proposed steps will help, if reduced performance and additional expenses are within your Project's budget. If not, go for the use of an HPC centre at your Alma Mater or at your Project's closest research centre, where such multi-TB computing devices are being used in common.
Related
I have a rather large rectangular (>1G rows, 1K columns) Fortran-style NumPy matrix, which I want to transpose to C-style.
So far, my approach has been relatively trivial with the following Rust snippet, using MMAPed slices of the source and destination matrix, where both original_matrix and target_matrix are MMAPPed PyArray2, with Rayon handling the parallelization.
Since the target_matrix has to be modified by multiple threads, I wrap it in an UnsafeCell.
let shared_target_matrix = std::cell::UnsafeCell::new(target_matrix);
original_matrix.as_ref().par_chunks(number_of_nodes).enumerate().for_each(|(j, feature)|{
feature.iter().copied().enumerate().for_each(|(i, feature_value)| unsafe {
*(shared_target_matrix.uget_mut([i, j])) = feature_value;
});
});
This approach transposes a matrix with shape (~1G, 100), ~120GB takes ~3 hours on an HDD disk. Transposing a (~1G, 1000), ~1200GB matrix does not scale linearly to 30 hours, as one may naively expect, but explode to several weeks. As it stands, I have managed to transpose roughly 100 features in 2 days, and it keeps slowing down.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Note on sequential and parallel approaches
While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster (on a machine with 12 cores and 24 threads) than a sequential approach when transposing a matrix with shape (1G, 100). We are not sure why this is the case.
Note on using two HDDs
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.
While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster
This may be due to poor IO queue utilization. With an entirely sequential workload without prefetching you'll be alternating the device between working and idle. If you keep multiple operations in flight it'll be working all the time.
Check with iostat -x <interval>
But parallelism is a suboptimal way to achieve best utilization of a HDD because it'll likely cause more head-seeks than necessary.
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.
This may be due to the operating system's write cache which means it can batch writes very efficiently and you're mostly bottlenecked on reads. Again, check with iostat.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Yes, if the underlying filesystem supports it you can use FIEMAP to get the physical layout of the data on disk and then optimize your read order to follow the physical layout rather than the logical layout. You can use the filefrag CLI tool to inspect the fragmentation data manually, but there are rust bindings for that ioctl so you can use it programmatically too.
Additionally you can use madvise(MADV_WILLNEED) to inform the kernel to prefetch data in the background for the next few loop iterations. For HDDs this should be ideally done in batches worth a few megabytes at a time. And the next batch should be issued when you're half-way through the current one.
Issuing them in batches minimizes syscall overhead and starting the next one half-way through ensures there's enough time left to actually complete the IO before you reach the end of the current one.
And since you'll be manually issuing prefetches in physical instead of logical order you can also disable the default readahead heuristics (which would be getting in the way) via madvise(MADV_RANDOM)
If you have enough free diskspace you could also try a simpler approach: defragmenting the file before operating on it. But even then you should still use madvise to ensure that there always are IO requests in flight.
I need routinely populate matrices A[i,j] by evaluation of a function between pairs of vectors, as computation of every i,j-pair is independent from each other I want to parallelize this
A = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
A[i,j] = function(X[i], X[j])
How this computation could be elegantly parallelized via joblib or other widely used library?
Q : "How this computation could be elegantly parallelized via joblib or other widely used library?"
If using joblib, the main python interpreter will spawn other, GIL-lock independent copies of its own ( yes, huge memory-I/O to copy all python interpreter state, including all data-structures in O/S Windows, somewhat less horrible initial latency hit in linux-type O/S ), yet the worse is only to come - any "remote" modification of the spawned/distributed replicas of the original data have to somehow make it back to the main-python-process ( yes, huge memory-I/O + cache-(de)coherency hardware workloads (plus per-core L1-data cache-efficiency almost for sure devastated) ).
So this trick does not easily pay for its own add-on costs, unless the function() computation is indeed many times above the costs of process-instantiation + process-to-process data interchange ( SER/DES on the way "there" ( one can imagine a pickle.dumps() memory allocation + pickling-compression/decompression costs ) + SER/DES on the way "back" + the actual p2p-communication latencies (costs) to move the pickled-data elements ).
One might like more reads on this here and here and here.
Is There Any Better Way Forwards?
We all have had for sure heard about numpy and smart numpy-vectorised processing. Many thousands of man*years top level HPC experience were put into the numpy smart data-I/O vectorised processing.
So in most cases, if you try to redesign the function( scalarA, scalarB ) returnins a single scalarResult to be stored into an externally 2D-looped A[i,j] into an in-place modifying function( vectorX_data, matrixA_results ) and let the inner code thereof do both the i,j-looping over the actual matrixA_results.shape[0] and do the actual computing, the results may get astonishingly faster, if the numpy-code can harness the smart CPU-vector instructions, that pay less than 0.5 [ns] L1_data access latency times compared to as much as 300 ~ 380 [ns] RAM access latency times ( if memory-I/O channel were free and permitting unenqueued data transfer from the slow & far RAM-memory, not mentioning even the somewhat latency-masked 10.000.000+ [ns] access-costs for using the numpy.memmap()-file-based data proxy ).
If one has never visited the domain of numpy-tricks with smart-vectorised processing, do not hesitate to read as many as possible posts from a true master in this domain, guru #Divakar - all respect to them!
I have some large (N-by-N for N=10,000 to N=36,000,000) sparse, complex, Hermitian matrices, typically nonsingular, for which I have a spectral slicing problem. Specifically I need to know the exact number of positive eigenvalues.
I require a sparse LDLT decomposition -- is there one? Ideally, it will be a multifrontal algorithm and so well parallelized, and have the option of computing only D and not the upper triangular or permutation matrices.
I currently use ldl() in Matlab. This only works for real matrices so I need to create a larger real matrix. Also, it always computes L as well as D. I need a better algorithm to fit with 64GB RAM. I am hoping Python will be more customizable. (If so, I will learn Python.) I should add: I can get 64GB RAM per node, and can get 6 nodes. Even with a single machine with 64GB RAM, I would like to stop wasting RAM storing L only to delete it.
Perhaps someone wrote a Python front-end for MUMPS (MUltifrontal Massively Parallel Solver)?
I would have use for a non-parallel Python version of LDLT, as a lot of my research involves many matrices of moderate size.
I need a better algorithm to fit with 64GB RAM. I am hoping Python will be more customizable. (If so, I will learn Python.)
If this were ever possible:
|>>> ( 2. * 8 ) * 10E3 ** 2 / 1E12 # a 64GB RAM can store matrices ~1.6 GB
0.0016 # the [10k,10k] of complex64
| # or [20k,20k] of real64
|>>> ( 2. * 8 ) * 63E3 ** 2 / 1E12 # a 64GB RAM can store matrices in-RAM
0.063504 # max [63k,63k] of type complex
| # but
|>>> ( 2. * 8 ) * 36E6 ** 2 / 1E12 # the [36M,36M] one has
20736.0 # ~ 21PB of data
+0:00:09.876642 # Houston, we have a problem
#---------------------------------------------#--------and [2M7,2M7] has taken
~ month
on HPC cluster
Research needs are clear, yet there is no such language ( be it Matlab, Python, assembler, Julia or LISP ) that can ever store 21 petabytes of data into a space of just 64 gigabytes of physical RAM storage, to make a complex matrix ( of such a given scale ) eigenvalues computations possible and as fast as possible. By this I also mean, that "off-loading" of data from in-RAM computation into any form of out-of-RAM storage is so prohibitively expensive ( about ~ +1E2 ~ +1E5 ~ orders of magnitude slower ) that any such computational process will yield ages for just first "reading through" 21 PB of elements' data.
If your research has funding or sponsoring for using rather very specific computing devices infrastructure, there might be some tricks to process such heaps of numbers, yet do not expect to receive 21 PB of worms ( data ) put into a 64 GB large ( well, rather small ) can "for free".
You may enjoy Python due to many other reasons and/or motivations, but not due to any cheaper while faster HPC-grade parallel-computing, nor making easily process 21PB of data inside a 64GB device, nor for any kind of annihilating the principal and immense [TIME]-domain add-on costs of sparse-matrix manipulations visible but during their use in computations. Having made some xTB sparse-matrix processing feasible to yield results in less than 1E2 [min] instead of 2E3, I dare say I know how immensely hard it is to increase both the [PSPACE]-data scaling and to shorten the often [EXPTIME] processing durations at the same time ... the truly Hell-corner of the computational-complexity ... where the sparse-matrix representation will often create even more headaches ( again, both in [SPACE] and more, well, worse in [TIME] as new types of penalties appear ) than helping to at least enjoy some, potential [PSPACE]-savings
Given the scope of parameters, I may safely bet for sure, even the algorithm part will not help and even the promises of Quantum-Computing devices will remain for our lifetime expectancy unable to augment such vast parameter-space into the QC annealer-based unstructured quantum-driven minimiser ( processing ) for any reasonably long ( short duration ) sequence of parameter-blocks translation into a ( limited physical size ) qubit-field problem augmentation-process, as is currently in use by the QC community, thanks to LLNL et al research innovations.
Sorry, there no such magic seems to be anywhere near.
Using the available python frontends for MUMPS does not change the HPC-problem of the game, but if you wish to use it, yes, there are several of these available.
The efficient HPC-grade number-crunching at scale is still the root-cause of the problems with the product of [ processing-time ] x [ (whatever) data-representation's efficient storage and retrieval ].
Hope you will get and enjoy the right mix of a comfort ( pythonic users are keen to stay in ) and yet the HPC-grade performance ( of the backend of whatever type ) you wish to have.
I'm trying to compute the matrix product Y=XX^T for a matrix X of size 10,000 * 800,000. The matrix X is stored on-disk in an h5py file. The resulting Y should be a 10,000*10,000 matrix stored in the same h5py file. Here is a reproducible sample code.
import dask.array as da
from blaze import into
into("h5py:///tmp/dummy::/X", da.ones((10**4, 8*10**5), chunks=(10**4,10**4)))
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**4,10**4)))
y = x.dot(x.T)
into("h5py:///tmp/dummy::/Y", y)
I expected this computation to go smoothly as each (10,000*10,000) chunk should be individually transposed, followed by a dot product and then summed up to the final result. However, running this computation fills both my RAM and swap memory until the process eventually gets killed.
Here is a sample of the computation graph plotted with dot_graph: Computation graph sample
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html
I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed. This would free the memory of these tensordot intermediary results, so that we would not face memory errors.
Playing around with a smaller toy example:
from dask.diagnostics import Profiler, CacheProfiler, ResourceProfiler
# Experiment on a (1,0000 * 5,000) matrix X split into 500 chunks of size (1,000 * 10)
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**3,10)))[:10**3,5000]
y = x.T.dot(x)
with Profiler() as prof, CacheProfiler() as cprof, ResourceProfiler() as rprof:
into("h5py:///tmp/dummy::/X", y)
rprof.visualize()
I get the following display:
Ressource profiler
Where the green bar represents the sum operation, while yellow and purple bars represent respectively get_array and tensordot operations. This seems to indicate that the sum operation waits for all intermediary tensordot operations to be performed before summing them. This would also explain my process running out of memory and getting killed.
So my questions are:
Is that the normal behavior of the sum operation?
Is there a way to force it to compute intermediary sums before all
the intermediary tensordot products are computed and kept in memory?
If not, is there a work around that does not involve spilling to disk?
Any help much much appreciated!
Generally speaking performing a dense matrix-matrix multiply in small space is hard. This is because every intermediate chunk will by used by several of the output chunks.
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed.
The graph that you have shown has many inputs to a sum function. Dask will wait until all of those inputs are complete before running the sum function. The task scheduler has no idea that sum is associative and can be run piece by piece. This lack of semantic information is the price you pay for using a general task scheduling system like Dask rather than a dedicated linear algebra library. If your goal is to perform dense linear algebra as efficiently as possible then you might want to look elsewhere; this is a well covered field.
So as written your memory requirements are at least 8e5 * 1e4 * dtype.itemsize, assuming that Dask proceeds in exactly the right order (which it should mostly do).
You might try the following:
Reduce the chunksize along the non-contracting dimension
Use a version of Dask later than 0.14.1 (0.14.2 should be released by May 5th, 2017), where we break down those large sum calls into many smaller ones explicitly in the graph.
Use the distributed scheduler, which handles writing data to disk more efficiently.
from dask.distributed import Client
client = Client(processes=False) # create a local cluster in this process
Currently I'm implementing this paper for my undergraduate theses with python, but I only use the mahalanobis metric learning (in case you're curious).
In a shortcut, I face a problem when I need to learn a matrix with the size of 67K*67K consisting of integer, by simply numpy.dot(A.T,A) where A is a random vector sized (1,67K). When I do that it's simply throw MemoryError since my PC only have 8gb ram, and the raw calculation of the memory needed is 16gb to init. Than I search for alternative and found dask.
so i moved on to dask with this dask.array.dot(A.T,A) and it's done. But than I need to do whitening transformation to that matrix, and in dask I can achieve it by get the SVD. But everytime I do that SVD, the ipython kernel dies (I assume it due to lack of memory).
this is what I do so far from init, until the kernel dies:
fv_length=512*2*66
W = da.random.randint(10,20,(fv_length),(1000,1000))
W = da.reshape(W,(1,fv_length))
W_T = W.T
Wt = da.dot(W_T,W); del W,W_T
Wt = da.reshape(Wt,(fv_length*fv_length/2,2))
U,S,Vt = da.linalg.svd(Wt); del Wt
I didn't get the U,S,and Vt yet.
Is my memory simply not enough to do these sort of things, even when I'm using dask?
or actually this is not a spec problem, but my bad memory management?
or something else?
At this point I'm desperately trying in other bigger spec PC, so I am planning to rent a bare metal server with a 32gb ram. Even if I do so, is it enough?
Generally speaking dask.array does not guarantee out-of-core operation on all computations. A square matrix-matrix multiply (or any L3 BLAS operation) is more-or-less impossible to do efficiently in small memory.
You can ask Dask to use an on-disk cache for intermediate values. See the FAQ under the question My computation fills memory, how do I spill to disk?. However this will be limited by disk-writing speeds, which are generally fairly slow.
A large memory machine and NumPy is probably the simplest way to resolve this problem. Alternatively you could try to find a different formulation of your problem.