memory exhaust on big matrix operation using dask

memory exhaust on big matrix operation using dask - python

Currently I'm implementing this paper for my undergraduate theses with python, but I only use the mahalanobis metric learning (in case you're curious).
In a shortcut, I face a problem when I need to learn a matrix with the size of 67K*67K consisting of integer, by simply numpy.dot(A.T,A) where A is a random vector sized (1,67K). When I do that it's simply throw MemoryError since my PC only have 8gb ram, and the raw calculation of the memory needed is 16gb to init. Than I search for alternative and found dask.
so i moved on to dask with this dask.array.dot(A.T,A) and it's done. But than I need to do whitening transformation to that matrix, and in dask I can achieve it by get the SVD. But everytime I do that SVD, the ipython kernel dies (I assume it due to lack of memory).
this is what I do so far from init, until the kernel dies:
fv_length=512*2*66
W = da.random.randint(10,20,(fv_length),(1000,1000))
W = da.reshape(W,(1,fv_length))
W_T = W.T
Wt = da.dot(W_T,W); del W,W_T
Wt = da.reshape(Wt,(fv_length*fv_length/2,2))
U,S,Vt = da.linalg.svd(Wt); del Wt
I didn't get the U,S,and Vt yet.
Is my memory simply not enough to do these sort of things, even when I'm using dask?
or actually this is not a spec problem, but my bad memory management?
or something else?
At this point I'm desperately trying in other bigger spec PC, so I am planning to rent a bare metal server with a 32gb ram. Even if I do so, is it enough?

Generally speaking dask.array does not guarantee out-of-core operation on all computations. A square matrix-matrix multiply (or any L3 BLAS operation) is more-or-less impossible to do efficiently in small memory.
You can ask Dask to use an on-disk cache for intermediate values. See the FAQ under the question My computation fills memory, how do I spill to disk?. However this will be limited by disk-writing speeds, which are generally fairly slow.
A large memory machine and NumPy is probably the simplest way to resolve this problem. Alternatively you could try to find a different formulation of your problem.

Related

Transpose large numpy matrix on disk

I have a rather large rectangular (>1G rows, 1K columns) Fortran-style NumPy matrix, which I want to transpose to C-style.
So far, my approach has been relatively trivial with the following Rust snippet, using MMAPed slices of the source and destination matrix, where both original_matrix and target_matrix are MMAPPed PyArray2, with Rayon handling the parallelization.
Since the target_matrix has to be modified by multiple threads, I wrap it in an UnsafeCell.
let shared_target_matrix = std::cell::UnsafeCell::new(target_matrix);
original_matrix.as_ref().par_chunks(number_of_nodes).enumerate().for_each(|(j, feature)|{
feature.iter().copied().enumerate().for_each(|(i, feature_value)| unsafe {
*(shared_target_matrix.uget_mut([i, j])) = feature_value;
});
});
This approach transposes a matrix with shape (~1G, 100), ~120GB takes ~3 hours on an HDD disk. Transposing a (~1G, 1000), ~1200GB matrix does not scale linearly to 30 hours, as one may naively expect, but explode to several weeks. As it stands, I have managed to transpose roughly 100 features in 2 days, and it keeps slowing down.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Note on sequential and parallel approaches
While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster (on a machine with 12 cores and 24 threads) than a sequential approach when transposing a matrix with shape (1G, 100). We are not sure why this is the case.
Note on using two HDDs
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.

While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster
This may be due to poor IO queue utilization. With an entirely sequential workload without prefetching you'll be alternating the device between working and idle. If you keep multiple operations in flight it'll be working all the time.
Check with iostat -x <interval>
But parallelism is a suboptimal way to achieve best utilization of a HDD because it'll likely cause more head-seeks than necessary.
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.
This may be due to the operating system's write cache which means it can batch writes very efficiently and you're mostly bottlenecked on reads. Again, check with iostat.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Yes, if the underlying filesystem supports it you can use FIEMAP to get the physical layout of the data on disk and then optimize your read order to follow the physical layout rather than the logical layout. You can use the filefrag CLI tool to inspect the fragmentation data manually, but there are rust bindings for that ioctl so you can use it programmatically too.
Additionally you can use madvise(MADV_WILLNEED) to inform the kernel to prefetch data in the background for the next few loop iterations. For HDDs this should be ideally done in batches worth a few megabytes at a time. And the next batch should be issued when you're half-way through the current one.
Issuing them in batches minimizes syscall overhead and starting the next one half-way through ensures there's enough time left to actually complete the IO before you reach the end of the current one.
And since you'll be manually issuing prefetches in physical instead of logical order you can also disable the default readahead heuristics (which would be getting in the way) via madvise(MADV_RANDOM)
If you have enough free diskspace you could also try a simpler approach: defragmenting the file before operating on it. But even then you should still use madvise to ensure that there always are IO requests in flight.

SVD MemoryError in Python

I want to perform an SVD on a big array M[159459,159459].
Since SVD computation depends on the input matrix of shape (159459,159459), this here does not address my goal.
I have tried to use:
scipy.linalg.svd
scipy.linalg.svd(check_finite=False)
change the driver to lapack_driver='gesvd
numpy.linalg.svd
However, I always get a MemoryError. Finally, I want to compute the full SVD, because I want to perform a Procrustes analysis, i.e. if M is the matrix that I have now, I need M = USV'
import numpy as np
from scipy import linalg
#M = np.load("M.npy")
M = np.random.rand(159459,159459)
U, s, Vh = linalg.svd(M, check_finite=False, lapack_driver='gesvd)
Everything fails.
My system details:
$ cat /proc/meminfo
MemTotal: 527842404 kB
MemFree: 523406068 kB
MemAvailable: 521659112 kB

Memory size matters, latency costs will hurt you next:
Given mM.shape == [159459, 159459],given mM.dtype is by default float(64)there will be a need to have about:203.42 [GB] for the ori ginal of mM[159459, 159459], plus203.42 [GB] for the computed mU[159459, 159459], plus203.42 [GB] for the computed Vh[159459, 159459]0.0013 [GB] for the computed vS[159459]
A cheapest ever step, by trying a linear-only downscaling by a factor of 2 ( and not more than 4 ) from float64 to float32 or even float16 is not the game changer and is even heavily penalised by numpy inefficiencies ( if not internally performed back-conversions up to float64 again - my own attempts were so bleeding on this, that I share the resulting dissatisfaction here, so as to avoid repeating my own errors on trying to start with the lowest hanging fruit first ... )
In case your analysis may work just with the vector vS, only the .svd( ..., compute_uv = False, ... ) flag will avoid making a space for about ~ 1/2 [TB] RAM-allocations by not returning ( and thus not reserving space for them ) instances of mU and Vh.
Even such a case does not mean your SLOC will survive as is in the just about the reported 0.5 TB RAM-system. The scipy.linalg.svd() processing will allocated internally working resources, that are outside of your scope of coding ( sure, unless you re-factor and re-design the scipy.linalg module on your own, which is fair to consider very probable if not sure ) and configuration control. So, be warned that even when you test the compute_uv = False-mode of processing, the .svd() may still throw an error, if it fails to internally allocate required internally used data-structures, that do not fit the current RAM.
This also means that even using the numpy.memmap(), which may be a successful trick to off-load the in-RAM representation of the original mM ( avoiding some remarkable part of the first needed 203.4 [GB] from sitting and blocking the usage of the hosts RAM ), yet there are costs you will have to pay for using this trick.
My experiments, at smaller scales of the .memmap-s, used for matrix processing and in ML-optimisations, yield about 1E4 ~ 1E6 slower processing because, in spite of the smart caching, the numpy.memmap()-instances are dependent on the disk-I/O.
Best result will come from using advanced, TB-sized, SSD-only-storage devices, hosted right on the computing device on some fast and low-latency access-bus M.2 or PCIx16.
The last piece of experience, one might not yet want to hear here:
Using larger host-based RAM, which means using a multi-TB computing device, is the safest way to go further. Testing the above proposed steps will help, if reduced performance and additional expenses are within your Project's budget. If not, go for the use of an HPC centre at your Alma Mater or at your Project's closest research centre, where such multi-TB computing devices are being used in common.

Python - go beyond RAM limits?

I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.
Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?

You will need to do it manually.
There are probably two different core-problems here:
A: holding your training-data
B: training the regressor
For A, you can try numpy's memmap which abstracts swapping away.
As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.
For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.
Keep in mind, that this training-process decomposes into at least two new elements:
Efficient being in regards to memory
Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
Efficient convergence
Those algorithms in the link above should be okay for both.
SGDRegressor can be parameterized to resemble RidgeRegression.
Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!
Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).

Efficient Matrix-Vector Multiplication: Multithreading directly in Python vs. using ctypes to bind a multithreaded C function

I have a simple problem: multiply a matrix by a vector. However, the implementation of the multiplication is complicated because the matrix is 18 gb (3000^2 by 500).
Some info:
The matrix is stored in HDF5 format. It's Matlab output. It's dense so no sparsity savings there.
I have to do this matrix multiplication roughly 2000 times over the course of my algorithm (MCMC Bayesian Inversion)
My program is a combination of Python and C, where the Python code handles most of the MCMC procedure: keeping track of the random walk, generating perturbations, checking MH Criteria, saving accepted proposals, monitoring the burnout, etc. The C code is simply compiled into a separate executable and called when I need to solve the forward (acoustic wave) problem. All communication between the Python and C is done via the file system. All this is to say I don't already have ctype stuff going on.
The C program is already parallelized using MPI, but I don't think that's an appropriate solution for this MV multiplication problem.
Our program is run mainly on linux, but occasionally on OSX and Windows. Cross-platform capabilities without too much headache is a must.
Right now I have a single-thread implementation where the python code reads in the matrix a few thousand lines at a time and performs the multiplication. However, this is a significant bottleneck for my program since it takes so darn long. I'd like to multithread it to speed it up a bit.
I'm trying to get an idea of whether it would be faster (computation-time-wise, not implementation time) for python to handle the multithreading and to continue to use numpy operations to do the multiplication, or to code an MV multiplication function with multithreading in C and bind it with ctypes.
I will likely do both and time them since shaving time off of an extremely long running program is important. I was wondering if anyone had encountered this situation before, though, and had any insight (or perhaps other suggestions?)
As a side question, I can only find algorithmic improvements for nxn matrices for m-v multiplication. Does anyone know of one that can be used on an mxn matrix?

Hardware
As Sven Marnach wrote in the comments, your problem is most likely I/O bound since disk access is orders of magnitude slower than RAM access.
So the fastest way is probably to have a machine with enough memory to keep the whole matrix multiplication and the result in RAM. It would save lots of time if you read the matrix only once.
Replacing the harddisk with an SSD would also help, because that can read and write a lot faster.
Software
Barring that, for speeding up reads from disk, you could use the mmap module. This should help, especially once the OS figures out you're reading pieces of the same file over and over and starts to keep it in the cache.
Since the calculation can be done by row, you might benefit from using numpy in combination with a multiprocessing.Pool for that calculation. But only really if a single process cannot use all available disk read bandwith.

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.

Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.

You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.

Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.

I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)

Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.

From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)

Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.