dask out-of-core matrix multiply scheduling

dask out-of-core matrix multiply scheduling - python

I'm trying to compute the matrix product Y=XX^T for a matrix X of size 10,000 * 800,000. The matrix X is stored on-disk in an h5py file. The resulting Y should be a 10,000*10,000 matrix stored in the same h5py file. Here is a reproducible sample code.
import dask.array as da
from blaze import into
into("h5py:///tmp/dummy::/X", da.ones((10**4, 8*10**5), chunks=(10**4,10**4)))
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**4,10**4)))
y = x.dot(x.T)
into("h5py:///tmp/dummy::/Y", y)
I expected this computation to go smoothly as each (10,000*10,000) chunk should be individually transposed, followed by a dot product and then summed up to the final result. However, running this computation fills both my RAM and swap memory until the process eventually gets killed.
Here is a sample of the computation graph plotted with dot_graph: Computation graph sample
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html
I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed. This would free the memory of these tensordot intermediary results, so that we would not face memory errors.
Playing around with a smaller toy example:
from dask.diagnostics import Profiler, CacheProfiler, ResourceProfiler
# Experiment on a (1,0000 * 5,000) matrix X split into 500 chunks of size (1,000 * 10)
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**3,10)))[:10**3,5000]
y = x.T.dot(x)
with Profiler() as prof, CacheProfiler() as cprof, ResourceProfiler() as rprof:
into("h5py:///tmp/dummy::/X", y)
rprof.visualize()
I get the following display:
Ressource profiler
Where the green bar represents the sum operation, while yellow and purple bars represent respectively get_array and tensordot operations. This seems to indicate that the sum operation waits for all intermediary tensordot operations to be performed before summing them. This would also explain my process running out of memory and getting killed.
So my questions are:
Is that the normal behavior of the sum operation?
Is there a way to force it to compute intermediary sums before all
the intermediary tensordot products are computed and kept in memory?
If not, is there a work around that does not involve spilling to disk?
Any help much much appreciated!

Generally speaking performing a dense matrix-matrix multiply in small space is hard. This is because every intermediate chunk will by used by several of the output chunks.
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed.
The graph that you have shown has many inputs to a sum function. Dask will wait until all of those inputs are complete before running the sum function. The task scheduler has no idea that sum is associative and can be run piece by piece. This lack of semantic information is the price you pay for using a general task scheduling system like Dask rather than a dedicated linear algebra library. If your goal is to perform dense linear algebra as efficiently as possible then you might want to look elsewhere; this is a well covered field.
So as written your memory requirements are at least 8e5 * 1e4 * dtype.itemsize, assuming that Dask proceeds in exactly the right order (which it should mostly do).
You might try the following:
Reduce the chunksize along the non-contracting dimension
Use a version of Dask later than 0.14.1 (0.14.2 should be released by May 5th, 2017), where we break down those large sum calls into many smaller ones explicitly in the graph.
Use the distributed scheduler, which handles writing data to disk more efficiently.
from dask.distributed import Client
client = Client(processes=False) # create a local cluster in this process

Related

3D volume processing using dask

I’m exploring 3D interactive volume convolution with some simple stencils using dask right now.
Let me explain what I mean:
Assume that you have a 3D data which you would like to process through Sobel Transform (for example to get L1 or L2 gradient).
Then you divide your input 3D image into subvolumes (with some overlapping boundaries – for 3x3x3 stencil Sobel it will demand +2 samples overlap/padding)
Now let’s assume that you create a delayed computation of the Sobel 3D transform on entire 3D volume – but not executing it yet.
And now the most important part:
I want to write a function which will extract some particular 2D section from the virtually transformed data.
And then finally let dask everything to compute:
But what I need dask to do is not to compute the entire transform for me and then provide a section.
I need it to execute only those tasks which are needed to compute that particular 2D transformed image slice.
Do you think – it’s possible?
In order to explain it with image – please consider this to be a 3D domain decomposition (this is from DWT – but good for illustration from here):
illistration of domain decomposition
And assume that there is a function which computes 3D transform of the entire volume using dask.
But what I would like to get – for example – is 2D image of the transformed 3D data which consists from LLL1,LLH1,HLH1,HLL1 planes (essentially a single slice).
The important part is not to compute the whole subcubes – but let dask somehow automatically track the dependencies in the compute graph and evaluate only those.
Please don’t worry about compute v.s. copy time.
Assume that it has perfect ratio.
Let me know if more clarification is needed!
Thanks for your help!

I'm hearing a few questions. I'll answer each individually
Can Dask track which tasks are required for a subset of outputs and only compute those?
Yes. Lazy Dask operations produce a dependency graph. In the case of dask.arrays this graph is per-chunk. If your output only depends on a subset of the graph then Dask will remove tasks that are not necessary. The in-depth docs for this are here and the cull optimization in particular.
As an example consider this 100,000 by 100,000 array
>>> x = da.random.random((100000, 100000), chunks=(1000, 1000))
And lets say that I add a couple of 1d slices from it
>>> y = x[5000, :] + x[:, 5000].T
The resulting optimized graph is only large enough to compute the output
>>> graph = y._optimize(y.dask, y._keys()) # you don't need to do this
>>> len(graph) # it happens automatically
301
And we can compute the result quite quickly:
In [8]: %time y.compute()
CPU times: user 3.18 s, sys: 120 ms, total: 3.3 s
Wall time: 936 ms
Out[8]:
array([ 1.59069994, 0.84731881, 1.86923216, ..., 0.45040813,
0.86290539, 0.91143427])
Now, this wasn't perfect. It did have to produce all of the 1000x1000 chunks that our two slices touched. But you can control the granularity there.
Short answer: Dask will automatically inspect the graph and only run those tasks that are necessary to evaluate the output. You don't need to do anything special to do this.
Is it a good idea to do overlapping array computations with dask.array?
Maybe. The relevant doc page is here on Overlapping Blocks with Ghost Cells. Dask.array has convenience functions to make this easy to write down. However it will create in-memory copies. Many people in your position find memcopy too slow. Dask generally doesn't support in-place computation so we can't be as efficient as proper MPI code. I'll leave the performance question here to you though.

Not to detract from the nicely laid out answer from #MRocklin, but more to add to it.
I also regularly find myself needing to do things like edge detection and other image processing techniques on large scale array data. As Dask is a very nice library for constructing and exploring such computational workflows on large array data, have put together some utility libraries for some common image processing techniques in a GitHub organization called dask-image. They have largely been designed to mimic SciPy's ndimage API.
As to using a Sobel operator with Dask, one can use this sobel function from dask-ndfilters (permissively licensed) to perform this operation on a Dask Array. It handles proper haloing on the blocks underneath the hood returning a new Dask Array.
As SciPy's sobel function (and dask-ndfilters' sobel as well) operate on one dimension, one will need to map over each axis and stack to get the full Sobel operator result. That said, this is quite straightforward to do. Below is a brief snippet showing how to do this on a random Dask Array. Also included is taking a slice along the XZ-plane. Though one could just as easily take any other slice or perform additional operations on the data.
Hope this helps. :)
import dask.array as da
import dask_ndfilters as da_ndfilt
d = da.random.random((100, 120, 140), chunks=(25, 30, 35))
ds = da.stack([da_ndfilt.sobel(d, axis=i) for i in range(d.ndim)])
dsp = ds[:, :, 0, :]
asp = dsp.compute()

Memory-efficient normalization of scipy sparse arrays

I have a script that reads in a corpus of many short documents and vectorizes them using sklearn. The result is a large, sparse matrix (specifically, a scipy.sparse.csr.csr_matrix) with dimensions 371k x 100k. My goal is to normalize this such that each row sums to 1, i.e. divide each entry by the sum of the entries in its row. I've tried several ways of doing this and each has given me a MemoryError:
M /= M.sum(axis=1)
M_normalized = sklearn.preprocessing.normalize(M, axis=1, norm='l1')
a for loop which sums and divides the rows one at a time and adds the result to an all-zero matrix
For the first option, I used pdb to stop the script right before the normalization step so I could monitor memory consumption with htop. Interestingly, as soon as I stepped forward in the script to execute M /= M.sum(axis=1), the memory error was thrown immediately, in under a second.
My machine has 16GB of memory + 16GB swap, but only around 8GB + 16GB free at the point in the program where the normalization takes place.
Can anyone explain why all these methods are running into memory problems? Surely the third one at least should only use a small amount of memory, since it's only looking at one row at a time. Is there a more memory-efficient way to achieve this?

How do I work with large, memory hungry numpy array?

I have a program which creates an array:
List1 = zeros((x, y), dtype=complex_)
Currently I am using x = 500 and y = 1000000.
I will initialize the first column of the list by some formula. Then the subsequent columns will calculate their own values based on the preceding column.
After the list is completely filled, I will then display this multidimensional array using imshow().
The size of each value (item) in the list is 24 bytes.
A sample value from the code is: 4.63829355451e-32
When I run the code with y = 10000000, it takes up too much RAM and the system stops the run. How do I solve this problem? Is there a way to save my RAM while still being able to process the list using imshow() easily? Also, how large a list can imshow() display?

There's no way to solve this problem (in any general way).
Computers (as commonly understood) have a limited amount of RAM, and they require elements to be in RAM in order to operate on them.
An complex128 array size of 10000000x500 would require around 74GiB to store. You'll need to somehow reduce the amount of data you're processing if you hope to use a regular computer to do it (as opposed to a supercomputer).
A common technique is partitioning your data and processing each partition individually (possibly on multiple computers). Depending on the problem you're trying to solve, there may be special data structures that you can use to reduce the amount of memory needed to represent the data - a good example is a sparse matrix.
It's very unusual to need this much memory - make sure to carefully consider if it's actually needed before you dwell into the extremely complex workarounds.

Python NUMPY HUGE Matrices multiplication

I need to multiply two big matrices and sort their columns.
import numpy
a= numpy.random.rand(1000000, 100)
b= numpy.random.rand(300000,100)
c= numpy.dot(b,a.T)
sorted = [argsort(j)[:10] for j in c.T]
This process takes a lot of time and memory. Is there a way to fasten this process? If not how can I calculate RAM needed to do this operation? I currently have an EC2 box with 4GB RAM and no swap.
I was wondering if this operation can be serialized and I dont have to store everything in the memory.

One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.
To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).
> ipython
In [1]: import numpy as np
In [2]: a = np.random.rand(1000, 1000)
In [3]: a.dtype
Out[3]: dtype('float64')
In [4]: del(a)
To get the RSS I ran:
ps -xao comm,rss | grep python
[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]
The results are;
After starting the interpreter: 24880 kiB
After importing numpy: 34364 kiB
After creating a: 42200 kiB
After deleting a: 34368 kiB
Calculating the size;
In [4]: (42200 - 34364) * 1024
Out[4]: 8024064
In [5]: 8024064/(1000*1000)
Out[5]: 8.024064
As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.
The size of your original arrays in MiB will be approximately;
In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125
In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375
That's not too bad. However, the dot product will be way too large:
In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484
That's 2235 GiB!
What you can do is split up the problem and perfrom the dot operation in pieces;
load b as an ndarray
load every row from a as an ndarray in turn.
multiply the row by every column of b and write the result to a file.
del() the row and load the next row.
This wil not make it faster, but it would make it use less memory!
Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.

What DrV and Roland Smith said are good answers; they should be listened to. My answer does nothing more than present an option to make your data sparse, a complete game-changer.
Sparsity can be extremely powerful. It would transform your O(100 * 300000 * 1000000) operation into an O(k) operation with k non-zero elements (sparsity only means that the matrix is largely zero). I know sparsity has been mentioned by DrV and disregarded as not applicable but I would guess it is.
All that needs to be done is to find a sparse representation for computing this transform (and interpreting the results is another ball game). Easy (and fast) methods include the Fourier transform or wavelet transform (both rely on similarity between matrix elements) but this problem is generalizable through several different algorithms.
Having experience with problems like this, this smells like a relatively common problem that is typically solved through some clever trick. When in a field like machine learning where these types of problems are classified as "simple," that's often the case.

YOu have a problem in any case. As Roland Smith shows you in his answer, the amount of data and number of calculations is enormous. You may not be very familiar with linear algebra, so a few words of explanation might help in understanding (and then hopefully solving) the problem.
Your arrays are both a collection of vectors with length 100. One of the arrays has 300 000 vectors, the other one 1 000 000 vectors. The dot product between these arrays means that you calculate the dot product of each possible pair of vectors. There are 300 000 000 000 such pairs, so the resulting matrix is either 1.2 TB or 2.4 TB depending on whether you use 32 or 64-bit floats.
On my computer dot multiplying a (300,100) array with a (100,1000) array takes approximately 1 ms. Extrapolating from that, you are looking at a 1000 s calculation time (depending on the number of cores).
The nice thing about taking a dot product is that you can do it piecewise. Keeping the output is then another problem.
If you were running it on your own computer, calculating the resulting matrix could be done in the following way:
create an output array as a np.memmap array onto the disk
calculate the results one row at a time (as explained by Roland Smith)
This would result in a linear file write with a largish (2.4 TB) file.
This does not require too many lines of code. However, make sure everything is transposed in a suitable way; transposing the input arrays is cheap, transposing the output is extremely expensive. Accessing the resulting huge array is cheap if you can access elements close to each other, expensive, if you access elements far away from each other.
Sorting a huge memmapped array has to be done carefully. You should use in-place sort algorithms which operate on contiguous chunks of data. The data is stored in 4 KiB chunks (512 or 1024 floats), and the fewer chunks you need to read, the better.
Now that you are not running the code in our own machine but on a cloud platform, things change a lot. Usually the cloud SSD storage is very fast with random accesses, but IO is expensive (also in terms of money). Probably the least expensive option is to calculate suitable chunks of data and send them to S3 storage for further use. The "suitable chunk" part depends on how you intend to use the data. If you need to process individual columns, then you send one or a few columns at a time to the cloud object storage.
However, a lot depends on your sorting needs. Your code looks as if you are finally only looking at a few first items of each column. If this is the case, then you should only calculate the first few items and not the full output matrix. That way you can do everything in memory.
Maybe if you tell a bit more about your sorting needs, there can be a viable way to do what you want.
Oh, one important thing: Are your matrices dense or sparse? (Sparse means they mostly contain 0's.) If your expect your output matrix to be mostly zero, that may change the game completely.

Segmentation Fault on ndarray matrices dot product

I am performing dot product of a matrix with 50000 rows and 100 columns with it's transpose. The values of the matrix is in float.
A(50000, 100)
B(100, 50000)
Basically I get the matrix after performing SVD on a larger sparse matrix.
The matrix is of numpy.ndarray type.
I use dot method of numpy for multiplying the two matrices. And I get segmentation fault.
numpy.dot(A, B)
The dot product works well on matrix with 30000 rows but fails for 50000 rows.
Is there any limit on numpy's dot product?
Any problem above while using dot product?
Is there any other good python linear algebra tool which is efficient on large matrices.

As you have been told, there is a memory problem. You want to do this:
numpy.dot(A, A.T)
which requires a lot of memory for the result (not the operands). However the operation is easy to perform in pieces. You may use a loop-based approach to produce one output row at a time:
def trans_multi(A):
rows = A.shape[0]
result = numpy.empty((rows, rows), dtype=A.dtype)
for r in range(rows):
result[r,:] = numpy.dot(A, A[r,:].T)
return result
This, as such, is just a slower and equally memory consuming version (numpy.dot is well-optimized). However, what you most probably want to do is to write the result into a file, as you do not have the memory to hold the result:
def trans_multi(A, filename):
with open(filename, "wb") as f:
rows = A.shape[0]
for r in range(rows):
f.write(numpy.dot(A, A[r,:].T).tostring())
Yes, it is not exactly lightning-fast. However, it is most probably the fastest you can hope for. Sequential writes are usually well-optimized. I tried:
a=numpy.random.random((50000,100)).astype('float32')
trans_multi(a,"/tmp/large.dat")
This took approximately 60 seconds, but it really depends on your HDD performance.
Why not memmap?
I like mmap, and numpy.memmap is a great thing. However, numpy.memmap is optimized for having large tables and calculating small results from them. There is, for example memmap.dot which is optimized for taking dot products of memmapped arrays. The scenario is that the operands are memmapped, but the result is in RAM. Exactly the other way round, that is.
Memmapping is very useful when you have random access. Here the access is not random access but sequential write. Also, yf you try to use numpy.memmap to create a (50000,50000) array of float32's, it will take some time (for some reason I do not get, maybe it initializes the data even though there is no need).
However, after the file has been created, it is a very good idea to use numpy.memmap to analyze the huge table, as it gives the best random read performance and a very convenient interface.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.