I’m exploring 3D interactive volume convolution with some simple stencils using dask right now.
Let me explain what I mean:
Assume that you have a 3D data which you would like to process through Sobel Transform (for example to get L1 or L2 gradient).
Then you divide your input 3D image into subvolumes (with some overlapping boundaries – for 3x3x3 stencil Sobel it will demand +2 samples overlap/padding)
Now let’s assume that you create a delayed computation of the Sobel 3D transform on entire 3D volume – but not executing it yet.
And now the most important part:
I want to write a function which will extract some particular 2D section from the virtually transformed data.
And then finally let dask everything to compute:
But what I need dask to do is not to compute the entire transform for me and then provide a section.
I need it to execute only those tasks which are needed to compute that particular 2D transformed image slice.
Do you think – it’s possible?
In order to explain it with image – please consider this to be a 3D domain decomposition (this is from DWT – but good for illustration from here):
illistration of domain decomposition
And assume that there is a function which computes 3D transform of the entire volume using dask.
But what I would like to get – for example – is 2D image of the transformed 3D data which consists from LLL1,LLH1,HLH1,HLL1 planes (essentially a single slice).
The important part is not to compute the whole subcubes – but let dask somehow automatically track the dependencies in the compute graph and evaluate only those.
Please don’t worry about compute v.s. copy time.
Assume that it has perfect ratio.
Let me know if more clarification is needed!
Thanks for your help!
I'm hearing a few questions. I'll answer each individually
Can Dask track which tasks are required for a subset of outputs and only compute those?
Yes. Lazy Dask operations produce a dependency graph. In the case of dask.arrays this graph is per-chunk. If your output only depends on a subset of the graph then Dask will remove tasks that are not necessary. The in-depth docs for this are here and the cull optimization in particular.
As an example consider this 100,000 by 100,000 array
>>> x = da.random.random((100000, 100000), chunks=(1000, 1000))
And lets say that I add a couple of 1d slices from it
>>> y = x[5000, :] + x[:, 5000].T
The resulting optimized graph is only large enough to compute the output
>>> graph = y._optimize(y.dask, y._keys()) # you don't need to do this
>>> len(graph) # it happens automatically
301
And we can compute the result quite quickly:
In [8]: %time y.compute()
CPU times: user 3.18 s, sys: 120 ms, total: 3.3 s
Wall time: 936 ms
Out[8]:
array([ 1.59069994, 0.84731881, 1.86923216, ..., 0.45040813,
0.86290539, 0.91143427])
Now, this wasn't perfect. It did have to produce all of the 1000x1000 chunks that our two slices touched. But you can control the granularity there.
Short answer: Dask will automatically inspect the graph and only run those tasks that are necessary to evaluate the output. You don't need to do anything special to do this.
Is it a good idea to do overlapping array computations with dask.array?
Maybe. The relevant doc page is here on Overlapping Blocks with Ghost Cells. Dask.array has convenience functions to make this easy to write down. However it will create in-memory copies. Many people in your position find memcopy too slow. Dask generally doesn't support in-place computation so we can't be as efficient as proper MPI code. I'll leave the performance question here to you though.
Not to detract from the nicely laid out answer from #MRocklin, but more to add to it.
I also regularly find myself needing to do things like edge detection and other image processing techniques on large scale array data. As Dask is a very nice library for constructing and exploring such computational workflows on large array data, have put together some utility libraries for some common image processing techniques in a GitHub organization called dask-image. They have largely been designed to mimic SciPy's ndimage API.
As to using a Sobel operator with Dask, one can use this sobel function from dask-ndfilters (permissively licensed) to perform this operation on a Dask Array. It handles proper haloing on the blocks underneath the hood returning a new Dask Array.
As SciPy's sobel function (and dask-ndfilters' sobel as well) operate on one dimension, one will need to map over each axis and stack to get the full Sobel operator result. That said, this is quite straightforward to do. Below is a brief snippet showing how to do this on a random Dask Array. Also included is taking a slice along the XZ-plane. Though one could just as easily take any other slice or perform additional operations on the data.
Hope this helps. :)
import dask.array as da
import dask_ndfilters as da_ndfilt
d = da.random.random((100, 120, 140), chunks=(25, 30, 35))
ds = da.stack([da_ndfilt.sobel(d, axis=i) for i in range(d.ndim)])
dsp = ds[:, :, 0, :]
asp = dsp.compute()
Related
I am trying to run indexing and assignment on a 3D netCDF array loaded into dask to be used for multiprocessing purposes. At the moment Dask does not appear to support direct 3D indexing and assignment so I have been trying to ravel the 3D array to 1D, perform the indexing and assignment, then reshape the result back to 3D at the end, however, this process has proven to be highly memory intensive to the point where my dask workers with 4GB of available memory are unable to complete the task.
The data I am using are the GridRAD set (http://gridrad.org/data.html; netCDF). Getting the data and saving the files are straight-forward enough through XArray, however the data itself needs to have additional processing done on it to be suitable for analysis (The data source creators have written filtering routines that allow for such suitable analysis using numpy here: http://gridrad.org/zip/gridrad_python_software.zip [Provided to show what operations are being performed in dask]). I have converted this script from using numpy arrays to using dask arrays (Namely just switching the definition of numpy functions (IE: np.function) to dask.array functions (IE: da.function), which share the same names and calling conventions.
I stumbled across this git issue in researching potential solutions: https://github.com/dask/dask/issues/3096. This git issue seems to detail what I am trying to do, however, the provided examples in there are not descriptive enough for me to understand. I wrote a utility function to remove the .compute() call on the index and have tried to use the lambda approach as detailed in the git issue, but I do not know how to properly use this, here is my current code:
def contains_grid0(in_struct, grid):
lX = len(in_struct["X"])
lY = len(in_struct["Y"])
highest = lX * lY
# If there is a grid, we can find the lowest index.
if(len(grid) > 0):
return grid[0] <= highest
return False
inan = da.flatnonzero(da.asarray(da.isnan(in_struct["ref"]))) # Find bins with NaNs
if(contains_grid0(in_struct, inan)):
da.map_blocks(lambda block, idxs=None: idxs[block] = 0.0, block=in_struct["ref"], idxs=inan)
This does not work because you cannot assign in a lambda (I do not know how to take the above example and assign from it to test if this will work), but it establishes the proof of concept to what I am trying to accomplish.
Therefore, I am curious if anyone has had to index and assign on a 3D array using Dask and how you have succeeded in doing so in the most memory efficient manner.
Thanks!
My problem is to perform 3 matrix multiplications on a 3D numpy array A too large to fit in a single processor. In tensorial form I want A_ijk B_km C_jn D_ip (B, C, and D can all fit in memory). I want to know if dask is appropriate for this task (or if another tool might be more suited).
I believe the best approach is to split this operation into each multiplication, and make sure that they are all local. This link has a really useful diagram that summarises what I'm talking about http://www.2decomp.org/1d_mode.html.
In more detail: First, to do A_ijk B_km, I should distribute A over the first two axes, and perform the matrix multiplication over each pencil locally (the first step in the diagram).
Then, I need to transpose the array, making the j axis local to each processor (and splitting over the k (now m) axis), to then perform the next multiplication. (So going from the first to the second step in the diagram). This is where I wonder if dask could help.
I'm aware that this can be done in principle using mpi4py, but the steps are pretty non-trivial, whereas dask arrays have helpful rechunk and transpose methods, which feel relevant to this application.
Does this seem like something well-suited to dask?
If not, is anyone aware of any python libraries that can perform these steps? I know that fftw has routines for doing just this, but I don't know how to write the C-code necessary, or how to get it to interface with python and numpy.
Thanks for any help.
For anyone else in the future, mpi4py does have a transpose method. But it's called Alltoall/Alltoallv. It's not explained in the documentation or tutorial on mpi4py. I found out about it at another tutorial: https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:mpi4py.
Dask implements einsum, which may be what you are after, and there is, of course matmul, if you want to write out the operation. So long as your large matrix A is a Dask array, with reasonable chunk sizes, Dask will parcel out your work without running out of memory.
In image processing and classification networks, a common task is the convolution or cross-correlation of input images with some fixed filters. For example, in convolutional neural nets (CNNs), this is an extremely common operation. I have reduced the general version task to this:
Given: a batch of N images [N,H,W,D,...] and a set of K filters [K,H,W,D,...]
Return: a ndarray that represents the m-dimensional cross-correlation (xcorr) of image N_i with filter K_j for every N_i in N and K_j in K
Currently, I am using scipy.spatial.cdist on a custom function that represents the max of the xcorr of two m-dim images, namely scipy.signal.correlate. The code looks something like this:
from scipy.spatial.distance import cdist
from scipy.signal import correlate
def xcorr(u,v):
'''unfortunately, cdist only takes 2D arrays, so need to do this'''
u = np.reshape(u, [96,96,3])
v = np.reshape(v, [96,96,3])
return np.max(correlate(u,v,mode='same',method='fft'))
batch_images = np.random.random([500,96,96,3])
my_filters = np.random.random([1000,96,96,3])
# unfortunately, cdist only takes 2D arrays, so need to do this
batch_vec = np.reshape(batch_images, [-1,np.prod(batch_images.shape[1:])])
filt_vec = np.reshape(my_filters, [-1,np.prod(my_filters.shape[1:])])
answer = cdist(batch_vec, filt_vec, xcorr)
The method works, and its nice that cdist is automatically parallelized across threads, but it is actually quite slow. I am guessing this is due to a number of reasons, including non-optimal use of the cache between threads (e.g. keep one filter fixed in cache while you filter all the images, or vice versa), the reshape operation inside xcorr, etc.
Does the community have any ideas how to speed this up? I realize in my example xcorr takes the maximum over the cross-correlation between both images, but this was just an example that was fit to work with cdist. Ideally, you could perform this batch operation and use some other function (or none) to get the output you wanted. Ideal solutions could handle (R,G,B,D,...) data.
Any/all help appreciated, including but not limited to wrapping C, although Python/numpy solutions are preferred. I saw some posts related to einsum notation, but I am not super familiar with that, so any help would be appreciated. I welcome tensorflow solutions IF they are able to get the same answer (within reasonable precision) as the corresponding slow numpy version.
I'm trying to compute the matrix product Y=XX^T for a matrix X of size 10,000 * 800,000. The matrix X is stored on-disk in an h5py file. The resulting Y should be a 10,000*10,000 matrix stored in the same h5py file. Here is a reproducible sample code.
import dask.array as da
from blaze import into
into("h5py:///tmp/dummy::/X", da.ones((10**4, 8*10**5), chunks=(10**4,10**4)))
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**4,10**4)))
y = x.dot(x.T)
into("h5py:///tmp/dummy::/Y", y)
I expected this computation to go smoothly as each (10,000*10,000) chunk should be individually transposed, followed by a dot product and then summed up to the final result. However, running this computation fills both my RAM and swap memory until the process eventually gets killed.
Here is a sample of the computation graph plotted with dot_graph: Computation graph sample
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html
I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed. This would free the memory of these tensordot intermediary results, so that we would not face memory errors.
Playing around with a smaller toy example:
from dask.diagnostics import Profiler, CacheProfiler, ResourceProfiler
# Experiment on a (1,0000 * 5,000) matrix X split into 500 chunks of size (1,000 * 10)
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**3,10)))[:10**3,5000]
y = x.T.dot(x)
with Profiler() as prof, CacheProfiler() as cprof, ResourceProfiler() as rprof:
into("h5py:///tmp/dummy::/X", y)
rprof.visualize()
I get the following display:
Ressource profiler
Where the green bar represents the sum operation, while yellow and purple bars represent respectively get_array and tensordot operations. This seems to indicate that the sum operation waits for all intermediary tensordot operations to be performed before summing them. This would also explain my process running out of memory and getting killed.
So my questions are:
Is that the normal behavior of the sum operation?
Is there a way to force it to compute intermediary sums before all
the intermediary tensordot products are computed and kept in memory?
If not, is there a work around that does not involve spilling to disk?
Any help much much appreciated!
Generally speaking performing a dense matrix-matrix multiply in small space is hard. This is because every intermediate chunk will by used by several of the output chunks.
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed.
The graph that you have shown has many inputs to a sum function. Dask will wait until all of those inputs are complete before running the sum function. The task scheduler has no idea that sum is associative and can be run piece by piece. This lack of semantic information is the price you pay for using a general task scheduling system like Dask rather than a dedicated linear algebra library. If your goal is to perform dense linear algebra as efficiently as possible then you might want to look elsewhere; this is a well covered field.
So as written your memory requirements are at least 8e5 * 1e4 * dtype.itemsize, assuming that Dask proceeds in exactly the right order (which it should mostly do).
You might try the following:
Reduce the chunksize along the non-contracting dimension
Use a version of Dask later than 0.14.1 (0.14.2 should be released by May 5th, 2017), where we break down those large sum calls into many smaller ones explicitly in the graph.
Use the distributed scheduler, which handles writing data to disk more efficiently.
from dask.distributed import Client
client = Client(processes=False) # create a local cluster in this process
I am currently using version 1.92 of numpy which I believe is the latest publicly available one. I wish to use the central difference method quickly on an n dimensional array and specify an axis over which to perform the calculation.
My first thought was to use numpy.diff which allows axis specification, however this returns a right difference rather than a central difference and has limited functionality.
I understand the following method works using numpy.gradient
num_vectors=10 #number of 3-vectors in the 2D array
vectorarray=numpy.empty((num_vectors,3))
vectorarray[0]=[4,5,6]
vectorarray[1]=[1,4,4]
vectorarray[2]=[8,8,1] #add some arbitrary data for illustrative purposes
c1,c2=numpy.gradient(vectorarray)
So c1 stores the useful information that I require. The problem is that I also have to generate c2, and I want to do this sort of calculation with many dimensional arrays and will incur a time loss by generating all this useless data.
Is there any other method I can achieve this same result without the redundancy, preferably this also includes using nested for loops.
You can read Numpy's gradient code at https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L1119 and use the algorithm therein.
You can also copy it and change for i, axis in enumerate(axes): to for i, axis in [[0,0]]: so that it only runs once, but keep in mind that your modified code may fall under Numpy's license.