I am trying to run indexing and assignment on a 3D netCDF array loaded into dask to be used for multiprocessing purposes. At the moment Dask does not appear to support direct 3D indexing and assignment so I have been trying to ravel the 3D array to 1D, perform the indexing and assignment, then reshape the result back to 3D at the end, however, this process has proven to be highly memory intensive to the point where my dask workers with 4GB of available memory are unable to complete the task.
The data I am using are the GridRAD set (http://gridrad.org/data.html; netCDF). Getting the data and saving the files are straight-forward enough through XArray, however the data itself needs to have additional processing done on it to be suitable for analysis (The data source creators have written filtering routines that allow for such suitable analysis using numpy here: http://gridrad.org/zip/gridrad_python_software.zip [Provided to show what operations are being performed in dask]). I have converted this script from using numpy arrays to using dask arrays (Namely just switching the definition of numpy functions (IE: np.function) to dask.array functions (IE: da.function), which share the same names and calling conventions.
I stumbled across this git issue in researching potential solutions: https://github.com/dask/dask/issues/3096. This git issue seems to detail what I am trying to do, however, the provided examples in there are not descriptive enough for me to understand. I wrote a utility function to remove the .compute() call on the index and have tried to use the lambda approach as detailed in the git issue, but I do not know how to properly use this, here is my current code:
def contains_grid0(in_struct, grid):
lX = len(in_struct["X"])
lY = len(in_struct["Y"])
highest = lX * lY
# If there is a grid, we can find the lowest index.
if(len(grid) > 0):
return grid[0] <= highest
return False
inan = da.flatnonzero(da.asarray(da.isnan(in_struct["ref"]))) # Find bins with NaNs
if(contains_grid0(in_struct, inan)):
da.map_blocks(lambda block, idxs=None: idxs[block] = 0.0, block=in_struct["ref"], idxs=inan)
This does not work because you cannot assign in a lambda (I do not know how to take the above example and assign from it to test if this will work), but it establishes the proof of concept to what I am trying to accomplish.
Therefore, I am curious if anyone has had to index and assign on a 3D array using Dask and how you have succeeded in doing so in the most memory efficient manner.
Thanks!
Related
I want to create a vector with the size 10^15 with numpy and fill it with random numbers, but I get the following error:
Maximum allowed dimension exceeded.
Can it help if i use MPI?
Thank you
The Message Passing Interface (MPI) is mainly used to do parallel computations across multiple machines (nodes). Large arrays can be split into smaller arrays and stored on different machines. However, while it's of course possible to distribute the data to different nodes, you should carefully think about the necessity of doing this for your particular task. Additionally, if you are able to split your array, you could also do this on one machine. If performance is not an issue, avoid parallel computing.
My problem is to perform 3 matrix multiplications on a 3D numpy array A too large to fit in a single processor. In tensorial form I want A_ijk B_km C_jn D_ip (B, C, and D can all fit in memory). I want to know if dask is appropriate for this task (or if another tool might be more suited).
I believe the best approach is to split this operation into each multiplication, and make sure that they are all local. This link has a really useful diagram that summarises what I'm talking about http://www.2decomp.org/1d_mode.html.
In more detail: First, to do A_ijk B_km, I should distribute A over the first two axes, and perform the matrix multiplication over each pencil locally (the first step in the diagram).
Then, I need to transpose the array, making the j axis local to each processor (and splitting over the k (now m) axis), to then perform the next multiplication. (So going from the first to the second step in the diagram). This is where I wonder if dask could help.
I'm aware that this can be done in principle using mpi4py, but the steps are pretty non-trivial, whereas dask arrays have helpful rechunk and transpose methods, which feel relevant to this application.
Does this seem like something well-suited to dask?
If not, is anyone aware of any python libraries that can perform these steps? I know that fftw has routines for doing just this, but I don't know how to write the C-code necessary, or how to get it to interface with python and numpy.
Thanks for any help.
For anyone else in the future, mpi4py does have a transpose method. But it's called Alltoall/Alltoallv. It's not explained in the documentation or tutorial on mpi4py. I found out about it at another tutorial: https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:mpi4py.
Dask implements einsum, which may be what you are after, and there is, of course matmul, if you want to write out the operation. So long as your large matrix A is a Dask array, with reasonable chunk sizes, Dask will parcel out your work without running out of memory.
I’m exploring 3D interactive volume convolution with some simple stencils using dask right now.
Let me explain what I mean:
Assume that you have a 3D data which you would like to process through Sobel Transform (for example to get L1 or L2 gradient).
Then you divide your input 3D image into subvolumes (with some overlapping boundaries – for 3x3x3 stencil Sobel it will demand +2 samples overlap/padding)
Now let’s assume that you create a delayed computation of the Sobel 3D transform on entire 3D volume – but not executing it yet.
And now the most important part:
I want to write a function which will extract some particular 2D section from the virtually transformed data.
And then finally let dask everything to compute:
But what I need dask to do is not to compute the entire transform for me and then provide a section.
I need it to execute only those tasks which are needed to compute that particular 2D transformed image slice.
Do you think – it’s possible?
In order to explain it with image – please consider this to be a 3D domain decomposition (this is from DWT – but good for illustration from here):
illistration of domain decomposition
And assume that there is a function which computes 3D transform of the entire volume using dask.
But what I would like to get – for example – is 2D image of the transformed 3D data which consists from LLL1,LLH1,HLH1,HLL1 planes (essentially a single slice).
The important part is not to compute the whole subcubes – but let dask somehow automatically track the dependencies in the compute graph and evaluate only those.
Please don’t worry about compute v.s. copy time.
Assume that it has perfect ratio.
Let me know if more clarification is needed!
Thanks for your help!
I'm hearing a few questions. I'll answer each individually
Can Dask track which tasks are required for a subset of outputs and only compute those?
Yes. Lazy Dask operations produce a dependency graph. In the case of dask.arrays this graph is per-chunk. If your output only depends on a subset of the graph then Dask will remove tasks that are not necessary. The in-depth docs for this are here and the cull optimization in particular.
As an example consider this 100,000 by 100,000 array
>>> x = da.random.random((100000, 100000), chunks=(1000, 1000))
And lets say that I add a couple of 1d slices from it
>>> y = x[5000, :] + x[:, 5000].T
The resulting optimized graph is only large enough to compute the output
>>> graph = y._optimize(y.dask, y._keys()) # you don't need to do this
>>> len(graph) # it happens automatically
301
And we can compute the result quite quickly:
In [8]: %time y.compute()
CPU times: user 3.18 s, sys: 120 ms, total: 3.3 s
Wall time: 936 ms
Out[8]:
array([ 1.59069994, 0.84731881, 1.86923216, ..., 0.45040813,
0.86290539, 0.91143427])
Now, this wasn't perfect. It did have to produce all of the 1000x1000 chunks that our two slices touched. But you can control the granularity there.
Short answer: Dask will automatically inspect the graph and only run those tasks that are necessary to evaluate the output. You don't need to do anything special to do this.
Is it a good idea to do overlapping array computations with dask.array?
Maybe. The relevant doc page is here on Overlapping Blocks with Ghost Cells. Dask.array has convenience functions to make this easy to write down. However it will create in-memory copies. Many people in your position find memcopy too slow. Dask generally doesn't support in-place computation so we can't be as efficient as proper MPI code. I'll leave the performance question here to you though.
Not to detract from the nicely laid out answer from #MRocklin, but more to add to it.
I also regularly find myself needing to do things like edge detection and other image processing techniques on large scale array data. As Dask is a very nice library for constructing and exploring such computational workflows on large array data, have put together some utility libraries for some common image processing techniques in a GitHub organization called dask-image. They have largely been designed to mimic SciPy's ndimage API.
As to using a Sobel operator with Dask, one can use this sobel function from dask-ndfilters (permissively licensed) to perform this operation on a Dask Array. It handles proper haloing on the blocks underneath the hood returning a new Dask Array.
As SciPy's sobel function (and dask-ndfilters' sobel as well) operate on one dimension, one will need to map over each axis and stack to get the full Sobel operator result. That said, this is quite straightforward to do. Below is a brief snippet showing how to do this on a random Dask Array. Also included is taking a slice along the XZ-plane. Though one could just as easily take any other slice or perform additional operations on the data.
Hope this helps. :)
import dask.array as da
import dask_ndfilters as da_ndfilt
d = da.random.random((100, 120, 140), chunks=(25, 30, 35))
ds = da.stack([da_ndfilt.sobel(d, axis=i) for i in range(d.ndim)])
dsp = ds[:, :, 0, :]
asp = dsp.compute()
I am currently using version 1.92 of numpy which I believe is the latest publicly available one. I wish to use the central difference method quickly on an n dimensional array and specify an axis over which to perform the calculation.
My first thought was to use numpy.diff which allows axis specification, however this returns a right difference rather than a central difference and has limited functionality.
I understand the following method works using numpy.gradient
num_vectors=10 #number of 3-vectors in the 2D array
vectorarray=numpy.empty((num_vectors,3))
vectorarray[0]=[4,5,6]
vectorarray[1]=[1,4,4]
vectorarray[2]=[8,8,1] #add some arbitrary data for illustrative purposes
c1,c2=numpy.gradient(vectorarray)
So c1 stores the useful information that I require. The problem is that I also have to generate c2, and I want to do this sort of calculation with many dimensional arrays and will incur a time loss by generating all this useless data.
Is there any other method I can achieve this same result without the redundancy, preferably this also includes using nested for loops.
You can read Numpy's gradient code at https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L1119 and use the algorithm therein.
You can also copy it and change for i, axis in enumerate(axes): to for i, axis in [[0,0]]: so that it only runs once, but keep in mind that your modified code may fall under Numpy's license.
Just a short question that I can't find the answer to before i head off for the day,
When i do something like this:
v1 = float_list_python = ... # <some list of floats>
v2 = float_array_NumPy = ... # <some numpy.ndarray of floats>
# I guess they don't have to be floats -
# but some object that also has a native
# object in C, so that numpy can just use
# that
If i want to multiply these vectors by a scalar, my understanding has always been that the python list is a list of object references, and so looping through the list to do the multiplication must fetch the locations of all the floats, and then must get the floats in order to do it - which is one of the reasons it's slow.
If i do the same thing in NumPy, then, well, i'm not sure what happens. There are a number of things i imagine could happen:
It splits the multpilication up across the cores.
It vectorises the multications (as well?)
The documentation i've found suggests that many of the primitives in numpy take advantage of the first option there whenever they can (i don't have a computer on hand at the moment i can test it on). And my intuition tells me that number 2 should happen whenever it's possible.
So my question is, if I create a NumPy array of python objects, will it still at least perform operations on the list in parallel? I know that if you create an array of objects that have native C types, then it will actually create a contiguous array in memory of the actual objects, and that if you create an numpy array of python objects it will create an array of references, but i don't see why this would rule out parallel operations on said list, and cannot find anywhere that explicitly states that.
EDIT: I feel there's a bit of confusion over what i'm asking. I understand what vectorisation is, I understand that it is a compiler optimisation, and not something you necesarily program in (though aligning the data such that it's contiguous in memory is important). On the grounds of vectorisation, all i wanted to know was whether or not numpy uses it. If i do something like np_array1 * np_array2 does the underlying library call use vectorisation (presuming that dtype is a compatible type).
For the splitting up over the cores, all i mean there, is if i again do something like np_array1 * np_array2, but this time dtype=object: would it divide that work up amongst there cores?
numpy is fast because it performs numeric operations like this in fast compiled C code. In contrast the list operation operates at the interpreted Python level (streamlined as much as possible with Python bytecodes etc).
A numpy array of numeric type stores those numbers in a data buffer. At least in the simple cases this is just a block of bytes that C code can step through efficiently. The array also has shape and strides information that allows multidimensional access.
When you multiply the array by a scalar, it, in effect, calls a C function titled something like 'multiply_array_by_scalar', which does the multiplication in fast compiled code. So this kind of numpy operation is fast (compared to Python list code) regardless of the number of cores or other multi-processing/threading enhancements.
Arrays of objects do not have any special speed advantage (compared to lists), at least not at this time.
Look at my answer to a question about creating an array of arrays, https://stackoverflow.com/a/28284526/901925
I had to use iteration to initialize the values.
Have you done any time experiments? For example, construct an array, say (1000,2). Use tolist() to create an equivalent list of lists. And make a similar array of objects, with each object being a (2,) array or list (how much work did that take?). Now do something simple like len(x) for each of those sub lists.
#hpaulj provided a good answer to your question. In general, from reading your question it occurred to me that you do not actually understand what "vectorization" does under the hood. This writeup is a pretty decent explanation of vectorization and how it enables faster computations - http://quantess.net/2013/09/30/vectorization-magic-for-your-computations/
With regards to point 1 - Distributing computations across multiple cores, this is not always the case with Numpy. However, there are libraries like numexpr that enable multithreaded, highly efficient Numpy array computations with support for several basic logical and arithmetic operators. Numexpr can be used to turbo charge critical computations when used in conjunction with Numpy as it avoids replicating large arrays in memory for vectorization routines (as is the case for Numpy) and can use all cores on your system to perform computations.