My problem is to perform 3 matrix multiplications on a 3D numpy array A too large to fit in a single processor. In tensorial form I want A_ijk B_km C_jn D_ip (B, C, and D can all fit in memory). I want to know if dask is appropriate for this task (or if another tool might be more suited).
I believe the best approach is to split this operation into each multiplication, and make sure that they are all local. This link has a really useful diagram that summarises what I'm talking about http://www.2decomp.org/1d_mode.html.
In more detail: First, to do A_ijk B_km, I should distribute A over the first two axes, and perform the matrix multiplication over each pencil locally (the first step in the diagram).
Then, I need to transpose the array, making the j axis local to each processor (and splitting over the k (now m) axis), to then perform the next multiplication. (So going from the first to the second step in the diagram). This is where I wonder if dask could help.
I'm aware that this can be done in principle using mpi4py, but the steps are pretty non-trivial, whereas dask arrays have helpful rechunk and transpose methods, which feel relevant to this application.
Does this seem like something well-suited to dask?
If not, is anyone aware of any python libraries that can perform these steps? I know that fftw has routines for doing just this, but I don't know how to write the C-code necessary, or how to get it to interface with python and numpy.
Thanks for any help.
For anyone else in the future, mpi4py does have a transpose method. But it's called Alltoall/Alltoallv. It's not explained in the documentation or tutorial on mpi4py. I found out about it at another tutorial: https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:mpi4py.
Dask implements einsum, which may be what you are after, and there is, of course matmul, if you want to write out the operation. So long as your large matrix A is a Dask array, with reasonable chunk sizes, Dask will parcel out your work without running out of memory.
Related
I am trying to run indexing and assignment on a 3D netCDF array loaded into dask to be used for multiprocessing purposes. At the moment Dask does not appear to support direct 3D indexing and assignment so I have been trying to ravel the 3D array to 1D, perform the indexing and assignment, then reshape the result back to 3D at the end, however, this process has proven to be highly memory intensive to the point where my dask workers with 4GB of available memory are unable to complete the task.
The data I am using are the GridRAD set (http://gridrad.org/data.html; netCDF). Getting the data and saving the files are straight-forward enough through XArray, however the data itself needs to have additional processing done on it to be suitable for analysis (The data source creators have written filtering routines that allow for such suitable analysis using numpy here: http://gridrad.org/zip/gridrad_python_software.zip [Provided to show what operations are being performed in dask]). I have converted this script from using numpy arrays to using dask arrays (Namely just switching the definition of numpy functions (IE: np.function) to dask.array functions (IE: da.function), which share the same names and calling conventions.
I stumbled across this git issue in researching potential solutions: https://github.com/dask/dask/issues/3096. This git issue seems to detail what I am trying to do, however, the provided examples in there are not descriptive enough for me to understand. I wrote a utility function to remove the .compute() call on the index and have tried to use the lambda approach as detailed in the git issue, but I do not know how to properly use this, here is my current code:
def contains_grid0(in_struct, grid):
lX = len(in_struct["X"])
lY = len(in_struct["Y"])
highest = lX * lY
# If there is a grid, we can find the lowest index.
if(len(grid) > 0):
return grid[0] <= highest
return False
inan = da.flatnonzero(da.asarray(da.isnan(in_struct["ref"]))) # Find bins with NaNs
if(contains_grid0(in_struct, inan)):
da.map_blocks(lambda block, idxs=None: idxs[block] = 0.0, block=in_struct["ref"], idxs=inan)
This does not work because you cannot assign in a lambda (I do not know how to take the above example and assign from it to test if this will work), but it establishes the proof of concept to what I am trying to accomplish.
Therefore, I am curious if anyone has had to index and assign on a 3D array using Dask and how you have succeeded in doing so in the most memory efficient manner.
Thanks!
Are there really good methods in Python to vectorize matrix like data constructs/containers -operations? What are the according data constructs used?
(I could observe and read that pandas and numpy element-wise operations using vectorize or applymap (may also be the case of apply/apply along axis for rows/columns) are not much of a speed progress compared to for loops.
Given that when trying to use them, you have sometimes to mess with the specificities of the datatypes when it is usually a little bit easier in for loops, what are the benefits? Readability?)
Are there ways to achieve a gap of performance similar to what happens in Matlab when comparing for loops and vectorized operations?
(note it is not to bash numpy or pandas, these are great, whole matrix operations are ok, it is just that when you have to do element-wise operations, it becomes slow).
EDIT to explain the context:
I was only wondering because I received more than once answers mentionning the fact that apply and so on are actually similar to for loops. That's why I was wondering if there were similar functions implemented in such way that it would perform better. The actual problems were varied. They just had to be element-wise, actually, not "doing the sum, product, whatever of the whole matrix". I did a lot of comparisons with differential outputs sometimes based on other matrices, so I had to use complex functions for this. But since the matrices are huge and the implementation depended on "for loop like" mechanisms, in the end I felt that my program would not work well on a more important dataset. Hence my question. But I was not looking for review, only knowledge.
You need to provide a specific example.
Normal per-element MATLAB or Python functions cannot be vectorized in general. The whole point of vectorizing, in both MATLAB and Python, is to off-load the operation onto the much faster underlying C or Fortran libraries that are designed to work on arrays of uniform data. This cannot be done on functions that operate on scalars, either in MATLAB or Python.
For functions that operate on arrays or matrices as a whole (such as mathematical operators, sum, square, etc), MATLAB and Python behave the same. In fact they use most of the same underlying C and Fortran libraries to do their calculations.
So you need to show the actual operation you want to do, and then we can see if there is a way to vectorize it.
If it is working code and you just want to improve its performance, then Code Review stack exchange site is probably a better choice.
Just a short question that I can't find the answer to before i head off for the day,
When i do something like this:
v1 = float_list_python = ... # <some list of floats>
v2 = float_array_NumPy = ... # <some numpy.ndarray of floats>
# I guess they don't have to be floats -
# but some object that also has a native
# object in C, so that numpy can just use
# that
If i want to multiply these vectors by a scalar, my understanding has always been that the python list is a list of object references, and so looping through the list to do the multiplication must fetch the locations of all the floats, and then must get the floats in order to do it - which is one of the reasons it's slow.
If i do the same thing in NumPy, then, well, i'm not sure what happens. There are a number of things i imagine could happen:
It splits the multpilication up across the cores.
It vectorises the multications (as well?)
The documentation i've found suggests that many of the primitives in numpy take advantage of the first option there whenever they can (i don't have a computer on hand at the moment i can test it on). And my intuition tells me that number 2 should happen whenever it's possible.
So my question is, if I create a NumPy array of python objects, will it still at least perform operations on the list in parallel? I know that if you create an array of objects that have native C types, then it will actually create a contiguous array in memory of the actual objects, and that if you create an numpy array of python objects it will create an array of references, but i don't see why this would rule out parallel operations on said list, and cannot find anywhere that explicitly states that.
EDIT: I feel there's a bit of confusion over what i'm asking. I understand what vectorisation is, I understand that it is a compiler optimisation, and not something you necesarily program in (though aligning the data such that it's contiguous in memory is important). On the grounds of vectorisation, all i wanted to know was whether or not numpy uses it. If i do something like np_array1 * np_array2 does the underlying library call use vectorisation (presuming that dtype is a compatible type).
For the splitting up over the cores, all i mean there, is if i again do something like np_array1 * np_array2, but this time dtype=object: would it divide that work up amongst there cores?
numpy is fast because it performs numeric operations like this in fast compiled C code. In contrast the list operation operates at the interpreted Python level (streamlined as much as possible with Python bytecodes etc).
A numpy array of numeric type stores those numbers in a data buffer. At least in the simple cases this is just a block of bytes that C code can step through efficiently. The array also has shape and strides information that allows multidimensional access.
When you multiply the array by a scalar, it, in effect, calls a C function titled something like 'multiply_array_by_scalar', which does the multiplication in fast compiled code. So this kind of numpy operation is fast (compared to Python list code) regardless of the number of cores or other multi-processing/threading enhancements.
Arrays of objects do not have any special speed advantage (compared to lists), at least not at this time.
Look at my answer to a question about creating an array of arrays, https://stackoverflow.com/a/28284526/901925
I had to use iteration to initialize the values.
Have you done any time experiments? For example, construct an array, say (1000,2). Use tolist() to create an equivalent list of lists. And make a similar array of objects, with each object being a (2,) array or list (how much work did that take?). Now do something simple like len(x) for each of those sub lists.
#hpaulj provided a good answer to your question. In general, from reading your question it occurred to me that you do not actually understand what "vectorization" does under the hood. This writeup is a pretty decent explanation of vectorization and how it enables faster computations - http://quantess.net/2013/09/30/vectorization-magic-for-your-computations/
With regards to point 1 - Distributing computations across multiple cores, this is not always the case with Numpy. However, there are libraries like numexpr that enable multithreaded, highly efficient Numpy array computations with support for several basic logical and arithmetic operators. Numexpr can be used to turbo charge critical computations when used in conjunction with Numpy as it avoids replicating large arrays in memory for vectorization routines (as is the case for Numpy) and can use all cores on your system to perform computations.
Hej there, I am writing a data aquisition and analysis software to a physical measurement set up with Python. In the process I gather massive amounts of data points (easily in the order of 1.000.000 or more) which I subsequently will analyze. So far I am using arrays of float numbers, which in principle do the job.
However, I am getting strange effects on the aquired data as I use more and more data points per measurement, which makes me wonder wether the handling of the arrays is so inefficient, that writing into them makes for a significant time delay in the data aquisition loop.
Is that a possibility? Do you have any suggestions about how to improve the handling time in the writing process (it is a matter of microseconds) or is that not a possible influence and I need to look somewhere else?
Thanks in advance!
Do you mean lists? You can use NumPy to handle numerical arrays efficient and performant.
From the NumyPy website:
First of all, they are great for performing calculation relying
heavily on mathematical and numerical operations. They can work
natively with matrices and arrays, perform operations on them, find
eigenvectors, compute integrals, solve differential equations.
NumPy’s array class (which is used to implement the matrix class) is
implemented with speed in mind, so accessing NumPy arrays is faster
than accessing Python lists. Further, NumPy implements an array
language, so that most loops are not needed.
In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this
for sub in Arr:
#do stuff using sub
Now the stuff that is done using sub is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5 times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.
If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:
http://docs.cython.org/src/tutorial/numpy.html
and I also recommend looking at the following paper
You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.
San you can take a look at scipy.weave. You can use scipy.weave.blitz to transparently translate your expression into C++ code and run it. It will handle slicing automatically and get rid of temporaries, but you claim that the body of your for loop does not create temporaries so your milage may vary.
However if you want to replace your entire for loop with something more efficient then you could make use of scipy.inline. The drawback is that you have to write C++ code. This should not be too hard because you can use Blitz++ syntax which is very close to numpy array expressions. Slicing is directly supported, broadcasting however is not.
There are two work arounds:
is to use the numpy-C api and use multi-dimensional iterators. They transparently handle broadcasting. However you are invoking the Numpy runtime so there might be some overhead. The other option, and possibly the simpler option is to use the usual matrix notation for broadcasting. Broadcast operations can be written as outer-products with vector of all ones. The good thing is that Blitz++ will not actually create this temporary broadcasted arrays in memory, it will figure out how to wrap it into an equivalent loop.
For the second option take a look at http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC88 for index place holders. As long as your matrix has less than 11 dimensions you are fine. This link shows how they can be used to form outer products http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC99 (search for outer products to go to the relevant part of the document).
Besides using Cython, you can write the bottle-neck part(s) in Fortran. Then use f2py to compile it to Python .pyd file.