Truly vectorized routines in python?

Truly vectorized routines in python? - python

Are there really good methods in Python to vectorize matrix like data constructs/containers -operations? What are the according data constructs used?
(I could observe and read that pandas and numpy element-wise operations using vectorize or applymap (may also be the case of apply/apply along axis for rows/columns) are not much of a speed progress compared to for loops.
Given that when trying to use them, you have sometimes to mess with the specificities of the datatypes when it is usually a little bit easier in for loops, what are the benefits? Readability?)
Are there ways to achieve a gap of performance similar to what happens in Matlab when comparing for loops and vectorized operations?
(note it is not to bash numpy or pandas, these are great, whole matrix operations are ok, it is just that when you have to do element-wise operations, it becomes slow).
EDIT to explain the context:
I was only wondering because I received more than once answers mentionning the fact that apply and so on are actually similar to for loops. That's why I was wondering if there were similar functions implemented in such way that it would perform better. The actual problems were varied. They just had to be element-wise, actually, not "doing the sum, product, whatever of the whole matrix". I did a lot of comparisons with differential outputs sometimes based on other matrices, so I had to use complex functions for this. But since the matrices are huge and the implementation depended on "for loop like" mechanisms, in the end I felt that my program would not work well on a more important dataset. Hence my question. But I was not looking for review, only knowledge.

You need to provide a specific example.
Normal per-element MATLAB or Python functions cannot be vectorized in general. The whole point of vectorizing, in both MATLAB and Python, is to off-load the operation onto the much faster underlying C or Fortran libraries that are designed to work on arrays of uniform data. This cannot be done on functions that operate on scalars, either in MATLAB or Python.
For functions that operate on arrays or matrices as a whole (such as mathematical operators, sum, square, etc), MATLAB and Python behave the same. In fact they use most of the same underlying C and Fortran libraries to do their calculations.
So you need to show the actual operation you want to do, and then we can see if there is a way to vectorize it.
If it is working code and you just want to improve its performance, then Code Review stack exchange site is probably a better choice.

Related

How(/if) to use dask to transpose distributed 3D numpy arrays?

My problem is to perform 3 matrix multiplications on a 3D numpy array A too large to fit in a single processor. In tensorial form I want A_ijk B_km C_jn D_ip (B, C, and D can all fit in memory). I want to know if dask is appropriate for this task (or if another tool might be more suited).
I believe the best approach is to split this operation into each multiplication, and make sure that they are all local. This link has a really useful diagram that summarises what I'm talking about http://www.2decomp.org/1d_mode.html.
In more detail: First, to do A_ijk B_km, I should distribute A over the first two axes, and perform the matrix multiplication over each pencil locally (the first step in the diagram).
Then, I need to transpose the array, making the j axis local to each processor (and splitting over the k (now m) axis), to then perform the next multiplication. (So going from the first to the second step in the diagram). This is where I wonder if dask could help.
I'm aware that this can be done in principle using mpi4py, but the steps are pretty non-trivial, whereas dask arrays have helpful rechunk and transpose methods, which feel relevant to this application.
Does this seem like something well-suited to dask?
If not, is anyone aware of any python libraries that can perform these steps? I know that fftw has routines for doing just this, but I don't know how to write the C-code necessary, or how to get it to interface with python and numpy.
Thanks for any help.

For anyone else in the future, mpi4py does have a transpose method. But it's called Alltoall/Alltoallv. It's not explained in the documentation or tutorial on mpi4py. I found out about it at another tutorial: https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:mpi4py.

Dask implements einsum, which may be what you are after, and there is, of course matmul, if you want to write out the operation. So long as your large matrix A is a Dask array, with reasonable chunk sizes, Dask will parcel out your work without running out of memory.

Can whole line NumpyArrays indexing operations be cythonized?

In the Cython documentation under Efficient Indexing, on the gotcha part it says that:
This efficient indexing only affects certain index operations, namely
those with exactly ndim number of typed integer indices.
Does this mean that operations like
f[:, w] = something
are not optimized?

It probably meant "optimized [compared to pure Python code]". There are different kinds of slicings and most of them are already really fast in Python, there just is not much you can speed up. For example if you use f[:,w] you'll get a view of the array f. It involves a bit of overhead because a "view" has to be created but it's really fast already because it (excluding certain advanced indexing operations) just a memoryview.
However what Cython can speed up significantly is: accessing single elements of an array. That is a really inefficient operation in Python code because the element has to be "boxed as Python object" when accessed. Cython can avoid this "boxing", when "exactly ndim number of typed integer indices" are used.
So it's not like f[:,w] isn't optimized. It is already optimized by numpy. Cython can't improve (much) there.

Is there a more efficient method to process large amounts of data than through arrays?

Hej there, I am writing a data aquisition and analysis software to a physical measurement set up with Python. In the process I gather massive amounts of data points (easily in the order of 1.000.000 or more) which I subsequently will analyze. So far I am using arrays of float numbers, which in principle do the job.
However, I am getting strange effects on the aquired data as I use more and more data points per measurement, which makes me wonder wether the handling of the arrays is so inefficient, that writing into them makes for a significant time delay in the data aquisition loop.
Is that a possibility? Do you have any suggestions about how to improve the handling time in the writing process (it is a matter of microseconds) or is that not a possible influence and I need to look somewhere else?
Thanks in advance!

Do you mean lists? You can use NumPy to handle numerical arrays efficient and performant.
From the NumyPy website:
First of all, they are great for performing calculation relying
heavily on mathematical and numerical operations. They can work
natively with matrices and arrays, perform operations on them, find
eigenvectors, compute integrals, solve differential equations.
NumPy’s array class (which is used to implement the matrix class) is
implemented with speed in mind, so accessing NumPy arrays is faster
than accessing Python lists. Further, NumPy implements an array
language, so that most loops are not needed.

Speeding up loops over a Numpy array

In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this
for sub in Arr:
#do stuff using sub
Now the stuff that is done using sub is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5 times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.

If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:
http://docs.cython.org/src/tutorial/numpy.html
and I also recommend looking at the following paper
You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.

San you can take a look at scipy.weave. You can use scipy.weave.blitz to transparently translate your expression into C++ code and run it. It will handle slicing automatically and get rid of temporaries, but you claim that the body of your for loop does not create temporaries so your milage may vary.
However if you want to replace your entire for loop with something more efficient then you could make use of scipy.inline. The drawback is that you have to write C++ code. This should not be too hard because you can use Blitz++ syntax which is very close to numpy array expressions. Slicing is directly supported, broadcasting however is not.
There are two work arounds:
is to use the numpy-C api and use multi-dimensional iterators. They transparently handle broadcasting. However you are invoking the Numpy runtime so there might be some overhead. The other option, and possibly the simpler option is to use the usual matrix notation for broadcasting. Broadcast operations can be written as outer-products with vector of all ones. The good thing is that Blitz++ will not actually create this temporary broadcasted arrays in memory, it will figure out how to wrap it into an equivalent loop.
For the second option take a look at http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC88 for index place holders. As long as your matrix has less than 11 dimensions you are fine. This link shows how they can be used to form outer products http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC99 (search for outer products to go to the relevant part of the document).

Besides using Cython, you can write the bottle-neck part(s) in Fortran. Then use f2py to compile it to Python .pyd file.

Convolution of two functions in Python

I will have to implement a convolution of two functions in Python, but SciPy/Numpy appear to have functions only for the convolution of two arrays.
Before I try to implement this by using the the regular integration expression of convolution, I would like to ask if someone knows of an already available module that performs these operations.
Failing that, which of the several kinds of integration that SciPy provides is the best suited for this?
Thanks!

You could try to implement the Discrete Convolution if you need it point by point.

Yes, SciPy/Numpy is mostly concerned about arrays.
If you can tolerate an approximate solution, and your functions only operate over a range of value (not infinite) you can fill an array with the values and convolve the arrays.
If you want something more "correct" calculus-wise you would probably need a powerful solver (mathmatica, maple...)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.