Very large in-place numpy array operations : numba, pythran or other?

Very large in-place numpy array operations : numba, pythran or other? - python

tI need to perform operations on very large arrays (several millions entries), with the cumulated size of these arrays close to the available memory.
I'm understanding that when doing naive operation using numpy like a=a*3+b-c**2, several temporary arrays are created and thus occupy more memory.
As I'm planning to work at the limit of the memory occupancy, I'm afraid this simple approach won't work. So I'd like to start my developments with the right approach.
I know that packages like numba or pythran can help with improving performance when manipulating arrays, but it is not clear to me if they can deal automatically or not with in-place operations, avoiding temporary objects... ?
As a simple example here's one function I'll have to use on large arrays :
def find_bins(a, indices):
global offset, width, nstep
i = (a-offset) *nstep/ width
i = np.where(i<0,0,i)
i = np.where(i>=nstep,nstep, i)
indices[:] = i.astype(int)
So something that mixes arithmetic operations and calls to numpy functions.
How easy would it be to write such functions using numba or pythran (or something else?) ?
What would be the pros and cons in each case ?
Thanks for any hint !
ps: I know about numexpr, but I'm not sure it is convenient or well adapted to functions more complex than a single arithmetic expression ?

using numexpr. For example:
import numexpr
numexpr.evaluate("a+b*c", out=a)
this could help you to avoid the tmp variables and you could refer to High Performance Python, M.G, I.O.

Pythran avoids many temporary arrays by design. For the simple expression you're pointing at, that would be
#pythran export find_bins(float[], int[], float, float, int)
import numpy as np
def find_bins(a, indices, offset, width, nstep):
i = (a-offset) *nstep/ width #
i = np.where(i<0,0,i)
i = np.where(i>=nstep,nstep, i)
indices[:] = i.astype(int)
This both avoids temporary and speeds-up computation.
Not that you should use np.clip function here, it's supported by Pythran as well.

Related

An issue with paralellising function broadcasting over a mesh using dask

I am looking to parallelise a function which takes multiple 1-dimensional ranges (which are of the form np.linspace(x,y,t)) of numerical input values (this is variable, but lets say it takes five), creates a mesh out of these ranges, and then evaluates some (5-dimensional) cost function for this over this mesh. In its current form it looks something like this:
def func_5d(a,b,c,d,e):
return a + b + c + d + e
def range_search(a_range, b_range, c_range, d_range, e_range):
mesh = itertools.product(a_range, b_range, c_range, d_range, e_range)
func_eval = map(lambda x: (func_5d(np.array(x)), x), mesh)
return func_eval
So, here I would be looking to parallelise the function range_search using dask. Ideally, this would be done by creating a dask mesh, which could then be chunked, and then mapped through to our cost function using either multi-threading or multi-core processing. Looking through the dask documentation, it does not appear that dask.array contains any suitable mechanism to achieve this. There is a dask.array.meshgrid function, extended from the numpy library, but this does not support chunking. Additionally, dask.array does not seem to contain a paralellised map function. However, there is one in dask.bag. But the documentation seems to suggest that dask.bag is used only as a module to carry out preliminary processing of raw data (in formats such as CSV, JSON, etc). Dask.bag objects do also have a method called product() which seems to imitate the itertools.product; however this only takes one other dask.bag object as an argument. So meshing 5 arrays required this method called to be stacked (4 times), which aside from being hideously ugly, is also inefficent when the number of inputs is variable.
From here, I don't really know where to go. I have worked through the Jupyter Notebooks that dask have put together, but they do not seem to hold an answer to my question. Any suggestions on the best approach to paralellising functions of the above form would be much appreciated.

I would use Numpy Slicing for this
a[:, None, None] + b[None, :, None] + c[None, None, :]
You will want to make sure that your input vectors are chunked finely enough that the products of them will still fit comfortably in memory.

Are numpy-matrix-functions buffered?

Are numpy matrix specific functions, such as x.max() buffered when applied several times?
So should one write:
bincount=np.apply_along_axis(lambda x: np.bincount(x, minlength=data.max()+1), axis=0, arr=data)
or better
data_max=data.max()+1
bincount=np.apply_along_axis(lambda x: np.bincount(x, minlength=data_max), axis=0, arr=data)
where data is e.g.
data=np.array([[1,2,5,4,8,7,8,9,14,8,14,5,2,1],
[5,8,7,13,7,8,9,21,5,7,9,24,3,2]])
or of course even much larger

After updating the question, it seems that you are asking whether numpy implements some form of caching of its results. While there is no general response to this question, for a method like ndarray.max, it is clear that no caching is done.
How can we know that without looking at the implementation? Consider that a caching scheme must resolve two problems:
find a place to store the cached result(s);
have a strategy to invalidate the cache once it no longer applies.
Although the first issue is non-trivial, the second one is the real killer. Not only can a numpy array be changed at any time, but the contents of the array can be shared by many objects. Additionally, C code can obtain the address of the internal buffers, and implement its own modifications to the underlying memory. Caching results would effectively disable many interesting uses of numpy.
You can consider numpy as a low-level library that doesn't concern itself with optimizations of that nature. If caching is needed, it should be implemented at a higher level, such as shown in your second example.

Like Slater Tyranus pointed out, only a benchmakr will show any results:
import numpy as np
import timeit
def func_a(data):
return np.apply_along_axis(lambda x: np.bincount(x, minlength=data.max()+1), axis=0, arr=data)
def func_b(data):
data_max=data.max()+1
return np.apply_along_axis(lambda x: np.bincount(x, minlength=data_max), axis=0, arr=data)
setup = '''import numpy as np
data=np.array([[1,2,5,4,8,7,8,9,14,8,14,5,2,1],
[5,8,7,13,7,8,9,21,5,7,9,24,3,2]])
from __main__ import func_a, func_b'''
min(timeit.Timer('func_a(data)', setup=setup).repeat(100,100))
0.02922797203063965
min(timeit.Timer('func_b(data)', setup=setup).repeat(100,100))
0.018524169921875
I tested also with much larger data. Overall one can say, it pays back calculating data_max=data.max() before. With much bigger arrays the discrepancy gets even larger.

In python, is math.acos() faster than numpy.arccos() for scalars?

I'm doing some scientific computing in Python with a lot of geometric calculations, and I ran across a significant difference between using numpy versus the standard math library.
>>> x = timeit.Timer('v = np.arccos(a)', 'import numpy as np; a = 0.6')
>>> x.timeit(100000)
0.15387153439223766
>>> y = timeit.Timer('v = math.acos(a)', 'import math; a = 0.6')
>>> y.timeit(100000)
0.012333301827311516
That's more than a 10x speedup! I'm using numpy for almost all standard math functions, and I just assumed it was optimized and at least as fast as math. For long enough vectors, numpy.arccos() will eventually win vs. looping with math.acos(), but since I only use the scalar case, is there any downside to using math.acos(), math.asin(), math.atan() across the board, instead of the numpy versions?

Using the functions from the math module for scalars is perfectly fine. The numpy.arccos function is likely to be slower due to
conversion to an array (and a C data type)
C function call overhead
conversion of the result back to a python type
If this difference in performance is important for your problem, you should check if you really can't use array operations. As user2357112 said in the comments, arrays are what numpy really is great at.

Python and Numba for vectorized functions

Good day, I'm writing a Python module for some numeric work. Since there's a lot of stuff going on, I've been spending the last few days optimizing code to improve calculations times.
However, I have a question concerning Numba.
Basically, I have a class with some fields which are numpy arrays, which I initialize in the following way:
def init(self):
a = numpy.arange(0, self.max_i, 1)
self.vibr_energy = self.calculate_vibr_energy(a)
def calculate_vibr_energy(i):
return numpy.exp(-self.harmonic * i - self.anharmonic * (i ** 2))
So, the code is vectorized, and using Numba's JIT results in some improvement. However, sometimes I need to access the calculate_vibr_energy function from outside the class, and pass a single integer instead of an array in place of i.
As far as I understand, if I use Numba's JIT on the calculate_vibr_energy, it will have to always take an array as an argument.
So, which of the following options is better:
1) Create a new function calculate_vibr_energy_single(i), which will only take a single integer number, and use Numba on it too
2) Replace all usages of the function that are similar to this one:
myclass.calculate_vibr_energy(1)
with this:
tmp = np.array([1])
myclass.calculate_vibr_energy(tmp)[0]
Or are there other, more efficient (or at least, more Python-ic) ways of doing that?

I have only played a little with numba yet so I may be mistaken, but as far as I've understood it, using the "autojit" decorator should give functions that can take arguments of any type.
See e.g. http://numba.pydata.org/numba-doc/dev/pythonstuff.html

Python: is the iteration of the multidimensional array super slow?

I have to iterate all items in two-dimensional array of integers and change the value (according to some rule, not important).
I'm surprised how significant difference in performance is there between python runtime and C# or java runtime. Did I wrote totally wrong python code (v2.7.2)?
import numpy
a = numpy.ndarray((5000,5000), dtype = numpy.int32)
for x in numpy.nditer(a.T):
x = 123
>python -m timeit -n 2 -r 2 -s "import numpy; a = numpy.ndarray((5000,5000), dtype=numpy.int32)" "for x in numpy.nditer(a.T):" " x = 123"
2 loops, best of 2: 4.34 sec per loop
For example the C# code performs only 50ms, i.e. python is almost 100 times slower! (suppose the matrix variable is already initialized)
for (y = 0; y < 5000; y++)
for (x = 0; x < 5000; x++)
matrix[y][x] = 123;

Yep! Iterating through numpy arrays in python is slow. (Slower than iterating through a python list, as well.)
Typically, you avoid iterating through them directly.
If you can give us an example of the rule you're changing things based on, there's a good chance that it's easy to vectorize.
As a toy example:
import numpy as np
x = np.linspace(0, 8*np.pi, 100)
y = np.cos(x)
x[y > 0] = 100
However, in many cases you have to iterate, either due to the algorithm (e.g. finite difference methods) or to lessen the memory cost of temporary arrays.
In that case, have a look at Cython, Weave, or something similar.

The example you gave was presumably meant to set all items of a two-dimensional NumPy array to 123. This can be done efficiently like this:
a.fill(123)
or
a[:] = 123

Python is a much more dynamic language than C or C#. The main reason why the loop is so slow is that on every pass, the CPython interpreter is doing some extra work that wastes time: specifically, it is binding the name x with the next object from the iterator, then when it evaluates the assignment it has to look up the name x again.
As #Sven Marnach noted, you can call a method function numpy.fill() and it is fast. That function is compiled C or maybe Fortran, and it will simply loop over the addresses of the numpy.array data structure and fill in the values. Much less dynamic than Python, which is good for this simple case.
But now consider PyPy. Once you run your program under PyPy, a JIT analyzes what your code is actually doing. In this example, it notes that the name x isn't used for anything but the assignment, and it can optimize away binding the name. This example should be one that PyPy speeds up tremendously; likely PyPy will be ten times faster than plain Python (so only one-tenth as fast as C, rather than 1/100 as fast).
http://pypy.org
As I understand it, PyPy won't be working with Numpy for a while yet, so you can't just run your existing Numpy code under PyPy yet. But the day is coming.
I'm excited about PyPy. It offers the hope that we can write in a very high-level language (Python) and yet get nearly the performance of writing things in "portable assembly language" (C). For examples like this one, the Numpy might even beat the performance of naive C code, by using SIMD instructions from the CPU (SSE2, NEON, or whatever). For this example, with SIMD, you could set four integers to 123 with each loop, and that would be faster than a plain C loop. (Unless the C compiler used a SIMD optimization also! Which, come to think of it, is likely for this case. So we are back to "nearly the speed of C" rather than faster for this example. But we can imagine trickier cases that the C compiler isn't smart enough to optimize, where a future PyPy might.)
But never mind PyPy for now. If you will be working with Numpy, it is a good idea to learn all the functions like numpy.fill() that are there to speed up your code.

C++ emphasizes machine time over programmer time.
Python emphasizes programmer time over machine time.
Pypy is a python written in python, and they have the beginnings of numpy; you might try that. Pypy has a nice JIT that makes things quite fast.
You could also try cython, which allows you to translate a dialect of Python to C, and compile the C to a Python C extension module; this allows one to continue using CPython for most of your code, while still getting a bit of a speedup. However, in the one microbenchmark I've tried comparing Pypy and Cython, Pypy was quite a bit faster than Cython.
Cython uses a highly pythonish syntax, but it allows you to pretty freely intermix Python datatypes with C datatypes. If you redo your hotspots with C datatypes, it should be pretty fast. Continuing to use Python datatypes is sped up by Cython too, but not as much.

The nditer code does not assign a value to the elements of a. This doesn't affect the timings issue, but I mention it because it should not be taken as a good use of nditer.
a correct version is:
for i in np.nditer(a, op_flags=[["readwrite"]]):
i[...] = 123
The [...] is needed to retain the reference to loop value, which is an array of shape ().
There's no point in using A.T, since its the values of the base A that get changed.
I agree that the proper way of doing this assignment is a[:]=123.

If you need to do operations on a multidimensional array that depend on the value of the array but don't depend on the position inside the array then .itemset is 5 times faster than nditer for me.
So instead of doing something like
image = np.random.random_sample((200, 200,3));
with np.nditer(image, op_flags=['readwrite']) as it:
for x in it:
x[...] = x*4.5 if x<0.2 else x
You can do this
image2 = np.random.random_sample((200, 200,3));
for i in range(0,image2.size):
x = image2.item(i)
image2.itemset(i, x*4.5 if x<0.2 else x);

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.