Parallelizing a numba loop

Parallelizing a numba loop - python

Previously, I asked a question about a relatively simple loop that Numba was failing to parallelize. A solution turned out to make all the loops explicit.
Now, I need to do a simpler version of the same task: I now have arrays alpha and beta respectively of shape (m,n) and (b,m,n), and I want to compute the computes the Frobenius product of 2D slices of the arguments and find the slice of beta which maximizes this product. Previously, there was an additional, large first dimension of alpha so it was over this dimension that I parallelized; now I want to parallelize over the first dimension of beta as the calculation becomes expensive when b>1000.
If I naively modify the code that worked for the previous problem, I obtain:
#njit(parallel=True)
def parallel_value_numba(alpha,beta):
dot = np.zeros(beta.shape[0])
for i in prange(beta.shape[0]):
for j in prange(beta.shape[1]):
for k in prange(beta.shape[2]):
dot[i] += alpha[j,k]*beta[i, j, k]
index=np.argmax(dot)
value=dot[index]
return value,index
But Numba doesn't like this for some reason and complains:
numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
scalar type memoryview(float64, 2d, C) given for non scalar argument #3
So instead, I tried
#njit(parallel=True)
def parallel_value_numba_2(alpha,beta):
product=np.multiply(alpha,beta)
dot1=np.sum(product,axis=2)
dot2=np.sum(dot1,axis=1)
index=np.argmax(dot2)
value=dot2[index]
return value,index
This compiles as long as you broadcast alpha to beta.shape before passing it to the function, and in principal Numba is capable of parallelizing the numpy operations. But it runs painfully slowly, much slower than the serial, pure Python code
def einsum_value(alpha,beta):
dot=np.einsum('kl,jkl->j',alpha,beta)
index=np.argmax(dot)
value=dot[index]
return value,index
So, my current working code uses this last implementation, but this function is still bottlenecking the runtime and I'd like to speed it up. Can anyone convince Numba to parallelize this function with an appreciable speedup?

This is not exactly an answer with a solution, but formatting comments is harder.
Numba generates different code depending on the arguments passed to the function. For example, your code works with the following example:
>>> alpha = np.random.random((5, 4))
>>> beta = np.random.random((3, 5, 4))
>>> parallel_value_numba(alpha, beta)
(5.89447648574048, 0)
In order to diagnose the problem, it's necessary to have an example of the specific argument values causing the problem.
Reading the error message, it seems you are passing a memoryview object, but Numba may not have full support for it.
As a side comment, you don't need to use prange in every loop. It's normally enough to use it in the outer loop, as long as the number of expected iterations is larger than the number of cores in your machine.

Related

How to make numba(nopython=true) work with 1D numpy.ndarray input with unknown number of elements

I'm porting a (mathematically complicated/involved but few operations) homebrew empirical distribution class from C++/MATLAB (I have both) to Python.
The file has some 1100 lines of code including comments and test data including a
if __name__ == "__main__":
at the bottom of the file.
line 83 has the function declaration: def cdf(self, x):
Which compiled and ran fine it's just very slow so I want to compile with #numba.jit(nopython=True) to make it run faster.
However, the compilation died on one of the earliest lines of the function (only comments in front of it) line 85 of the file npts=len(x).
The message ends with :
[1] During: typing of argument at
C:\Users\kdalbey\Canopy\scripts\empDist.py (85)
--%<-----------------------------------------------------------------
File "Canopy\scripts\empDist.py", line 85
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class '__main__.empDist'>
Now I really did a import numpy as np at the top of the file but for clarity of this message below I've tried to replace np with numpy. But I might have missed a few.
If I use npts=x.size, I get the same error message.
So I try to type x as:
#numba.jit(nopython=True)
def cdf(self, x: numpy.ndarray(dtype=numpy.float64)):
And I get following error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\kdalbey\Canopy\scripts\empDist.py in <module>()
15 np.set_printoptions(precision=16)
16
---> 17 class empDist:
18 def __init__(self, xdata):
19 npts=len(xdata)
C:\Users\kdalbey\Canopy\scripts\empDist.py in empDist()
81
82 #numba.jit(nopython=True)
---> 83 def cdf(self, x: np.ndarray(dtype=np.float64)):
84 # compute the value of cdf at vector of points x
85 npts = x.size
TypeError: Required argument 'shape' (pos 1) not found
But I don't know how many elements the 1D numpy.ndarray has in advance (it's arbitrary)
I guessed that I might be able to do a
#numba.jit(nopython=True)
def cdf(self, x: numpy.ndarray(shape=(), dtype=numpy.float64)):
and it gets past that error only to go back to the
[1] During: typing of argument at
C:\Users\kdalbey\Canopy\scripts\empDist.py (85)
--%<-----------------------------------------------------------------
File "Canopy\scripts\empDist.py", line 85
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class '__main__.empDist'>
And it's the same error if I do either a npts=int(x.size) or npts=numpy.int32(x.size) so I'm figuring the problem is with x.

Your approach is problematic because of multiple issues (as of numba version 0.46.0):
The numpy.ndarray(shape=(), dtype=numpy.float64) really tries to create a NumPy array. It doesn't matter that you used it as type-hint. It's still executed (and fails).
Instead of type hints you should use the more appropriate (for numba) signature in the jit. Or even better: Omit the signature entirely and just let numba figure it out. In most cases numba is better at it and it takes you much less effort (if you don't need to restrict the types).
You cannot jit a method in nopython mode. A better approach is to make a function and call it from your method.
So in your case:
import numba as nb
#nb.njit
def _cdf(x):
# do something with x
class empDist:
def cdf(self, x):
result = _cds(x)
...
Your example might be more complicated however that should give you a good place to start from. If you need to use instance attributes, then simply pass them along to _cdf (if numba supports them).
In general it's not really a good idea to try to use numba on everything. Numba has a very limited scope but where it's applicable it can be amazing.
In your case you said, that it's slow. The first step then should be to profile your code and find out why it's slow and where. Then try to find out if you can wrap this bottleneck with a faster approach. Often the problem isn't the code itself but the algorithm/approach. Check if it uses a sub-optimal approach. If it doesn't an it's a numerical heavy part then it might make sense to use numba - but be warned: often you don't really need numba at all because you can get sufficient performance just by optimizing the NumPy parts.

Ok... the problem was that it was a method (member function), I got that from MrFuppes. isolating it in it's own function that the method called worked great (with almost no mods to the function that worked pre-numba).
BTW I will try to get approval to release/publish the empirical distribution code, but it'll be a ways off. I also might want to learn cython and recode it for speed in cython, the compilation takes O(seconds) on my machine because the operations are mathematically complicated/involved but there's not a lot of them from a flops count perspective. Selling points compared to the sklearn.neighbors.kde my empirical distribution is significantly faster (after/discounting the #numba.jit(nopython=True) compilation caching). Running in canopy (with numba 0.36.2 so np.interp didn't benefit from numba) on windows building this empirical distribution from took 5.72e-5 seconds compared to fitting the sklearn kde which cost 2.03e-4 seconds for 463 points. Moreover it should scale quite well to very large numbers of points. Apart from a quicksort which is O(n log(n)) and interpolation which is O(n) the construction (and memory needed to store the object) cost is O(n^(1/3)) (with a significant coefficient to the O(n^(1/3)). It has "simple" analytical formulas for the PDF, CDF and inverse CDF so the empirical distribution is a lot faster to evaluate too. It has comparable/slightly better accuracy to the sklearn KDE for the Gaussian (using bandwidth = (maxx-minx)*0.015 I copied the bandwidth out so some else's code who is presumably better with the sklearn kde than I am, obviously accuracy of a kde significantly depends on the bandwidth, my empirical distribution does not take any parameters other than the data during construction, it algorithmicly figures out everything it needs to know about the data) and substantially better accuracy for stuff with finite tails (e.g. uniform or exponential). The improved accuracy is coming in part from it being smoother/less oscillatory than the sklearn kde.

I have issues understanding the use of Numbas vectorize decorator in Python

I am currently looking into the use of Numba to speed up my python software. I am entirely new to the concept and currently trying to learn the absolute basics. What I am stuck on for now is:
I don't understand, what's the big benefit of the vectorize decorator.
The documentation explains, that the decorator is used to turn a normal python function into a Numpy ufunc. From what I understand, the benefit of a ufunc is, that it can take numpy arrays (instead of scalars) and provide features such as broadcasting.
But all examples I can find online, can be just as easily solved without this decorator.
Take for instance, this example from the numba documentation.
#vectorize([float64(float64, float64)])
def f(x, y):
return x + y
They claim, that now the function works like a numpy ufunc. But doesn't it anyways, even without the decorator? If I were to just run the following code:
def f(x,y):
return x+y
x = np.arange(10)
y = np.arange(10)
print(f(x,y))
That works just as fine. The function already takes arguments of type other than scalars.
What am I misunderstanding here?

Just read the docs few lines below :
You might ask yourself, “why would I go through this instead of compiling a simple iteration loop using the #jit decorator?”. The answer is that NumPy ufuncs automatically get other features such as reduction, accumulation or broadcasting.
For example f.reduce(arr) will sum all the elements off arr at C speed, what f cannot.

Compiling njit nopython version of function fails due to data types

I'm writing a function in njit to speed up a very slow reservoir operations optimization code. The function is returning the maximum value for spill releases based on the reservoir level and gate availability. I am passing in a parameter size that specifies the number of flows to calculate (in some calls it's one and in some its many). I'm also passing in a numpy.zeros array that I can then fill with the function output. A simplified version of the function is written as follows:
import numpy as np
from numba import njit
#njit(cache=True)
def fncMaxFlow(elev, flag, size, MaxQ):
if (flag == 1): # SPOG2 running
if size==0:
if (elev>367.28):
return 861.1
else: return 0
else:
for i in range(size):
if((elev[i]>367.28) & (elev[i]<385)):
MaxQ[i]=861.1
return MaxQ
else:
if size==0: return 0
else: return MaxQ
fncMaxFlow(np.random.randint(368, 380, 3), 1, 3, np.zeros(3))
The error I'm getting:
Can't unify return type from the following types: array(float64, 1d, C), float64, int32
What is the reason for this? Is there any workaround or some step I'm missing so I can use numba to speed things up? This function and others like it are being called millions of times so they are a major factor in the computational efficiency. Any advice would help - I'm pretty new to python.

A variable within a numba function must have consistent type including the return variable. In your code you can either return MaxQ (an array), 861.1 (a float) or 0 (an int).
You need to refactor this code so that it always returns a consistent type regardless of code path.
Also note that in several places where you are comparing a numpy array to a scalar (elev > 367.28), what you are getting back is an array of boolean values, which is going to cause you issues. Your example function doesn't run as a pure python function (dropping the numba decorator) because of this.

How can I improve python code performance using numpy

I have read this blog which shows how an algorithm had a 250x speed-up by using numpy. I have tried to improve the following code by using numpy but I couldn't make it work:
for i in nodes[1:]:
for lb in range(2, diameter+1):
not_valid_colors = set()
valid_colors = set()
for j in nodes:
if j == i:
break
if distances[i-1, j-1] >= lb:
not_valid_colors.add(c[j, lb])
else:
valid_colors.add(c[j, lb])
c[i, lb] = choose_color(not_valid_colors, valid_colors)
return c
Explanation
The code above is part of an algorithm used to calculate the self similar dimension of a graph. It works basically by constructing dual graphs G' where a node is connected to each other node if the distance between them is greater or equals to a given value (Lb) and then compute the graph coloring on those dual networks.
The algorithm description is the following:
Assign a unique id from 1 to N to all network nodes, without assigning any colors yet.
For all Lb values, assign a color value 0 to the node with id=1, i.e. C_1l = 0.
Set the id value i = 2. Repeat the following until i = N.
a) Calculate the distance l_ij from i to all the nodes in the network with id j less than i.
b) Set Lb = 1
c) Select one of the unused colors C[ j][l_ij] from all nodes j < i for which l_ij ≥ Lb . This is the color C[i][Lb] of node i for the given Lb value.
d) Increase Lb by one and repeat (c) until Lb = Lb_max.
e) Increase i by 1.
I wrote it in python but it takes more than a minute when try to use it with small networks which have 100 nodes and p=0.9.
As I'm still new to python and numpy I did not find the way to improve its efficiency.
Is it possible to remove the loops by using the numpy.where to find where the paths are longer than the given Lb? I tried to implement it but didn't work...

Vectorized operations with numpy arrays are fast since actual calculations are done with underlying libraries such as BLAS and LAPACK without Python overheads. With loop-intensive operations, you will not see those benefits.
You usually have to figure out a way to vectorize operations (usually possible with a smart use of array slicing). Some operations are inherently loop-intensive, however, and sometimes it is not easy to vectorize them (which seems to be the case for your code).
In those cases, you can first try Numba, which generates optimized machine code from a Python function without any modifications. (You just annotate the function and it will automatically do it for you). I do not have a lot of experience with it, and have not tried using this for complicated functions.
If this does not work, then you can use Cython, which converts Python-like code (with typed variables) into efficient C code automatically and generates a Python extension module that you can import and use in Python. That will usually give you at least an order of magnitude (usually two orders of magnitude) speedup for loop-intensive operations. I generally find Cython easy to use since unlike pure C, one can access your numpy arrays directly in Cython code.
I recommend using Anaconda Python distribution, since you will be able to install these packages easily. I'm sorry I don't have a specific answer for your code.

if you want to go to numpy, you can just change the lists into arrays,
for example distances[i-1][j-1] becomes distances[i-1, j-1] after you declare distances as a numpy array. same with c[i][lb]. About valid_colors and not_valid_colors you should think a bit more because with numpy arrays you cannot append things: the array have fixed length, so you should fix a maximum size before. Another idea is that after you have everything in numpy, you can cythonize your code http://docs.cython.org/src/tutorial/cython_tutorial.html it means that all your loops will become very fast. In any case, if you don't want cython and you look at the blog, you see that distances is declared as an array in the main()

Python and Numba for vectorized functions

Good day, I'm writing a Python module for some numeric work. Since there's a lot of stuff going on, I've been spending the last few days optimizing code to improve calculations times.
However, I have a question concerning Numba.
Basically, I have a class with some fields which are numpy arrays, which I initialize in the following way:
def init(self):
a = numpy.arange(0, self.max_i, 1)
self.vibr_energy = self.calculate_vibr_energy(a)
def calculate_vibr_energy(i):
return numpy.exp(-self.harmonic * i - self.anharmonic * (i ** 2))
So, the code is vectorized, and using Numba's JIT results in some improvement. However, sometimes I need to access the calculate_vibr_energy function from outside the class, and pass a single integer instead of an array in place of i.
As far as I understand, if I use Numba's JIT on the calculate_vibr_energy, it will have to always take an array as an argument.
So, which of the following options is better:
1) Create a new function calculate_vibr_energy_single(i), which will only take a single integer number, and use Numba on it too
2) Replace all usages of the function that are similar to this one:
myclass.calculate_vibr_energy(1)
with this:
tmp = np.array([1])
myclass.calculate_vibr_energy(tmp)[0]
Or are there other, more efficient (or at least, more Python-ic) ways of doing that?

I have only played a little with numba yet so I may be mistaken, but as far as I've understood it, using the "autojit" decorator should give functions that can take arguments of any type.
See e.g. http://numba.pydata.org/numba-doc/dev/pythonstuff.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelizing a numba loop - python

Related

How to make numba(nopython=true) work with 1D numpy.ndarray input with unknown number of elements

I have issues understanding the use of Numbas vectorize decorator in Python

Compiling njit nopython version of function fails due to data types

How can I improve python code performance using numpy

Python and Numba for vectorized functions

Categories

Resources