Numba's prange gives wrong result - python

Updating a list in a prange loop gives wrong results when using prange compared to range.
from numba import jit, prange
import numpy as np
#jit(parallel=True)
def prange_test(A):
s = [0,0,0,0]
b = 0.
for i in prange(A.shape[0]):
s[i%4] += A[i]
b += A[i]
return s,b
def range_test(A):
s = [0,0,0,0]
b = 0.
for i in range(A.shape[0]):
s[i%4] += A[i]
b += A[i]
return s,b
A = np.random.random(100000)
print(prange_test(A))
print(range_test(A))
The sum b is the same, but the partial sum in s is wrong:
(array([7013.98962611, 6550.90312863, 7232.49698366, 7246.53627734]), 49955.32870429267)
([12444.683249345742, 12432.449908902432, 12596.461028432543, 12481.734517611982], 49955.32870429247)

Although it's a little unclear in the documentation, you cannot safely accumulate into an array-like object when you are writing to the same data elements from different iterations of a prange parallel loop. This github issue, that I actually submitted earlier this year asks about this specific issue.
The fact that this has been raised again reminds me that I want to submit a PR to the numba docs to clarify this.

Related

Comparing numba njit/vectorize/guvectorize

I have been testing the following block for numba speed up:
import numpy as np
import timeit
from numba import njit
import numba
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64, float64, float64[:])"],
"(m),(m),(m),(),()->(m)",nopython=True,target="parallel")
def func_diff_calc_numba_v2(X,refY,Y,lower,upper,arr):
fac=1000
for i in range(len(X)):
if X[i] >=lower and X[i] <upper:
diff=Y[i]-refY[i]
arr[i] = diff**2*fac
else:
arr[i] = 0
#numba.vectorize('(float64, float64, float64, float64, float64)',nopython=True,target="parallel")
def func_diff_calc_numba_v3(X,refY,Y,lower,upper):
fac=1000
if X >= lower and X < upper:
return (Y-refY)**2*fac
else:
return 0.0
#njit
def func_diff_calc_numba(X,refY,Y,lower,upper):
fac=1000
arr=np.zeros(len(X))
for i in range(len(X)):
if X[i] >=lower and X[i] <upper:
arr[i]=(Y[i]-refY[i])**2*fac
else:
arr[i] = 0
return arr
np.random.seed(69)
X=np.arange(10000)
refY = np.random.rand(10000)
Y = np.random.rand(10000)
lower=1
upper=10000
print("func_diff_calc_numba: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v2: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v2(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v3: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v3(X,refY,Y,lower,upper)", number=10000, globals=globals())))
The speedups for the v2 and v3 are significantly different:
func_diff_calc_numba: 0.58257
func_diff_calc_numba_v2: 0.49573
func_diff_calc_numba_v3: 1.07519
and if I change the number of iterations from 10,000 to 100,000 then:
func_diff_calc_numba: 1.67251
func_diff_calc_numba_v2: 4.85828
func_diff_calc_numba_v3: 11.63361
I was expecting vectorize and guvectorize to be almost similar in speedup but while njit and guvectorize are almost equal to each other in time, vectorize is ~2 and ~10 times slower than guvectorize and njit respectively. Is there is something wrong in my implementation or something else?
The task (function + inputs) is probably too small/simple to be effectively parallelized, causing the overhead of doing so to increase total runtime. If you compile both to the default cpu target the difference disappears I assume?
Because your input is 1D, with the given ufunc signature, the guvectorize doesn't parallelize anything, because there's only one task.
A like-for-like parallel comparison can be done by setting the signature to "(),(),(),(),()->()" basically telling it to also (like vectorize) apply the function element-wise. And those results should be very close again. But then you'll see that the overhead of parallelization makes it worse for both in this case.
For me timings are:
Using target="parallel" for both, and "(m),(m),(m),(),()->(m)":
numba_guvec : 0.26364
numba_vec : 3.26960
Using target="cpu" for both, and "(m),(m),(m),(),()->(m)":
numba_guvec : 0.21886
numba_vec : 0.26198
Using target="parallel" for both, and "(),(),(),(),()->()":
numba_guvec : 3.05748
numba_vec : 3.15587
You'll probably find similar behavior if you would also compare #njit(parallel=True) with the numba.prange.
At the end, there's just some extra work involved for parallelizing something, and that's only worth it for a sufficiently large (slow) task.

Why is numpy list access slower than vanilla python?

I was under the impression that numpy would be faster for list operations, but the following example seems to indicate otherwise:
import numpy as np
import time
def ver1():
a = [i for i in range(40)]
b = [0 for i in range(40)]
for i in range(1000000):
for j in range(40):
b[j]=a[j]
def ver2():
a = np.array([i for i in range(40)])
b = np.array([0 for i in range(40)])
for i in range(1000000):
for j in range(40):
b[j]=a[j]
t0 = time.time()
ver1()
t1 = time.time()
ver2()
t2 = time.time()
print(t1-t0)
print(t2-t1)
Output is:
4.872278928756714
9.120521068572998
(I'm running 64-bit Python 3.4.3 in Windows 7, on an i7 920)
I do understand that this isn't the fastest way to copy a list, but I'm trying to find out if I'm using numpy incorrectly. Or is it the case that numpy is slower for this kind of operation and is only more efficient in more complex operations?
EDIT:
I also tried the following, which just just does a direct copy via b[:] = a, and numpy is still twice as slow:
import numpy as np
import time
def ver6():
a = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
b = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for i in range(1000000):
b[:] = a
def ver7():
a = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
b = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
for i in range(1000000):
b[:] = a
t0 = time.time()
ver6()
t1 = time.time()
ver7()
t2 = time.time()
print(t1-t0)
print(t2-t1)
Output is:
0.36202096939086914
0.6750380992889404
You're using NumPy wrong. NumPy's efficiency relies on doing as much work as possible in C-level loops instead of interpreted code. When you do
for j in range(40):
b[j]=a[j]
That's an interpreted loop, with all the intrinsic interpreter overhead and more, because NumPy's indexing logic is way more complex than list indexing, and NumPy needs to create a new element wrapper object on every element retrieval. You're not getting any of the benefits of NumPy when you write code like this.
You need to write the code in such a way that the work happens in C:
b[:] = a
This would also improve the efficiency of the list operation, but it's much more important for NumPy.
Most of what you are seeing is Python object creation from C native types.
A Python list is, at it's heart, an array of PyObject pointers. When a and b are both Python lists, doing b[i] = a[i] will imply:
decreasing the reference count of the object pointed by b[i],
increasing the reference count of the object pointed by a[i], and
copying the address stored in a[i] into b[i].
But if a and b are NumPy arrays, things are a little more ellaborate, and the same b[i] = a[i] then requires:
creating a Python integer object from the native C integer type stored at a[i], see this,
converting the Python integer object into a native C integer type, and storing its value in b[i], see here, and
decreasing the reference count of the temporary Python integer object.
So the difference is mostly in creating and disposing of that intermediate Python object, that lists do not need to do.

Numba #jit fails to optimise simple function

I have a pretty simple function which uses Numpy arrays and for loops, but adding the Numba #jit decorator gives absolutely no speed up:
# #jit(float64[:](int32,float64,float64,float64,int32))
#jit
def Ising_model_1D(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
beta = 1/T
s = randn(N,1) > 10
s[N-1] = s[0]
mag = zeros((n_iter,1))
aux_idx = randint(low=0,high=N,size=(n_iter,1))
for i1 in arange(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx]*2 - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
if(delta_E < 0):
s[rnd_idx] = np.logical_not(s[rnd_idx])
elif(np.exp(-1*beta*delta_E) >= rand()):
s[rnd_idx] = np.logical_not(s[rnd_idx])
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
return mag
MATLAB on the other hand takes less than 0.5 seconds to run this!
Why is Numba missing something so basic?
Here is a reworking of your code that runs in about 0.4 seconds on my machine:
def ising_model_1d(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
n_iter = int(n_iter)
beta = 1/T
s = randn(N) > 10
s[N-1] = s[0]
mag = zeros(n_iter)
aux_idx = randint(low=0,high=N,size=n_iter)
pre_rand = rand(n_iter)
_ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag)
return mag
#jit(nopython=True)
def _ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag):
for i1 in range(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx*2] - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
t = rand()
if delta_E < 0:
s[rnd_idx] = not s[rnd_idx]
elif np.exp(-1*beta*delta_E) >= pre_rand[i1]:
s[rnd_idx] = not s[rnd_idx]
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
Please make sure the results are as expected! I changed much of what you had, and can't guarantee that the calculations are correct!
Working with numba requires a little care. Python functions, as well as most numpy functions, cannot be optimized by the compiler. One thing I find helpful is to use the nopython option to #jit. This means that the compiler will complain whenever you give it some code that it can't really optimize. You can then look at the error message and find the line that will likely slow down your code.
The trick, I find, is to write a "gateway" function in Python that does as much of the work as possible using numpy and its vectorized functions. It should create the empty arrays that you'll need to store the results in. It should package all of the data you'll need during the computation. Then it should pass all of these into your jitted function in one big, long argument list.
Case in point: notice how I handle random number generation in the jitted code. In your original code, you called rand():
elif(np.exp(-1*beta*delta_E) >= rand()):
But rand() can't be optimized by numba (in older versions of numba, at least. In newer versions it can, provided that rand is called without arguments). The observation is that you need a single random number for every one of the n_iter iterations. So we simply create a random array using numpy in our wrapper function, then feed this random array to the jitted function. Getting a random number is then as simple as indexing into this array.
Lastly, for a list of the numpy functions that can be optimized by the latest version of the compiler, see here. In my reworking of your code I was aggressive in removing calls to numpy functions so that the code would work over more versions of numba.

Reduce python loop to array calculation

I am trying to fill an array with calculated values from functions defined earlier in my code. I started with a code that has a similar structure to the following:
from numpy import cos, sin, arange, zeros
a = arange(1000)
b = arange(1000)
def defcos(x):
return cos(x)
def defsin(x):
return sin(x)
a_len = len(a)
b_len = len(b)
result = zeros((a_len,b_len))
for i in xrange(b_len):
for j in xrange(a_len):
a_res = defcos(a[j])
b_res = defsin(b[i])
result[i,j] = a_res * b_res
I tried to use array representations of the functions, which ended up in the following change for the loop
a_res = defsin(a)
b_res = defcos(b)
for i in xrange(b_len):
for j in xrange(a_len):
result[i,j] = a_res[i] * b_res[j]
This is already significantly faster, than the first version. But is there a way to avoid the loop entirely? I have encountered those loops a couple of times in the past but never botheres as it was not critical in terms of speed. But this time it is the core component of something, which is looped through a couple of times more. :)
Any help would be appreciated, thanks in advance!
Like so:
from numpy import newaxis
a_res = sin(a)
b_res = cos(b)
result = a_res[:, newaxis] * b_res
To understand how this works, have a look at the rules for array broadcasting. And please don't define useless functions like defsin, just use sin itself! Another minor detail, you get i from range(b_len), but you use it to index a_res! This is a bug if a_len != b_len.

Numpy sum of operator results without allocating an unnecessary array

I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?
I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.
Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.
Yes, this is the bottleneck of my program and is worth optimizing.
On my machine this is faster:
(a == b).sum()
If you don't want to use any extra storage, than I would suggest using numba.
I'm not too familiar with it, but this seems to work well.
I ran into some trouble getting Cython to take a boolean NumPy array.
from numba import autojit
def pysumeq(a, b):
tot = 0
for i in xrange(a.shape[0]):
for j in xrange(a.shape[1]):
if a[i,j] == b[i,j]:
tot += 1
return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)
If you don't have numba, I would suggest using the answer by #user2357112
Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
for i in xrange(h):
for j in xrange(w):
if a[i,j] == b[i,j]:
tot += 1
return tot
To start with you can skip then A*B step:
>>> a
array([ True, False, True, False, True], dtype=bool)
>>> b
array([False, True, True, False, True], dtype=bool)
>>> np.sum(~(a^b))
3
If you do not mind destroying array a or b, I am not sure you will get faster then this:
>>> a^=b #In place xor operator
>>> np.sum(~a)
3
If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:
out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()
This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.
Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
cdef np.uint8_t* adata = &a[0, 0]
cdef np.uint8_t* bdata = &b[0, 0]
for i in xrange(h):
for j in xrange(w):
if adata[j] == bdata[j]:
tot += 1
adata += w
bdata += w
return tot
This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s

Categories

Resources