I have been testing the following block for numba speed up:
import numpy as np
import timeit
from numba import njit
import numba
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64, float64, float64[:])"],
"(m),(m),(m),(),()->(m)",nopython=True,target="parallel")
def func_diff_calc_numba_v2(X,refY,Y,lower,upper,arr):
fac=1000
for i in range(len(X)):
if X[i] >=lower and X[i] <upper:
diff=Y[i]-refY[i]
arr[i] = diff**2*fac
else:
arr[i] = 0
#numba.vectorize('(float64, float64, float64, float64, float64)',nopython=True,target="parallel")
def func_diff_calc_numba_v3(X,refY,Y,lower,upper):
fac=1000
if X >= lower and X < upper:
return (Y-refY)**2*fac
else:
return 0.0
#njit
def func_diff_calc_numba(X,refY,Y,lower,upper):
fac=1000
arr=np.zeros(len(X))
for i in range(len(X)):
if X[i] >=lower and X[i] <upper:
arr[i]=(Y[i]-refY[i])**2*fac
else:
arr[i] = 0
return arr
np.random.seed(69)
X=np.arange(10000)
refY = np.random.rand(10000)
Y = np.random.rand(10000)
lower=1
upper=10000
print("func_diff_calc_numba: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v2: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v2(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v3: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v3(X,refY,Y,lower,upper)", number=10000, globals=globals())))
The speedups for the v2 and v3 are significantly different:
func_diff_calc_numba: 0.58257
func_diff_calc_numba_v2: 0.49573
func_diff_calc_numba_v3: 1.07519
and if I change the number of iterations from 10,000 to 100,000 then:
func_diff_calc_numba: 1.67251
func_diff_calc_numba_v2: 4.85828
func_diff_calc_numba_v3: 11.63361
I was expecting vectorize and guvectorize to be almost similar in speedup but while njit and guvectorize are almost equal to each other in time, vectorize is ~2 and ~10 times slower than guvectorize and njit respectively. Is there is something wrong in my implementation or something else?
The task (function + inputs) is probably too small/simple to be effectively parallelized, causing the overhead of doing so to increase total runtime. If you compile both to the default cpu target the difference disappears I assume?
Because your input is 1D, with the given ufunc signature, the guvectorize doesn't parallelize anything, because there's only one task.
A like-for-like parallel comparison can be done by setting the signature to "(),(),(),(),()->()" basically telling it to also (like vectorize) apply the function element-wise. And those results should be very close again. But then you'll see that the overhead of parallelization makes it worse for both in this case.
For me timings are:
Using target="parallel" for both, and "(m),(m),(m),(),()->(m)":
numba_guvec : 0.26364
numba_vec : 3.26960
Using target="cpu" for both, and "(m),(m),(m),(),()->(m)":
numba_guvec : 0.21886
numba_vec : 0.26198
Using target="parallel" for both, and "(),(),(),(),()->()":
numba_guvec : 3.05748
numba_vec : 3.15587
You'll probably find similar behavior if you would also compare #njit(parallel=True) with the numba.prange.
At the end, there's just some extra work involved for parallelizing something, and that's only worth it for a sufficiently large (slow) task.
Related
Suppose I have a NumPy array arr that I want to element-wise filter (reduce) depending on the truth value of a (broadcastable) function, e.g.
I want to get only values below a certain threshold value k:
def cond(x):
return x < k
There are a couple of methods, e.g.:
Using a generator: np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype) (which is a memory efficient version of using a list comprehension: np.array([x for x in arr if cond(x)]) because np.fromiter() will produce a NumPy array directly, without the need to allocate an intermediate Python list)
Using boolean masking: arr[cond(arr)]
Using integer indexing: arr[np.nonzero(cond(arr))] (or equivalently using np.where() as it defaults to np.nonzero() with only one condition)
Using explicit looping with:
single pass and final copying/resizing
two passes: one to determine the size of the result and one to actually perform the computation
(The last two approaches to be accelerated with Cython or Numba)
Which is the fastest? What about memory efficiency?
(EDITED: To use directly np.nonzero() instead of np.where() as per #ShadowRanger comment)
Summary
Using a loop-based approach with single pass and copying, accelerated with Numba, offers the best overall trade-off in terms of speed, memory efficiency and flexibility.
If the execution of the condition function is sufficiently fast, two-passes (filter2_nb()) may be faster, while they are more memory efficient regardless.
Also, for sufficiently large inputs, resizing instead of copying (filter_resize_xnb()) leads to faster execution.
If the data type (and the condition function) is known ahead of time and can be compiled, the Cython acceleration seems to be faster.
It is likely that a similar hard-coding of the condition would lead to comparable speed-up with Numba accerelation as well.
When it comes to NumPy-only based approaches, boolean masking or integer indexing are of comparable speed, and which one comes out faster depends largely on the filtering factor, i.e. the portion of values that passes through the filtering condition.
The np.fromiter() approach is much slower (it would be off-charts in the plot), but does not produce large temporary objects.
Note that the following tests are meant to give some insights into the different approaches and should be taken with a grain of salt.
The most relevant assumptions are that the condition is broadcastable and that it would eventually compute very fast.
Definitions
Using a generator:
def filter_fromiter(arr, cond):
return np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype)
Using boolean masking:
def filter_mask(arr, cond):
return arr[cond(arr)]
Using integer indexing:
def filter_idx(arr, cond):
return arr[np.nonzero(cond(arr))]
4a. Using explicit looping, with single pass and final copying/resizing
Cython-accelerated with copying (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef long NUM = 1048576
cdef long MAX_VAL = 1048576
cdef long K = 1048576 // 2
cdef int cond_cy(long x, long k=K):
return x < k
cdef size_t _filter_cy(long[:] arr, long[:] result, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
result[j] = arr[i]
j += 1
return j
def filter_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
return result[:new_size].copy()
Cython-accelerated with resizing (pre-compiled condition)
def filter_resize_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
result.resize(new_size)
return result
Numba-accelerated with copying
import numba as nb
#nb.njit
def cond_nb(x, k=K):
return x < k
#nb.njit
def filter_nb(arr, cond_nb):
result = np.empty_like(arr)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result[:j].copy()
Numba-accelerated with resizing
#nb.njit
def _filter_out_nb(arr, out, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
out[j] = arr[i]
j += 1
return j
def filter_resize_xnb(arr, cond_nb):
result = np.empty_like(arr)
j = _filter_out_nb(arr, result, cond_nb)
result.resize(j, refcheck=False) # unsupported in NoPython mode
return result
Numba-accelerated with a generator and np.fromiter()
#nb.njit
def filter_gen_nb(arr, cond_nb):
for i in range(arr.size):
if cond_nb(arr[i]):
yield arr[i]
def filter_gen_xnb(arr, cond_nb):
return np.fromiter(filter_gen_nb(arr, cond_nb), dtype=arr.dtype)
4b. Using explicit looping with two passes: one to determine the size of the result and one to actually perform the computation
Cython-accelerated (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
cdef size_t _filtered_size_cy(long[:] arr, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
j += 1
return j
def filter2_cy(arr):
cdef size_t new_size = _filtered_size_cy(arr, arr.size)
result = np.empty(new_size, dtype=arr.dtype)
new_size = _filter_cy(arr, result, arr.size)
return result
Numba-accelerated
#nb.njit
def filter2_nb(arr, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
j += 1
result = np.empty(j, dtype=arr.dtype)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result
Timing Benchmarks
(The generator-based filter_fromiter() method is much slower than the others -- by approx. 2 orders of magnitude.
Similar (and perhaps slightly worse) performances can be expected from a list comprehension.
This would be true for any explicit looping with non-accelerated code.)
The timing would depend on both the input array size and the percent of filtered items.
As a function of input size
The first graph addresses the timings as a function of the input size (for ~50% filtering factor -- i.e. 50% of the elements appear in the result):
In general, explicit looping with one form of acceleration leads to the fastest execution, with slight variations depending on input size.
Within NumPy, the integer indexing approaches are basically on par with boolean masking.
The benefits of using np.fromiter() (no pre-allocation) can be reaped by writing a Numba-accelerated generator, which would come out slower than the other approaches (within an order of magnitude), but much faster than pure-Python looping.
As a function of filling
The second graph addresses the timings as a function of items passing through the filter (for a fixed input size of ~1 million elements):
The first observation is that all methods are slowest when approaching a ~50% filling and with less, or more filling they are faster, and fastest towards no filling (highest percent of filtered-out values, lowest percent of passing through values as indicated in the x-axis of the graph).
Again, explicit looping with some mean of acceleration leads to the fastest execution.
Within NumPy, the integer indexing and boolean masking approaches are again basically the same.
(Full code available here)
Memory Considerations
The generator-based filter_fromiter() method requires only minimal temporary storage, independently of the size of the input.
Memory-wise this is the most efficient method.
This approach can be effectively speed up with a Numba-accelerated generator.
Of similar memory efficiency are the Cython / Numba two-passes methods, because the size of the output is determined during the first pass.
The caveat here is that computing the condition has to be fast for these methods to be fast.
On the memory side, the single-pass solutions for both Cython and Numba require a temporary array of the size of the input.
Hence, these are not very memory-efficient compared to two-passes or the generator-based one.
Yet they are of similar asymptotic temporary memory footprint compared to masking, but the constant term is typically larger than masking.
The boolean masking solution requires a temporary array of the size of the input but of type bool, which in NumPy is 1 byte, so this is ~8 times smaller than the default size of a NumPy array on a typical 64-bit system.
The integer indexing solution has the same requirement as the boolean mask slicing in the first step (inside np.nonzero() call), which gets converted to a series of ints (typically int64 on a 64-bit system) in the second step (the output of np.nonzero()).
This second step, therefore, has variable memory requirements, depending on the number of filtered elements.
Remarks
both boolean masking and integer indexing require some form of conditioning that is capable of producing a boolean mask (or, alternatively, a list of indices); in the above implementation, the condition is broadcastable
the generator and the Numba-accelerated methods are also the most flexible when it comes to specifying a different filtering condition
the Numba-accelerated methods require the condition to be Numba-compatible to access the Numba acceleration in NoPython mode
the Cython solution requires specifying the data types for it to be fast, or extra effort for multiple types dispatching, and extra effort (not explored here) to get the same level of flexibility as the other methods
for both Numba and Cython, the filtering condition can be hard-coded, leading to marginal but appreaciable speed differences
the single-pass solutions require additional code to handle the unused (but otherwise initially allotted) memory.
the NumPy methods do NOT return a view of the input, but a copy, as a result of advanced indexing:
arr = np.arange(100)
k = 50
print('`arr[arr > k]` is a copy: ', arr[arr > k].base is None)
# `arr[arr > k]` is a copy: True
print('`arr[np.where(arr > k)]` is a copy: ', arr[np.where(arr > k)].base is None)
# `arr[np.where(arr > k)]` is a copy: True
print('`arr[:k]` is a copy: ', arr[:k].base is None)
# `arr[:k]` is a copy: False
(EDITED: various improvements based on #ShadowRanger, #PaulPanzer, #max9111 and #DavidW comments.)
Updating a list in a prange loop gives wrong results when using prange compared to range.
from numba import jit, prange
import numpy as np
#jit(parallel=True)
def prange_test(A):
s = [0,0,0,0]
b = 0.
for i in prange(A.shape[0]):
s[i%4] += A[i]
b += A[i]
return s,b
def range_test(A):
s = [0,0,0,0]
b = 0.
for i in range(A.shape[0]):
s[i%4] += A[i]
b += A[i]
return s,b
A = np.random.random(100000)
print(prange_test(A))
print(range_test(A))
The sum b is the same, but the partial sum in s is wrong:
(array([7013.98962611, 6550.90312863, 7232.49698366, 7246.53627734]), 49955.32870429267)
([12444.683249345742, 12432.449908902432, 12596.461028432543, 12481.734517611982], 49955.32870429247)
Although it's a little unclear in the documentation, you cannot safely accumulate into an array-like object when you are writing to the same data elements from different iterations of a prange parallel loop. This github issue, that I actually submitted earlier this year asks about this specific issue.
The fact that this has been raised again reminds me that I want to submit a PR to the numba docs to clarify this.
I have a pretty simple function which uses Numpy arrays and for loops, but adding the Numba #jit decorator gives absolutely no speed up:
# #jit(float64[:](int32,float64,float64,float64,int32))
#jit
def Ising_model_1D(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
beta = 1/T
s = randn(N,1) > 10
s[N-1] = s[0]
mag = zeros((n_iter,1))
aux_idx = randint(low=0,high=N,size=(n_iter,1))
for i1 in arange(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx]*2 - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
if(delta_E < 0):
s[rnd_idx] = np.logical_not(s[rnd_idx])
elif(np.exp(-1*beta*delta_E) >= rand()):
s[rnd_idx] = np.logical_not(s[rnd_idx])
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
return mag
MATLAB on the other hand takes less than 0.5 seconds to run this!
Why is Numba missing something so basic?
Here is a reworking of your code that runs in about 0.4 seconds on my machine:
def ising_model_1d(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
n_iter = int(n_iter)
beta = 1/T
s = randn(N) > 10
s[N-1] = s[0]
mag = zeros(n_iter)
aux_idx = randint(low=0,high=N,size=n_iter)
pre_rand = rand(n_iter)
_ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag)
return mag
#jit(nopython=True)
def _ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag):
for i1 in range(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx*2] - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
t = rand()
if delta_E < 0:
s[rnd_idx] = not s[rnd_idx]
elif np.exp(-1*beta*delta_E) >= pre_rand[i1]:
s[rnd_idx] = not s[rnd_idx]
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
Please make sure the results are as expected! I changed much of what you had, and can't guarantee that the calculations are correct!
Working with numba requires a little care. Python functions, as well as most numpy functions, cannot be optimized by the compiler. One thing I find helpful is to use the nopython option to #jit. This means that the compiler will complain whenever you give it some code that it can't really optimize. You can then look at the error message and find the line that will likely slow down your code.
The trick, I find, is to write a "gateway" function in Python that does as much of the work as possible using numpy and its vectorized functions. It should create the empty arrays that you'll need to store the results in. It should package all of the data you'll need during the computation. Then it should pass all of these into your jitted function in one big, long argument list.
Case in point: notice how I handle random number generation in the jitted code. In your original code, you called rand():
elif(np.exp(-1*beta*delta_E) >= rand()):
But rand() can't be optimized by numba (in older versions of numba, at least. In newer versions it can, provided that rand is called without arguments). The observation is that you need a single random number for every one of the n_iter iterations. So we simply create a random array using numpy in our wrapper function, then feed this random array to the jitted function. Getting a random number is then as simple as indexing into this array.
Lastly, for a list of the numpy functions that can be optimized by the latest version of the compiler, see here. In my reworking of your code I was aggressive in removing calls to numpy functions so that the code would work over more versions of numba.
Suppose I have the following function:
def f(x,y):
return x*y
How do I apply the funtion to each element in an NxM 2D numpy array using the multiprocessing module? Using serial iteration, the code might look as follows:
import numpy as np
N = 10
M = 12
results = np.zeros(shape=(N,M))
for x in range(N):
for y in range(M):
results[x,y] = f(x,y)
Here's how you might parallelize your example function using multiprocesssing. I've also included an almost identical pure Python function that uses non-parallel for loops, and a numpy one-liner that achieves the same result:
import numpy as np
from multiprocessing import Pool
def f(x,y):
return x * y
# this helper function is needed because map() can only be used for functions
# that take a single argument (see http://stackoverflow.com/q/5442910/1461210)
def splat_f(args):
return f(*args)
# a pool of 8 worker processes
pool = Pool(8)
def parallel(M, N):
results = pool.map(splat_f, ((i, j) for i in range(M) for j in range(N)))
return np.array(results).reshape(M, N)
def nonparallel(M, N):
out = np.zeros((M, N), np.int)
for i in range(M):
for j in range(N):
out[i, j] = f(i, j)
return out
def broadcast(M, N):
return np.prod(np.ogrid[:M, :N])
Now let's look at the performance:
%timeit parallel(1000, 1000)
# 1 loops, best of 3: 1.67 s per loop
%timeit nonparallel(1000, 1000)
# 1 loops, best of 3: 395 ms per loop
%timeit broadcast(1000, 1000)
# 100 loops, best of 3: 2 ms per loop
The non-parallel pure Python version beats the parallelized version by a factor of about 4, and the version using numpy array broadcasting absolutely crushes the other two.
The problem is that starting and stopping Python subprocesses carries quite a lot of overhead, and your test function is so trivial that each worker thread spends only a tiny proportion of its lifetime doing useful work. Multiprocessing only makes sense if each thread has a substantial amount of work to do before it is killed. You might, for example, give each worker a bigger chunk of the output array to compute (try messing around with the chunksize= parameter to pool.map()), but with such a trivial example I doubt you'll see a big improvement.
I don't know what your actual code looks like - maybe your function is big and expensive enough to warrant using multiprocessing. However, I would bet that there are much better ways to improve its performance.
Not sure multiprocessing is needed in your case. In the simple example above, you can do
X, Y = numpy.meshgrid(numpy.arange(10), numpy.arange(12))
result = X*Y
I need to create a matrix starting from the values of a weight matrix. Which is the best structure to hold the matrix in term of speed both when creating and iterating over it? I was thinking about a list of lists or a numpy 2D array but they both seem slow to me.
What I need:
numpy array
A = np.zeros((dim, dim))
for r in range(A.shape[0]):
for c in range(A.shape[0]):
if(r==c):
A.itemset(node_degree[r])
else:
A.itemset(arc_weight[r,c])
or
list of lists
l = []
for r in range(dim):
l.append([])
for c in range(dim):
if(i==j):
l[i].append(node_degree[r])
else:
l[i].append(arc_weight[r,c])
where dim can be also 20000 , node_degree is a vector and arc_weight is another matrix. I wrote it in c++, it takes less less than 0.5 seconds while the others two in python more than 20 seconds. I know python is not c++ but I need to be as fast as possible.
Thank you all.
One thing is you shouldn't be appending to the list if you already know it's size.
Preallocate the memory first using list comprehension and generate the r, c values using xrange() instead of range() since you are using Python < 3.x (see here):
l = [[0 for c in xrange(dim)] for r in xrange(dim)]
Better yet, you can build what you need in one shot using:
l = [[node_degree[r] if r == c else arc_weight[r,c]
for c in xrange(dim)] for r in xrange(dim)]
Compared to your original implementation, this should use less memory (because of the xrange() generators), and less time because you remove the need to reallocating memory by specifying the dimensions up front.
Numpy matrices are generally faster as they know their dimensions and entry type.
In your particular situation, since you already have the arc_weight and node_degree matrices created so you can create your matrix directly from arc_weight and then replace the diagonal:
A = np.matrix(arc_matrix)
np.fill_diagonal(A, node_degree)
Another option is to replace the double loop with a function that puts the right element in each position and create a matrix from the function:
def fill_matrix(r, c):
return arc_weight[r,c] if r != c else node_degree[r]
A = np.fromfunction(fill_matrix, (dim, dim))
As a rule of thumb, with numpy you must avoid loops at all costs. First method should be faster but you should profile both to see what works for you. You should also take into account that you seem to be duplicating your data set in memory, so if it is really huge you might get in trouble. Best idea would be to create your matrix directly avoiding arc_weight and node_degree altogether.
Edit: Some simple time comparisons between list comprehension and numpy matrix creation. Since I don't know how your arc_weight and node_degree are defined, I just made up two random functions. It seems that numpy.fromfunction complains a bit if the function has a conditional on it, so I construct the matrix in two steps.
import numpy as np
def arc_weight(a,b):
return a+b
def node_degree(a):
return a*a
def create_as_list(N):
return [[arc_weight(c,r) if c!=r else node_degree(c) for c in xrange(N)] for r in xrange(N)]
def create_as_numpy(N):
A = np.fromfunction(arc_weight, (N,N))
np.fill_diagonal(A, node_degree(np.arange(N)))
return A
And here the timings for N=2000:
time A = create_as_list(2000)
CPU times: user 839 ms, sys: 16.5 ms, total: 856 ms
Wall time: 845 ms
time A = create_as_numpy(2000)
CPU times: user 83.1 ms, sys: 12.9 ms, total: 96 ms
Wall time: 95.3 ms
Make a copy of arc_weight and fill the diagonal with values from node_degree. For a 20000-by-20000 output, it takes about 1.6 seconds on my machine:
>>> import numpy
>>> dim = 20000
>>> arc_weight = numpy.arange(dim**2).reshape([dim, dim])
>>> node_degree = numpy.arange(dim)
>>> import timeit
>>> timeit.timeit('''
... A = arc_weight.copy()
... A.flat[::dim+1] = node_degree
... ''', '''
... from __main__ import dim, arc_weight, node_degree''',
... number=1)
1.6081738501125764
Once you have your array, try not to iterate over it. Compared to broadcasted operators and NumPy built-in functions, Python-level loops are a performance disaster.