This is a contrived test case but, hopefully, it can suffice to convey the point and ask the question. Inside of a Numba njit function, I noticed that it is very costly to assign a locally computed value to an array element. Here are two example functions:
from numba import njit
import numpy as np
#njit
def slow_func(x, y):
result = y.sum()
for i in range(x.shape[0]):
if x[i] > result:
x[i] = result
else:
x[i] = result
#njit
def fast_func(x, y):
result = y.sum()
for i in range(x.shape[0]):
if x[i] > result:
z = result
else:
z = result
if __name__ == "__main__":
x = np.random.rand(100_000_000)
y = np.random.rand(100_000_000)
%timeit slow_func(x, y) # 177 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit fast_func(x, y) # 407 ns ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I understand that the two functions aren't quite doing the same thing but let's not worry about that for now and stay focused on the "slow assignment". Also, due to Numba's lazy initialization, the timing above has been re-run post JIT-compiling. Notice that both functions are assigning result to either x[i] or to z and the number of assignments are the same in both cases. However, the assignment of result to z is substantially faster. Is there a way to make the slow_func as fast as the fast_func?
As #PaulPanzer already has pointed out, your fast function does nothing once optimized - so what you see is basically the overhead of calling a numba-function.
The interesting part is, that in order to do this optimization, numba must be replacing np.sum with its own sum-implementation - otherwise the optimizer would not be able to throw the call to this function away, as it cannot look into the implementation of np.sum and must assume that there are side effects from calling this function.
Let's measure only the summation with numba:
from numba import njit
#njit
def only_sum(x, y):
return y.sum()
%timeit only_sum(y,x)
# 112 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
.
Well, that is disappointing: I know my machine can do more than 10^9 addition per second and to read up to 13GB/s from RAM (there are about 0.8GB data, so it doesn't fit the cache), which mean I would expect the summation to use between 60-80ms.
And if I use the numpy's version, it really does:
%timeit y.sum()
# 57 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That sounds about right! I assume, numba doesn't use the pairwise addition and thus is slower (if the RAM is fast enough to be the bottleneck) and less precise than numpy's version.
If we just look at the writing of the values:
#njit
def only_assign(x, y):
res=y[0]
for i in range(x.shape[0]):
x[i]=res
%timeit only_assign(x,y)
85.2 ms ± 417 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
so we see it is really slower than reading. The reason for that (and how it can be fixed) is explained in this great answer: the update of caches which numba (rightly?) doesn't bypass.
In a nutshell: While assigning of values in numba isn't really slow (even if it could be speed-up by ussing non-temporal memory accesses), the really slow part is the summation (which seems not to use the pairwise summation) - it is inferior to the numpy's version.
Related
My problem is that i have a ndarray of shape (N,M,3) and i am trying to check each element in the array using a low level approach currently i am doing something like:
for i in range(N):
for j in range(M):
if ndarr[i][j][2] == 3:
ndarr[i][j][0] == var1
and most of the time the ndarray i need to process is very large usually around 1000x1000 .
the same idea i managed to run on cpp withing a couple of milisecondes in python it take around 30 seconds at best.
i would really appreciate if someone can explain to me or point me towards reading material on how to efficiently iterate trough ndarray
There is no way of doing that efficiently.
NumPy is a small Python wrapper around C code/datatypes. So an ndarray is actually a multidimensional C array. That means the memory address of the array is the address of the first element of the array. All other elements are stored consecutively in memory.
What your Python for loop does, is grabbing each element of the array and temporarily saving it somewhere else (as a Python datastructure) before stuffing it back in the C array. As I have said, there is no way of doing that efficiently with a Python loop.
What you could do is using Numba #jit to speed up the for loop or look after a NumPy routine, that can iterate over an array.
you can use logical indexing to do this more efficiently, it might be interesting to see how it compares with your c implementation.
import numpy as np
a = np.random.randn(2, 4, 3)
print(a)
idx = a[:, :, 2] > 0
a[idx, 0] = 9
print(a)
In Numpy you have to use vectorized-commands (usually calling a C or Cython-function) to achieve good performance. As an alternative you can use Numba or Cython.
Two possible Implementations
import numba as nb
import numpy as np
def calc_np(ndarr,var1):
ndarr[ndarr[:,:,0]==3]=var1
return ndarr
#nb.njit(parallel=True,cache=True)
def calc_nb(ndarr,var1):
for i in nb.prange(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i,j,2] == 3:
ndarr[i,j,0] == var1
return ndarr
Timings
ndarr=np.random.randint(low=0,high=3,size=(1000,1000,3))
%timeit calc_np(ndarr,2)
#780 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#first call takes longer due to compilation overhead
res=calc_nb(ndarr,2)
%timeit calc(ndarr,2)
#55.2 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Edit
You also use a wrong indexing method. ndarr[i] gives a 2d view on the original 3d-array, the next indexing operation [j] gives the next view on the previous view. This also has quite an impact on performance.
def calc_1(ndarr,var1):
for i in range(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i][j][2] == 3:
ndarr[i][j][0] == var1
return ndarr
def calc_2(ndarr,var1):
for i in range(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i,j,2] == 3:
ndarr[i,j,0] == var1
return ndarr
%timeit calc_1(ndarr,2)
#549 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_2(ndarr,2)
#321 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There is a drastic performance hit when using a keyfunc in heapq.nlargest:
>>> from random import random
>>> from heapq import nlargest
>>> data = [random() for _ in range(1234567)]
>>> %timeit nlargest(10, data)
30.2 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit nlargest(10, data, key=lambda n: n)
159 ms ± 6.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I expected a small extra cost, perhaps something like 30% - not 400%. This degradation seems to be reproducible over a few different data sizes. You can see in the source code there is a special-case handling for if key is None, but otherwise the implementation looks more or less the same.
Why is performance so degraded by using a key function? Is it only due to the extra function call overhead, or is the algorithm fundamentally changed somehow by using a keyfunc?
For comparison, sorted takes about a 30% hit with the same data and lambda.
Say your iterable has N elements. Whether sorting or doing nlargest, the key function will be called N times. When sorting, that overhead is largely buried under roughly N * log2(N) other operations. But when doing nlargest of k items, there are only roughly N * log2(k) other operations, which is much smaller when k is much smaller than N.
In your example, N = 1234567 and k = 10, and so the ratio of other operations, sorting over nlargest, is roughly:
>>> log2(1234567) / log2(10)
6.0915146640862625
That this is close to 6 is purely coincidence ;-) It's the qualitative point that matters: the overhead of using a key function is much more significant for nlargest than for sorting randomly ordered data, provided k is much smaller than N.
In fact, that greatly understates the relative burden for nlargest, because the O(log2(k)) heapreplace is called in the latter only when the next element is larger than the k'th largest seen so far. Most of the time it isn't, and so the loop on such an iteration is nearly pure overhead, calling a Python-level key function just to discover that the result isn't interesting.
Quantifying that is beyond me, though; for example, on my Win10 box under Python 3.6.5, I only see a timing difference in your code a bit less than a factor of 3. That doesn't surprise me - calling a Python-level function is much more expensive than poking a list iterator and doing an integer compare (both "at C speed").
The extra overhead of calling lambda n: n so many times is really just that expensive.
In [17]: key = lambda n: n
In [18]: x = [random() for _ in range(1234567)]
In [19]: %timeit nlargest(10, x)
33.1 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [20]: %timeit nlargest(10, x, key=key)
133 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %%timeit
...: for i in x:
...: key(i)
...:
93.2 ms ± 978 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [22]: %%timeit
...: for i in x:
...: pass
...:
10.1 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, the cost of calling key on all the elements accounts for almost the entirety of the overhead.
Key evaluations are equally expensive for sorted, but because the total work of sorting is more expensive, the overhead of key calls is a smaller percentage of the total. You should have compared the absolute overhead of using a key with nlargest or sorted, rather than the overhead as a percentage of the base.
In [23]: %timeit sorted(x)
542 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [24]: %timeit sorted(x, key=key)
683 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see, the cost of key calls accounts for about half the overhead of using this key with sorted on this input, the rest of the overhead probably coming from the work of shuffling more data around in the sort itself.
You might wonder how nlargest manages to do so little work per element. For the no-key case, most iteration happens in the following loop:
for elem in it:
if top < elem:
_heapreplace(result, (elem, order))
top = result[0][0]
order -= 1
or for the case with a key:
for elem in it:
k = key(elem)
if top < k:
_heapreplace(result, (k, order, elem))
top = result[0][0]
order -= 1
The crucial realization is that the top < elem and top < k branches are almost never taken. Once the algorithm has found 10 fairly large elements, most of the remaining elements are going to be smaller than the 10 current candidates. On the rare occasions where a heap element needs to be replaced, that just makes it even harder for further elements to pass the bar needed to call heapreplace.
On a random input, the number of heapreplace calls nlargest makes is expected logarithmic in the size of the input. Specifically, for nlargest(10, x), aside from the first 10 elements of x, element x[i] has a 10/(i+1) probability of being in the top 10 elements of l[:i+1], which is the condition necessary for a heapreplace call. By linearity of expectation, the expected number of heapreplace calls is the sum of these probabilities, and that sum is O(log(len(x))). (This analysis holds with 10 replaced by any constant, but a slightly more sophisticated analysis is needed for a variable n in nlargest(n, l).)
The performance story would be very different for a sorted input, where every element would pass the if check:
In [25]: sorted_x = sorted(x)
In [26]: %timeit nlargest(10, sorted_x)
463 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Over 10 times as expensive as the unsorted case!
import numpy as np
from datetime import datetime
import math
def norm(l):
s = 0
for i in l:
s += i**2
return math.sqrt(s)
def foo(a, b, f):
l = range(a)
s = datetime.now()
for i in range(b):
f(l)
e = datetime.now()
return e-s
foo(10**4, 10**5, norm)
foo(10**4, 10**5, np.linalg.norm)
foo(10**2, 10**7, norm)
foo(10**2, 10**7, np.linalg.norm)
I got the following output:
0:00:43.156278
0:00:23.923239
0:00:44.184835
0:01:00.343875
It seems like when np.linalg.norm is called many times for small-sized data, it runs slower than my norm function.
What is the cause of that?
First of all: datetime.now() isn't appropriate to measure performance, it includes the wall-time and you may just pick a bad time (for your computer) when a high-priority process runs or Pythons GC kicks in, ...
There are dedicated timing functions/modules available in Python: the built-in timeit module or %timeit in IPython/Jupyter and several other external modules (like perf, ...)
Let's see what happens if I use these on your data:
import numpy as np
import math
def norm(l):
s = 0
for i in l:
s += i**2
return math.sqrt(s)
r1 = range(10**4)
r2 = range(10**2)
%timeit norm(r1)
3.34 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.linalg.norm(r1)
1.05 ms ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit norm(r2)
30.8 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.linalg.norm(r2)
14.2 µs ± 313 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It isn't slower for short iterables it's still faster. However note that the real advantage from NumPy functions comes if you already have NumPy arrays:
a1 = np.arange(10**4)
a2 = np.arange(10**2)
%timeit np.linalg.norm(a1)
18.7 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.linalg.norm(a2)
4.03 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Yeah, it's quite a lot faster now. 18.7us vs. 1ms - almost 100 times faster for 10000 elements. That means most of the time of np.linalg.norm in your examples was spent in converting the range to a np.array.
You are on the right way
np.linalg.norm has a quite high overhead on small arrays. On large arrays both the jit compiled function and np.linalg.norm runs in a memory bottleneck, which is expected on a function that does simple multiplications most of the time.
If the jitted function is called from another jitted function it might get inlined, which can lead to a quite a lot larger advantage over the numpy-norm function.
Example
import numba as nb
import numpy as np
#nb.njit(fastmath=True)
def norm(l):
s = 0.
for i in range(l.shape[0]):
s += l[i]**2
return np.sqrt(s)
Performance
r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:0.42µs
linalg:4.46µs
r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:8.9µs
linalg:13.4µs
r1 = np.array(np.arange(10**2),dtype=np.float64)
Numba:0.35µs
linalg:3.71µs
r2 = np.array(np.arange(10**4), dtype=np.float64)
Numba:1.4µs
linalg:5.6µs
Measuring Performance
Call the jit-compiled function one time before the measurement (there is a static compilation overhead on the first call)
Make clear if the measurement is valid (since small arrays stays in processor-cache there may be to optimistic results exceeding your RAM throughput on realistic examples eg. example)
As a background, please read this quick post and clear answer:
What is the difference between np.sum and np.add.reduce?
So, for a small array, using add.reduce is faster. Let's take the following code which I experimented with for learning, that sums a 2D array:
a = np.array([[1,4,6],[3,1,2]])
print('Sum function result =', np.sum(a))
# faster for small array -
# print(np.add.reduce(a))
# but the only reduces dimension by 1. So do this repeatedly. I create a copy of x since I keep reducing it:
x = np.copy(a)
while x.size > 1:
x = np.add.reduce(x)
print('Sum with add.reduce =', x)
So, the above seems like overkill - I assume it's better to just use sum when you don't know the size of your array, and definitely if it's more than one dimension. Does anyone use add.reduce in production code if your array isn't obvious/small? If so, why?
Any comments for code improvisation are welcome.
I don't think I've used np.add.reduce when np.sum or arr.sum would do just as well. Why type something longer for a trivial speedup.
Consider a 1 axis sum on a modest size array:
In [299]: arr = np.arange(10000).reshape(100,10,5,2)
In [300]: timeit np.sum(arr,axis=0).shape
20.1 µs ± 547 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [301]: timeit arr.sum(axis=0).shape
17.6 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [302]: timeit np.add.reduce(arr,axis=0).shape
18 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [303]:
arr.sum is fastest. Obviously it beats np.sum because there's one less level of function call. np.add.reduce isn't faster.
The ufunc.reduce has its place, especially for ufunc that don't have the equivalent of sum or prod. (seems that I commented about this recently).
I suspect you'll find more uses of np.add.at or np.add.reduceat than np.add.reduce in SO answers. Those are ufunc constructs that don't have a method equivalent.
Or search for a keyword like keepdims. That's available with all 3 constructs, but almost all examples will be using it with sum, not reduce.
When I was setting up those tests, I stumbled on a difference I wasn't aware of:
In [307]: np.add.reduce(arr).shape # default axis 0
Out[307]: (10, 5, 2)
In [308]: np.sum(arr) # default axis None
Out[308]: 49995000
In [309]: arr.sum()
Out[309]: 49995000
def nonzero(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
The above code is much slower compared to
(row,col) = np.nonzero(edges_canny)
It would be great if I can get any direction how to increase the speed and why numpy functions are much faster?
There are 2 reasons why NumPy functions can outperform Pythons types:
The values inside the array are native types, not Python types. This means NumPy doesn't need to go through the abstraction layer that Python has.
NumPy functions are (mostly) written in C. That actually only matters in some cases because a lot of Python functions are also written in C, for example sum.
In your case you also do something really inefficient: You append to an array. That's one really expensive operation in the middle of a double loop. That's an obvious (and unnecessary) bottleneck right there. You would get amazing speedups just by using lists as nonzero_row and nonzero_col and only convert them to array just before you return:
def nonzero_list_based(a):
row,colum = a.shape
a = a.tolist()
nonzero_row = []
nonzero_col = []
for i in range(0,row):
for j in range(0,colum):
if a[i][j] != 0:
nonzero_row.append(i)
nonzero_col.append(j)
return (np.array(nonzero_row), np.array(nonzero_col))
The timings:
import numpy as np
def nonzero_original(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
arr = np.random.randint(0, 10, (100, 100))
%timeit np.nonzero(arr)
# 315 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit nonzero_original(arr)
# 759 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nonzero_list_based(arr)
# 13.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even though it's 40 times slower than the NumPy operation it's still more than 60 times faster than your approach. There's an important lesson here: Avoid np.append whenever possible!
One additional point why NumPy outperforms alternative approaches is because they (mostly) use state-of-the art approaches (or they "import" them, i.e. BLAS/LAPACK/ATLAS/MKL) to solve the problems. These algorithms have been optimized for correctness and speed over years (if not decades). You shouldn't expect to find a faster or even comparable solution.