Why numpy.where is much faster than alternatives - python

im trying to speedup the following code:
import time
import numpy as np
np.random.seed(10)
b=np.random.rand(10000,1000)
def f(a=1):
tott=0
for _ in range(a):
q=np.array(b)
t1 = time.time()
for i in range(len(q)):
for j in range(len(q[0])):
if q[i][j]>0.5:
q[i][j]=1
else:
q[i][j]=-1
t2=time.time()
tott+=t2-t1
print(tott/a)
As you can see, mainly func is about iterating in double cycle. So, i've tried to use np.nditer,np.vectorize and map instead of it. If gives some speedup (like 4-5 times except np.nditer), but! with np.where(q>0.5,1,-1) speedup is almost 100x.
How can i iterate over numpy arrays as fast as np.where does it? And why is it so much faster?

It's because the core of numpy is implemented in C. You're basically comparing the speed of C with Python.
If you want to use the speed advantage of numpy, you should make as few calls as possible in your Python code. If you use a Python-loop, you have already lost, even if you use numpy functions in that loop only. Use higher-level functions provided by numpy (that's why they ship so many special functions). Internally, it will use a much more efficient (C-)loop
You can implement a function in C (with loops) yourself and call that from Python. That should give comparable speeds.

To answer this question, you can gain the same speed (100x acceleration) by using the numba library:
from numba import njit
def f(b):
q = np.zeros_like(b)
for i in range(b.shape[0]):
for j in range(b.shape[1]):
if q[i][j] > 0.5:
q[i][j] = 1
else:
q[i][j] = -1
return q
#njit
def f_jit(b):
q = np.zeros_like(b)
for i in range(b.shape[0]):
for j in range(b.shape[1]):
if q[i][j] > 0.5:
q[i][j] = 1
else:
q[i][j] = -1
return q
Compare the speed:
Plain Python
%timeit f(b)
592 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba (just-in-time compiled using LLVM ~ C speed)
%timeit f_jit(b)
5.97 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Numba Slow Array Element Assignment to Variable

This is a contrived test case but, hopefully, it can suffice to convey the point and ask the question. Inside of a Numba njit function, I noticed that it is very costly to assign a locally computed value to an array element. Here are two example functions:
from numba import njit
import numpy as np
#njit
def slow_func(x, y):
result = y.sum()
for i in range(x.shape[0]):
if x[i] > result:
x[i] = result
else:
x[i] = result
#njit
def fast_func(x, y):
result = y.sum()
for i in range(x.shape[0]):
if x[i] > result:
z = result
else:
z = result
if __name__ == "__main__":
x = np.random.rand(100_000_000)
y = np.random.rand(100_000_000)
%timeit slow_func(x, y) # 177 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit fast_func(x, y) # 407 ns ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I understand that the two functions aren't quite doing the same thing but let's not worry about that for now and stay focused on the "slow assignment". Also, due to Numba's lazy initialization, the timing above has been re-run post JIT-compiling. Notice that both functions are assigning result to either x[i] or to z and the number of assignments are the same in both cases. However, the assignment of result to z is substantially faster. Is there a way to make the slow_func as fast as the fast_func?
As #PaulPanzer already has pointed out, your fast function does nothing once optimized - so what you see is basically the overhead of calling a numba-function.
The interesting part is, that in order to do this optimization, numba must be replacing np.sum with its own sum-implementation - otherwise the optimizer would not be able to throw the call to this function away, as it cannot look into the implementation of np.sum and must assume that there are side effects from calling this function.
Let's measure only the summation with numba:
from numba import njit
#njit
def only_sum(x, y):
return y.sum()
%timeit only_sum(y,x)
# 112 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
.
Well, that is disappointing: I know my machine can do more than 10^9 addition per second and to read up to 13GB/s from RAM (there are about 0.8GB data, so it doesn't fit the cache), which mean I would expect the summation to use between 60-80ms.
And if I use the numpy's version, it really does:
%timeit y.sum()
# 57 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That sounds about right! I assume, numba doesn't use the pairwise addition and thus is slower (if the RAM is fast enough to be the bottleneck) and less precise than numpy's version.
If we just look at the writing of the values:
#njit
def only_assign(x, y):
res=y[0]
for i in range(x.shape[0]):
x[i]=res
%timeit only_assign(x,y)
85.2 ms ± 417 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
so we see it is really slower than reading. The reason for that (and how it can be fixed) is explained in this great answer: the update of caches which numba (rightly?) doesn't bypass.
In a nutshell: While assigning of values in numba isn't really slow (even if it could be speed-up by ussing non-temporal memory accesses), the really slow part is the summation (which seems not to use the pairwise summation) - it is inferior to the numpy's version.

Building a numpy array as a function of previous element

I would like to create a numpy array where the first element is a defined constant, and every next element is defined as the function of the previous element in the following way:
import numpy as np
def build_array_recursively(length, V_0, function):
returnList = np.empty(length)
returnList[0] = V_0
for i in range(1,length):
returnList[i] = function(returnList[i-1])
return returnList
d_t = 0.05
print(build_array_recursively(20, 0.3, lambda x: x-x*d_t+x*x/2*d_t*d_t-x*x*x/6*d_t*d_t*d_t))
The print method above outputs
[0.3 0.28511194 0.27095747 0.25750095 0.24470843 0.23254756 0.22098752
0.20999896 0.19955394 0.18962586 0.18018937 0.17122037 0.16269589
0.15459409 0.14689418 0.13957638 0.13262186 0.1260127 0.11973187 0.11376316]
Is there a fast way of doing this in numpy without a for loop?
If so is there a way to handle two elements before the current one, e.g. can a Fibonacci array be constructed similarly?
I found a similar question here
Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
but was not answered in general. In my example, the difference equation is difficult to solve manually.
This is faster for what you want to do. You don't have to use recursion for the function.
Calculate the element based on previous element. Append calculated element to a list, and then change the list to numpy.
def method2(length, V_0, d_t):
k = [V_0]
x = V_0
for i in range(1, length):
x = x - x * d_t + x * x / 2 * d_t * d_t - x * x * x / 6 * d_t * d_t * d_t
k.append(x)
return np.asarray(k)
print(method2(20,0.3, 0.05))
Running you existing method 10000 times takes 0.438 seconds, while method2 takes 0.097 seconds.
Using a function to make the code clearer (instead of the inline lambda):
def fn(x):
return x-x*d_t+x*x/2*d_t*d_t-x*x*x/6*d_t*d_t*d_t
And a function that combines elements of build_array_recursively and method2:
def foo1(length, V_0, function):
returnList = np.empty(length)
returnList[0] = x = V_0
for i in range(1,length):
returnList[i] = x = function(x)
return returnList
In [887]: timeit build_array_recursively(20,0.3, fn);
61.4 µs ± 63 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [888]: timeit method2(20,0.3, fn);
16.9 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [889]: timeit foo1(20,0.3, fn);
13 µs ± 29.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The main time saver in method2 and foo2 is carrying over x, the last value, from one iteration to the next, rather than indexing with returnList[i-1].
The accumulation method, assigning to a preallocated array, or list append, is less important. Performance is usually similar.
Here the calculation is simple enough that details of what you do in the loop makes a big difference in the overall time.
All of these are loops. Some ufunc have a reduce (and accumulate) method, that can apply the function repeatedly to a elements of the input array. np.sum, np.cumsum, etc make use of this. But you can't do that with a general Python function.
You have to use some sort of compilation tool like numba to perform this sort of loop much faster.

is there an efficient way to iterate trough all ndarray elements

My problem is that i have a ndarray of shape (N,M,3) and i am trying to check each element in the array using a low level approach currently i am doing something like:
for i in range(N):
for j in range(M):
if ndarr[i][j][2] == 3:
ndarr[i][j][0] == var1
and most of the time the ndarray i need to process is very large usually around 1000x1000 .
the same idea i managed to run on cpp withing a couple of milisecondes in python it take around 30 seconds at best.
i would really appreciate if someone can explain to me or point me towards reading material on how to efficiently iterate trough ndarray
There is no way of doing that efficiently.
NumPy is a small Python wrapper around C code/datatypes. So an ndarray is actually a multidimensional C array. That means the memory address of the array is the address of the first element of the array. All other elements are stored consecutively in memory.
What your Python for loop does, is grabbing each element of the array and temporarily saving it somewhere else (as a Python datastructure) before stuffing it back in the C array. As I have said, there is no way of doing that efficiently with a Python loop.
What you could do is using Numba #jit to speed up the for loop or look after a NumPy routine, that can iterate over an array.
you can use logical indexing to do this more efficiently, it might be interesting to see how it compares with your c implementation.
import numpy as np
a = np.random.randn(2, 4, 3)
print(a)
idx = a[:, :, 2] > 0
a[idx, 0] = 9
print(a)
In Numpy you have to use vectorized-commands (usually calling a C or Cython-function) to achieve good performance. As an alternative you can use Numba or Cython.
Two possible Implementations
import numba as nb
import numpy as np
def calc_np(ndarr,var1):
ndarr[ndarr[:,:,0]==3]=var1
return ndarr
#nb.njit(parallel=True,cache=True)
def calc_nb(ndarr,var1):
for i in nb.prange(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i,j,2] == 3:
ndarr[i,j,0] == var1
return ndarr
Timings
ndarr=np.random.randint(low=0,high=3,size=(1000,1000,3))
%timeit calc_np(ndarr,2)
#780 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#first call takes longer due to compilation overhead
res=calc_nb(ndarr,2)
%timeit calc(ndarr,2)
#55.2 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Edit
You also use a wrong indexing method. ndarr[i] gives a 2d view on the original 3d-array, the next indexing operation [j] gives the next view on the previous view. This also has quite an impact on performance.
def calc_1(ndarr,var1):
for i in range(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i][j][2] == 3:
ndarr[i][j][0] == var1
return ndarr
def calc_2(ndarr,var1):
for i in range(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i,j,2] == 3:
ndarr[i,j,0] == var1
return ndarr
%timeit calc_1(ndarr,2)
#549 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_2(ndarr,2)
#321 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba complains about typing - but all types are being provided

I have a problem with Numba typing - I read the manual, but eventually hit a brick wall.
The function in question is a part of a bigger project - though it needs to run fast - Python lists are out of the question, hence I've decided on trying Numba. Sadly, the function fails in nopython=True mode, despite the fact that - according to my understanding - all types are being provided.
The code is as follows:
from Numba import jit, njit, uint8, int64, typeof
#jit(uint8[:,:,:](int64))
def findWhite(cropped):
h1 = int64(0)
for i in cropped:
for j in i:
if np.sum(j) == 765:
h1 = h1 + int64(1)
else:
pass
return h1
also, separately:
print(typeof(cropped))
array(uint8, 3d, C)
print(typeof(h1))
int64
In this case 'cropped' is a large uint8 3D C matrix (RGB tiff file comprehension - PIL.Image). Could someone please explain to a Numba newbie what am I doing wrong?
Have you considered using Numpy? That's often a good intermediate between Python lists and Numba, something like:
h1 = (cropped.sum(axis=-1) == 765).sum()
or
h1 = (cropped == 255).all(axis=-1).sum()
The example code you provide is not valid Numba. Your signature is also incorrect, since the input is a 3D array and the output an integer, it should probably be:
#njit(int64(uint8[:,:,:]))
Looping over the array like you do is not valid code. A close translation of your code would be something like this:
#njit(int64(uint8[:,:,:]))
def findWhite(cropped):
h1 = int64(0)
ys, xs, n_bands = cropped.shape
for i in range(ys):
for j in range(xs):
if cropped[i, j, :].sum() == 765:
h1 += 1
return h1
But that isn't very fast and doesn't beat Numpy on my machine. With Numba it's fine to explicitly loop over every element in an array, this is already a lot faster:
#njit(int64(uint8[:,:,:]))
def findWhite_numba(cropped):
h1 = int64(0)
ys, xs, zs = cropped.shape
for i in range(ys):
for j in range(xs):
incr = 1
for k in range(zs):
if cropped[i, j, k] != 255:
incr = 0
break
h1 += incr
return h1
For a 5000x5000x3 array these are the result for me:
Numpy (h1 = (cropped == 255).all(axis=-1).sum()):
427 ms ± 6.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
findWhite:
612 ms ± 6.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
findWhite_numba:
31 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A benefit of the Numpy method is that it generalizes to any amount of dimensions.

Why is numpy faster at finding non-zero elements in a matrix?

def nonzero(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
The above code is much slower compared to
(row,col) = np.nonzero(edges_canny)
It would be great if I can get any direction how to increase the speed and why numpy functions are much faster?
There are 2 reasons why NumPy functions can outperform Pythons types:
The values inside the array are native types, not Python types. This means NumPy doesn't need to go through the abstraction layer that Python has.
NumPy functions are (mostly) written in C. That actually only matters in some cases because a lot of Python functions are also written in C, for example sum.
In your case you also do something really inefficient: You append to an array. That's one really expensive operation in the middle of a double loop. That's an obvious (and unnecessary) bottleneck right there. You would get amazing speedups just by using lists as nonzero_row and nonzero_col and only convert them to array just before you return:
def nonzero_list_based(a):
row,colum = a.shape
a = a.tolist()
nonzero_row = []
nonzero_col = []
for i in range(0,row):
for j in range(0,colum):
if a[i][j] != 0:
nonzero_row.append(i)
nonzero_col.append(j)
return (np.array(nonzero_row), np.array(nonzero_col))
The timings:
import numpy as np
def nonzero_original(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
arr = np.random.randint(0, 10, (100, 100))
%timeit np.nonzero(arr)
# 315 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit nonzero_original(arr)
# 759 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nonzero_list_based(arr)
# 13.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even though it's 40 times slower than the NumPy operation it's still more than 60 times faster than your approach. There's an important lesson here: Avoid np.append whenever possible!
One additional point why NumPy outperforms alternative approaches is because they (mostly) use state-of-the art approaches (or they "import" them, i.e. BLAS/LAPACK/ATLAS/MKL) to solve the problems. These algorithms have been optimized for correctness and speed over years (if not decades). You shouldn't expect to find a faster or even comparable solution.

Categories

Resources