Computing a slightly different matrix multiplication

Computing a slightly different matrix multiplication - python

I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]

Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.

Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Conjugating a complex number much faster if number has python-native complex type

Conjugating a complex number appears to be about 30 times faster if the type() of the complex number is complex rather than numpy.complex128, see the minimal example below. However, the absolute value takes about the same time. Taking the real and the imaginary part is only about 3 times faster.
Why is the conjugate slower by that much? When I take a from a large complex-valued array, it seems I should cast it to complex first (the complex conjugation is part of a larger code which has many (> 10^6) iterations).
import numpy as np
np.random.seed(100)
a = (np.random.rand(1) + 1j*np.random.rand(1))[0]
b = complex(a)
%timeit a.conjugate() # 2.95 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a.conj() # 2.86 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b.conjugate() # 82.8 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(a) # 112 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(b) # 99.6 ns ± 0.623 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.real # 145 ns ± 0.259 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.real # 54.8 ns ± 0.121 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.imag # 144 ns ± 0.771 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.imag # 55.4 ns ± 0.297 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Calling NumPy routines always comes at a fixed cost, which in this case is more expensive than cost of the Python-native routine.
As soon as you start processing more than one number (possibly millions) at once NumPy will be much faster:
import numpy as np
N = 10
a = np.random.rand(N) + 1j*np.random.rand(N)
b = [complex(x) for x in a]
%timeit a.conjugate() # 481 ns ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit [x.conjugate() for x in b] # 605 ns ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

numpy , applying function over list optimization

I have this two code that are doing the same but for different data structs
res = np.array([np.array([2.0, 4.0, 6.0]), np.array([8.0, 10.0, 12.0])], dtype=np.int)
%timeit np.sum(res, axis=1)
4.08 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
list_obj_array = np.ndarray((2,), dtype=np.object)
list_obj_array[0] = [2.0, 4.0, 6.0]
list_obj_array[1] = [8.0, 10.0, 12.0]
v_func = np.vectorize(np.sum, otypes=[np.int])
%timeit v_func(list_obj_array)
20.6 µs ± 486 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
the second one is 5 times slower , is there a better way to optimize this?
#nb.jit()
def nb_np_sum(arry_list):
return [np.sum(row) for row in arry_list]
%timeit nb_np_sum(list_obj_array)
30.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#nb.jit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
%timeit nb_sum(list_obj_array)
13.6 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Best so far (thanks #hpaulj)
%timeit [sum(l) for l in list_obj_array]
850 ns ± 115 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#nb.njit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'sum': cannot determine Numba type of <class 'builtin_function_or_method'>
File "<ipython-input-54-3bb48c5273bb>", line 3:
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
for longer array
list_obj_array = np.ndarray((n,), dtype=np.object)
for i in range(n):
list_obj_array[i] = list(range(7))
the vectorized version come closer to the best option (list Comprehension)
%timeit [sum(l) for l in list_obj_array]
23.4 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit v_func(list_obj_array)
29.6 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numba still is slower
%timeit nb_sum(list_obj_array)
74.4 µs ± 6.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Since you used otypes you read enough of the vectorize docs to know that it is not a performance tool.
In [430]: timeit v_func(list_obj_array)
38.3 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A list comprehension is faster:
In [431]: timeit [sum(l) for l in list_obj_array]
2.08 µs ± 62.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Even better if you start with a list of list instead on of an object dtype array:
In [432]: alist = list_obj_array.tolist()
In [433]: timeit [sum(l) for l in alist]
542 ns ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
edit
np.frompyfunc is faster than np.vectorize, especially when working with object dtype arrays:
In [459]: np.frompyfunc(sum,1,1)(list_obj_array)
Out[459]: array([12.0, 30.0], dtype=object)
In [460]: timeit np.frompyfunc(sum,1,1)(list_obj_array)
2.22 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As I've seen elsewhere frompyfunc is competitive with the list comprehension.
Interestingly, using np.sum instead of sum slows it down. I think that's because np.sum applied to lists has the overhead of converting the lists to arrays. sum applied to lists of numbers is pretty good, using python's own compiled code.
In [461]: timeit np.frompyfunc(np.sum,1,1)(list_obj_array)
30.3 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So let's try sum in your vectorize:
In [462]: v_func = np.vectorize(sum, otypes=[int])
In [463]: timeit v_func(list_obj_array)
8.7 µs ± 331 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much better.

Efficiency of numpy.dot with dynamic vs. static arrays

I have to do a lot of dot products in my data processing pipeline. So, I was experimenting with the following two pieces of code where one is 3 times efficient (in terms of runtime) when compared to its slowest counterpart.
slowest method (with arrays created on the fly)
In [33]: %timeit np.dot(np.arange(200000), np.arange(200000, 400000))
352 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
fastest method (with static arrays)
In [34]: vec1_arr = np.arange(200000)
In [35]: vec2_arr = np.arange(200000, 400000)
In [36]: %timeit np.dot(vec1_arr, vec2_arr)
121 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
why is the first method of dynamically generating arrays 3x slower when compared to second method? Is it because in the first method much of these extra time is spent in allocating memory for the elements? Or some other factors contributing to this degradation?
To gain little more understanding, I also replicated the setup in pure Python. And surprisingly there is no performance difference between doing it one way or the other, although it is slower than the numpy implementation, which is obvious and expected.
In [42]: %timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
12.5 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [38]: vec1 = range(200000)
In [39]: vec2 = range(200000, 400000)
In [40]: %timeit sum(map(operator.mul, vec1, vec2))
12.5 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The behaviour in the case of pure Python is clear because range function doesn't actually create all those elements. It does lazy evaluation (i.e. it is generated on the fly).
Note: The pure Python impl. is just to make myself convinced that the array allocation might be the factor that is causing the drag. It's not meant to compare it with NumPy implementation.

The pure Python test is not fair. Because np.arange(200000) really returns an array while range(200000) only returns a generator. So these two methods both create arrays on the fly.
import operator
%timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
# 15.1 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = range(200000)
vec2 = range(200000, 400000)
%timeit sum(map(operator.mul, vec1, vec2))
# 15.2 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = list(range(200000))
vec2 = list(range(200000, 400000))
%timeit sum(map(operator.mul, vec1, vec2))
# 12.4 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And we can see the time cost on allocation:
import numpy as np
%timeit np.arange(200000), np.arange(200000, 400000)
# 632 µs ± 9.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 703 µs ± 5.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 77.7 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make sense.

The difference in speed is due to allocating the arrays in the slower case. I pasted the output of %timeit that takes into account the allocation of arrays in the two cases. The OP's timeit commands only took into account allocation in the slower case but not in the faster case.
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 524 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000); np.dot(vec1_arr, vec2_arr)
# 523 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The allocation of both arrays takes about 360 microseconds on my machine, and the np.dot operation takes 169 microseconds. The sum of those two durations is 529 microseconds, which is equivalent to the output of %timeit output above.
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000)
# 360 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 169 µs ± 5.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Best way to count all rows in a 2d numpy array that include all elements of another 1d array?

What is the best way to count the rows in a 2d numpy array that include all values of another 1d numpy array? The 2nd array can have more columns than the length of the 1d array.
elements = np.arange(4).reshape((2, 2))
test_elements = [2, 3]
somefunction(elements, test_elements)
I would expect the function to return 1.
elements = np.arange(15).reshape((5, 3))
# array([[ 0, 1, 2],
# [ 3, 4, 5],
# [ 6, 7, 8],
# [ 9, 10, 11],
# [12, 13, 14]])
test_elements = [4, 3]
somefunction(elements, test_elements)
Should also return 1.
All elements of the 1d array must be included. If only a few elements are found in a row, it doesn't count. Hence:
elements = np.arange(15).reshape((5, 3))
# array([[ 0, 1, 2],
# [ 3, 4, 5],
# [ 6, 7, 8],
# [ 9, 10, 11],
# [12, 13, 14]])
test_elements = [3, 4, 10]
somefunction(elements, test_elements)
Should also return 0.

Create a boolean array of elements found then use any row-wise this will avoid multiple values in the same row and at last count the rows by using sum,
np.any(np.isin(elements, test), axis=1).sum()
Output
>>> elements
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> test = [1, 6, 7, 4]
>>> np.any(np.isin(elements, test), axis=1).sum()
3

There's probably a more efficient solution, but if you want the rows where "all" elements of test_elements are present, you can reverse np.isin and apply it along each row, with the following:
np.apply_along_axis(lambda x: np.isin(test_elements, x), 1, elements).all(1).sum()

A slightly more efficient (but less readable) variant of #norok2's solution is the following.
sum(map(set(test_elements).issubset, elements))

(EDIT: OK, now I actually had a bit more time to figure out what is going on.)
The are two issues here:
the computational complexity depends on the sizes of both inputs and it is not captured well by a 1D benchmark plot
the actual timing are dominated by variation in the inputs
The problem can be separated in two parts:
looping through the rows
performing the subset check, which is basically a nested-loop quadratic operation (in the worst-case scenario)
We know that, for sufficiently large inputs, looping through the rows is faster in NumPy and slower in pure Python.
For reference, let's consider these two approaches:
# pure Python approach
def all_in_by_row_flt(arr, elems=ELEMS):
return sum(1 for row in arr if all(e in row for e in elems))
# NumPy apprach (based on #Mstaino answer)
def all_in_by_row_np(arr, elems=ELEMS):
def _aaa_helper(row, e=elems):
return np.isin(e, row)
return np.sum(np.all(np.apply_along_axis(_aaa_helper, 1, arr), 1))
Then, considering the subset check operation, if the input is such that the check is performed within fewer loops, pure Python looping gets faster than NumPy. Conversely, if a sufficiently large number of loops is required, then NumPy can actually be faster.
On top of this, there is the looping through the rows, but because the subset check operation is quadratic AND the have different constant coefficients, there are situations for which, despite the rows-looping being faster in NumPy (because the number of rows would be sufficiently large), the overall operation is faster in pure Python.
This was the situation I was running into in the earlier benchmarks, and corresponds to the situation where the subset check is always (or almost) False and it does fail within few loops.
As soon as the subset check starts requiring more loops, the Python only approach begins to lag behind and for the situation where the subset check is actually True for most (if not all) the rows, the NumPy approach is actually faster.
Another key difference between the NumPy and the pure Python approach is that pure Python uses lazy evaluation and NumPy does not, and actually require the creation of potentially large intermediate objects that slow down the computation.
On top of this, NumPy iterates over the rows twice (one in sum() and one in np.apply_along_axis()), while the pure Python approaches only once.
Other approaches using set().issubset() like e.g. from #GZ0 answer:
def all_in_by_row_set(arr, elems=ELEMS):
elems = set(elems)
return sum(map(elems.issubset, row))
have different timings than the explicitly writing of the nested-loop when it comes to subset checking, but they still suffer from slower outer looping.
So, what's next?
The answer is to use Cython or Numba.
The idea is to get NumPy-like (read: C) speed all the times (and not only for sufficiently large inputs), lazy evaluation and minimal number of looping through the rows.
An example of a Cython approach (as implemented in IPython, using the %load_ext Cython magic) is:
%%cython --cplus -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True
cdef long all_in_by_row_c(long[:, :] arr, long[:] elems) nogil:
cdef long result = 0
I = arr.shape[0]
J = arr.shape[1]
K = elems.shape[0]
for i in range(I):
is_subset = True
for k in range(K):
is_contained = False
for j in range(J):
if elems[k] == arr[i, j]:
is_contained = True
break
if not is_contained:
is_subset = False
break
result += 1 if is_subset else 0
return result
def all_in_by_row_cy(long[:, :] arr, long[:] elems):
return all_in_by_row_c(arr, elems)
While a similar Numba code reads:
import numba as nb
#nb.jit(nopython=True, nogil=True)
def all_in_by_row_jit(arr, elems=ELEMS):
result = 0
n_rows, n_cols = arr.shape
for i in range(n_rows):
is_subset = True
for e in elems:
is_contained = False
for r in arr[i, :]:
if e == r:
is_contained = True
break
if not is_contained:
is_subset = False
break
result += 1 if is_subset else 0
return result
Now, time-wise we get to the following (for relatively small number of rows):
arr.shape=(100, 1000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 120 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_jit 129 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_flt 2.44 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_set 9.98 ms ± 52.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_np 13.7 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
arr.shape=(100, 2000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 1.45 ms ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_jit 1.52 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_flt 30.1 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_set 19.8 ms ± 56.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_np 18 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
arr.shape=(100, 3000) elems.shape=(1000,) result=37
Func: all_in_by_row_cy 10.4 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_jit 10.9 ms ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_flt 226 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 30.5 ms ± 92.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np 21.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
arr.shape=(100, 4000) elems.shape=(1000,) result=86
Func: all_in_by_row_cy 16.8 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_jit 17.7 ms ± 42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_flt 385 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 39.5 ms ± 588 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np 25.7 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now that the slow down of the last block cannot be explained by the increased input size in the second dimension.
Actually, if the short-circuit rate is increased (e.g. by changing the values range of the random arrays), for the last block (same input sizes) one gets:
arr.shape=(100, 4000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 152 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_jit 173 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_flt 556 µs ± 8.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_set 39.7 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np 31.5 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that set()-based method is kind of independent on the short-circuit rate (because of the hash-based implementation which has ~O(1) check for presence complexity, but this comes at the expenses of hashing pre-computation and these results indicate this might not be faster than the direct nested-looping approach).
Finally, for larger rows counts :
arr.shape=(100000, 1000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 141 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_jit 150 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_flt 2.6 s ± 28.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 10.1 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 13.7 s ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
arr.shape=(100000, 2000) elems.shape=(1000,) result=34
Func: all_in_by_row_cy 1.2 s ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 1.27 s ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 24.1 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 19.5 s ± 270 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 18 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
arr.shape=(100000, 3000) elems.shape=(1000,) result=33859
Func: all_in_by_row_cy 9.79 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 10.3 s ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 3min 30s ± 1.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 30 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 21.9 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
arr.shape=(100000, 4000) elems.shape=(1000,) result=86376
Func: all_in_by_row_cy 17 s ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 17.9 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 6min 29s ± 293 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 38.9 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 25.7 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Finally, note that the Cython/Numba code may be algorithmically optimized.

Performance of various numpy fancy indexing methods, also with numba

Since for my program fast indexing of Numpy arrays is quite necessary and fancy indexing doesn't have a good reputation considering performance, I decided to make a few tests. Especially since Numba is developing quite fast, I tried which methods work well with numba.
As inputs I've been using the following arrays for my small-arrays-test:
import numpy as np
import numba as nb
x = np.arange(0, 100, dtype=np.float64) # array to be indexed
idx = np.array((0, 4, 55, -1), dtype=np.int32) # fancy indexing array
bool_mask = np.zeros(x.shape, dtype=np.bool) # boolean indexing mask
bool_mask[idx] = True # set same elements as in idx True
y = np.zeros(idx.shape, dtype=np.float64) # output array
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64) #bool output array (only for convenience)
And the following arrays for my large-arrays-test (y_bool needed here to cope with dupe numbers from randint):
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/50))
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
y = np.zeros(idx.shape, dtype=np.float64)
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)
This yields the following timings without using numba:
%timeit x[idx]
#1.08 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 129 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
#482 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 621 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.take(x, idx)
#2.27 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 112 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.take(x, idx, out=y)
#2.65 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 134 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x.take(idx)
#919 ns ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 108 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x.take(idx, out=y)
#1.79 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# larg arrays: 131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.compress(bool_mask, x)
#1.93 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 618 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.compress(bool_mask, x, out=y_bool)
#2.58 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 637 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.compress(bool_mask)
#900 ns ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.compress(bool_mask, out=y_bool)
#1.78 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.extract(bool_mask, x)
#5.29 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 641 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And with numba, using jitting in nopython-mode, caching and nogil I decorated the ways of indexing, which are supported by numba:
#nb.jit(nopython=True, cache=True, nogil=True)
def fancy(x, idx):
x[idx]
#nb.jit(nopython=True, cache=True, nogil=True)
def fancy_bool(x, bool_mask):
x[bool_mask]
#nb.jit(nopython=True, cache=True, nogil=True)
def taker(x, idx):
np.take(x, idx)
#nb.jit(nopython=True, cache=True, nogil=True)
def ndtaker(x, idx):
x.take(idx)
This yields the following results for small and large arrays:
%timeit fancy(x, idx)
#686 ns ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 84.7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit fancy_bool(x, bool_mask)
#845 ns ± 31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 843 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit taker(x, idx)
#814 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 87 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ndtaker(x, idx)
#831 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 85.4 µs ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Summary
While for numpy without numba it is clear that small arrays are by far best indexed with boolean masks (about a factor 2 compared to ndarray.take(idx)), for larger arrays ndarray.take(idx) will perform best, in this case around 6 times faster than boolean indexing. The breakeven-point is at an array-size of around 1000 cells with and index-array-size of around 20 cells.
For arrays with 1e5 elements and 5e3 index array size, ndarray.take(idx) will be around 10 times faster than boolean mask indexing. So it seems that boolean indexing seems to slow down considerably with array size, but catches up a little after some array-size-threshold is reached.
For the numba jitted functions there is a small speedup for all indexing functions except for boolean mask indexing. Simple fancy indexing works best here, but is still slower than boolean masking without jitting.
For larger arrays boolean mask indexing is a lot slower than the other methods, and even slower than the non-jitted version. The three other methods all perform quite good and around 15% faster than the non-jitted version.
For my case with many arrays of different sizes, fancy indexing with numba is the best way to go. Perhaps some other people can also find some useful information in this quite lengthy post.
Edit:
I'm sorry that I forgot to ask my question, which I in fact have. I was just rapidly typing this at the end of my workday and completely forgot it...
Well, do you know any better and faster method than those that I tested? Using Cython my timings were between Numba and Python.
As the index array is predefined once and used without alteration in long iterations, any way of pre-defining the indexing process would be great. For this I thought about using strides. But I wasn't able to pre-define a custom set of strides. Is it possible to get a predefined view into the memory using strides?
Edit 2:
I guess I'll move my question about predefined constant index arrays which will be used on the same value array (where only the values change but not the shape) for a few million times in iterations to a new and more specific question. This question was too general and perhaps I also formulated the question a little bit misleading. I'll post the link here as soon as I opened the new question!
Here is the link to the followup question.

Your summary isn't completely correct, you already did tests with differently sized arrays but one thing that you didn't do was to change the number of elements indexed.
I restricted it to pure indexing and omitted take (which effectively is integer array indexing) and compress and extract (because these are effectively boolean array indexing). The only difference for these are the constant factors. The constant factor for the methods take and compress will be less than the overhead for the numpy functions np.take and np.compress but otherwise the effects will be negligible for reasonably sized arrays.
Just let me present it with different numbers:
# ~ every 500th element
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/500)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 51.6 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
# 1.03 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ~ every 50th element
idx = np.random.randint(0, 1000000, size=int(1000000/50)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 1.46 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x[bool_mask]
# 2.69 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ~ every 5th element
idx = np.random.randint(0, 1000000, size=int(1000000/5)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 14.9 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x[bool_mask]
# 8.31 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So what happened here? It's simple: Integer array indexing only needs to access as many elements as there are values in the index-array. That means if there are few matches it will be quite fast but slow if there are many indices. Boolean array indexing, however, always needs to walk through the whole boolean array and check for "true" values. That means it should be roughly "constant" for the array.
But, wait, it's not really constant for boolean arrays and why does integer array indexing take longer (last case) than boolean array indexing even if it has to process ~5 times less elements?
That's where it gets more complicated. In this case the boolean array had True at random places which means that it will be subject to branch prediction failures. These will be more likely if True and False will have equal occurrences but at random places. That's why the boolean array indexing got slower - because the ratio of True to False got more equal and thus more "random". Also the result array will be larger if there are more Trues which also consumes more time.
As an example for this branch prediction thing use this as example (could differ with different system/compilers):
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[:1000000//2] = True # first half True, second half False
%timeit x[bool_mask]
# 5.92 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True # True and False alternating
%timeit x[bool_mask]
# 16.6 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True
np.random.shuffle(bool_mask) # shuffled
%timeit x[bool_mask]
# 18.2 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the distribution of True and False will critically affect the runtime with boolean masks even if they contain the same amount of Trues! The same effect will be visible for the compress-functions.
For integer array indexing (and likewise np.take) another effect will be visible: cache locality. The indices in your case are randomly distributed, so your computer has to do a lot of "RAM" to "processor cache" loads because it's very unlikely two indices will be near to each other.
Compare this:
idx = np.random.randint(0, 1000000, size=int(1000000/5))
%timeit x[idx]
# 15.6 ms ± 703 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
idx = np.random.randint(0, 1000000, size=int(1000000/5))
idx = np.sort(idx) # sort them
%timeit x[idx]
# 4.33 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
By sorting the indices the chances immensely increased that the next value will already be in the cache and this can lead to huge speedups. That's a very important factor if you know that the indices will be sorted (for example if they were created by np.where they are sorted, which makes the result of np.where especially efficient for indexing).
So, it's not like integer array indexing is slower for small arrays and faster for large arrays it depends on much more factors. Both do have their use-cases and depending on the circumstances one might be (considerably) faster than the other.
Let me also talk a bit about the numba functions. First some general statements:
cache won't make a difference, it just avoids recompiling the function. In interactive environments this is essentially useless. It's faster if you would package the functions in a module though.
nogil by itself won't provide any speed boost. It will be faster if it's called in different threads because each function execution can release the GIL and then multiple calls can run in parallel.
Otherwise I don't know how numba effectivly implements these functions, however when you use NumPy features in numba it could be slower or faster - but even if it's faster it won't be much faster (except maybe for small arrays). Because if it could be made faster the NumPy developers would also implement it. My rule of thumb is: If you can do it (vectorized) with NumPy don't bother with numba. Only if you can't do it with vectorized NumPy functions or NumPy would use too many temporary arrays then numba will shine!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Computing a slightly different matrix multiplication - python

Related

Conjugating a complex number much faster if number has python-native complex type

numpy , applying function over list optimization

Efficiency of numpy.dot with dynamic vs. static arrays

Best way to count all rows in a 2d numpy array that include all elements of another 1d array?

Performance of various numpy fancy indexing methods, also with numba

Categories

Resources