I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:
large_dimension = 1000
a = np.random.random((1,))
b = np.random.random((1, large_dimension))
c = np.matmul(a, b)
In other words, multiplying matrix b with the scalar a[0].
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this, and they are practically equivalent:
%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Is there a more efficient way to compute this?
Note: I cannot move these computations to a GPU since the program is using multiprocessing and many such computations are done in parallel.
large_dimension = 1000
a = np.random.random((1,))
B = np.random.random((1, large_dimension))
%timeit np.matmul(a, B)
5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a[0] * B
5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Use just float
%timeit float(a[0]) * B
3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid memory allocation use "buffer"
buffer = np.empty_like(B)
%timeit np.multiply(float(a[0]), B, buffer)
2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid unnecessary getting attribute use "alias"
mul = np.multiply
%timeit mul(float(a[0]), B, buffer)
2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And I don't recommend using numpy scalars at all,
because if you avoid it, computation will be faster
a_float = float(a[0])
%timeit mul(a_float, B, buffer)
1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)
rng = range(1000)
%%timeit
for i in rng:
pass
24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for i in rng:
mul(a_float, B, buffer)
1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So,
"best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)
"speedup" = 5.43 / 1.89 = 2.87
In this case, it is probably faster to work with an element-wise multiplication but the time you see is mostly the overhead of Numpy (calling C functions from the CPython interpreter, wrapping/unwraping types, making checks, doing the operation, array allocations, etc.).
since this operation is repeated millions of times
This is the problem. Indeed, the CPython interpreter is very bad at doing things with a low latency. This is especially true when you work on Numpy types as calling a C code and performing checks for trivial operation is much slower than doing it in pure Python which is also much slower than compiled native C/C++ codes. If you really need this, and you cannot vectorize your code using Numpy (because you have a loop iterating over timesteps), then you move away from using CPython, or at least not a pure Python code. Instead, you can use Numba or Cython to mitigate the impact doing C calls, wrapping types, etc. If this is not enough, then you will need to write a native C/C++ code (or any similar language) unless you find exactly a dedicated Python package doing exactly that for you. Note that Numba is fast only when it works on native types or Numpy arrays (containing native types). If you works with a lot of pure Python types and you do not want to rewrite your code, then you can try the PyPy JIT.
Here is a simple example in Numba avoiding the (costly) creation/allocation of a new array (as well as many Numpy internal checks and calls) that is specifically written to solve your specific case:
#nb.njit('void(float64[::1],float64[:,::1],float64[:,::1])')
def fastMul(a, b, out):
val = a[0]
for i in range(b.shape[1]):
out[0,i] = b[0,i] * val
res = np.empty(b.shape, dtype=b.dtype)
%timeit fastMul(a, b, res)
# 397 ns ± 0.587 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
At the time of writing, this solution is faster than all the others. As most of the time is spent in calling Numba and performing some internal checks, using Numba directly for the function containing the iteration loop should result in an even faster code.
import numpy as np
import numba
def matmult_numpy(matrix, c):
return np.matmul(c, matrix)
#numba.jit(nopython=True)
def matmult_numba(matrix, c):
return c*matrix
if __name__ == "__main__":
large_dimension = 1000
a = np.random.random((1, large_dimension))
c = np.random.random((1,))
About a factor of 3 speedup using Numba. Numba cognoscenti may be able to do better by explicitly casting the parameter "c" as a scalar
Check: The result of
%timeit matmult_numpy(a, c) 2.32 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit matmult_numba(a, c)
763 ns ± 6.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I made a few experiment and found a number of cases where python's standard random and math library is faster than numpy counterpart.
I think there is a tendency that python's standard library is about 10x faster for small scale operation, while numpy is much faster for large scale (vector) operations. My guess is that numpy has some overhead which becomes dominant for small cases.
My question is: Is my intuition correct? And will it be in general advisable to use the standard library rather than numpy for small (typically scalar) operations?
Examples are below.
import math
import random
import numpy as np
Log and exponential
%timeit math.log(10)
# 158 ns ± 6.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit np.log(10)
# 1.64 µs ± 93.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit math.exp(3)
# 146 ns ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit np.exp(3)
# 1.72 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Generate normal distribution
%timeit random.gauss(0, 1)
# 809 ns ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.normal()
# 2.57 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Choosing a random element
%timeit random.choices([1,2,3], k=1)
# 1.56 µs ± 55.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.choice([1,2,3], size=1)
# 23.1 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Same with numpy array
arr = np.array([1,2,3])
%timeit random.choices(arr, k=1)
# 1.72 µs ± 33.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.choice(arr, size=1)
# 18.4 µs ± 502 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With big array
arr = np.arange(10000)
%timeit random.choices(arr, k=1000)
# 401 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.random.choice(arr, size=1000)
# 41.7 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numpy is only really a performance improvement for large blocks of data. The overhead of making sure the memory blocks line up correctly before pouring an ndarray into a c-compiled numpy function will generally overwhelm any time benefit if the array isn't relatively large. This is why so many numpy questions are basically "How do I take this loopy code and make it fast," and why it is considered a valid question in this tag where nearly any other tag will toss you to Code review before they get past the title.
So, yes, your observation is generalizable. Vectorizing is the whole point of numpy. numpy code that isn't vectorized is always slower than bare python code, and is arguably just as "wrong" as cracking a single walnut with a jackhammer. Either find the right tool or get more nuts.
NumPy is used primarily for performance with arrays. This relies on the use of contiguous memory blocks and more efficient lower-level iteration. Applying a NumPy mathematical function on a scalar or calculating a random number are not vectorisable operations. This explains the behaviour you are seeing.
See also What are the advantages of NumPy over regular Python lists?
And will it be in general advisable to use the standard library rather
than NumPy for small (typically scalar) operations?
It's rare that the bottleneck for a program is caused by operations on scalars. In practice, the differences are negligible. So either way is fine. If you are already using NumPy there's no harm in continuing to use NumPy operations on scalars.
It's worth making a special case of calculating random numbers. As you might expect, the random number selected via random vs NumPy may not be the same:
assert random.gauss(0, 1) == np.random.normal() # AssertionError
assert random.choices(arr, k=1)[0] == np.random.choice(arr, size=1)[0] # AssertionError
You have additional functionality in NumPy to make random numbers "predictable". For example, running the below script repeatedly will only ever generate the same result:
np.random.seed(0)
np.random.normal()
The same applies to np.random.choice. So there are differences in how the random number is derived and the functionality available. For testing, or other, purposes you may wish to be able to produce consistent "random" numbers.
If we compute time of execution for given n to create a pascal triangle in python with two different implementations one with python for loop and one with numpy array addition then plotting time required for input n = 2^i the output will be like
source : https://algorithmdotcpp.blogspot.com/2022/01/prove-numpy-is-faster-than-normal-list.html
Since for my program fast indexing of Numpy arrays is quite necessary and fancy indexing doesn't have a good reputation considering performance, I decided to make a few tests. Especially since Numba is developing quite fast, I tried which methods work well with numba.
As inputs I've been using the following arrays for my small-arrays-test:
import numpy as np
import numba as nb
x = np.arange(0, 100, dtype=np.float64) # array to be indexed
idx = np.array((0, 4, 55, -1), dtype=np.int32) # fancy indexing array
bool_mask = np.zeros(x.shape, dtype=np.bool) # boolean indexing mask
bool_mask[idx] = True # set same elements as in idx True
y = np.zeros(idx.shape, dtype=np.float64) # output array
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64) #bool output array (only for convenience)
And the following arrays for my large-arrays-test (y_bool needed here to cope with dupe numbers from randint):
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/50))
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
y = np.zeros(idx.shape, dtype=np.float64)
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)
This yields the following timings without using numba:
%timeit x[idx]
#1.08 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 129 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
#482 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 621 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.take(x, idx)
#2.27 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 112 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.take(x, idx, out=y)
#2.65 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 134 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x.take(idx)
#919 ns ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 108 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x.take(idx, out=y)
#1.79 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# larg arrays: 131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.compress(bool_mask, x)
#1.93 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 618 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.compress(bool_mask, x, out=y_bool)
#2.58 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 637 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.compress(bool_mask)
#900 ns ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.compress(bool_mask, out=y_bool)
#1.78 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.extract(bool_mask, x)
#5.29 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 641 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And with numba, using jitting in nopython-mode, caching and nogil I decorated the ways of indexing, which are supported by numba:
#nb.jit(nopython=True, cache=True, nogil=True)
def fancy(x, idx):
x[idx]
#nb.jit(nopython=True, cache=True, nogil=True)
def fancy_bool(x, bool_mask):
x[bool_mask]
#nb.jit(nopython=True, cache=True, nogil=True)
def taker(x, idx):
np.take(x, idx)
#nb.jit(nopython=True, cache=True, nogil=True)
def ndtaker(x, idx):
x.take(idx)
This yields the following results for small and large arrays:
%timeit fancy(x, idx)
#686 ns ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 84.7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit fancy_bool(x, bool_mask)
#845 ns ± 31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 843 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit taker(x, idx)
#814 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 87 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ndtaker(x, idx)
#831 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 85.4 µs ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Summary
While for numpy without numba it is clear that small arrays are by far best indexed with boolean masks (about a factor 2 compared to ndarray.take(idx)), for larger arrays ndarray.take(idx) will perform best, in this case around 6 times faster than boolean indexing. The breakeven-point is at an array-size of around 1000 cells with and index-array-size of around 20 cells.
For arrays with 1e5 elements and 5e3 index array size, ndarray.take(idx) will be around 10 times faster than boolean mask indexing. So it seems that boolean indexing seems to slow down considerably with array size, but catches up a little after some array-size-threshold is reached.
For the numba jitted functions there is a small speedup for all indexing functions except for boolean mask indexing. Simple fancy indexing works best here, but is still slower than boolean masking without jitting.
For larger arrays boolean mask indexing is a lot slower than the other methods, and even slower than the non-jitted version. The three other methods all perform quite good and around 15% faster than the non-jitted version.
For my case with many arrays of different sizes, fancy indexing with numba is the best way to go. Perhaps some other people can also find some useful information in this quite lengthy post.
Edit:
I'm sorry that I forgot to ask my question, which I in fact have. I was just rapidly typing this at the end of my workday and completely forgot it...
Well, do you know any better and faster method than those that I tested? Using Cython my timings were between Numba and Python.
As the index array is predefined once and used without alteration in long iterations, any way of pre-defining the indexing process would be great. For this I thought about using strides. But I wasn't able to pre-define a custom set of strides. Is it possible to get a predefined view into the memory using strides?
Edit 2:
I guess I'll move my question about predefined constant index arrays which will be used on the same value array (where only the values change but not the shape) for a few million times in iterations to a new and more specific question. This question was too general and perhaps I also formulated the question a little bit misleading. I'll post the link here as soon as I opened the new question!
Here is the link to the followup question.
Your summary isn't completely correct, you already did tests with differently sized arrays but one thing that you didn't do was to change the number of elements indexed.
I restricted it to pure indexing and omitted take (which effectively is integer array indexing) and compress and extract (because these are effectively boolean array indexing). The only difference for these are the constant factors. The constant factor for the methods take and compress will be less than the overhead for the numpy functions np.take and np.compress but otherwise the effects will be negligible for reasonably sized arrays.
Just let me present it with different numbers:
# ~ every 500th element
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/500)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 51.6 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
# 1.03 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ~ every 50th element
idx = np.random.randint(0, 1000000, size=int(1000000/50)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 1.46 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x[bool_mask]
# 2.69 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ~ every 5th element
idx = np.random.randint(0, 1000000, size=int(1000000/5)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 14.9 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x[bool_mask]
# 8.31 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So what happened here? It's simple: Integer array indexing only needs to access as many elements as there are values in the index-array. That means if there are few matches it will be quite fast but slow if there are many indices. Boolean array indexing, however, always needs to walk through the whole boolean array and check for "true" values. That means it should be roughly "constant" for the array.
But, wait, it's not really constant for boolean arrays and why does integer array indexing take longer (last case) than boolean array indexing even if it has to process ~5 times less elements?
That's where it gets more complicated. In this case the boolean array had True at random places which means that it will be subject to branch prediction failures. These will be more likely if True and False will have equal occurrences but at random places. That's why the boolean array indexing got slower - because the ratio of True to False got more equal and thus more "random". Also the result array will be larger if there are more Trues which also consumes more time.
As an example for this branch prediction thing use this as example (could differ with different system/compilers):
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[:1000000//2] = True # first half True, second half False
%timeit x[bool_mask]
# 5.92 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True # True and False alternating
%timeit x[bool_mask]
# 16.6 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True
np.random.shuffle(bool_mask) # shuffled
%timeit x[bool_mask]
# 18.2 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the distribution of True and False will critically affect the runtime with boolean masks even if they contain the same amount of Trues! The same effect will be visible for the compress-functions.
For integer array indexing (and likewise np.take) another effect will be visible: cache locality. The indices in your case are randomly distributed, so your computer has to do a lot of "RAM" to "processor cache" loads because it's very unlikely two indices will be near to each other.
Compare this:
idx = np.random.randint(0, 1000000, size=int(1000000/5))
%timeit x[idx]
# 15.6 ms ± 703 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
idx = np.random.randint(0, 1000000, size=int(1000000/5))
idx = np.sort(idx) # sort them
%timeit x[idx]
# 4.33 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
By sorting the indices the chances immensely increased that the next value will already be in the cache and this can lead to huge speedups. That's a very important factor if you know that the indices will be sorted (for example if they were created by np.where they are sorted, which makes the result of np.where especially efficient for indexing).
So, it's not like integer array indexing is slower for small arrays and faster for large arrays it depends on much more factors. Both do have their use-cases and depending on the circumstances one might be (considerably) faster than the other.
Let me also talk a bit about the numba functions. First some general statements:
cache won't make a difference, it just avoids recompiling the function. In interactive environments this is essentially useless. It's faster if you would package the functions in a module though.
nogil by itself won't provide any speed boost. It will be faster if it's called in different threads because each function execution can release the GIL and then multiple calls can run in parallel.
Otherwise I don't know how numba effectivly implements these functions, however when you use NumPy features in numba it could be slower or faster - but even if it's faster it won't be much faster (except maybe for small arrays). Because if it could be made faster the NumPy developers would also implement it. My rule of thumb is: If you can do it (vectorized) with NumPy don't bother with numba. Only if you can't do it with vectorized NumPy functions or NumPy would use too many temporary arrays then numba will shine!
I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)