I want to test whether all elements of an array are zero. According to the StackOverflow posts Test if numpy array contains only zeros and https://stackoverflow.com/a/72976775/5269892, compared to (array == 0).all(), not array.any() should be the both most memory-efficient and fastest method.
I tested the performance with a random-number floating array, see below. Somehow though, at least for the given array size, not array.any() and even casting the array to boolean type appear to be slower than (array == 0).all(). How comes?
np.random.seed(100)
a = np.random.rand(10418*144)
%timeit (a == 0)
%timeit (a == 0).all()
%timeit a.astype(bool)
%timeit a.any()
%timeit not a.any()
# 711 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 740 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.69 ms ± 587 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.71 ms ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.71 ms ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The problem is due to the first two operations being vectorized using SIMD instructions while the three last are not. More specifically, the three last calls do an implicit conversion to bool (_aligned_contig_cast_double_to_bool) which is not yet vectorized. This is a known issue and I have already proposed a pull request for this (which revealed some unexpected issues due to undefined behaviors now fixed). If everything is fine, it should be available in the next major release of Numpy.
Note that a.any() and not a.any() implicitly perform a cast to an array of boolean so to then perform the any operation faster. This is not very efficient, but this is done that way so to reduce the number of generated function variants (Numpy is written in C and so a different implementation has to be generated for each type and optimizing many variants is hard so we prefer so perform implicit casts here, not to mention that this also reduce the size of the generated binaries). If this is not enough, not you can use Cython so to generate a faster specific optimized code.
Related
I create an array that does not contain a single zero (let's ignore that it does, with zero probability, as np.random.rand() samples [0,1) uniformly). I want to check whether all values are equal to zero (for some other purpose the arrays may contain all zeros). Below are some timings.
Surprisingly to me, checking a single (nonzero) element is about 2000 times faster than using np.all() or np.any(). I would assume that the compiler internally replaces np.all() by np.any() of the inverse condition and that np.any()/np.all() returns True/False at the first instance that the condition is fulfilled/violated (i.e. the compiler does not create the entire array of True or False values first).
How comes np.all() or np.any() are that much slower when it would only have to check one element? Or is this because of the external knowledge I put that the array does not contain all zeros? In the case of an all-zeros array, I guess it might be too slow to do the boolean comparison separately for each element. I don't know about the performance of the underlying low-level algorithms, but each element needs to be accessed once independent of whether it goes one by one or creates the whole boolean array once.
import numpy as np
np.random.seed(100)
a = np.random.rand(10418,144)
%timeit a[0,0] == 0
%timeit (a == 0).all()
%timeit np.all(a == 0)
%timeit (a != 0).any()
%timeit np.any(a != 0)
# 400 ns ± 2.08 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 713 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 720 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 711 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 723 µs ± 630 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
When you write a == 0, numpy creates a new array of type boolean, compares each element in a with 0 and stores the result in the array. This allocation, initialization, and subsequent deallocation is the reason for the high cost.
Note that you don't need the explicit a == 0 in the first place. Integers that are zero always evauate to False, nonzero integers to True. np.all(a) is equivalent to np.all(a != 0). So np.all(a==0) is equivalent to not np.any(a)
I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I made a few experiment and found a number of cases where python's standard random and math library is faster than numpy counterpart.
I think there is a tendency that python's standard library is about 10x faster for small scale operation, while numpy is much faster for large scale (vector) operations. My guess is that numpy has some overhead which becomes dominant for small cases.
My question is: Is my intuition correct? And will it be in general advisable to use the standard library rather than numpy for small (typically scalar) operations?
Examples are below.
import math
import random
import numpy as np
Log and exponential
%timeit math.log(10)
# 158 ns ± 6.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit np.log(10)
# 1.64 µs ± 93.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit math.exp(3)
# 146 ns ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit np.exp(3)
# 1.72 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Generate normal distribution
%timeit random.gauss(0, 1)
# 809 ns ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.normal()
# 2.57 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Choosing a random element
%timeit random.choices([1,2,3], k=1)
# 1.56 µs ± 55.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.choice([1,2,3], size=1)
# 23.1 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Same with numpy array
arr = np.array([1,2,3])
%timeit random.choices(arr, k=1)
# 1.72 µs ± 33.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.choice(arr, size=1)
# 18.4 µs ± 502 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With big array
arr = np.arange(10000)
%timeit random.choices(arr, k=1000)
# 401 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.random.choice(arr, size=1000)
# 41.7 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numpy is only really a performance improvement for large blocks of data. The overhead of making sure the memory blocks line up correctly before pouring an ndarray into a c-compiled numpy function will generally overwhelm any time benefit if the array isn't relatively large. This is why so many numpy questions are basically "How do I take this loopy code and make it fast," and why it is considered a valid question in this tag where nearly any other tag will toss you to Code review before they get past the title.
So, yes, your observation is generalizable. Vectorizing is the whole point of numpy. numpy code that isn't vectorized is always slower than bare python code, and is arguably just as "wrong" as cracking a single walnut with a jackhammer. Either find the right tool or get more nuts.
NumPy is used primarily for performance with arrays. This relies on the use of contiguous memory blocks and more efficient lower-level iteration. Applying a NumPy mathematical function on a scalar or calculating a random number are not vectorisable operations. This explains the behaviour you are seeing.
See also What are the advantages of NumPy over regular Python lists?
And will it be in general advisable to use the standard library rather
than NumPy for small (typically scalar) operations?
It's rare that the bottleneck for a program is caused by operations on scalars. In practice, the differences are negligible. So either way is fine. If you are already using NumPy there's no harm in continuing to use NumPy operations on scalars.
It's worth making a special case of calculating random numbers. As you might expect, the random number selected via random vs NumPy may not be the same:
assert random.gauss(0, 1) == np.random.normal() # AssertionError
assert random.choices(arr, k=1)[0] == np.random.choice(arr, size=1)[0] # AssertionError
You have additional functionality in NumPy to make random numbers "predictable". For example, running the below script repeatedly will only ever generate the same result:
np.random.seed(0)
np.random.normal()
The same applies to np.random.choice. So there are differences in how the random number is derived and the functionality available. For testing, or other, purposes you may wish to be able to produce consistent "random" numbers.
If we compute time of execution for given n to create a pascal triangle in python with two different implementations one with python for loop and one with numpy array addition then plotting time required for input n = 2^i the output will be like
source : https://algorithmdotcpp.blogspot.com/2022/01/prove-numpy-is-faster-than-normal-list.html
I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)