Performance of numpy all/any vs testing a single element - python

I create an array that does not contain a single zero (let's ignore that it does, with zero probability, as np.random.rand() samples [0,1) uniformly). I want to check whether all values are equal to zero (for some other purpose the arrays may contain all zeros). Below are some timings.
Surprisingly to me, checking a single (nonzero) element is about 2000 times faster than using np.all() or np.any(). I would assume that the compiler internally replaces np.all() by np.any() of the inverse condition and that np.any()/np.all() returns True/False at the first instance that the condition is fulfilled/violated (i.e. the compiler does not create the entire array of True or False values first).
How comes np.all() or np.any() are that much slower when it would only have to check one element? Or is this because of the external knowledge I put that the array does not contain all zeros? In the case of an all-zeros array, I guess it might be too slow to do the boolean comparison separately for each element. I don't know about the performance of the underlying low-level algorithms, but each element needs to be accessed once independent of whether it goes one by one or creates the whole boolean array once.
import numpy as np
np.random.seed(100)
a = np.random.rand(10418,144)
%timeit a[0,0] == 0
%timeit (a == 0).all()
%timeit np.all(a == 0)
%timeit (a != 0).any()
%timeit np.any(a != 0)
# 400 ns ± 2.08 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 713 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 720 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 711 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 723 µs ± 630 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When you write a == 0, numpy creates a new array of type boolean, compares each element in a with 0 and stores the result in the array. This allocation, initialization, and subsequent deallocation is the reason for the high cost.
Note that you don't need the explicit a == 0 in the first place. Integers that are zero always evauate to False, nonzero integers to True. np.all(a) is equivalent to np.all(a != 0). So np.all(a==0) is equivalent to not np.any(a)

Related

Python comparing array to zero faster than np.any(array)

I want to test whether all elements of an array are zero. According to the StackOverflow posts Test if numpy array contains only zeros and https://stackoverflow.com/a/72976775/5269892, compared to (array == 0).all(), not array.any() should be the both most memory-efficient and fastest method.
I tested the performance with a random-number floating array, see below. Somehow though, at least for the given array size, not array.any() and even casting the array to boolean type appear to be slower than (array == 0).all(). How comes?
np.random.seed(100)
a = np.random.rand(10418*144)
%timeit (a == 0)
%timeit (a == 0).all()
%timeit a.astype(bool)
%timeit a.any()
%timeit not a.any()
# 711 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 740 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.69 ms ± 587 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.71 ms ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.71 ms ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The problem is due to the first two operations being vectorized using SIMD instructions while the three last are not. More specifically, the three last calls do an implicit conversion to bool (_aligned_contig_cast_double_to_bool) which is not yet vectorized. This is a known issue and I have already proposed a pull request for this (which revealed some unexpected issues due to undefined behaviors now fixed). If everything is fine, it should be available in the next major release of Numpy.
Note that a.any() and not a.any() implicitly perform a cast to an array of boolean so to then perform the any operation faster. This is not very efficient, but this is done that way so to reduce the number of generated function variants (Numpy is written in C and so a different implementation has to be generated for each type and optimizing many variants is hard so we prefer so perform implicit casts here, not to mention that this also reduce the size of the generated binaries). If this is not enough, not you can use Cython so to generate a faster specific optimized code.

Is function any() in pandas.groupby short-circuited? [duplicate]

I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is short-cut evaluation available for np.any()? [duplicate]

I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

NumPy reshape array in-place

I need to do an in-place resizing of a NumPy array, so I'd prefer numpy.resize() module to numpy.reshape(). I find that numpy.resize() return an array with wrong dimensions if I specify -1 in one of the dimensions of the required shape. Does anyone know why is it so? What is an alternative way to do in-place resizing of an array?
The in-place resize you get with ndarray.resize does not allow for negative dimensions. You can easily check yourself:
a=np.array([[0,1],[2,3]])
a.resize((4,-1))
> ValueError: negative dimensions not allowed
In most of the cases, np.reshape will be returning a view of the array, and hence there will be no unnecessary copying and additional memory allocation involved (though it doesn't modify the array in-place):
a_view = a.reshape(4,-1)
np.shares_memory(a, a_view)
# True
But even though reshape does not allow for in-place operations, what you can do is assign the new shape to the shape attribute of the array, which does allow for negative dimensions:
a.shape = (4,-1)
Which is an in-place operation, and just as efficient as a.resize((4,1)) would be. Note that this method will raise an error when the reshape cannot be done without copying the data.
Here are some timings for efficiency comparison with a larger array, including the timings for reassigning from a view:
def inplace_reshape(a):
a.shape = (10000,-1)
def inplace_resize(a):
a.resize((10000,3))
def reshaped_view(a):
a = np.reshape(a, (10000,-1))
def resized_copy(a):
a = np.resize(a, (10000,3))
a = np.random.random((30000,1))
%timeit inplace_reshape(a)
# 383 ns ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit inplace_resize(a)
# 294 ns ± 20.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit reshaped_view(a)
# 1.5 µs ± 25.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit resized_copy(a)
# 21.5 µs ± 289 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Where both of them produce the same result:
b = np.copy(a)
a.shape = (10000,-1)
b.resize((10000,3))
np.array_equal(a,b)
# True

Improve performance in lists

I have a problem where I am trying to take a randomly ordered list and I want to know how many elements with a greater index than the current element are smaller in value than the current element.
For example:
[1,2,5,3,7,6,8,4]
should return:
[0,0,2,0,2,1,1,0]
This is the code I have that is currently working.
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
This does produce the desired array but it runs slowly. What is the more pythonic way to get this accomplished?
We could fiddle around with the code in the question, but still it would be an O(n^2) algorithm. To truly improve the performance is not a matter of making the implementation more or less pythonic, but to use a different approach with a helper data structure.
Here's an outline for an O(n log n) solution: implement a self-balancing BST (AVL or red-black are good options), and additionally store in each node an attribute with the size of the subtree rooted in it. Now traverse the list from right to left and insert all its elements in the tree as new nodes. We also need an extra output list of the same size of the input list to keep track of the answer.
For every node we insert in the tree, we compare its key with the root. If it's greater than the value in the root, it means that it's greater than all the nodes in the left subtree, hence we need to add the size of the left subtree to the answer list at the position of the element we're trying to insert.
We keep doing this recursively and updating the size attribute in each node we visit, until we find the right place to insert the new node, and proceed to the next element in the input list. In the end the output list will contain the answer.
Another option that's much simpler than implementing a balanced BST is to adapt merge sort to count inversions and accumulate them during the process. Clearly, any single swap is an inversion so the lower-indexed element gets one count. Then during the merge traversal, simply keep track of how many elements from the right group have moved to the left and add that count for elements added to the right group.
Here's a very crude illustration :)
[1,2,5,3,7,6,8,4]
sort 1,2 | 5,3
3,5 -> 5: 1
merge
1,2,3,5
sort 7,6 | 8,4
6,7 -> 7: 1
4,8 -> 8: 1
merge
4 -> 6: 1, 7: 2
4,6,7,8
merge 1,2,3,5 | 4,6,7,8
1,2,3,4 -> 1 moved
5 -> +1 -> 5: 2
6,7,8
There are several ways of speeding up your code without touching the overall computational complexity.
This is so because there are several ways of writing this very algorithm.
Let's start with your code:
def bribe_orig(q):
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
return bribe_array
This is of somewhat mixed style: firstly, you generate a list of zeros (which is not really needed, as you can append items on demand; secondly, the outer list uses a range() which is sub-optimal given that you would like to access a specific item multiple times and hence a local name would be faster; thirdly, you write a generator inside sum() which is also sub-optimal since it will be summing up booleans and hence perform implicit conversions all the time.
A cleaner approach would be:
def bribe(items):
result = []
for i, item in enumerate(items):
partial_sum = 0
for x in items[i + 1:]:
if x < item:
partial_sum += 1
result.append(partial_sum)
return result
This is somewhat simpler and since it does a number of things explicitly, and only performing a summation when necessary (thus skipping when you would be adding 0), it may be faster.
Another way of writing your code in a more compact way is:
def bribe_compr(items):
return [sum(x < item for x in items[i + 1:]) for i, item in enumerate(items)]
This involves the use of generators and list comprehensions, but also the outer loop is written with enumerate() following the typical Python style.
But Python is infamously slow in raw looping, therefore when possible, vectorization can be helpful. One way of doing this (only for the inner loop) is with numpy:
import numpy as np
def bribe_np(items):
items = np.array(items)
return [np.sum(items[i + 1:] < item) for i, item in enumerate(items)]
Finally, it is possible to use a JIT compiler to speed up the plain Python loops using Numba:
import numba
bribe_jit = nb.jit(bribe)
As for any JIT it has some costs for the just-in-time compilation, which is eventually offset for large enough loops.
Unfortunately, Numba's JIT does not support all Python code, but when it does, like in this case, it can be pretty rewarding.
Let's look at some numbers.
Consider the input generated with the following:
import numpy as np
np.random.seed(0)
n = 10
q = np.random.randint(1, n, n)
On a small-sized inputs (n = 10):
%timeit bribe_orig(q)
# 228 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe(q)
# 20.3 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_compr(q)
# 216 µs ± 5.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_np(q)
# 133 µs ± 9.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_jit(q)
# 1.11 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
On a medium-sized inputs (n = 100):
%timeit bribe_orig(q)
# 20.5 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe(q)
# 741 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_compr(q)
# 18.9 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_np(q)
# 1.22 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_jit(q)
# 7.54 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
On a larger inputs (n = 10000):
%timeit bribe_orig(q)
# 1.99 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe(q)
# 60.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe_compr(q)
# 1.8 s ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe_np(q)
# 12.8 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_jit(q)
# 182 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
From these results, we observe that we gain the most from substituting sum() with the explicit construct involving only Python loops.
The use of comprehensions does not land you above approx. 10% improvement over your code.
For larger inputs, the use of NumPy can be even faster than the explicit construct involving only Python loops.
However, you will get the real deal when you use the Numba's JITed version of bribe().
You can get better performance by progressively building a sorted list going from last to first in your array. Using a binary search algorithm on the sorted list for each element in the array, you get the index at which the element will be inserted which also happens to be the number of elements that are smaller in the ones already processed.
Collecting these insertion points will give you the expected result (in reverse order).
Here's an example:
a = [1,2,5,3,7,6,8,4]
from bisect import bisect_left
s = []
r = []
for x in reversed(a):
p = bisect_left(s,x)
r.append(p)
s.insert(p,x)
r = r[::-1]
print(r) #[0,0,2,0,2,1,1]
For this example, the progression will be as follows:
step 1: x = 4, p=0 ==> r=[0] s=[4]
step 2: x = 8, p=1 ==> r=[0,1] s=[4,8]
step 3: x = 6, p=1 ==> r=[0,1,1] s=[4,6,8]
step 4: x = 7, p=2 ==> r=[0,1,1,2] s=[4,6,7,8]
step 5: x = 3, p=0 ==> r=[0,1,1,2,0] s=[3,4,6,7,8]
step 6: x = 5, p=2 ==> r=[0,1,1,2,0,2] s=[3,4,5,6,7,8]
step 7: x = 2, p=0 ==> r=[0,1,1,2,0,2,0] s=[2,3,4,5,6,7,8]
step 8: x = 1, p=0 ==> r=[0,1,1,2,0,2,0,0] s=[1,2,3,4,5,6,7,8]
Reverse r, r = r[::-1] r=[0,0,2,0,2,1,1,0]
You will be performing N loops (size of the array) and the binary search performs in log(i) where i is 1 to N. So, smaller than O(N*log(N)). The only caveat is the performance of s.insert(p,x) which will introduce some variability depending on the order of the original list.
Overall the performance profile should be between O(N) and O(N*log(N)) with a worst case of O(n^2) when the array is already sorted.
If you only need to make your code a little faster and more concise, you could use a list comprehension (but that'll still be O(n^2) time):
r = [sum(v<p for v in a[i+1:]) for i,p in enumerate(a)]

Categories

Resources