Say I have an array x = np.arange(6).reshape(3, 2).
What is the meaning of x[False], or x[np.asanyarray(False)]? Both result in array([], shape=(0, 3, 2), dtype=int64), which is unexpected.
I expected to get an IndexError because of an improperly sized mask, as from something like x[np.ones((2, 2), dtype=np.bool)].
This behavior is consistent for x[True] and x[np.asanyarray(True)], as both result in an additional dimension: array([[[0, 1], [2, 3], [4, 5]]]).
I am using numpy 1.13.1. It appears that the behavior has changed recently, so while it is nice to have answers for older versions, please mention your version in the answers.
EDIT
Just for completeness, I filed https://github.com/numpy/numpy/issues/9515 based on the commentary on this question.
EDIT 2
And closed it almost immeditely.
There's technically no requirement that the dimensionality of a mask match the dimensionality of the array you index with it. (In previous versions, there were even fewer restrictions, and you could get away with some extreme shape mismatches.)
The docs describe boolean indexing as
A single boolean index array is practically identical to x[obj.nonzero()] where, as described above, obj.nonzero() returns a tuple (of length obj.ndim) of integer index arrays showing the True elements of obj.
but nonzero is weird for 0-dimensional input, so this case is one of the ways that "practically identical" turns out to be not identical:
the nonzero equivalence for Boolean arrays does not hold for zero dimensional boolean arrays.
NumPy has a special case for a 0-dimensional boolean index, motivated by the desire to have the following behavior:
In [3]: numpy.array(3)[True]
Out[3]: array([3])
In [4]: numpy.array(3)[False]
Out[4]: array([], dtype=int64)
I'll refer to a comment in the source code that handles a 0-dimensional boolean index:
if (PyArray_NDIM(arr) == 0) {
/*
* This can actually be well defined. A new axis is added,
* but at the same time no axis is "used". So if we have True,
* we add a new axis (a bit like with np.newaxis). If it is
* False, we add a new axis, but this axis has 0 entries.
*/
While this is primarily intended for a 0-dimensional index to a 0-dimensional array, it also applies to indexing multidimensional arrays with booleans. Thus,
x[True]
is equivalent to x[np.newaxis], producing a result with a new length-1 axis in front, and
x[False]
produces a result with a new axis in front of length 0, selecting no elements.
Related
I find this behaviour an utter nonsense. This happens only with numpy arrays, typical Python's arrays will just throw an error.
Let's create two arrays:
randomNumMatrix = np.random.randint(0,20,(3,3,3), dtype=np.int)
randRow = np.array([0,1,2], dtype=np.int)
If we pass an array as index to get something from another array, an original array is returned.
randomNumMatrix[randRow]
The code above returns an equivalent of randomNumMatrix. I find this unintuitive. I would expect it, not to work or at least return an equivalent of
randomNumMatrix[randRow[0]][randRow[1]][randRow[2]].
Additional observations:
A)
The code below does not work, it throws this error: IndexError: index 3 is out of bounds for axis 0 with size 3
randRow = np.array([0, 1, 3], dtype=np.int)
B)
To my surprise, the code below works:
randRow = np.array([0, 1, 2, 2,0,1,2], dtype=np.int)
Can somebody please explain what are the advantages of this feature?
In my opinion it only creates much confusion.
What is?
randomNumMatrix[randRow[0]][randRow[1]][randRow[2]]
That's not a valid Python.
In numpy there is a difference between
arr[(x,y,z)] # equivalent to arr[x,y,z]
and
arr[np.array([x,y,z])] # equivalent to arr[np.array([x,y,z]),:,:]
The tuple provides a scalar index for each dimension. The array (or list) provides multiple indices for one dimension.
You may need to study the numpy docs on indexing, especially advanced indexing.
I know similar questions have been asked before (e.g.), but AFAIK nobody has answered my specific question...
My question is about the numpy mixed advanced / basic indexing described here:
... Two cases of index combination need to be distinguished:
The advanced indexes are separated by a slice, ellipsis or newaxis. For example x[arr1,:,arr2].
The advanced indexes are all next to each other. For example x[...,arr1,arr2,:] but not x[arr1,:,1] since 1 is an advanced index in this regard.
In the first case, the dimensions resulting from the advanced indexing operation come first in the result array, and the subspace dimensions after that. In the second case, the dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array (the latter logic is what makes simple advanced indexing behave just like slicing).
Why is this distinction necessary?
I was expecting the behaviour described for case 2 to be used in all cases. Why does it matter whether indexes are next to each other?
I understand you may want the behaviour of case 1 in some situations; for example, "vectorization" of index results along new dimensions. But this behaviour can and should be defined by the user. That is, if case 2 behaviour was the default, case 1 behaviour would be possible using only:
x[arr1,:,arr2].reshape((len(arr1),x.shape[1]))
I know you can achieve the behaviour described in case 2 using np.ix_(), but this inconsistency in default indexing behaviour is unexpected and unjustified, in my opinion. Can someone justify it?
Thanks,
The behavior for case 2 isn't well-defined for case 1. There's a subtlety you're probably missing in the following sentence:
In the second case, the dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array
You're probably imagining a one-to-one correspondence between input and output dimensions, perhaps because you're imagining Matlab-style indexing. NumPy doesn't work like that. If you have four arrays with the following shapes:
a.shape == (2, 3, 4, 5, 6)
b.shape == (20, 30)
c.shape == (20, 30)
d.shape == (20, 30)
then a[b, :, c, :, d] has four dimensions, with lengths 3, 5, 20, and 30. There is no unambiguous place to put the 20 and the 30. NumPy defaults to sticking them in front.
On the other hand, with a[:, b, c, d, :], the 20 and 30 can go where the 3, 4, and 5 were, because the 3, 4, and 5 were next to each other. The whole block of new dimensions goes where the whole block of original dimensions was, which only works if the original dimensions were in a single block in the original shape.
Related to this question, I came across an indexing behaviour via Boolean arrays and broadcasting I do not understand. We know it's possible to index a NumPy array in 2 dimensions using integer indices and broadcasting. This is specified in the docs:
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
c1 = np.where(b1)[0] # i.e. [1, 2]
c2 = np.where(b2)[0] # i.e. [0, 2]
a[c1[:, np.newaxis], c2] # or a[c1[:, None], c2]
array([[ 4, 6],
[ 8, 10]])
However, the same does not work for Boolean arrays.
a[b1[:, None], b2]
IndexError: too many indices for array
The alternative numpy.ix_ works for both integer and Boolean arrays. This seems to be because ix_ performs specific manipulation for Boolean arrays to ensure consistent treatment.
assert np.array_equal(a[np.ix_(b1, b2)], a[np.ix_(c1, c2)])
array([[ 4, 6],
[ 8, 10]])
So my question is: why does broadcasting work with integers, but not with Boolean arrays? Is this behaviour documented? Or am I misunderstanding a more fundamental issue?
As #Divakar noted in comments, Boolean advanced indices behave as if they were first fed through np.nonzero and then broadcast together, see the relevant documentation for extensive explanations. To quote the docs,
In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].
[...]
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.
In your case broadcasting would not necessarily be a problem, since both arrays have only two nonzero elements. The problem is the number of dimensions in the result:
>>> len(b1[:,None].nonzero())
2
>>> len(b2.nonzero())
1
Consequently the indexing expression a[b1[:,None], b2] would be equivalent to a[b1[:,None].nonzero() + b2.nonzero()], which would put a length-3 tuple inside a, corresponding to a 3d array index. Hence the error you see about "too many indices".
The surprises mentioned in the docs are very close to your example: what if you hadn't injected that singleton dimension? Starting from a length-3 and a length-4 Boolean array you would've ended up with a length-2 advanced index, i.e. a 1d array of size (2,). This is never what you'd want, which is leads us to another piece of trivia in the subject.
There's been a lot of discussion in planning to revamp advanced indexing, see the work-in-progress draft NEP 21. The gist of the issue is that fancy indexing in numpy, while clearly documented, has some very quirky features which aren't practically useful for anything, but which can bite you if you make a mistake by producing surprising results rather than errors.
A relevant quote from the NEP:
Mixed cases involving multiple array indices are also surprising, and
only less problematic because the current behavior is so useless that
it is rarely encountered in practice. When a boolean array index is
mixed with another boolean or integer array, boolean array is
converted to integer array indices (equivalent to np.nonzero()) and
then broadcast. For example, indexing a 2D array of size (2, 2) like
x[[True, False], [True, False]] produces a 1D vector with shape (1,),
not a 2D sub-matrix with shape (1, 1).
Now, I emphasize that the NEP is very much work-in-progress, but one of the suggestions in the current state of the NEP is to forbid Boolean arrays in advanced indexing cases such as the above, and only allow them in "outer indexing" scenarios, i.e. exactly what np.ix_ would help you do with your Boolean array:
Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing [i.e. the current behaviour] is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually.
My point is that the behaviour of Boolean advanced indices and their deprecation status (or lack thereof) may change in the not-so-distant future.
What is the rationale behind the seemingly inconsistent behaviour of the following lines of code?
import numpy as np
# standard list
print(bool([])) # False - expected
print(bool([0])) # True - expected
print(bool([1])) # True - expected
print(bool([0,0])) # True - expected
# numpy arrays
print(bool(np.array([]))) # False - expected, deprecation warning: The
# truth value of an empty array is ambiguous...
print(bool(np.array([0]))) # False - unexpected, no warning
print(bool(np.array([1]))) # True - unexpected, no warning
print(bool(np.array([0,0]))) # ValueError: The truth value of an array
# with more than one element is ambiguous...
There are at least two inconsistencies in my point of view:
Standard python containers can be tested for emptiness bool(container). Why do numpy array not follow this pattern? (bool(np.array([0])) yields False)
Why is there an exception/deprecation warning when converting an empty numpy array or an array of length > 1, but it is okay to do so when the numpy array contains just one element?
Note that the deprecation for empty numpy arrays was added somewhere between numpy 1.11. and 1.14.
For the first problem, the reason is that it's not at all clear what you want to do with if np.array([1, 2]):.
This isn't a problem for if [1, 2]: because Python lists don't do element-wise anything. The only thing you can be asking is whether the list itself is truthy (non-empty).
But Numpy arrays do everything element-wise that possibly could be element-wise. Notice that this is hardly the only place, or even the most common place, where element-wise semantics mean that arrays work differently from normal Python sequences. For example:
>>> [1, 2] * 3
[1, 2, 1, 2, 1, 2]
>>> np.array([1, 2]) * 3
array([3, 6])
And, for this case in particular, boolean arrays are a very useful thing, especially since you can index with them:
>>> arr = np.array([1, 2, 3, 4])
>>> arr > 2 # all values > 2
array([False, False, True, True])
>>> arr[arr > 2] = 2 # clamp the values at <= 2
>>> arr
array([1, 2, 2, 2])
And once you have that feature, it becomes ambiguous what an array should mean in a boolean context. Normally, you want the bool array. But when you write if arr:, you could mean any of multiple things:
Do the body of the if for each element that's truthy. (Rewrite the body as an expression on arr indexed by the bool array.)
Do the body of the if if any element is truthy. (Use any.)
Do the body of the if if all elements are truthy. (Use any.)
A hybrid over some axis—e.g., do the body for each row where any element is truthy.
Do the body of the if if the array is nonempty—acting like a normal Python sequence but violating the usual element-wise semantics of an array. (Explicitly check for emptiness.)
So, rather than guess, and be wrong more often than not, numpy gives you an error and forces you to be explicit.
For the second problem, doesn't the warning text answer this for you? The truth value of a single element is obviously not ambiguous.
And single-element arrays—especially 0D ones—are often used as pseudo-scalars, so being able to do this isn't just harmless, it's also sometimes useful.
By contrast, asking "is this array empty" is rarely useful. A list is a variable-sized thing that you usually build up by adding one element at a time, zero or more times (possibly implicitly in a comprehension), so it's very often worth asking whether you added zero elements. But an array is a fixed-size thing, where you usually explicitly specified the size somewhere nearby in the code.
That's why it's allowed. And why it operates on the single value, not on the size of the array.
For empty arrays (which you didn't ask about, but did bring up): here, instead of there being multiple reasonable things you could mean, it's hard to think of anything reasonable you could mean. Which is probably why this is the only case that's changed recently (see issue 9583), rather than being the same since the days when Python added __nonzero__.
I'm wondering about the order of indices returned by numpy.nonzero / numpy.flatnonzero.
I couldn't find anything in the docs about it. It just says:
A[nonzero(flag)] == A[flag]
While in most cases this is enough, there are some when you need a sorted list of indices. Is it guaranteed that returned indices are sorted in case of 1-D or I need to sort them explicitly? (A similar question is the order of elements returned simply by selecting with a boolean array (A[flag]) which must be the same according to the docs.)
Example: finding the "gaps" between True elements in flag:
flag=np.array([True,False,False,True],dtype=bool)
iflag=flatnonzero(flag)
gaps= iflag[1:] - iflag[:-1]
Thanks.
Given the specification for advanced (or "fancy") indexing with integers, the guarantee that A[nonzero(flag)] == A[flag] is also a guarantee that the values are sorted low-to-high in the 1-d case. However, in higher dimensions, the result (while "sorted") has a different structure than you might expect.
In short, given a 1-dimensional array of integers ind and a 1-dimensional array x to be indexed, we have the following for all valid i defined for ind:
result[i] = x[ind[i]]
result takes the shape of ind, and contains the values of x at the indices indicated by ind. This means that we can deduce that if x[flag] maintains the original order of x, and if x[nonzero(flag)] is the same as x[flag], then nonzero(flag) must always produce indices in sorted order.
The only catch is that for multidimensional arrays, the indices are stored as distinct arrays for each dimension being indexed. So in other words,
x[array([0, 1, 2]), array([0, 0, 0])]
is equal to
array([x[0, 0], x[1, 0], x[2, 0]])
The values are still sorted, but each dimension is broken out into its own array. (You can do interesting things with broadcasting as a result; but that's beyond the scope of this answer.)
The only problem with this line of reasoning is that -- to my great surprise -- I can't find an explicit statement guaranteeing that boolean indexing preserves the original order of the array. Nonetheless, I'm quite certain from experience that it does. More generally, it would be unbelievably perverse to have x[[True, True, True]] return a reversed version of x.