I know similar questions have been asked before (e.g.), but AFAIK nobody has answered my specific question...
My question is about the numpy mixed advanced / basic indexing described here:
... Two cases of index combination need to be distinguished:
The advanced indexes are separated by a slice, ellipsis or newaxis. For example x[arr1,:,arr2].
The advanced indexes are all next to each other. For example x[...,arr1,arr2,:] but not x[arr1,:,1] since 1 is an advanced index in this regard.
In the first case, the dimensions resulting from the advanced indexing operation come first in the result array, and the subspace dimensions after that. In the second case, the dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array (the latter logic is what makes simple advanced indexing behave just like slicing).
Why is this distinction necessary?
I was expecting the behaviour described for case 2 to be used in all cases. Why does it matter whether indexes are next to each other?
I understand you may want the behaviour of case 1 in some situations; for example, "vectorization" of index results along new dimensions. But this behaviour can and should be defined by the user. That is, if case 2 behaviour was the default, case 1 behaviour would be possible using only:
x[arr1,:,arr2].reshape((len(arr1),x.shape[1]))
I know you can achieve the behaviour described in case 2 using np.ix_(), but this inconsistency in default indexing behaviour is unexpected and unjustified, in my opinion. Can someone justify it?
Thanks,
The behavior for case 2 isn't well-defined for case 1. There's a subtlety you're probably missing in the following sentence:
In the second case, the dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array
You're probably imagining a one-to-one correspondence between input and output dimensions, perhaps because you're imagining Matlab-style indexing. NumPy doesn't work like that. If you have four arrays with the following shapes:
a.shape == (2, 3, 4, 5, 6)
b.shape == (20, 30)
c.shape == (20, 30)
d.shape == (20, 30)
then a[b, :, c, :, d] has four dimensions, with lengths 3, 5, 20, and 30. There is no unambiguous place to put the 20 and the 30. NumPy defaults to sticking them in front.
On the other hand, with a[:, b, c, d, :], the 20 and 30 can go where the 3, 4, and 5 were, because the 3, 4, and 5 were next to each other. The whole block of new dimensions goes where the whole block of original dimensions was, which only works if the original dimensions were in a single block in the original shape.
Related
I'm just learning python, but have decided to do so by recoding and improving some old java based school AI project.
My project involved a mathematical operation that is basically a discrete convolution operation, but without one of the functions time reversed.
So, while in my original java project I just wrote all the code to do the operation myself, since I'm working in python, and it's got great math libraries like numpy and scipy, I figured I could just make use of an existing convolution function like scipy.convolve. However, this would require me to pre-reverse one of the two arrays so that when scipy.convolve runs, and reverses one of the arrays to perform the convolution, it's really un-reversing the array. (I also still don't know how I can be sure to pre-reverse the right one of the two arrays so that the two arrays are still slid past each other both forwards rather than both backwards, but I assume I should ask that as a separate question.)
Unlike my java code, which only handled one dimensional data, I wanted to extend this project to multidimensional data. And so, while I have learned that if I had a numpy array of known dimension, such as a three dimensional array a, I could fully reverse the array (or rather get back a view that is reversed, which is much faster), by
a = a(::-1, ::-1, ::-1)
However, this requires me to have a ::-1 for every dimension. How can I perform this same reversal within a method for an array of arbitrary dimension that has the same result as the above code?
You can use np.flip. From the documentation:
numpy.flip(m, axis=None)
Reverse the order of elements in an array along the given axis.
The shape of the array is preserved, but the elements are reordered.
Note: flip(m) corresponds to m[::-1,::-1,...,::-1] with ::-1 at all positions.
This is a possible solution:
slices = tuple([slice(-1, -n-1, -1) for n in a.shape])
result = a[slices]
extends to arbitrary number of axes. Verification:
a = np.arange(8).reshape(2, 4)
slices = tuple([slice(-1, -n-1, -1) for n in a.shape])
result = a[slices]
yields:
>>> a
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
>>> result
array([[7, 6, 5, 4],
[3, 2, 1, 0]])
Related to this question, I came across an indexing behaviour via Boolean arrays and broadcasting I do not understand. We know it's possible to index a NumPy array in 2 dimensions using integer indices and broadcasting. This is specified in the docs:
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
c1 = np.where(b1)[0] # i.e. [1, 2]
c2 = np.where(b2)[0] # i.e. [0, 2]
a[c1[:, np.newaxis], c2] # or a[c1[:, None], c2]
array([[ 4, 6],
[ 8, 10]])
However, the same does not work for Boolean arrays.
a[b1[:, None], b2]
IndexError: too many indices for array
The alternative numpy.ix_ works for both integer and Boolean arrays. This seems to be because ix_ performs specific manipulation for Boolean arrays to ensure consistent treatment.
assert np.array_equal(a[np.ix_(b1, b2)], a[np.ix_(c1, c2)])
array([[ 4, 6],
[ 8, 10]])
So my question is: why does broadcasting work with integers, but not with Boolean arrays? Is this behaviour documented? Or am I misunderstanding a more fundamental issue?
As #Divakar noted in comments, Boolean advanced indices behave as if they were first fed through np.nonzero and then broadcast together, see the relevant documentation for extensive explanations. To quote the docs,
In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].
[...]
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.
In your case broadcasting would not necessarily be a problem, since both arrays have only two nonzero elements. The problem is the number of dimensions in the result:
>>> len(b1[:,None].nonzero())
2
>>> len(b2.nonzero())
1
Consequently the indexing expression a[b1[:,None], b2] would be equivalent to a[b1[:,None].nonzero() + b2.nonzero()], which would put a length-3 tuple inside a, corresponding to a 3d array index. Hence the error you see about "too many indices".
The surprises mentioned in the docs are very close to your example: what if you hadn't injected that singleton dimension? Starting from a length-3 and a length-4 Boolean array you would've ended up with a length-2 advanced index, i.e. a 1d array of size (2,). This is never what you'd want, which is leads us to another piece of trivia in the subject.
There's been a lot of discussion in planning to revamp advanced indexing, see the work-in-progress draft NEP 21. The gist of the issue is that fancy indexing in numpy, while clearly documented, has some very quirky features which aren't practically useful for anything, but which can bite you if you make a mistake by producing surprising results rather than errors.
A relevant quote from the NEP:
Mixed cases involving multiple array indices are also surprising, and
only less problematic because the current behavior is so useless that
it is rarely encountered in practice. When a boolean array index is
mixed with another boolean or integer array, boolean array is
converted to integer array indices (equivalent to np.nonzero()) and
then broadcast. For example, indexing a 2D array of size (2, 2) like
x[[True, False], [True, False]] produces a 1D vector with shape (1,),
not a 2D sub-matrix with shape (1, 1).
Now, I emphasize that the NEP is very much work-in-progress, but one of the suggestions in the current state of the NEP is to forbid Boolean arrays in advanced indexing cases such as the above, and only allow them in "outer indexing" scenarios, i.e. exactly what np.ix_ would help you do with your Boolean array:
Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing [i.e. the current behaviour] is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually.
My point is that the behaviour of Boolean advanced indices and their deprecation status (or lack thereof) may change in the not-so-distant future.
In Python/Numpy I can slice arrays in this form:
arr = np.ones((3,4,5))
arr[2]
and the shape will be maintained:
(arr[2]).shape # prints (4, 5)
Which means that, if I want to keep the shape of the array, the following code works for N-dimensional arrays
arr = np.ones((3,4,5,2,2))
(arr[2]).shape # prints (4, 5, 2, 2)
This is great if I want to write functions that work for N-dim arrays preserving their output.
In Julia, however, the same action does not preserve the structure:
arr = ones(3,4,5)
size(arr[3]) # prints () (0-dimensinoal)
size(arr[3,:]) # prints (20,)
because of partial linear indexing. So if want to keep the original dimensions I need to write arr[3,:,:], which only works for 3D arrays. If I want a 4D array I would have to use arr[3,:,:,:] and so on. The code isn't general.
Furthermore, when you get to array that are 5 dimensions or more (which is the case I'm working with now) this notation gets extremely cumbersome.
Is there any way I can write code like I do in Python and make it general? I couldn't even think of a nice clean way with reshape, let alone a way that's as clean as Python.
Notice that in Python the shape is only preserved if you slice the first dimension of the array. In Julia you can use slicedim(A, d, i) to slice dimension d of array A at index i.
Say I have an array x = np.arange(6).reshape(3, 2).
What is the meaning of x[False], or x[np.asanyarray(False)]? Both result in array([], shape=(0, 3, 2), dtype=int64), which is unexpected.
I expected to get an IndexError because of an improperly sized mask, as from something like x[np.ones((2, 2), dtype=np.bool)].
This behavior is consistent for x[True] and x[np.asanyarray(True)], as both result in an additional dimension: array([[[0, 1], [2, 3], [4, 5]]]).
I am using numpy 1.13.1. It appears that the behavior has changed recently, so while it is nice to have answers for older versions, please mention your version in the answers.
EDIT
Just for completeness, I filed https://github.com/numpy/numpy/issues/9515 based on the commentary on this question.
EDIT 2
And closed it almost immeditely.
There's technically no requirement that the dimensionality of a mask match the dimensionality of the array you index with it. (In previous versions, there were even fewer restrictions, and you could get away with some extreme shape mismatches.)
The docs describe boolean indexing as
A single boolean index array is practically identical to x[obj.nonzero()] where, as described above, obj.nonzero() returns a tuple (of length obj.ndim) of integer index arrays showing the True elements of obj.
but nonzero is weird for 0-dimensional input, so this case is one of the ways that "practically identical" turns out to be not identical:
the nonzero equivalence for Boolean arrays does not hold for zero dimensional boolean arrays.
NumPy has a special case for a 0-dimensional boolean index, motivated by the desire to have the following behavior:
In [3]: numpy.array(3)[True]
Out[3]: array([3])
In [4]: numpy.array(3)[False]
Out[4]: array([], dtype=int64)
I'll refer to a comment in the source code that handles a 0-dimensional boolean index:
if (PyArray_NDIM(arr) == 0) {
/*
* This can actually be well defined. A new axis is added,
* but at the same time no axis is "used". So if we have True,
* we add a new axis (a bit like with np.newaxis). If it is
* False, we add a new axis, but this axis has 0 entries.
*/
While this is primarily intended for a 0-dimensional index to a 0-dimensional array, it also applies to indexing multidimensional arrays with booleans. Thus,
x[True]
is equivalent to x[np.newaxis], producing a result with a new length-1 axis in front, and
x[False]
produces a result with a new axis in front of length 0, selecting no elements.
This seems like a pretty basic question, but I didn't find anything related to it on stack. Apologies if I missed an existing question.
I've seen some mathematical/linear algebraic reasons why one might want to use numpy vectors "proper" (i.e. ndim 1), as opposed to row/column vectors (i.e. ndim 2).
But now I'm wondering: are there any (significant) efficiency reasons why one might pick one over the other? Or is the choice pretty much arbitrary in that respect?
(edit) To clarify: By "ndim 1 vs ndim 2 vectors" I mean representing a vector that contains, say, numbers 3 and 4 as either:
np.array([3, 4]) # ndim 1
np.array([[3, 4]]) # ndim 2
The numpy documentation seems to lean towards the first case as the default, but like I said, I'm wondering if there's any performance difference.
If you use numpy properly, then no - it is not a consideration.
If you look at the numpy internals documentation, you can see that
Numpy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. Numpy also contains a significant set of data that describes how to interpret the data in the data buffer.
So, irrespective of the dimensions of the array, all data is stored in a continuous buffer. Now consider
a = np.array([1, 2, 3, 4])
and
b = np.array([[1, 2], [3, 4]])
It is true that accessing a[1] requires (slightly) less operations than b[1, 1] (as the translation of 1, 1 to the flat index requires some calculations), but, for high performance, vectorized operations are required anyway.
If you want to sum all elements in the arrays, then, in both case you would use the same thing: a.sum(), and b.sum(), and the sum would be over elements in contiguous memory anyway. Conversely, if the data is inherently 2d, then you could do things like b.sum(axis=1) to sum over rows. Doing this yourself in a 1d array would be error prone, and not more efficient.
So, basically a 2d array, if it is natural for the problem just gives greater functionality, with zero or negligible overhead.