One of the great feature of NumPy arrays is that you can perform multidimensional slicing. I am wondering exactly how it is implemented. Let me lay out what I am thinking so far, then hopefully someone can fill in the gaps, answer some questions I have, and (probably) tell me why I'm wrong.
import numpy as np
arr = np.array([ [1, 2, 3], [4, 5, 6] ])
# retrieve the rightmost column of values for all rows
print(arr[:, 2])
# indexing a normal multidimensional list
not_an_arr = [ [1, 2, 3], [4, 5, 6] ]
print(not_an_arr[:, 2]) # TypeError: indices must be integers or slices, not tuple
At first, [:, 2] seemed like a violation of Python syntax to me. If I tried to index a normal multidimensional list in Python I would get an error. Of course, upon actually reading the error message, I realize that the issue isn't with the syntax as I originally thought, but the type of object passed in. So the conclusion I've come to is that [:, 2] implicitly creates a tuple, so that what's really happening in [:, 2] is [(:, 2)]. Is that what's happening?
I next tried to read the source code for the numpy.ndarray class which is linked to by the ndarray documentation, but that's all in C, which I'm not proficient in, so I can't make much sense of this.
I then noticed that there was documentation for ndarray.__getitem__. I was hoping this would lead me to the implementation of __getitem__ for the class, since my understanding is that implementing __getitem__ is where the behavior for indexing an object should be defined. My hope was that I would be able to see that they unpack the tuple and then use the slice objects or integers included in it to do the indexing on the underlying data structure however that may need to be done.
So... what really goes on behind the scenes to make multidimensional slicing work on numpy arrays?
TLDR: How is multidimensional array slicing implemented for numpy arrays?
We can verify your first level inferences with a simple class:
In [137]: class Foo():
...: def __getitem__(self,arg):
...: print(arg)
...: return None
...:
In [138]: f=Foo()
In [139]: f[1]
1
In [140]: f[::3]
slice(None, None, 3)
In [141]: f[,]
File "<ipython-input-141-d115e3c638fb>", line 1
f[,]
^
SyntaxError: invalid syntax
In [142]: f[:,]
(slice(None, None, None),)
In [143]: f[:,:3,[1,2,3]]
(slice(None, None, None), slice(None, 3, None), [1, 2, 3])
numpy uses code like this in np.lib.index_tricks.py to implement "functions" like np.r_ and np.s_. They are actually class instances that use an index syntax.
It's worth noting that it's the comma, most so than the () that creates a tuple:
In [145]: 1,
Out[145]: (1,)
In [146]: 1,2
Out[146]: (1, 2)
In [147]: () # exception - empty tuple, no comma
Out[147]: ()
That explains the syntax. But the implementation details are left up to the object class. list (and other sequences like string) can work with integers and slice objects, but give an error when given a tuple.
numpy is happy with the tuple. In fact passing a tuple via getitem was added years ago to base Python because numpy needed it. No base classes use it (that I know of); but user classes can accept a tuple, as my example shows.
As for the numpy details, that requires some knowledge of numpy array storage, including the role of the shape, strides and data-buffer. I'm not sure if I want get into those now.
A few days ago I explored one example of multidimensional indexing, and discovered some nuances that I wasn't aware of (or ever seen documented)
view of numpy with 2D slicing
For most of us, understanding the how-to of indexing is more important than knowing the implementation details. I suspect there are textbooks, papers and even Wiki pages that describe 'strided' multidimensional indexing. numpy isn't the only place that uses it.
https://numpy.org/doc/stable/reference/arrays.indexing.html
This looks like a nice intro to numpy arrays
https://ajcr.net/stride-guide-part-1/
Related
I am trying to randomly select a set of integers in numpy and am encountering a strange error. If I define a numpy array with two sets of different sizes, np.random.choice chooses between them without issue:
Set1 = np.array([[1, 2, 3], [2, 4]])
In: np.random.choice(Set1)
Out: [4, 5]
However, once the numpy array are sets of the same size, I get a value error:
Set2 = np.array([[1, 3, 5], [2, 4, 6]])
In: np.random.choice(Set2)
ValueError: a must be 1-dimensional
Could be user error, but I've checked several times and the only difference is the size of the sets. I realize I can do something like:
Chosen = np.random.choice(N, k)
Selection = Set[Chosen]
Where N is the number of sets and k is the number of samples, but I'm just wondering if there was a better way and specifically what I am doing wrong to raise a value error when the sets are the same size.
Printout of Set1 and Set2 for reference:
In: Set1
Out: array([list([1, 3, 5]), list([2, 4])], dtype=object)
In: type(Set1)
Out: numpy.ndarray
In: Set2
Out:
array([[1, 3, 5],
[2, 4, 6]])
In: type(Set2)
Out: numpy.ndarray
Your issue is caused by a misunderstanding of how numpy arrays work. The first example can not "really" be turned into an array because numpy does not support ragged arrays. You end up with an array of object references that points to two python lists. The second example is a proper 2xN numerical array. I can think of two types of solutions here.
The obvious approach (which would work in both cases, by the way), would be to choose the index instead of the sublist. Since you are sampling with replacement, you can just generate the index and use it directly:
Set[np.random.randint(N, size=k)]
This is the same as
Set[np.random.choice(N, k)]
If you want to choose without replacement, your best bet is to use np.random.choice, with replace=False. This is similar to, but less efficient than shuffling. In either case, you can write a one-liner for the index:
Set[np.random.choice(N, k, replace=False)]
Or:
index = np.arange(Set.shape[0])
np.random.shuffle(index)
Set[index[:k]]
The nice thing about np.random.shuffle, though, is that you can apply it to Set directly, whether it is a one- or many-dimensional array. Shuffling will always happen along the first axis, so you can just take the top k elements afterwards:
np.random.shuffle(Set)
Set[:k]
The shuffling operation works only in-place, so you have to write it out the long way. It's also less efficient for large arrays, since you have to create the entire range up front, no matter how small k is.
The other solution is to turn the second example into an array of list objects like the first one. I do not recommend this solution unless the only reason you are using numpy is for the choice function. In fact I wouldn't recommend it at all, since you can, and probably should, use pythons standard random module at this point. Disclaimers aside, you can coerce the datatype of the second array to be object. It will remove any benefits of using numpy, and can't be done directly. Simply setting dtype=object will still create a 2D array, but will store references to python int objects instead of primitives in it. You have to do something like this:
Set = np.zeros(N, dtype=object)
Set[:] = [[1, 2, 3], [2, 4]]
You will now get an object essentially equivalent to the one in the first example, and can therefore apply np.random.choice directly.
Note
I show the legacy np.random methods here because of personal inertia if nothing else. The correct way, as suggested in the documentation I link to, is to use the new Generator API. This is especially true for the choice method, which is much more efficient in the new implementation. The usage is not any more difficult:
Set[np.random.default_rng().choice(N, k, replace=False)]
There are additional advantages, like the fact that you can now choose directly, even from a multidimensional array:
np.random.default_rng().choice(Set2, k, replace=False)
The same goes for shuffle, which, like choice, now allows you to select the axis you want to rearrange:
np.random.default_rng().shuffle(Set)
Set[:k]
I find this behaviour an utter nonsense. This happens only with numpy arrays, typical Python's arrays will just throw an error.
Let's create two arrays:
randomNumMatrix = np.random.randint(0,20,(3,3,3), dtype=np.int)
randRow = np.array([0,1,2], dtype=np.int)
If we pass an array as index to get something from another array, an original array is returned.
randomNumMatrix[randRow]
The code above returns an equivalent of randomNumMatrix. I find this unintuitive. I would expect it, not to work or at least return an equivalent of
randomNumMatrix[randRow[0]][randRow[1]][randRow[2]].
Additional observations:
A)
The code below does not work, it throws this error: IndexError: index 3 is out of bounds for axis 0 with size 3
randRow = np.array([0, 1, 3], dtype=np.int)
B)
To my surprise, the code below works:
randRow = np.array([0, 1, 2, 2,0,1,2], dtype=np.int)
Can somebody please explain what are the advantages of this feature?
In my opinion it only creates much confusion.
What is?
randomNumMatrix[randRow[0]][randRow[1]][randRow[2]]
That's not a valid Python.
In numpy there is a difference between
arr[(x,y,z)] # equivalent to arr[x,y,z]
and
arr[np.array([x,y,z])] # equivalent to arr[np.array([x,y,z]),:,:]
The tuple provides a scalar index for each dimension. The array (or list) provides multiple indices for one dimension.
You may need to study the numpy docs on indexing, especially advanced indexing.
Related to this question, I came across an indexing behaviour via Boolean arrays and broadcasting I do not understand. We know it's possible to index a NumPy array in 2 dimensions using integer indices and broadcasting. This is specified in the docs:
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
c1 = np.where(b1)[0] # i.e. [1, 2]
c2 = np.where(b2)[0] # i.e. [0, 2]
a[c1[:, np.newaxis], c2] # or a[c1[:, None], c2]
array([[ 4, 6],
[ 8, 10]])
However, the same does not work for Boolean arrays.
a[b1[:, None], b2]
IndexError: too many indices for array
The alternative numpy.ix_ works for both integer and Boolean arrays. This seems to be because ix_ performs specific manipulation for Boolean arrays to ensure consistent treatment.
assert np.array_equal(a[np.ix_(b1, b2)], a[np.ix_(c1, c2)])
array([[ 4, 6],
[ 8, 10]])
So my question is: why does broadcasting work with integers, but not with Boolean arrays? Is this behaviour documented? Or am I misunderstanding a more fundamental issue?
As #Divakar noted in comments, Boolean advanced indices behave as if they were first fed through np.nonzero and then broadcast together, see the relevant documentation for extensive explanations. To quote the docs,
In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].
[...]
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.
In your case broadcasting would not necessarily be a problem, since both arrays have only two nonzero elements. The problem is the number of dimensions in the result:
>>> len(b1[:,None].nonzero())
2
>>> len(b2.nonzero())
1
Consequently the indexing expression a[b1[:,None], b2] would be equivalent to a[b1[:,None].nonzero() + b2.nonzero()], which would put a length-3 tuple inside a, corresponding to a 3d array index. Hence the error you see about "too many indices".
The surprises mentioned in the docs are very close to your example: what if you hadn't injected that singleton dimension? Starting from a length-3 and a length-4 Boolean array you would've ended up with a length-2 advanced index, i.e. a 1d array of size (2,). This is never what you'd want, which is leads us to another piece of trivia in the subject.
There's been a lot of discussion in planning to revamp advanced indexing, see the work-in-progress draft NEP 21. The gist of the issue is that fancy indexing in numpy, while clearly documented, has some very quirky features which aren't practically useful for anything, but which can bite you if you make a mistake by producing surprising results rather than errors.
A relevant quote from the NEP:
Mixed cases involving multiple array indices are also surprising, and
only less problematic because the current behavior is so useless that
it is rarely encountered in practice. When a boolean array index is
mixed with another boolean or integer array, boolean array is
converted to integer array indices (equivalent to np.nonzero()) and
then broadcast. For example, indexing a 2D array of size (2, 2) like
x[[True, False], [True, False]] produces a 1D vector with shape (1,),
not a 2D sub-matrix with shape (1, 1).
Now, I emphasize that the NEP is very much work-in-progress, but one of the suggestions in the current state of the NEP is to forbid Boolean arrays in advanced indexing cases such as the above, and only allow them in "outer indexing" scenarios, i.e. exactly what np.ix_ would help you do with your Boolean array:
Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing [i.e. the current behaviour] is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually.
My point is that the behaviour of Boolean advanced indices and their deprecation status (or lack thereof) may change in the not-so-distant future.
Am hoping someone can explain to me the following behavior I observe with a numpy array:
>>> import numpy as np
>>> data_block=np.zeros((26,480,1000))
>>> indices=np.arange(1000)
>>> indices.shape
(1000,)
>>> data_block[0,:,:].shape
(480, 1000) #fine and dandy
>>> data_block[0,:,indices].shape
(1000, 480) #what happened???? why the transpose????
>>> ind_slice=np.arange(300) # this is more what I really want.
>>> data_block[0,:,ind_slice].shape
(300, 480) # transpose again! arghhh!
I don't understand this transposing behavior and it is very inconvenient for what I want to do. Could anyone explain it to me? An alternative method for getting that subset of data_block would be a great bonus.
You can achieve your desired result this way:
>>> data_block[0,:,:][:,ind_slice].shape
(480L, 300L)
I confess I don't have a complete understanding of how complicated numpy indexing works, but the documentation seems to hint at the trouble you're having:
Basic slicing with more than one non-: entry in the slicing tuple, acts like repeated application of slicing using a single non-: entry, where the non-: entries are successively taken (with all other non-: entries replaced by :). Thus, x[ind1,...,ind2,:] acts like x[ind1][...,ind2,:] under basic slicing.
Warning: The above is not true for advanced slicing.
and. . .
Advanced indexing is triggered when the selection object, obj, is a non-tuple sequence object, an ndarray (of data type integer or bool), or a tuple with at least one sequence object or ndarray (of data type integer or bool).
Thus you are triggering that behavior by indexing with your ind_slice array instead of a regular slice.
The documentation itself says that this kind of indexing "can be somewhat mind-boggling to understand", so it's not surprising we both have trouble with this :-).
There really is not much to be surprised about once you understand how fancy indexing works. If you have lists or arrays as indices, they must all be of the same shape, or be broadcastable to a common shape. That shape will be the base shape of the return array. If there are indices which are slices, then every entry in the base shape array will be multidimensional, so the base shape gets extended with extra entries. While this may seem a weird choice, it really is the only one consistent with multidimensional fancy indexing. As an example, try to figure what would you expect the return shape to be if you did the following:
>>> ind_slice=np.arange(16).reshape(4, 4)
>>> data_block[ind_slice, :, ind_slice].shape
(4, 4, 480) # No, (4, 4, 480, 4, 4) is not a better option
There are several ways to get what you are after. For the particular case in your question, the most obvious would be to not use fancy indexing, as you can get what you ask with slices:
>>> data_block[0, :, :300].shape
(480, 300)
If you do need fancy indexing, you can replace slices with broadcastable arrays:
>>> data_block[0, np.arange(480)[:, None], ind_slice].shape
(480, 300)
You may want to take a look at np.ogrid and np.mgrid if you need to replace more complicated slices with arrays.
I'm new to python and wanted to do something I normally do in matlab/R all the time, but couldn't figure it out from the docs.
I'd like to slice an array not as 0:3 which includes elements 0,1,2 but as an explicit vector of indices such as 0,3
For example, say I had this data structure
a = [1, 2, 3, 4, 5]
I'd like the second and third element
so I thought something like this would work
a[list(1,3)]
but that gives me this error
TypeError: list indices must be
integers
This happens for most other data types as well such as numpy arrays
In matlab, you could even say a[list(2,1)] which would return this second and then the first element.
There is an alternative implementation I am considering, but I think it would be slow for large arrays. At least it would be damn slow in matlab. I'm primarily using numpy arrays.
[ a[i] for i in [1,3] ]
What's the python way oh wise ones?
Thanks!!
NumPy allows you to use lists as indices:
import numpy
a = numpy.array([1, 2, 3, 4, 5])
a[[1, 3]]
Note that this makes a copy instead of a view.
I believe you want numpy.take:
newA = numpy.take(a, [1,3])