Funky behavior with numpy arrays - python

Am hoping someone can explain to me the following behavior I observe with a numpy array:
>>> import numpy as np
>>> data_block=np.zeros((26,480,1000))
>>> indices=np.arange(1000)
>>> indices.shape
(1000,)
>>> data_block[0,:,:].shape
(480, 1000) #fine and dandy
>>> data_block[0,:,indices].shape
(1000, 480) #what happened???? why the transpose????
>>> ind_slice=np.arange(300) # this is more what I really want.
>>> data_block[0,:,ind_slice].shape
(300, 480) # transpose again! arghhh!
I don't understand this transposing behavior and it is very inconvenient for what I want to do. Could anyone explain it to me? An alternative method for getting that subset of data_block would be a great bonus.

You can achieve your desired result this way:
>>> data_block[0,:,:][:,ind_slice].shape
(480L, 300L)
I confess I don't have a complete understanding of how complicated numpy indexing works, but the documentation seems to hint at the trouble you're having:
Basic slicing with more than one non-: entry in the slicing tuple, acts like repeated application of slicing using a single non-: entry, where the non-: entries are successively taken (with all other non-: entries replaced by :). Thus, x[ind1,...,ind2,:] acts like x[ind1][...,ind2,:] under basic slicing.
Warning: The above is not true for advanced slicing.
and. . .
Advanced indexing is triggered when the selection object, obj, is a non-tuple sequence object, an ndarray (of data type integer or bool), or a tuple with at least one sequence object or ndarray (of data type integer or bool).
Thus you are triggering that behavior by indexing with your ind_slice array instead of a regular slice.
The documentation itself says that this kind of indexing "can be somewhat mind-boggling to understand", so it's not surprising we both have trouble with this :-).

There really is not much to be surprised about once you understand how fancy indexing works. If you have lists or arrays as indices, they must all be of the same shape, or be broadcastable to a common shape. That shape will be the base shape of the return array. If there are indices which are slices, then every entry in the base shape array will be multidimensional, so the base shape gets extended with extra entries. While this may seem a weird choice, it really is the only one consistent with multidimensional fancy indexing. As an example, try to figure what would you expect the return shape to be if you did the following:
>>> ind_slice=np.arange(16).reshape(4, 4)
>>> data_block[ind_slice, :, ind_slice].shape
(4, 4, 480) # No, (4, 4, 480, 4, 4) is not a better option
There are several ways to get what you are after. For the particular case in your question, the most obvious would be to not use fancy indexing, as you can get what you ask with slices:
>>> data_block[0, :, :300].shape
(480, 300)
If you do need fancy indexing, you can replace slices with broadcastable arrays:
>>> data_block[0, np.arange(480)[:, None], ind_slice].shape
(480, 300)
You may want to take a look at np.ogrid and np.mgrid if you need to replace more complicated slices with arrays.

Related

How is multidimensional array slicing/indexing implemented in numpy?

One of the great feature of NumPy arrays is that you can perform multidimensional slicing. I am wondering exactly how it is implemented. Let me lay out what I am thinking so far, then hopefully someone can fill in the gaps, answer some questions I have, and (probably) tell me why I'm wrong.
import numpy as np
arr = np.array([ [1, 2, 3], [4, 5, 6] ])
# retrieve the rightmost column of values for all rows
print(arr[:, 2])
# indexing a normal multidimensional list
not_an_arr = [ [1, 2, 3], [4, 5, 6] ]
print(not_an_arr[:, 2]) # TypeError: indices must be integers or slices, not tuple
At first, [:, 2] seemed like a violation of Python syntax to me. If I tried to index a normal multidimensional list in Python I would get an error. Of course, upon actually reading the error message, I realize that the issue isn't with the syntax as I originally thought, but the type of object passed in. So the conclusion I've come to is that [:, 2] implicitly creates a tuple, so that what's really happening in [:, 2] is [(:, 2)]. Is that what's happening?
I next tried to read the source code for the numpy.ndarray class which is linked to by the ndarray documentation, but that's all in C, which I'm not proficient in, so I can't make much sense of this.
I then noticed that there was documentation for ndarray.__getitem__. I was hoping this would lead me to the implementation of __getitem__ for the class, since my understanding is that implementing __getitem__ is where the behavior for indexing an object should be defined. My hope was that I would be able to see that they unpack the tuple and then use the slice objects or integers included in it to do the indexing on the underlying data structure however that may need to be done.
So... what really goes on behind the scenes to make multidimensional slicing work on numpy arrays?
TLDR: How is multidimensional array slicing implemented for numpy arrays?
We can verify your first level inferences with a simple class:
In [137]: class Foo():
...: def __getitem__(self,arg):
...: print(arg)
...: return None
...:
In [138]: f=Foo()
In [139]: f[1]
1
In [140]: f[::3]
slice(None, None, 3)
In [141]: f[,]
File "<ipython-input-141-d115e3c638fb>", line 1
f[,]
^
SyntaxError: invalid syntax
In [142]: f[:,]
(slice(None, None, None),)
In [143]: f[:,:3,[1,2,3]]
(slice(None, None, None), slice(None, 3, None), [1, 2, 3])
numpy uses code like this in np.lib.index_tricks.py to implement "functions" like np.r_ and np.s_. They are actually class instances that use an index syntax.
It's worth noting that it's the comma, most so than the () that creates a tuple:
In [145]: 1,
Out[145]: (1,)
In [146]: 1,2
Out[146]: (1, 2)
In [147]: () # exception - empty tuple, no comma
Out[147]: ()
That explains the syntax. But the implementation details are left up to the object class. list (and other sequences like string) can work with integers and slice objects, but give an error when given a tuple.
numpy is happy with the tuple. In fact passing a tuple via getitem was added years ago to base Python because numpy needed it. No base classes use it (that I know of); but user classes can accept a tuple, as my example shows.
As for the numpy details, that requires some knowledge of numpy array storage, including the role of the shape, strides and data-buffer. I'm not sure if I want get into those now.
A few days ago I explored one example of multidimensional indexing, and discovered some nuances that I wasn't aware of (or ever seen documented)
view of numpy with 2D slicing
For most of us, understanding the how-to of indexing is more important than knowing the implementation details. I suspect there are textbooks, papers and even Wiki pages that describe 'strided' multidimensional indexing. numpy isn't the only place that uses it.
https://numpy.org/doc/stable/reference/arrays.indexing.html
This looks like a nice intro to numpy arrays
https://ajcr.net/stride-guide-part-1/

Why numpy array returns oryginal array when passing array as index?

I find this behaviour an utter nonsense. This happens only with numpy arrays, typical Python's arrays will just throw an error.
Let's create two arrays:
randomNumMatrix = np.random.randint(0,20,(3,3,3), dtype=np.int)
randRow = np.array([0,1,2], dtype=np.int)
If we pass an array as index to get something from another array, an original array is returned.
randomNumMatrix[randRow]
The code above returns an equivalent of randomNumMatrix. I find this unintuitive. I would expect it, not to work or at least return an equivalent of
randomNumMatrix[randRow[0]][randRow[1]][randRow[2]].
Additional observations:
A)
The code below does not work, it throws this error: IndexError: index 3 is out of bounds for axis 0 with size 3
randRow = np.array([0, 1, 3], dtype=np.int)
B)
To my surprise, the code below works:
randRow = np.array([0, 1, 2, 2,0,1,2], dtype=np.int)
Can somebody please explain what are the advantages of this feature?
In my opinion it only creates much confusion.
What is?
randomNumMatrix[randRow[0]][randRow[1]][randRow[2]]
That's not a valid Python.
In numpy there is a difference between
arr[(x,y,z)] # equivalent to arr[x,y,z]
and
arr[np.array([x,y,z])] # equivalent to arr[np.array([x,y,z]),:,:]
The tuple provides a scalar index for each dimension. The array (or list) provides multiple indices for one dimension.
You may need to study the numpy docs on indexing, especially advanced indexing.

Why does numpy mixed basic / advanced indexing depend on slice adjacency?

I know similar questions have been asked before (e.g.), but AFAIK nobody has answered my specific question...
My question is about the numpy mixed advanced / basic indexing described here:
... Two cases of index combination need to be distinguished:
The advanced indexes are separated by a slice, ellipsis or newaxis. For example x[arr1,:,arr2].
The advanced indexes are all next to each other. For example x[...,arr1,arr2,:] but not x[arr1,:,1] since 1 is an advanced index in this regard.
In the first case, the dimensions resulting from the advanced indexing operation come first in the result array, and the subspace dimensions after that. In the second case, the dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array (the latter logic is what makes simple advanced indexing behave just like slicing).
Why is this distinction necessary?
I was expecting the behaviour described for case 2 to be used in all cases. Why does it matter whether indexes are next to each other?
I understand you may want the behaviour of case 1 in some situations; for example, "vectorization" of index results along new dimensions. But this behaviour can and should be defined by the user. That is, if case 2 behaviour was the default, case 1 behaviour would be possible using only:
x[arr1,:,arr2].reshape((len(arr1),x.shape[1]))
I know you can achieve the behaviour described in case 2 using np.ix_(), but this inconsistency in default indexing behaviour is unexpected and unjustified, in my opinion. Can someone justify it?
Thanks,
The behavior for case 2 isn't well-defined for case 1. There's a subtlety you're probably missing in the following sentence:
In the second case, the dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array
You're probably imagining a one-to-one correspondence between input and output dimensions, perhaps because you're imagining Matlab-style indexing. NumPy doesn't work like that. If you have four arrays with the following shapes:
a.shape == (2, 3, 4, 5, 6)
b.shape == (20, 30)
c.shape == (20, 30)
d.shape == (20, 30)
then a[b, :, c, :, d] has four dimensions, with lengths 3, 5, 20, and 30. There is no unambiguous place to put the 20 and the 30. NumPy defaults to sticking them in front.
On the other hand, with a[:, b, c, d, :], the 20 and 30 can go where the 3, 4, and 5 were, because the 3, 4, and 5 were next to each other. The whole block of new dimensions goes where the whole block of original dimensions was, which only works if the original dimensions were in a single block in the original shape.

Numpy view contiguous part of non-contiguous array as dtype of bigger size

I was trying to generate an array of trigrams (i.e. continuous-three-letter combinations) from a super long char array:
# data is actually load from a source file
a = np.random.randint(0, 256, 2**28, 'B').view('c')
Since making copy is not efficient (and it creates problems like cache miss), I directly generated the trigram using stride tricks:
tri = np.lib.stride_tricks.as_strided(a, (len(a) - 2, 3), a.strides * 2)
This generates a trigram list with shape (2**28 - 2, 3) where each row is a trigram. Now I want to convert the trigram to a list of string (i.e. S3) so that numpy displays it more "reasonably" (instead of individual chars).
tri = tri.view('S3')
It gives the exception:
ValueError: To change to a dtype of a different size, the array must be C-contiguous
I understand generally data should be contiguous in order to create a meaningful view, but this data is contiguous at "where it should be": each three elements are contiguous.
So I'm wondering how to view contiguous part in non-contiguous np.ndarray as dtype of bigger size? A more "standard" way would be better, while hackish ways are also welcome. It seems that I can set shape and stride freely with np.lib.stride_tricks.as_strided, but I can't force the dtype to be something, which is the problem here.
EDIT
Non-contiguous array can be made by simple slicing. For example:
np.empty((8, 4), 'uint32')[:, :2].view('uint64')
will throw the same exception above (while from a memory point of view I should be able to do this). This case is much more common than my example above.
If you have access to a contiguous array from which your non-contiguous one is derived, it should typically be possible to work around this limitation.
For example your trigrams can be obtained like so:
>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b')', b'\xf2', b'\xf7', ..., b'\xf4', b'\xf1', b'z'], dtype='|S1')
>>> np.lib.stride_tricks.as_strided(a[:0].view('S3'), ((2**28)-2,), (1,))
array([b')\xf2\xf7', b'\xf2\xf7\x14', b'\xf7\x14\x1b', ...,
b'\xc9\x14\xf4', b'\x14\xf4\xf1', b'\xf4\xf1z'], dtype='|S3')
In fact, this example demonstrates that all we need is a contiguous "stub" at the memory buffer's base for view casting, since afterwards, because as_strided does not do many checks we are essentially free to do whatever we like.
It seems we can always get such a stub by slicing to a size 0 array. For your second example:
>>> X = np.empty((8, 4), 'uint32')[:, :2]
>>> np.lib.stride_tricks.as_strided(X[:0].view(np.uint64), (8, 1), X.strides)
array([[140133325248280],
[ 32],
[ 32083728],
[ 31978800],
[ 0],
[ 29686448],
[ 32],
[ 32362720]], dtype=uint64)
As of numpy 1.23.0, you will be able to do exactly what you want without jumping through extra hoops. I've added PR#20722 to numpy to address pretty much this exact issue. The idea is that if your new dtype is smaller than the current, you can clearly expand a unit or contiguous axis without any problems. If the new dtype is larger, you can shrink a contiguous axis.
With the update, your code runs out of the box:
>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b'\x19', b'\xf9', b'\r', ..., b'\xc3', b'\xa3', b'{'], dtype='|S1')
>>> tri = np.lib.stride_tricks.as_strided(a, (len(a)-2,3), a.strides*2)
>>> tri.view('S3')
array([[b'\x9dB\xeb'],
[b'B\xebU'],
[b'\xebU\xa4'],
...,
[b'-\xcbM'],
[b'\xcbM\x97'],
[b'M\x97o']], dtype='|S3')
The array has to have a unit dimension or be contiguous in the last axis, which is true in your case.
I've also added PR#20694 to introduce slicing to the np.char module. If that PR gets accepted as-is, you will be able to do:
>>> np.char.slice_(a.view(f'U{len(a)}'), step=1, chunksize=3)

Numpy-like slicing in Julia

In Python/Numpy I can slice arrays in this form:
arr = np.ones((3,4,5))
arr[2]
and the shape will be maintained:
(arr[2]).shape # prints (4, 5)
Which means that, if I want to keep the shape of the array, the following code works for N-dimensional arrays
arr = np.ones((3,4,5,2,2))
(arr[2]).shape # prints (4, 5, 2, 2)
This is great if I want to write functions that work for N-dim arrays preserving their output.
In Julia, however, the same action does not preserve the structure:
arr = ones(3,4,5)
size(arr[3]) # prints () (0-dimensinoal)
size(arr[3,:]) # prints (20,)
because of partial linear indexing. So if want to keep the original dimensions I need to write arr[3,:,:], which only works for 3D arrays. If I want a 4D array I would have to use arr[3,:,:,:] and so on. The code isn't general.
Furthermore, when you get to array that are 5 dimensions or more (which is the case I'm working with now) this notation gets extremely cumbersome.
Is there any way I can write code like I do in Python and make it general? I couldn't even think of a nice clean way with reshape, let alone a way that's as clean as Python.
Notice that in Python the shape is only preserved if you slice the first dimension of the array. In Julia you can use slicedim(A, d, i) to slice dimension d of array A at index i.

Categories

Resources