Understanding NumPy's copy behaviour when using advanced indexing - python

I am currently struggling writing tidy code in NumPy using advanced indexing.
arr = np.arange(100).reshape(10,10) # array I want to manipulate
sl1 = arr[:,-1] # basic indexing
# Do stuff with sl1...
sl1[:] = -1
# arr has been changed as well
sl2 = arr[arr >= 50] # advanced indexing
# Do stuff with sl2...
sl2[:] = -2
# arr has not been changed,
# changes must be written back into it
arr[arr >= 50] = sl2 # What I'd like to avoid
I'd like to avoid this "write back" operation because it feels superfluous and I often forget it. Is there a more elegant way to accomplish the same thing?

Both boolean and integer array indexing, fall under the category of advanced indexing methods. In the second example (boolean indexing), you'll see that the original array is not updated, this is because advanced indexing always returns a copy of the data (see second paragraph in the advanced indexing section of the docs). This means that once you do arr[arr >= 50] this is already a copy of arr, and whatever changes you apply over it they won't affect arr.
The reason why it does not return a view is that advanced indexing cannot be expressed as a slice, and hence cannot be addressed with offsets, strides, and counts, which is required to be able to take a view of the array's elements.
We can easily verify that we are viewing different objects in the case of advanced indexing with:
np.shares_memory(arr, arr[arr>50])
# False
np.shares_memory(arr, arr[:,-1])
# True
Views are only returned when performing basic slicing operations. So you'll have to assign back as you're doing in the last example. In reference to the question in the comments, when assigning back in a same expression:
arr[arr >= 50] = -2
This is translated by the python interpreter as:
arr.__setitem__(arr >= 50, -2)
Here the thing to understand is that the expression can be evaluated in-place, hence there's no new object creation involved since there is no need for it.

Related

Assigning values to subset of an array defined by multiple conditions

I want to assign values to part of an array which is specified by multiple conditions.
For example:
import numpy as np
temp = np.arange(120).reshape((2,3,4,5))
mask = temp > 22
submask = temp[mask] < 43
(temp[mask])[submask] = 0 # assign zeros to the part of the array specified via mask and submask
print(temp) # notice that temp is unchanged - I want it to be changed
This example is artificial. Generally I want to do something more complex which involves a combination of indexing and boolean masks. Using a list index fails in similar circumstances. For example: temp[:,:,0,[1,3,2]]=0 is a valid assignment, but temp[:,:,0,[1,3,2]][mask]=0 will fail.
My understanding is that the assignment is failing because the complex indexing is prompting numpy to make a copy of the array object and assigning to that, rather than to the original array. So it isn't that the assignment is failing per se, just that the assignment is directed towards the "wrong" object.
I have tried using functions such as np.copyto and np.putmask but these also fail, presumably because the backend implementation mimics the original problem. For example: np.putmask(temp, mask, 0) will work but np.putmask(temp[mask], submask, 0) will not.
Is there a good way to do what I want to do?

Which numpy index is copy and which is view?

Question
Regarding Numpy Indexing which can either return copy or view, please confirm if my understandings are correct. If not, please provide explanations and provide pointers to the related specifications.
Q: Basic slicing
The Numpy Indexing documentation has Basic Slicing and Indexing section. I believe this section talks only about basic slicing as mentioned one of three ways in Indexing.
Indexing
There are three kinds of indexing available: field access, basic slicing, advanced indexing. Which one occurs depends on obj.
It only uses Python slice object and it returns view only. Even when a slice is in a tuple, which may fall into the section below, it still returns view.
Dealing with variable numbers of indices within programs
If one supplies to the index a tuple, the tuple will be interpreted as a list of indices.
This code will return a view.
>>> indices = (1,1,1,slice(0,2)) # same as [1,1,1,0:2]
>>> z[indices]
array([39, 40])
Are these correct understanding?
Q: Combining basic slicing and advance indexing
There is a section in the document:
Combining advanced and basic indexing
When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behavior can be more complicated.
I believe, basic indexing in the section title means basic slicing. There is no such basic indexing that is different from basic slicing because Indexing says there are only three ways, field access, basic slicing, advanced indexing.
When advanced indexing and basic indexing are combined, it will return copy.
Are these correct understandings?
Background
Numpy indexing has so many options, some return a view others return copy. It is quite confusing to have clear mental classification which operation may return reference/view, or copy.
Indexing
There are three kinds of indexing available: field access, basic slicing, advanced indexing. Which one occurs depends on obj.
Basic Slicing and Indexing
Basic slicing occurs when obj is a slice object (constructed by start:stop:step notation inside of brackets), an integer, or a tuple of slice objects and integers. Ellipsis and newaxis objects can be interspersed with these as well.
NumPy slicing creates a view instead of a copy as in the case of builtin Python sequences such as string, tuple and list.
Advanced Indexing
Advanced indexing is triggered when the selection object, obj, is a
non-tuple sequence object, an ndarray (of data type integer or bool),
or a tuple with at least one sequence object or ndarray (of data type
integer or bool). There are two types of advanced indexing: integer
and Boolean.
Advanced indexing always returns a copy of the data (contrast with
basic slicing that returns a view).
Dealing with variable numbers of indices within programs
If one supplies to the index a tuple, the tuple will be interpreted as a list of indices.
z = np.arange(81).reshape(3,3,3,3)
>>> indices = (1,1,1,1)
>>> z[indices]
40
Field Access
If the ndarray object is a structured array the fields of the array can be accessed by indexing the array with strings, dictionary-like. Indexing x['field-name'] returns a new view to the array
Note
Which numpy operations copy and which mutate?
It's true that in order to get a good grasp of what returns a view and what returns a copy, you need to be thorough with the documentation (which sometimes doesn't really mention it as well). I will not be able to provide you a complete set of operations and their output types (view or copy) however, maybe this could help you on your quest.
You can use np.shares_memory() to check whether a function returns a view or a copy of the original array.
x = np.array([1, 2, 3, 4])
x1 = x
x2 = np.sqrt(x)
x3 = x[1:2]
x4 = x[1::2]
x5 = x.reshape(-1,2)
x6 = x[:,None]
x7 = x[None,:]
x8 = x6+x7
x9 = x5[1,0:2]
x10 = x5[[0,1],0:2]
print(np.shares_memory(x, x1))
print(np.shares_memory(x, x2))
print(np.shares_memory(x, x3))
print(np.shares_memory(x, x4))
print(np.shares_memory(x, x5))
print(np.shares_memory(x, x6))
print(np.shares_memory(x, x7))
print(np.shares_memory(x, x8))
print(np.shares_memory(x, x9))
print(np.shares_memory(x, x10))
True
False
True
True
True
True
True
False
True
False
Notice the last 2 advance+basic indexing examples. One is a view while other is a copy. The explaination of this difference as mentioned in the documentation (also provides insight on how these are implemented) is -
When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behaviour can be more complicated. It is like concatenating the indexing result for each advanced index element
From Python to Numpy. Nicolas P. Rougier : Views and copies.
First, we have to distinguish between indexing and fancy indexing. The first will always return a view while the second will return a copy. This difference is important because in the first case, modifying the view modifies the base array while this is not true in the second case:
If you are unsure if the result of your indexing is a view or a copy, you can check what is the base of your result. If it is None, then you result is a copy:
>>> Z = np.random.uniform(0,1,(5,5))
>>> Z1 = Z[:3,:]
>>> Z2 = Z[[0,1,2], :]
>>> print(np.allclose(Z1,Z2))
True
>>> print(Z1.base is Z)
True
>>> print(Z2.base is Z)
False
>>> print(Z2.base is None)
True
If the base refers to the source, it is NOT None, hence a view.

Use of np.where()[0]

My code detects all the points under a threshold, then locates the start and end points.
below = np.where(self.data < self.threshold)[0]
startandend = np.diff(below)
startpoints = np.insert(startandend, 0, 2)
endpoints = np.insert(startandend, -1, 2)
startpoints = np.where(startpoints>1)[0]
endpoints = np.where(endpoints>1)[0]
startpoints = below[startpoints]
endpoints = below[endpoints]
I don't really get the use of [0] after np.where() function here
below = np.where(self.data < self.threshold)[0]
means:
take the first element from the tuple of ndarrays returned by np.where() and
assign it to below.
np.where is tricky. It returns an array of lists of indices where the conditions are met, even if the condition is never satisfied. In the case of np.where(my_numpy_array==some_value)[0] specifically, this means that you want the first value in the array, which is a list, and which contains the list of indices of condition-meeting cells.
Quite a mouthful. In simple terms, np.where(array==x)[0] returns a list of indices where the conditions have been met. I'm guessing this is a result of designing numpy for extensively broad applications.
Keep in mind that no matches still returns an empty list; errors like only size-1 arrays can be converted to python (some type) may be attributed to that.

python/numpy: negate or complement a slice

I have some functions, part of a big analysis software, that require a boolean mask to divide array items in two groups. These functions are like this:
def process(data, a_mask):
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
Now, I need to use these functions (with no modification) with a big array that has items of only class "a", but I would like to save RAM and do not pass a boolean mask with all True. For example I could pass a slice like slice(None, None).
The problem is that the line b_mask = -a_mask will fail if a_mask is a slice. Ideally -a_mask should give a 0-items selection.
I was thinking of creating a "modified" slice object that implements the __neg__() method as a null slice (for example slice(0, 0)). I don't know if this is possible.
Other solutions that allow to don't modify the process() function but at the same time avoid allocating an all-True boolean array will be accepted as well.
Unfortunately we can't add a __neg__() method to slice, since it cannot be subclassed. However, tuple can be subclassed, and we can use it to hold a single slice object.
This leads me to a very, very nasty hack which should just about work for you:
class NegTuple(tuple):
def __neg__(self):
return slice(0)
We can create a NegTuple containing a single slice object:
nt = NegTuple((slice(None),))
This can be used as an index, and negating it will yield an empty slice resulting in a 0-length array being indexed:
a = np.arange(5)
print a[nt]
# [0 1 2 3 4]
print a[-nt]
# []
You would have to be very desperate to resort to something like this, though. Is it totally out of the question to modify process like this?
def process(data, a_mask=None):
if a_mask is None:
a_mask = slice(None) # every element
b_mask = slice(0) # no elements
else:
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
This is way more explicit, and should not have any affect on its behavior for your current use cases.
Your solution is very similar to a degenerate sparse boolean array, although I don't know of any implementations of the same. My knee-jerk reaction is one of dislike, but if you really can't modify process it's probably the best way.
If you are concerned about memory use, then advanced indexing may be a bad idea. From the docs
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
As it stands, the process function has:
data of size n say
a_mask of size n (assuming advanced indexing)
And creates:
b_mask of size n
data[a_mask] of size m say
data[b_mask] of size n - m
This is effectively 4 arrays of size n.
Basic slicing seems to be your best option then, however Python doesn't seem to allow subclassing slice:
TypeError: Error when calling the metaclass bases
type 'slice' is not an acceptable base type
See #ali_m's answer for a solution that incorporates slicing.
Alternatively, you could just bypass process and get your results as
result = func_a(data), func_b([])

numpy ravel versus flat in slice assignment

According to the documentation, ndarray.flat is an iterator over the array while ndarray.ravel returns a flattened array (when possible). So my question is, when should we use one or the other?
Which one would be preferred as the rvalue in an assignment like the one in the code below?
import numpy as np
x = np.arange(2).reshape((2,1,1))
y = np.arange(3).reshape((1,3,1))
z = np.arange(5).reshape((1,1,5))
mask = np.random.choice([True, False], size=(2,3,5))
# netCDF4 module wants this kind of boolean indexing:
nc4slice = tuple(mask.any(axis=axis) for axis in ((1,2),(2,0),(0,1)))
indices = np.ix_(*nc4slice)
ncrds = 3
npnts = (np.broadcast(*indices)).size
points = np.empty((npnts, ncrds))
for i,crd in enumerate(np.broadcast_arrays(x,y,z)):
# Should we use ndarray.flat ...
points[:,i] = crd[indices].flat
# ... or ndarray.ravel():
points[:,i] = crd[indices].ravel()
You don't need either. crd[mask] is already 1-d. If you did, numpy always calls np.asarray(rhs) first, so it is the same if no copy is needed for ravel. When the copy is needed, I would guess that ravel may be faster currently (I did not time it).
If you knew that a copy might be needed, and here you know that nothing is needed, reshaping points could actually be the fastest. Since you usually don't need the fastest, I would say it is more a matter of taste, and would personally probably use ravel.

Categories

Resources