Which numpy index is copy and which is view? - python

Question
Regarding Numpy Indexing which can either return copy or view, please confirm if my understandings are correct. If not, please provide explanations and provide pointers to the related specifications.
Q: Basic slicing
The Numpy Indexing documentation has Basic Slicing and Indexing section. I believe this section talks only about basic slicing as mentioned one of three ways in Indexing.
Indexing
There are three kinds of indexing available: field access, basic slicing, advanced indexing. Which one occurs depends on obj.
It only uses Python slice object and it returns view only. Even when a slice is in a tuple, which may fall into the section below, it still returns view.
Dealing with variable numbers of indices within programs
If one supplies to the index a tuple, the tuple will be interpreted as a list of indices.
This code will return a view.
>>> indices = (1,1,1,slice(0,2)) # same as [1,1,1,0:2]
>>> z[indices]
array([39, 40])
Are these correct understanding?
Q: Combining basic slicing and advance indexing
There is a section in the document:
Combining advanced and basic indexing
When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behavior can be more complicated.
I believe, basic indexing in the section title means basic slicing. There is no such basic indexing that is different from basic slicing because Indexing says there are only three ways, field access, basic slicing, advanced indexing.
When advanced indexing and basic indexing are combined, it will return copy.
Are these correct understandings?
Background
Numpy indexing has so many options, some return a view others return copy. It is quite confusing to have clear mental classification which operation may return reference/view, or copy.
Indexing
There are three kinds of indexing available: field access, basic slicing, advanced indexing. Which one occurs depends on obj.
Basic Slicing and Indexing
Basic slicing occurs when obj is a slice object (constructed by start:stop:step notation inside of brackets), an integer, or a tuple of slice objects and integers. Ellipsis and newaxis objects can be interspersed with these as well.
NumPy slicing creates a view instead of a copy as in the case of builtin Python sequences such as string, tuple and list.
Advanced Indexing
Advanced indexing is triggered when the selection object, obj, is a
non-tuple sequence object, an ndarray (of data type integer or bool),
or a tuple with at least one sequence object or ndarray (of data type
integer or bool). There are two types of advanced indexing: integer
and Boolean.
Advanced indexing always returns a copy of the data (contrast with
basic slicing that returns a view).
Dealing with variable numbers of indices within programs
If one supplies to the index a tuple, the tuple will be interpreted as a list of indices.
z = np.arange(81).reshape(3,3,3,3)
>>> indices = (1,1,1,1)
>>> z[indices]
40
Field Access
If the ndarray object is a structured array the fields of the array can be accessed by indexing the array with strings, dictionary-like. Indexing x['field-name'] returns a new view to the array
Note
Which numpy operations copy and which mutate?

It's true that in order to get a good grasp of what returns a view and what returns a copy, you need to be thorough with the documentation (which sometimes doesn't really mention it as well). I will not be able to provide you a complete set of operations and their output types (view or copy) however, maybe this could help you on your quest.
You can use np.shares_memory() to check whether a function returns a view or a copy of the original array.
x = np.array([1, 2, 3, 4])
x1 = x
x2 = np.sqrt(x)
x3 = x[1:2]
x4 = x[1::2]
x5 = x.reshape(-1,2)
x6 = x[:,None]
x7 = x[None,:]
x8 = x6+x7
x9 = x5[1,0:2]
x10 = x5[[0,1],0:2]
print(np.shares_memory(x, x1))
print(np.shares_memory(x, x2))
print(np.shares_memory(x, x3))
print(np.shares_memory(x, x4))
print(np.shares_memory(x, x5))
print(np.shares_memory(x, x6))
print(np.shares_memory(x, x7))
print(np.shares_memory(x, x8))
print(np.shares_memory(x, x9))
print(np.shares_memory(x, x10))
True
False
True
True
True
True
True
False
True
False
Notice the last 2 advance+basic indexing examples. One is a view while other is a copy. The explaination of this difference as mentioned in the documentation (also provides insight on how these are implemented) is -
When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behaviour can be more complicated. It is like concatenating the indexing result for each advanced index element

From Python to Numpy. Nicolas P. Rougier : Views and copies.
First, we have to distinguish between indexing and fancy indexing. The first will always return a view while the second will return a copy. This difference is important because in the first case, modifying the view modifies the base array while this is not true in the second case:
If you are unsure if the result of your indexing is a view or a copy, you can check what is the base of your result. If it is None, then you result is a copy:
>>> Z = np.random.uniform(0,1,(5,5))
>>> Z1 = Z[:3,:]
>>> Z2 = Z[[0,1,2], :]
>>> print(np.allclose(Z1,Z2))
True
>>> print(Z1.base is Z)
True
>>> print(Z2.base is Z)
False
>>> print(Z2.base is None)
True
If the base refers to the source, it is NOT None, hence a view.

Related

Understanding NumPy's copy behaviour when using advanced indexing

I am currently struggling writing tidy code in NumPy using advanced indexing.
arr = np.arange(100).reshape(10,10) # array I want to manipulate
sl1 = arr[:,-1] # basic indexing
# Do stuff with sl1...
sl1[:] = -1
# arr has been changed as well
sl2 = arr[arr >= 50] # advanced indexing
# Do stuff with sl2...
sl2[:] = -2
# arr has not been changed,
# changes must be written back into it
arr[arr >= 50] = sl2 # What I'd like to avoid
I'd like to avoid this "write back" operation because it feels superfluous and I often forget it. Is there a more elegant way to accomplish the same thing?
Both boolean and integer array indexing, fall under the category of advanced indexing methods. In the second example (boolean indexing), you'll see that the original array is not updated, this is because advanced indexing always returns a copy of the data (see second paragraph in the advanced indexing section of the docs). This means that once you do arr[arr >= 50] this is already a copy of arr, and whatever changes you apply over it they won't affect arr.
The reason why it does not return a view is that advanced indexing cannot be expressed as a slice, and hence cannot be addressed with offsets, strides, and counts, which is required to be able to take a view of the array's elements.
We can easily verify that we are viewing different objects in the case of advanced indexing with:
np.shares_memory(arr, arr[arr>50])
# False
np.shares_memory(arr, arr[:,-1])
# True
Views are only returned when performing basic slicing operations. So you'll have to assign back as you're doing in the last example. In reference to the question in the comments, when assigning back in a same expression:
arr[arr >= 50] = -2
This is translated by the python interpreter as:
arr.__setitem__(arr >= 50, -2)
Here the thing to understand is that the expression can be evaluated in-place, hence there's no new object creation involved since there is no need for it.

Seemingly inconsistent behavior when converting numpy arrays to bool

What is the rationale behind the seemingly inconsistent behaviour of the following lines of code?
import numpy as np
# standard list
print(bool([])) # False - expected
print(bool([0])) # True - expected
print(bool([1])) # True - expected
print(bool([0,0])) # True - expected
# numpy arrays
print(bool(np.array([]))) # False - expected, deprecation warning: The
# truth value of an empty array is ambiguous...
print(bool(np.array([0]))) # False - unexpected, no warning
print(bool(np.array([1]))) # True - unexpected, no warning
print(bool(np.array([0,0]))) # ValueError: The truth value of an array
# with more than one element is ambiguous...
There are at least two inconsistencies in my point of view:
Standard python containers can be tested for emptiness bool(container). Why do numpy array not follow this pattern? (bool(np.array([0])) yields False)
Why is there an exception/deprecation warning when converting an empty numpy array or an array of length > 1, but it is okay to do so when the numpy array contains just one element?
Note that the deprecation for empty numpy arrays was added somewhere between numpy 1.11. and 1.14.
For the first problem, the reason is that it's not at all clear what you want to do with if np.array([1, 2]):.
This isn't a problem for if [1, 2]: because Python lists don't do element-wise anything. The only thing you can be asking is whether the list itself is truthy (non-empty).
But Numpy arrays do everything element-wise that possibly could be element-wise. Notice that this is hardly the only place, or even the most common place, where element-wise semantics mean that arrays work differently from normal Python sequences. For example:
>>> [1, 2] * 3
[1, 2, 1, 2, 1, 2]
>>> np.array([1, 2]) * 3
array([3, 6])
And, for this case in particular, boolean arrays are a very useful thing, especially since you can index with them:
>>> arr = np.array([1, 2, 3, 4])
>>> arr > 2 # all values > 2
array([False, False, True, True])
>>> arr[arr > 2] = 2 # clamp the values at <= 2
>>> arr
array([1, 2, 2, 2])
And once you have that feature, it becomes ambiguous what an array should mean in a boolean context. Normally, you want the bool array. But when you write if arr:, you could mean any of multiple things:
Do the body of the if for each element that's truthy. (Rewrite the body as an expression on arr indexed by the bool array.)
Do the body of the if if any element is truthy. (Use any.)
Do the body of the if if all elements are truthy. (Use any.)
A hybrid over some axis—e.g., do the body for each row where any element is truthy.
Do the body of the if if the array is nonempty—acting like a normal Python sequence but violating the usual element-wise semantics of an array. (Explicitly check for emptiness.)
So, rather than guess, and be wrong more often than not, numpy gives you an error and forces you to be explicit.
For the second problem, doesn't the warning text answer this for you? The truth value of a single element is obviously not ambiguous.
And single-element arrays—especially 0D ones—are often used as pseudo-scalars, so being able to do this isn't just harmless, it's also sometimes useful.
By contrast, asking "is this array empty" is rarely useful. A list is a variable-sized thing that you usually build up by adding one element at a time, zero or more times (possibly implicitly in a comprehension), so it's very often worth asking whether you added zero elements. But an array is a fixed-size thing, where you usually explicitly specified the size somewhere nearby in the code.
That's why it's allowed. And why it operates on the single value, not on the size of the array.
For empty arrays (which you didn't ask about, but did bring up): here, instead of there being multiple reasonable things you could mean, it's hard to think of anything reasonable you could mean. Which is probably why this is the only case that's changed recently (see issue 9583), rather than being the same since the days when Python added __nonzero__.

masking a series with a boolean array

This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance
x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta
delta mask creates a boolean pandas series.
However, if you do
x[deltamask]
y[deltamask]
You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like
x[deltamask]*y[deltamask]
results in an error:
print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])
Even more perplexing, I noticed that the operator < is treated differently. For instance
print type(2*x < x*y)
print type(2 < x*y)
will give you a pd.series and np.array respectively.
Also,
5 < x - y
results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.
What is the reason for this?
Fancy Indexing
As numpy currently stands, fancy indexing in numpy works as follows:
If the thing between brackets is a tuple (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of x. For example, both x[(True, True)] and x[True, True] will raise IndexError: too many indices for array in this case because x is 1D. However, before the exception happens, a telling warning will be raised too: VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.
If the thing between brackets is exactly an ndarray, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values] gives the expected result (empty array since deltamask is all False.
If the thing between brackets is any array-like, whether a subclass like Series or just a list, or something else, it is converted to an np.intp array (if possible) and used as an integer index. So x[deltamask] yeilds something equivalent to x[[False] * 7] or just x[[0] * 7]. In this case, len(deltamask)==7 and x[0]==1 so the result is [1, 1, 1, 1, 1, 1, 1].
This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.
This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.
Relational Operators
Now let's address the second part of your question about how the comparison works. Relational operators (<, >, <=, >=) work by calling the corresponding method on one of the objects being compared. For < this is __lt__. However, instead of just calling x.__lt__(y) for the expression x < y, Python actually checks the types of the objects being compared. If y is a subtype of x that implements the comparison, then Python prefers to call y.__gt__(x) instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y) will get called if y is a subclass of x is if y.__gt__(x) returns NotImplemented to indicate that the comparison is not supported in that direction.
A similar thing happens when you do 5 < x - y. While ndarray is not a subclass of int, the comparison int.__lt__(ndarray) returns NotImplemented, so Python actually ends up calling (x - y).__gt__(5), which is of course defined and works just fine.
A much more succinct explanation of all this can be found in the Python docs.

python/numpy: negate or complement a slice

I have some functions, part of a big analysis software, that require a boolean mask to divide array items in two groups. These functions are like this:
def process(data, a_mask):
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
Now, I need to use these functions (with no modification) with a big array that has items of only class "a", but I would like to save RAM and do not pass a boolean mask with all True. For example I could pass a slice like slice(None, None).
The problem is that the line b_mask = -a_mask will fail if a_mask is a slice. Ideally -a_mask should give a 0-items selection.
I was thinking of creating a "modified" slice object that implements the __neg__() method as a null slice (for example slice(0, 0)). I don't know if this is possible.
Other solutions that allow to don't modify the process() function but at the same time avoid allocating an all-True boolean array will be accepted as well.
Unfortunately we can't add a __neg__() method to slice, since it cannot be subclassed. However, tuple can be subclassed, and we can use it to hold a single slice object.
This leads me to a very, very nasty hack which should just about work for you:
class NegTuple(tuple):
def __neg__(self):
return slice(0)
We can create a NegTuple containing a single slice object:
nt = NegTuple((slice(None),))
This can be used as an index, and negating it will yield an empty slice resulting in a 0-length array being indexed:
a = np.arange(5)
print a[nt]
# [0 1 2 3 4]
print a[-nt]
# []
You would have to be very desperate to resort to something like this, though. Is it totally out of the question to modify process like this?
def process(data, a_mask=None):
if a_mask is None:
a_mask = slice(None) # every element
b_mask = slice(0) # no elements
else:
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
This is way more explicit, and should not have any affect on its behavior for your current use cases.
Your solution is very similar to a degenerate sparse boolean array, although I don't know of any implementations of the same. My knee-jerk reaction is one of dislike, but if you really can't modify process it's probably the best way.
If you are concerned about memory use, then advanced indexing may be a bad idea. From the docs
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
As it stands, the process function has:
data of size n say
a_mask of size n (assuming advanced indexing)
And creates:
b_mask of size n
data[a_mask] of size m say
data[b_mask] of size n - m
This is effectively 4 arrays of size n.
Basic slicing seems to be your best option then, however Python doesn't seem to allow subclassing slice:
TypeError: Error when calling the metaclass bases
type 'slice' is not an acceptable base type
See #ali_m's answer for a solution that incorporates slicing.
Alternatively, you could just bypass process and get your results as
result = func_a(data), func_b([])

Named dtype array: Difference between a[0]['name'] and a['name'][0]?

I came across the following oddity in numpy which may or may not be a bug:
import numpy as np
dt = np.dtype([('tuple', (int, 2))])
a = np.zeros(3, dt)
type(a['tuple'][0]) # ndarray
type(a[0]['tuple']) # ndarray
a['tuple'][0] = (1,2) # ok
a[0]['tuple'] = (1,2) # ValueError: shape-mismatch on array construction
I would have expected that both of the options below work.
Opinions?
I asked that on the numpy-discussion list. Travis Oliphant answered here.
Citing his answer:
The short answer is that this is not really a "normal" bug, but it could be considered a "design" bug (although the issues may not be straightforward to resolve). What that means is that it may not be changed in the short term --- and you should just use the first spelling.
Structured arrays can be a confusing area of NumPy for several of reasons. You've constructed an example that touches on several of them. You have a data-type that is a "structure" array with one member ("tuple"). That member contains a 2-vector of integers.
First of all, it is important to remember that with Python, doing
a['tuple'][0] = (1,2)
is equivalent to
b = a['tuple']; b[0] = (1,2)
In like manner,
a[0]['tuple'] = (1,2)
is equivalent to
b = a[0]; b['tuple'] = (1,2)
To understand the behavior, we need to dissect both code paths and what happens. You built a (3,) array of those elements in 'a'. When you write b = a['tuple'] you should probably be getting a (3,) array of (2,)-integers, but as there is currently no formal dtype support for (n,)-integers as a general dtype in NumPy, you get back a (3,2) array of integers which is the closest thing that NumPy can give you. Setting the [0] row of this object via
a['tuple'][0] = (1,2)
works just fine and does what you would expect.
On the other hand, when you type:
b = a[0]
you are getting back an array-scalar which is a particularly interesting kind of array scalar that can hold records. This new object is formally of type numpy.void and it holds a "scalar representation" of anything that fits under the "VOID" basic dtype.
For some reason:
b['tuple'] = [1,2]
is not working. On my system I'm getting a different error: TypeError: object of type 'int' has no len()
I think this should be filed as a bug on the issue tracker which is for the time being here: http://projects.scipy.org/numpy
The problem is ultimately the void->copyswap function being called in voidtype_setfields if someone wants to investigate. I think this behavior should work.
An explanation for this is given in a numpy bug report.
I get a different error than you do (using numpy 1.7.0.dev):
ValueError: setting an array element with a sequence.
so the explanation below may not be correct for your system (or it could even be the wrong explanation for what I see).
First, notice that indexing a row of a structured array gives you a numpy.void object (see data type docs)
import numpy as np
dt = np.dtype([('tuple', (int, 2))])
a = np.zeros(3, dt)
print type(a[0]) # = numpy.void
From what I understand, void is sort of like a Python list since it can hold objects of different data types, which makes sense since the columns in a structured array can be different data types.
If, instead of indexing, you slice out the first row, you get an ndarray:
print type(a[:1]) # = numpy.ndarray
This is analogous to how Python lists work:
b = [1, 2, 3]
print b[0] # 1
print b[:1] # [1]
Slicing returns a shortened version of the original sequence, but indexing returns an element (here, an int; above, a void type).
So when you slice into the rows of the structured array, you should expect it to behave just like your original array (only with fewer rows). Continuing with your example, you can now assign to the 'tuple' columns of the first row:
a[:1]['tuple'] = (1, 2)
So,... why doesn't a[0]['tuple'] = (1, 2) work?
Well, recall that a[0] returns a void object. So, when you call
a[0]['tuple'] = (1, 2) # this line fails
you're assigning a tuple to the 'tuple' element of that void object. Note: despite the fact you've called this index 'tuple', it was stored as an ndarray:
print type(a[0]['tuple']) # = numpy.ndarray
So, this means the tuple needs to be cast into an ndarray. But, the void object can't cast assignments (this is just a guess) because it can contain arbitrary data types so it doesn't know what type to cast to. To get around this you can cast the input yourself:
a[0]['tuple'] = np.array((1, 2))
The fact that we get different errors suggests that the above line might not work for you since casting addresses the error I received---not the one you received.
Addendum:
So why does the following work?
a[0]['tuple'][:] = (1, 2)
Here, you're indexing into the array when you add [:], but without that, you're indexing into the void object. In other words, a[0]['tuple'][:] says "replace the elements of the stored array" (which is handled by the array), a[0]['tuple'] says "replace the stored array" (which is handled by void).
Epilogue:
Strangely enough, accessing the row (i.e. indexing with 0) seems to drop the base array, but it still allows you to assign to the base array.
print a['tuple'].base is a # = True
print a[0].base is a # = False
a[0] = ((1, 2),) # `a` is changed
Maybe void is not really an array so it doesn't have a base array,... but then why does it have a base attribute?
This was an upstream bug, fixed as of NumPy PR #5947, with a fix in 1.9.3.

Categories

Resources