Seemingly inconsistent behavior when converting numpy arrays to bool

Seemingly inconsistent behavior when converting numpy arrays to bool - python

What is the rationale behind the seemingly inconsistent behaviour of the following lines of code?
import numpy as np
# standard list
print(bool([])) # False - expected
print(bool([0])) # True - expected
print(bool([1])) # True - expected
print(bool([0,0])) # True - expected
# numpy arrays
print(bool(np.array([]))) # False - expected, deprecation warning: The
# truth value of an empty array is ambiguous...
print(bool(np.array([0]))) # False - unexpected, no warning
print(bool(np.array([1]))) # True - unexpected, no warning
print(bool(np.array([0,0]))) # ValueError: The truth value of an array
# with more than one element is ambiguous...
There are at least two inconsistencies in my point of view:
Standard python containers can be tested for emptiness bool(container). Why do numpy array not follow this pattern? (bool(np.array([0])) yields False)
Why is there an exception/deprecation warning when converting an empty numpy array or an array of length > 1, but it is okay to do so when the numpy array contains just one element?
Note that the deprecation for empty numpy arrays was added somewhere between numpy 1.11. and 1.14.

For the first problem, the reason is that it's not at all clear what you want to do with if np.array([1, 2]):.
This isn't a problem for if [1, 2]: because Python lists don't do element-wise anything. The only thing you can be asking is whether the list itself is truthy (non-empty).
But Numpy arrays do everything element-wise that possibly could be element-wise. Notice that this is hardly the only place, or even the most common place, where element-wise semantics mean that arrays work differently from normal Python sequences. For example:
>>> [1, 2] * 3
[1, 2, 1, 2, 1, 2]
>>> np.array([1, 2]) * 3
array([3, 6])
And, for this case in particular, boolean arrays are a very useful thing, especially since you can index with them:
>>> arr = np.array([1, 2, 3, 4])
>>> arr > 2 # all values > 2
array([False, False, True, True])
>>> arr[arr > 2] = 2 # clamp the values at <= 2
>>> arr
array([1, 2, 2, 2])
And once you have that feature, it becomes ambiguous what an array should mean in a boolean context. Normally, you want the bool array. But when you write if arr:, you could mean any of multiple things:
Do the body of the if for each element that's truthy. (Rewrite the body as an expression on arr indexed by the bool array.)
Do the body of the if if any element is truthy. (Use any.)
Do the body of the if if all elements are truthy. (Use any.)
A hybrid over some axis—e.g., do the body for each row where any element is truthy.
Do the body of the if if the array is nonempty—acting like a normal Python sequence but violating the usual element-wise semantics of an array. (Explicitly check for emptiness.)
So, rather than guess, and be wrong more often than not, numpy gives you an error and forces you to be explicit.
For the second problem, doesn't the warning text answer this for you? The truth value of a single element is obviously not ambiguous.
And single-element arrays—especially 0D ones—are often used as pseudo-scalars, so being able to do this isn't just harmless, it's also sometimes useful.
By contrast, asking "is this array empty" is rarely useful. A list is a variable-sized thing that you usually build up by adding one element at a time, zero or more times (possibly implicitly in a comprehension), so it's very often worth asking whether you added zero elements. But an array is a fixed-size thing, where you usually explicitly specified the size somewhere nearby in the code.
That's why it's allowed. And why it operates on the single value, not on the size of the array.
For empty arrays (which you didn't ask about, but did bring up): here, instead of there being multiple reasonable things you could mean, it's hard to think of anything reasonable you could mean. Which is probably why this is the only case that's changed recently (see issue 9583), rather than being the same since the days when Python added __nonzero__.

Related

What Is the Function of Less Than (<) Operator in numpy Array?

I'm learning Python right now and I'm stuck with this line of code I found on the internet. I can not understand what actually this line of code do.
Suppose I have this array:
import numpy as np
x = np.array ([[1,5],[8,1],[10,0.5]]
y = x[np.sqrt(x[:,0]**2+x[:,1]**2) < 1]
print (y)
The result is an empty array. What I want to know is what does actually the y do? I've never encountered this kind of code before. It seems like the square brackets is like the if-conditional statement. Instead of that code, If write this line of code:
import numpy as np
x = np.array ([[1,5],[8,1],[10,0.5]]
y = x[0 < 1]
print (y)
It will return exactly what x is (because zero IS less than one).
Assuming that it is a way to write if-conditional statement, I find it really absurd because I'm comparing an array with an integer.
Thank you for your answer!

In Numpy:
[1,1,2,3,4] < 2
is (very roughly) equivalent to something like:
[x<2 for x in [1,1,2,3,4]]
for vanilla Python lists. And as such, in both cases, the result would be:
[True, True, False, False, False]
The same holds true for some other functions, like addition, multiplication and so on. Broadcasting is actually a major selling point for Numpy.
Now, another thing you can do in Numpy is boolean indexing, which is providing an array of bools that are interpreted as 'Keep this value Y/N?'. So:
arr = [1,1,2,3,4]
res = arr[arr<2]
# evaluates to:
=> [1,1]

numpy works differently when you slice an array using a boolean or an int.
From the docs:
This advanced indexing occurs when obj is an array object of Boolean type, such as may be returned from comparison operators. A single
boolean index array is practically identical to x[obj.nonzero()]
where, as described above, obj.nonzero() returns a tuple (of length
obj.ndim) of integer index arrays showing the True elements of obj.
However, it is faster when obj.shape == x.shape.
If obj.ndim == x.ndim, x[obj] returns a 1-dimensional array filled
with the elements of x corresponding to the True values of obj. The
search order will be row-major, C-style. If obj has True values at
entries that are outside of the bounds of x, then an index error will
be raised. If obj is smaller than x it is identical to filling it with
False.
When you index an array using booleans, you are telling numpy to select the data corresponding to True, therefore array[True] is not the same as array[1]. In the first case, numpy will therefore interpret it as a zero dimensional boolean array, which, based on how masks works, is the same as selecting all data.
Therefore:
x[True]
will return the full array, just as
x[False]
will return an empty array.

What does `x[False]` do in numpy?

Say I have an array x = np.arange(6).reshape(3, 2).
What is the meaning of x[False], or x[np.asanyarray(False)]? Both result in array([], shape=(0, 3, 2), dtype=int64), which is unexpected.
I expected to get an IndexError because of an improperly sized mask, as from something like x[np.ones((2, 2), dtype=np.bool)].
This behavior is consistent for x[True] and x[np.asanyarray(True)], as both result in an additional dimension: array([[[0, 1], [2, 3], [4, 5]]]).
I am using numpy 1.13.1. It appears that the behavior has changed recently, so while it is nice to have answers for older versions, please mention your version in the answers.
EDIT
Just for completeness, I filed https://github.com/numpy/numpy/issues/9515 based on the commentary on this question.
EDIT 2
And closed it almost immeditely.

There's technically no requirement that the dimensionality of a mask match the dimensionality of the array you index with it. (In previous versions, there were even fewer restrictions, and you could get away with some extreme shape mismatches.)
The docs describe boolean indexing as
A single boolean index array is practically identical to x[obj.nonzero()] where, as described above, obj.nonzero() returns a tuple (of length obj.ndim) of integer index arrays showing the True elements of obj.
but nonzero is weird for 0-dimensional input, so this case is one of the ways that "practically identical" turns out to be not identical:
the nonzero equivalence for Boolean arrays does not hold for zero dimensional boolean arrays.
NumPy has a special case for a 0-dimensional boolean index, motivated by the desire to have the following behavior:
In [3]: numpy.array(3)[True]
Out[3]: array([3])
In [4]: numpy.array(3)[False]
Out[4]: array([], dtype=int64)
I'll refer to a comment in the source code that handles a 0-dimensional boolean index:
if (PyArray_NDIM(arr) == 0) {
/*
* This can actually be well defined. A new axis is added,
* but at the same time no axis is "used". So if we have True,
* we add a new axis (a bit like with np.newaxis). If it is
* False, we add a new axis, but this axis has 0 entries.
*/
While this is primarily intended for a 0-dimensional index to a 0-dimensional array, it also applies to indexing multidimensional arrays with booleans. Thus,
x[True]
is equivalent to x[np.newaxis], producing a result with a new length-1 axis in front, and
x[False]
produces a result with a new axis in front of length 0, selecting no elements.

masking a series with a boolean array

This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance
x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta
delta mask creates a boolean pandas series.
However, if you do
x[deltamask]
y[deltamask]
You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like
x[deltamask]*y[deltamask]
results in an error:
print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])
Even more perplexing, I noticed that the operator < is treated differently. For instance
print type(2*x < x*y)
print type(2 < x*y)
will give you a pd.series and np.array respectively.
Also,
5 < x - y
results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.
What is the reason for this?

Fancy Indexing
As numpy currently stands, fancy indexing in numpy works as follows:
If the thing between brackets is a tuple (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of x. For example, both x[(True, True)] and x[True, True] will raise IndexError: too many indices for array in this case because x is 1D. However, before the exception happens, a telling warning will be raised too: VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.
If the thing between brackets is exactly an ndarray, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values] gives the expected result (empty array since deltamask is all False.
If the thing between brackets is any array-like, whether a subclass like Series or just a list, or something else, it is converted to an np.intp array (if possible) and used as an integer index. So x[deltamask] yeilds something equivalent to x[[False] * 7] or just x[[0] * 7]. In this case, len(deltamask)==7 and x[0]==1 so the result is [1, 1, 1, 1, 1, 1, 1].
This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.
This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.
Relational Operators
Now let's address the second part of your question about how the comparison works. Relational operators (<, >, <=, >=) work by calling the corresponding method on one of the objects being compared. For < this is __lt__. However, instead of just calling x.__lt__(y) for the expression x < y, Python actually checks the types of the objects being compared. If y is a subtype of x that implements the comparison, then Python prefers to call y.__gt__(x) instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y) will get called if y is a subclass of x is if y.__gt__(x) returns NotImplemented to indicate that the comparison is not supported in that direction.
A similar thing happens when you do 5 < x - y. While ndarray is not a subclass of int, the comparison int.__lt__(ndarray) returns NotImplemented, so Python actually ends up calling (x - y).__gt__(5), which is of course defined and works just fine.
A much more succinct explanation of all this can be found in the Python docs.

Truth value of numpy array with one falsey element seems to depend on dtype

import numpy as np
a = np.array([0])
b = np.array([None])
c = np.array([''])
d = np.array([' '])
Why should we have this inconsistency:
>>> bool(a)
False
>>> bool(b)
False
>>> bool(c)
True
>>> bool(d)
False

For arrays with one element, the array's truth value is determined by the truth value of that element.
The main point to make is that np.array(['']) is not an array containing one empty Python string. This array is created to hold strings of exactly one byte each and NumPy pads strings that are too short with the null character. This means that the array is equal to np.array(['\0']).
In this regard, NumPy is being consistent with Python which evaluates bool('\0') as True.
In fact, the only strings which are False in NumPy arrays are strings which do not contain any non-whitespace characters ('\0' is not a whitespace character).
Details of this Boolean evaluation are presented below.
Navigating NumPy's labyrinthine source code is not always easy, but we can find the code governing how values in different datatypes are mapped to Boolean values in the arraytypes.c.src file. This will explain how bool(a), bool(b), bool(c) and bool(d) are determined.
Before we get to the code in that file, we can see that calling bool() on a NumPy array invokes the internal _array_nonzero() function. If the array is empty, we get False. If there are two or more elements we get an error. But if the array has exactly one element, we hit the line:
return PyArray_DESCR(mp)->f->nonzero(PyArray_DATA(mp), mp);
Now, PyArray_DESCR is a struct holding various properties for the array. f is a pointer to another struct PyArray_ArrFuncs that holds the array's nonzero function. In other words, NumPy is going to call upon the array's own special nonzero function to check the Boolean value of that one element.
Determining whether an element is nonzero or not is obviously going to depend on the datatype of the element. The code implementing the type-specific nonzero functions can be found in the "nonzero" section of the arraytypes.c.src file.
As we'd expect, floats, integers and complex numbers are False if they're equal with zero. This explains bool(a). In the case of object arrays, None is similarly going to be evaluated as False because NumPy just calls the PyObject_IsTrue function. This explains bool(b).
To understand the results of bool(c) and bool(d), we see that the nonzero function for string type arrays is mapped to the STRING_nonzero function:
static npy_bool
STRING_nonzero (char *ip, PyArrayObject *ap)
{
int len = PyArray_DESCR(ap)->elsize; // size of dtype (not string length)
int i;
npy_bool nonz = NPY_FALSE;
for (i = 0; i < len; i++) {
if (!Py_STRING_ISSPACE(*ip)) { // if it isn't whitespace, it's True
nonz = NPY_TRUE;
break;
}
ip++;
}
return nonz;
}
(The unicode case is more or less the same idea.)
So in arrays with a string or unicode datatype, a string is only False if it contains only whitespace characters:
>>> bool(np.array([' ']))
False
In the case of array c in the question, there is a really a null character \0 padding the seemingly-empty string:
>>> np.array(['']) == np.array(['\0'])
array([ True], dtype=bool)
The STRING_nonzero function sees this non-whitespace character and so bool(c) is True.
As noted at the start of this answer, this is consistent with Python's evaluation of strings containing a single null character: bool('\0') is also True.
Update: Wim has fixed the behaviour detailed above in NumPy's master branch by making strings which contain only null characters, or a mix of only whitespace and null characters, evaluate to False. This means that NumPy 1.10+ will see that bool(np.array([''])) is False, which is much more in line with Python's treatment of "empty" strings.

I'm pretty sure the answer is, as explained in Scalars, that:
Array scalars have the same attributes and methods as ndarrays. [1] This allows one to treat items of an array partly on the same footing as arrays, smoothing out rough edges that result when mixing scalar and array operations.
So, if it's acceptable to call bool on a scalar, it must be acceptable to call bool on an array of shape (1,), because they are, as far as possible, the same thing.
And, while it isn't directly said anywhere in the docs that I know of, it's pretty obvious from the design that NumPy's scalars are supposed to act like native Python objects.
So, that explains why np.array([0]) is falsey rather than truthy, which is what you were initially surprised about.
So, that explains the basics. But what about the specifics of case c?
First, note that your array np.array(['']) is not an array of one Python object, but an array of one NumPy <U1 null-terminated character string of length 1. Fixed-length-string values don't have the same truthiness rule as Python strings—and they really couldn't; for a fixed-length-string type, "false if empty" doesn't make any sense, because they're never empty. You could argument about whether NumPy should have been designed that way or not, but it clearly does follow that rule consistently, and I don't think the opposite rule would be any less confusing here, just different.
But there seems to be something else weird going on with strings. Consider this:
>>> np.array(['a', 'b']) != 0
True
That's not doing an elementwise comparison of the <U2 strings to 0 and returning array([True, True]) (as you'd get from np.array(['a', 'b'], dtype=object)), it's doing an array-wide comparison and deciding that no array of strings is equal to 0, which seems odd… I'm not sure whether this deserves a separate answer here or even a whole separate question, but I am pretty sure I'm not going to be the one who writes that answer, because I have no clue what's going on here. :)
Beyond arrays of shape (1,), arrays of shape () are treated the same way, but anything else is a ValueError, because otherwise it would be very easily to misuse arrays with and and other Python operators that NumPy can't automagically convert into elementwise operations.
I personally think being consistent with other arrays would be more useful than being consistent with scalars here—in other words, just raise a ValueError. I also think that, if being consistent with scalars were important here, it would be better to be consistent with the unboxed Python values. In other words, if bool(array([v])) and bool(array(v)) are going to be allowed at all, they should always return exactly the same thing as bool(v), even if that's not consistent with np.nonzero. But I can see the argument the other way.

It's fixed in master now.
I thought this was a bug, and the numpy devs agreed, so this patch was merged earlier today. We should see new behaviour in the upcoming 1.10 release.

Numpy seems to be following the same castings as builtin python**, in this context it seems to be because of which return true for calls to nonzero. Apparently len can also be used, but here, none of these arrays are empty (length 0) - so that's not directly relevant. Note that calling bool([False]) also returns True according to these rules.
a = np.array([0])
b = np.array([None])
c = np.array([''])
>>> nonzero(a)
(array([], dtype=int64),)
>>> nonzero(b)
(array([], dtype=int64),)
>>> nonzero(c)
(array([0]),)
This also seems consistent with the more enumerative description of bool casting --- where your examples are all explicitly discussed.
Interestingly, there does seem to be systematically different behavior with string arrays, e.g.
>>> a.astype(bool)
array([False], dtype=bool)
>>> b.astype(bool)
array([False], dtype=bool)
>>> c.astype(bool)
ERROR: ValueError: invalid literal for int() with base 10: ''
I think, when numpy converts something into a bool it uses the PyArray_BoolConverter function which, in turn, just calls the PyObject_IsTrue function --- i.e. the exact same function that builtin python uses, which is why numpy's results are so consistent.

Numpy nonzero/flatnonzero index order; order of returned elements in boolean indexing

I'm wondering about the order of indices returned by numpy.nonzero / numpy.flatnonzero.
I couldn't find anything in the docs about it. It just says:
A[nonzero(flag)] == A[flag]
While in most cases this is enough, there are some when you need a sorted list of indices. Is it guaranteed that returned indices are sorted in case of 1-D or I need to sort them explicitly? (A similar question is the order of elements returned simply by selecting with a boolean array (A[flag]) which must be the same according to the docs.)
Example: finding the "gaps" between True elements in flag:
flag=np.array([True,False,False,True],dtype=bool)
iflag=flatnonzero(flag)
gaps= iflag[1:] - iflag[:-1]
Thanks.

Given the specification for advanced (or "fancy") indexing with integers, the guarantee that A[nonzero(flag)] == A[flag] is also a guarantee that the values are sorted low-to-high in the 1-d case. However, in higher dimensions, the result (while "sorted") has a different structure than you might expect.
In short, given a 1-dimensional array of integers ind and a 1-dimensional array x to be indexed, we have the following for all valid i defined for ind:
result[i] = x[ind[i]]
result takes the shape of ind, and contains the values of x at the indices indicated by ind. This means that we can deduce that if x[flag] maintains the original order of x, and if x[nonzero(flag)] is the same as x[flag], then nonzero(flag) must always produce indices in sorted order.
The only catch is that for multidimensional arrays, the indices are stored as distinct arrays for each dimension being indexed. So in other words,
x[array([0, 1, 2]), array([0, 0, 0])]
is equal to
array([x[0, 0], x[1, 0], x[2, 0]])
The values are still sorted, but each dimension is broken out into its own array. (You can do interesting things with broadcasting as a result; but that's beyond the scope of this answer.)
The only problem with this line of reasoning is that -- to my great surprise -- I can't find an explicit statement guaranteeing that boolean indexing preserves the original order of the array. Nonetheless, I'm quite certain from experience that it does. More generally, it would be unbelievably perverse to have x[[True, True, True]] return a reversed version of x.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Seemingly inconsistent behavior when converting numpy arrays to bool - python

Related

What Is the Function of Less Than (<) Operator in numpy Array?

What does `x[False]` do in numpy?

masking a series with a boolean array

Truth value of numpy array with one falsey element seems to depend on dtype

Numpy nonzero/flatnonzero index order; order of returned elements in boolean indexing

Categories

Resources