boolean indexing vs np.where - python

Ok let's say I have a numpy array arr and a boolean array mask of the same shape (for example mask = arr >= 20)
I want an array containing all values of arr where mask is True. I don't really care about the order (I am going to take the sum of this afterwards)
From what I gather from the numpy doc, I can just use boolean indexing :
arr[mask]
Nethertheless, on the internet, I saw a lot of code along the lines of :
arr[np.where(mask)]
Which, I think, does the same, but using index arrays.
Do these two lines really do the same thing ? and if so, is one of them faster ?

As for performance: why not simply measure it? Have a simple example:
In [11]: y = np.arange(35).reshape(5,7)
In [12]: mask = (y % 2 == 0)
In [13]: mask
Out[13]:
array([[ True, False, True, False, True, False, True],
[False, True, False, True, False, True, False],
[ True, False, True, False, True, False, True],
[False, True, False, True, False, True, False],
[ True, False, True, False, True, False, True]])
Then %timeit:
In [14]: %timeit y[mask]
534 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit y[np.where(mask)]
2.18 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Unsurprisingly - even if there were no functional differences between both lines - the function call overhead makes np.where slower. As to "are they identical"? Not exactly. From np.where docstring:
where(condition, [x, y]):
Return elements chosen from x or y depending on condition.
Note: When only condition is provided, this function is a shorthand for
np.asarray(condition).nonzero(). Using nonzero directly should be
preferred, as it behaves correctly for subclasses. The rest of this
documentation covers only the case where all three arguments are
provided.
Looking back at the example:
While y[mask] directly selects all matching (True) elements of y, np.where(mask) takes the detour of calculating all (here 2D) index positions for True elements in mask:
In [26]: np.where(mask)
Out[26]:
(array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4], dtype=int64),
array([0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6], dtype=int64))
In other words: using the boolean mask directly is not only simpler, but avoids a lot of extra computation.

Related

How to quickly compare every element of an ndarray with every element of a sorted list/array?

I have an ndarray A(an image in 2D for example) for which values are integers going from 0 to N.
I have another list B or array containing a list of numbers that are in the range of 0 to N.
I want to compare the first array to every element of the second list in order to obtain a new ndarray indicating if the value of the pixel is in the list.
A is around 10000 * 10000
B is a list having 10000-100000 values.
N goes up to 500 000
Here is an example of the results I wish to obtain.
I already tried for loops, it works but it's really slow as I have really big matrices. I also tried to do it with .any() and numpy's compare function but did not managed to obtain the desired result.
a = np.array([2, 23, 15, 0, 7, 5, 3])
b = np.array([3,7,17])
c = np.array([False, False, False, False, True, False, True])
You could use numpy.in1d:
>>> np.in1d(a, b)
array([False, False, False, False, True, False, True], dtype=bool)
There's also numpy.isin, which is recommended for new code.
You can reshape the array a to have an extra dimension which will be used for comparing with b and then use np.any along that dimension:
>>> np.any(a[..., None] == b, axis=-1)
array([False, False, False, False, True, False, True])
This approach is flexible since it works with other element-wise comparison functions too. For example for two float arrays, instead of np.equal we typically want to compare np.isclose and we can do so by simply exchanging the comparison function:
>>> np.any(np.isclose(a[..., None], b), axis=-1)
If equality is however the criterion then np.isin will perform better since it doesn't need to go through an intermediate broadcasted array of shape a.shape + (b.size,) which will be reduced along the last axis anyway. That means it saves both in memory and compute since it doesn't need to allocate that array and neither perform all the computations:
In [2]: a = np.random.randint(0, 100, size=(100, 100))
In [3]: b = np.random.randint(0, 100, size=1000)
In [4]: %timeit np.any(a[..., None] == b, axis=-1)
12.1 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit np.isin(a, b)
608 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Create mask for numpy array based on values' set membership

I want to create a 'mask' index array for an array, based on whether the elements of that array are members of some set. What I want can be achieved as follows:
x = np.arange(20)
interesting_numbers = {1, 5, 7, 17, 18}
x_mask = np.array([xi in interesting_numbers for xi in x])
I'm wondering if there's a faster way to execute that last line. As it is, it builds a list in Python by repeatedly calling a __contains__ method, then converts that list to a numpy array.
I want something like x_mask = x[x in interesting_numbers] but that's not valid syntax.
You can use np.in1d:
np.in1d(x, list(interesting_numbers))
#array([False, True, False, False, False, True, False, True, False,
# False, False, False, False, False, False, False, False, True,
# True, False], dtype=bool)
Timing, it is faster if the array x is large:
x = np.arange(10000)
interesting_numbers = {1, 5, 7, 17, 18}
%timeit np.in1d(x, list(interesting_numbers))
# 10000 loops, best of 3: 41.1 µs per loop
%timeit x_mask = np.array([xi in interesting_numbers for xi in x])
# 1000 loops, best of 3: 1.44 ms per loop
Here's one approach with np.searchsorted -
def set_membership(x, interesting_numbers):
b = np.sort(list(interesting_numbers))
idx = np.searchsorted(b, x)
idx[idx==b.size] = 0
return b[idx] == x
Runtime test -
# Setup inputs with random numbers that are not necessarily sorted
In [353]: x = np.random.choice(100000, 10000, replace=0)
In [354]: interesting_numbers = set(np.random.choice(100000, 1000, replace=0))
In [355]: x_mask = np.array([xi in interesting_numbers for xi in x])
# Verify output with set_membership
In [356]: np.allclose(x_mask, set_membership(x, interesting_numbers))
Out[356]: True
# #Psidom's solution
In [357]: %timeit np.in1d(x, list(interesting_numbers))
1000 loops, best of 3: 1.04 ms per loop
In [358]: %timeit set_membership(x, interesting_numbers)
1000 loops, best of 3: 682 µs per loop

are elements of an array in a set?

import numpy
data = numpy.random.randint(0, 10, (6,8))
test = set(numpy.random.randint(0, 10, 5))
I want an expression whose value is a Boolean array, with the same shape of data (or, at least, can be reshaped to the same shape), that tells me if the corresponding term in data is in set.
E.g., if I want to know which elements of data are strictly less than 6, I can use a single vectorized expression,
a = data < 6
that computes a 6x8 boolean ndarray. On the contrary, when I try an apparently equivalent boolean expression
b = data in test
what I get is an exception:
TypeError: unhashable type: 'numpy.ndarray'
Addendum — benmarching different solutions
Edit: the possibility #4 below gives wrong results, thanks to hpaulj
and Divakar for getting me on the right track.
Here I compare four different possibilities,
What was proposed by Divakar, np.in1d(data, np.hstack(test)).
One proposal by hpaulj, np.in1d(data, np.array(list(test))).
Another proposal by hpaulj, `np.in1d(data, np.fromiter(test, int)).
What was proposed in an answer removed by its author, whose name I dont remember, np.in1d(data, test).
Here it is the Ipython session, slightly edited to avoid blank lines
In [1]: import numpy as np
In [2]: nr, nc = 100, 100
In [3]: top = 3000
In [4]: data = np.random.randint(0, top, (nr, nc))
In [5]: test = set(np.random.randint(0, top, top//3))
In [6]: %timeit np.in1d(data, np.hstack(test))
100 loops, best of 3: 5.65 ms per loop
In [7]: %timeit np.in1d(data, np.array(list(test)))
1000 loops, best of 3: 1.4 ms per loop
In [8]: %timeit np.in1d(data, np.fromiter(test, int))
1000 loops, best of 3: 1.33 ms per loop
In [9]: %timeit np.in1d(data, test)
1000 loops, best of 3: 687 µs per loop
In [10]: nr, nc = 1000, 1000
In [11]: top = 300000
In [12]: data = np.random.randint(0, top, (nr, nc))
In [13]: test = set(np.random.randint(0, top, top//3))
In [14]: %timeit np.in1d(data, np.hstack(test))
1 loop, best of 3: 706 ms per loop
In [15]: %timeit np.in1d(data, np.array(list(test)))
1 loop, best of 3: 269 ms per loop
In [16]: %timeit np.in1d(data, np.fromiter(test, int))
1 loop, best of 3: 274 ms per loop
In [17]: %timeit np.in1d(data, test)
10 loops, best of 3: 67.9 ms per loop
In [18]:
The better times are given by the (now) anonymous poster's answer.
It turns out that the anonymous poster had a good reason to remove their answer, the results being wrong!
As commented by hpaulj, in the documentation of in1d there is a warning against the use of a set as the second argument, but I'd like better an explicit failure if the computed results could be wrong.
That said, the solution using numpy.fromiter() has the best numbers...
I am assuming you are looking to find a boolean array to detect the presence of the set elements in data array. To do so, you can extract the elements from set with np.hstack and then use np.in1d to detect presence of any element from set at each position in data, giving us a boolean array of the same size as data. Since, np.in1d flattens the input before processing, so as a final step, we need to reshape the output from np.in1d back to its original 2D shape. Thus, the final implementation would be -
np.in1d(data,np.hstack(test)).reshape(data.shape)
Sample run -
In [125]: data
Out[125]:
array([[7, 0, 1, 8, 9, 5, 9, 1],
[9, 7, 1, 4, 4, 2, 4, 4],
[0, 4, 9, 6, 6, 3, 5, 9],
[2, 2, 7, 7, 6, 7, 7, 2],
[3, 4, 8, 4, 2, 1, 9, 8],
[9, 0, 8, 1, 6, 1, 3, 5]])
In [126]: test
Out[126]: {3, 4, 6, 7, 9}
In [127]: np.in1d(data,np.hstack(test)).reshape(data.shape)
Out[127]:
array([[ True, False, False, False, True, False, True, False],
[ True, True, False, True, True, False, True, True],
[False, True, True, True, True, True, False, True],
[False, False, True, True, True, True, True, False],
[ True, True, False, True, False, False, True, False],
[ True, False, False, False, True, False, True, False]], dtype=bool)
The expression a = data < 6 returns a new array because < is a value comparison operator.
Arithmetic, matrix multiplication, and comparison operations
Arithmetic and comparison operations on ndarrays are defined as
element-wise operations, and generally yield ndarray objects as
results.
Each of the arithmetic operations (+, -, *, /, //, %, divmod(), ** or
pow(), <<, >>, &, ^, |, ~) and the comparisons (==, <, >, <=, >=, !=)
is equivalent to the corresponding universal function (or ufunc for
short) in Numpy.
Note that the in operator is not in this list. Probably because it works in the opposite direction to most operators.
While a + b is the same as a.__add__(b), a in b works right to left b.__contains__(a). In this case python tries to call set.__contains__(), which will only accept hashable/immutable types. Arrays are mutable, so they can't be a member of a set.
A solution to this is to use numpy.vectorize instead of in directly, and call any python function on each element in the array.
It's a kind of map() for numpy arrays.
numpy.vectorize
Define a vectorized function which takes a nested sequence of objects
or numpy arrays as inputs and returns a numpy array as output. The
vectorized function evaluates pyfunc over successive tuples of the
input arrays like the python map function, except it uses the
broadcasting rules of numpy.
>>> import numpy
>>> data = numpy.random.randint(0, 10, (3, 3))
>>> test = set(numpy.random.randint(0, 10, 5))
>>> numpy.vectorize(test.__contains__)(data)
array([[False, False, True],
[ True, True, False],
[ True, False, True]], dtype=bool)
Benchmarks
This approach is fast when n is large, since set.__contains__() is a constant time operation. ("large" means thattop > 13000 or so)
>>> import numpy as np
>>> nr, nc = 100, 100
>>> top = 300000
>>> data = np.random.randint(0, top, (nr, nc))
>>> test = set(np.random.randint(0, top, top//3))
>>> %timeit -n10 np.in1d(data, list(test)).reshape(data.shape)
10 loops, best of 3: 26.2 ms per loop
>>> %timeit -n10 np.in1d(data, np.hstack(test)).reshape(data.shape)
10 loops, best of 3: 374 ms per loop
>>> %timeit -n10 np.vectorize(test.__contains__)(data)
10 loops, best of 3: 3.16 ms per loop
However, when n is small, the other solutions are significantly faster.

Remove empty 'rows' and 'columns' from 3D numpy pixel array

I essentially want to crop an image with numpy—I have a 3-dimension numpy.ndarray object, ie:
[ [0,0,0,0], [255,255,255,255], ....]
[0,0,0,0], [255,255,255,255], ....] ]
where I want to remove whitespace, which, in context, is known to be either entire rows or entire columns of [0,0,0,0].
Letting each pixel just be a number for this example, I'm trying to essentially do this:
Given this: *EDIT: chose a slightly more complex example to clarify
[ [0,0,0,0,0,0]
[0,0,1,1,1,0]
[0,1,1,0,1,0]
[0,0,0,1,1,0]
[0,0,0,0,0,0]]
I'm trying to create this:
[ [0,1,1,1],
[1,1,0,1],
[0,0,1,1] ]
I can brute force this with loops, but intuitively I feel like numpy has a better means of doing this.
In general, you'd want to look into scipy.ndimage.label and scipy.ndimage.find_objects to extract the bounding box of contiguous regions fulfilling a condition.
However, in this case, you can do it fairly easily with "plain" numpy.
I'm going to assume you have a nrows x ncols x nbands array here. The other convention of nbands x nrows x ncols is also quite common, so have a look at the shape of your array.
With that in mind, you might do something similar to:
mask = im == 0
all_white = mask.sum(axis=2) == 0
rows = np.flatnonzero((~all_white).sum(axis=1))
cols = np.flatnonzero((~all_white).sum(axis=0))
crop = im[rows.min():rows.max()+1, cols.min():cols.max()+1, :]
For your 2D example, it would look like:
import numpy as np
im = np.array([[0,0,0,0,0,0],
[0,0,1,1,1,0],
[0,1,1,0,1,0],
[0,0,0,1,1,0],
[0,0,0,0,0,0]])
mask = im == 0
rows = np.flatnonzero((~mask).sum(axis=1))
cols = np.flatnonzero((~mask).sum(axis=0))
crop = im[rows.min():rows.max()+1, cols.min():cols.max()+1]
print crop
Let's break down the 2D example a bit.
In [1]: import numpy as np
In [2]: im = np.array([[0,0,0,0,0,0],
...: [0,0,1,1,1,0],
...: [0,1,1,0,1,0],
...: [0,0,0,1,1,0],
...: [0,0,0,0,0,0]])
Okay, now let's create a boolean array that meets our condition:
In [3]: mask = im == 0
In [4]: mask
Out[4]:
array([[ True, True, True, True, True, True],
[ True, True, False, False, False, True],
[ True, False, False, True, False, True],
[ True, True, True, False, False, True],
[ True, True, True, True, True, True]], dtype=bool)
Also, note that the ~ operator functions as logical_not on boolean arrays:
In [5]: ~mask
Out[5]:
array([[False, False, False, False, False, False],
[False, False, True, True, True, False],
[False, True, True, False, True, False],
[False, False, False, True, True, False],
[False, False, False, False, False, False]], dtype=bool)
With that in mind, to find rows where all elements are false, we can sum across columns:
In [6]: (~mask).sum(axis=1)
Out[6]: array([0, 3, 3, 2, 0])
If no elements are True, we'll get a 0.
And similarly to find columns where all elements are false, we can sum across rows:
In [7]: (~mask).sum(axis=0)
Out[7]: array([0, 1, 2, 2, 3, 0])
Now all we need to do is find the first and last of these that are not zero. np.flatnonzero is a bit easier than nonzero, in this case:
In [8]: np.flatnonzero((~mask).sum(axis=1))
Out[8]: array([1, 2, 3])
In [9]: np.flatnonzero((~mask).sum(axis=0))
Out[9]: array([1, 2, 3, 4])
Then, you can easily slice out the region based on min/max nonzero elements:
In [10]: rows = np.flatnonzero((~mask).sum(axis=1))
In [11]: cols = np.flatnonzero((~mask).sum(axis=0))
In [12]: im[rows.min():rows.max()+1, cols.min():cols.max()+1]
Out[12]:
array([[0, 1, 1, 1],
[1, 1, 0, 1],
[0, 0, 1, 1]])
One way of implementing this for arbitrary dimensions would be:
import numpy as np
def trim(arr, mask):
bounding_box = tuple(
slice(np.min(indexes), np.max(indexes) + 1)
for indexes in np.where(mask))
return arr[bounding_box]
A slightly more flexible solution (where you could indicate which axis to act on) is available in FlyingCircus (Disclaimer: I am the main author of the package).
You could use np.nonzero function to find your zero values, then slice nonzero elements from your original array and reshape to what you want:
import numpy as np
n = np.array([ [0,0,0,0,0,0],
[0,0,1,1,1,0],
[0,0,1,1,1,0],
[0,0,1,1,1,0],
[0,0,0,0,0,0]])
elems = n[n.nonzero()]
In [415]: elems
Out[415]: array([1, 1, 1, 1, 1, 1, 1, 1, 1])
In [416]: elems.reshape(3,3)
Out[416]:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])

remove items with low frequency

Let's consider the array of length n:
y=np.array([1,1,1,1,2,2,2,3,3,3,3,3,2,2,2,2,1,4,1,1,1])
and the matrix X of size n x m.
I want to remove items of y and rows of X, for which the corresponding value of y has low frequency.
I figured out this would give me the values of y which should be removed:
>>> items, count = np.unique(y, return_counts=True)
>>> to_remove = items[count < 3] # array([4])
and this would remove the items:
>>> X=X[y != to_remove,:]
>>> y=y[y != to_remove]
array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1])
While the code above works when there is only one label to remove, it fails when there are multiple values of y with low frequency (i.e. y=np.array([1,1,1,1,2,2,2,3,3,3,3,3,2,2,2,2,1,4,1,1,1,5,5,1,1]) would cause to_remove to be array([4, 5])):
>>> y[y != to_remove,:]
Traceback (most recent call last):
File "<input>", line 1, in <module>
IndexError: too many indices for array
How to fix this in a concise way?
You can use an additional output parameter return_inverse in np.unique like so -
def unique_where(y):
_, idx, count = np.unique(y, return_inverse=True,return_counts=True)
return y[np.in1d(idx,np.where(count>=3)[0])]
def unique_arange(y):
_, idx, count = np.unique(y, return_inverse=True,return_counts=True)
return y[np.in1d(idx,np.arange(count.size)[count>=3])]
You can use np.bincount for counting that is supposedly pretty efficient at counting and might suit it better here, assuming y contains non-negative numbers, like so -
def bincount_where(y):
counts = np.bincount(y)
return y[np.in1d(y,np.where(counts>=3)[0])]
def bincount_arange(y):
counts = np.bincount(y)
return y[np.in1d(y,np.arange(y.max())[counts>=3])]
Runtime tests -
This section times the above listed three approaches alongwith the approach listed in #Ashwini Chaudhary's solution -
In [85]: y = np.random.randint(0,100000,50000)
In [90]: def unique_items_indexed(y): # #Ashwini Chaudhary's solution
...: items, count = np.unique(y, return_counts=True)
...: return y[np.in1d(y, items[count >= 3])]
...:
In [115]: %timeit unique_items_indexed(y)
10 loops, best of 3: 19.8 ms per loop
In [116]: %timeit unique_where(y)
10 loops, best of 3: 26.9 ms per loop
In [117]: %timeit unique_arange(y)
10 loops, best of 3: 26.5 ms per loop
In [118]: %timeit bincount_where(y)
100 loops, best of 3: 16.7 ms per loop
In [119]: %timeit bincount_arange(y)
100 loops, best of 3: 16.5 ms per loop
You're looking for numpy.in1d:
>>> y = np.array([1,1,1,1,2,2,2,3,3,3,3,3,2,2,2,2,1,4,1,1,1,5,5,1,1])
>>> items, count = np.unique(y, return_counts=True)
>>> to_remove = items[count < 3]
>>> y[~np.in1d(y, to_remove)]
array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1])
If you have more than one value in to_remove the operation is ill defined:
>>> to_remove
array([4, 5])
>>> y != to_remove
True
Use the operator in1d:
>>> ~np.in1d(y, to_remove)
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, True, True], dtype=bool)

Categories

Resources