Create mask for numpy array based on values' set membership - python

I want to create a 'mask' index array for an array, based on whether the elements of that array are members of some set. What I want can be achieved as follows:
x = np.arange(20)
interesting_numbers = {1, 5, 7, 17, 18}
x_mask = np.array([xi in interesting_numbers for xi in x])
I'm wondering if there's a faster way to execute that last line. As it is, it builds a list in Python by repeatedly calling a __contains__ method, then converts that list to a numpy array.
I want something like x_mask = x[x in interesting_numbers] but that's not valid syntax.

You can use np.in1d:
np.in1d(x, list(interesting_numbers))
#array([False, True, False, False, False, True, False, True, False,
# False, False, False, False, False, False, False, False, True,
# True, False], dtype=bool)
Timing, it is faster if the array x is large:
x = np.arange(10000)
interesting_numbers = {1, 5, 7, 17, 18}
%timeit np.in1d(x, list(interesting_numbers))
# 10000 loops, best of 3: 41.1 µs per loop
%timeit x_mask = np.array([xi in interesting_numbers for xi in x])
# 1000 loops, best of 3: 1.44 ms per loop

Here's one approach with np.searchsorted -
def set_membership(x, interesting_numbers):
b = np.sort(list(interesting_numbers))
idx = np.searchsorted(b, x)
idx[idx==b.size] = 0
return b[idx] == x
Runtime test -
# Setup inputs with random numbers that are not necessarily sorted
In [353]: x = np.random.choice(100000, 10000, replace=0)
In [354]: interesting_numbers = set(np.random.choice(100000, 1000, replace=0))
In [355]: x_mask = np.array([xi in interesting_numbers for xi in x])
# Verify output with set_membership
In [356]: np.allclose(x_mask, set_membership(x, interesting_numbers))
Out[356]: True
# #Psidom's solution
In [357]: %timeit np.in1d(x, list(interesting_numbers))
1000 loops, best of 3: 1.04 ms per loop
In [358]: %timeit set_membership(x, interesting_numbers)
1000 loops, best of 3: 682 µs per loop

Related

boolean indexing vs np.where

Ok let's say I have a numpy array arr and a boolean array mask of the same shape (for example mask = arr >= 20)
I want an array containing all values of arr where mask is True. I don't really care about the order (I am going to take the sum of this afterwards)
From what I gather from the numpy doc, I can just use boolean indexing :
arr[mask]
Nethertheless, on the internet, I saw a lot of code along the lines of :
arr[np.where(mask)]
Which, I think, does the same, but using index arrays.
Do these two lines really do the same thing ? and if so, is one of them faster ?
As for performance: why not simply measure it? Have a simple example:
In [11]: y = np.arange(35).reshape(5,7)
In [12]: mask = (y % 2 == 0)
In [13]: mask
Out[13]:
array([[ True, False, True, False, True, False, True],
[False, True, False, True, False, True, False],
[ True, False, True, False, True, False, True],
[False, True, False, True, False, True, False],
[ True, False, True, False, True, False, True]])
Then %timeit:
In [14]: %timeit y[mask]
534 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit y[np.where(mask)]
2.18 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Unsurprisingly - even if there were no functional differences between both lines - the function call overhead makes np.where slower. As to "are they identical"? Not exactly. From np.where docstring:
where(condition, [x, y]):
Return elements chosen from x or y depending on condition.
Note: When only condition is provided, this function is a shorthand for
np.asarray(condition).nonzero(). Using nonzero directly should be
preferred, as it behaves correctly for subclasses. The rest of this
documentation covers only the case where all three arguments are
provided.
Looking back at the example:
While y[mask] directly selects all matching (True) elements of y, np.where(mask) takes the detour of calculating all (here 2D) index positions for True elements in mask:
In [26]: np.where(mask)
Out[26]:
(array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4], dtype=int64),
array([0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6], dtype=int64))
In other words: using the boolean mask directly is not only simpler, but avoids a lot of extra computation.

How to quickly compare every element of an ndarray with every element of a sorted list/array?

I have an ndarray A(an image in 2D for example) for which values are integers going from 0 to N.
I have another list B or array containing a list of numbers that are in the range of 0 to N.
I want to compare the first array to every element of the second list in order to obtain a new ndarray indicating if the value of the pixel is in the list.
A is around 10000 * 10000
B is a list having 10000-100000 values.
N goes up to 500 000
Here is an example of the results I wish to obtain.
I already tried for loops, it works but it's really slow as I have really big matrices. I also tried to do it with .any() and numpy's compare function but did not managed to obtain the desired result.
a = np.array([2, 23, 15, 0, 7, 5, 3])
b = np.array([3,7,17])
c = np.array([False, False, False, False, True, False, True])
You could use numpy.in1d:
>>> np.in1d(a, b)
array([False, False, False, False, True, False, True], dtype=bool)
There's also numpy.isin, which is recommended for new code.
You can reshape the array a to have an extra dimension which will be used for comparing with b and then use np.any along that dimension:
>>> np.any(a[..., None] == b, axis=-1)
array([False, False, False, False, True, False, True])
This approach is flexible since it works with other element-wise comparison functions too. For example for two float arrays, instead of np.equal we typically want to compare np.isclose and we can do so by simply exchanging the comparison function:
>>> np.any(np.isclose(a[..., None], b), axis=-1)
If equality is however the criterion then np.isin will perform better since it doesn't need to go through an intermediate broadcasted array of shape a.shape + (b.size,) which will be reduced along the last axis anyway. That means it saves both in memory and compute since it doesn't need to allocate that array and neither perform all the computations:
In [2]: a = np.random.randint(0, 100, size=(100, 100))
In [3]: b = np.random.randint(0, 100, size=1000)
In [4]: %timeit np.any(a[..., None] == b, axis=-1)
12.1 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit np.isin(a, b)
608 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Efficient numpy subarrays extraction from a mask

I am searching a pythonic way to extract multiple subarrays from a given array using a mask as shown in the example:
a = np.array([10, 5, 3, 2, 1])
m = np.array([True, True, False, True, True])
The output will be a collection of array like the following, where only the contiguous "region" of True values (True values next to each other) of the mask m represent the indices generating a subarray.
L[0] = np.array([10, 5])
L[1] = np.array([2, 1])
Here's one approach -
def separate_regions(a, m):
m0 = np.concatenate(( [False], m, [False] ))
idx = np.flatnonzero(m0[1:] != m0[:-1])
return [a[idx[i]:idx[i+1]] for i in range(0,len(idx),2)]
Sample run -
In [41]: a = np.array([10, 5, 3, 2, 1])
...: m = np.array([True, True, False, True, True])
...:
In [42]: separate_regions(a, m)
Out[42]: [array([10, 5]), array([2, 1])]
Runtime test
Other approach(es) -
# #kazemakase's soln
def zip_split(a, m):
d = np.diff(m)
cuts = np.flatnonzero(d) + 1
asplit = np.split(a, cuts)
msplit = np.split(m, cuts)
L = [aseg for aseg, mseg in zip(asplit, msplit) if np.all(mseg)]
return L
Timings -
In [49]: a = np.random.randint(0,9,(100000))
In [50]: m = np.random.rand(100000)>0.2
# #kazemakase's's solution
In [51]: %timeit zip_split(a,m)
10 loops, best of 3: 114 ms per loop
# #Daniel Forsman's solution
In [52]: %timeit splitByBool(a,m)
10 loops, best of 3: 25.1 ms per loop
# Proposed in this post
In [53]: %timeit separate_regions(a, m)
100 loops, best of 3: 5.01 ms per loop
Increasing the average length of islands -
In [58]: a = np.random.randint(0,9,(100000))
In [59]: m = np.random.rand(100000)>0.1
In [60]: %timeit zip_split(a,m)
10 loops, best of 3: 64.3 ms per loop
In [61]: %timeit splitByBool(a,m)
100 loops, best of 3: 14 ms per loop
In [62]: %timeit separate_regions(a, m)
100 loops, best of 3: 2.85 ms per loop
def splitByBool(a, m):
if m[0]:
return np.split(a, np.nonzero(np.diff(m))[0] + 1)[::2]
else:
return np.split(a, np.nonzero(np.diff(m))[0] + 1)[1::2]
This will return a list of arrays, split into chunks of True in m
Sounds like a natural application for np.split.
You first have to figure out where to cut the array, which is where the mask changes between True and False. Next discard all elements where the mask is False.
a = np.array([10, 5, 3, 2, 1])
m = np.array([True, True, False, True, True])
d = np.diff(m)
cuts = np.flatnonzero(d) + 1
asplit = np.split(a, cuts)
msplit = np.split(m, cuts)
L = [aseg for aseg, mseg in zip(asplit, msplit) if np.all(mseg)]
print(L[0]) # [10 5]
print(L[1]) # [2 1]

Python pairwise comparison of elements in a array or list

Let me elaborate my question using a simple example.I have a=[a1,a2,a3,a4], with all ai being a numerical value.
What I want to get is pairwise comparisons within 'a', such as
I(a1>=a2), I(a1>=a3), I(a1>=a4), ,,,,I(a4>=a1), I(a4>=a2), I(a4>=a3), where I is a indicator function. So I used the following code.
res=[x>=y for x in a for y in a]
But it also gives the comparison results like I(a1>=a1),..,I(a4>=a4), which is always one. To get rid of these nuisance, I convert res into a numpy array and find the off diagonal elements.
res1=numpy.array(res)
This gives the result what I want, but I think there should be more efficient or simpler way to do pairwise comparison and extract the off diagonal element. Do you have any idea about this? Thanks in advance.
You could use NumPy broadcasting -
# Get the mask of comparisons in a vectorized manner using broadcasting
mask = a[:,None] >= a
# Select the elements other than diagonal ones
out = mask[~np.eye(a.size,dtype=bool)]
If you rather prefer to set the diagonal elements as False in mask and then mask would be the output, like so -
mask[np.eye(a.size,dtype=bool)] = 0
Sample run -
In [56]: a
Out[56]: array([3, 7, 5, 8])
In [57]: mask = a[:,None] >= a
In [58]: mask
Out[58]:
array([[ True, False, False, False],
[ True, True, True, False],
[ True, False, True, False],
[ True, True, True, True]], dtype=bool)
In [59]: mask[~np.eye(a.size,dtype=bool)] # Selecting non-diag elems
Out[59]:
array([False, False, False, True, True, False, True, False, False,
True, True, True], dtype=bool)
In [60]: mask[np.eye(a.size,dtype=bool)] = 0 # Setting diag elems as False
In [61]: mask
Out[61]:
array([[False, False, False, False],
[ True, False, True, False],
[ True, False, False, False],
[ True, True, True, False]], dtype=bool)
Runtime test
Reasons to use NumPy broadcasting? Performance! Let's see how with a large dataset -
In [34]: def pairwise_comp(A): # Using NumPy broadcasting
...: a = np.asarray(A) # Convert to array if not already so
...: mask = a[:,None] >= a
...: out = mask[~np.eye(a.size,dtype=bool)]
...: return out
...:
In [35]: a = np.random.randint(0,9,(1000)).tolist() # Input list
In [36]: %timeit [x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]
1 loop, best of 3: 185 ms per loop # #Sixhobbits's loopy soln
In [37]: %timeit pairwise_comp(a)
100 loops, best of 3: 5.76 ms per loop
Perhaps you want:
[x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]
This will not compare any item against itself, but compare each of the others against each other.
I'd like to apply #Divakar's solution to pandas objects. Here are two approaches for calculating pairwise absolute differences.
(IPython 6.1.0 on Python 3.6.2)
In [1]: import pandas as pd
...: import numpy as np
...: import itertools
In [2]: n = 256
...: labels = range(n)
...: ser = pd.Series(np.random.randn(n), index=labels)
...: ser.head()
Out[2]:
0 1.592248
1 -1.168560
2 -1.243902
3 -0.133140
4 -0.714133
dtype: float64
Loops
In [3]: %%time
...: result = dict()
...: for pair in itertools.combinations(labels, 2):
...: a, b = pair
...: a = ser[a] # retrieve values
...: b = ser[b]
...: result[pair] = a - b
...: result = pd.Series(result).abs().reset_index()
...: result.columns = list('ABC')
...: df1 = result.pivot('A', 'B, 'C').reindex(index=labels, columns=labels)
...: df1 = df1.fillna(df1.T).fillna(0.)
CPU times: user 18.2 s, sys: 468 ms, total: 18.7 s
Wall time: 18.7 s
NumPy broadcast
In [4]: %%time
...: arr = ser.values
...: arr = arr[:, None] - arr
...: df2 = pd.DataFrame(arr, labels, labels).abs()
CPU times: user 816 µs, sys: 432 µs, total: 1.25 ms
Wall time: 675 µs
Verify they're equal:
In [5]: df1.equals(df2)
Out[5]: True
Using loops is about 20000 times slower than the clever NumPy approach. NumPy has many optimizations, but sometimes they need a different way of thinking. :-)
You may achieve that by using:
[x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]
Issue with your code:
You are iterating in over list twice. If you convert your comprehension to loop, it will work like:
for x in a:
for y in a:
x>=y # which is your condition
Hence, it the the order of execution is as: (a1, a1), (a1, a2), ... , (a2, a1), (a2, a2), ... , (a4, a4)
Why are you worried about the a1>=a1 comparison. It may be pridictable, but skipping it might not be worth the extra work.
Make a list of 100 numbers
In [17]: a=list(range(100))
Compare them with the simple double loop; producing a 10000 values (100*100)
In [18]: len([x>=y for x in a for y in a])
Out[18]: 10000
In [19]: timeit [x>=y for x in a for y in a]
1000 loops, best of 3: 1.04 ms per loop
Now use #Moinuddin Quadri's enumerated loop to skip the 100 eye values:
In [20]: len([x>=y for i,x in enumerate(a) for j, y in enumerate(a) if i!=j])
Out[20]: 9900
In [21]: timeit [x>=y for i,x in enumerate(a) for j, y in enumerate(a) if i!=j]
100 loops, best of 3: 2.12 ms per loop
It takes 2x longer. Half the extra time is the enumerates, and half the if.
In this case working with numpy arrays is much faster, even when including the time to create the array.
xa = np.array(x); Z = xa[:,None]>=xa
But you can't get rid of the the diagonal values. They will the True; they can be flipped to False, but why. In a boolean array there are only 2 values.
The fastest solution is to write an indicator function that isn't bothered by these diagonal values.

are elements of an array in a set?

import numpy
data = numpy.random.randint(0, 10, (6,8))
test = set(numpy.random.randint(0, 10, 5))
I want an expression whose value is a Boolean array, with the same shape of data (or, at least, can be reshaped to the same shape), that tells me if the corresponding term in data is in set.
E.g., if I want to know which elements of data are strictly less than 6, I can use a single vectorized expression,
a = data < 6
that computes a 6x8 boolean ndarray. On the contrary, when I try an apparently equivalent boolean expression
b = data in test
what I get is an exception:
TypeError: unhashable type: 'numpy.ndarray'
Addendum — benmarching different solutions
Edit: the possibility #4 below gives wrong results, thanks to hpaulj
and Divakar for getting me on the right track.
Here I compare four different possibilities,
What was proposed by Divakar, np.in1d(data, np.hstack(test)).
One proposal by hpaulj, np.in1d(data, np.array(list(test))).
Another proposal by hpaulj, `np.in1d(data, np.fromiter(test, int)).
What was proposed in an answer removed by its author, whose name I dont remember, np.in1d(data, test).
Here it is the Ipython session, slightly edited to avoid blank lines
In [1]: import numpy as np
In [2]: nr, nc = 100, 100
In [3]: top = 3000
In [4]: data = np.random.randint(0, top, (nr, nc))
In [5]: test = set(np.random.randint(0, top, top//3))
In [6]: %timeit np.in1d(data, np.hstack(test))
100 loops, best of 3: 5.65 ms per loop
In [7]: %timeit np.in1d(data, np.array(list(test)))
1000 loops, best of 3: 1.4 ms per loop
In [8]: %timeit np.in1d(data, np.fromiter(test, int))
1000 loops, best of 3: 1.33 ms per loop
In [9]: %timeit np.in1d(data, test)
1000 loops, best of 3: 687 µs per loop
In [10]: nr, nc = 1000, 1000
In [11]: top = 300000
In [12]: data = np.random.randint(0, top, (nr, nc))
In [13]: test = set(np.random.randint(0, top, top//3))
In [14]: %timeit np.in1d(data, np.hstack(test))
1 loop, best of 3: 706 ms per loop
In [15]: %timeit np.in1d(data, np.array(list(test)))
1 loop, best of 3: 269 ms per loop
In [16]: %timeit np.in1d(data, np.fromiter(test, int))
1 loop, best of 3: 274 ms per loop
In [17]: %timeit np.in1d(data, test)
10 loops, best of 3: 67.9 ms per loop
In [18]:
The better times are given by the (now) anonymous poster's answer.
It turns out that the anonymous poster had a good reason to remove their answer, the results being wrong!
As commented by hpaulj, in the documentation of in1d there is a warning against the use of a set as the second argument, but I'd like better an explicit failure if the computed results could be wrong.
That said, the solution using numpy.fromiter() has the best numbers...
I am assuming you are looking to find a boolean array to detect the presence of the set elements in data array. To do so, you can extract the elements from set with np.hstack and then use np.in1d to detect presence of any element from set at each position in data, giving us a boolean array of the same size as data. Since, np.in1d flattens the input before processing, so as a final step, we need to reshape the output from np.in1d back to its original 2D shape. Thus, the final implementation would be -
np.in1d(data,np.hstack(test)).reshape(data.shape)
Sample run -
In [125]: data
Out[125]:
array([[7, 0, 1, 8, 9, 5, 9, 1],
[9, 7, 1, 4, 4, 2, 4, 4],
[0, 4, 9, 6, 6, 3, 5, 9],
[2, 2, 7, 7, 6, 7, 7, 2],
[3, 4, 8, 4, 2, 1, 9, 8],
[9, 0, 8, 1, 6, 1, 3, 5]])
In [126]: test
Out[126]: {3, 4, 6, 7, 9}
In [127]: np.in1d(data,np.hstack(test)).reshape(data.shape)
Out[127]:
array([[ True, False, False, False, True, False, True, False],
[ True, True, False, True, True, False, True, True],
[False, True, True, True, True, True, False, True],
[False, False, True, True, True, True, True, False],
[ True, True, False, True, False, False, True, False],
[ True, False, False, False, True, False, True, False]], dtype=bool)
The expression a = data < 6 returns a new array because < is a value comparison operator.
Arithmetic, matrix multiplication, and comparison operations
Arithmetic and comparison operations on ndarrays are defined as
element-wise operations, and generally yield ndarray objects as
results.
Each of the arithmetic operations (+, -, *, /, //, %, divmod(), ** or
pow(), <<, >>, &, ^, |, ~) and the comparisons (==, <, >, <=, >=, !=)
is equivalent to the corresponding universal function (or ufunc for
short) in Numpy.
Note that the in operator is not in this list. Probably because it works in the opposite direction to most operators.
While a + b is the same as a.__add__(b), a in b works right to left b.__contains__(a). In this case python tries to call set.__contains__(), which will only accept hashable/immutable types. Arrays are mutable, so they can't be a member of a set.
A solution to this is to use numpy.vectorize instead of in directly, and call any python function on each element in the array.
It's a kind of map() for numpy arrays.
numpy.vectorize
Define a vectorized function which takes a nested sequence of objects
or numpy arrays as inputs and returns a numpy array as output. The
vectorized function evaluates pyfunc over successive tuples of the
input arrays like the python map function, except it uses the
broadcasting rules of numpy.
>>> import numpy
>>> data = numpy.random.randint(0, 10, (3, 3))
>>> test = set(numpy.random.randint(0, 10, 5))
>>> numpy.vectorize(test.__contains__)(data)
array([[False, False, True],
[ True, True, False],
[ True, False, True]], dtype=bool)
Benchmarks
This approach is fast when n is large, since set.__contains__() is a constant time operation. ("large" means thattop > 13000 or so)
>>> import numpy as np
>>> nr, nc = 100, 100
>>> top = 300000
>>> data = np.random.randint(0, top, (nr, nc))
>>> test = set(np.random.randint(0, top, top//3))
>>> %timeit -n10 np.in1d(data, list(test)).reshape(data.shape)
10 loops, best of 3: 26.2 ms per loop
>>> %timeit -n10 np.in1d(data, np.hstack(test)).reshape(data.shape)
10 loops, best of 3: 374 ms per loop
>>> %timeit -n10 np.vectorize(test.__contains__)(data)
10 loops, best of 3: 3.16 ms per loop
However, when n is small, the other solutions are significantly faster.

Categories

Resources