Similar questions have already been asked on SO, but they have more specific constraints and their answers don't apply to my question.
Generally speaking, what is the most pythonic way to determine if an arbitrary numpy array is a subset of another array? More specifically, I have a roughly 20000x3 array and I need to know the indices of the 1x3 elements that are entirely contained within a set. More generally, is there a more pythonic way of writing the following:
master = [12, 155, 179, 234, 670, 981, 1054, 1209, 1526, 1667, 1853] # some indices of interest
triangles = np.random.randint(2000, size=(20000, 3)) # some data
for i, x in enumerate(triangles):
if x[0] in master and x[1] in master and x[2] in master:
print i
For my use case, I can safely assume that len(master) << 20000. (Consequently, it is also safe to assume that master is sorted because this is cheap).
You can do this easily via iterating over an array in list comprehension. A toy example is as follows:
import numpy as np
x = np.arange(30).reshape(10,3)
searchKey = [4,5,8]
x[[0,3,7],:] = searchKey
x
gives
array([[ 4, 5, 8],
[ 3, 4, 5],
[ 6, 7, 8],
[ 4, 5, 8],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[ 4, 5, 8],
[24, 25, 26],
[27, 28, 29]])
Now iterate over the elements:
ismember = [row==searchKey for row in x.tolist()]
The result is
[True, False, False, True, False, False, False, True, False, False]
You can modify it for being a subset as in your question:
searchKey = [2,4,10,5,8,9] # Add more elements for testing
setSearchKey = set(searchKey)
ismember = [setSearchKey.issuperset(row) for row in x.tolist()]
If you need the indices, then use
np.where(ismember)[0]
It gives
array([0, 3, 7])
One can also use np.isin which might be more efficient than the list comprehension in #petrichor's answer. Using the same set up:
import numpy as np
x = np.arange(30).reshape(10, 3)
searchKey = [4, 5, 8]
x[[0, 3, 7], :] = searchKey
array([[ 4, 5, 8],
[ 3, 4, 5],
[ 6, 7, 8],
[ 4, 5, 8],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[ 4, 5, 8],
[24, 25, 26],
[27, 28, 29]])
Now one can use np.isin; by default, it will work element wise:
np.isin(x, searchKey)
array([[ True, True, True],
[False, True, True],
[False, False, True],
[ True, True, True],
[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, True],
[False, False, False],
[False, False, False]])
We now have to filter the rows where all entries evaluate to True for which we could use all:
np.isin(x, searchKey).all(1)
array([ True, False, False, True, False, False, False, True, False,
False])
If one now wants the corresponding indices, one can use np.where:
np.where(np.isin(x, searchKey).all(1))
(array([0, 3, 7]),)
EDIT:
Just realize that one has to be careful though. For example, if I do
x[4, :] = [8, 4, 5]
so, in the assignment I use the same values as in searchKey but in a different order, I will still get it returned when doing
np.where(np.isin(x, searchKey).all(1))
which prints
(array([0, 3, 4, 7]),)
That can be undesired.
Here are two approaches you could try:
1, Use sets. Sets are implemented much like python dictionaries and have have constant time lookups. That would look much like the code you already have, just create a set from master:
master = [12,155,179,234,670,981,1054,1209,1526,1667,1853]
master_set = set(master)
triangles = np.random.randint(2000,size=(20000,3)) #some data
for i, x in enumerate(triangles):
if master_set.issuperset(x):
print i
2, Use search sorted. This is nice because it doesn't require you to use hashable types and uses numpy builtins. searchsorted is log(N) in the size of master and O(N) in the size of triangels so it should also be pretty fast, maybe faster depending on the size of your arrays and such.
master = [12,155,179,234,670,981,1054,1209,1526,1667,1853]
master = np.asarray(master)
triangles = np.random.randint(2000,size=(20000,3)) #some data
idx = master.searchsorted(triangles)
idx.clip(max=len(master) - 1, out=idx)
print np.where(np.all(triangles == master[idx], axis=1))
This second case assumes master is sorted, as searchsorted implies.
A more natural (and possibly faster) solution for set operations in numpy is to use the functions in numpy.lib.arraysetops. These generally allow you to avoid having to convert back and forth between Python's set type. To check if one array is a subset of another, use numpy.setdiff1d() and test if the returned array has 0 length:
import numpy as np
a = np.arange(10)
b = np.array([1, 5, 9])
c = np.array([-5, 5, 9])
# is `a` a subset of `b`?
len(np.setdiff1d(a, b)) == 0 # gives False
# is `b` a subset of `a`?
len(np.setdiff1d(b, a)) == 0 # gives True
# is `c` a subset of `a`?
len(np.setdiff1d(c, a)) == 0 # gives False
You can also optionally set assume_unique=True for a potential speed boost.
I'm actually a bit surprised that numpy doesn't have something like a built-in issubset() function to do the above (analogous to set.issubset()).
Another option is to use numpy.in1d() (see https://stackoverflow.com/a/37262010/2020363)
Edit: I just realized that at some point in the distant past this bothered me enough that I wrote my own simple function:
def issubset(a, b):
"""Return whether sequence `a` is a subset of sequence `b`"""
return len(np.setdiff1d(a, b)) == 0
starting with:
master=[12,155,179,234,670,981,1054,1209,1526,1667,1853] #some indices of interest
triangles=np.random.randint(2000,size=(20000,3)) #some data
What's the most pythonic way to find indices of triplets contained in master? try using np.in1d with a list comprehension:
inds = [j for j in range(len(triangles)) if all(np.in1d(triangles[j], master))]
%timeit says ~0.5 s = half a second
--> MUCH faster way (factor of 1000!) that avoids python's slow looping? Try using np.isin with np.sum to get a boolean mask for np.arange:
inds = np.where(
np.sum(np.isin(triangles, master), axis=-1) == triangles.shape[-1])
%timeit says ~0.0005 s = half a millisecond!
Advice: avoid looping over lists whenever possible, because for the same price as a single iteration of a python loop containing one arithmetic operation, you can call a numpy function that does thousands of that same arithmetic operation
Conclusion
It seems that np.isin(arr1=triangles, arr2=master) is the function you were looking for, which gives a boolean mask of the same shape as arr1 telling whether each element of arr1 is also an element of arr2; from here, requiring that the sum of a mask row is 3 (i.e., the full length of a row in triangles) gives a 1d mask for the desired rows (or indices, using np.arange) of triangles.
Related
I have an array in Python that looks like this:
array = [[UUID('0d9ba9c6-632b-4dd4-912c-e8ff0a7134f7'), array([['1', '1']], dtype='<U21')], [UUID('9cb1feb6-0ef4-4e15-9070-7735545d12c9'), array([['2', '1']], dtype='<U21')], [UUID('955d308b-3570-4166-895e-81a077e6b9f9'), array([['3', '1']], dtype='<U21')]]
I also have a query that looks like this:
query = UUID('0d9ba9c6-632b-4dd4-912c-e8ff0a7134f7')
I am trying to find the sub-array in the main array that contains this UUID. So, querying this would return in:
[UUID('0d9ba9c6-632b-4dd4-912c-e8ff0a7134f7'), array([['1', '1']], dtype='<U21')]
I found this syntax online to do this:
out = array[array[:, 0] == query]
I know that this only works in NumPy if the array itself is an NP array. But why does this work and how? I am extremely confused by the syntax.
You might want to read the numpy tutorial on indexing on ndarrays.
But here are some basic explanations, understanding the examples would be a
good starting point.
So you have an array:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
The most basic indexing is with slices, as for basic lists, but you can access
nested dimensions with tuples (arr[i1:i2, j1:j2]) instead of chained indexing
as with basic lists (arr[i1:i2][j1:j2]):
arr[0] # array([1, 2, 3])
arr[:, 0] # array([1, 4])
arr[0, 0] # 1
Another way of indexing with numpy is to use boolean arrays.
You can use a tuple of lists of booleans, one list per dimension, each list
having the same size as the dimension.
And you can notice that you can use booleans on one dimension ([False, True])
and slices on the other dimension (:):
arr[[False, True], :] # array([[4, 5, 6]])
arr[[False, True]] # array([[4, 5, 6]])
arr[[False, True], [True, False, True]] # array([[4, 6]])
You can also use a single big boolean numpy array that has the same shape as
the array you are indexing:
arr[np.array([[False, True, False], [False, False, True]])] # array([2, 6])
Also, otherwise elementwise operation (+, /, % ...) are redefined by
numpy so they can work on whole arrays at once:
def is_even(x):
return x % 2 == 0
is_even(2) # True
is_even(arr) # array([[False, True, False], [True, False, True]])
Here we just constructed a big boolean array, so it can be used on your original array:
arr[is_even(arr)] # array([2, 4, 6])
But in your case you was only indexing on the first dimension, so using the tuple of boolean lists indexing method:
arr[is_even(arr[:, 0]), :] # array([4, 5, 6])
I have a 1D numpy array of False booleans, and a 2D numpy array containing the min,max indices of values in the first array to change to True.
An example:
my_data = numpy.zeros((10,), dtype=bool)
inds2true = numpy.array([[1, 3], [8, 9]])
And I want the following result:
out = numpy.array([False, True, True, True, False, False, False, False, True, True])
How is this possible in Python with Numpy?
Edit: I would like this to be performed in one step (i.e. no looping).
There's one rule-breaking hack:
my_data[inds2true] = True
my_data = np.cumsum(my_data) % 2 == 1
my_data
>>> array([False, True, True, False, False, False, False, False, True, False])
The most common practise is to change indices within np.arange([1, 3]) and np.arange([8, 9]), not including 3 or 9. If you still want to include them, do in addition: my_data[inds2true[:, 1]] = True
If you're looking for other options to do it in one go, the most probably it will include np.cumsum tricks.
import numpy as np
my_data = np.zeros((10,), dtype=bool)
inds2true = np.array([[1, 3], [8, 9]])
indeces = []
for ix_range in inds2true:
indeces += list(range(ix_range[0], ix_range[1] + 1))
my_data[indeces] = True
I'm solving some problems and it asked me
"In Python solve these problems with where() in Numpy without using for() or if()"
There is two arrays First one is
[1,2,3,5,3,4,3,6,9,7,0,8,7,10]
Second one is
[7,2,10,5,7,4,9,1,8,0,3,7,6]
And results of One is "In two arrays have both same values [2 5 4 9 0 7]" other one is In two arrays have both same Index(array([1, 3, 5, 8, 10, 12]))
So I need to find these condition as same values with in Index
I tried solve it. But I don't figure out how to find both. I mean I found Index but couldn't found both.
First, make sure the arrays are of equal length. Then you can use == to do an elementwise comparison
a == b
# array([False, True, False, True, False, True, False, False, False,
# False, False, False, False])
get the equal values using
b[a == b]
# array([2, 5, 4])
and the the indices using
np.where(a == b)[0]
# array([1, 3, 5])
I need to copy elements from one numpy array to another, but only if a condition is met. Let's say I have two arrays:
x = ([1,2,3,4,5,6,7,8,9])
y = ([])
I want to add numbers from x to y, but only if they match a condition, lets say check if they are divisible by two. I know I can do the following:
y = x%2 == 0
which makes y an array of values 'true' and 'false'. This is not what I am trying to accomplish however, I want the actual values (0,2,4,6,8) and only those that evaluate to true.
You can get the values you want like this:
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9])
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])
y = x[x%2==0]
# y is now: array([2, 4, 6, 8])
And, you can sum them like this:
np.sum(x[x%2==0])
# 20
Explanation: As you noticed, x%2==0 gives you a boolean array array([False, True, False, True, False, True, False, True, False], dtype=bool). You can use this as a "mask" on your original array, by indexing it with x[x%2==0], returning the values of x where your "mask" is True. Take a look at the numpy indexing documentation for more info.
I have written a script that evaluates if some entry of arr is in check_elements. My approach does not compare single entries, but whole vectors inside of arr. Thus, the script checks if [8, 3], [4, 5], ... is in check_elements.
Here's an example:
import numpy as np
# arr.shape -> (2, 3, 2)
arr = np.array([[[8, 3],
[4, 5],
[6, 2]],
[[9, 0],
[1, 10],
[7, 11]]])
# check_elements.shape -> (3, 2)
# generally: (n, 2)
check_elements = np.array([[4, 5], [9, 0], [7, 11]])
# rslt.shape -> (2, 3)
rslt = np.zeros((arr.shape[0], arr.shape[1]), dtype=np.bool)
for i, j in np.ndindex((arr.shape[0], arr.shape[1])):
if arr[i, j] in check_elements: # <-- condition is checked against
# the whole last dimension
rslt[i, j] = True
else:
rslt[i, j] = False
Now:
print(rslt)
...would print:
[[False True False]
[ True False True]]
For getting the indices of I use:
print(np.transpose(np.nonzero(rslt)))
...which prints the following:
[[0 1] # arr[0, 1] -> [4, 5] -> is in check_elements
[1 0] # arr[1, 0] -> [9, 0] -> is in check_elements
[1 2]] # arr[1, 2] -> [7, 11] -> is in check_elements
This task would be easy and performant if I would check a condition on single values, like arr > 3 or np.where(...), but I am not interested in single values. I want to check a condition against the whole last dimension (or slices of it).
My question is: is there a faster way to achieve the same result? Am I right that vectorized attempts and things like np.where can not be used for my problem, because they always operate on single values and not on a whole dimension or slices of that dimension?
Here is a Numpythonic approach using broadcasting:
>>> (check_elements == arr[:,:,None]).reshape(2, 3, 6).any(axis=2)
array([[False, True, False],
[ True, False, True]], dtype=bool)
The numpy_indexed package (disclaimer: I am its author) contains functionality to perform these kind of queries; specifically, containment relations for nd (sub)arrays:
import numpy_indexed as npi
flatidx = npi.indices(arr.reshape(-1, 2), check_elements)
idx = np.unravel_index(flatidx, arr.shape[:-1])
Note that the implementation is fully vectorized under the hood.
Also, note that with this approach, the order of the indices in idx match with the order of check_elements; the first item in idx are the row and col of the first item in check_elements. This information is lost when using an approach along the lines you posted above, or when using one of the alternative suggested answers, which will give you the idx sorted by their order of appearance in arr instead, which is often undesirable.
You can use np.in1d even though it is meant for 1D arrays by giving it a 1D view of your array, containing one element per last axis:
arr_view = arr.view((np.void, arr.dtype.itemsize*arr.shape[-1])).ravel()
check_view = check_elements.view((np.void,
check_elements.dtype.itemsize*check_elements.shape[-1])).ravel()
will give you two 1D arrays, which contain a void type version of you 2 element arrays along the last axis. Now you can check, which of the elements in arr is also in check_view by doing:
flatResult = np.in1d(arr_view, check_view)
This will give a flattened array, which you can then reshape to the shape of arr, dropping the last axis:
print(flatResult.reshape(arr.shape[:-1]))
which will give you the desired result:
array([[False, True, False],
[ True, False, True]], dtype=bool)