Membership checking in Numpy ndarray - python

I have written a script that evaluates if some entry of arr is in check_elements. My approach does not compare single entries, but whole vectors inside of arr. Thus, the script checks if [8, 3], [4, 5], ... is in check_elements.
Here's an example:
import numpy as np
# arr.shape -> (2, 3, 2)
arr = np.array([[[8, 3],
[4, 5],
[6, 2]],
[[9, 0],
[1, 10],
[7, 11]]])
# check_elements.shape -> (3, 2)
# generally: (n, 2)
check_elements = np.array([[4, 5], [9, 0], [7, 11]])
# rslt.shape -> (2, 3)
rslt = np.zeros((arr.shape[0], arr.shape[1]), dtype=np.bool)
for i, j in np.ndindex((arr.shape[0], arr.shape[1])):
if arr[i, j] in check_elements: # <-- condition is checked against
# the whole last dimension
rslt[i, j] = True
else:
rslt[i, j] = False
Now:
print(rslt)
...would print:
[[False True False]
[ True False True]]
For getting the indices of I use:
print(np.transpose(np.nonzero(rslt)))
...which prints the following:
[[0 1] # arr[0, 1] -> [4, 5] -> is in check_elements
[1 0] # arr[1, 0] -> [9, 0] -> is in check_elements
[1 2]] # arr[1, 2] -> [7, 11] -> is in check_elements
This task would be easy and performant if I would check a condition on single values, like arr > 3 or np.where(...), but I am not interested in single values. I want to check a condition against the whole last dimension (or slices of it).
My question is: is there a faster way to achieve the same result? Am I right that vectorized attempts and things like np.where can not be used for my problem, because they always operate on single values and not on a whole dimension or slices of that dimension?

Here is a Numpythonic approach using broadcasting:
>>> (check_elements == arr[:,:,None]).reshape(2, 3, 6).any(axis=2)
array([[False, True, False],
[ True, False, True]], dtype=bool)

The numpy_indexed package (disclaimer: I am its author) contains functionality to perform these kind of queries; specifically, containment relations for nd (sub)arrays:
import numpy_indexed as npi
flatidx = npi.indices(arr.reshape(-1, 2), check_elements)
idx = np.unravel_index(flatidx, arr.shape[:-1])
Note that the implementation is fully vectorized under the hood.
Also, note that with this approach, the order of the indices in idx match with the order of check_elements; the first item in idx are the row and col of the first item in check_elements. This information is lost when using an approach along the lines you posted above, or when using one of the alternative suggested answers, which will give you the idx sorted by their order of appearance in arr instead, which is often undesirable.

You can use np.in1d even though it is meant for 1D arrays by giving it a 1D view of your array, containing one element per last axis:
arr_view = arr.view((np.void, arr.dtype.itemsize*arr.shape[-1])).ravel()
check_view = check_elements.view((np.void,
check_elements.dtype.itemsize*check_elements.shape[-1])).ravel()
will give you two 1D arrays, which contain a void type version of you 2 element arrays along the last axis. Now you can check, which of the elements in arr is also in check_view by doing:
flatResult = np.in1d(arr_view, check_view)
This will give a flattened array, which you can then reshape to the shape of arr, dropping the last axis:
print(flatResult.reshape(arr.shape[:-1]))
which will give you the desired result:
array([[False, True, False],
[ True, False, True]], dtype=bool)

Related

Finding Element in Array Indicing Syntax Explanation

I have an array in Python that looks like this:
array = [[UUID('0d9ba9c6-632b-4dd4-912c-e8ff0a7134f7'), array([['1', '1']], dtype='<U21')], [UUID('9cb1feb6-0ef4-4e15-9070-7735545d12c9'), array([['2', '1']], dtype='<U21')], [UUID('955d308b-3570-4166-895e-81a077e6b9f9'), array([['3', '1']], dtype='<U21')]]
I also have a query that looks like this:
query = UUID('0d9ba9c6-632b-4dd4-912c-e8ff0a7134f7')
I am trying to find the sub-array in the main array that contains this UUID. So, querying this would return in:
[UUID('0d9ba9c6-632b-4dd4-912c-e8ff0a7134f7'), array([['1', '1']], dtype='<U21')]
I found this syntax online to do this:
out = array[array[:, 0] == query]
I know that this only works in NumPy if the array itself is an NP array. But why does this work and how? I am extremely confused by the syntax.
You might want to read the numpy tutorial on indexing on ndarrays.
But here are some basic explanations, understanding the examples would be a
good starting point.
So you have an array:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
The most basic indexing is with slices, as for basic lists, but you can access
nested dimensions with tuples (arr[i1:i2, j1:j2]) instead of chained indexing
as with basic lists (arr[i1:i2][j1:j2]):
arr[0] # array([1, 2, 3])
arr[:, 0] # array([1, 4])
arr[0, 0] # 1
Another way of indexing with numpy is to use boolean arrays.
You can use a tuple of lists of booleans, one list per dimension, each list
having the same size as the dimension.
And you can notice that you can use booleans on one dimension ([False, True])
and slices on the other dimension (:):
arr[[False, True], :] # array([[4, 5, 6]])
arr[[False, True]] # array([[4, 5, 6]])
arr[[False, True], [True, False, True]] # array([[4, 6]])
You can also use a single big boolean numpy array that has the same shape as
the array you are indexing:
arr[np.array([[False, True, False], [False, False, True]])] # array([2, 6])
Also, otherwise elementwise operation (+, /, % ...) are redefined by
numpy so they can work on whole arrays at once:
def is_even(x):
return x % 2 == 0
is_even(2) # True
is_even(arr) # array([[False, True, False], [True, False, True]])
Here we just constructed a big boolean array, so it can be used on your original array:
arr[is_even(arr)] # array([2, 4, 6])
But in your case you was only indexing on the first dimension, so using the tuple of boolean lists indexing method:
arr[is_even(arr[:, 0]), :] # array([4, 5, 6])

Select all rows from Numpy array where each column satisfies some condition

I have an array x of the form,
x = [[1,2,3,...,7,8,9],
[1,2,3,...,7,9,8],
...,
[9,8,7,...,3,1,2],
[9,8,7,...,3,2,1]]
I also have an array of non-allowed numbers for each column. I want to select all of the rows which only have allowed characters in each column. For instance, I might have that I want only rows which do not have any of [1,2,3] in the first column; I can do this by,
x[~np.in1d(x[:,0], [1,2,3])]
And for any single column, I can do this. But I'm looking to essentially do this for all columns at once, selecting only the rows for which every elemnt is an allowed number for its column. I can't seem to get x.any or x.all to do this well - how should I go about this?
EDIT: To clarify, the non-allowed numbers are different for each column. In actuality, I will have some array y,
y = [[1,4,...,7,8],
[2,5,...,9,4],
[3,6,...,8,6]]
Where I want rows from x for which column 1 cannot be in [1,2,3], column 2 cannot be in [4,5,6], and so on.
You can broadcast the comparison, then all to check:
x[(x != y[:,None,:]).all(axis=(0,-1))]
Break down:
# compare each element of `x` to each element of `y`
# mask.shape == (y.shape[0], x.shape[0], x.shape[1])
mask = (x != y[:,None,:])
# `all(0)` checks, for each element in `x`, it doesn't match any element in the same column of `y`
# `all(-1) checks along the rows of `x`
mask = mask.all(axis=(0,-1)
# slice
x[mask]
For example, consider:
x = np. array([[1, 2],
[9, 8],
[5, 6],
[7, 8]])
y = np.array([[1, 4],
[2, 5],
[3, 7]])
Then mask = (x != y[:,None,:]).all(axis=(0,1)) gives
array([False, True, True, True])
It's recommended to use np.isin rather than np.in1d these days. This lets you (a) compare the entire array all at once, and (b) invert the mask more efficiently.
x[np.isin(x, [1, 2, 3], invert=True).all(1)]
np.isin preserves the shape of x, so you can then use .all across the columns. It also has an invert argument which allows you to do the equivalent of ~isin(x, [1, 2, 3]), but more efficiently.
This solution vectorizes a similar computation to what the other is suggesting much more efficiently (although it's still a linear search), and avoids creating the temporary arrays as well.

Deleting elements at specific positions in a M X N numpy array

I am trying to implement Seam carving algorithm wherein we have to delete a seam from the image. Image is stored as a numpy M X N array. I have found the seam, which is nothing but an array of M integers whose value specifies column values to be deleted.
Eg: a 2 X 3 array
import numpy
img_array = numpy.array([[1, 2, 3],[4, 5, 6]])
and
seam = numpy.array([1,2])
This means that we have to delete from the Img 1st element from the 1st row (1), and second element from the second row (5). After deletion, Img will be
print img_array
[[2,3]
[4,6]]
Work done:
I am new to python and I have found solutions which concern about single dimensional array or deleting an entire row or column. But I could not find a way to delete elements from specific columns.
Will you always delete one element from each row? If you try to delete one element from one row, but not another, you will end up with a ragged array. That is why there isn't a general purpose way of removing single elements from a 2d array.
One option is to figure out which ones you want to delete, remove them from a flattened array, and then reshape it back to the correct shape. Then it is your responsibility to ensure that the correct number of elements are removed.
All of these 'delete' methods actually copy the 'keep' values to a new array. Nothing actually deletes elements from the original array. So you could just as easily (and just as fast) do your own copy to a new array.
Another option is to work with lists of lists. Those are more tolerant of becoming ragged.
Here's an example of using a boolean mask to remove selected elements from an array (making a copy of course):
In [100]: x=np.arange(1,7).reshape(2,3)
In [101]: x
Out[101]:
array([[1, 2, 3],
[4, 5, 6]])
In [102]: mask=np.ones_like(x,bool)
In [103]: mask
Out[103]:
array([[ True, True, True],
[ True, True, True]], dtype=bool)
In [104]: mask[0,0]=False
In [105]: mask[1,1]=False
In [106]: mask
Out[106]:
array([[False, True, True],
[ True, False, True]], dtype=bool)
In [107]: x[mask]
Out[107]: array([2, 3, 4, 6]) # it's flat
In [108]: x[mask].reshape(2,2)
Out[108]:
array([[2, 3],
[4, 6]])
Notice that even though both x and mask are 2d, the indexing result is flattened. Such a mask could easily have produced an array that couldn't be reshape back to 2d.
Each row in your matrix is a single dimensional array.
import numpy
ary=numpy.array([[1,2,3],[4,5,6]])
print ary[0]
Gives
array([1, 2, 3])
You could iterate over your matrix, using the values from you seam to remove an element from the current row. Append the result to a modified matrix you are building.
seam = numpy.array([1,2])
for i in range(2):
tmp = numpy.delete(ary[i],seam[i]-1)
if i == 0:
modified_ary = tmp
else:
modified_ary = numpy.vstack((modified_ary,tmp))
print modified_ary
Gives
[[2 3]
[4 6]]

check if numpy array is subset of another array

Similar questions have already been asked on SO, but they have more specific constraints and their answers don't apply to my question.
Generally speaking, what is the most pythonic way to determine if an arbitrary numpy array is a subset of another array? More specifically, I have a roughly 20000x3 array and I need to know the indices of the 1x3 elements that are entirely contained within a set. More generally, is there a more pythonic way of writing the following:
master = [12, 155, 179, 234, 670, 981, 1054, 1209, 1526, 1667, 1853] # some indices of interest
triangles = np.random.randint(2000, size=(20000, 3)) # some data
for i, x in enumerate(triangles):
if x[0] in master and x[1] in master and x[2] in master:
print i
For my use case, I can safely assume that len(master) << 20000. (Consequently, it is also safe to assume that master is sorted because this is cheap).
You can do this easily via iterating over an array in list comprehension. A toy example is as follows:
import numpy as np
x = np.arange(30).reshape(10,3)
searchKey = [4,5,8]
x[[0,3,7],:] = searchKey
x
gives
array([[ 4, 5, 8],
[ 3, 4, 5],
[ 6, 7, 8],
[ 4, 5, 8],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[ 4, 5, 8],
[24, 25, 26],
[27, 28, 29]])
Now iterate over the elements:
ismember = [row==searchKey for row in x.tolist()]
The result is
[True, False, False, True, False, False, False, True, False, False]
You can modify it for being a subset as in your question:
searchKey = [2,4,10,5,8,9] # Add more elements for testing
setSearchKey = set(searchKey)
ismember = [setSearchKey.issuperset(row) for row in x.tolist()]
If you need the indices, then use
np.where(ismember)[0]
It gives
array([0, 3, 7])
One can also use np.isin which might be more efficient than the list comprehension in #petrichor's answer. Using the same set up:
import numpy as np
x = np.arange(30).reshape(10, 3)
searchKey = [4, 5, 8]
x[[0, 3, 7], :] = searchKey
array([[ 4, 5, 8],
[ 3, 4, 5],
[ 6, 7, 8],
[ 4, 5, 8],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[ 4, 5, 8],
[24, 25, 26],
[27, 28, 29]])
Now one can use np.isin; by default, it will work element wise:
np.isin(x, searchKey)
array([[ True, True, True],
[False, True, True],
[False, False, True],
[ True, True, True],
[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, True],
[False, False, False],
[False, False, False]])
We now have to filter the rows where all entries evaluate to True for which we could use all:
np.isin(x, searchKey).all(1)
array([ True, False, False, True, False, False, False, True, False,
False])
If one now wants the corresponding indices, one can use np.where:
np.where(np.isin(x, searchKey).all(1))
(array([0, 3, 7]),)
EDIT:
Just realize that one has to be careful though. For example, if I do
x[4, :] = [8, 4, 5]
so, in the assignment I use the same values as in searchKey but in a different order, I will still get it returned when doing
np.where(np.isin(x, searchKey).all(1))
which prints
(array([0, 3, 4, 7]),)
That can be undesired.
Here are two approaches you could try:
1, Use sets. Sets are implemented much like python dictionaries and have have constant time lookups. That would look much like the code you already have, just create a set from master:
master = [12,155,179,234,670,981,1054,1209,1526,1667,1853]
master_set = set(master)
triangles = np.random.randint(2000,size=(20000,3)) #some data
for i, x in enumerate(triangles):
if master_set.issuperset(x):
print i
2, Use search sorted. This is nice because it doesn't require you to use hashable types and uses numpy builtins. searchsorted is log(N) in the size of master and O(N) in the size of triangels so it should also be pretty fast, maybe faster depending on the size of your arrays and such.
master = [12,155,179,234,670,981,1054,1209,1526,1667,1853]
master = np.asarray(master)
triangles = np.random.randint(2000,size=(20000,3)) #some data
idx = master.searchsorted(triangles)
idx.clip(max=len(master) - 1, out=idx)
print np.where(np.all(triangles == master[idx], axis=1))
This second case assumes master is sorted, as searchsorted implies.
A more natural (and possibly faster) solution for set operations in numpy is to use the functions in numpy.lib.arraysetops. These generally allow you to avoid having to convert back and forth between Python's set type. To check if one array is a subset of another, use numpy.setdiff1d() and test if the returned array has 0 length:
import numpy as np
a = np.arange(10)
b = np.array([1, 5, 9])
c = np.array([-5, 5, 9])
# is `a` a subset of `b`?
len(np.setdiff1d(a, b)) == 0 # gives False
# is `b` a subset of `a`?
len(np.setdiff1d(b, a)) == 0 # gives True
# is `c` a subset of `a`?
len(np.setdiff1d(c, a)) == 0 # gives False
You can also optionally set assume_unique=True for a potential speed boost.
I'm actually a bit surprised that numpy doesn't have something like a built-in issubset() function to do the above (analogous to set.issubset()).
Another option is to use numpy.in1d() (see https://stackoverflow.com/a/37262010/2020363)
Edit: I just realized that at some point in the distant past this bothered me enough that I wrote my own simple function:
def issubset(a, b):
"""Return whether sequence `a` is a subset of sequence `b`"""
return len(np.setdiff1d(a, b)) == 0
starting with:
master=[12,155,179,234,670,981,1054,1209,1526,1667,1853] #some indices of interest
triangles=np.random.randint(2000,size=(20000,3)) #some data
What's the most pythonic way to find indices of triplets contained in master? try using np.in1d with a list comprehension:
inds = [j for j in range(len(triangles)) if all(np.in1d(triangles[j], master))]
%timeit says ~0.5 s = half a second
--> MUCH faster way (factor of 1000!) that avoids python's slow looping? Try using np.isin with np.sum to get a boolean mask for np.arange:
inds = np.where(
np.sum(np.isin(triangles, master), axis=-1) == triangles.shape[-1])
%timeit says ~0.0005 s = half a millisecond!
Advice: avoid looping over lists whenever possible, because for the same price as a single iteration of a python loop containing one arithmetic operation, you can call a numpy function that does thousands of that same arithmetic operation
Conclusion
It seems that np.isin(arr1=triangles, arr2=master) is the function you were looking for, which gives a boolean mask of the same shape as arr1 telling whether each element of arr1 is also an element of arr2; from here, requiring that the sum of a mask row is 3 (i.e., the full length of a row in triangles) gives a 1d mask for the desired rows (or indices, using np.arange) of triangles.

Indexing NumPy 2D array with another 2D array

I have something like
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
and
select = array([0,1,0,0])
My target is
result = array([1, 5, 7, 6])
I tried _ix as I read at Simplfy row AND column extraction, numpy, but this did not result in what I wanted.
p.s. Please change the title of this question if you can think of a more precise one.
The numpy way to do this is by using np.choose or fancy indexing/take (see below):
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
select = array([0,1,0,0])
result = np.choose(select, m.T)
So there is no need for python loops, or anything, with all the speed advantages numpy gives you. m.T is just needed because choose is really more a choise between the two arrays np.choose(select, (m[:,0], m[:1])), but its straight forward to use it like this.
Using fancy indexing:
result = m[np.arange(len(select)), select]
And if speed is very important np.take, which works on a 1D view (its quite a bit faster for some reason, but maybe not for these tiny arrays):
result = m.take(select+np.arange(0, len(select) * m.shape[1], m.shape[1]))
I prefer to use NP.where for indexing tasks of this sort (rather than NP.ix_)
What is not mentioned in the OP is whether the result is selected by location (row/col in the source array) or by some condition (e.g., m >= 5). In any event, the code snippet below covers both scenarios.
Three steps:
create the condition array;
generate an index array by calling NP.where, passing in this
condition array; and
apply this index array against the source array
>>> import numpy as NP
>>> cnd = (m==1) | (m==5) | (m==7) | (m==6)
>>> cnd
matrix([[ True, False],
[False, True],
[ True, False],
[ True, False]], dtype=bool)
>>> # generate the index array/matrix
>>> # by calling NP.where, passing in the condition (cnd)
>>> ndx = NP.where(cnd)
>>> ndx
(matrix([[0, 1, 2, 3]]), matrix([[0, 1, 0, 0]]))
>>> # now apply it against the source array
>>> m[ndx]
matrix([[1, 5, 7, 6]])
The argument passed to NP.where, cnd, is a boolean array, which in this case, is the result from a single expression comprised of compound conditional expressions (first line above)
If constructing such a value filter doesn't apply to your particular use case, that's fine, you just need to generate the actual boolean matrix (the value of cnd) some other way (or create it directly).
What about using python?
result = array([subarray[index] for subarray, index in zip(m, select)])
IMHO, this is simplest variant:
m[np.arange(4), select]
Since the title is referring to indexing a 2D array with another 2D array, the actual general numpy solution can be found here.
In short:
A 2D array of indices of shape (n,m) with arbitrary large dimension m, named inds, is used to access elements of another 2D array of shape (n,k), named B:
# array of index offsets to be added to each row of inds
offset = np.arange(0, inds.size, inds.shape[1])
# numpy.take(B, C) "flattens" arrays B and C and selects elements from B based on indices in C
Result = np.take(B, offset[:,np.newaxis]+inds)
Another solution, which doesn't use np.take and I find more intuitive, is the following:
B[np.expand_dims(np.arange(B.shape[0]), -1), inds]
The advantage of this syntax is that it can be used both for reading elements from B based on inds (like np.take), as well as for assignment.
result = array([m[j][0] if i==0 else m[j][1] for i,j in zip(select, range(0, len(m)))])

Categories

Resources