how do I get a row-wise comparison between two arrays, in the result of a row-wise true/false array?
Given datas:
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
Result step 1:
c = np.array([True, True,False,True])
Result final:
a = a[c]
So how do I get the array c ????
P.S.: In this example the arrays a and b are sorted, please give also information if in your solution it is important that the arrays are sorted
Here's a vectorised solution:
res = (a[:, None] == b).all(-1).any(-1)
print(res)
array([ True, True, False, True])
Note that a[:, None] == b compares each row of a with b element-wise. We then use all + any to deduce if there are any rows which are all True for each sub-array:
print(a[:, None] == b)
[[[ True True]
[False True]
[False False]]
[[False True]
[ True True]
[False False]]
[[False False]
[False False]
[False False]]
[[False False]
[False False]
[ True True]]]
Approach #1
We could use a view based vectorized solution -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
A,B = view1D(a,b)
out = np.isin(A,B)
Sample run -
In [8]: a
Out[8]:
array([[1, 0],
[2, 0],
[3, 1],
[4, 2]])
In [9]: b
Out[9]:
array([[1, 0],
[2, 0],
[4, 2]])
In [10]: A,B = view1D(a,b)
In [11]: np.isin(A,B)
Out[11]: array([ True, True, False, True])
Approach #2
Alternatively for the case when all rows in b are in a and rows are lexicographically sorted, using the same views, but with searchsorted -
out = np.zeros(len(A), dtype=bool)
out[np.searchsorted(A,B)] = 1
If the rows are not necessarily lexicographically sorted -
sidx = A.argsort()
out[sidx[np.searchsorted(A,B,sorter=sidx)]] = 1
you can use numpy with apply_along_axis (kind of iteration on specific axis while axis=0 iterate on every cell, axis=1 iterate on every row, axis=2 on matrix and so on
import numpy as np
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
c = np.apply_along_axis(lambda x,y: x in y, 1, a, b)
You can do it as a list comp via:
c = np.array([row in b for row in a])
though this approach will be slower than a pure numpy approach (if it exists).
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
i = 0
j = 0
result = []
We can take advantage of the fact that they are sorted and do this in O(n) time. Using two pointers we just move ahead the pointer that has gotten behind:
while i < len(a) and j < len(b):
if tuple(a[i])== tuple(b[j]):
result.append(True)
i += 1
j += 1 # get rid of this depending on how you want to handle duplicates
elif tuple(a[i]) > tuple(b[j]):
j += 1
else:
result.append(False)
i += 1
Pad with False if it ends early.
if len(result) < len(a):
result.extend([False] * (len(a) - len(result)))
print(result) # [True, True, False, True]
This answer is adapted from Better way to find matches in two sorted lists than using for loops? (Java)
You can use scipy's cdist which has a few advantages:
from scipy.spatial.distance import cdist
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
c = cdist(a, b)==0
print(c.any(axis=1))
[ True True False True]
print(a[c.any(axis=1)])
[[1 0]
[2 0]
[4 2]]
Also, cdist allows passing of a function pointer. So you can specify your own distance functions, to do whatever comparison you need:
c = cdist(a, b, lambda u, v: (u==v).all())
print(c)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]]
And now you can find which index matches. Which will also indicate if there are multiple matches.
# Array with multiple instances
a2 = np.array([[1,0],[2,0],[3,1],[4,2],[3,1],[4,2]])
c2 = cdist(a2, b, lambda u, v: (u==v).all())
print(c2)
idx = np.where(c2==1)
print(idx)
print(idx[0][idx[1]==2])
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]
[0. 0. 0.]
[0. 0. 1.]]
(array([0, 1, 3, 5], dtype=int64), array([0, 1, 2, 2], dtype=int64))
[3 5]
The recommended answer is good, but will struggle when dealing with arrays with a large number of rows. An alternative is:
baseval = np.max([a.max(), b.max()]) + 1
a[:,1] = a[:,1] * baseval
b[:,1] = b[:,1] * baseval
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))
This uses the maximum value contained in either array plus 1 as a numeric base and treats the columns as baseval^0 and baseval^1 values. This ensures that the sum of the columns are unique for each possible pair of values. If the order of the columns is not important then both input arrays can be sorted column-wise using np.sort(a,axis=1) beforehand.
This can be extended to arrays with more columns using:
baseval = np.max([a.max(), b.max()]) + 1
n_cols = a.shape[1]
a = a * baseval ** np.array(range(n_cols))
b = b * baseval ** np.array(range(n_cols))
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))
Overflow can occur when baseval ** (n_cols+1) > 9223372036854775807 if using int64. This can be avoided by setting the numpy arrays to use python integers using dtype=object.
Related
Let say there is a matrix as follow:
a = np.array([[74, 0, 2],
[ 0, 73, 8],
[ 0, 10, 72]])
I want to find mirror elements that are zero in both upper and lower triangles and set them to nan. E.g. In this case a[0, 1] and a[1, 0]. I can write a loop like:
m = np.zeros((3, 3))
for i in range(a.shape[0]):
for j in range(a.shape[1]):
if i == j:
m[i, j] = a[i, j]
continue
if (a[i, j] == 0) & (a[j, i] == 0):
m[i, j] = np.nan
m[j, i] = np.nan
continue
m[i, j] = a[i, j]
m[j, i] = a[j, i]
print(m)
[[74. nan 2.]
[nan 73. 8.]
[ 0. 10. 72.]]
This does the job. But I have millions of these matrices and I am wondering what would be a better and faster approach.
Here's another alternative, based on my comment suggestion. Note that the "ndiag" thing is not required if there will never be zeros along the diagonal.
import numpy as np
ndiag = 1-np.eye(3)
print(ndiag)
a = np.array( [[74,0,2],[0,73,8],[0,10,72]] ).astype(float)
m = a == 0
print( m )
m = np.logical_and( ndiag, np.logical_and( m, m.T ) )
print( m )
a[m] = np.nan
print( a )
Output:
[[0 1 1]
[1 0 1]
[1 1 0]]
[[False True False]
[ True False False]
[ True False False]]
[[False True False]
[ True False False]
[False False False]]
[[74. nan 2.]
[nan 73. 8.]
[ 0. 10. 72.]]
I've always had a preference for triu_indices and tril_indices for this sort of task. The nice thing is that they're just indices, so if all your matrices are the same size, you can cache them once without copying any specific data. The other nice thing is that for a given size n, you have that triu_indices(n, 1) is the swapped result of tril_indices(n, -1) up to some sorting that you don't generally care about.
So if all your matrices are of shape (n, n),
rows, cols = np.triu_indices(n, 1)
mask = (a[rows, cols] == a[cols, rows]) & (a[rows, cols] != 0)
a[rows[mask], cols[mask]] = a[cols[mask], rows[mask]] = np.nan
Keep in mind that you can't assign np.nan to an array unless it's a floating point type. Also, you may get a tiny bit of mileage out of pre-computing rows[mask] and cols[mask]:
rm = rows[mask]
cm = cols[mask]
a[rm, cm] = a[cm, rm] = np.nan
Here is a completely vectorised approach to solve this -
np.where(np.logical_and(np.tril(a) == np.triu(a).T, a==0), np.nan, a)
array([[74., nan, 2.],
[nan, 73., 8.],
[ 0., 10., 72.]])
Explanation -
Lets see what happens in the first step -
np.tril(a) #keeps only the lower triangular, and others become 0
array([[74, 0, 0],
[ 0, 73, 0],
[ 0, 10, 72]])
np.triu(a).T #keeps only the upper triangular and others become 0. Then flips it to become lower triangular
array([[74, 0, 0],
[ 0, 73, 0],
[ 2, 8, 72]])
Equating these will give you the upper triangular part always as True, while lower triangual matrix contains True only for mirror matching elements.
np.tril(a) == np.triu(a).T
array([[ True, True, True],
[ True, True, True],
[False, False, True]])
Now, when you take a logical_and of this boolean with the a==0 matrix, only the values where the original matrix had 0 and were mirror elements remain.
np.logical_and(np.tril(a) == np.triu(a).T, a==0)
array([[False, True, False],
[ True, False, False],
[False, False, False]])
Now you can use np.where to replace True values with nan and keep the remaining values intact.
np.where(np.logical_and(np.tril(a) == np.triu(a).T, a==0), np.nan, a)
array([[74., nan, 2.],
[nan, 73., 8.],
[ 0., 10., 72.]])
how do I get a row-wise comparison between two arrays, in the result of a row-wise true/false array?
Given datas:
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
Result step 1:
c = np.array([True, True,False,True])
Result final:
a = a[c]
So how do I get the array c ????
P.S.: In this example the arrays a and b are sorted, please give also information if in your solution it is important that the arrays are sorted
Here's a vectorised solution:
res = (a[:, None] == b).all(-1).any(-1)
print(res)
array([ True, True, False, True])
Note that a[:, None] == b compares each row of a with b element-wise. We then use all + any to deduce if there are any rows which are all True for each sub-array:
print(a[:, None] == b)
[[[ True True]
[False True]
[False False]]
[[False True]
[ True True]
[False False]]
[[False False]
[False False]
[False False]]
[[False False]
[False False]
[ True True]]]
Approach #1
We could use a view based vectorized solution -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
A,B = view1D(a,b)
out = np.isin(A,B)
Sample run -
In [8]: a
Out[8]:
array([[1, 0],
[2, 0],
[3, 1],
[4, 2]])
In [9]: b
Out[9]:
array([[1, 0],
[2, 0],
[4, 2]])
In [10]: A,B = view1D(a,b)
In [11]: np.isin(A,B)
Out[11]: array([ True, True, False, True])
Approach #2
Alternatively for the case when all rows in b are in a and rows are lexicographically sorted, using the same views, but with searchsorted -
out = np.zeros(len(A), dtype=bool)
out[np.searchsorted(A,B)] = 1
If the rows are not necessarily lexicographically sorted -
sidx = A.argsort()
out[sidx[np.searchsorted(A,B,sorter=sidx)]] = 1
you can use numpy with apply_along_axis (kind of iteration on specific axis while axis=0 iterate on every cell, axis=1 iterate on every row, axis=2 on matrix and so on
import numpy as np
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
c = np.apply_along_axis(lambda x,y: x in y, 1, a, b)
You can do it as a list comp via:
c = np.array([row in b for row in a])
though this approach will be slower than a pure numpy approach (if it exists).
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
i = 0
j = 0
result = []
We can take advantage of the fact that they are sorted and do this in O(n) time. Using two pointers we just move ahead the pointer that has gotten behind:
while i < len(a) and j < len(b):
if tuple(a[i])== tuple(b[j]):
result.append(True)
i += 1
j += 1 # get rid of this depending on how you want to handle duplicates
elif tuple(a[i]) > tuple(b[j]):
j += 1
else:
result.append(False)
i += 1
Pad with False if it ends early.
if len(result) < len(a):
result.extend([False] * (len(a) - len(result)))
print(result) # [True, True, False, True]
This answer is adapted from Better way to find matches in two sorted lists than using for loops? (Java)
You can use scipy's cdist which has a few advantages:
from scipy.spatial.distance import cdist
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
c = cdist(a, b)==0
print(c.any(axis=1))
[ True True False True]
print(a[c.any(axis=1)])
[[1 0]
[2 0]
[4 2]]
Also, cdist allows passing of a function pointer. So you can specify your own distance functions, to do whatever comparison you need:
c = cdist(a, b, lambda u, v: (u==v).all())
print(c)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]]
And now you can find which index matches. Which will also indicate if there are multiple matches.
# Array with multiple instances
a2 = np.array([[1,0],[2,0],[3,1],[4,2],[3,1],[4,2]])
c2 = cdist(a2, b, lambda u, v: (u==v).all())
print(c2)
idx = np.where(c2==1)
print(idx)
print(idx[0][idx[1]==2])
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]
[0. 0. 0.]
[0. 0. 1.]]
(array([0, 1, 3, 5], dtype=int64), array([0, 1, 2, 2], dtype=int64))
[3 5]
The recommended answer is good, but will struggle when dealing with arrays with a large number of rows. An alternative is:
baseval = np.max([a.max(), b.max()]) + 1
a[:,1] = a[:,1] * baseval
b[:,1] = b[:,1] * baseval
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))
This uses the maximum value contained in either array plus 1 as a numeric base and treats the columns as baseval^0 and baseval^1 values. This ensures that the sum of the columns are unique for each possible pair of values. If the order of the columns is not important then both input arrays can be sorted column-wise using np.sort(a,axis=1) beforehand.
This can be extended to arrays with more columns using:
baseval = np.max([a.max(), b.max()]) + 1
n_cols = a.shape[1]
a = a * baseval ** np.array(range(n_cols))
b = b * baseval ** np.array(range(n_cols))
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))
Overflow can occur when baseval ** (n_cols+1) > 9223372036854775807 if using int64. This can be avoided by setting the numpy arrays to use python integers using dtype=object.
how do I get a row-wise comparison between two arrays, in the result of a row-wise true/false array?
Given datas:
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
Result step 1:
c = np.array([True, True,False,True])
Result final:
a = a[c]
So how do I get the array c ????
P.S.: In this example the arrays a and b are sorted, please give also information if in your solution it is important that the arrays are sorted
Here's a vectorised solution:
res = (a[:, None] == b).all(-1).any(-1)
print(res)
array([ True, True, False, True])
Note that a[:, None] == b compares each row of a with b element-wise. We then use all + any to deduce if there are any rows which are all True for each sub-array:
print(a[:, None] == b)
[[[ True True]
[False True]
[False False]]
[[False True]
[ True True]
[False False]]
[[False False]
[False False]
[False False]]
[[False False]
[False False]
[ True True]]]
Approach #1
We could use a view based vectorized solution -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
A,B = view1D(a,b)
out = np.isin(A,B)
Sample run -
In [8]: a
Out[8]:
array([[1, 0],
[2, 0],
[3, 1],
[4, 2]])
In [9]: b
Out[9]:
array([[1, 0],
[2, 0],
[4, 2]])
In [10]: A,B = view1D(a,b)
In [11]: np.isin(A,B)
Out[11]: array([ True, True, False, True])
Approach #2
Alternatively for the case when all rows in b are in a and rows are lexicographically sorted, using the same views, but with searchsorted -
out = np.zeros(len(A), dtype=bool)
out[np.searchsorted(A,B)] = 1
If the rows are not necessarily lexicographically sorted -
sidx = A.argsort()
out[sidx[np.searchsorted(A,B,sorter=sidx)]] = 1
you can use numpy with apply_along_axis (kind of iteration on specific axis while axis=0 iterate on every cell, axis=1 iterate on every row, axis=2 on matrix and so on
import numpy as np
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
c = np.apply_along_axis(lambda x,y: x in y, 1, a, b)
You can do it as a list comp via:
c = np.array([row in b for row in a])
though this approach will be slower than a pure numpy approach (if it exists).
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
i = 0
j = 0
result = []
We can take advantage of the fact that they are sorted and do this in O(n) time. Using two pointers we just move ahead the pointer that has gotten behind:
while i < len(a) and j < len(b):
if tuple(a[i])== tuple(b[j]):
result.append(True)
i += 1
j += 1 # get rid of this depending on how you want to handle duplicates
elif tuple(a[i]) > tuple(b[j]):
j += 1
else:
result.append(False)
i += 1
Pad with False if it ends early.
if len(result) < len(a):
result.extend([False] * (len(a) - len(result)))
print(result) # [True, True, False, True]
This answer is adapted from Better way to find matches in two sorted lists than using for loops? (Java)
You can use scipy's cdist which has a few advantages:
from scipy.spatial.distance import cdist
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
c = cdist(a, b)==0
print(c.any(axis=1))
[ True True False True]
print(a[c.any(axis=1)])
[[1 0]
[2 0]
[4 2]]
Also, cdist allows passing of a function pointer. So you can specify your own distance functions, to do whatever comparison you need:
c = cdist(a, b, lambda u, v: (u==v).all())
print(c)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]]
And now you can find which index matches. Which will also indicate if there are multiple matches.
# Array with multiple instances
a2 = np.array([[1,0],[2,0],[3,1],[4,2],[3,1],[4,2]])
c2 = cdist(a2, b, lambda u, v: (u==v).all())
print(c2)
idx = np.where(c2==1)
print(idx)
print(idx[0][idx[1]==2])
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]
[0. 0. 0.]
[0. 0. 1.]]
(array([0, 1, 3, 5], dtype=int64), array([0, 1, 2, 2], dtype=int64))
[3 5]
The recommended answer is good, but will struggle when dealing with arrays with a large number of rows. An alternative is:
baseval = np.max([a.max(), b.max()]) + 1
a[:,1] = a[:,1] * baseval
b[:,1] = b[:,1] * baseval
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))
This uses the maximum value contained in either array plus 1 as a numeric base and treats the columns as baseval^0 and baseval^1 values. This ensures that the sum of the columns are unique for each possible pair of values. If the order of the columns is not important then both input arrays can be sorted column-wise using np.sort(a,axis=1) beforehand.
This can be extended to arrays with more columns using:
baseval = np.max([a.max(), b.max()]) + 1
n_cols = a.shape[1]
a = a * baseval ** np.array(range(n_cols))
b = b * baseval ** np.array(range(n_cols))
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))
Overflow can occur when baseval ** (n_cols+1) > 9223372036854775807 if using int64. This can be avoided by setting the numpy arrays to use python integers using dtype=object.
I have a n x m matrix X and a n x p matrix Y where Y is binary data. In the end I want a p x n matrix Z where the columns of Z are a function of the columns of X subsetting to the column entries corresponding to 1's in Y.
For example
>>> X
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> Y
array([[1, 0],
[1, 0],
[0, 1]])
n_x,m = X.shape
n_y,p = Y.shape
Z = np.zeros([p, n_x])
for i in range(n_x):
col = X[:,[i]]
for j in range(p):
#this is where I subset col with Y[:,[j]]
Z[j][i] = my_func(subsetted_column)
The iterations would produce
i=0, j=0: subsetted_column = [[1],[4]]
i=0, j=1: subsetted_column = [[7]]
i=1, j=0: subsetted_column = [[2],[5]]
i=1, j=1: subsetted_column = [[8]]
i=2, j=0: subsetted_column = [[3],[6]]
i=2, j=1: subsetted_column = [[9]]
I assume there is some way to do that nested loop in a single list comprehension. The function my_func also takes a long time so would be nice to parallelize that somehow.
Edit: I could do something like
for i in range(n_x):
for j in range(p):
subsetted_column = np.trim_zeros(np.multiply(X[:,i], Y[:,j]))
Z[j][i] = my_func(subsetted_column)
But I still believe there is an easier solution
Does this what you want?
import numpy as np
N, M, P = 4, 3, 2
a = np.random.random((N, M))
b = np.random.randint(2, size=(N, P)).astype(bool)
your_func = lambda x: x # insert proper function here
flat = [your_func(ai[bj]) for bj in b.T for ai in a.T]
out = np.empty((P, M), dtype=object)
out.ravel()[:] = flat
print(a)
print(b)
print(out)
Remarks:
It is easiest to convert your masking array to dtype bool because this allows you to use logical indexing.
If your_func returns just a number it's better not to use dtype=object for out.
If you want to parallelise, a list comprehension is perhaps not the best thing to do, but I'm no expert on that. It's just that the loop looks like an obvious parallelisation target, since the order of iterations is irrelevant.
Sample output:
[[ 0.62739382 0.85774837 0.81958524]
[ 0.99690996 0.71202879 0.97636715]
[ 0.89235107 0.91739852 0.39537849]
[ 0.0413107 0.11662271 0.72419308]]
[[False True]
[ True True]
[False False]
[ True True]]
[[array([ 0.99690996, 0.0413107 ]) array([ 0.71202879, 0.11662271])
array([ 0.97636715, 0.72419308])]
[array([ 0.62739382, 0.99690996, 0.0413107 ])
array([ 0.85774837, 0.71202879, 0.11662271])
array([ 0.81958524, 0.97636715, 0.72419308])]]
It may help to perform the subsetting in a pre-processing loop
In [112]: xs = [X[y,:] for y in Y.astype(bool).T]
In [113]: xs
Out[113]:
[array([[1, 2, 3],
[4, 5, 6]]),
array([[7, 8, 9]])]
(.T is used to iterate on columns in the list comprehension; bool allows 'masked' selection)
Let's say, for example that my_func takes the mean on axis=0 for the subsets
In [116]: [np.mean(s, axis=0) for s in xs]
Out[116]: [array([ 2.5, 3.5, 4.5]), array([ 7., 8., 9.])]
In [117]: np.array(_)
Out[117]:
array([[ 2.5, 3.5, 4.5],
[ 7. , 8. , 9. ]])
I could combine it into one loop, but it's harder to think about:
np.array([np.mean(X[y,:],axis=0) for y in Y.astype(bool).T])
With this xs list, you can focus your efforts on applying my_func efficiently to all the columns of xs[i] as np.mean(xs[i], axis=0) does.
The double loop version of this mean
In [121]: p=np.zeros((2,3))
In [122]: for i in range(2):
...: for j in range(3):
...: p[i,j] = np.mean(xs[i][:,j])
...:
In [123]: p
Out[123]:
array([[ 2.5, 3.5, 4.5],
[ 7. , 8. , 9. ]])
Equivalent double list comprehension
In [125]: [[np.mean(i) for i in j.T] for j in xs]
Out[125]: [[2.5, 3.5, 4.5], [7.0, 8.0, 9.0]]
How can I get the indices of intersection points between two numpy arrays? I can get intersecting values with intersect1d:
import numpy as np
a = np.array(xrange(11))
b = np.array([2, 7, 10])
inter = np.intersect1d(a, b)
# inter == array([ 2, 7, 10])
But how can I get the indices into a of the values in inter?
You could use the boolean array produced by in1d to index an arange. Reversing a so that the indices are different from the values:
>>> a[::-1]
array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
>>> a = a[::-1]
intersect1d still returns the same values...
>>> numpy.intersect1d(a, b)
array([ 2, 7, 10])
But in1d returns a boolean array:
>>> numpy.in1d(a, b)
array([ True, False, False, True, False, False, False, False, True,
False, False], dtype=bool)
Which can be used to index a range:
>>> numpy.arange(a.shape[0])[numpy.in1d(a, b)]
array([0, 3, 8])
>>> indices = numpy.arange(a.shape[0])[numpy.in1d(a, b)]
>>> a[indices]
array([10, 7, 2])
To simplify the above, though, you could use nonzero -- this is probably the most correct approach, because it returns a tuple of uniform lists of X, Y... coordinates:
>>> numpy.nonzero(numpy.in1d(a, b))
(array([0, 3, 8]),)
Or, equivalently:
>>> numpy.in1d(a, b).nonzero()
(array([0, 3, 8]),)
The result can be used as an index to arrays of the same shape as a with no problems.
>>> a[numpy.nonzero(numpy.in1d(a, b))]
array([10, 7, 2])
But note that under many circumstances, it makes sense just to use the boolean array itself, rather than converting it into a set of non-boolean indices.
Finally, you can also pass the boolean array to argwhere, which produces a slightly differently-shaped result that's not as suitable for indexing, but might be useful for other purposes.
>>> numpy.argwhere(numpy.in1d(a, b))
array([[0],
[3],
[8]])
If you need to get unique values as given by intersect1d:
import numpy as np
a = np.array([range(11,21), range(11,21)]).reshape(20)
b = np.array([12, 17, 20])
print(np.intersect1d(a,b))
#unique values
inter = np.in1d(a, b)
print(a[inter])
#you can see these values are not unique
indices=np.array(range(len(a)))[inter]
#These are the non-unique indices
_,unique=np.unique(a[inter], return_index=True)
uniqueIndices=indices[unique]
#this grabs the unique indices
print(uniqueIndices)
print(a[uniqueIndices])
#now they are unique as you would get from np.intersect1d()
Output:
[12 17 20]
[12 17 20 12 17 20]
[1 6 9]
[12 17 20]
indices = np.argwhere(np.in1d(a,b))
For Python >= 3.5, there's another solution to do so
Other Solution
Let we go through this step by step.
Based on the original code from the question
import numpy as np
a = np.array(range(11))
b = np.array([2, 7, 10])
inter = np.intersect1d(a, b)
First, we create a numpy array with zeros
c = np.zeros(len(a))
print (c)
output
>>> [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Second, change array value of c using intersect index. Hence, we have
c[inter] = 1
print (c)
output
>>>[ 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1.]
The last step, use the characteristic of np.nonzero(), it will return exactly the index of the non-zero term you want.
inter_with_idx = np.nonzero(c)
print (inter_with_idx)
Final output
array([ 2, 7, 10])
Reference
[1] numpy.nonzero
As of numpy version 1.15.0 intersect1d has a return_indices option :
numpy.intersect1d(ar1, ar2, assume_unique=False, return_indices=False)