I tried searching for an answer, but couldn't find what I needed. Apologies if this is a duplicate question.
Suppose I have a 2d-array with shape (n, n*m). What I want to do is an outer sum of this array to its transpose that results in an array with shape (n*m, n*m). For example, suppose i have
A = array([[1., 1., 2., 2.],
[1., 1., 2., 2.]])
I want to do an outer sum of A and A.T such that the output is:
>>> array([[2., 2., 3., 3.],
[2., 2., 3., 3.],
[3., 3., 4., 4.],
[3., 3., 4., 4.]])
Note that np.add.outer does not work because it ravels in the inputs into vectors. I can achieve something similar by doing
np.tile(A, (2, 1)) + np.tile(A.T, (1, 2))
but this does not seem reasonable when n and m are reasonably large (n > 100 and m > 1000). Is it possible to write this sum using einsum? I just can't seem to figure out einsum.
To leverage broadcasting, we need to break it down to 3D and then permute axes and add -
n = A.shape[0]
m = A.shape[1]//n
a = A.reshape(n,m,n) # reshape to 3D
out = (a[None,:,:,:] + a.transpose(1,2,0)[:,:,None,:]).reshape(n*m,-1)
Sample run for verification -
In [359]: # Setup input array
...: np.random.seed(0)
...: n,m = 3,4
...: A = np.random.randint(1,10,(n,n*m))
In [360]: # Original soln
...: out0 = np.tile(A, (m, 1)) + np.tile(A.T, (1, m))
In [361]: # Posted soln
...: n = A.shape[0]
...: m = A.shape[1]//n
...: a = A.reshape(n,m,n)
...: out = (a[None,:,:,:] + a.transpose(1,2,0)[:,:,None,:]).reshape(n*m,-1)
In [362]: np.allclose(out0, out)
Out[362]: True
Timings with large n,m -
In [363]: # Setup input array
...: np.random.seed(0)
...: n,m = 100,100
...: A = np.random.randint(1,10,(n,n*m))
In [364]: %timeit np.tile(A, (m, 1)) + np.tile(A.T, (1, m))
1 loop, best of 3: 407 ms per loop
In [365]: %%timeit
...: # Posted soln
...: n = A.shape[0]
...: m = A.shape[1]//n
...: a = A.reshape(n,m,n)
...: out = (a[None,:,:,:] + a.transpose(1,2,0)[:,:,None,:]).reshape(n*m,-1)
1 loop, best of 3: 219 ms per loop
Further performance boost with numexpr
We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance -
import numexpr as ne
n = A.shape[0]
m = A.shape[1]//n
a = A.reshape(n,m,n)
p1 = a[None,:,:,:]
p2 = a.transpose(1,2,0)[:,:,None,:]
out = ne.evaluate('p1+p2').reshape(n*m,-1)
Timings with same large n, m setup -
In [367]: %%timeit
...: # Posted soln
...: n = A.shape[0]
...: m = A.shape[1]//n
...: a = A.reshape(n,m,n)
...: p1 = a[None,:,:,:]
...: p2 = a.transpose(1,2,0)[:,:,None,:]
...: out = ne.evaluate('p1+p2').reshape(n*m,-1)
10 loops, best of 3: 152 ms per loop
one way is
(A.reshape(-1,*A.shape).T+A)[:,0,:]
I think this will take a lot of memory with n>100 and m>1000.
but isn't this the same as
np.add.outer(A,A)[:,0,:].reshape(4,-1)
Related
Say I have a 2D pytorch tensor and a 2D numpy boolean as follows,
a = torch.tensor([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.],
[ 9., 10., 11.],
[12., 13., 14.]])
m = numpy.array([[ False, True, False],
[ True, False, True],
[ False, True, True],
[ False, False, False],
[ True, False, False]])
They have the same dimension and the number of True's in each column of m is the same.
I need to get the 2x3 tensor that is
a.transpose(0,1).masked_select(torch.from_numpy(m.transpose())).reshape(a.shape[1],-1).transpose(0,1)
which is
tensor([[ 3., 1., 5.],
[12., 7., 8.]])
The actual tensor is very large, and the operation needs to be performed many times. So I want to ask what is an efficient way of doing this (or the most efficient way).
In my benchmarks a jitted numba solution is the fastest, I could find
My benchmarks for a, m with shape (10000,200)(equal result tensors)
1
#numba.jit
13.2 ms (3.46x)
2
list comprehension
31.3 ms (1.46x)
3
baseline
45.7 ms (1.00x)
Generation of sufficiently large sample data for benchmarking
import torch
import numpy as np
def generate_data(rows=500, columns=100):
a = torch.from_numpy(np.random.uniform(1,10, (rows,columns)).astype(np.float32))
# argsort trick by #divakar https://stackoverflow.com/a/55317373/14277722
def shuffle_along_axis(a, axis):
idx = np.random.rand(*a.shape).argsort(axis=axis)
return np.take_along_axis(a,idx,axis=axis)
m = shuffle_along_axis(np.full((columns,rows), np.random.randint(2, size=rows)), 1).astype('bool').T
return a, np.ascontiguousarray(m)
a, m = generate_data(10000,200)
A jitted numba implementation
import numba as nb
#nb.njit
def gather2d(arr1, arr2):
res = np.zeros((np.count_nonzero(arr2[:,0]), arr1.shape[1]), np.float32)
counter = np.zeros(arr1.shape[1], dtype=np.intp)
for i in range(arr1.shape[0]):
for j in range(arr1.shape[1]):
if arr2[i,j]:
res[counter[j], j] = arr1[i,j]
counter[j] += 1
return res
torch.from_numpy(gather2d(a.numpy(),m))
Output
# %timeit 10 loops, best of 5: 13.2 ms per loop
tensor([[2.1846, 7.8890, 8.8218, ..., 4.8309, 9.2853, 6.4404],
[5.8842, 3.7332, 6.7436, ..., 1.2914, 3.2983, 3.5627],
[9.5128, 2.4283, 2.2152, ..., 4.9512, 9.7335, 9.6252],
...,
[7.3193, 7.8524, 9.6654, ..., 3.3665, 8.8926, 4.7660],
[1.3829, 1.3347, 6.6436, ..., 7.1956, 4.0446, 6.4633],
[6.4264, 3.6283, 3.6385, ..., 8.4152, 5.8498, 5.0281]])
Against a vectorized baseline solution
# %timeit 10 loops, best of 5: 45.7 ms per loop
a.gather(0, torch.from_numpy(np.nonzero(m.T)[1].reshape(-1, m.shape[1], order='F')))
A python list comprehension turns out to be surprisingly fast
def g(arr1,arr2):
return np.array([i[j] for i,j in zip(arr1.T,arr2.T)]).T
# %timeit 10 loops, best of 5: 31.3 ms per loop
torch.from_numpy(g(a.numpy(), m))
You can try this way by using only NumPy and PyTorch:
b,c = m.nonzero()
b = torch.tensor(b)
c = torch.tensor(c)
a[b,c].reshape(2,3)
#output
tensor([[ 1., 3., 5.],
[ 7., 8., 12.]]) # True values are taken on axis=1
I used the same example provided by #Michael Szczesny to measure the time:
import numpy as np
import timeit
import torch
rows, columns = (10000,200)
a = torch.from_numpy(np.random.uniform(1,10, (rows,columns)).astype(np.float32))
m = np.random.choice([False, True], size=(rows, columns))
starttime = timeit.default_timer()
b,c = m.nonzero()
b = torch.tensor(b)
c = torch.tensor(c)
a[b,c]
print(f"The time difference is :{(timeit.default_timer() - starttime)*1000} ms")
#output
The time difference is : 26.4 ms
It is better than the second and third approaches of #Michael Szczesny.
I try to compute a convolution on a scipy.sparse matrix. Here is the code:
import numpy as np
import scipy.sparse, scipy.signal
M = scipy.sparse.csr_matrix([[0,1,0,0],[1,0,0,1],[1,0,1,0],[0,0,0,0]])
kernel = np.ones((3,3))
kernel[1,1]=0
X = scipy.signal.convolve(M, kernel, mode='same')
Which produces the following error:
ValueError: volume and kernel should have the same dimensionality
Computing scipy.signal.convolve(M.todense(), kernel, mode='same') provides the expected result. However, I would like to keep the computation sparse.
More generally speaking, my goal is to compute the 1-hop neighbourhood sum of the sparse matrix M. If you have any good idea how to calculate this on a sparse matrix, I would love to hear it !
EDIT:
I just tried a solution for this specific kernel (sum of neighbors) that is not really faster than the dense version (I didn't try in a very high dimension though). Here is the code:
row_ind, col_ind = M.nonzero()
X = scipy.sparse.csr_matrix((M.shape[0]+2, M.shape[1]+2))
for i in [0, 1, 2]:
for j in [0, 1, 2]:
if i!= 1 or j !=1:
X += scipy.sparse.csr_matrix( (M.data, (row_ind+i, col_ind+j)), (M.shape[0]+2, M.shape[1]+2))
X = X[1:-1, 1:-1]
In [1]: from scipy import sparse, signal
In [2]: M = sparse.csr_matrix([[0,1,0,0],[1,0,0,1],[1,0,1,0],[0,0,0,0]])
...: kernel = np.ones((3,3))
...: kernel[1,1]=0
In [3]: X = signal.convolve(M.A, kernel, mode='same')
In [4]: X
Out[4]:
array([[2., 1., 2., 1.],
[2., 4., 3., 1.],
[1., 3., 1., 2.],
[1., 2., 1., 1.]])
Why do posters show runnable code, but not the results? Most of us can't run code like this in our heads.
In [5]: M.A
Out[5]:
array([[0, 1, 0, 0],
[1, 0, 0, 1],
[1, 0, 1, 0],
[0, 0, 0, 0]])
Your alternative - while the result is a sparse matrix, all values are filled. Even if M is larger and sparser, X will be denser.
In [7]: row_ind, col_ind = M.nonzero()
...: X = sparse.csr_matrix((M.shape[0]+2, M.shape[1]+2))
...: for i in [0, 1, 2]:
...: for j in [0, 1, 2]:
...: if i!= 1 or j !=1:
...: X += sparse.csr_matrix( (M.data, (row_ind+i, col_ind+j)), (M
...: .shape[0]+2, M.shape[1]+2))
...: X = X[1:-1, 1:-1]
In [8]: X
Out[8]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 16 stored elements in Compressed Sparse Row format>
In [9]: X.A
Out[9]:
array([[2., 1., 2., 1.],
[2., 4., 3., 1.],
[1., 3., 1., 2.],
[1., 2., 1., 1.]])
Here's an alternative that builds the coo style inputs, and only makes the matrix at the end. Keep in mind that repeated coordinates are summed. That's handy in FEM stiffness matrix construction, and fits nicely here as well.
In [10]: row_ind, col_ind = M.nonzero()
...: data, row, col = [],[],[]
...: for i in [0, 1, 2]:
...: for j in [0, 1, 2]:
...: if i!= 1 or j !=1:
...: data.extend(M.data)
...: row.extend(row_ind+i)
...: col.extend(col_ind+j)
...: X = sparse.csr_matrix( (data, (row, col)), (M.shape[0]+2, M.shape[1]+2)
...: )
...: X = X[1:-1, 1:-1]
In [11]: X
Out[11]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 16 stored elements in Compressed Sparse Row format>
In [12]: X.A
Out[12]:
array([[2, 1, 2, 1],
[2, 4, 3, 1],
[1, 3, 1, 2],
[1, 2, 1, 1]])
===
My approach is noticeably faster (but still well behind the dense convolution). sparse.csr_matrix(...) is pretty slow, so it isn't a good idea to do repeatedly. And sparse addition isn't very good either.
In [13]: %%timeit
...: row_ind, col_ind = M.nonzero()
...: data, row, col = [],[],[]
...: for i in [0, 1, 2]:
...: for j in [0, 1, 2]:
...: if i!= 1 or j !=1:
...: data.extend(M.data)
...: row.extend(row_ind+i)
...: col.extend(col_ind+j)
...: X = sparse.csr_matrix( (data, (row, col)), (M.shape[0]+2, M.shape[1]+2)
...: )
...: X = X[1:-1, 1:-1]
...:
...:
793 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [14]: %%timeit
...: row_ind, col_ind = M.nonzero()
...: X = sparse.csr_matrix((M.shape[0]+2, M.shape[1]+2))
...: for i in [0, 1, 2]:
...: for j in [0, 1, 2]:
...: if i!= 1 or j !=1:
...: X += sparse.csr_matrix( (M.data, (row_ind+i, col_ind+j)), (
...: M.shape[0]+2, M.shape[1]+2))
...: X = X[1:-1, 1:-1]
...:
...:
4.72 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: timeit X = signal.convolve(M.A, kernel, mode='same')
85.9 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Considering a function below:
import numpy as np
a = np.ones(16).reshape(4,4)
def fn(a):
b = np.array(a)
for i in range(b.shape[0]):
for j in range(b.shape[1] - 1):
b[i][j+1] += b[i][j]
return b
print(fn(a))
That is, for a general function that calculates t+1 based on t in an array, can I make this faster? I'm aware there's a np.vectorize but not seeming appropriate for this case.
You can use cumsum I think that would be helpful.
import numpy as np
import pandas as pd
a = np.ones(16).reshape(4,4)
df =pd.DataFrame(a)
df.cumsum(axis=1)
Or you can use np.cumsum():
np.cumsum(a,axis=1)
It is possible to reduce the two for loops to one for loop with little bit of copying overhead in addition.
In [86]: a
Out[86]:
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
In [87]: b = a.copy()
In [88]: for col in range(b.shape[1]-1):
...: b[:, col+1] = np.sum(a[:, :col+2], axis=1)
In [89]: b
Out[89]:
array([[1., 2., 3., 4.],
[1., 2., 3., 4.],
[1., 2., 3., 4.],
[1., 2., 3., 4.]])
To make this work for a generic function, you can look for an equivalent function in numpy or implement one using numpy operations (vectorized one). For the example you provided, I just used numpy.sum() that does the job for us.
In terms of performance, this approach would be much better than operating using two for loops at the indices level, particularly for larger arrays. In the approach I used above, we work with slices of columns.
Here are the timings which suggest more than 3X speedup over native python implementation.
Native Python:
def fn(a):
b = np.array(a)
for i in range(b.shape[0]):
for j in range(b.shape[1] - 1):
b[i][j+1] += b[i][j]
return b
Slightly vectorized:
In [104]: def slightly_vectorized(b):
...: for col in range(b.shape[1]-1):
...: b[:, col+1] = np.sum(a[:, :col+2], axis=1)
...: return b
In [100]: a = np.ones(625).reshape(25, 25)
In [101]: %timeit fn(a)
303 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [102]: b = a.copy()
In [103]: %timeit slightly_vectorized(b)
99.8 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
What you're looking for is called accumulate here's an example:
import numpy as np
from itertools import accumulate
def fn(a):
acc = accumulate(a, lambda prev, row: prev + row)
return np.array(list(acc))
a = np.arange(16).reshape(4, 4)
print(fn(a))
# [[ 0 1 2 3]
# [ 4 6 8 10]
# [12 15 18 21]
# [24 28 32 36]]
There is no optimized accumulate function in numpy because it's not really possible to write accumulate in a way that's both performant & general. The python implementation is general, but will perform much like a hand coded lok.
To get optimal performance you'll probably need to find or write a low level implementation of the specific accumulate function that you need. You've already mentioned numba and you could also look into cython.
I have two 1D numpy arrays A and B of size (n, ) and (m, ) respectively which correspond to the x positions of points on a line. I want to calculate the distance between every point in A to every point in B. I then need to use these distances at a set y distance, d, to work out the potential at each point in A.
I'm currently using the following:
V = numpy.zeros(n)
for i in range(n):
xdist = A[i] - B
r = numpy.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = numpy.sum(dV)
This works but for large data sets it can take a while so I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
Vectorized approach
One vectorized approach after extending A to 2D with the introduction of a new axis using np.newaxis/None and thus making use of broadcasting would be -
(1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
Hybrid approach for large arrays
Now, for large arrays, we might have to divide the data into chunks.
Thus, with BSZ as the block size, we would have a hybrid approach, like so -
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
Runtime test
Approaches -
def original_app(A,B,d):
V = np.zeros(n)
for i in range(n):
xdist = A[i] - B
r = np.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = np.sum(dV)
return V
def vectorized_app1(A,B,d):
return (1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
def vectorized_app2(A,B,d, BSZ = 100):
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
return V.ravel()
Timings and verification -
In [203]: # Setup inputs
...: n,m = 10000,2000
...: A = np.random.rand(n)
...: B = np.random.rand(m)
...: d = 10
...:
In [204]: out1 = original_app(A,B,d)
...: out2 = vectorized_app1(A,B,d)
...: out3 = vectorized_app2(A,B,d, BSZ = 100)
...:
...: print np.allclose(out1, out2)
...: print np.allclose(out1, out3)
...:
True
True
In [205]: %timeit original_app(A,B,d)
10 loops, best of 3: 133 ms per loop
In [206]: %timeit vectorized_app1(A,B,d)
10 loops, best of 3: 138 ms per loop
In [207]: %timeit vectorized_app2(A,B,d, BSZ = 100)
10 loops, best of 3: 65.2 ms per loop
We can play around with the parameter block size BSZ -
In [208]: %timeit vectorized_app2(A,B,d, BSZ = 200)
10 loops, best of 3: 74.5 ms per loop
In [209]: %timeit vectorized_app2(A,B,d, BSZ = 50)
10 loops, best of 3: 67.4 ms per loop
Thus, the best one seems to be giving a 2x speedup with a block size of 100 at my end.
EDIT: My answer turned out to be nearly identical to Divakar's after a closer look. However, you can save some memory by doing the operations in-place. Taking the sum along the second axis is more efficient than long the first.
import numpy
a = numpy.random.randint(0,10,10) * 1.
b = numpy.random.randint(0,10,10) * 1.
xdist = a[:,None] - b
xdist **= 2
xdist += d**2
xdist **= -1
V = numpy.sum(xdist, axis=1)
which gives the same solution as your code.
I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
cdist works fine, you just have to reshape the arrays to have shape (n, 1) instead of (n,). You can add another dimension to a one-dimensional array A without copying the underlying data by using A[:, None] or A.reshape(-1, 1).
For example,
In [56]: from scipy.spatial.distance import cdist
In [57]: A
Out[57]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [58]: B
Out[58]: array([0, 2, 4, 6, 8])
In [59]: A[:, None]
Out[59]:
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
In [60]: cdist(A[:, None], B[:, None])
Out[60]:
array([[ 0., 2., 4., 6., 8.],
[ 1., 1., 3., 5., 7.],
[ 2., 0., 2., 4., 6.],
[ 3., 1., 1., 3., 5.],
[ 4., 2., 0., 2., 4.],
[ 5., 3., 1., 1., 3.],
[ 6., 4., 2., 0., 2.],
[ 7., 5., 3., 1., 1.],
[ 8., 6., 4., 2., 0.],
[ 9., 7., 5., 3., 1.]])
To compute V as shown in your code, you can use cdist with metric='sqeuclidean', as follows:
In [72]: d = 3.
In [73]: r = np.sqrt(cdist(A[:,None], B[:,None], metric='sqeuclidean') + d**2)
In [74]: V = (1/r).sum(axis=1)
Given two numpy arrays of nx3 and mx3, what is an efficient way to determine the row indices (counter) wherein the rows are common in the two arrays. For instance I have the following solution, which is significantly slow for not even much larger arrays
def arrangment(arr1,arr2):
hits = []
for i in range(arr2.shape[0]):
current_row = np.repeat(arr2[i,:][None,:],arr1.shape[0],axis=0)
x = current_row - arr1
for j in range(arr1.shape[0]):
if np.isclose(x[j,0],0.0) and np.isclose(x[j,1],0.0) and np.isclose(x[j,2],0.0):
hits.append(j)
return hits
It checks if rows of arr2 exist in arr1 and returns the row indices of arr1 where the rows match. I need this arrangement to be always sequentially ascending in terms of rows of arr2. For instance given
arr1 = np.array([[-1., -1., -1.],
[ 1., -1., -1.],
[ 1., 1., -1.],
[-1., 1., -1.],
[-1., -1., 1.],
[ 1., -1., 1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
arr2 = np.array([[-1., 1., -1.],
[ 1., 1., -1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
The function should return:
[3, 2, 6, 7]
quick and dirty answer
(arr1[:, None] == arr2).all(-1).argmax(0)
array([3, 2, 6, 7])
Better answer
Takes care of chance a row in arr2 doesn't match anything in arr1
t = (arr1[:, None] == arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
array([ 3., 2., 6., 7.])
As pointed out by #Divakar np.isclose accounts for rounding error in comparing floats
t = np.isclose(arr1[:, None], arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
I had a similar problem in the past and I came up with a fairly optimised solution for it.
First you need a generalisation of numpy.unique for multidimensional arrays, which for the sake of completeness I would copy it here
def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False):
"""Get unique values along an axis for 2D arrays.
input:
arr:
2D array
consider_sort:
Does permutation of the values within the axis matter?
Two rows can contain the same values but with
different arrangements. If consider_sort
is True then those rows would be considered equal
return_index:
Similar to numpy unique
return_inverse:
Similar to numpy unique
returns:
2D array of unique rows
If return_index is True also returns indices
If return_inverse is True also returns the inverse array
"""
if consider_sort is True:
a = np.sort(arr,axis=1)
else:
a = arr
b = np.ascontiguousarray(a).view(np.dtype((np.void,
a.dtype.itemsize * a.shape[1])))
if return_inverse is False:
_, idx = np.unique(b, return_index=True)
else:
_, idx, inv = np.unique(b, return_index=True, return_inverse=True)
if return_index == False and return_inverse == False:
return arr[idx]
elif return_index == True and return_inverse == False:
return arr[idx], idx
elif return_index == False and return_inverse == True:
return arr[idx], inv
else:
return arr[idx], idx, inv
Now all you need is to concatenate (np.vstack) your arrays and find the unique rows. The reverse mapping together with np.searchsorted will give you the indices you need. So lets write another function similar to numpy.in2d but for multidimensional (2D) arrays
def in2d_unsorted(arr1, arr2, axis=1, consider_sort=False):
"""Find the elements in arr1 which are also in
arr2 and sort them as the appear in arr2"""
assert arr1.dtype == arr2.dtype
if axis == 0:
arr1 = np.copy(arr1.T,order='C')
arr2 = np.copy(arr2.T,order='C')
if consider_sort is True:
sorter_arr1 = np.argsort(arr1)
arr1 = arr1[np.arange(arr1.shape[0])[:,None],sorter_arr1]
sorter_arr2 = np.argsort(arr2)
arr2 = arr2[np.arange(arr2.shape[0])[:,None],sorter_arr2]
arr = np.vstack((arr1,arr2))
_, inv = unique2d(arr, return_inverse=True)
size1 = arr1.shape[0]
size2 = arr2.shape[0]
arr3 = inv[:size1]
arr4 = inv[-size2:]
# Sort the indices as they appear in arr2
sorter = np.argsort(arr3)
idx = sorter[arr3.searchsorted(arr4, sorter=sorter)]
return idx
Now all you need to do is call in2d_unsorted with your input parameters
>>> in2d_unsorted(arr1,arr2)
array([ 3, 2, 6, 7])
While may not be fully optimised this approach is much faster. Let's benchmark it against #piRSquareds solutions
def indices_piR(arr1,arr2):
t = np.isclose(arr1[:, None], arr2).all(-1)
return np.where(t.any(0), t.argmax(0), np.nan)
with the following arrays
n=150
arr1 = np.random.permutation(n).reshape(n//3, 3)
idx = np.random.permutation(n//3)
arr2 = arr1[idx]
In [13]: np.allclose(in2d_unsorted(arr1,arr2),indices_piR(arr1,arr2))
True
In [14]: %timeit indices_piR(arr1,arr2)
10000 loops, best of 3: 181 µs per loop
In [15]: %timeit in2d_unsorted(arr1,arr2)
10000 loops, best of 3: 85.7 µs per loop
Now, for n=1500
In [24]: %timeit indices_piR(arr1,arr2)
100 loops, best of 3: 10.3 ms per loop
In [25]: %timeit in2d_unsorted(arr1,arr2)
1000 loops, best of 3: 403 µs per loop
and for n=15000
In [28]: %timeit indices_piR(A,B)
1 loop, best of 3: 1.02 s per loop
In [29]: %timeit in2d_unsorted(arr1,arr2)
100 loops, best of 3: 4.65 ms per loop
So for largeish arrays this is over 200X faster compared to #piRSquared's vectorised solution.