Numpy: find row-wise common element efficiently - python

Suppose we are given two 2D numpy arrays a and b with the same number of rows. Assume furthermore that we know that each row i of a and b has at most one element in common, though this element may occur multiple times. How can we find this element as efficiently as possible?
An example:
import numpy as np
a = np.array([[1, 2, 3],
[2, 5, 2],
[5, 4, 4],
[2, 1, 3]])
b = np.array([[4, 5],
[3, 2],
[1, 5],
[0, 5]])
desiredResult = np.array([[np.nan],
[2],
[5],
[np.nan]])
It is easy to come up with a streightforward implementation by applying intersect1d along the first axis:
from intertools import starmap
desiredResult = np.array(list(starmap(np.intersect1d, zip(a, b))))
Apperently, using python's builtin set operations is even quicker. Converting the result to the desired form is easy.
However, I need an implementation as efficient as possible. Hence, I do not like the starmap, as I suppose that it requires a python call for every row. I would like a purely vectorized option, and would be happy, if this even exploitet our additional knowledge that there is at most one common value per row.
Does anyone have ideas how I could speed up the task and implement the solution more elegantly? I would be okay with using C code or cython, but coding effort should be not too much.

Approach #1
Here's a vectorized one based on searchsorted2d -
# Sort each row of a and b in-place
a.sort(1)
b.sort(1)
# Use 2D searchsorted row-wise between a and b
idx = searchsorted2d(a,b)
# "Clip-out" out of bounds indices
idx[idx==a.shape[1]] = 0
# Get mask of valid ones i.e. matches
mask = np.take_along_axis(a,idx,axis=1)==b
# Use argmax to get first match as we know there's at most one match
match_val = np.take_along_axis(b,mask.argmax(1)[:,None],axis=1)
# Finally use np.where to choose between valid match
# (decided by any one True in each row of mask)
out = np.where(mask.any(1)[:,None],match_val,np.nan)
Approach #2
Numba-based one for memory efficiency -
from numba import njit
#njit(parallel=True)
def numba_f1(a,b,out):
n,a_ncols = a.shape
b_ncols = b.shape[1]
for i in range(n):
for j in range(a_ncols):
for k in range(b_ncols):
m = a[i,j]==b[i,k]
if m:
break
if m:
out[i] = a[i,j]
break
return out
def find_first_common_elem_per_row(a,b):
out = np.full(len(a),np.nan)
numba_f1(a,b,out)
return out
Approach #3
Here's another vectorized one based on stacking and sorting -
r = np.arange(len(a))
ab = np.hstack((a,b))
idx = ab.argsort(1)
ab_s = ab[r[:,None],idx]
m = ab_s[:,:-1] == ab_s[:,1:]
m2 = (idx[:,1:]*m)>=a.shape[1]
m3 = m & m2
out = np.where(m3.any(1),b[r,idx[r,m3.argmax(1)+1]-a.shape[1]],np.nan)
Approach #4
For an elegant one, we can make use of broadcasting for a resource-hungry method -
m = (a[:,None]==b[:,:,None]).any(2)
out = np.where(m.any(1),b[np.arange(len(a)),m.argmax(1)],np.nan)

Doing some research, I found that checking whether two lists are disjoint runs in O(n+m), whereby n and m are the lengths of the lists (see here). The idea is that instertion and lookup of elements run in constant time for hash maps. Therefore, inserting all elements from the first list into a hashmap takes O(n) operations, and checking for each element in the second list whether it is already in the hash map takes O(m) operations. Therefore, solutions based on sorting, which run in O(n log(n) + m log(m)), are not optimal asymptotically.
Though the solutions by #Divakar are highly efficient in many use cases, they are less efficient, if the second dimension is large. Then, a solution based on hash maps is better suited. I have implemented it as follows in cython:
import numpy as np
cimport numpy as np
import cython
from libc.math cimport NAN
from libcpp.unordered_map cimport unordered_map
np.import_array()
#cython.boundscheck(False)
#cython.wraparound(False)
def get_common_element2d(np.ndarray[double, ndim=2] arr1,
np.ndarray[double, ndim=2] arr2):
cdef np.ndarray[double, ndim=1] result = np.empty(arr1.shape[0])
cdef int dim1 = arr1.shape[1]
cdef int dim2 = arr2.shape[1]
cdef int i, j
cdef unordered_map[double, int] tmpset = unordered_map[double, int]()
for i in range(arr1.shape[0]):
for j in range(dim1):
# insert arr1[i, j] as key without assigned value
tmpset[arr1[i, j]]
for j in range(dim2):
# check whether arr2[i, j] is in tmpset
if tmpset.count(arr2[i,j]):
result[i] = arr2[i,j]
break
else:
result[i] = NAN
tmpset.clear()
return result
I have created test cases as follows:
import numpy as np
import timeit
from itertools import starmap
from mycythonmodule import get_common_element2d
m, n = 3000, 3000
a = np.random.rand(m, n)
b = np.random.rand(m, n)
for i, row in enumerate(a):
if np.random.randint(2):
common = np.random.choice(row, 1)
b[i][np.random.choice(np.arange(n), np.random.randint(min(n,20)), False)] = common
# we need to copy the arrays on each test run, otherwise they
# will remain sorted, which would bias the results
%timeit [set(aa).intersection(bb) for aa, bb in zip(a.copy(), b.copy())]
# returns 3.11 s ± 56.8 ms
%timeit list(starmap(np.intersect1d, zip(a.copy(), b.copy)))
# returns 1.83 s ± 55.4
# test sorting method
# divakarsMethod1 is the appraoch #1 in #Divakar's answer
%timeit divakarsMethod1(a.copy(), b.copy())
# returns 1.88 s ± 18 ms
# test hash map method
%timeit get_common_element2d(a.copy(), b.copy())
# returns 1.46 s ± 22.6 ms
These results seem to indicate that the naive approach is actually better than some vectorized versions. However, the vectorized algorithms play out their strengths, if many rows with fewer columns are considered (a different use case). In these cases, the vectorized approaches are more than 5 times faster than the naive appraoch and the sorting method turns out to be best.
Conclusion: I will go with the HashMap-based cython version, because it is among the most efficient variants in both use cases. If I had to set up cython first, I would use the sorting-based method.

Not sure if this is faster, but we can try a couple things here:
Method 1 np.intersect1d with list comprehension
[np.intersect1d(arr[0], arr[1]) for arr in list(zip(a,b))]
# Out
[array([], dtype=int32), array([2]), array([5]), array([], dtype=int32)]
Or to list:
[np.intersect1d(arr[0], arr[1]).tolist() for arr in list(zip(a,b))]
# Out
[[], [2], [5], []]
Method 2 set with list comprehension:
[list(set(arr[0]) & set(arr[1])) for arr in list(zip(a,b))]
# Out
[[], [2], [5], []]

Related

Fill an array using the values of another array as the indices. If an index is repeated, prioritize according to a parallel array

Description
I have an array a with N integer elements that range from 0 to M-1. I have another array b with N positive numbers.
Then, I want to create an array c with M elements. The i-th element of c should the index of a that has a value of i.
If more than one of these indices existed, then we take the one with a higher value in b.
If none existed, the i-th element of c should be -1.
Example
N = 5, M = 3
a = [2, 1, 1, 2, 2]
b = [1, 3, 5, 7, 3]
Then, c should be...
c = [-1, 2, 3]
My Solution 1
A possible approach would be to initialize an array d that stores the current max and then loop through a and b updating the maximums.
c = -np.ones(M)
d = np.zeros(M)
for i, (idx, val) in enumerate(zip(a, b)):
if d[idx] <= val:
c[idx] = i
d[idx] = val
This solution is O(N) in time but requires iterating the array with Python, making it slow.
My Solution 2
Another solution would be to sort a using b as the key. Then, we can just assign a indices to c (max elements will be last).
sort_idx = np.argsort(b)
a_idx = np.arange(len(a))
a = a[sort_idx]
a_idx = a_idx[sort_idx]
c = -np.ones(M)
c[a] = a_idx
This solution does not require Python loops but requires sorting b, making it O(N*log(N)).
Ideal Solution
Is there a solution to this problem in linear time without having to loop the array in Python?
AFAIK, this cannot be implemented in O(n) currently with Numpy (mainly because the index table is both read and written regarding the value of another array). Note that np.argsort(b) can theoretically be implemented in O(n) using a radix sort, but such sort is not implemented yet in Numpy (it would not be much faster in practice due to the bad cache locality of the algorithm on big arrays).
One solution is to use Numba to speed up your algorithmically-efficient solution. Numba uses a JIT compiler to speed up loops. Here is an example (working with np.int32 types):
import numpy as np
import numba as nb
#nb.njit('int32[:](int32[:], int32[:])')
def compute(a, b):
c = np.full(M, -1, dtype=np.int32)
d = np.zeros(M, dtype=np.int32)
for i, (idx, val) in enumerate(zip(a, b)):
if d[idx] <= val:
c[idx] = i
d[idx] = val
return c
a = np.array([2, 1, 1, 2, 2], dtype=np.int32)
b = np.array([1, 3, 5, 7, 3], dtype=np.int32)
c = compute(a, b)

Increment all entries in an array by 'n' without a for loop

I have an array:
arr = [5,5,5,5,5,5]
I want to increment a particular range in the arr by 'n'. So if n=2 and the range is [2,5].
The array should look like this:
arr = [5,5,7,7,7,5]
Needed to do this without a for loop, for a problem im trying to solve.
Tried:
arr[2:5] = [n]*3
but that obviously replaces the entries and becomes:
arr = [5,5,3,3,3,5]
Any suggestions would be highly appriciated.
n = 2
arr_range = slice(2, 5)
arr = [5,5,7,7,7,5]
arr[arr_range] = map(lambda x: x+n, arr[arr_range])
# arr
# [5, 5, 9, 9, 9, 5]
But I would recommend using numpy...
import numpy as np
n = 2
arr_range = slice(2, 5)
arr = np.array([5,5,7,7,7,5])
arr[arr_range] += n
You actually have a list, not an array. If you convert it to a Numpy array it is simple.
>>> n=3
>>> arr = np.array([5,5,5,5,5,5])
>>> arr[2:5] += n
>>> arr
array([5, 5, 8, 8, 8, 5])
You have basically two options (for code see below):
Use slice assignment via a list comprehension (a[:] = [x+1 for x in a]),
Use a for-loop (even though you exclude this in your question, I don't see a legitimate reason for doing so).
They come with pros and cons. Let's assume you are going to replace some fraction of the list items (as opposed to a fixed number of items). The for-loop runs in Python and hence might be slower but it has O(1) memory usage. The list comprehension and slice assignment both operate in C (assuming you are using CPython) but it has O(N) memory usage due to the temporary list.
Using a generator doesn't buy anything since it is converted to a list anyway before the assignment happens (this is necessary because if the generator had fewer or more items than the slice, the list would need to be resized accordingly; see the source code).
Using a map adds even more overhead since it needs to call the mapped function on every item.
The following is a performance comparison of the different methods. The for-loop is fastest for very small lists since it has minimal overhead (just the range object). For more than about a dozen items, the list comprehension clearly outperforms the other methods and especially for larger lists (len(a) > 3e5) the difference to the generator becomes noticeable (the generator cannot provide information about its size, so the generated list needs to be resized as more items are fetched). For very large lists the difference between for-loop and list comprehension seems to shrink again since the memory overhead tends to outweigh the loop cost, but reaching that point would require unusually large lists (where you'd be better off using something like Numpy anyway).
This is the code using the perfplot package:
import numpy
import perfplot
def use_generator(a):
i = slice(0, len(a)//2)
a[i] = (x+1 for x in a[i])
def use_map(a):
i = slice(0, len(a)//2)
a[i] = map(lambda x: x+1, a[i])
def use_list(a):
i = slice(0, len(a)//2)
a[i] = [x+1 for x in a[i]]
def use_loop(a):
for i in range(len(a)//2):
a[i] += 1
perfplot.show(
setup=lambda n: [0]*n,
kernels=[use_generator, use_map, use_list, use_loop],
n_range=[2**k for k in range(1, 26)],
xlabel="len(a)",
equality_check=None,
)

Efficiently adding numpy arrays with duplicate destination indices [duplicate]

Suppose I have 2 matrices M and N (both have > 1 columns). I also have an index matrix I with 2 columns -- 1 for M and one for N. The indices for N are unique, but the indices for M may appear more than once. The operation I would like to perform is,
for i,j in w:
M[i] += N[j]
Is there a more efficient way to do this other than a for loop?
For completeness, in numpy >= 1.8 you can also use np.add's at method:
In [8]: m, n = np.random.rand(2, 10)
In [9]: m_idx, n_idx = np.random.randint(10, size=(2, 20))
In [10]: m0 = m.copy()
In [11]: np.add.at(m, m_idx, n[n_idx])
In [13]: m0 += np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
In [14]: np.allclose(m, m0)
Out[14]: True
In [15]: %timeit np.add.at(m, m_idx, n[n_idx])
100000 loops, best of 3: 9.49 us per loop
In [16]: %timeit np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
1000000 loops, best of 3: 1.54 us per loop
Aside of the obvious performance disadvantage, it has a couple of advantages:
np.bincount converts its weights to double precision floats, .at will operate with you array's native type. This makes it the simplest option for dealing e.g. with complex numbers.
np.bincount only adds weights together, you have an at method for all ufuncs, so you can repeatedly multiply, or logical_and, or whatever you feel like.
But for your use case, np.bincount is probably the way to go.
Using also m_ind, n_ind = w.T, just do M += np.bincount(m_ind, weights=N[n_ind], minlength=len(M))
For clarity, let's define
>>> m_ind, n_ind = w.T
Then the for loop
for i, j in zip(m_ind, n_ind):
M[i] += N[j]
updates the entries M[np.unique(m_ind)]. The values that get written to it are N[n_ind], which must be grouped by m_ind. (The fact that there's an n_ind in addition to m_ind is actually tangential to the question; you could just set N = N[n_ind].) There happens to be a SciPy class that does exactly this: scipy.sparse.csr_matrix.
Example data:
>>> m_ind, n_ind = array([[0, 0, 1, 1], [2, 3, 0, 1]])
>>> M = np.arange(2, 6)
>>> N = np.logspace(2, 5, 4)
The result of the for loop is that M becomes [110002 1103 4 5]. We get the same result with a csr_matrix as follows. As I said earlier, n_ind isn't relevant, so we get rid of that first.
>>> N = N[n_ind]
>>> from scipy.sparse import csr_matrix
>>> update = csr_matrix((N, m_ind, [0, len(N)])).toarray()
The CSR constructor builds a matrix with the required values at the required indices; the third part of its argument is a compressed column index, meaning that the values N[0:len(N)] have the indices m_ind[0:len(N)]. Duplicates are summed:
>>> update
array([[ 110000., 1100.]])
This has shape (1, len(np.unique(m_ind))) and can be added in directly:
>>> M[np.unique(m_ind)] += update.ravel()
>>> M
array([110002, 1103, 4, 5])

Better alternative to nested for loops through arrays in numpy?

Often I need to traverse an array and perform some operation on each entry, where the operation may depend on the indices and the value of the entry. Here is a simple example.
import numpy as np
N=10
M = np.zeros((N,N))
for i in range(N):
for j in range(N):
M[i,j] = 1/((i+2*j+1)**2)
Is there a shorter, cleaner, or more pythonic way to perform such tasks?
What you show is 'pythonic' in the sense that it uses a Python list and iteration approach. The only use of numpy is in assigning the values, M{i,j] =. Lists don't take that kind of index.
To make most use of numpy, make index grids or arrays, and calculate all values at once, without explicit loop. For example, in your case:
In [333]: N=10
In [334]: I,J = np.ogrid[0:10,0:10]
In [335]: I
Out[335]:
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
In [336]: J
Out[336]: array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [337]: M = 1/((I + 2*J + 1)**2)
In [338]: M
Out[338]:
array([[ 1. , 0.11111111, 0.04 , 0.02040816, 0.01234568,
0.00826446, 0.00591716, 0.00444444, 0.00346021, 0.00277008],
...
[ 0.01 , 0.00694444, 0.00510204, 0.00390625, 0.00308642,
0.0025 , 0.00206612, 0.00173611, 0.00147929, 0.00127551]])
ogrid is one of several ways of construction sets of arrays that can be 'broadcast' together. meshgrid is another common function.
In your case, the equation is one that works well with 2 arrays like this. It depends very much on broadcasting rules, which you should study.
If the function only takes scalar inputs, we will have to use some form of iteration. That has been a frequent SO question; search for [numpy] vectorize.
np.fromfunction is intended for that :
def f(i,j) : return 1/((i+2*j+1)**2)
M = np.fromfunction(f,(N,N))
it's slighty slower that the 'hand made' vectorised way , but easy to understand.
I would say that's the most straight forward and universally understood way of performing that iteration.
An alternative would be to iterate over over the values and call a function for a given (i, j) pair
import itertools
N = 10
M = np.zeros((N,N))
def do_work(i, j):
M[i,j] = 1/((i+2*j+1)**2)
[do_work(i, j) for (i, j) in itertools.product(xrange(N), xrange(N))]
Here I just used itertools.product to create a generator for an possible (i, j) values, you can just as well use a for loop.
for (i, j) in itertools.product(xrange(N), xrange(N)):
M[i,j] = 1/((i+2*j+1)**2)
Yes, you can do this in pure NumPy without using any loops:
import numpy as np
N = 10
i = np.arange(N)[:, np.newaxis]
j = np.arange(N)
M = 1/((i+2*j+1)**2)
The reason why this works is because NumPy automatically performs outer products whenever you mix row- and column vectors within an expression.
Moreover, since this is pure NumPy, the code will also run a lot faster.
For example, for N=10**4, the double for loop version takes 48.3 seconds on my computer, whereas this code is already finished after only 1.2 seconds.

numpy: efficiently summing with index arrays

Suppose I have 2 matrices M and N (both have > 1 columns). I also have an index matrix I with 2 columns -- 1 for M and one for N. The indices for N are unique, but the indices for M may appear more than once. The operation I would like to perform is,
for i,j in w:
M[i] += N[j]
Is there a more efficient way to do this other than a for loop?
For completeness, in numpy >= 1.8 you can also use np.add's at method:
In [8]: m, n = np.random.rand(2, 10)
In [9]: m_idx, n_idx = np.random.randint(10, size=(2, 20))
In [10]: m0 = m.copy()
In [11]: np.add.at(m, m_idx, n[n_idx])
In [13]: m0 += np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
In [14]: np.allclose(m, m0)
Out[14]: True
In [15]: %timeit np.add.at(m, m_idx, n[n_idx])
100000 loops, best of 3: 9.49 us per loop
In [16]: %timeit np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
1000000 loops, best of 3: 1.54 us per loop
Aside of the obvious performance disadvantage, it has a couple of advantages:
np.bincount converts its weights to double precision floats, .at will operate with you array's native type. This makes it the simplest option for dealing e.g. with complex numbers.
np.bincount only adds weights together, you have an at method for all ufuncs, so you can repeatedly multiply, or logical_and, or whatever you feel like.
But for your use case, np.bincount is probably the way to go.
Using also m_ind, n_ind = w.T, just do M += np.bincount(m_ind, weights=N[n_ind], minlength=len(M))
For clarity, let's define
>>> m_ind, n_ind = w.T
Then the for loop
for i, j in zip(m_ind, n_ind):
M[i] += N[j]
updates the entries M[np.unique(m_ind)]. The values that get written to it are N[n_ind], which must be grouped by m_ind. (The fact that there's an n_ind in addition to m_ind is actually tangential to the question; you could just set N = N[n_ind].) There happens to be a SciPy class that does exactly this: scipy.sparse.csr_matrix.
Example data:
>>> m_ind, n_ind = array([[0, 0, 1, 1], [2, 3, 0, 1]])
>>> M = np.arange(2, 6)
>>> N = np.logspace(2, 5, 4)
The result of the for loop is that M becomes [110002 1103 4 5]. We get the same result with a csr_matrix as follows. As I said earlier, n_ind isn't relevant, so we get rid of that first.
>>> N = N[n_ind]
>>> from scipy.sparse import csr_matrix
>>> update = csr_matrix((N, m_ind, [0, len(N)])).toarray()
The CSR constructor builds a matrix with the required values at the required indices; the third part of its argument is a compressed column index, meaning that the values N[0:len(N)] have the indices m_ind[0:len(N)]. Duplicates are summed:
>>> update
array([[ 110000., 1100.]])
This has shape (1, len(np.unique(m_ind))) and can be added in directly:
>>> M[np.unique(m_ind)] += update.ravel()
>>> M
array([110002, 1103, 4, 5])

Categories

Resources