Python equivalent of MATLAB's "ismember" function - python

After many attempts trying optimize code, it seems that one last resource would be to attempt to run the code below using multiple cores. I don't know exactly how to convert/re-structure my code so that it can run much faster using multiple cores. I will appreciate if I could get guidance to achieve the end goal. The end goal is to be able to run this code as fast as possible for arrays A and B where each array holds about 700,000 elements. Here is the code using small arrays. The 700k element arrays are commented out.
import numpy as np
def ismember(a,b):
for i in a:
index = np.where(b==i)[0]
if index.size == 0:
yield 0
else:
yield index
def f(A, gen_obj):
my_array = np.arange(len(A))
for i in my_array:
my_array[i] = gen_obj.next()
return my_array
#A = np.arange(700000)
#B = np.arange(700000)
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
gen_obj = ismember(A,B)
f(A, gen_obj)
print 'done'
# if we print f(A, gen_obj) the output will be: [4 0 0 4 3]
# notice that the output array needs to be kept the same size as array A.
What I am trying to do is to mimic a MATLAB function called ismember[2] (The one that is formatted as: [Lia,Locb] = ismember(A,B). I am just trying to get the Locb part only.
From Matlab: Locb, contain the lowest index in B for each value in A that is a member of B. The output array, Locb, contains 0 wherever A is not a member of B
One of the main problems is that I need to be able to perform this operation as efficient as possible. For testing I have two arrays of 700k elements. Creating a generator and going through the values of the generator doesn't seem to get the job done fast.

Before worrying about multiple cores, I would eliminate the linear scan in your ismember function by using a dictionary:
def ismember(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
Your original implementation requires a full scan of the elements in B for each element in A, making it O(len(A)*len(B)). The above code requires one full scan of B to generate the dict Bset. By using a dict, you effectively make the lookup of each element in B constant for each element of A, making the operation O(len(A)+len(B)). If this is still too slow, then worry about making the above function run on multiple cores.
Edit: I've also modified your indexing slightly. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so if you're data set looks like this
A = [2378, 2378, 2378, 2378]
B = [2378, 2379]
and you return 0 for no element, then your results will exclude all elements of A. The above routine returns None for no index instead of 0. Returning -1 is an option, but Python will interpret that to be the last element in the array. None will raise an exception if it's used as an index into the array. If you'd like different behavior, change the second argument in the Bind.get(item,None) expression to the value you want returned.

sfstewman's excellent answer most likely solved the issue for you.
I'd just like to add how you can achieve the same exclusively in numpy.
I make use of numpy's unique an in1d functions.
B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)
B_unique_sorted contains the unique values in B sorted.
B_idx holds for these values the indices into the original B.
B_in_A_bool is a boolean array the size of B_unique_sorted that
stores whether a value in B_unique_sorted is in A.
Note: I need to look for (unique vals from B) in A because I need the output to be returned with respect to B_idx
Note: I assume that A is already unique.
Now you can use B_in_A_bool to either get the common vals
B_unique_sorted[B_in_A_bool]
and their respective indices in the original B
B_idx[B_in_A_bool]
Finally, I assume that this is significantly faster than the pure Python for-loop although I didn't test it.

Try the ismember library.
pip install ismember
Simple example:
# Import library
from ismember import ismember
import numpy as np
# data
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
# Lookup
Iloc,idx = ismember(A, B)
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [ True False False True True]
# indexes of d_unique that exists in d
print(idx)
# [4 4 3]
print(B[idx])
# [3 3 6]
print(A[Iloc])
# [3 3 6]
# These vectors will match
A[Iloc]==B[idx]
Speed check:
from ismember import ismember
from datetime import datetime
t1=[]
t2=[]
# Create some random vectors
ns = np.random.randint(10,10000,1000)
for n in ns:
a_vec = np.random.randint(0,100,n)
b_vec = np.random.randint(0,100,n)
# Run stack version
start = datetime.now()
out1=ismember_stack(a_vec, b_vec)
end = datetime.now()
t1.append(end - start)
# Run ismember
start = datetime.now()
out2=ismember(a_vec, b_vec)
end = datetime.now()
t2.append(end - start)
print(np.sum(t1))
# 0:00:07.778331
print(np.sum(t2))
# 0:00:04.609801
# %%
def ismember_stack(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
The ismember function from pypi is almost 2x faster.
Large vectors, eg 700000 elements:
from ismember import ismember
from datetime import datetime
A = np.random.randint(0,100,700000)
B = np.random.randint(0,100,700000)
# Lookup
start = datetime.now()
Iloc,idx = ismember(A, B)
end = datetime.now()
# Print time
print(end-start)
# 0:00:01.194801

Try using a list comprehension;
In [1]: import numpy as np
In [2]: A = np.array([3,4,4,3,6])
In [3]: B = np.array([2,5,2,6,3])
In [4]: [x for x in A if not x in B]
Out[4]: [4, 4]
Generally, list comprehensions are much faster than for-loops.
To get an equal length-list;
In [19]: map(lambda x: x if x not in B else False, A)
Out[19]: [False, 4, 4, False, False]
This is quite fast for small datasets:
In [20]: C = np.arange(10000)
In [21]: D = np.arange(15000, 25000)
In [22]: %timeit map(lambda x: x if x not in D else False, C)
1 loops, best of 3: 756 ms per loop
For large datasets, you could try using a multiprocessing.Pool.map() to speed up the operation.

Here is the exact MATLAB equivalent that returns both the output arguments [Lia, Locb] that match MATLAB except in Python 0 is also a valid index. So, this function doesn't return the 0s. It essentially returns Locb(Locb>0). The performance is also equivalent to MATLAB.
def ismember(a_vec, b_vec):
""" MATLAB equivalent ismember function """
bool_ind = np.isin(a_vec,b_vec)
common = a[bool_ind]
common_unique, common_inv = np.unique(common, return_inverse=True) # common = common_unique[common_inv]
b_unique, b_ind = np.unique(b_vec, return_index=True) # b_unique = b_vec[b_ind]
common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
return bool_ind, common_ind[common_inv]
An alternate implementation that is a bit (~5x) slower but doesn't use the unique function is here:
def ismember(a_vec, b_vec):
''' MATLAB equivalent ismember function. Slower than above implementation'''
b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
booleans = np.in1d(a_vec, b_vec)
return booleans, np.array(indices, dtype=int)

Related

Efficiently find the indices of shared values (with repeats) between two large arrays

Problem description
Let's take this simple array set
# 0,1,2,3,4,5
a = np.array([1,1,3,4,6])
b = np.array([6,6,1,3])
From these two arrays I want to get the indices of all possible matches. So for number 1 we get 0,2 and 1,2, with the complete output looking like:
0,2 # 1
1,2 # 1
2,3 # 3
4,0 # 6
4,1 # 6
Note that the arrays are (not yet) sorted neither do they only contain unique elements - two conditions often assumed in other answers (see bottom). The above example is very small, however, I have to apply this to ~40K element arrays.
Tried approaches
1.Python loop approach
indx = []
for i, aval in enumerate(a):
for j, bval in enumerate(b):
if aval == bval:
indx.append([i,j])
# [[0, 2], [1, 2], [2, 3], [4, 0], [4, 1]]
2.Python dict approach
adict = defaultdict(list)
bdict = defaultdict(list)
for i, aval in enumerate(a): adict[aval].append(i)
for j, bval in enumerate(b): bdict[bval].append(j)
for val, a_positions in adict.items():
for b_position in bdict[val]:
for a_position in a_positions:
print(a_position, b_position)
3.Numpy where
print(np.where(a.reshape(-1,1) == b))
4. Polars dataframes
Converting it to a dataframe and then using Polars
import polars as pl
a = pl.DataFrame( {'x': a, 'apos':list(range(len(a)))} )
b = pl.DataFrame( {'x': b, 'apos':list(range(len(b)))} )
a.join(b, how='inner', on='x')
"Big data"
On "big" data using Polars seems the fastest now with around 0.02 secs. I'm suprised that creating DataFrames first and then joining them is faster than any other approach I could come up with so curious if there is any other way to beat it :)
a = np.random.randint(0,1000, 40000)
b = np.random.randint(0,1000, 40000)
Using the above data:
python loop: 218s
python dict: 0.03s
numpy.where: 4.5s
polars: 0.02s
How related questions didn't solve this
Return common element indices between two numpy arrays, only returns the indexes of matchesin one of the arrays, not both
Find indices of common values in two arrays, returns the matching indices of A with B and B with A, but not the paired indices (see example)
Very surprised a DataFrame library is currently the fastest, so curious to see if there are other approaches to beat this speed :) Everything is fine, cython, numba, pythran etc.
NOTE: this post is now superseded by the faster alternative sort-based solution.
The dict based approach is an algorithmically efficient solution compared to others (I guess Polars should use a similar approach). However, the overhead of CPython make it a bit slow. You can speed it up a bit using Numba. Here is an implementation:
import numba as nb
import numpy as np
from numba.typed.typeddict import Dict
from numba.typed.typedlist import ListType
from numba.typed.typedlist import List
IntList = ListType(nb.int32)
#nb.njit('(int32[:], int32[:])')
def numba_dict_based_compute(a, b):
adict = Dict.empty(nb.int32, IntList)
bdict = Dict.empty(nb.int32, IntList)
for i, val in enumerate(a):
if val in adict: adict[val].append(i)
else: adict[val] = List([nb.int32(i)])
for i, val in enumerate(b):
if val in bdict: bdict[val].append(i)
else: bdict[val] = List([nb.int32(i)])
count = 0
for val, a_positions in adict.items():
if val not in bdict:
continue
b_positions = bdict[val]
count += len(a_positions) * len(b_positions)
result = np.empty((count, 2), dtype=np.int32)
cur = 0
for val, a_positions in adict.items():
if val not in bdict:
continue
for b_position in bdict[val]:
for a_position in a_positions:
result[cur, 0] = a_position
result[cur, 1] = b_position
cur += 1
return result
result = numba_dict_based_compute(a.astype(np.int32), b.astype(np.int32))
Note that computing in-place the value is a bit faster than storing them (and pre-compute the size of the array). However, if nothing is done in the loop, Numba can completely optimize it out and the benchmark would be biased. Alternatively, printing values is so slow that is would also biased the benchmark. Note also that the implementation assumes the numbers are 32-bit ones. A 64 bit implementation can be trivially implemented by replacing 32-bit types by 64-bit ones though it decreases the performance.
This solution is about twice faster on my machine though it is a bit verbose and not very easy to read. The performance of the operation is mainly bounded by the speed of dictionary lookups. This implementation is a bit faster than the one of polars on my machine.
Here are timings:
Naive python loop: >100_000 ms
Numpy where: 3_451 ms
Python dict: 24.7 ms
Polars: 12.3 ms
This implementation: 11.3 ms (takes 13.2 ms on 64-bit values)
An alternative completely-different solution is to sort the array and retrieve the locations of the sorted array with np.argsort, then get the sorted value, and then walk in lockstep over the two set of locations sorted by value. This last operation can be (again) implemented efficiently in Numba or Cython. It can be actually split in two part: the one finding slices in a and b matching to the same value (similar to a merge operation), and one doing the actual cartesian product for each matching slices. Splitting this in two steps enable the second one (which is expensive) to be computed in parallel if possible (and in-place if possible too). The complexity of finding the matching offsets is O(n log n) with Numpy (one can reach the theoretical optimal O(n) time using a radix sort).
Here is the resulting implementation:
import numba as nb
import numpy as np
# Support both 32-bit and 64-bit integers
#nb.njit(['(int64[::1],int64[::1],int64[::1],int64[::1])', '(int64[::1],int64[::1],int32[::1],int32[::1])'], debug=True)
def find_matching_offsets(a_positions, b_positions, a_sorted_values, b_sorted_values):
n, m = a_positions.size, b_positions.size
result = np.empty((n, 4), dtype=np.int32)
a_pos, b_pos, cur = 0, 0, 0
while a_pos < n and b_pos < m:
a_val = a_sorted_values[a_pos]
b_val = b_sorted_values[b_pos]
if a_val < b_val:
a_pos += 1
continue
if a_val > b_val:
b_pos += 1
continue
a_end = n
for i in range(a_pos + 1, n):
if a_sorted_values[i] != a_val:
a_end = i
break
b_end = m
for i in range(b_pos + 1, m):
if b_sorted_values[i] != b_val:
b_end = i
break
result[cur, 0] = a_pos
result[cur, 1] = a_end
result[cur, 2] = b_pos
result[cur, 3] = b_end
cur += 1
a_pos = a_end
b_pos = b_end
return result[:cur]
#nb.njit(['(int64[::1],int64[::1],int32[:,::1])'], parallel=True)
def do_cartesian_product(a_positions, b_positions, offsets):
size = 0
cur = 0
result_offsets = np.empty(offsets.shape[0], dtype=np.int32)
# Compute the size of the output
for i in range(offsets.shape[0]):
a_pos, a_end, b_pos, b_end = offsets[i]
assert a_end > a_pos and b_end > b_pos
result_offsets[cur] = size
size += (a_end - a_pos) * (b_end - b_pos)
cur += 1
assert size > 0
result = np.empty((size, 2), dtype=np.int32)
# Generate the output in parallel (or in-place if possible)
for i in nb.prange(offsets.shape[0]):
a_pos, a_end, b_pos, b_end = offsets[i]
offset = result_offsets[i]
local_cur = 0
for j in range(a_pos, a_end):
for k in range(b_pos, b_end):
local_offset = offset + local_cur
result[local_offset, 0] = a_positions[j]
result[local_offset, 1] = b_positions[k]
local_cur += 1
return result
def sorted_based_compute(a, b):
a_positions = np.argsort(a)
b_positions = np.argsort(b)
a_sorted_values = a[a_positions]
b_sorted_values = b[b_positions]
offsets = find_matching_offsets(a_positions, b_positions, a_sorted_values, b_sorted_values)
return do_cartesian_product(a_positions, b_positions, offsets)
This solution is faster than the previous one and certainly reach the limit of with what is possible with Numpy/Numba (without making additional assumptions on the input). Here is the performance results (on my 6-core machine):
Python dict: 24.7 ms
Polars: 12.3 ms
Dict-based Numba version: 11.3 ms
Sort-based Numpy+Numba version: 5.0 ms <----
Note that ~60% of the time is spent in the argsort functions and the rest is basically the cartesian product. It can theoretically be improved using a parallel sort but AFAIK this is not possible with Numpy yet (and pretty hard to do in Numba).

How to efficiently find the indices a first array values matching with a second array values?

I have two numpy arrays A and B. A has shape (10000000, 3) and B has shape (1000000, 3). Both the arrays are XYZ coordinates such that B corresponds to some region of A. I have to find indexes of A which correspond to values B.
Right now I am solving as below. I would like some help in optimizing this using Numpy or other python packages.
extract_BinA=np.empty(B.shape[0])
for i in range(B.shape[0]):
for j in range(A.shape[0]):
if(A[j][0]==B[i][0] and A[j][1]==B[i][1] and A[j][2]==B[i][2]):
extract_BinA[i]=j
The issue here is not the speed of pure-python code, but the algorithm itself. You can use sorted-arrays or hash-tables to improve the complexity of the algorithm to O(n log n) or even O(n) rather than the slow current O(n^2) solution (as well as the solution proposed by #Mazen). An O(n^2) cannot be efficient here since it will results in roughly 10,000,000 * 10,000,000 = 100,000 billion operations which is too much for any modern computer.
Here is a hash-table solution in pure Python:
table = {tuple(A[i]):i for i in range(A.shape[0])}
extract_BinA = np.empty(B.shape[0])
for i in range(B.shape[0]):
val = tuple(B[i])
if val in table:
extract_BinA[i] = table[val]
Note that the result may differ if there are multiple points in the same location in A.
Here is a benchmark with two random array of size 10,000:
Initial solution: 53.82 s
Mazen solution: 1.76 s
This solution: 0.02 s
On this small input, the above code is 2700 times faster than the initial solution and 88 times faster than the proposed alternative solution. On bigger input, the gap will be much bigger and the above code is many order of magnitude faster than the two other solutions (ie. >10000 times faster).
Update:
If there are multiple points equal each other in A, then the dictionary can be modified to store list of indices rather than one value. Alternatively, the dictionary can be created so that the first value is kept like in the original code. Here are example of the two solutions:
table = dict()
for i in range(A.shape[0])
key = tuple(A[i])
if key in table:
table[key].append(i)
else:
table[key] = [i]
extract_BinA = np.empty(B.shape[0])
for i in range(B.shape[0]):
val = tuple(B[i])
if val in table:
# Here table[val] is a list and thus you
# can do whatever you want with the indices.
# For example you can take the first one like here,
# or possibly the last as you want.
extract_BinA[i] = table[val][0]
# Select always directly the first index
table = dict()
for i in range(A.shape[0])
key = tuple(A[i])
if key not in table:
table[key] = i
extract_BinA = np.empty(B.shape[0])
for i in range(B.shape[0]):
val = tuple(B[i])
if val in table:
extract_BinA[i] = table[val]
Note that these solution are a bit slower than the above code but the complexity is still linear (and thus still very fast).
Solution
extract_BinA = np.ones(B.shape[0]) * -1
for i, b in enumerate(B):
idx = np.argwhere((A == b) == [True, True, True])
if idx.any():
extract_BinA[i] = idx[0][0]
print(extract_BinA)
Explanation
Set extract_BinA to negative values array of size B
extract_BinA = np.ones(B.shape[0]) * -1
In order to get the indices of the elements where B elements equals to A elements, we would need to do the following:
(A == b)
Compares x,y,z for a row in B with every x,y,z rows in A
(A == b) == [True, True, True]
Compares only elements where x_a==x_b, y_a==y_b, and z_a==z_b yields True for all of them
np.argwhere((A == b) == [True, True, True])
Returns a set of indices where the condition is true
A full example to test:
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]])
B = np.array([[0,0,0],[4,5,6],[13,14,15]])
# your code
extract_BinA=np.ones(B.shape[0]) * -1
for i in range(B.shape[0]):
for j in range(A.shape[0]):
if (A[j] == B[i]).all():
extract_BinA[i]=j
print(extract_BinA)
# my code
extract_BinA = np.ones(B.shape[0]) * -1
for i, b in enumerate(B):
idx = np.argwhere((A == b) == [True, True, True])
if idx.any():
extract_BinA[i] = idx[0][0] -----------> changed extract_BinB to extract_BinA
print(extract_BinA)

While loop with Cython? or Better way to remove the elements that fall into a given range

I am basically looking for a faster/better/efficient way to perform a piece of my python code.
Here goes a simpler version of my part of code.
import numpy as np
A = np.random.choice(100,80) # randomly select integers
A = np.sort(A) # sort it
B = np.unique(A) # drop the duplicate values
What I want to do with this vector B is to remove its elements that fall within a given range from the previous value. For example, if I have a sorted vector B = [1,2,5,7,8,11,20,25,30] and a range value that I would like to assign is 10, then my code should output C = [1,11,25]. (2,5,7,8 were removed because it has the distance less than 10 with the element 1. Next element is 11. 20 is removed because 20 has the distance less than 10 with the element 11. Next is 25 so 30 is removed). You get the idea.
I wrote the code as following:
def RemoveViolations(vec, L):
S = []
P = 0 # pointer
C = 0 # counter
while C < vec.size:
S.append(vec[C])
preC = np.where(vec>S[P]+L)[0]
if preC.size:
C = preC[0]
else:
C = vec.size+1
P = P+1
return np.asarray(S)
So, now, I can do this C = RemoveViolations(B,10), which works like a charm.
Now, the issue is that this is very slow code in python. I have like a sorted vector size of 1 million and it takes some time to finish this code. Is there a better way to do this task?
If I need to implement Cython, how would I change the code to work in C++ environment? I heard it's not really complicated, but a quick search didn't work out well.
Thank you!
The complexity of your algorithm is the problem: Here is a solution in pure python that executes under 0.15s on my 8 years old laptop (your implementation needed 200 seconds; i/e a 1300 times improvement for n=1000000):
import random
def get_filtered_values(dist, seq):
prev_val = seq[0]
compare_to = prev_val + dist
filtered = [prev_val]
for elt in seq[1:]:
if elt <= compare_to: # <-- change to `<` to match desired results;
# this matches the results of your implementation
continue
else:
compare_to = elt + dist
filtered.append(elt)
return filtered
B = [1,2,5,7,8,11,20,25,30]
print(get_filtered_values(10, B))
n = 1000000
C = sorted(list(set([random.randint(0, n) for _ in range(n)])))
get_filtered_values(10, C)
You can cythonize this code, or numpyize it as you wish, but it probably will not be necessary.

Comparing multiple numpy arrays

how should i compare more than 2 numpy arrays?
import numpy
a = numpy.zeros((512,512,3),dtype=numpy.uint8)
b = numpy.zeros((512,512,3),dtype=numpy.uint8)
c = numpy.zeros((512,512,3),dtype=numpy.uint8)
if (a==b==c).all():
pass
this give a valueError, and i am not interested in comparing arrays two at a time.
For three arrays, you can check for equality among the corresponding elements between the first and second arrays and then second and third arrays to give us two boolean scalars and finally see if both of these scalars are True for final scalar output, like so -
np.logical_and( (a==b).all(), (b==c).all() )
For more number of arrays, you could stack them, get the differentiation along the axis of stacking and check if all of those differentiations are equal to zeros. If they are, we have equality among all input arrays, otherwise not. The implementation would look like so -
L = [a,b,c] # List of input arrays
out = (np.diff(np.vstack(L).reshape(len(L),-1),axis=0)==0).all()
For three arrays, you should really just compare them two at a time:
if np.array_equal(a, b) and np.array_equal(b, c):
do_whatever()
For a variable number of arrays, let's suppose they're all combined into one big array arrays. Then you could do
if np.all(arrays[:-1] == arrays[1:]):
do_whatever()
To expand on previous answers, I would use combinations from itertools to construct all pairs, then run your comparison on each pair. For example, if I have three arrays and want to confirm that they're all equal, I'd use:
from itertools import combinations
for pair in combinations([a, b, c], 2):
assert np.array_equal(pair[0], pair[1])
solution supporting different shapes and nans
compare against first element of array-list:
import numpy as np
a = np.arange(3)
b = np.arange(3)
c = np.arange(3)
d = np.arange(4)
lst_eq = [a, b, c]
lst_neq = [a, b, d]
def all_equal(lst):
for arr in lst[1:]:
if not np.array_equal(lst[0], arr, equal_nan=True):
return False
return True
print('all_equal(lst_eq)=', all_equal(lst_eq))
print('all_equal(lst_neq)=', all_equal(lst_neq))
output
all_equal(lst_eq)= True
all_equal(lst_neq)= False
for equal shape and without nan-support
Combine everything into one array, calculate the absolute diff along the new axis and check if the maximum element along the new dimension is equal 0 or lower than some threshold. This should be quite fast.
import numpy as np
a = np.arange(3)
b = np.arange(3)
c = np.arange(3)
d = np.array([0, 1, 3])
lst_eq = [a, b, c]
lst_neq = [a, b, d]
def all_equal(lst, threshold = 0):
arr = np.stack(lst, axis=0)
return np.max(np.abs(np.diff(arr, axis=0))) <= threshold
print('all_equal(lst_eq)=', all_equal(lst_eq))
print('all_equal(lst_neq)=', all_equal(lst_neq))
output
all_equal(lst_eq)= True
all_equal(lst_neq)= False
This might work.
import numpy
x = np.random.rand(10)
arrays = [x for _ in range(10)]
print(np.allclose(arrays[:-1], arrays[1:])) # True
arrays.append(np.random.rand(10))
print(np.allclose(arrays[:-1], arrays[1:])) # False
one-liner solution:
arrays = [a, b, c]
all([np.array_equal(a, b) for a, b in zip(arrays, arrays[1:])])
We test the equality of consecutive pairs of arrays

combine two arrays and sort

Given two sorted arrays like the following:
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
I would like the output to be:
c = array([1,2,3,4,5,6,7,8,9,10])
or:
c = array([1,2,3,4,4,5,6,7,8,9,10])
I'm aware that I can do the following:
c = unique(concatenate((a,b))
I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements.
Any idea is welcomed. Thanks
Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:
Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.
This would make the full thing close to what you got (without doing a unique beforehand):
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).
Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.
Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.
The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.
It would look vaguely like the following:
def foo(numpy.ndarray[int, ndim=1] out,
numpy.ndarray[int, ndim=1] in1,
numpy.ndarray[int, ndim=1] in2):
cdef int i = 0
cdef int j = 0
cdef int k = 0
while (i!=len(in1)) or (j!=len(in2)):
# set out[k] to smaller of in[i] or in[j]
# increment k
# increment one of i or j
When curious about timings, it's always best to just timeit. Below, i've listed a subset of the various methods and their timings:
import numpy as np
import timeit
import heapq
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, np.insert(a, lo, [x])
size=10000
a = np.array(range(size))
b = np.array(range(size))
def op(a,b):
return np.unique(np.concatenate((a,b)))
def martijn(a,b):
c = np.copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
return c
def martijn2(a,b):
c = np.zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
def larsmans(a,b):
return np.array(sorted(set(a) | set(b)))
def larsmans_mod(a,b):
return np.array(set.union(set(a),b))
def sebastian(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Results:
martijn2 25.1079499722
OP 1.44831800461
larsmans 9.91507601738
larsmans_mod 5.87612199783
sebastian 3.50475311279e-05
My specific contribution here is larsmans_mod which avoids creating 2 sets -- it only creates 1 and in doing so cuts execution time nearly in half.
EDIT removed martijn as it was too slow to compete. Also tested for slightly bigger arrays (sorted) input. I also have not tested for correctness in output ...
In addition to the other answer on using bisect.insort, if you are not content with performance, you may try using blist module with bisect. It should improve the performance.
Traditional list insertion complexity is O(n), while blist's complexity on insertion is O(log(n)).
Also, you arrays seem to be sorted. If so, you can use merge function from heapq mudule to utilize the fact that both arrays are presorted. This approach will take an overhead because of crating a new array in memory. It may be an option to consider as this solution's time complexity is O(n+m), while the solutions with insort are O(n*m) complexity (n elements * m insertions)
import heapq
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 4, 5, 6, 7, 8, 9, 10]
If you want to delete repeating values, you can use groupby:
import heapq
import itertools
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
it = (k for k,v in itertools.groupby(it))
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
You could use the bisect module for such merges, merging the second python list into the first.
The bisect* functions work for numpy arrays but the insort* functions don't. It's easy enough to use the module source code to adapt the algorithm, it's quite basic:
from numpy import array, copy, insert
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, insert(a, lo, [x])
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
c = copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
Not that the custom insort is really adding anything here, the default bisect.bisect works just fine too:
import bisect
c = copy(a)
lo = 0
for i in b:
lo = bisect.bisect(c, i)
c = insert(c, i, lo)
Using this adapted insort is much more efficient than a combine and sort. Because b is sorted as well, we can track the lo insertion point and search for the next point starting there instead of considering the whole array each loop.
If you don't need to preserve a, just operate directly on that array and save yourself the copy.
More efficient still: because both lists are sorted, we can use heapq.merge:
from numpy import zeros
import heapq
c = zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
Use the bisect module for this:
import bisect
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
for i in b:
pos = bisect.bisect(a, i)
insert(a,[pos],i)
I can't test this right now, but it should work
The sortednp package implements an efficient merge of sorted numpy-arrays, just sorting the values, not making them unique:
import numpy as np
import sortednp
a = np.array([1,2,4,5,6,8,9])
b = np.array([3,4,7,10])
c = sortednp.merge(a, b)
I measured the times and compared them in this answer to a similar post where it outperforms numpy's mergesort (v1.17.4).
Seems like no one mentioned union1d (union1d). Currently, it is a shortcut for unique(concatenate((ar1, ar2))), but its a short name to remember and it has a potential to be optimized by numpy developers since its a library function. It performs very similar to insort from seberg's accepted answer for large arrays. Here is my benchmark:
import numpy as np
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
size = int(1e7)
a = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
b = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
np.testing.assert_array_equal(insort(a, b), np.union1d(a, b))
import timeit
repetitions = 20
print("insort: %.5fs" % (timeit.timeit("insort(a, b)", "from __main__ import a, b, insort", number=repetitions)/repetitions,))
print("union1d: %.5fs" % (timeit.timeit("np.union1d(a, b)", "from __main__ import a, b; import numpy as np", number=repetitions)/repetitions,))
Output on my machine:
insort: 1.69962s
union1d: 1.66338s

Categories

Resources