Given two sorted arrays like the following:
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
I would like the output to be:
c = array([1,2,3,4,5,6,7,8,9,10])
or:
c = array([1,2,3,4,4,5,6,7,8,9,10])
I'm aware that I can do the following:
c = unique(concatenate((a,b))
I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements.
Any idea is welcomed. Thanks
Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:
Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.
This would make the full thing close to what you got (without doing a unique beforehand):
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).
Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.
Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.
The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.
It would look vaguely like the following:
def foo(numpy.ndarray[int, ndim=1] out,
numpy.ndarray[int, ndim=1] in1,
numpy.ndarray[int, ndim=1] in2):
cdef int i = 0
cdef int j = 0
cdef int k = 0
while (i!=len(in1)) or (j!=len(in2)):
# set out[k] to smaller of in[i] or in[j]
# increment k
# increment one of i or j
When curious about timings, it's always best to just timeit. Below, i've listed a subset of the various methods and their timings:
import numpy as np
import timeit
import heapq
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, np.insert(a, lo, [x])
size=10000
a = np.array(range(size))
b = np.array(range(size))
def op(a,b):
return np.unique(np.concatenate((a,b)))
def martijn(a,b):
c = np.copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
return c
def martijn2(a,b):
c = np.zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
def larsmans(a,b):
return np.array(sorted(set(a) | set(b)))
def larsmans_mod(a,b):
return np.array(set.union(set(a),b))
def sebastian(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Results:
martijn2 25.1079499722
OP 1.44831800461
larsmans 9.91507601738
larsmans_mod 5.87612199783
sebastian 3.50475311279e-05
My specific contribution here is larsmans_mod which avoids creating 2 sets -- it only creates 1 and in doing so cuts execution time nearly in half.
EDIT removed martijn as it was too slow to compete. Also tested for slightly bigger arrays (sorted) input. I also have not tested for correctness in output ...
In addition to the other answer on using bisect.insort, if you are not content with performance, you may try using blist module with bisect. It should improve the performance.
Traditional list insertion complexity is O(n), while blist's complexity on insertion is O(log(n)).
Also, you arrays seem to be sorted. If so, you can use merge function from heapq mudule to utilize the fact that both arrays are presorted. This approach will take an overhead because of crating a new array in memory. It may be an option to consider as this solution's time complexity is O(n+m), while the solutions with insort are O(n*m) complexity (n elements * m insertions)
import heapq
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 4, 5, 6, 7, 8, 9, 10]
If you want to delete repeating values, you can use groupby:
import heapq
import itertools
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
it = (k for k,v in itertools.groupby(it))
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
You could use the bisect module for such merges, merging the second python list into the first.
The bisect* functions work for numpy arrays but the insort* functions don't. It's easy enough to use the module source code to adapt the algorithm, it's quite basic:
from numpy import array, copy, insert
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, insert(a, lo, [x])
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
c = copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
Not that the custom insort is really adding anything here, the default bisect.bisect works just fine too:
import bisect
c = copy(a)
lo = 0
for i in b:
lo = bisect.bisect(c, i)
c = insert(c, i, lo)
Using this adapted insort is much more efficient than a combine and sort. Because b is sorted as well, we can track the lo insertion point and search for the next point starting there instead of considering the whole array each loop.
If you don't need to preserve a, just operate directly on that array and save yourself the copy.
More efficient still: because both lists are sorted, we can use heapq.merge:
from numpy import zeros
import heapq
c = zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
Use the bisect module for this:
import bisect
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
for i in b:
pos = bisect.bisect(a, i)
insert(a,[pos],i)
I can't test this right now, but it should work
The sortednp package implements an efficient merge of sorted numpy-arrays, just sorting the values, not making them unique:
import numpy as np
import sortednp
a = np.array([1,2,4,5,6,8,9])
b = np.array([3,4,7,10])
c = sortednp.merge(a, b)
I measured the times and compared them in this answer to a similar post where it outperforms numpy's mergesort (v1.17.4).
Seems like no one mentioned union1d (union1d). Currently, it is a shortcut for unique(concatenate((ar1, ar2))), but its a short name to remember and it has a potential to be optimized by numpy developers since its a library function. It performs very similar to insort from seberg's accepted answer for large arrays. Here is my benchmark:
import numpy as np
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
size = int(1e7)
a = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
b = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
np.testing.assert_array_equal(insort(a, b), np.union1d(a, b))
import timeit
repetitions = 20
print("insort: %.5fs" % (timeit.timeit("insort(a, b)", "from __main__ import a, b, insort", number=repetitions)/repetitions,))
print("union1d: %.5fs" % (timeit.timeit("np.union1d(a, b)", "from __main__ import a, b; import numpy as np", number=repetitions)/repetitions,))
Output on my machine:
insort: 1.69962s
union1d: 1.66338s
Related
Problem description
Let's take this simple array set
# 0,1,2,3,4,5
a = np.array([1,1,3,4,6])
b = np.array([6,6,1,3])
From these two arrays I want to get the indices of all possible matches. So for number 1 we get 0,2 and 1,2, with the complete output looking like:
0,2 # 1
1,2 # 1
2,3 # 3
4,0 # 6
4,1 # 6
Note that the arrays are (not yet) sorted neither do they only contain unique elements - two conditions often assumed in other answers (see bottom). The above example is very small, however, I have to apply this to ~40K element arrays.
Tried approaches
1.Python loop approach
indx = []
for i, aval in enumerate(a):
for j, bval in enumerate(b):
if aval == bval:
indx.append([i,j])
# [[0, 2], [1, 2], [2, 3], [4, 0], [4, 1]]
2.Python dict approach
adict = defaultdict(list)
bdict = defaultdict(list)
for i, aval in enumerate(a): adict[aval].append(i)
for j, bval in enumerate(b): bdict[bval].append(j)
for val, a_positions in adict.items():
for b_position in bdict[val]:
for a_position in a_positions:
print(a_position, b_position)
3.Numpy where
print(np.where(a.reshape(-1,1) == b))
4. Polars dataframes
Converting it to a dataframe and then using Polars
import polars as pl
a = pl.DataFrame( {'x': a, 'apos':list(range(len(a)))} )
b = pl.DataFrame( {'x': b, 'apos':list(range(len(b)))} )
a.join(b, how='inner', on='x')
"Big data"
On "big" data using Polars seems the fastest now with around 0.02 secs. I'm suprised that creating DataFrames first and then joining them is faster than any other approach I could come up with so curious if there is any other way to beat it :)
a = np.random.randint(0,1000, 40000)
b = np.random.randint(0,1000, 40000)
Using the above data:
python loop: 218s
python dict: 0.03s
numpy.where: 4.5s
polars: 0.02s
How related questions didn't solve this
Return common element indices between two numpy arrays, only returns the indexes of matchesin one of the arrays, not both
Find indices of common values in two arrays, returns the matching indices of A with B and B with A, but not the paired indices (see example)
Very surprised a DataFrame library is currently the fastest, so curious to see if there are other approaches to beat this speed :) Everything is fine, cython, numba, pythran etc.
NOTE: this post is now superseded by the faster alternative sort-based solution.
The dict based approach is an algorithmically efficient solution compared to others (I guess Polars should use a similar approach). However, the overhead of CPython make it a bit slow. You can speed it up a bit using Numba. Here is an implementation:
import numba as nb
import numpy as np
from numba.typed.typeddict import Dict
from numba.typed.typedlist import ListType
from numba.typed.typedlist import List
IntList = ListType(nb.int32)
#nb.njit('(int32[:], int32[:])')
def numba_dict_based_compute(a, b):
adict = Dict.empty(nb.int32, IntList)
bdict = Dict.empty(nb.int32, IntList)
for i, val in enumerate(a):
if val in adict: adict[val].append(i)
else: adict[val] = List([nb.int32(i)])
for i, val in enumerate(b):
if val in bdict: bdict[val].append(i)
else: bdict[val] = List([nb.int32(i)])
count = 0
for val, a_positions in adict.items():
if val not in bdict:
continue
b_positions = bdict[val]
count += len(a_positions) * len(b_positions)
result = np.empty((count, 2), dtype=np.int32)
cur = 0
for val, a_positions in adict.items():
if val not in bdict:
continue
for b_position in bdict[val]:
for a_position in a_positions:
result[cur, 0] = a_position
result[cur, 1] = b_position
cur += 1
return result
result = numba_dict_based_compute(a.astype(np.int32), b.astype(np.int32))
Note that computing in-place the value is a bit faster than storing them (and pre-compute the size of the array). However, if nothing is done in the loop, Numba can completely optimize it out and the benchmark would be biased. Alternatively, printing values is so slow that is would also biased the benchmark. Note also that the implementation assumes the numbers are 32-bit ones. A 64 bit implementation can be trivially implemented by replacing 32-bit types by 64-bit ones though it decreases the performance.
This solution is about twice faster on my machine though it is a bit verbose and not very easy to read. The performance of the operation is mainly bounded by the speed of dictionary lookups. This implementation is a bit faster than the one of polars on my machine.
Here are timings:
Naive python loop: >100_000 ms
Numpy where: 3_451 ms
Python dict: 24.7 ms
Polars: 12.3 ms
This implementation: 11.3 ms (takes 13.2 ms on 64-bit values)
An alternative completely-different solution is to sort the array and retrieve the locations of the sorted array with np.argsort, then get the sorted value, and then walk in lockstep over the two set of locations sorted by value. This last operation can be (again) implemented efficiently in Numba or Cython. It can be actually split in two part: the one finding slices in a and b matching to the same value (similar to a merge operation), and one doing the actual cartesian product for each matching slices. Splitting this in two steps enable the second one (which is expensive) to be computed in parallel if possible (and in-place if possible too). The complexity of finding the matching offsets is O(n log n) with Numpy (one can reach the theoretical optimal O(n) time using a radix sort).
Here is the resulting implementation:
import numba as nb
import numpy as np
# Support both 32-bit and 64-bit integers
#nb.njit(['(int64[::1],int64[::1],int64[::1],int64[::1])', '(int64[::1],int64[::1],int32[::1],int32[::1])'], debug=True)
def find_matching_offsets(a_positions, b_positions, a_sorted_values, b_sorted_values):
n, m = a_positions.size, b_positions.size
result = np.empty((n, 4), dtype=np.int32)
a_pos, b_pos, cur = 0, 0, 0
while a_pos < n and b_pos < m:
a_val = a_sorted_values[a_pos]
b_val = b_sorted_values[b_pos]
if a_val < b_val:
a_pos += 1
continue
if a_val > b_val:
b_pos += 1
continue
a_end = n
for i in range(a_pos + 1, n):
if a_sorted_values[i] != a_val:
a_end = i
break
b_end = m
for i in range(b_pos + 1, m):
if b_sorted_values[i] != b_val:
b_end = i
break
result[cur, 0] = a_pos
result[cur, 1] = a_end
result[cur, 2] = b_pos
result[cur, 3] = b_end
cur += 1
a_pos = a_end
b_pos = b_end
return result[:cur]
#nb.njit(['(int64[::1],int64[::1],int32[:,::1])'], parallel=True)
def do_cartesian_product(a_positions, b_positions, offsets):
size = 0
cur = 0
result_offsets = np.empty(offsets.shape[0], dtype=np.int32)
# Compute the size of the output
for i in range(offsets.shape[0]):
a_pos, a_end, b_pos, b_end = offsets[i]
assert a_end > a_pos and b_end > b_pos
result_offsets[cur] = size
size += (a_end - a_pos) * (b_end - b_pos)
cur += 1
assert size > 0
result = np.empty((size, 2), dtype=np.int32)
# Generate the output in parallel (or in-place if possible)
for i in nb.prange(offsets.shape[0]):
a_pos, a_end, b_pos, b_end = offsets[i]
offset = result_offsets[i]
local_cur = 0
for j in range(a_pos, a_end):
for k in range(b_pos, b_end):
local_offset = offset + local_cur
result[local_offset, 0] = a_positions[j]
result[local_offset, 1] = b_positions[k]
local_cur += 1
return result
def sorted_based_compute(a, b):
a_positions = np.argsort(a)
b_positions = np.argsort(b)
a_sorted_values = a[a_positions]
b_sorted_values = b[b_positions]
offsets = find_matching_offsets(a_positions, b_positions, a_sorted_values, b_sorted_values)
return do_cartesian_product(a_positions, b_positions, offsets)
This solution is faster than the previous one and certainly reach the limit of with what is possible with Numpy/Numba (without making additional assumptions on the input). Here is the performance results (on my 6-core machine):
Python dict: 24.7 ms
Polars: 12.3 ms
Dict-based Numba version: 11.3 ms
Sort-based Numpy+Numba version: 5.0 ms <----
Note that ~60% of the time is spent in the argsort functions and the rest is basically the cartesian product. It can theoretically be improved using a parallel sort but AFAIK this is not possible with Numpy yet (and pretty hard to do in Numba).
In this snippet of Python code,
fun iterates through the array arr and counts the number of identical integers in two array sections for every section pair. (It simulates a matrix.) This makes n*(n-1)/2*m comparisons in total, giving a time complexity of O(n^2).
Are there programming solutions or ways of reframing this problem that would yield equivalent results but have reduced time complexity?
# n > 500000, 0 < i < n, m = 100
# dim(arr) = n*m, 0 < arr[x] < 4294967311
arr = mp.RawArray(ctypes.c_uint, n*m)
def fun(i):
for j in range(i-1,0,-1):
count = 0
for k in range(0,m):
count += (arr[i*m+k] == arr[j*m+k])
if count/m > 0.7:
return (i,j)
return ()
arr is a shared memory array, therefore it's best kept read-only for simplicity and performance reasons.
arr is implemented as a 1D RawArray from multiprocessing. The reason for this it has by far the fastest performance according to my tests. Using a numpy 2D array, for example, like this:
arr = np.ctypeslib.as_array(mp.RawArray(ctypes.c_uint, n*m)).reshape(n,m)
would provide vectorization capabilities, but increases the total runtime by an order of magnitude - 250s vs. 30s for n = 1500, which amounts to 733%.
Since you can't change the array characteristics at all, I think you're stuck with O(n^2). numpy would gain some vectorization, but would change the access for others sharing the array. Start with the innermost operation:
for k in range(0,m):
count += (arr[i][k] == arr[j][k])
Change this to a one-line assignment:
count = sum(arr[i][k] == arr[j][k] for k in range(m))
Now, if this is truly an array, rather than a list of lists, use the array package's vectorization to simplify the loops, one at a time:
count = sum(arr[i] == arr[j]) # results in a vector of counts
You can now return the j indices where count[j] / m > 0.7. Note that there's no real need to return i for each one: it's constant within the function, and the calling program already has the value. Your array package likely has a pair of vectorized indexing operations that can return those indices. If you're using numpy, those are easy enough to look up on this site.
So after fiddling around some more, I was able to cut down the running time greatly with help from NumPy's vectorization and Numba's JIT compiler. Going back to the original code:
arr = mp.RawArray(ctypes.c_uint, n*m)
def fun(i):
for j in range(i-1,0,-1):
count = 0
for k in range(0,m):
count += (arr[i*m+k] == arr[j*m+k])
if count/m > 0.7:
return (i,j)
return ()
We can leave out the bottom return statement as well as dismiss the idea of using count entirely, leaving us with:
def fun(i):
for j in range(i-1,0,-1):
if sum(arr[i*m+k] == arr[j*m+k] for k in range(m)) > 0.7*m:
return (i,j)
Then, we change the array arr to a NumPy format:
np_arr = np.frombuffer(arr,dtype='int32').reshape(m,n)
The important thing to note here is that we do not use a NumPy array as a shared memory array to be written from multiple processes, avoiding the overhead pitfall.
Finally, we apply Numba's decorator and rewrite the sum function in vector form so that it works with the new array:
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def fun(i):
for j in range(i-1, 0, -1):
if np.sum(np_arr[i] == np_arr[j]) > 0.7*m:
return (i,j)
This reduced the running time to 7.9s, which is definitely a victory for me.
How can I make this function faster? (I call it a lot of times and it could result in some speed improvements)
def vectorr(I, J, K):
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
You can try to take a look at itertools.product
Equivalent to nested for-loops in a generator expression. For example,
product(A, B) returns the same as ((x,y) for x in A for y in B).
The nested loops cycle like an odometer with the rightmost element
advancing on every iteration. This pattern creates a lexicographic
ordering so that if the input’s iterables are sorted, the product
tuples are emitted in sorted order.
Also no need in 0 while calling range(0, I) and etc - use just range(I)
So in your case it can be:
import itertools
def vectorr(I, J, K):
return itertools.product(range(K), range(J), range(I))
You said you want it to be faster. Let's use NumPy!
import numpy as np
def vectorr(I, J, K):
arr = np.empty((I*J*K, 3), int)
arr[:,0] = np.tile(np.arange(I), J*K)
arr[:,1] = np.tile(np.repeat(np.arange(J), I), K)
arr[:,2] = np.repeat(np.arange(K), I*J)
return arr
There may be even more elegant tweaks possible here, but that's a basic tiling that gives the same result (but as a 2D array rather than a list of lists). The code for this is all implemented in C, so it's very, very fast--this may be important if the input values may get somewhat large.
The other answers are more thorough and, in this specific case at least, better, but in general, if you're using Python 2, and for large values of I, J, or K, use xrange() instead of range(). xrange gives a generator-like object, instead of constructing a list, so you don't have to allocate memory for the entire list.
In Python 3, range works like Python 2's xrange.
import numpy
def vectorr(I,J,K):
val = numpy.indices( (I,J,K))
val.shape = (3,-1)
return val.transpose() # or val.transpose().tolist()
After many attempts trying optimize code, it seems that one last resource would be to attempt to run the code below using multiple cores. I don't know exactly how to convert/re-structure my code so that it can run much faster using multiple cores. I will appreciate if I could get guidance to achieve the end goal. The end goal is to be able to run this code as fast as possible for arrays A and B where each array holds about 700,000 elements. Here is the code using small arrays. The 700k element arrays are commented out.
import numpy as np
def ismember(a,b):
for i in a:
index = np.where(b==i)[0]
if index.size == 0:
yield 0
else:
yield index
def f(A, gen_obj):
my_array = np.arange(len(A))
for i in my_array:
my_array[i] = gen_obj.next()
return my_array
#A = np.arange(700000)
#B = np.arange(700000)
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
gen_obj = ismember(A,B)
f(A, gen_obj)
print 'done'
# if we print f(A, gen_obj) the output will be: [4 0 0 4 3]
# notice that the output array needs to be kept the same size as array A.
What I am trying to do is to mimic a MATLAB function called ismember[2] (The one that is formatted as: [Lia,Locb] = ismember(A,B). I am just trying to get the Locb part only.
From Matlab: Locb, contain the lowest index in B for each value in A that is a member of B. The output array, Locb, contains 0 wherever A is not a member of B
One of the main problems is that I need to be able to perform this operation as efficient as possible. For testing I have two arrays of 700k elements. Creating a generator and going through the values of the generator doesn't seem to get the job done fast.
Before worrying about multiple cores, I would eliminate the linear scan in your ismember function by using a dictionary:
def ismember(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
Your original implementation requires a full scan of the elements in B for each element in A, making it O(len(A)*len(B)). The above code requires one full scan of B to generate the dict Bset. By using a dict, you effectively make the lookup of each element in B constant for each element of A, making the operation O(len(A)+len(B)). If this is still too slow, then worry about making the above function run on multiple cores.
Edit: I've also modified your indexing slightly. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so if you're data set looks like this
A = [2378, 2378, 2378, 2378]
B = [2378, 2379]
and you return 0 for no element, then your results will exclude all elements of A. The above routine returns None for no index instead of 0. Returning -1 is an option, but Python will interpret that to be the last element in the array. None will raise an exception if it's used as an index into the array. If you'd like different behavior, change the second argument in the Bind.get(item,None) expression to the value you want returned.
sfstewman's excellent answer most likely solved the issue for you.
I'd just like to add how you can achieve the same exclusively in numpy.
I make use of numpy's unique an in1d functions.
B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)
B_unique_sorted contains the unique values in B sorted.
B_idx holds for these values the indices into the original B.
B_in_A_bool is a boolean array the size of B_unique_sorted that
stores whether a value in B_unique_sorted is in A.
Note: I need to look for (unique vals from B) in A because I need the output to be returned with respect to B_idx
Note: I assume that A is already unique.
Now you can use B_in_A_bool to either get the common vals
B_unique_sorted[B_in_A_bool]
and their respective indices in the original B
B_idx[B_in_A_bool]
Finally, I assume that this is significantly faster than the pure Python for-loop although I didn't test it.
Try the ismember library.
pip install ismember
Simple example:
# Import library
from ismember import ismember
import numpy as np
# data
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
# Lookup
Iloc,idx = ismember(A, B)
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [ True False False True True]
# indexes of d_unique that exists in d
print(idx)
# [4 4 3]
print(B[idx])
# [3 3 6]
print(A[Iloc])
# [3 3 6]
# These vectors will match
A[Iloc]==B[idx]
Speed check:
from ismember import ismember
from datetime import datetime
t1=[]
t2=[]
# Create some random vectors
ns = np.random.randint(10,10000,1000)
for n in ns:
a_vec = np.random.randint(0,100,n)
b_vec = np.random.randint(0,100,n)
# Run stack version
start = datetime.now()
out1=ismember_stack(a_vec, b_vec)
end = datetime.now()
t1.append(end - start)
# Run ismember
start = datetime.now()
out2=ismember(a_vec, b_vec)
end = datetime.now()
t2.append(end - start)
print(np.sum(t1))
# 0:00:07.778331
print(np.sum(t2))
# 0:00:04.609801
# %%
def ismember_stack(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
The ismember function from pypi is almost 2x faster.
Large vectors, eg 700000 elements:
from ismember import ismember
from datetime import datetime
A = np.random.randint(0,100,700000)
B = np.random.randint(0,100,700000)
# Lookup
start = datetime.now()
Iloc,idx = ismember(A, B)
end = datetime.now()
# Print time
print(end-start)
# 0:00:01.194801
Try using a list comprehension;
In [1]: import numpy as np
In [2]: A = np.array([3,4,4,3,6])
In [3]: B = np.array([2,5,2,6,3])
In [4]: [x for x in A if not x in B]
Out[4]: [4, 4]
Generally, list comprehensions are much faster than for-loops.
To get an equal length-list;
In [19]: map(lambda x: x if x not in B else False, A)
Out[19]: [False, 4, 4, False, False]
This is quite fast for small datasets:
In [20]: C = np.arange(10000)
In [21]: D = np.arange(15000, 25000)
In [22]: %timeit map(lambda x: x if x not in D else False, C)
1 loops, best of 3: 756 ms per loop
For large datasets, you could try using a multiprocessing.Pool.map() to speed up the operation.
Here is the exact MATLAB equivalent that returns both the output arguments [Lia, Locb] that match MATLAB except in Python 0 is also a valid index. So, this function doesn't return the 0s. It essentially returns Locb(Locb>0). The performance is also equivalent to MATLAB.
def ismember(a_vec, b_vec):
""" MATLAB equivalent ismember function """
bool_ind = np.isin(a_vec,b_vec)
common = a[bool_ind]
common_unique, common_inv = np.unique(common, return_inverse=True) # common = common_unique[common_inv]
b_unique, b_ind = np.unique(b_vec, return_index=True) # b_unique = b_vec[b_ind]
common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
return bool_ind, common_ind[common_inv]
An alternate implementation that is a bit (~5x) slower but doesn't use the unique function is here:
def ismember(a_vec, b_vec):
''' MATLAB equivalent ismember function. Slower than above implementation'''
b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
booleans = np.in1d(a_vec, b_vec)
return booleans, np.array(indices, dtype=int)
Given:
x = ['a','b','c','d','e']
y = ['1','2','3']
I'd like iterate resulting in:
a, 1
b, 2
c, 3
d, 1
e, 2
a, 3
b, 1
... where the two iterables cycle independently until a given count.
Python's cycle(iterable) can do this w/ 1 iterable. Functions such as map and itertools.izip_longest can take a function to handle None, but do not provide the built-in auto-repeat.
A not-so-crafty idea is to just concatenate each list to a certain size from which I can iterate evenly. (Boooo!)
Suggestions? Thanks in advance.
The simplest way to do this is in cyclezip1 below. It is fast enough for most purposes.
import itertools
def cyclezip1(it1, it2, count):
pairs = itertools.izip(itertools.cycle(iter1),
itertools.cycle(iter2))
return itertools.islice(pairs, 0, count)
Here is another implementation of it that is about twice as fast when count is significantly larger than the least common multiple of it1 and it2.
import fractions
def cyclezip2(co1, co2, count):
l1 = len(co1)
l2 = len(co2)
lcm = l1 * l2 / float(fractions.gcd(l1, l2))
pairs = itertools.izip(itertools.cycle(co1),
itertools.cycle(co2))
pairs = itertools.islice(pairs, 0, lcm)
pairs = itertools.cycle(pairs)
return itertools.islice(pairs, 0, count)
here we take advantage of the fact that pairs will cycle after the first n of them where n is the least common mutliple of len(it1) and len(it2). This of course assumes that the iterables are collections so that asking for the length of them makes any sense. A further optimization that can be made is to
replace the line
pairs = itertools.islice(pairs, 0, lcm)
with
pairs = list(itertools.islice(pairs, 0, lcm))
This is not nearly as dramatic of an improvement (about 2% in my testing) and not nearly as consistent. it also requires more memory. If it1 and it2 were known in advance to be small enough so that the additional memory was negligible, then you could squeeze that extra performance out of it.
It's interesting to note that the obvious thing to do in the case of a collection is about four times slower than the first option presented.
def cyclezip3(co1, co2, count):
l1 = len(co1)
l2 = len(co2)
return ((co1[i%l1], co2[i%l2]) for i in xrange(count))
import itertools
x = ['a','b','c','d','e']
y = ['1','2','3']
for a, b in itertools.izip(itertools.cycle(x), itertools.cycle(y)):
print a, b