Related
I'm trying to optimize the performance of my python program and I think I have identified this piece of code as bottleneck:
for i in range(len(green_list)):
rgb_list = []
for j in range(len(green_list[i])):
rgb_list.append('%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]))
write_file(str(i), rgb_list)
Where red_list, green_list and blue_list are numpy arrays with values like this:
red_list = [[1, 2, 3, 4, 5], [51, 52, 53, 54, 55]]
green_list = [[6, 7, 8, 9, 10], [56, 57, 58, 59, 60]]
blue_list = [[11, 12, 13, 14, 15], [61, 62, 63, 64, 65]]
At the end of each execution of the inner-for rgb_list is containing the hex values:
rgb_list = ['01060b', '02070c', '03080d', '04090e', '050a01']
Now, it is not clear to me how to exploit the potential of numpy arrays but I think there is a way to optimize those two nested loops. Any suggestions?
I assume the essential traits of your code could be summarized in the following generator:
import numpy as np
def as_str_OP(r_arr, g_arr, b_arr):
n, m = r_arr.shape
rgbs = []
for i in range(n):
rgb = []
for j in range(m):
rgb.append('%02x%02x%02x' % (r_arr[i, j], g_arr[i, j], b_arr[i, j]))
yield rgb
which can be consumed with a for loop, for example to write to disk:
for x in as_str_OP(r_arr, g_arr, b_arr):
write_to_disk(x)
The generator itself can be written either with the core computation vectorized in Python or in a Numba-friendly way.
The key is to replace the relatively slow string interpolation with a int-to-hex custom-made computation.
This results in substantial speed-up, especially as the size of the input grows (and particularly the second dimension).
Below is the NumPy-vectorized version:
def as_str_np(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
rgbs = []
for i in range(n):
rgb = np.empty((m, 2 * l), dtype=np.uint32)
r0, r1 = divmod(r_arr[i, :], 16)
g0, g1 = divmod(g_arr[i, :], 16)
b0, b1 = divmod(b_arr[i, :], 16)
rgb[:, 0] = hex_to_ascii(r0)
rgb[:, 1] = hex_to_ascii(r1)
rgb[:, 2] = hex_to_ascii(g0)
rgb[:, 3] = hex_to_ascii(g1)
rgb[:, 4] = hex_to_ascii(b0)
rgb[:, 5] = hex_to_ascii(b1)
yield rgb.view(f'<U{2 * l}').reshape(m).tolist()
and the Numba-accelerated version:
import numba as nb
#nb.njit
def hex_to_ascii(x):
ascii_num_offset = 48 # ord(b'0') == 48
ascii_alp_offset = 87 # ord(b'a') == 97, (num of non-alpha digits) == 10
return x + (ascii_num_offset if x < 10 else ascii_alp_offset)
#nb.njit
def _to_hex_2d(x):
a, b = divmod(x, 16)
return hex_to_ascii(a), hex_to_ascii(b)
#nb.njit
def _as_str_nb(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for i in range(n):
rgb = np.empty((m, 2 * l), dtype=np.uint32)
for j in range(m):
rgb[j, 0:2] = _to_hex_2d(r_arr[i, j])
rgb[j, 2:4] = _to_hex_2d(g_arr[i, j])
rgb[j, 4:6] = _to_hex_2d(b_arr[i, j])
yield rgb
def as_str_nb(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for x in _as_str_nb(r_arr, g_arr, b_arr):
yield x.view(f'<U{2 * l}').reshape(m).tolist()
This essentially involves manually writing the number, correctly converted to hexadecimal ASCII chars, into a properly typed array, which can then be converted to give the desired output.
Note that the final numpy.ndarray.tolist() could be avoided if whatever will consume the generator is capable of dealing with the NumPy array itself, thus saving some potentially large and definitely appreciable time, e.g.:
def as_str_nba(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for x in _as_str_nb(r_arr, g_arr, b_arr):
yield x.view(f'<U{2 * l}').reshape(m)
Overcoming IO-bound bottleneck
However, if you are IO-bounded you should modify your code to write in blocks, e.g using the grouper recipe from itertools recipes:
from itertools import zip_longest
def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
"Collect data into non-overlapping fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx
# grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError
# grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF
args = [iter(iterable)] * n
if incomplete == 'fill':
return zip_longest(*args, fillvalue=fillvalue)
if incomplete == 'strict':
return zip(*args, strict=True)
if incomplete == 'ignore':
return zip(*args)
else:
raise ValueError('Expected fill, strict, or ignore')
to be used like:
group_size = 3
for x in grouper(as_str_OP(r_arr, g_arr, b_arr), group_size):
write_many_to_disk(x)
Testing out the output
Some dummy input can be produced easily (r_arr is essentially red_list, etc.):
def gen_color(n, m):
return np.random.randint(0, 2 ** 8, (n, m))
N, M = 10, 3
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
and tested by consuming the generator to produce a list:
res_OP = list(as_str_OP(r_arr, g_arr, b_arr))
res_np = list(as_str_np(r_arr, g_arr, b_arr))
res_nb = list(as_str_nb(r_arr, g_arr, b_arr))
res_nba = list(as_str_nba(r_arr, g_arr, b_arr))
print(np.array(res_OP))
# [['1f6984' '916d98' 'f9d779']
# ['65f895' 'ded23e' '332fdc']
# ['b9e059' 'ce8676' 'cb75e9']
# ['bca0fc' '3289a9' 'cc3d3a']
# ['6bb0be' '07134a' 'c3cf05']
# ['152d5c' 'bac081' 'c59a08']
# ['97efcc' '4c31c0' '957693']
# ['15247e' 'af8f0a' 'ffb89a']
# ['161333' '8f41ce' '187b01']
# ['d811ae' '730b17' 'd2e269']]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [x.tolist() for x in res_nba])
# True
eventually passing through some grouping:
k = 3
res_OP = list(grouper(as_str_OP(r_arr, g_arr, b_arr), k))
res_np = list(grouper(as_str_np(r_arr, g_arr, b_arr), k))
res_nb = list(grouper(as_str_nb(r_arr, g_arr, b_arr), k))
res_nba = list(grouper(as_str_nba(r_arr, g_arr, b_arr), k))
print(np.array(res_OP))
# [[list(['1f6984', '916d98', 'f9d779'])
# list(['65f895', 'ded23e', '332fdc'])
# list(['b9e059', 'ce8676', 'cb75e9'])]
# [list(['bca0fc', '3289a9', 'cc3d3a'])
# list(['6bb0be', '07134a', 'c3cf05'])
# list(['152d5c', 'bac081', 'c59a08'])]
# [list(['97efcc', '4c31c0', '957693'])
# list(['15247e', 'af8f0a', 'ffb89a'])
# list(['161333', '8f41ce', '187b01'])]
# [list(['d811ae', '730b17', 'd2e269']) None None]]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [tuple(y.tolist() if y is not None else y for y in x) for x in res_nba])
# True
Benchmarks
To give you some ideas of the numbers we could be talking, let us use %timeit on much larger inputs:
N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
%timeit -n 1 -r 1 list(as_str_OP(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 1.1 s per loop
%timeit -n 4 -r 4 list(as_str_np(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 279 ms per loop
%timeit -n 4 -r 4 list(as_str_nb(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 96.5 ms per loop
%timeit -n 4 -r 4 list(as_str_nba(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 10.4 ms per loop
To simulate disk writing we could use the following consumer:
import time
import math
def consumer(gen, timeout_sec=0.001, weight=1):
result = []
for x in gen:
result.append(x)
time.sleep(timeout_sec * weight)
return result
where disk writing is simulated with a time.sleep() call with a timeout depending on the logarithm of the object size:
N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
%timeit -n 1 -r 1 consumer(as_str_OP(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 2.37 s per loop
%timeit -n 1 -r 1 consumer(as_str_np(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.48 s per loop
%timeit -n 1 -r 1 consumer(as_str_nb(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.27 s per loop
%timeit -n 1 -r 1 consumer(as_str_nba(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.13 s per loop
k = 100
%timeit -n 1 -r 1 consumer(grouper(as_str_OP(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 1.17 s per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_np(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 368 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nb(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 173 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nba(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 87.4 ms per loop
Ignoring the disk-writing simulation, the NumPy-vectorized approach is ~4x faster with the test input sizes, while Numba-accelerated approach gets ~10x to ~100x faster depending on whether the potentially useless conversion to list() with numpy.ndarray.tolist() is present or not.
When it comes to the simulated disk-writing, the faster versions are all more or less equivalent, and noticeably less effective without grouping, resulting in ~2x speed-up.
With grouping alone the speed-up gets to be ~2x, but when combining it with the faster approaches, the speed-ups fare between ~3x of the NumPy-vectorized version and the ~7x or ~13x of the Numba-accelerated approaches (with or without numpy.ndarray.tolist()).
Again, this is with the given input, and under the test conditions.
The actual mileage may vary.
you could use reduce for the inner loop, making it possible for your computer to divide the computations between different threads behind the scenes
for i in range(len(green_list)):
rgb_list = reduce(lambda ls, j: ls + ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j])],range(len(green_list[i])),list())
print(rgb_list)
or you could try to achive the same goal with a one-liner,
for i in range(len(green_list)):
rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
print(rgb_list)
hope it will do the trick for you
In the code you show, the slow bit is the string formatting. That we can improve somewhat.
A hex colour consists of eight bits for the red field, eight for the green, and eight for the blue (since your data does not seem to have an alpha channel, I am going to ignore that option). So we need at least twenty four bits to store the rgb colours.
You can create hex values using numpy's bitwise operators. The advantage is that this is completely vectorised. You then only have one value to format into a hex string for each (i, j), instead of three:
for i in range(len(green_list)):
hx = red_list[i] << 16 | green_list[i] << 8 | blue_list[i]
hex_list = ['%06x' % val for val in hx]
When the numpy arrays have dimensions (10, 1_000_000), this is about 5.5x faster than your original method (on my machine).
1. for-loop
Code modifications for rgb_list.append() does not affect much to the performance.
import timeit
n = 1000000
red_list = [list(range(1, n+0)), list(range(1, n+2))]
green_list = [list(range(2, n+1)), list(range(2, n+3))]
blue_list = [list(range(3, n+2)), list(range(3, n+4))]
def test_1():
for i in range(len(green_list)):
rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
def test_2():
for i in range(len(green_list)):
rgb_list = [None] * len(green_list[i])
for j in range(len(green_list[i])):
rgb_list[j] = '%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j])
def test_3():
for i in range(len(green_list)):
rgb_list = []
for j in range(len(green_list[i])):
rgb_list.append('%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]))
%timeit -n 1 -r 7 test_1(): 1.31 s ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit -n 1 -r 7 test_2(): 1.33 s ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit -n 1 -r 7 test_3(): 1.39 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2. disk IO
Code modifications for disk IO also does not affect much to the performance.
n = 20000000
def test_write_each():
for i in range(len(green_list)):
rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
with open("test_%d" % i, "wb") as f:
pickle.dump(rgb_list, f)
def test_write_once():
rgb_list_list = [None] * len(green_list)
for i in range(len(green_list)):
rgb_list_list[i] = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
with open("test_all", "wb") as f:
pickle.dump(rgb_list_list, f)
%timeit -n 1 -r 3 test_write_each(): 35.2 s ± 74.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
%timeit -n 1 -r 3 test_write_once(): 35.4 s ± 54.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Conclusion
From the benchmark result, there seems like no bottleneck to be avoided in the question code.
If the disk IO itself is the problem, I would like to suggest to run the disk-IO code only once after every other job (including the ones that are not mentioned in this question) is finished.
Calculate the euclidean of a vector from each column of another vector.
Is this correct?
distances=np.sqrt(np.sum(np.square(new_v-val.reshape(10,1)),axis=0))
new_v is a matrix.
val.reshape(10,1) is a column vector.
Another other/better ways to do it.
What you have is correct. There is a simpler method available in numpy.linalg:
from numpy.linalg import norm
norm(new_v.T-val, axis=1, ord=2)
You can make use of the efficient np.einsum -
subs = new_v - val[:,None]
out = np.sqrt(np.einsum('ij,ij->j',subs,subs))
Alternatively, using (a-b)^2 = a^2 + b^2 - 2ab formula -
out = np.sqrt(np.einsum('ij,ij->j',new_v, new_v) + val.dot(val) - 2*val.dot(new_v))
If the second axis of new_v is a large one, we can also numexpr module to compute the sqrt part at the end.
Runtime test
Approaches -
import numexpr as ne
def einsum_based(new_v, val):
subs = new_v - val[:,None]
return np.sqrt(np.einsum('ij,ij->j',subs,subs))
def dot_based(new_v, val):
return np.sqrt(np.einsum('ij,ij->j',new_v, new_v) + \
val.dot(val) - 2*val.dot(new_v))
def einsum_numexpr_based(new_v, val):
subs = new_v - val[:,None]
sq_dists = np.einsum('ij,ij->j',subs,subs)
return ne.evaluate('sqrt(sq_dists)')
def dot_numexpr_based(new_v, val):
sq_dists = np.einsum('ij,ij->j',new_v, new_v) + val.dot(val) - 2*val.dot(new_v)
return ne.evaluate('sqrt(sq_dists)')
Timings -
In [85]: # Inputs
...: new_v = np.random.randint(0,9,(10,100000))
...: val = np.random.randint(0,9,(10))
In [86]: %timeit np.sqrt(np.sum(np.square(new_v-val.reshape(10,1)),axis=0))
...: %timeit einsum_based(new_v, val)
...: %timeit dot_based(new_v, val)
...: %timeit einsum_numexpr_based(new_v, val)
...: %timeit dot_numexpr_based(new_v, val)
...:
100 loops, best of 3: 2.91 ms per loop
100 loops, best of 3: 2.1 ms per loop
100 loops, best of 3: 2.12 ms per loop
100 loops, best of 3: 2.26 ms per loop
100 loops, best of 3: 2.43 ms per loop
In [87]: from numpy.linalg import norm
# #wim's solution
In [88]: %timeit norm(new_v.T-val, axis=1, ord=2)
100 loops, best of 3: 5.88 ms per loop
I have an array A and a reference array B. Size of A is at least as big as B. e.g.
A = [2,100,300,793,1300,1500,1810,2400]
B = [4,305,789,1234,1890]
B is in fact the position of peaks in a signal at a specified time, and A contains position of peaks at a later time. But some of the elements in A are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A based on B. The 'real' elements in A should be close to those in B, and in the example given above, the 'real' ones in A should be A'=[2,300,793,1300,1810]. It should be obvious in this example that 100,1500,2400 are not the ones we want as they are quite far off from any of the elements in B. How can I code this in the most efficient/accurate way in python/matlab?
Approach #1: With NumPy broadcasting, we can look for absolute element-wise subtractions between the input arrays and use an appropriate threshold to filter out unwanted elements from A. It seems for the given sample inputs, a threshold of 90 works.
Thus, we would have an implementation, like so -
thresh = 90
Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]
Sample run -
In [69]: A
Out[69]: array([ 2, 100, 300, 793, 1300, 1500, 1810, 2400])
In [70]: B
Out[70]: array([ 4, 305, 789, 1234, 1890])
In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]
Out[71]: array([ 2, 300, 793, 1300, 1810])
Approach #2: Based on this post, here's a memory efficient approach using np.searchsorted, which could be crucial for large arrays -
def searchsorted_filter(a, b, thresh):
choices = np.sort(b) # if b is already sorted, skip it
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]
Sample run -
In [95]: searchsorted_filter(A,B, thresh = 90)
Out[95]: array([ 2, 300, 793, 1300, 1810])
Runtime test
In [104]: A = np.sort(np.random.randint(0,100000,(1000)))
In [105]: B = np.sort(np.random.randint(0,100000,(400)))
In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]
In [107]: out2 = searchsorted_filter(A,B, thresh = 10)
In [108]: np.allclose(out1, out2) # Verify results
Out[108]: True
In [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]
100 loops, best of 3: 2.74 ms per loop
In [110]: %timeit searchsorted_filter(A,B, thresh = 10)
10000 loops, best of 3: 85.3 µs per loop
Jan 2018 Update with further performance boost
We can avoid the second usage of np.searchsorted(..., 'right') by making use of the indices obtained from np.searchsorted(..., 'left') and also the absolute computations, like so -
def searchsorted_filter_v2(a, b, thresh):
N = len(b)
choices = np.sort(b) # if b is already sorted, skip it
l = np.searchsorted(choices, a, 'left')
l_invalid_mask = l==N
l[l_invalid_mask] = N-1
left_offset = choices[l]-a
left_offset[l_invalid_mask] *= -1
r = (l - (left_offset!=0))
r_invalid_mask = r<0
r[r_invalid_mask] = 0
r += l_invalid_mask
right_offset = a-choices[r]
right_offset[r_invalid_mask] *= -1
out = a[(left_offset < thresh) | (right_offset < thresh)]
return out
Updated timings to test the further speedup -
In [388]: np.random.seed(0)
...: A = np.random.randint(0,1000000,(100000))
...: B = np.unique(np.random.randint(0,1000000,(40000)))
...: np.random.shuffle(B)
...: thresh = 10
...:
...: out1 = searchsorted_filter(A, B, thresh)
...: out2 = searchsorted_filter_v2(A, B, thresh)
...: print np.allclose(out1, out2)
True
In [389]: %timeit searchsorted_filter(A, B, thresh)
10 loops, best of 3: 24.2 ms per loop
In [390]: %timeit searchsorted_filter_v2(A, B, thresh)
100 loops, best of 3: 13.9 ms per loop
Digging deeper -
In [396]: a = A; b = B
In [397]: N = len(b)
...:
...: choices = np.sort(b) # if b is already sorted, skip it
...:
...: l = np.searchsorted(choices, a, 'left')
In [398]: %timeit np.sort(B)
100 loops, best of 3: 2 ms per loop
In [399]: %timeit np.searchsorted(choices, a, 'left')
100 loops, best of 3: 10.3 ms per loop
Seems like searchsorted and sort are taking almost all of the runtime and they seem essential to this method. So, doesn't seem like it could be improved any further staying with this sort-based approach.
You could find the distance of each point in A from each value in B using bsxfun and then find the index of the point in A which is closest to each value in B using min.
[dists, ind] = min(abs(bsxfun(#minus, A, B.')), [], 2)
If you're on R2016b, bsxfun can be removed thanks to automatic broadcasting
[dists, ind] = min(abs(A - B.'), [], 2);
If you suspect that some values in B are not real peaks, then you can set a threshold value and remove any distances that were greater than this value.
threshold = 90;
ind = ind(dists < threshold);
Then we can use ind to index into A
output = A(ind);
You can use MATLAB interp1 function that exactly does what you want.
option nearest is used to find nearest points and there is no need to specify a threshold.
out = interp1(A, A, B, 'nearest', 'extrap');
comparing with other method:
A = sort(randi([0,1000000],1,10000));
B = sort(randi([0,1000000],1,4000));
disp('---interp1----------------')
tic
out = interp1(A, A, B, 'nearest', 'extrap');
toc
disp('---subtraction with threshold------')
%numpy version is the same
tic
[dists, ind] = min(abs(bsxfun(#minus, A, B.')), [], 2);
toc
Result:
---interp1----------------
Elapsed time is 0.00778699 seconds.
---subtraction with threshold------
Elapsed time is 0.445485 seconds.
interp1 can be used for inputs larger than 10000 and 4000 but in subtrction method out of memory error occured.
I have a list that models a phenomenon that is a function of radius. I want to convert this to a 2D array. I wrote some code that does exactly what I want, but since it uses nested for loops, it is quite slow.
l = len(profile1D)/2
critDim = int((l**2 /2.)**(1/2.))
profile2D = np.empty([critDim, critDim])
for x in xrange(0, critDim):
for y in xrange(0,critDim):
r = ((x**2 + y**2)**(1/2.))
profile2D[x,y] = profile1D[int(l+r)]
Is there a more efficient way to do the same thing by avoiding these loops?
Here's a vectorized approach using broadcasting -
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = profile1D[(l+r2D).astype(int)]
If there are many repeated indices generated by l+r2D, we can use np.take for some further performance boost, like so -
out = np.take(profile1D,(l+r2D).astype(int))
Runtime test
Function definitions -
def org_app(profile1D,l,critDim):
profile2D = np.empty([critDim, critDim])
for x in xrange(0, critDim):
for y in xrange(0,critDim):
r = ((x**2 + y**2)**(1/2.))
profile2D[x,y] = profile1D[int(l+r)]
return profile2D
def vect_app1(profile1D,l,critDim):
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = profile1D[(l+r2D).astype(int)]
return out
def vect_app2(profile1D,l,critDim):
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = np.take(profile1D,(l+r2D).astype(int))
return out
Timings and verification -
In [25]: # Setup input array and params
...: profile1D = np.random.randint(0,9,(1000))
...: l = len(profile1D)/2
...: critDim = int((l**2 /2.)**(1/2.))
...:
In [26]: np.allclose(org_app(profile1D,l,critDim),vect_app1(profile1D,l,critDim))
Out[26]: True
In [27]: np.allclose(org_app(profile1D,l,critDim),vect_app2(profile1D,l,critDim))
Out[27]: True
In [28]: %timeit org_app(profile1D,l,critDim)
10 loops, best of 3: 154 ms per loop
In [29]: %timeit vect_app1(profile1D,l,critDim)
1000 loops, best of 3: 1.69 ms per loop
In [30]: %timeit vect_app2(profile1D,l,critDim)
1000 loops, best of 3: 1.68 ms per loop
In [31]: # Setup input array and params
...: profile1D = np.random.randint(0,9,(5000))
...: l = len(profile1D)/2
...: critDim = int((l**2 /2.)**(1/2.))
...:
In [32]: %timeit org_app(profile1D,l,critDim)
1 loops, best of 3: 3.76 s per loop
In [33]: %timeit vect_app1(profile1D,l,critDim)
10 loops, best of 3: 59.8 ms per loop
In [34]: %timeit vect_app2(profile1D,l,critDim)
10 loops, best of 3: 59.5 ms per loop
I'm using Python 2.7.
I have two arrays, A and B.
To find the indices of the elements in A that are present in B, I can do
A_inds = np.in1d(A,B)
I also want to get the indices of the elements in B that are present in A, i.e. the indices in B of the same overlapping elements I found using the above code.
Currently I am running the same line again as follows:
B_inds = np.in1d(B,A)
but this extra calculation seems like it should be unnecessary. Is there a more computationally efficient way of obtaining both A_inds and B_inds?
I am open to using either list or array methods.
np.unique and np.searchsorted could be used together to solve it -
def unq_searchsorted(A,B):
# Get unique elements of A and B and the indices based on the uniqueness
unqA,idx1 = np.unique(A,return_inverse=True)
unqB,idx2 = np.unique(B,return_inverse=True)
# Create mask equivalent to np.in1d(A,B) and np.in1d(B,A) for unique elements
mask1 = (np.searchsorted(unqB,unqA,'right') - np.searchsorted(unqB,unqA,'left'))==1
mask2 = (np.searchsorted(unqA,unqB,'right') - np.searchsorted(unqA,unqB,'left'))==1
# Map back to all non-unique indices to get equivalent of np.in1d(A,B),
# np.in1d(B,A) results for non-unique elements
return mask1[idx1],mask2[idx2]
Runtime tests and verify results -
In [233]: def org_app(A,B):
...: return np.in1d(A,B), np.in1d(B,A)
...:
In [234]: A = np.random.randint(0,10000,(10000))
...: B = np.random.randint(0,10000,(10000))
...:
In [235]: np.allclose(org_app(A,B)[0],unq_searchsorted(A,B)[0])
Out[235]: True
In [236]: np.allclose(org_app(A,B)[1],unq_searchsorted(A,B)[1])
Out[236]: True
In [237]: %timeit org_app(A,B)
100 loops, best of 3: 7.69 ms per loop
In [238]: %timeit unq_searchsorted(A,B)
100 loops, best of 3: 5.56 ms per loop
If the two input arrays are already sorted and unique, the performance boost would be substantial. Thus, the solution function would simplify to -
def unq_searchsorted_v1(A,B):
out1 = (np.searchsorted(B,A,'right') - np.searchsorted(B,A,'left'))==1
out2 = (np.searchsorted(A,B,'right') - np.searchsorted(A,B,'left'))==1
return out1,out2
Subsequent runtime tests -
In [275]: A = np.random.randint(0,100000,(20000))
...: B = np.random.randint(0,100000,(20000))
...: A = np.unique(A)
...: B = np.unique(B)
...:
In [276]: np.allclose(org_app(A,B)[0],unq_searchsorted_v1(A,B)[0])
Out[276]: True
In [277]: np.allclose(org_app(A,B)[1],unq_searchsorted_v1(A,B)[1])
Out[277]: True
In [278]: %timeit org_app(A,B)
100 loops, best of 3: 8.83 ms per loop
In [279]: %timeit unq_searchsorted_v1(A,B)
100 loops, best of 3: 4.94 ms per loop
A simple multiprocessing implementation will get you a little more speed:
import time
import numpy as np
from multiprocessing import Process, Queue
a = np.random.randint(0, 20, 1000000)
b = np.random.randint(0, 20, 1000000)
def original(a, b, q):
q.put( np.in1d(a, b) )
if __name__ == '__main__':
t0 = time.time()
q = Queue()
q2 = Queue()
p = Process(target=original, args=(a, b, q,))
p2 = Process(target=original, args=(b, a, q2))
p.start()
p2.start()
res = q.get()
res2 = q2.get()
print time.time() - t0
>>> 0.21398806572
Divakar's unq_searchsorted(A,B) method took 0.271834135056 seconds on my machine.