Indices that intersect and sort two numpy arrays - python

I have two numpy arrays of integers, both of length several hundred million. Within each array values are unique, and each is initially unsorted.
I would like the indices to each that yield their sorted intersection. For example:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
Then the sorted intersection of these is [4, 5, 11], and so we want the indices that turn each of x and y into that array, so we want it to return:
mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])
since then x[mx] == y[my] == np.intersect1d(x, y)
The only solution we have so far involves three different argsorts, so it seems that is unlikely to be optimal.
Each value represents a galaxy, in case that makes the problem more fun.

Here's an option based on intersect1d's implementation, which is fairly straightforward. It requires one call to argsort.
The admittedly simplistic test passes.
import numpy as np
def my_intersect(x, y):
"""my_intersect(x, y) -> xm, ym
x, y: 1-d arrays of unique values
xm, ym: indices into x and y giving sorted intersection
"""
# basic idea taken from numpy.lib.arraysetops.intersect1d
aux = np.concatenate((x, y))
sidx = aux.argsort()
# Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
inidx = aux[sidx[1:]] == aux[sidx[:-1]]
# quicksort is not stable, so must do some work to extract indices
# (if stable, sidx[inidx.nonzero()] would be for x)
# interlace the two sets of indices, and check against lengths
xym = np.vstack((sidx[inidx.nonzero()],
sidx[1:][inidx.nonzero()])).T.flatten()
xm = xym[xym < len(x)]
ym = xym[xym >= len(x)] - len(x)
return xm, ym
def check_my_intersect(x, y):
mx, my = my_intersect(x, y)
assert (x[mx] == np.intersect1d(x, y)).all()
# not really necessary: np.intersect1d returns a sorted list
assert (x[mx] == sorted(x[mx])).all()
assert (x[mx] == y[my]).all()
def random_unique_unsorted(n):
while True:
x = np.unique(np.random.randint(2*n, size=n))
if len(x):
break
np.random.shuffle(x)
return x
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
check_my_intersect(x, y)
for i in range(20):
x = random_unique_unsorted(100+i)
y = random_unique_unsorted(200+i)
check_my_intersect(x, y)
Edit: "Note" comment was confusing (Used ... as speech ellipsis, forgot it was a Python operator too).

You could also use np.searchsorted, like so -
def searchsorted_based(x,y):
# Get argsort for both x and y
xsort_idx = x.argsort()
ysort_idx = y.argsort()
# Sort x and y and store them
X = x[xsort_idx]
Y = y[ysort_idx]
# Find positions of Y in X and the matches by the positions that
# shift between 'left' and 'right' based searches.
# Use the matches posotions to get corresponding argsort for X.
x1 = np.searchsorted(X,Y,'left')
x2 = np.searchsorted(X,Y,'right')
out1 = xsort_idx[x1[x2 != x1]]
# Repeat for X in Y findings
y1 = np.searchsorted(Y,X,'left')
y2 = np.searchsorted(Y,X,'right')
out2 = ysort_idx[y1[y2 != y1]]
return out1, out2
Sample run -
In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
...: y = np.array([20, 5, 4, 9, 11, 7, 25])
...:
In [101]: searchsorted_based(x,y)
Out[101]: (array([0, 3, 6]), array([2, 1, 4]))

For a pure numpy solution you could do something like this:
Use np.unique to get the unique values and corresponding indices in x and y separately:
# sorted unique values in x and y and the indices corresponding to their first
# occurrences, such that u_x == x[u_idx_x]
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)
Find the intersection of the unique values using np.intersect1d:
# we can assume_unique, which can be faster for large arrays
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
Finally, use np.in1d to select only the indices that correspond to unique values in x or y that also happen to be in the intersection of x and y:
# it is also safe to assume_unique here
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
To pull all that together into a single function:
def intersect_indices(x, y):
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
return i_idx_x, i_idx_y
For example:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
i_idx_x, i_idx_y = intersect_indices(x, y)
print(i_idx_x, i_idx_y)
# (array([0, 3, 6]), array([2, 1, 4]))
Speed test:
In [1]: k = 1000000
In [2]: %%timeit x, y = np.random.randint(k, size=(2, k))
intersect_indices(x, y)
....:
1 loops, best of 3: 597 ms per loop
Update:
I initially missed the fact that in your case both x and y contain only unique values. Taking that into account, it's possible to do slightly better by using an indirect sort:
def intersect_indices_unique(x, y):
u_idx_x = np.argsort(x)
u_idx_y = np.argsort(y)
i_xy = np.intersect1d(x, y, assume_unique=True)
i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)]
i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)]
return i_idx_x, i_idx_y
Here's a more realistic test case, where x and y both contain unique (but partially overlapping) values:
In [1]: n, k = 10000000, 1000000
In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices(x, y)
....:
1 loops, best of 3: 593 ms per loop
In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices_unique(x, y)
....:
1 loops, best of 3: 453 ms per loop
#Divakar's solution is very similar in terms of performance:
In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
searchsorted_based(x, y)
....:
1 loops, best of 3: 472 ms per loop

Maybe a pure Python solutions using a dict works for you:
def indices_from_values(a, intersect):
idx = {value: index for index, value in enumerate(a)}
return np.array([idx[x] for x in intersect])
intersect = np.intersect1d(x, y)
mx = indices_from_values(x, intersect)
my = indices_from_values(y, intersect)
np.allclose(x[mx], y[my]) and np.allclose(x[mx], np.intersect1d(x, y))

Related

can vectorize this for-loop with update variable in each iterate?

Can we vectorize this for-loop that finally result update in each iterate and using the result in for like below:
x = np.array([[1,2],[2,3],[3,4]])
y = np.array([1,-1,1])
z = np.array([-1,1,-1])
w = 2
out = np.array([10,20])
for xi,yi,zi in zip(x, y, z):
out = out + (w*(yi-zi))*xi
Output:
>>> out
array([ 18, 32])
You could simply use:
(w*(y-z)*x.T).sum(1)
output: array([ 8, 12])
Add existing array:
out = np.array([10,20])
out += (w*(y-z)*x.T).sum(1)
output: array([18, 32])

Pythonic way of finding indexes of unique elements in two arrays

I have two sorted, numpy arrays similar to these ones:
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
Elements never repeat in the same array. I want to figure out a way of pythonicaly figuring out a list of indexes that contain the locations in the arrays at which the same element exists.
For instance, 1 exists in x and y at index 0. Element 2 in x doesn't exist in y, so I don't care about that item. However, 8 does exist in both arrays - in index 2 in x but index 1 in y. Similarly, 15 exists in both, in index 4 in x, but index 2 in y. So the outcome of my function would be a list that in this case returns [[0, 0], [2, 1], [4, 2]].
So far what I'm doing is:
def get_indexes(x, y):
indexes = []
for i in range(len(x)):
# Find index where item x[i] is in y:
j = np.where(x[i] == y)[0]
# If it exists, save it:
if len(j) != 0:
indexes.append([i, j[0]])
return indexes
But the problem is that arrays x and y are very large (millions of items), so it takes quite a while. Is there a better pythonic way of doing this?
Without Python loops
Code
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
Usage
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
result = get_indexes_darrylg(x, y)
# result[0]: array([[0, 0],
[2, 1],
[4, 2]], dtype=int64)
Timing Posted Solutions
Results show that darrlg code has the fastest run time.
Code Adjustment
Each posted solution as a function.
Slight mod so that each solution outputs an numpy array.
Curve named after poster
Code
import numpy as np
import perfplot
def create_arr(n):
' Creates pair of 1d numpy arrays with half the elements equal '
max_val = 100000 # One more than largest value in output arrays
arr1 = np.random.randint(0, max_val, (n,))
arr2 = arr1.copy()
# Change half the elements in arr2
all_indexes = np.arange(0, n, dtype=int)
indexes = np.random.choice(all_indexes, size = n//2, replace = False) # locations to make changes
np.put(arr2, indexes, np.random.randint(0, max_val, (n//2, ))) # assign new random values at change locations
arr1 = np.sort(arr1)
arr2 = np.sort(arr2)
return (arr1, arr2)
def get_indexes_lllrnr101(x,y):
' lllrnr101 answer '
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return np.array(ans)
def get_indexes_joostblack(x, y):
'joostblack'
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
try:
if y[idy]==val:
indexes.append([idx,idy])
except IndexError:
continue # ignore index errors
return np.array(indexes)
def get_indexes_mustafa(x, y):
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2]
return np.array(list(zip(indices_in_x, indices_in_y)))
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
def get_indexes_akopcz(x, y):
' akopcz answer '
return np.array([
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
])
perfplot.show(
setup = create_arr, # tuple of two 1D random arrays
kernels=[
lambda a: get_indexes_lllrnr101(*a),
lambda a: get_indexes_joostblack(*a),
lambda a: get_indexes_mustafa(*a),
lambda a: get_indexes_darrylg(*a),
lambda a: get_indexes_akopcz(*a),
],
labels=["lllrnr101", "joostblack", "mustafa", "darrylg", "akopcz"],
n_range=[2 ** k for k in range(5, 21)],
xlabel="Array Length",
# More optional arguments with their default values:
# logx="auto", # set to True or False to force scaling
# logy="auto",
equality_check=None, #np.allclose, # set to None to disable "correctness" assertion
# show_progress=True,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
What you are doing is O(nlogn) which is decent enough.
If you want, you can do it in O(n) by iterating on both arrays with two pointers and since they are sorted, increase the pointer for the array with smaller object.
See below:
x = [1, 2, 8, 11, 15]
y = [1, 8, 15, 17, 20, 21]
def get_indexes(x,y):
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return ans
print(get_indexes(x,y))
which gives me:
[[0, 0], [2, 1], [4, 2]]
Although, this function will search for all the occurances of x[i] in the y array, if duplicates are not allowed in y it will find x[i] exactly once.
def get_indexes(x, y):
return [
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
]
You can use numpy.searchsorted:
def get_indexes(x, y):
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
if y[idy]==val:
indexes.append([idx,idy])
return indexes
One solution is to first look from x's side to see what values are included in y by getting their indices through np.isin and np.flatnonzero, and then use the same procedure from the other side; but instead of giving x entirely, we give only the (already found) intersected elements to gain time:
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2])
Now you can zip them to get the result:
result = list(zip(indices_in_x, indices_in_y)) # [(0, 0), (2, 1), (4, 2)]

Vectorized groupby with NumPy

Pandas has a widely-used groupby facility to split up a DataFrame based on a corresponding mapping, from which you can apply a calculation on each subgroup and recombine the results.
Can this be done flexibly in NumPy without a native Python for-loop? With a Python loop, this would look like:
>>> import numpy as np
>>> X = np.arange(10).reshape(5, 2)
>>> groups = np.array([0, 0, 0, 1, 1])
# Split up elements (rows) of `X` based on their element wise group
>>> np.array([X[groups==i].sum() for i in np.unique(groups)])
array([15, 30])
Above 15 is the sum of the first three rows of X, and 30 is the sum of the remaining two.
By "flexibly,” I just mean that we aren't focusing on one particular computation such as sum, count, maximum, etc, but rather passing any computation to the grouped arrays.
If not, is there a faster approach than the above?
How about using scipy sparse matrix
import numpy as np
from scipy import sparse
import time
x_len = 500000
g_len = 100
X = np.arange(x_len * 2).reshape(x_len, 2)
groups = np.random.randint(0, g_len, x_len)
# original
s = time.time()
a = np.array([X[groups==i].sum() for i in np.unique(groups)])
print(time.time() - s)
# using scipy sparse matrix
s = time.time()
x_sum = X.sum(axis=1)
b = np.array(sparse.coo_matrix(
(
x_sum,
(groups, np.arange(len(x_sum)))
),
shape=(g_len, x_len)
).sum(axis=1)).ravel()
print(time.time() - s)
#compare
print(np.abs((a-b)).sum())
result on my PC
0.15915322303771973
0.012875080108642578
0
More than 10 times faster.
Update!
Let's benchmark answers of #Paul Panzer and #Daniel F. It is summation only benchmark.
import numpy as np
from scipy import sparse
import time
# by #Daniel F
def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
if minlength < groups.max() + 1:
minlength = groups.max() + 1
if identity is None:
identity = uf.identity
i = list(range(X.ndim))
del i[axis]
i = tuple(i)
n = out is None
if n:
if identity is None: # fallback to loops over 0-index for identity
assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
s = [slice(None)] * X.ndim
for i_ in i:
s[i_] = 0
out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
else:
out = np.full((minlength,), identity, dtype = X.dtype)
uf.at(out, groups, uf.reduce(X, i))
if n:
return out
x_len = 500000
g_len = 200
X = np.arange(x_len * 2).reshape(x_len, 2)
groups = np.random.randint(0, g_len, x_len)
print("original")
s = time.time()
a = np.array([X[groups==i].sum() for i in np.unique(groups)])
print(time.time() - s)
print("use scipy coo matrix")
s = time.time()
x_sum = X.sum(axis=1)
b = np.array(sparse.coo_matrix(
(
x_sum,
(groups, np.arange(len(x_sum)))
),
shape=(g_len, x_len)
).sum(axis=1)).ravel()
print(time.time() - s)
#compare
print(np.abs((a-b)).sum())
print("use scipy csr matrix #Daniel F")
s = time.time()
x_sum = X.sum(axis=1)
c = np.array(sparse.csr_matrix(
(
x_sum,
groups,
np.arange(len(groups)+1)
),
shape=(len(groups), g_len)
).sum(axis=0)).ravel()
print(time.time() - s)
#compare
print(np.abs((a-c)).sum())
print("use bincount #Paul Panzer #Daniel F")
s = time.time()
d = np.bincount(groups, X.sum(axis=1), g_len)
print(time.time() - s)
#compare
print(np.abs((a-d)).sum())
print("use ufunc #Daniel F")
s = time.time()
e = groupby_np(X, groups)
print(time.time() - s)
#compare
print(np.abs((a-e)).sum())
STDOUT
original
0.2882847785949707
use scipy coo matrix
0.012301445007324219
0
use scipy csr matrix #Daniel F
0.01046299934387207
0
use bincount #Paul Panzer #Daniel F
0.007468223571777344
0.0
use ufunc #Daniel F
0.04431319236755371
0
The winner is the bincount solution. But the csr matrix solution is also very interesting.
#klim's sparse matrix solution would at first sight appear to be tied to summation. We can, however, use it in the general case by converting between the csr and csc formats:
Let's look at a small example:
>>> m, n = 3, 8
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>>
>>> M = sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m))
>>>
>>> idx
array([0, 2, 2, 1, 1, 2, 2, 0])
>>>
>>> M = M.tocsc()
>>>
>>> M.indptr, M.indices
(array([0, 2, 4, 8], dtype=int32), array([0, 7, 3, 4, 1, 2, 5, 6], dtype=int32))
As we can see after conversion the internal representation of the sparse matrix yields the indices grouped and sorted:
>>> groups = np.split(M.indices, M.indptr[1:-1])
>>> groups
[array([0, 7], dtype=int32), array([3, 4], dtype=int32), array([1, 2, 5, 6], dtype=int32)]
>>>
We could have obtained the same using a stable argsort:
>>> np.argsort(idx, kind='mergesort')
array([0, 7, 3, 4, 1, 2, 5, 6])
>>>
But sparse matrices are actually faster, even when we allow argsort to use a faster non-stable algorithm:
>>> m, n = 1000, 100000
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>>
>>> timeit('sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m)).tocsc()', **kwds)
2.250748165184632
>>> timeit('np.argsort(idx)', **kwds)
5.783584725111723
If we require argsort to keep groups sorted, the difference is even larger:
>>> timeit('np.argsort(idx, kind="mergesort")', **kwds)
10.507467685034499
If you want a more flexible implementation of groupby that can group using any of numpy's ufuncs:
def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
if minlength < groups.max() + 1:
minlength = groups.max() + 1
if identity is None:
identity = uf.identity
i = list(range(X.ndim))
del i[axis]
i = tuple(i)
n = out is None
if n:
if identity is None: # fallback to loops over 0-index for identity
assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
s = [slice(None)] * X.ndim
for i_ in i:
s[i_] = 0
out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
else:
out = np.full((minlength,), identity, dtype = X.dtype)
uf.at(out, groups, uf.reduce(X, i))
if n:
return out
groupby_np(X, groups)
array([15, 30])
groupby_np(X, groups, uf = np.multiply)
array([ 0, 3024])
groupby_np(X, groups, uf = np.maximum)
array([5, 9])
groupby_np(X, groups, uf = np.minimum)
array([0, 6])
There's probably a faster way than this (both of the operands are making copies right now), but:
np.bincount(np.broadcast_to(groups, X.T.shape).ravel(), X.T.ravel())
array([ 15., 30.])
If you want to extend the answer to a ndarray, and still have a fast computation, you could extend the Daniel's solution :
x_len = 500000
g_len = 200
y_len = 2
X = np.arange(x_len * y_len).reshape(x_len, y_len)
groups = np.random.randint(0, g_len, x_len)
# original
a = np.array([X[groups==i].sum(axis=0) for i in np.unique(groups)])
# alternative
bins = [0] + list(np.bincount(groups, minlength=g_len).cumsum())
Z = np.argsort(groups)
d = np.array([X.take(Z[bins[i]:bins[i+1]],0).sum(axis=0) for i in range(g_len)])
It took about 30 ms (15ms for creating bins + 15ms for summing) instead of 280 ms on the original way in this example.
d.shape
>>> (1000, 2)

Perform summation at indizes saved as x and y arrays

I have basic question regarding indexing. I have two array lists with len = 9mio encoding vectorized image coordinates that have been extracted via computation by a previous function. Now I want to decrease a heatmap using the vectorized data. I could use a for loop and zip the coordinates. However, I would prefer a faster solution like
T = [L[i] +=1 for i in zip(X,Y)]
or something. Is this possible?
coord = [x_coords,y_coords]
Heatmap[coord[0],coord[1]] -= 1
This is one solution using collections. I have also added performance comparison versus #Piinthesky's pandas solution.
import pandas as pd
from collections import Counter, OrderedDict
#your pre-existing heatmap as a numpy array
heat_map = np.arange(32).reshape(8, 4)
#your x and y pairs as lists
x = [2, 3, 0, 5, 6, 2, 3, 4, 3]
y = [3, 1, 2, 0, 3, 3, 1, 1, 1]
def jp_data_analysis(heat_map, x, y):
#count occurences of x, y pairs
c = OrderedDict(Counter(zip(x, y)))
#create numpy array with count as value at position x, y
x_c, y_c = list(zip(*c))
pic_occur[x_c, y_c] = list(c.values())
#subtract this from heatmap
heat_map -= pic_occur
return heat_map
def piinthesky(heat_map, x, y):
#count occurences of x, y pairs
df = pd.DataFrame({"x": x, "y": y}).groupby(["x", "y"]).size().reset_index(name='count')
#create numpy array with count as value at position x, y
pic_occur = np.zeros([heat_map.shape[0], heat_map.shape[1]], dtype = int)
pic_occur[df["x"], df["y"]] = df["count"]
#and subtract this from heatmap
heat_map -= pic_occur
return heat_map
%timeit jp_data_analysis(heat_map, x, y)
# 10000 loops, best of 3: 43.8 µs per loop
%timeit piinthesky(heat_map, x, y)
# 100 loops, best of 3: 4.45 ms per loop
This is a solution using numpy/pandas. The x, y convention is according to the usual connotation in pictures, but you better check this with your dataset.
import pandas as pd
#your pre-existing heatmap as a numpy array
heat_map = np.arange(32).reshape(8, 4)
#your x and y pairs as lists
x = [2, 3, 0, 5, 6, 2, 3, 4, 3]
y = [3, 1, 2, 0, 3, 3, 1, 1, 1]
#count occurences of x, y pairs
df = pd.DataFrame({"x": x, "y": y}).groupby(["x", "y"]).size().reset_index(name='count')
#create numpy array with count as value at position x, y
pic_occur = np.zeros([heat_map.shape[0], heat_map.shape[1]], dtype = int)
pic_occur[df["x"], df["y"]] = df["count"]
#and subtract this from heatmap
heat_map -= pic_occur

Numpy: get the lowest N elements of an array X, considering only elements whose index is not an element in another array Y

To get the lowest 10 values of an array X I do something like:
lowest10 = np.argsort(X)[:10]
what is the most efficient way, avoiding loops, to filter the results so that I get the lowest 10 values whose index is not an element of another array Y?
So for example if the array Y is:
[2,20,51]
X[2], X[20] and X[51] shouldn't be taken into consideration to compute the lowest 10.
After some benchmarking here is my humble recommendation:
Swapping out appears to be more or less always faster than masking (even if 99% of X are forbidden.) So use something along the lines of
swap = X[Y]
X[Y] = np.inf
Sorting is expensive, therefore use argpartition and only sort what's necessary. Like
lowest10 = np.argpartition(Xfiltered, 10)[:10]
lowest10 = lowest10[np.argsort(Xfiltered[lowest10])]
Here are some benchmarks:
import numpy as np
from timeit import timeit
def swap_out():
global sol
swap = X[Y]
X[Y] = np.inf
sol = np.argpartition(X, K)[:K]
sol = sol[np.argsort(X[sol])]
X[Y] = swap
def app1():
sidx = X.argsort()
return sidx[~np.in1d(sidx, Y)][:K]
def app2():
sidx = np.argpartition(X,range(K+Y.size))
return sidx[~np.in1d(sidx, Y)][:K]
def app3():
sidx = np.argpartition(X,K+Y.size)
return sidx[~np.in1d(sidx, Y)][:K]
K = 10 # number of small elements wanted
N = 10000 # size of X
M = 10 # size of Y
S = 10 # number of repeats in benchmark
X = np.random.random((N,))
Y = np.random.choice(N, (M,))
so = timeit(swap_out, number=S)
print(sol)
print(X[sol])
d1 = timeit(app1, number=S)
print(sol)
print(X[sol])
d2 = timeit(app2, number=S)
print(sol)
print(X[sol])
d3 = timeit(app3, number=S)
print(sol)
print(X[sol])
print('pp', f'{so:8.5f}', ' d1(um)', f'{d1:8.5f}', ' d2', f'{d2:8.5f}', ' d3', f'{d3:8.5f}')
# pp 0.00053 d1(um) 0.00731 d2 0.00313 d3 0.00149
Here's one approach -
sidx = X.argsort()
idx_out = sidx[~np.in1d(sidx, Y)][:10]
Sample run -
# Setup inputs
In [141]: X = np.random.choice(range(60), 60)
In [142]: Y = np.array([2,20,51])
# For testing, let's set the Y positions as 0s and
# we want to see them skipped in o/p
In [143]: X[Y] = 0
# Use proposed approach
In [144]: sidx = X.argsort()
In [145]: X[sidx[~np.in1d(sidx, Y)][:10]]
Out[145]: array([ 0, 2, 4, 5, 5, 9, 9, 10, 12, 14])
# Print the first 13 numbers and skip three 0s and
# that should match up with the output from proposed approach
In [146]: np.sort(X)[:13]
Out[146]: array([ 0, 0, 0, 0, 2, 4, 5, 5, 9, 9, 10, 12, 14])
Alternatively, for performance, we might want to use np.argpartition, like so -
sidx = np.argpartition(X,range(10+Y.size))
idx_out = X[sidx[~np.in1d(sidx, Y)][:10]]
This would be beneficial if the length of X is a much larger number than 10.
If you don't care about the order of elements in that list of 10 indices, for further boost, we can simply pass on the scalar length instead of range array to np.argpartition : np.argpartition(X,10+Y.size).
We can optimize np.in1d with searchsorted to have one more approach (listing next).
Listing below all the discussed approaches in this post -
def app1(X, Y, n=10):
sidx = X.argsort()
return sidx[~np.in1d(sidx, Y)][:n]
def app2(X, Y, n=10):
sidx = np.argpartition(X,range(n+Y.size))
return sidx[~np.in1d(sidx, Y)][:n]
def app3(X, Y, n=10):
sidx = np.argpartition(X,n+Y.size)
return sidx[~np.in1d(sidx, Y)][:n]
def app4(X, Y, n=10):
n_ext = n+Y.size
sidx = np.argpartition(X,np.arange(n_ext))[:n_ext]
ssidx = sidx.argsort()
mask = np.ones(ssidx.size,dtype=bool)
search_idx = np.searchsorted(sidx, Y, sorter=ssidx)
search_idx[search_idx==sidx.size] = 0
idx = ssidx[search_idx]
mask[idx[sidx[idx] == Y]] = 0
return sidx[mask][:n]
You can work on a subset of original array using numpy.delete();
lowest10 = np.argsort(np.delete(X, Y))[:10]
Since delete works by slicing the original array with indexes to keep, complexity should be constant.
Warning: This solution uses a subset of original X array (X without the elements indexed in Y), thus the end result will be the lowest 10 of that subset.

Categories

Resources