Fast way to find index of array in array of arrays

Fast way to find index of array in array of arrays - python

Suppose I have a numpy array of arrays of length 4:
In [41]: arr
Out[41]:
array([[ 1, 15, 0, 0],
[ 30, 10, 0, 0],
[ 30, 20, 0, 0],
...,
[104, 139, 146, 75],
[ 9, 11, 146, 74],
[ 9, 138, 146, 75]], dtype=uint8)
I want to know:
Is it true that arr includes [1, 2, 3, 4]?
If it true what index of [1, 2, 3, 4] in arr?
I want to find out it as fast as it possible.
Suppose arr contains 8550420 elements. I've checked several methods with timeit:
Just for checking without getting index: any(all([1, 2, 3, 4] == elt) for elt in arr). It tooks 15.5 sec in average on 10 runs on my machine
for-based solution:
for i,e in enumerate(arr):
if list(e) == [1, 2, 3, 4]:
break
It tooks about 5.7 secs in average
Does exists some faster solutions, for example numpy based?

This is Jaime's idea, I just love it:
import numpy as np
def asvoid(arr):
"""View the array as dtype np.void (bytes)
This collapses ND-arrays to 1D-arrays, so you can perform 1D operations on them.
https://stackoverflow.com/a/16216866/190597 (Jaime)"""
arr = np.ascontiguousarray(arr)
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
def find_index(arr, x):
arr_as1d = asvoid(arr)
x = asvoid(x)
return np.nonzero(arr_as1d == x)[0]
arr = np.array([[ 1, 15, 0, 0],
[ 30, 10, 0, 0],
[ 30, 20, 0, 0],
[1, 2, 3, 4],
[104, 139, 146, 75],
[ 9, 11, 146, 74],
[ 9, 138, 146, 75]], dtype='uint8')
arr = np.tile(arr,(1221488,1))
x = np.array([1,2,3,4], dtype='uint8')
print(find_index(arr, x))
yields
[ 3 10 17 ..., 8550398 8550405 8550412]
The idea is to view each row of the array as a string. For example,
In [15]: x
Out[15]:
array([^A^B^C^D],
dtype='|V4')
The strings look like garbage, but they are really just the underlying data in each row viewed as bytes. You can then compare arr_as1d == x to find which rows equal x.
There is another way to do it:
def find_index2(arr, x):
return np.where((arr == x).all(axis=1))[0]
but it turns out to be not as fast:
In [34]: %timeit find_index(arr, x)
1 loops, best of 3: 209 ms per loop
In [35]: %timeit find_index2(arr, x)
1 loops, best of 3: 370 ms per loop

If you perform search more than one time and you don't mind to use extra memory, you can create set from you array (I'm using list here, but it's almost the same code):
>>> elem = [1, 2, 3, 4]
>>> elements = [[ 1, 15, 0, 0], [ 30, 10, 0, 0], [1, 2, 3, 4]]
>>> index = set([tuple(x) for x in elements])
>>> True if tuple(elem) in index else False
True

Related

is there a way in python that can I merge two arrays (1d/2d/3d) and replacing specific elements

Simple case-
I have two arrays:
x1 = np.arange(1,10) and x2 = np.array([0,0,4,0,0,5,0,0,0])
I would like to merge or combine these two arrays such that the 0 in x2 will be replaced with values in x1 and the non-zero elements of x2 remains. NumPy.union1d seems to do this union. But I don't want it sorted/ordered.
Then
Actual case-
I would then like to perform this on multi-dimensional arrays, eg: x.shape=(xx,yy,zz). Both array objects will have the same shape. x.shape = y.shape
Is this possible or should I try something with masked arrays NumPy.ma?
---------------------------Example-----------------------------
k_angle = khan(_angle)
e_angle = emss(_angle)
_angle.shape = (3647, 16)
e_angle.shape = (2394, 3647, 16)
k_angle.shape = (2394, 3647, 16)
_angle contains a list of values 0 - 180 degrees, if angle < 5 it should only use one function khan anything else is emss function.
Any value larger than 5 for khan becomes 0. While emss works for all values.
Attempt 1: I tried splitting the angle values but recombining them proved tricky
khan = bm.Khans_beam_model(freq=f, theta=None)
emss = bm.emss_beam_model(f=f)
test = np.array([[0,1,2], [3,4,5], [6,7,8], [9,10,11]])
gt_idx = test > 5
le_idx = test <= 5
# then update the array
test[gt_idx] = khan(test[gt_idx])
test[le_idx] = emss(test[le_idx])
But this gets an error TypeError: NumPy boolean array indexing assignment requires a 0 or 1-dimensional input, input has 2 dimensions
khan and emss are `lambda' functions
So I thought it would easier to execute khan and emss and then merge after the fact.
I applied the simple case above to help ease the question.

The np.where(boolean_mask, value_if_true, value_otherwise) function should be sufficient as long as x1 and x2 are the same shape.
Here, you could use np.where(x2, x2, x1) where the condition is simply x2, which means that truthy values (non-zero) will be preserved and falsy values will be replaced by the corresponding values in x1. In general, any boolean mask will work as a condition, and it is better to be explicit here: np.where(x2 == 0, x1, x2).
1D
In [1]: import numpy as np
In [2]: x1 = np.arange(1, 10)
In [3]: x2 = np.array([0,0,4,0,0,5,0,0,0])
In [4]: np.where(x2 == 0, x1, x2)
Out[4]: array([1, 2, 4, 4, 5, 5, 7, 8, 9])
2D
In [5]: x1 = x1.reshape(3, 3)
In [6]: x2 = x2.reshape(3, 3)
In [7]: x1
Out[7]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [8]: x2
Out[8]:
array([[0, 0, 4],
[0, 0, 5],
[0, 0, 0]])
In [9]: np.where(x2 == 0, x1, x2)
Out[9]:
array([[1, 2, 4],
[4, 5, 5],
[7, 8, 9]])
3D
In [10]: x1 = np.random.randint(1, 9, (2, 3, 3))
In [11]: x2 = np.random.choice((0, 0, 0, 0, 0, 0, 0, 0, 99), (2, 3, 3))
In [12]: x1
Out[12]:
array([[[3, 7, 4],
[1, 4, 3],
[7, 4, 3]],
[[5, 7, 1],
[5, 7, 6],
[1, 8, 8]]])
In [13]: x2
Out[13]:
array([[[ 0, 99, 99],
[ 0, 99, 0],
[ 0, 99, 0]],
[[99, 0, 0],
[ 0, 0, 99],
[ 0, 99, 0]]])
In [14]: np.where(x2 == 0, x1, x2)
Out[14]:
array([[[ 3, 99, 99],
[ 1, 99, 3],
[ 7, 99, 3]],
[[99, 7, 1],
[ 5, 7, 99],
[ 1, 99, 8]]])

Apply np.vectorize along one axis

Say I have two arrays arr1 and arr2:
arr1 = [0, 1, 2]
arr2 = [
[0, 1, 2],
[3, 4, 5],
[6, 7, 8],
]
And say I have a function that does something to the elements of this array:
def func(arr):
new_arr = arr.copy()
new_arr[0] = new_arr[0] * 2
new_arr[1] = new_arr[1] * 10
new_arr[2] = new_arr[2] * 100
return new_arr
Now I want to vectorize this, so that it works for both arr1 and arr2:
func(arr1)
# returns [0, 10, 200]
func(arr2)
# returns
# [0, 10, 200],
# [6, 40, 500],
# [12, 70, 800],
np.vectorize doesn't work because it breaks down each and every element in my array parameter. I want it to apply the function only along the first axis.
np.apply_along_axis almost works, except it won't consider 1-D array parameter to be a single parameter.
What's the best way to do this?

You can just directly multiply the arrays. It works thanks to numpy broadcasting:
factor = np.array([2, 10, 100])
arr1 * factor
array([ 0, 10, 200])
arr2 * factor
array([[ 0, 10, 200],
[ 6, 40, 500],
[ 12, 70, 800]])

If you take time to read the np.vectorize docs, you'll eventually encounter the signature option:
In [27]: f= np.vectorize(func, signature='(n)->(n)')
In [28]: f(arr1)
Out[28]: array([ 0, 10, 200])
In [29]: f(arr2)
Out[29]:
array([[ 0, 10, 200],
[ 6, 40, 500],
[ 12, 70, 800]])
And reading a bit further you'll encounter the caveats about performance.

Just do this:
import numpy as np
a = np.array([0, 1, 2])
b = np.array([
[0, 1, 2],
[3, 4, 5],
[6, 7, 8],
])
c = np.array([2, 10, 100])
print(a*c)
print(b*c)
Output:
[ 0 10 200]
[[ 0 10 200]
[ 6 40 500]
[ 12 70 800]]

Getting a range by rows in numpy

I have a function that produces an array like this:
my_array = np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)
Which outputs:
array([[0, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 0, 2],
...,
[9, 9, 9, 7],
[9, 9, 9, 8],
[9, 9, 9, 9]])
As you can see by converting ints to strings and lists, and then back to int, this is highly inefficient, and my real needs is for a much larger array (larger range). I tried looking into numpy to find a more efficient way to generate this array / list, but could not find a way. The best i've got so far is arange which will give a range from 1...9999 but not separated into lists.
Any ideas?

Here's one based on cartesian_product_broadcasted -
import functools
def cartesian_product_ranges(shape, out_dtype='int'):
arrays = [np.arange(s, dtype=out_dtype) for s in shape]
broadcastable = np.ix_(*arrays)
broadcasted = np.broadcast_arrays(*broadcastable)
rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), \
len(broadcasted)
out = np.empty(rows * cols, dtype=out_dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
N = len(shape)
return np.moveaxis(out.reshape((-1,) + tuple(shape)),0,-1).reshape(-1,N)
Sample run -
In [116]: cartesian_product_ranges([3,2,4])
Out[116]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 2],
[0, 0, 3],
[0, 1, 0],
[0, 1, 1],
[0, 1, 2],
[0, 1, 3],
[1, 0, 0],
[1, 0, 1],
[1, 0, 2],
[1, 0, 3],
[1, 1, 0],
[1, 1, 1],
[1, 1, 2],
[1, 1, 3],
[2, 0, 0],
[2, 0, 1],
[2, 0, 2],
[2, 0, 3],
[2, 1, 0],
[2, 1, 1],
[2, 1, 2],
[2, 1, 3]])
Run and timings on 10-ranged array with 4 cols -
In [119]: cartesian_product_ranges([10]*4)
Out[119]:
array([[0, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 0, 2],
...,
[9, 9, 9, 7],
[9, 9, 9, 8],
[9, 9, 9, 9]])
In [120]: cartesian_product_ranges([10]*4).shape
Out[120]: (10000, 4)
In [121]: %timeit cartesian_product_ranges([10]*4)
10000 loops, best of 3: 105 µs per loop
In [122]: %timeit np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)
100 loops, best of 3: 16.7 ms per loop
In [123]: 16700.0/105
Out[123]: 159.04761904761904
Around 160x speedup!
For 10-ranged array with 9 columns, we can use lower-precision uint8 dtype -
In [7]: %timeit cartesian_product_ranges([10]*9, out_dtype=np.uint8)
1 loop, best of 3: 3.36 s per loop

You can user itertools.product for this.
Simply provide range(10) as an argument, and the number of digits you want as the argument for repeat.
Conveniently, the itertools iterator returns the elements in sorted order, so you do not have to perform a secondary sorting step by yourself.
Below is an evaluation of my code:
import timeit
if __name__ == "__main__":
# time run: 14.20635
print(timeit.timeit("np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)",
"import numpy as np",
number=1000))
# time run: 5.00319
print(timeit.timeit("np.array(list(itertools.product(range(10), r=4)))",
"import itertools; import numpy as np",
number=1000))

I would solve this with a combination of np.tile and np.repeat and try to assemble the rows, then np.column_stack them.
This pure Numpy solution becomes nearly a one-liner then:
n = 10000
x = np.arange(10)
a = [np.tile(np.repeat(x, 10 ** k), n/(10 ** (k+1))) for k in range(int(np.log10(n)))]
y = np.column_stack(a[::-1]) # flip the list, first entry is rightmost row
A more verbose version to see what happens can be written like that
n = 10000
x = np.arange(10)
x0 = np.tile(np.repeat(x, 1), n/10)
x1 = np.tile(np.repeat(x, 10), n/100)
x2 = np.tile(np.repeat(x, 100), n/1000)
Now replace the numbers with exponents and get the number of columns using the log10.
Speed test:
import timeit
s = """
n = 10000
x = np.arange(10)
a = [np.tile(np.repeat(x, 10 ** k), n/(10 ** (k+1))) for k in range(int(np.log10(n)))]
y = np.column_stack(a[::-1])
"""
n_runs = 100000
t = timeit.timeit(s,
"import numpy as np",
number=n_runs)
print(t, t/n_runs)
About 260 µs on my slow machine (7 years old).

A fast solution is to use np.meshgrid to create all the columns. Then sort the columns on for instance element 123 or 1234 so that they are in the right order. And then just make an array out of them.
n_digits = 4
digits = np.arange(10)
columns = [c.ravel() for c in np.meshgrid(*[digits]*n_digits)]
out_array = columns.sort(key=lambda x: x[int("".join(str(d) for d in range(n_digits)))])
out_array = np.array(columns).T
np.all(out_array==my_array)

There are other one-liners to solve this
import numpy as np
y = np.array([index for index in np.ndindex(10, 10, 10, 10)])
This seems to be much slower.
Or
import numpy as np
from sklearn.utils.extmath import cartesian
x = np.arange(10)
y = cartesian((x, x, x, x))
This seems to be slightly slower than the accepted answer.

Change values in array (condition) with values from another list

I have the following list:
indices
>>> [21, 43, 58, 64, 88, 104, 113, 115, 120]
I want every occurrence of these values in this list -1 (so 20, 42, 57, etc.) to be zeroed out from a 3D array 'q' I have.
I have tried list comprehensions, for and if loops (see below), but I always get the following error:
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
I haven't been able to resolve this.
Any help would be amazing!
>>> for b in q:
... for u in indices:
... if b==u:
... b==0
>>> for u in indices:
... q = [0 if x==u else x for x in q]

I think this is a short and efficient way:
b= b*np.logical_not(np.reshape(np.in1d(b,indices),b.shape))
with np.in1d() we have a boolean array with True where the element in b is in indices. We reshape it to be the as b and then negate, so that we have False (or, if you want, 0) where we want to zero b. Just multiply this matrix element wise with b and you got it
It has the advantage that it works for 1D, 2D, 3D, ... arrays

How about this?
indices = range(1, 10)
>>[1, 2, 3, 4, 5, 6, 7, 8, 9]
q = np.arange(12).reshape(2,2,3)
array([[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]]])
def zeroed(row):
new_indices = map(lambda x: x-1, indices)
nrow = [0 if elem in new_indices else elem for elem in row]
return now
np.apply_along_axis(zeroed, 1, q)
array([[[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 9, 10, 11]]])

I tried this and it worked for me:
>>> arr_2D = [3,4,5,6]
>>> arr_3D = [[3,4,5,6],[2,3,4,5],[4,5,6,7,8,8]]
>>> for el in arr_2D:
... for x in arr_3D:
... for y in x:
... if y == el - 1:
... x.remove(y)
...
>>> arr_3D
[[6], [], [6, 7, 8, 8]]
Doing it with list comprehensions seams like it might be overkill in this situation.
Or to zero out instead of remove
>>> for el in arr_2D:
... for x in range(len(arr_3D)):
... for y in range(len(arr_3D[x])):
... if arr_3D[x][y] == el - 1:
... arr_3D[x][y] = 0
...
>>> arr_3D
[[0, 0, 0, 6], [0, 0, 0, 0], [0, 0, 6, 7, 8, 8]]
Here is the list comprehension:
zero_out = lambda arr_2D, arr_3D: [[0 if x in [el-1 for el in arr_2D] else x for x in y] for y in arr_3D]

Let's make a reference implementation of N-dimensional pixel binning/bucketing for python's numpy

I frequently want to pixel bin/pixel bucket a numpy array, meaning, replace groups of N consecutive pixels with a single pixel which is the sum of the N replaced pixels. For example, start with the values:
x = np.array([1, 3, 7, 3, 2, 9])
with a bucket size of 2, this transforms into:
bucket(x, bucket_size=2)
= [1+3, 7+3, 2+9]
= [4, 10, 11]
As far as I know, there's no numpy function that specifically does this (please correct me if I'm wrong!), so I frequently roll my own. For 1d numpy arrays, this isn't bad:
import numpy as np
def bucket(x, bucket_size):
return x.reshape(x.size // bucket_size, bucket_size).sum(axis=1)
bucket_me = np.array([3, 4, 5, 5, 1, 3, 2, 3])
print(bucket(bucket_me, bucket_size=2)) #[ 7 10 4 5]
...however, I get confused easily for the multidimensional case, and I end up rolling my own buggy, half-assed solution to this "easy" problem over and over again. I'd love it if we could establish a nice N-dimensional reference implementation.
Preferably the function call would allow different bin sizes along different axes (perhaps something like bucket(x, bucket_size=(2, 2, 3)))
Preferably the solution would be reasonably efficient (reshape and sum are fairly quick in numpy)
Bonus points for handling edge effects when the array doesn't divide nicely into an integer number of buckets.
Bonus points for allowing the user to choose the initial bin edge offset.
As suggested by Divakar, here's my desired behavior in a sample 2-D case:
x = np.array([[1, 2, 3, 4],
[2, 3, 7, 9],
[8, 9, 1, 0],
[0, 0, 3, 4]])
bucket(x, bucket_size=(2, 2))
= [[1 + 2 + 2 + 3, 3 + 4 + 7 + 9],
[8 + 9 + 0 + 0, 1 + 0 + 3 + 4]]
= [[8, 23],
[17, 8]]
...hopefully I did my arithmetic correctly ;)

I think you can do most of the fiddly work with skimage's view_as_blocks. This function is implemented using as_strided so it is very efficient (it just changes the stride information to reshape the array). Because it's written in Python/NumPy, you can always copy the code if you don't have skimage installed.
After applying that function, you just need to sum the N trailing axes of the reshaped array (where N is the length of the bucket_size tuple). Here's a new bucket() function:
from skimage.util import view_as_blocks
def bucket(x, bucket_size):
blocks = view_as_blocks(x, bucket_size)
tup = tuple(range(-len(bucket_size), 0))
return blocks.sum(axis=tup)
Then for example:
>>> x = np.array([1, 3, 7, 3, 2, 9])
>>> bucket(x, bucket_size=(2,))
array([ 4, 10, 11])
>>> x = np.array([[1, 2, 3, 4],
[2, 3, 7, 9],
[8, 9, 1, 0],
[0, 0, 3, 4]])
>>> bucket(x, bucket_size=(2, 2))
array([[ 8, 23],
[17, 8]])
>>> y = np.arange(6*6*6).reshape(6,6,6)
>>> bucket(y, bucket_size=(2, 2, 3))
array([[[ 264, 300],
[ 408, 444],
[ 552, 588]],
[[1128, 1164],
[1272, 1308],
[1416, 1452]],
[[1992, 2028],
[2136, 2172],
[2280, 2316]]])

Natively from as_strided :
x = array([[1, 2, 3, 4],
[2, 3, 7, 9],
[8, 9, 1, 0],
[0, 0, 3, 4]])
from numpy.lib.stride_tricks import as_strided
def bucket(x,bucket_size):
x=np.ascontiguousarray(x)
oldshape=array(x.shape)
newshape=concatenate((oldshape//bucket_size,bucket_size))
oldstrides=array(x.strides)
newstrides=concatenate((oldstrides*bucket_size,oldstrides))
axis=tuple(range(x.ndim,2*x.ndim))
return as_strided (x,newshape,newstrides).sum(axis)
if a dimension not divide evenly into the corresponding dimension of x, remaining elements are lost.
verification :
In [9]: bucket(x,(2,2))
Out[9]:
array([[ 8, 23],
[17, 8]])

To specify different bin sizes along each axis for ndarray cases, you can use iteratively use np.add.reduceat along each axis of it, like so -
def bucket(x, bin_size):
ndims = x.ndim
out = x.copy()
for i in range(ndims):
idx = np.append(0,np.cumsum(bin_size[i][:-1]))
out = np.add.reduceat(out,idx,axis=i)
return out
Sample run -
In [126]: x
Out[126]:
array([[165, 107, 133, 82, 199],
[ 35, 138, 91, 100, 207],
[ 75, 99, 40, 240, 208],
[166, 171, 78, 7, 141]])
In [127]: bucket(x, bin_size = [[2, 2],[3, 2]])
Out[127]:
array([[669, 588],
[629, 596]])
# [2, 2] are the bin sizes along axis=0
# [3, 2] are the bin sizes along axis=1
# array([[165, 107, 133, | 82, 199],
# [ 35, 138, 91, | 100, 207],
# -------------------------------------
# [ 75, 99, 40, | 240, 208],
# [166, 171, 78, | 7, 141]])
In [128]: x[:2,:3].sum()
Out[128]: 669
In [129]: x[:2,3:].sum()
Out[129]: 588
In [130]: x[2:,:3].sum()
Out[130]: 629
In [131]: x[2:,3:].sum()
Out[131]: 596

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast way to find index of array in array of arrays - python

Related

is there a way in python that can I merge two arrays (1d/2d/3d) and replacing specific elements

Apply np.vectorize along one axis

Getting a range by rows in numpy

Change values in array (condition) with values from another list

Let's make a reference implementation of N-dimensional pixel binning/bucketing for python's numpy

Categories

Resources