Cycling Slicing in Python - python

I've come up with this question while trying to apply a Cesar Cipher to a matrix with different shift values for each row, i.e. given a matrix X
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
with shift values of S = array([0, 1, 1]), the output needs to be
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
This is easy to implement by the following code:
Y = []
for i in range(X.shape[0]):
if (S[i] > 0):
Y.append( X[i,S[i]::].tolist() + X[i,:S[i]:].tolist() )
else:
Y.append(X[i,:].tolist())
Y = np.array(Y)
This is a left-cycle-shift. I wonder how to do this in a more efficient way using numpy arrays?
Update: This example applies the shift to the columns of a matrix. Suppose that we have a 3D array
array([[[8, 1, 8],
[8, 6, 2],
[5, 3, 7]],
[[4, 1, 0],
[5, 9, 5],
[5, 1, 7]],
[[9, 8, 6],
[5, 1, 0],
[5, 5, 4]]])
Then, the cyclic right shift of S = array([0, 0, 1]) over the columns leads to
array([[[8, 1, 7],
[8, 6, 8],
[5, 3, 2]],
[[4, 1, 7],
[5, 9, 0],
[5, 1, 5]],
[[9, 8, 4],
[5, 1, 6],
[5, 5, 0]]])

Approach #1 : Use modulus to implement the cyclic pattern and get the new column indices and then simply use advanced-indexing to extract the elements, giving us a vectorized solution, like so -
def cyclic_slice(X, S):
m,n = X.shape
idx = np.mod(np.arange(n) + S[:,None],n)
return X[np.arange(m)[:,None], idx]
Approach #2 : We can also leverage the power of strides for further speedup. The idea would be to concatenate the sliced off portion from the start and append it at the end, then create sliding windows of lengths same as the number of cols and finally index into the appropriate window numbers to get the same rolled over effect. The implementation would be like so -
def cyclic_slice_strided(X, S):
X2 = np.column_stack((X,X[:,:-1]))
s0,s1 = X2.strides
strided = np.lib.stride_tricks.as_strided
m,n1 = X.shape
n2 = X2.shape[1]
X2_3D = strided(X2, shape=(m,n2-n1+1,n1), strides=(s0,s1,s1))
return X2_3D[np.arange(len(S)),S]
Sample run -
In [34]: X
Out[34]:
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
In [35]: S
Out[35]: array([0, 1, 1])
In [36]: cyclic_slice(X, S)
Out[36]:
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
Runtime test -
In [75]: X = np.random.rand(10000,100)
...: S = np.random.randint(0,100,(10000))
# #Moses Koledoye's soln
In [76]: %%timeit
...: Y = []
...: for i, x in zip(S, X):
...: Y.append(np.roll(x, -i))
10 loops, best of 3: 108 ms per loop
In [77]: %timeit cyclic_slice(X, S)
100 loops, best of 3: 14.1 ms per loop
In [78]: %timeit cyclic_slice_strided(X, S)
100 loops, best of 3: 4.3 ms per loop
Adaption for 3D case
Adapting approach #1 for the 3D case, we would have -
shift = 'left'
axis = 1 # axis along which S is to be used (axis=1 for rows)
n = X.shape[axis]
if shift == 'left':
Sa = S
else:
Sa = -S
# For rows
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[:,idx, np.arange(len(S))]
# For columns
idx = np.mod(Sa[:,None] + np.arange(n),n)
out = X[:,np.arange(len(S))[:,None], idx]
# For axis=0
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[idx, np.arange(len(S))]
There could be a way to have a generic solution for a generic axis, but I will keep it to this point.

You could shift each row using np.roll and use the new rows to build the output array:
Y = []
for i, x in zip(S, X):
Y.append(np.roll(x, -i))
print(np.array(Y))
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])

Related

compare entire row and column in numpy array and delete selected rows and columns

I have 2 square array with shape = (25, 25) and I want to check if an entire row is filled with zeros and if the corresponding column is filled with zeros. If this is the case I want to remove those columns and rows from the array.
For example:
array = np.array([[1, 0, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 1],
[1, 0, 1, 1]])
I want it manipulated to
array=np.array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
I hope you can understand what I am aiming at. In this example row and column two have been removed as they are zero rows/columns.
I could do that by iterating through all of those arrays, as I have 10 million of those arrays I would like to have a pythonic/efficient way to solve this issue.
The second array is a tensorflow array manipulating that should be no problem if I know the index of the rows columns I want removed.
Edit:
I have now found following solution, but it is using for-looping:
def removepadding(y_true, y_pred):
shape = np.shape(y_true)
y_true_cleaned=[]
for i in range(shape[0]):
x = y_true[i]
for n in range(shape[1] - 1, -1, -1):
if sum(x[n, :]) == 0 and sum(x[:, n]) == 0:
x = np.delete(np.delete(x, n, 0), n, 1)
y_true_cleaned.append(x)
return y_true_cleaned
You can do it in one line:
array[array.any(axis = 1)][:, array.any(axis = 0)]
#array([[1, 1, 1],
# [1, 1, 1],
# [1, 1, 1]])
if there is negative values in the arr, np.sum may fail.
for 2d array:
import numpy as np
a = np.array([[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]])
row = np.all(a==0, axis=1)
col = np.all(a==0, axis=0)
a[~row][:,~col]
output
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]])
for 3d array:
a = np.ones((3,3,3))
a[1,:,1] = 0
a[1,1,:] = 0
a[:,1,1] = 0
z = np.all(a==0, axis=2)
y = np.all(a==0, axis=1)
x = np.all(a==0, axis=0)
Z = ~np.array([z]*a.shape[2])
Y = ~np.array([y]*a.shape[1])
X = ~np.array([x]*a.shape[0])
ZZ, YY, XX = (Z*Y*X).nonzero()
a[ZZ, YY, XX]
You can use np.count_nonzero to get the indices in one step per dimension:
nnz_row = np.count_nonzero(array, axis=1)
nnz_col = np.count_nonzero(array, axis=0)
Now you make a mask of where both are zero:
mask = (nnz_row == 0) & (nnz_col == 9)
You can turn the mask into indices and pass it to np.delete:
ind = np.flatnonzero(mask)
array = np.delete(np.delete(array, ind, axis=0), ind, axis=1)
Alternatively, you can compute the positive mask:
pmask = nnz_row.astype(bool) | nnz_col.astype(bool)
This mask can select directly, analogously to what delete did with the negative mask:
array = array[pmask, :][:, pmask]
Edit: Thanks to #mad physicist, we can use np.flatnonzero. Here's the 2d case:
import numpy as np
a=np.array([[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]])
cols_to_keep = np.flatnonzero(a.sum(axis=0))
rows_to_keep = np.flatnonzero(a.sum(axis=1))
a = a[:, cols_to_keep]
a = a[rows_to_keep, :]
a
>>>
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]])
Here's the 3d case:
import numpy as np
a=np.array([
[[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]],
[[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0]],
[[5,0,5,5,0,5],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]],
])
ix_keep_axis_0 = np.flatnonzero(a.sum(axis=(1, 2)))
ix_keep_axis_1 = np.flatnonzero(a.sum(axis=(0, 2)))
ix_keep_axis_2 = np.flatnonzero(a.sum(axis=(0, 1)))
a = a[ix_keep_axis_0, :, :]
a = a[:, ix_keep_axis_1, :]
a = a[:, :, ix_keep_axis_2]
a
>>>
array([[[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]],
[[5, 5, 5, 5],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]]])

Getting a range by rows in numpy

I have a function that produces an array like this:
my_array = np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)
Which outputs:
array([[0, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 0, 2],
...,
[9, 9, 9, 7],
[9, 9, 9, 8],
[9, 9, 9, 9]])
As you can see by converting ints to strings and lists, and then back to int, this is highly inefficient, and my real needs is for a much larger array (larger range). I tried looking into numpy to find a more efficient way to generate this array / list, but could not find a way. The best i've got so far is arange which will give a range from 1...9999 but not separated into lists.
Any ideas?
Here's one based on cartesian_product_broadcasted -
import functools
def cartesian_product_ranges(shape, out_dtype='int'):
arrays = [np.arange(s, dtype=out_dtype) for s in shape]
broadcastable = np.ix_(*arrays)
broadcasted = np.broadcast_arrays(*broadcastable)
rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), \
len(broadcasted)
out = np.empty(rows * cols, dtype=out_dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
N = len(shape)
return np.moveaxis(out.reshape((-1,) + tuple(shape)),0,-1).reshape(-1,N)
Sample run -
In [116]: cartesian_product_ranges([3,2,4])
Out[116]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 2],
[0, 0, 3],
[0, 1, 0],
[0, 1, 1],
[0, 1, 2],
[0, 1, 3],
[1, 0, 0],
[1, 0, 1],
[1, 0, 2],
[1, 0, 3],
[1, 1, 0],
[1, 1, 1],
[1, 1, 2],
[1, 1, 3],
[2, 0, 0],
[2, 0, 1],
[2, 0, 2],
[2, 0, 3],
[2, 1, 0],
[2, 1, 1],
[2, 1, 2],
[2, 1, 3]])
Run and timings on 10-ranged array with 4 cols -
In [119]: cartesian_product_ranges([10]*4)
Out[119]:
array([[0, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 0, 2],
...,
[9, 9, 9, 7],
[9, 9, 9, 8],
[9, 9, 9, 9]])
In [120]: cartesian_product_ranges([10]*4).shape
Out[120]: (10000, 4)
In [121]: %timeit cartesian_product_ranges([10]*4)
10000 loops, best of 3: 105 µs per loop
In [122]: %timeit np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)
100 loops, best of 3: 16.7 ms per loop
In [123]: 16700.0/105
Out[123]: 159.04761904761904
Around 160x speedup!
For 10-ranged array with 9 columns, we can use lower-precision uint8 dtype -
In [7]: %timeit cartesian_product_ranges([10]*9, out_dtype=np.uint8)
1 loop, best of 3: 3.36 s per loop
You can user itertools.product for this.
Simply provide range(10) as an argument, and the number of digits you want as the argument for repeat.
Conveniently, the itertools iterator returns the elements in sorted order, so you do not have to perform a secondary sorting step by yourself.
Below is an evaluation of my code:
import timeit
if __name__ == "__main__":
# time run: 14.20635
print(timeit.timeit("np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)",
"import numpy as np",
number=1000))
# time run: 5.00319
print(timeit.timeit("np.array(list(itertools.product(range(10), r=4)))",
"import itertools; import numpy as np",
number=1000))
I would solve this with a combination of np.tile and np.repeat and try to assemble the rows, then np.column_stack them.
This pure Numpy solution becomes nearly a one-liner then:
n = 10000
x = np.arange(10)
a = [np.tile(np.repeat(x, 10 ** k), n/(10 ** (k+1))) for k in range(int(np.log10(n)))]
y = np.column_stack(a[::-1]) # flip the list, first entry is rightmost row
A more verbose version to see what happens can be written like that
n = 10000
x = np.arange(10)
x0 = np.tile(np.repeat(x, 1), n/10)
x1 = np.tile(np.repeat(x, 10), n/100)
x2 = np.tile(np.repeat(x, 100), n/1000)
Now replace the numbers with exponents and get the number of columns using the log10.
Speed test:
import timeit
s = """
n = 10000
x = np.arange(10)
a = [np.tile(np.repeat(x, 10 ** k), n/(10 ** (k+1))) for k in range(int(np.log10(n)))]
y = np.column_stack(a[::-1])
"""
n_runs = 100000
t = timeit.timeit(s,
"import numpy as np",
number=n_runs)
print(t, t/n_runs)
About 260 µs on my slow machine (7 years old).
A fast solution is to use np.meshgrid to create all the columns. Then sort the columns on for instance element 123 or 1234 so that they are in the right order. And then just make an array out of them.
n_digits = 4
digits = np.arange(10)
columns = [c.ravel() for c in np.meshgrid(*[digits]*n_digits)]
out_array = columns.sort(key=lambda x: x[int("".join(str(d) for d in range(n_digits)))])
out_array = np.array(columns).T
np.all(out_array==my_array)
There are other one-liners to solve this
import numpy as np
y = np.array([index for index in np.ndindex(10, 10, 10, 10)])
This seems to be much slower.
Or
import numpy as np
from sklearn.utils.extmath import cartesian
x = np.arange(10)
y = cartesian((x, x, x, x))
This seems to be slightly slower than the accepted answer.

Create a 2D array from another array and its indices with NumPy

Given an array:
arr = np.array([[1, 3, 7], [4, 9, 8]]); arr
array([[1, 3, 7],
[4, 9, 8]])
And given its indices:
np.indices(arr.shape)
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
How would I be able to stack them neatly one against the other to form a new 2D array? This is what I'd like:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
This is my current solution:
def foo(arr):
return np.hstack((np.indices(arr.shape).reshape(2, arr.size).T, arr.reshape(-1, 1)))
It works, but is there something shorter/more elegant to carry this operation out?
Using array-initialization and then broadcasted-assignment for assigning indices and the array values in subsequent steps -
def indices_merged_arr(arr):
m,n = arr.shape
I,J = np.ogrid[:m,:n]
out = np.empty((m,n,3), dtype=arr.dtype)
out[...,0] = I
out[...,1] = J
out[...,2] = arr
out.shape = (-1,3)
return out
Note that we are avoiding the use of np.indices(arr.shape), which could have slowed things down.
Sample run -
In [10]: arr = np.array([[1, 3, 7], [4, 9, 8]])
In [11]: indices_merged_arr(arr)
Out[11]:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
Performance
arr = np.random.randn(100000, 2)
%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 4.97 ms per loop
%timeit pd.DataFrame(indices_merged_arr_divakar(arr), columns=['x', 'y', 'value'])
100 loops, best of 3: 3.82 ms per loop
%timeit pd.DataFrame(indices_merged_arr_eric(arr), columns=['x', 'y', 'value'], dtype=np.float32)
100 loops, best of 3: 5.59 ms per loop
Note: Timings include conversion to pandas dataframe, that is the eventual use case for this solution.
A more generic answer for nd arrays, that handles other dtypes correctly:
def indices_merged_arr(arr):
out = np.empty(arr.shape, dtype=[
('index', np.intp, arr.ndim),
('value', arr.dtype)
])
out['value'] = arr
for i, l in enumerate(arr.shape):
shape = (1,)*i + (-1,) + (1,)*(arr.ndim-1-i)
out['index'][..., i] = np.arange(l).reshape(shape)
return out.ravel()
This returns a structured array with an index column and a value column, which can be of different types.

Assign same lexicographic rank to duplicate elements of 2d array

I'm trying to lexicographically rank array components. The below code works fine, but I'd like to assign equal ranks to equal elements.
import numpy as np
values = np.asarray([
[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]
])
# need to flip, because for `np.lexsort` last
# element has highest priority.
values_reversed = np.fliplr(values)
# this returns the order, i.e. the order in
# which the elements should be in a sorted
# array (not the rank by index).
order = np.lexsort(values_reversed.T)
# convert order to ranks.
n = values.shape[0]
ranks = np.empty(n, dtype=int)
# use order to assign ranks.
ranks[order] = np.arange(n)
The rank variable contains [2, 0, 4, 3, 1], but a rank array of [2, 0, 4, 2, 1] is required because elements [1, 2, 3] (index 0 and 3) share the same rank. Continuous rank numbers are ok, so [2, 0, 3, 2, 1] is also an acceptable rank array.
Here's one approach -
# Get lexsorted indices and hence sorted values by those indices
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_vals = values[lexsort_idx]
# Mask of steps where rows shift (there are no duplicates in subsequent rows)
mask = np.r_[True,(lexsort_vals[1:] != lexsort_vals[:-1]).any(1)]
# Get the stepped indices (indices shift at non duplicate rows) and
# the index values are scaled corresponding to row numbers
stepped_idx = np.maximum.accumulate(mask*np.arange(mask.size))
# Re-arrange the stepped indices based on the original order of rows
# This is basically same as the original code does in last 4 steps,
# just in a concise manner
out_idx = stepped_idx[lexsort_idx.argsort()]
Sample step-by-step intermediate outputs -
In [55]: values
Out[55]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [56]: lexsort_idx
Out[56]: array([1, 4, 0, 3, 2])
In [57]: lexsort_vals
Out[57]:
array([[1, 1, 1],
[1, 1, 2],
[1, 2, 3],
[1, 2, 3],
[2, 2, 3]])
In [58]: mask
Out[58]: array([ True, True, True, False, True], dtype=bool)
In [59]: stepped_idx
Out[59]: array([0, 1, 2, 2, 4])
In [60]: lexsort_idx.argsort()
Out[60]: array([2, 0, 4, 3, 1])
In [61]: stepped_idx[lexsort_idx.argsort()]
Out[61]: array([2, 0, 4, 2, 1])
Performance boost
For more performance efficiency to compute lexsort_idx.argsort(), we could use and this is identical to the original code in last 4 lines -
def argsort_unique(idx):
# Original idea : http://stackoverflow.com/a/41242285/3293881 by #Andras
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
Thus, lexsort_idx.argsort() could be alternatively computed with argsort_unique(lexsort_idx).
Runtime test
Applying few more optimization tricks, we would have a version like so -
def numpy_app(values):
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_v = values[lexsort_idx]
mask = np.concatenate(( [False],(lexsort_v[1:] == lexsort_v[:-1]).all(1) ))
stepped_idx = np.arange(mask.size)
stepped_idx[mask] = 0
np.maximum.accumulate(stepped_idx, out=stepped_idx)
return stepped_idx[argsort_unique(lexsort_idx)]
#Warren Weckesser's rankdata based method as a func for timings -
def scipy_app(values):
v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
return rankdata(v, method='min') - 1
Timings -
In [97]: a = np.random.randint(0,9,(10000,3))
In [98]: out1 = numpy_app(a)
In [99]: out2 = scipy_app(a)
In [100]: np.allclose(out1, out2)
Out[100]: True
In [101]: %timeit scipy_app(a)
100 loops, best of 3: 5.32 ms per loop
In [102]: %timeit numpy_app(a)
100 loops, best of 3: 1.96 ms per loop
Here's a way to do it using scipy.stats.rankdata (with method='min'), by viewing the 2-d array as a 1-d structured array:
In [15]: values
Out[15]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [16]: v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
In [17]: rankdata(v, method='min') - 1
Out[17]: array([2, 0, 4, 2, 1])

How to select value from array that is closest to value in array using vectorization?

I have an array of values that I want to replace with from an array of choices based on which choice is linearly closest.
The catch is the size of the choices is defined at runtime.
import numpy as np
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
If choices was static in size, I would simply use np.where
d = np.where(np.abs(a - choices[0]) > np.abs(a - choices[1]),
np.where(np.abs(a - choices[0]) > np.abs(a - choices[2]), choices[0], choices[2]),
np.where(np.abs(a - choices[1]) > np.abs(a - choices[2]), choices[1], choices[2]))
To get the output:
>>d
>>[[1, 1, 1], [5, 5, 5], [10, 10, 10]]
Is there a way to do this more dynamically while still preserving the vectorization.
Subtract choices from a, find the index of the minimum of the result, substitute.
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 5 5]
[10 10 10]]
a = np.array([[0, 3, 0], [4, 8, 4], [9, 1, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 10 5]
[10 1 10]]
>>>
The extra dimension was added to a so that each element of choices would be subtracted from each element of a. choices was broadcast against a in the third dimension, This link has a decent graphic. b.shape is (3,3,3). EricsBroadcastingDoc is a pretty good explanation and has a graphic 3-d example at the end.
For the second example:
>>> print b
[[[ 1 5 10]
[ 2 2 7]
[ 1 5 10]]
[[ 3 1 6]
[ 7 3 2]
[ 3 1 6]]
[[ 8 4 1]
[ 0 4 9]
[ 8 4 1]]]
>>> print i
[[0 0 0]
[1 2 1]
[2 0 2]]
>>>
The final assignment uses an Index Array or Integer Array Indexing.
In the second example, notice that there was a tie for element a[0,1] , either one or five could have been substituted.
To explain wwii's excellent answer in a little more detail:
The idea is to create a new dimension which does the job of comparing each element of a to each element in choices using numpy broadcasting. This is easily done for an arbitrary number of dimensions in a using the ellipsis syntax:
>>> b = np.abs(a[..., np.newaxis] - choices)
array([[[ 1, 5, 10],
[ 1, 5, 10],
[ 1, 5, 10]],
[[ 3, 1, 6],
[ 3, 1, 6],
[ 3, 1, 6]],
[[ 8, 4, 1],
[ 8, 4, 1],
[ 8, 4, 1]]])
Taking argmin along the axis you just created (the last axis, with label -1) gives you the desired index in choices that you want to substitute:
>>> np.argmin(b, axis=-1)
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
Which finally allows you to choose those elements from choices:
>>> d = choices[np.argmin(b, axis=-1)]
>>> d
array([[ 1, 1, 1],
[ 5, 5, 5],
[10, 10, 10]])
For a non-symmetric shape:
Let's say a had shape (2, 5):
>>> a = np.arange(10).reshape((2, 5))
>>> a
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
Then you'd get:
>>> b = np.abs(a[..., np.newaxis] - choices)
>>> b
array([[[ 1, 5, 10],
[ 0, 4, 9],
[ 1, 3, 8],
[ 2, 2, 7],
[ 3, 1, 6]],
[[ 4, 0, 5],
[ 5, 1, 4],
[ 6, 2, 3],
[ 7, 3, 2],
[ 8, 4, 1]]])
This is hard to read, but what it's saying is, b has shape:
>>> b.shape
(2, 5, 3)
The first two dimensions came from the shape of a, which is also (2, 5). The last dimension is the one you just created. To get a better idea:
>>> b[:, :, 0] # = abs(a - 1)
array([[1, 0, 1, 2, 3],
[4, 5, 6, 7, 8]])
>>> b[:, :, 1] # = abs(a - 5)
array([[5, 4, 3, 2, 1],
[0, 1, 2, 3, 4]])
>>> b[:, :, 2] # = abs(a - 10)
array([[10, 9, 8, 7, 6],
[ 5, 4, 3, 2, 1]])
Note how b[:, :, i] is the absolute difference between a and choices[i], for each i = 1, 2, 3.
Hope that helps explain this a little more clearly.
I love broadcasting and would have gone that way myself too. But, with large arrays, I would like to suggest another approach with np.searchsorted that keeps it memory efficient and thus achieves performance benefits, like so -
def searchsorted_app(a, choices):
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
mask = np.abs(a - cl) > np.abs(a - cr)
cl[mask] = cr[mask]
return cl
Please note that if the elements in choices are not sorted, we need to add in the additional argument sorter with np.searchsorted.
Runtime test -
In [160]: # Setup inputs
...: a = np.random.rand(100,100)
...: choices = np.sort(np.random.rand(100))
...:
In [161]: def broadcasting_app(a, choices): # #wwii's solution
...: return choices[np.argmin(np.abs(a[:,:,None] - choices),-1)]
...:
In [162]: np.allclose(broadcasting_app(a,choices),searchsorted_app(a,choices))
Out[162]: True
In [163]: %timeit broadcasting_app(a, choices)
100 loops, best of 3: 9.3 ms per loop
In [164]: %timeit searchsorted_app(a, choices)
1000 loops, best of 3: 1.78 ms per loop
Related post : Find elements of array one nearest to elements of array two

Categories

Resources