Fastest way to check if two arrays have equivalent rows

Fastest way to check if two arrays have equivalent rows - python

I am trying to figure out a better way to check if two 2D arrays contain the same rows. Take the following case for a short example:
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> b
array([[6, 7, 8],
[3, 4, 5],
[0, 1, 2]])
In this case b=a[::-1]. To check if two rows are equal:
>>>a=a[np.lexsort((a[:,0],a[:,1],a[:,2]))]
>>>b=b[np.lexsort((b[:,0],b[:,1],b[:,2]))]
>>> np.all(a-b==0)
True
This is great and fairly fast. However the issue comes about when two rows are "close":
array([[-1.57839867 2.355354 -1.4225235 ],
[-0.94728367 0. -1.4225235 ],
[-1.57839867 -2.355354 -1.4225215 ]]) <---note ends in 215 not 235
array([[-1.57839867 -2.355354 -1.4225225 ],
[-1.57839867 2.355354 -1.4225225 ],
[-0.94728367 0. -1.4225225 ]])
Within a tolerance of 1E-5 these two arrays are equal by row, but the lexsort will tell you otherwise. This can be solved by a different sorting order but I would like a more general case.
I was toying with the idea of:
a=a.reshape(-1,1,3)
>>> a-b
array([[[-6, -6, -6],
[-3, -3, -3],
[ 0, 0, 0]],
[[-3, -3, -3],
[ 0, 0, 0],
[ 3, 3, 3]],
[[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6]]])
>>> np.all(np.around(a-b,5)==0,axis=2)
array([[False, False, True],
[False, True, False],
[ True, False, False]], dtype=bool)
>>>np.all(np.any(np.all(np.around(a-b,5)==0,axis=2),axis=1))
True
This doesn't tell you if the arrays are equal by row just if all points in b are close to a value in a. The number of rows can be several hundred and I need to do it quite a bit. Any ideas?

Your last code doesn't do what you think it is doing. What it tells you is whether every row in b is close to a row in a. If you change the axis you use for the outer calls to np.any and np.all, you could check whether every row in a is close to some row in b. If both every row in b is close to a row in a, and every row in a is close to a row in b, then the sets are equal. Probably not very computationally efficient, but probably very fast in numpy for moderately sized arrays:
def same_rows(a, b, tol=5) :
rows_close = np.all(np.round(a - b[:, None], tol) == 0, axis=-1)
return (np.all(np.any(rows_close, axis=-1), axis=-1) and
np.all(np.any(rows_close, axis=0), axis=0))
>>> rows, cols = 5, 3
>>> a = np.arange(rows * cols).reshape(rows, cols)
>>> b = np.arange(rows)
>>> np.random.shuffle(b)
>>> b = a[b]
>>> a
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> b
array([[ 9, 10, 11],
[ 3, 4, 5],
[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]])
>>> same_rows(a, b)
True
>>> b[0] = b[1]
>>> b
array([[ 3, 4, 5],
[ 3, 4, 5],
[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]])
>>> same_rows(a, b) # not all rows in a are close to a row in b
False
And for not too big arrays, performance is reasonable, even though it is having to build an array of (rows, rows, cols):
In [2]: rows, cols = 1000, 10
In [3]: a = np.arange(rows * cols).reshape(rows, cols)
In [4]: b = np.arange(rows)
In [5]: np.random.shuffle(b)
In [6]: b = a[b]
In [7]: %timeit same_rows(a, b)
10 loops, best of 3: 103 ms per loop

Related

Unexpected result from Numpy Matrix insert, How does this work?

My goal was to insert a column to the right on a numpy matrix. However, I found that the code I was using is putting in two columns rather than just one.
# This one results in a 4x1 matrix, as expected
np.insert(np.matrix([[0],[0]]), 1, np.matrix([[0],[0]]), 0)
>>>matrix([[0],
[0],
[0],
[0]])
# I would expect this line to return a 2x2 matrix, but it returns a 2x3 matrix instead.
np.insert(np.matrix([[0],[0]]), 1, np.matrix([[0],[0]]), 1)
>>>matrix([[0, 0, 0],
[0, 0, 0]]
Why do I get the above, in the second example, instead of [[0,0], [0,0]]?

While new use of np.matrix is discouraged, we get the same result with np.array:
In [41]: np.insert(np.array([[1],[2]]),1, np.array([[10],[20]]), 0)
Out[41]:
array([[ 1],
[10],
[20],
[ 2]])
In [42]: np.insert(np.array([[1],[2]]),1, np.array([[10],[20]]), 1)
Out[42]:
array([[ 1, 10, 20],
[ 2, 10, 20]])
In [44]: np.insert(np.array([[1],[2]]),1, np.array([10,20]), 1)
Out[44]:
array([[ 1, 10],
[ 2, 20]])
Insert as [1]:
In [46]: np.insert(np.array([[1],[2]]),[1], np.array([[10],[20]]), 1)
Out[46]:
array([[ 1, 10],
[ 2, 20]])
In [47]: np.insert(np.array([[1],[2]]),[1], np.array([10,20]), 1)
Out[47]:
array([[ 1, 10, 20],
[ 2, 10, 20]])
np.insert is a complex function written in Python. So we need to look at that code, and see how values are being mapped on the target space.
The docs elaborate on the difference between insert at 1 and [1]. But off hand I don't see an explanation of how the shape of values matters.
Difference between sequence and scalars:
>>> np.insert(a, [1], [[1],[2],[3]], axis=1)
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
>>> np.array_equal(np.insert(a, 1, [1, 2, 3], axis=1),
... np.insert(a, [1], [[1],[2],[3]], axis=1))
True
When adding an array at the end of another, I'd use concatenate (or one of its stack variants) rather than insert. None of these operate in-place.
In [48]: np.concatenate([np.array([[1],[2]]), np.array([[10],[20]])], axis=1)
Out[48]:
array([[ 1, 10],
[ 2, 20]])

python: building an iterator over a grid from the grid nodes

I am using python 3
I would like to start from a list of nodes in 3 dimensions and build a grid.
I would like to avoid the construct
import numpy as np
l = np.zeros(len(xv)*len(yv)*len(zv))
for (i,x) in zip(range(len(xv)),xv):
for (j,y) in zip(range(len(yv)),yv):
for (k,z) in zip(range(len(zv)),zv):
l[i,j,k] = func(x,y,z)
I am looking for a more compact version of the above lines. An iterator like zip, but that would iterate on all possible tuple in the grid

You can use something like np.meshgrid to construct your grid. Assuming that func is properly vectorized, that should be good enough to construct l
X, Y, Z = np.meshgrid(xv, yv, zv)
l = func(X, Y, Z)
If func isn't vectorized, you can construct a vectorized version using np.vectorize.
Also note that you might even be able to get away without using np.meshgrid through judicious use of np.newaxis:
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> y
array([0, 1, 2])
>>> z
array([0, 1])
>>> def func(x, y, z):
... return x + y + z
...
>>> vfunc = np.vectorize(func)
>>> vfunc(x[:, np.newaxis, np.newaxis], y[np.newaxis, :, np.newaxis], z[np.newaxis, np.newaxis, :])
array([[[ 0, 1],
[ 1, 2],
[ 2, 3]],
[[ 1, 2],
[ 2, 3],
[ 3, 4]],
[[ 2, 3],
[ 3, 4],
[ 4, 5]],
[[ 3, 4],
[ 4, 5],
[ 5, 6]],
[[ 4, 5],
[ 5, 6],
[ 6, 7]],
[[ 5, 6],
[ 6, 7],
[ 7, 8]],
[[ 6, 7],
[ 7, 8],
[ 8, 9]],
[[ 7, 8],
[ 8, 9],
[ 9, 10]],
[[ 8, 9],
[ 9, 10],
[10, 11]],
[[ 9, 10],
[10, 11],
[11, 12]]])
As pointed out in the comments, np.ix_ can be used as a shortcut instead of np.newaxis:
vfunc(*np.ix_(xv, yv, zv))
Also note that with this stupid simple function, np.vectorize isn't necessary and will actually hurt our performance a lot...

Say your func is something like
def func(x,y,z,indices):
xv, yv, zv = [i[j] for i,j in zip((x,y,z),indices)]
#do a calc with the value for the specific x,y,z points
Hook the lists you want to it using partial by doing
from functools import partial
f = partial(func, x=xv, y=yv, z=zv)
Now just do a map supplying the indices and you're set!
l = list(map(lambda x: f(indices=x), itertools.product(x,y,z)))

With a simple function:
def foo(x,y,z):
return x**2 + y*2 + z
and space defined by:
In [328]: xv, yv, zv = [np.arange(i) for i in [2,3,4]]
This iteration is as fast any as any, even if it is a bit wordy:
In [329]: res = np.zeros((xv.shape[0], yv.shape[0], zv.shape[0]), dtype=int)
In [330]: for i,x in enumerate(xv):
...: for j,y in enumerate(yv):
...: for k,z in enumerate(zv):
...: res[i,j,k] = foo(x,y,z)
In [331]: res
Out[331]:
array([[[0, 1, 2, 3],
[2, 3, 4, 5],
[4, 5, 6, 7]],
[[1, 2, 3, 4],
[3, 4, 5, 6],
[5, 6, 7, 8]]])
As #mgilson explains, you can generate 3 arrays that define the 3d space with:
In [332]: I,J,K = np.meshgrid(xv,yv,zv,indexing='ij',sparse=True)
In [333]: I.shape
Out[333]: (2, 1, 1)
In [334]: J.shape
Out[334]: (1, 3, 1)
In [335]: I,J,K = np.ix_(xv,yv,zv) # equivalently
In [336]: I.shape
Out[336]: (2, 1, 1)
foo was written so it works with arrays just as well as with scalars, so:
In [337]: res1 = foo(I,J,K)
In [338]: res1
Out[338]:
array([[[0, 1, 2, 3],
...
[5, 6, 7, 8]]])
So if your function fits this pattern, use it. Look at those I,J,K arrays, with and without sparse.
There are other tools for generating the i,j,k sets. For example:
for i,j,k in np.ndindex(res.shape):
res[i,j,k] = foo(xv[i], yv[j], zv[k])
for i,j,k in itertools.product(range(2),range(3),range(4)):
res[i,j,k] = foo(xv[i], yv[j], zv[k])
itertools.product is fast, especially when used as list(product(...)). But the iteration mechanism isn't that important. It's the repeated call to foo that take up most of the time.
ndindex actually uses nditer, which can be used directly in:
it = np.nditer([I,J,K,None],flags=['external_loop','buffered'])
for x,y,z,r in it:
r[...] = foo(x,y,z)
it.operands[-1]
nditer is best described in:
https://docs.scipy.org/doc/numpy/reference/arrays.nditer.html. It is best used as a stepping stone toward a cython version. Otherwise it doesn't have any speed advantages. (Though with this foo, and 'external_loop' it is as fast as foo(I,J,K)). Note that this doesn't need the indices (but see 'multi_index').
And yes, there's vectorize. Convenient, but not a speedy solution.
vfoo=np.vectorize(foo, otypes=['int'])
vfoo(I,J,K)

How to select value from array that is closest to value in array using vectorization?

I have an array of values that I want to replace with from an array of choices based on which choice is linearly closest.
The catch is the size of the choices is defined at runtime.
import numpy as np
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
If choices was static in size, I would simply use np.where
d = np.where(np.abs(a - choices[0]) > np.abs(a - choices[1]),
np.where(np.abs(a - choices[0]) > np.abs(a - choices[2]), choices[0], choices[2]),
np.where(np.abs(a - choices[1]) > np.abs(a - choices[2]), choices[1], choices[2]))
To get the output:
>>d
>>[[1, 1, 1], [5, 5, 5], [10, 10, 10]]
Is there a way to do this more dynamically while still preserving the vectorization.

Subtract choices from a, find the index of the minimum of the result, substitute.
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 5 5]
[10 10 10]]
a = np.array([[0, 3, 0], [4, 8, 4], [9, 1, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 10 5]
[10 1 10]]
>>>
The extra dimension was added to a so that each element of choices would be subtracted from each element of a. choices was broadcast against a in the third dimension, This link has a decent graphic. b.shape is (3,3,3). EricsBroadcastingDoc is a pretty good explanation and has a graphic 3-d example at the end.
For the second example:
>>> print b
[[[ 1 5 10]
[ 2 2 7]
[ 1 5 10]]
[[ 3 1 6]
[ 7 3 2]
[ 3 1 6]]
[[ 8 4 1]
[ 0 4 9]
[ 8 4 1]]]
>>> print i
[[0 0 0]
[1 2 1]
[2 0 2]]
>>>
The final assignment uses an Index Array or Integer Array Indexing.
In the second example, notice that there was a tie for element a[0,1] , either one or five could have been substituted.

To explain wwii's excellent answer in a little more detail:
The idea is to create a new dimension which does the job of comparing each element of a to each element in choices using numpy broadcasting. This is easily done for an arbitrary number of dimensions in a using the ellipsis syntax:
>>> b = np.abs(a[..., np.newaxis] - choices)
array([[[ 1, 5, 10],
[ 1, 5, 10],
[ 1, 5, 10]],
[[ 3, 1, 6],
[ 3, 1, 6],
[ 3, 1, 6]],
[[ 8, 4, 1],
[ 8, 4, 1],
[ 8, 4, 1]]])
Taking argmin along the axis you just created (the last axis, with label -1) gives you the desired index in choices that you want to substitute:
>>> np.argmin(b, axis=-1)
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
Which finally allows you to choose those elements from choices:
>>> d = choices[np.argmin(b, axis=-1)]
>>> d
array([[ 1, 1, 1],
[ 5, 5, 5],
[10, 10, 10]])
For a non-symmetric shape:
Let's say a had shape (2, 5):
>>> a = np.arange(10).reshape((2, 5))
>>> a
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
Then you'd get:
>>> b = np.abs(a[..., np.newaxis] - choices)
>>> b
array([[[ 1, 5, 10],
[ 0, 4, 9],
[ 1, 3, 8],
[ 2, 2, 7],
[ 3, 1, 6]],
[[ 4, 0, 5],
[ 5, 1, 4],
[ 6, 2, 3],
[ 7, 3, 2],
[ 8, 4, 1]]])
This is hard to read, but what it's saying is, b has shape:
>>> b.shape
(2, 5, 3)
The first two dimensions came from the shape of a, which is also (2, 5). The last dimension is the one you just created. To get a better idea:
>>> b[:, :, 0] # = abs(a - 1)
array([[1, 0, 1, 2, 3],
[4, 5, 6, 7, 8]])
>>> b[:, :, 1] # = abs(a - 5)
array([[5, 4, 3, 2, 1],
[0, 1, 2, 3, 4]])
>>> b[:, :, 2] # = abs(a - 10)
array([[10, 9, 8, 7, 6],
[ 5, 4, 3, 2, 1]])
Note how b[:, :, i] is the absolute difference between a and choices[i], for each i = 1, 2, 3.
Hope that helps explain this a little more clearly.

I love broadcasting and would have gone that way myself too. But, with large arrays, I would like to suggest another approach with np.searchsorted that keeps it memory efficient and thus achieves performance benefits, like so -
def searchsorted_app(a, choices):
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
mask = np.abs(a - cl) > np.abs(a - cr)
cl[mask] = cr[mask]
return cl
Please note that if the elements in choices are not sorted, we need to add in the additional argument sorter with np.searchsorted.
Runtime test -
In [160]: # Setup inputs
...: a = np.random.rand(100,100)
...: choices = np.sort(np.random.rand(100))
...:
In [161]: def broadcasting_app(a, choices): # #wwii's solution
...: return choices[np.argmin(np.abs(a[:,:,None] - choices),-1)]
...:
In [162]: np.allclose(broadcasting_app(a,choices),searchsorted_app(a,choices))
Out[162]: True
In [163]: %timeit broadcasting_app(a, choices)
100 loops, best of 3: 9.3 ms per loop
In [164]: %timeit searchsorted_app(a, choices)
1000 loops, best of 3: 1.78 ms per loop
Related post : Find elements of array one nearest to elements of array two

Numpy: Flatten some columns of an 2 D array

Suppose I have a numpy array as below
a = np.asarray([[1,2,3],[1,4,3],[2,5,4],[2,7,5]])
array([[1, 2, 3],
[1, 4, 3],
[2, 5, 4],
[2, 7, 5]])
How can I flatten column 2 and 3 for each unique element in column 1 like below:
array([[1, 2, 3, 4, 3],
[2, 5, 4, 7, 5],])
Thank you for your help.

Another option using list comprehension:
np.array([np.insert(a[a[:,0] == k, 1:].flatten(), 0, k) for k in np.unique(a[:,0])])
# array([[1, 2, 3, 4, 3],
# [2, 5, 4, 7, 5]])

import numpy as np
a = np.asarray([[1,2,3],[1,4,3],[2,5,4],[2,7,5]])
d = {}
for row in a:
d[row[0]] = np.concatenate( (d.get(row[0], []), row[1:]) )
r = np.array([np.concatenate(([key], d[key])) for key in d])
print(r)
This prints:
[[ 1. 2. 3. 4. 3.]
[ 2. 5. 4. 7. 5.]]

Since as posted in the comments, we know that each unique element in column-0 would have a fixed number of rows and by which I assumed it was meant same number of rows, we can use a vectorized approach to solve the case. We sort the rows based on column-0 and look for shifts along it, which would signify group change and thus give us the exact number of rows associated per unique element in column-0. Let's call it L. Finally, we slice sorted array to select columns-1,2 and group L rows together by reshaping. Thus, the implementation would be -
sa = a[a[:,0].argsort()]
L = np.unique(sa[:,0],return_index=True)[1][1]
out = np.column_stack((sa[::L,0],sa[:,1:].reshape(-1,2*L)))
For more performance boost, we can use np.diff to calculate L, like so -
L = np.where(np.diff(sa[:,0])>0)[0][0]+1
Sample run -
In [103]: a
Out[103]:
array([[1, 2, 3],
[3, 7, 8],
[1, 4, 3],
[2, 5, 4],
[3, 8, 2],
[2, 7, 5]])
In [104]: sa = a[a[:,0].argsort()]
...: L = np.unique(sa[:,0],return_index=True)[1][1]
...: out = np.column_stack((sa[::L,0],sa[:,1:].reshape(-1,2*L)))
...:
In [105]: out
Out[105]:
array([[1, 2, 3, 4, 3],
[2, 5, 4, 7, 5],
[3, 7, 8, 8, 2]])

How to create a sub-matrix in numpy

I have a two-dimensional NxM numpy array:
a = np.ndarray((N,M), dtype=np.float32)
I would like to make a sub-matrix with a selected number of columns and matrices. For each dimension I have as input either a binary vector, or a vector of indices. How can I do this most efficient?
Examples
a = array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
cols = [True, False, True]
rows = [False, False, True, True]
cols_i = [0,2]
rows_i = [2,3]
result = wanted_function(a, cols, rows) or wanted_function_i(a, cols_i, rows_i)
result = array([[2, 3],
[ 10, 11]])

There are several ways to get submatrix in numpy:
In [35]: ri = [0,2]
...: ci = [2,3]
...: a[np.reshape(ri, (-1, 1)), ci]
Out[35]:
array([[ 2, 3],
[10, 11]])
In [36]: a[np.ix_(ri, ci)]
Out[36]:
array([[ 2, 3],
[10, 11]])
In [37]: s=a[np.ix_(ri, ci)]
In [38]: np.may_share_memory(a, s)
Out[38]: False
note that the submatrix you get is a new copy, not a view of the original mat.

You only need to makes cols and rows be a numpy array, and then you can just use the [] as:
import numpy as np
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
cols = np.array([True, False, True])
rows = np.array([False, False, True, True])
result = a[cols][:,rows]
print(result)
print(type(result))
# [[ 2 3]
# [10 11]]
# <class 'numpy.ndarray'>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to check if two arrays have equivalent rows - python

Related

Unexpected result from Numpy Matrix insert, How does this work?

python: building an iterator over a grid from the grid nodes

How to select value from array that is closest to value in array using vectorization?

Numpy: Flatten some columns of an 2 D array

How to create a sub-matrix in numpy

Categories

Resources