Intersection of 2-d numpy arrays - python

I am looking for a way to get the intersection between two 2-dimensional numpy.array of shape (n_1, m) and (n_2, m). Note that n_1 and n_2 can differ but m is the same for both arrays. Here are two minimal examples with the expected results:
import numpy as np
array1a = np.array([[2], [2], [5], [1]])
array1b = np.array([[5], [2]])
array_intersect(array1a, array1b)
## array([[2],
## [5]])
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
array_intersect(array2a, array2b)
## array([[2, 1],
## [3, 3]])
If someone have a clue on how I should implement the array_intersect function, I would be very grateful!

How about using sets?
import numpy as np
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
a = set((tuple(i) for i in array2a))
b = set((tuple(i) for i in array2b))
a.intersection(b) # {(2, 1), (3, 3)}

Another approach would be to harness the broadcasting feature
import numpy as np
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
test = array2a[:, None] == array2b
print(array2b[np.all(test.mean(0) > 0, axis = 1)]) # [[2 1]
# [3 3]]
but this is less readable imo. [edit]: or use the unique and set combination. In short, there are many options!

Here's a way to do without any loops or list comprehensions, assuming you have scipy installed (I haven't tested for speed):
In [31]: from scipy.spatial.distance import cdist
In [32]: np.unique(array1a[np.where(cdist(array1a, array1b) == 0)[0]], axis=0)
Out[32]:
array([[2],
[5]])
In [33]: np.unique(array2a[np.where(cdist(array2a, array2b) == 0)[0]], axis=0)
Out[33]:
array([[2, 1],
[3, 3]])

Construct a set of tuples from the first array and test each line of the second array. Or vice versa.
def array_intersect(a, b):
s = {tuple(x) for x in a}
return np.unique([x for x in b if tuple(x) in s], axis=0)

The numpy-indexed package (disclaimer: I am its author) was created with the exact purpose of providing such functionality in an expressive and efficient manner:
import numpy_indexed as npi
npi.intersect(a, b)
Note that the implementation is fully vectorized; that is no loops over the arrays in python.

arr1 = np.arange(20000).reshape(-1,2)
arr2 = arr1.copy()
np.random.shuffle(arr2)
print(len(arr1)) #10000
%%timeit
res= np.array([x
for x in set(tuple(x) for x in arr1) & set(tuple(x) for x in arr2)
])
83.7 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Is there a better way to vstack a numpy array from an empty array, like a list array?

I wish to vstack a numpy.array (like building a list) but, I cannot initialize the numpy.array with the correct shape to use numpy.append(numpy.empty/zero/like_empty, etc. did not do the trick... anyway. Finally, I figure the two pieces of code below. Is there someyhing more pythonic? I am using python 3.6.9
import numpy as np
a=[]
n=4
for i in range(n):
'''
some calculation resultinng for example in an numpy.array([[i,i+1,i+2])
'''
a.append(np.array([i,i+1,i+2]))
a=np.array(a).reshape(3,n)
print(a)
or because I prefer to mantain it as a numpy array inside the loop:
import numpy as np
a=np.array([])
n=4
for i in range(n):
'''
some calculation resultinng for example in an numpy.array [i,i+1,i+2]
'''
if a.size == 0:
a=np.array([i,i+1,i+2])
else:
a=np.vstack((a,np.array([i,i+1,i+2])))
print(a)
both output:
[[0 1 2]
[1 2 3]
[2 3 4]
[3 4 5]]
Your first use, with list append:
In [146]: alist=[]
In [147]: for i in range(4):
...: alist.append(np.arange(i,i+3))
...:
In [148]: alist
Out[148]: [array([0, 1, 2]), array([1, 2, 3]), array([2, 3, 4]), array([3, 4, 5])]
and make the array:
In [149]: np.array(alist)
Out[149]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
or since vstack is happy with a list of arrays:
In [150]: np.vstack(alist)
Out[150]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
You could use vstack in the loop:
In [151]: arr = np.zeros((0,3),int)
In [152]: for i in range(4):
...: arr = np.vstack((arr, np.arange(i,i+3)))
...:
In [153]: arr
Out[153]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
This has two problems:
it is slower; list append operates in-place simply adding a pointer to the list. vstack makes whole new array each time, with a full copy!
it is harder to initialize, as you found out. You actually have to understand array shapes, and what concatenate does when it combines 2 or more arrays. Here I started with a (0,3) array.
np.array([np.arange(i, i+3) for i in range(n)])

Numpy sort two arrays together with one array as the keys in axis 1 [duplicate]

I'm trying to get the indices to sort a multidimensional array by the last axis, e.g.
>>> a = np.array([[3,1,2],[8,9,2]])
And I'd like indices i such that,
>>> a[i]
array([[1, 2, 3],
[2, 8, 9]])
Based on the documentation of numpy.argsort I thought it should do this, but I'm getting the error:
>>> a[np.argsort(a)]
IndexError: index 2 is out of bounds for axis 0 with size 2
Edit: I need to rearrange other arrays of the same shape (e.g. an array b such that a.shape == b.shape) in the same way... so that
>>> b = np.array([[0,5,4],[3,9,1]])
>>> b[i]
array([[5,4,0],
[9,3,1]])
Solution:
>>> a[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
array([[1, 2, 3],
[2, 8, 9]])
You got it right, though I wouldn't describe it as cheating the indexing.
Maybe this will help make it clearer:
In [544]: i=np.argsort(a,axis=1)
In [545]: i
Out[545]:
array([[1, 2, 0],
[2, 0, 1]])
i is the order that we want, for each row. That is:
In [546]: a[0, i[0,:]]
Out[546]: array([1, 2, 3])
In [547]: a[1, i[1,:]]
Out[547]: array([2, 8, 9])
To do both indexing steps at once, we have to use a 'column' index for the 1st dimension.
In [548]: a[[[0],[1]],i]
Out[548]:
array([[1, 2, 3],
[2, 8, 9]])
Another array that could be paired with i is:
In [560]: j=np.array([[0,0,0],[1,1,1]])
In [561]: j
Out[561]:
array([[0, 0, 0],
[1, 1, 1]])
In [562]: a[j,i]
Out[562]:
array([[1, 2, 3],
[2, 8, 9]])
If i identifies the column for each element, then j specifies the row for each element. The [[0],[1]] column array works just as well because it can be broadcasted against i.
I think of
np.array([[0],
[1]])
as 'short hand' for j. Together they define the source row and column of each element of the new array. They work together, not sequentially.
The full mapping from a to the new array is:
[a[0,1] a[0,2] a[0,0]
a[1,2] a[1,0] a[1,1]]
def foo(a):
i = np.argsort(a, axis=1)
return (np.arange(a.shape[0])[:,None], i)
In [61]: foo(a)
Out[61]:
(array([[0],
[1]]), array([[1, 2, 0],
[2, 0, 1]], dtype=int32))
In [62]: a[foo(a)]
Out[62]:
array([[1, 2, 3],
[2, 8, 9]])
The above answers are now a bit outdated, since new functionality was added in numpy 1.15 to make it simpler; take_along_axis (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.take_along_axis.html) allows you to do:
>>> a = np.array([[3,1,2],[8,9,2]])
>>> np.take_along_axis(a, a.argsort(axis=-1), axis=-1)
array([[1 2 3]
[2 8 9]])
I found the answer here, with someone having the same problem. They key is just cheating the indexing to work properly...
>>> a[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
array([[1, 2, 3],
[2, 8, 9]])
You can also use linear indexing, which might be better with performance, like so -
M,N = a.shape
out = b.ravel()[a.argsort(1)+(np.arange(M)[:,None]*N)]
So, a.argsort(1)+(np.arange(M)[:,None]*N) basically are the linear indices that are used to map b to get the desired sorted output for b. The same linear indices could also be used on a for getting the sorted output for a.
Sample run -
In [23]: a = np.array([[3,1,2],[8,9,2]])
In [24]: b = np.array([[0,5,4],[3,9,1]])
In [25]: M,N = a.shape
In [26]: b.ravel()[a.argsort(1)+(np.arange(M)[:,None]*N)]
Out[26]:
array([[5, 4, 0],
[1, 3, 9]])
Rumtime tests -
In [27]: a = np.random.rand(1000,1000)
In [28]: b = np.random.rand(1000,1000)
In [29]: M,N = a.shape
In [30]: %timeit b[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
10 loops, best of 3: 133 ms per loop
In [31]: %timeit b.ravel()[a.argsort(1)+(np.arange(M)[:,None]*N)]
10 loops, best of 3: 96.7 ms per loop

Fill 2-D numpy array with index location

I've been trying to figure out a clean, pythonic way to fill each element of an empty numpy array with the index value(s) of that element, without using for loops. For 1-D, it's easy, you can just use something like np.arange or just a basic range. But at 2-D and higher dimensions, I'm stumped on how to easily do this.
(Edit: Or just build a regular list like this, then np.array(lst) it. I think I just answered my question - use a list comprehension?)
Example:
rows = 4
cols = 4
arr = np.empty((rows, cols, 2)) # 4x4 matrix with [x,y] location
for y in range(rows):
for x in range(cols):
arr[y, x] = [y, x]
'''
Expected output:
[[[0,0], [0,1], [0,2], [0,3]],
[[1,0], [1,1], [1,2], [1,3]],
[[2,0], [2,1], [2,2], [2,3]],
[[3,0], [3,1], [3,2], [3,3]]]
'''
What you are showing is a meshgrid of a 4X4 matrix; You can either use np.mgrid, then transpose the result:
np.moveaxis(np.mgrid[:rows,:cols], 0, -1)
#array([[[0, 0],
# [0, 1],
# [0, 2],
# [0, 3]],
# [[1, 0],
# [1, 1],
# [1, 2],
# [1, 3]],
# [[2, 0],
# [2, 1],
# [2, 2],
# [2, 3]],
# [[3, 0],
# [3, 1],
# [3, 2],
# [3, 3]]])
Or use np.meshgrid with matrix indexing ij:
np.dstack(np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij'))
#array([[[0, 0],
# [0, 1],
# [0, 2],
# [0, 3]],
# [[1, 0],
# [1, 1],
# [1, 2],
# [1, 3]],
# [[2, 0],
# [2, 1],
# [2, 2],
# [2, 3]],
# [[3, 0],
# [3, 1],
# [3, 2],
# [3, 3]]])
another way using np.indices and concatenate
np.concatenate([x.reshape(4,4,1) for x in np.indices((4,4))],2)
or with np.dstack
np.dstack(np.indices((4,4)))
Some bench marking since you have a ton of possibilities
def Psidom_mrgid(rows,cols):
np.mgrid[:rows, :cols].transpose((1, 2, 0))
def Psidom_mesh(rows,cols):
np.dstack(np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij'))
def Mad_tile(rows,cols):
r = np.tile(np.arange(rows).reshape(rows, 1), (1, cols))
c = np.tile(np.arange(cols), (rows, 1))
result = np.stack((r, c), axis=-1)
def bora_comp(rows,cols):
x = [[[i, j] for j in range(rows)] for i in range(cols)]
def djk_ind(rows,cols):
np.concatenate([x.reshape(rows, cols, 1) for x in np.indices((rows, cols))], 2)
def devdev_mgrid(rows,cols):
index_tuple = np.mgrid[0:rows, 0:cols]
np.dstack(index_tuple).reshape((rows, cols, 2)
In[8]: %timeit Psidom_mrgid(1000,1000)
100 loops, best of 3: 15 ms per loop
In[9]: %timeit Psidom_mesh(1000,1000)
100 loops, best of 3: 9.98 ms per loop
In[10]: %timeit Mad_tile(1000,1000)
100 loops, best of 3: 15.3 ms per loop
In[11]: %timeit bora_comp(1000,1000)
1 loop, best of 3: 221 ms per loop
In[12]: %timeit djk_ind(1000,1000)
100 loops, best of 3: 9.72 ms per loop
In[13]: %timeit devdev_mgrid(1000,1000)
10 loops, best of 3: 20.6 ms per loop
I guess that's pretty pythonic:
[[[i,j] for j in range(5)] for i in range(5)]
Output:
[[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4]],
[[1, 0], [1, 1], [1, 2], [1, 3], [1, 4]],
[[2, 0], [2, 1], [2, 2], [2, 3], [2, 4]],
[[3, 0], [3, 1], [3, 2], [3, 3], [3, 4]],
[[4, 0], [4, 1], [4, 2], [4, 3], [4, 4]]]
Check out numpy.mgrid, which will return two arrays with the i and j indices. To combine them you can stack the arrays and reshape them. Something like this:
import numpy as np
def index_pair_array(rows, cols):
index_tuple = np.mgrid[0:rows, 0:cols]
return np.dstack(index_tuple).reshape((rows, cols, 2))
There are a few ways of doing this numpythonically.
One way is using np.tile and np.stack:
r = np.tile(np.arange(rows).reshape(rows, 1), (1, cols))
c = np.tile(np.arange(cols), (rows, 1))
result = np.stack((r, c), axis=-1)
A better way of getting the coordinates might be np.meshgrid:
rc = np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij')
result = np.stack(rc, axis=-1)

Create a 2D array from another array and its indices with NumPy

Given an array:
arr = np.array([[1, 3, 7], [4, 9, 8]]); arr
array([[1, 3, 7],
[4, 9, 8]])
And given its indices:
np.indices(arr.shape)
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
How would I be able to stack them neatly one against the other to form a new 2D array? This is what I'd like:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
This is my current solution:
def foo(arr):
return np.hstack((np.indices(arr.shape).reshape(2, arr.size).T, arr.reshape(-1, 1)))
It works, but is there something shorter/more elegant to carry this operation out?
Using array-initialization and then broadcasted-assignment for assigning indices and the array values in subsequent steps -
def indices_merged_arr(arr):
m,n = arr.shape
I,J = np.ogrid[:m,:n]
out = np.empty((m,n,3), dtype=arr.dtype)
out[...,0] = I
out[...,1] = J
out[...,2] = arr
out.shape = (-1,3)
return out
Note that we are avoiding the use of np.indices(arr.shape), which could have slowed things down.
Sample run -
In [10]: arr = np.array([[1, 3, 7], [4, 9, 8]])
In [11]: indices_merged_arr(arr)
Out[11]:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
Performance
arr = np.random.randn(100000, 2)
%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 4.97 ms per loop
%timeit pd.DataFrame(indices_merged_arr_divakar(arr), columns=['x', 'y', 'value'])
100 loops, best of 3: 3.82 ms per loop
%timeit pd.DataFrame(indices_merged_arr_eric(arr), columns=['x', 'y', 'value'], dtype=np.float32)
100 loops, best of 3: 5.59 ms per loop
Note: Timings include conversion to pandas dataframe, that is the eventual use case for this solution.
A more generic answer for nd arrays, that handles other dtypes correctly:
def indices_merged_arr(arr):
out = np.empty(arr.shape, dtype=[
('index', np.intp, arr.ndim),
('value', arr.dtype)
])
out['value'] = arr
for i, l in enumerate(arr.shape):
shape = (1,)*i + (-1,) + (1,)*(arr.ndim-1-i)
out['index'][..., i] = np.arange(l).reshape(shape)
return out.ravel()
This returns a structured array with an index column and a value column, which can be of different types.

How to pythonically get the max of a numpy argwhere function

I want to use numpy argwhere to find where a maximum in my data is. Below is a sample set that describes what I am doing:
bins = np.arange(10)
data = np.array([[6],[4],[8],[5]])
np.argwhere(bins<data)
array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[0, 5],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[2, 0],
[2, 1],
[2, 2],
[2, 3],
[2, 4],
[2, 5],
[2, 6],
[2, 7],
[3, 0],
[3, 1],
[3, 2],
[3, 3],
[3, 4]])
What I want from this data is
array([[0,5],
[1,3],
[2,7],
[3,4]])
This could be done with a for loop, but I was wondering if there was a more pythonic way to do this.
EDIT:
What I have now done was use Pandas and groupby. I am still wondering if this is the best method.
t = pd.DataFrame(np.argwhere(bins<data))
time = t.groupby(0)
time.max()
1
0
0 5
1 3
2 7
3 4
Now that I have this, I have a new problem. Lets say I have another set of data:
BigData = np.array([[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9]])
How can I use the array I achieved
array([[0,5],
[1,3],
[2,7],
[3,4]])
To be put in this new data to get BigData average up to the index in the second column. I.E
(0+1+2+3+4) / 5
(0+1+2) / 3
(0+1+2+3+4+5+6) / 7
(0+1+2+3) / 4
would be the return of BigData, assuming that we got the index value of where this happens in column two.
Here's a fairly short Numpy solution that's also pretty fast:
A = np.argwhere(bins<data)
print A[np.r_[A[1:,0] != A[:-1,0], True]]
Here's a NumPy solution. It is not as readable as the Pandas version, but timing suggests it is much faster:
>>> arr = np.argwhere(bins<data)
>>> arr[np.where(np.diff(np.vstack((arr, [arr[-1][0]+1, arr[-1][1]])), axis=0)[:,0] > 0)[0]]
array([[0, 5],
[1, 3],
[2, 7],
[3, 4]])
>>> %timeit arr[np.where(np.diff(np.vstack((arr, [arr[-1][0]+1, arr[-1][1]])), axis=0)[:,0] > 0)[0]]
10000 loops, best of 3: 32.7 µs per loop
>>> %%timeit
... t = pd.DataFrame(arr)
... time = t.groupby(0)
... time.max()
...
1000 loops, best of 3: 1 ms per loop
The following seems to be pretty fast for me, taking advantage of argmax working left -> right:
>>> bins[::-1][(bins[::-1] < data).argmax(axis=1)]
array([5, 3, 7, 4])
For me %timeit shows that this takes around 11µs.
However, manipulating the array to have the index as the first column (as follows) increases time to around 25µs:
>>> np.column_stack(
... [np.arange(data.shape[0]), bins[::-1][(bins[::-1] < data).argmax(axis=1)]])
array([[0, 5],
[1, 3],
[2, 7],
[3, 4]])

Categories

Resources