Map numpy's `in1d` over 2D array - python

I have two 2D numpy arrays,
import numpy as np
a = np.array([[ 1, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ -1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
What I want to do is, for each column of g, to check if it contains any element from the corresponding column of a. For the first column, I want to check if any of the values [1,2,3,4,5] appears in [1,-1,-1] and return True. For the second, I want to return False because no element in [12,13,14,17,20] appears in [15,10,-1]. At the moment, I do this using Python's list comprehension. Running
result = [np.any(np.in1d(g[:,i], a[:, i])) for i in range(5)]
calculates the correct result, but is getting slow when a has a lot of columns. Is there a more "pure numpy" way of doing this same thing? I feel like there should be an axis keyword one could add to the numpy.in1d function, but there isn't any...

I'd use broadcasting tricks, but this depends very much on the size of your arrays and the amount of RAM available to you:
M = g.reshape(g.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0]))
np.any(np.any(M == 0, axis=0), axis=1)
# returns:
# array([ True, False, True, True, False], dtype=bool)
It's easier to explain with a piece of paper and a pen (and smaller test arrays) (see below), but basically you're making copies of each column in g (one copy for each row in a) and subtracting single elements taken from the corresponding column in a from these copies. Similar to the original algorithm, just vectorized.
Caveat: if any of the arrays g or a is 1D, you'll need to force it to become 2D, such that its shape is at least (1,n).
Speed gains:
based only on your arrays: a factor ~20
python for loops: 301us per loop
vectorized: 15.4us per loop
larger arrays: factor ~80
In [2]: a = np.random.random_integers(-2, 3, size=(4, 50))
In [3]: b = np.random.random_integers(-20, 30, size=(35, 50))
In [4]: %timeit np.any(np.any(b.reshape(b.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0])) == 0, axis=0), axis=1)
10000 loops, best of 3: 39.5 us per loop
In [5]: %timeit [np.any(np.in1d(b[:,i], a[:, i])) for i in range(a.shape[1])]
100 loops, best of 3: 3.13 ms per loop
Image attached to explain broadcasting:

Instead of processing the input by column, you can process it by rows. For example you find out if any element of the first row of a is present in the columns of g, so that you can stop processing the columns where the element is found.
idx = arange(a.shape[1])
result = empty((idx.size,), dtype=bool)
result.fill(False)
for j in range(a.shape[0]):
#delete this print in production
print "%d line, I look only at columns " % (j + 1), idx
line_pruned = take(a[j], idx)
g_pruned = take(g, idx, axis=1)
positive_idx = where((g_pruned - line_pruned) == 0)[1]
#delete this print in production
print "positive hit on the ", positive_idx, " -th columns"
put(result, positive_idx, True)
idx = setdiff1d(idx, positive_idx)
if not idx.size:
break
To understand how it works, we can consider a different input:
a = np.array([[ 0, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ 1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
The output of the script is:
1 line, I look only at columns [0 1 2 3 4]
positive hit on the [2 3] -th columns
2 line, I look only at columns [0 1 4]
positive hit on the [] -th columns
3 line, I look only at columns [0 1 4]
positive hit on the [0] -th columns
Basically you can see how in the 2nd and 3rd round of the loop you're not processing the 2nd and 4th column.
The performance of this solution really depends on many factors, but it will be faster if it is likely that you hit many True values, and the problem has many rows. This of course depends also on the input, not just on the shape.

Related

how to delete rows and columns in numpy python?

I am having trouble creating a function which takes a matrix M as an input and deletes BOTH rows and columns containing the number 0 and giving an output containing the remaining numbers. Any help is much appreciated as I have my programming exam coming up soon.
By "deleting both rows and columns" this is what I mean:
import numpy as np
x = np.array([[1,2,3,4,5],
[6,0,8,9,10],
[11,12,13,14,15],
[16,0,0,19,20]])
idxs_array = list(np.where(x==0))
idxs_array = [list(dict.fromkeys(x)) for x in idxs_array]
for axis, idxs in enumerate(idxs_array):
sub_factor = 0
for idx in idxs:
x = np.delete(x,idx-sub_factor,axis)
sub_factor += 1
print(x)
# x = [[ 1, 4, 5],
# [11, 14, 15]]
1. Locate zero elements
First of all, we need to identify the location of the zero elements in the matrix, which can be done easily with np.where().
np.where will return the row/column indices of the elements matched specific condition (doc).
row_idx, col_idx = np.where(arr == 0)
2. Remove corresponding rows/columns
To remove corresponding rows and columns, there is an easy way to do this, which is indexing (doc).
That is, you can specify the row (or column) you want to keep with True, else it shall be False.
print(np.arange(4)[[True, False, True, False]])
# array([0, 2])
3. Put two things together
Here is a minimal example.
arr = np.array([[ 1, 2, 3, 4, 5],
[ 6, 0, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 0, 0, 19, 20]])
row_idx, col_idx = np.where(arr == 0)
rm_row_idx = set(row_idx.tolist())
rm_col_idx = set(col_idx.tolist())
row_mask = [i not in rm_row_idx for i in range(arr.shape[0])]
col_mask = [i not in rm_col_idx for i in range(arr.shape[1])]
arr = arr[row_mask, :]
arr = arr[:, col_mask]
print(arr)
# Shall be:
# array([[ 1, 4, 5],
# [11, 14, 15]])

Map an element in a multi-dimension array to its index

I am using the function get_tuples(length, total) from here
to generate an array of all tuples of given length and sum, an example and the function are shown below. After I have created the array I need to find a way to return the indices of a given number of elements in the array. I was able to do that using .index() by changing the array to a list, as shown below. However, this solution or another solution that is also based on searching (for example using np.where) takes a lot of time to find the indices. Since all elements in the array (array s in the example) are different, I was wondering if we can construct a one-to-one mapping, i.e., a function such that given the element in the array it returns the index of the element by doing some addition and multiplication on the values of this element. Any ideas if that is possible? Thanks!
import numpy as np
def get_tuples(length, total):
if length == 1:
yield (total,)
return
for i in range(total + 1):
for t in get_tuples(length - 1, total - i):
yield (i,) + t
#example
s = np.array(list(get_tuples(4, 20)))
# array s
In [1]: s
Out[1]:
array([[ 0, 0, 0, 20],
[ 0, 0, 1, 19],
[ 0, 0, 2, 18],
...,
[19, 0, 1, 0],
[19, 1, 0, 0],
[20, 0, 0, 0]])
#example of element to find the index for. (Note in reality this is 1000+ elements)
elements_to_find =np.array([[ 0, 0, 0, 20],
[ 0, 0, 7, 13],
[ 0, 5, 5, 10],
[ 0, 0, 5, 15],
[ 0, 2, 4, 14]])
#change array to list
s_list = s.tolist()
#find the indices
indx=[s_list.index(i) for i in elements_to_find.tolist()]
#output
In [2]: indx
Out[2]: [0, 7, 100, 5, 45]
Here is a formula that calculates the index based on the tuple alone, i.e. it needn't see the full array. To compute the index of an N-tuple it needs to evaluate N-1 binomial coefficients. The following implementation is (part-) vectorized, it accepts ND-arrays but the tuples must be in the last dimension.
import numpy as np
from scipy.special import comb
# unfortunately, comb with option exact=True is not vectorized
def bc(N,k):
return np.round(comb(N,k)).astype(int)
def get_idx(s):
N = s.shape[-1] - 1
R = np.arange(1,N)
ps = s[...,::-1].cumsum(-1)
B = bc(ps[...,1:-1]+R,1+R)
return bc(ps[...,-1]+N,N) - ps[...,0] - 1 - B.sum(-1)
# OP's generator
def get_tuples(length, total):
if length == 1:
yield (total,)
return
for i in range(total + 1):
for t in get_tuples(length - 1, total - i):
yield (i,) + t
#example
s = np.array(list(get_tuples(4, 20)))
# compute each index
r = get_idx(s)
# expected: 0,1,2,3,...
assert (r == np.arange(len(r))).all()
print("all ok")
#example of element to find the index for. (Note in reality this is 1000+ elements)
elements_to_find =np.array([[ 0, 0, 0, 20],
[ 0, 0, 7, 13],
[ 0, 5, 5, 10],
[ 0, 0, 5, 15],
[ 0, 2, 4, 14]])
print(get_idx(elements_to_find))
Sample run:
all ok
[ 0 7 100 5 45]
How to derive formula:
Use stars and bars to express the full partition count #part(N,k) (N is total, k is length) as a single binomial coefficient (N + k - 1) choose (k - 1).
Count back-to-front: It is not hard to verify that after the i-th full iteration of the outer loop of OP's generator exactly #part(N-i,k) have not yet been enumerated. Indeed, what's left are all partitions p1+p2+... = N with p1>=i; we can write p1=q1+i such that q1+p2+... = N-i and this latter partition is constraint-free so we can use 1. to count.
You can use binary search to make the search a lot faster.
Binary search makes the search O(log(n)) rather than O(n) (using Index)
We do not need to sort the tuples since they are already sorted by the generator
import bisect
def get_tuples(length, total):
" Generates tuples "
if length == 1:
yield (total,)
return
yield from ((i,) + t for i in range(total + 1) for t in get_tuples(length - 1, total - i))
def find_indexes(x, indexes):
if len(indexes) > 100:
# Faster to generate all indexes when we have a large
# number to check
d = dict(zip(x, range(len(x))))
return [d[tuple(i)] for i in indexes]
else:
return [bisect.bisect_left(x, tuple(i)) for i in indexes]
# Generate tuples (in this case 4, 20)
x = list(get_tuples(4, 20))
# Tuples are generated in sorted order [(0,0,0,20), ...(20,0,0,0)]
# which allows binary search to be used
indexes = [[ 0, 0, 0, 20],
[ 0, 0, 7, 13],
[ 0, 5, 5, 10],
[ 0, 0, 5, 15],
[ 0, 2, 4, 14]]
y = find_indexes(x, indexes)
print('Found indexes:', *y)
print('Indexes & Tuples:')
for i in y:
print(i, x[i])
Output
Found indexes: 0 7 100 5 45
Indexes & Tuples:
0 (0, 0, 0, 20)
7 (0, 0, 7, 13)
100 (0, 5, 5, 10)
5 (0, 0, 5, 15)
45 (0, 2, 4, 14)
Performance
Scenario 1--Tuples already computed and we just want to find the index of certain tuples
For instance x = list(get_tuples(4, 20)) has already been perform.
Search for
indexes = [[ 0, 0, 0, 20],
[ 0, 0, 7, 13],
[ 0, 5, 5, 10],
[ 0, 0, 5, 15],
[ 0, 2, 4, 14]]
Binary Search
%timeit find_indexes(x, indexes)
100000 loops, best of 3: 11.2 µs per loop
Calculates the index based on the tuple alone (courtesy #PaulPanzer approach)
%timeit get_idx(indexes)
10000 loops, best of 3: 92.7 µs per loop
In this scenario, binary search is ~8x faster when tuples have already been pre-computed.
Scenario 2--the tuples have not been pre-computed.
%%timeit
import bisect
def find_indexes(x, t):
" finds the index of each tuple in list t (assumes x is sorted) "
return [bisect.bisect_left(x, tuple(i)) for i in t]
# Generate tuples (in this case 4, 20)
x = list(get_tuples(4, 20))
indexes = [[ 0, 0, 0, 20],
[ 0, 0, 7, 13],
[ 0, 5, 5, 10],
[ 0, 0, 5, 15],
[ 0, 2, 4, 14]]
y = find_indexes(x, indexes)
100 loops, best of 3: 2.69 ms per loop
#PaulPanzer approach is the same timing in this scenario (92.97 us)
=> #PaulPanzer approach ~29 times faster when the tuples don't have to be computed
Scenario 3--Large number of indexes (#PJORR)
A large number of random indexes is generated
x = list(get_tuples(4, 20))
xnp = np.array(x)
indices = xnp[np.random.randint(0,len(xnp), 2000)]
indexes = indices.tolist()
%timeit find_indexes(x, indexes)
#Result: 1000 loops, best of 3: 1.1 ms per loop
%timeit get_idx(indices)
#Result: 1000 loops, best of 3: 716 µs per loop
In this case, we are #PaulPanzer is 53% faster

Numpy vectorization: comparing array against multiple values [duplicate]

Let's say I have an array like this:
import numpy as np
base_array = np.array([-13, -9, -11, -3, -3, -4, 2, 2,
2, 5, 7, 7, 8, 7, 12, 11])
Suppose I want to know: "how many elements in base_array are greater than 4?" This can be done simply by exploiting broadcasting:
np.sum(4 < base_array)
For which the answer is 7. Now, suppose instead of comparing to a single value, I want to do this over an array. In other words, for each value c in the comparison_array, find out how many elements of base_array are greater than c. If I do this the naive way, it obviously fails because it doesn't know how to broadcast it properly:
comparison_array = np.arange(-13, 13)
comparison_result = np.sum(comparison_array < base_array)
Output:
Traceback (most recent call last):
File "<pyshell#87>", line 1, in <module>
np.sum(comparison_array < base_array)
ValueError: operands could not be broadcast together with shapes (26,) (16,)
If I could somehow have each element of comparison_array get broadcast to base_array's shape, that would solve this. But I don't know how to do such an "element-wise broadcasting".
Now, I do know I how to implement this for both cases using list comprehension:
first = sum([4 < i for i in base_array])
second = [sum([c < i for i in base_array])
for c in comparison_array]
print(first)
print(second)
Output:
7
[15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7, 7, 6, 6, 3, 2, 2, 2, 1, 0]
But as we all know, this will be orders of magnitude slower than a correctly-vectorized numpy implementation on larger arrays. So, how should I do this in numpy so that it's fast? Ideally this solution should extend to any kind of operation where broadcasting works, not just greater-than or less-than in this example.
You can simply add a dimension to the comparison array, so that the comparison is "stretched" across all values along the new dimension.
>>> np.sum(comparison_array[:, None] < base_array)
228
This is the fundamental principle with broadcasting, and works for all kinds of operations.
If you need the sum done along an axis, you just specify the axis along which you want to sum after the comparison.
>>> np.sum(comparison_array[:, None] < base_array, axis=1)
array([15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7,
7, 6, 6, 3, 2, 2, 2, 1, 0])
You will want to transpose one of the arrays for broadcasting to work correctly. When you broadcast two arrays together, the dimensions are lined up and any unit dimensions are effectively expanded to the non-unit size that they match. So two arrays of size (16, 1) (the original array) and (1, 26) (the comparison array) would broadcast to (16, 26).
Don't forget to sum across the dimension of size 16:
(base_array[:, None] > comparison_array).sum(axis=1)
None in a slice is equivalent to np.newaxis: it's one of many ways to insert a new unit dimension at the specified index. The reason that you don't need to do comparison_array[None, :] is that broadcasting lines up the highest dimensions, and fills in the lowest with ones automatically.
Here's one with np.searchsorted with focus on memory efficiency and hence performance -
def get_comparative_sum(base_array, comparison_array):
n = len(base_array)
base_array_sorted = np.sort(base_array)
idx = np.searchsorted(base_array_sorted, comparison_array, 'right')
idx[idx==n] = n-1
return n - idx - (base_array_sorted[idx] == comparison_array)
Timings -
In [40]: np.random.seed(0)
...: base_array = np.random.randint(-1000,1000,(10000))
...: comparison_array = np.random.randint(-1000,1000,(20000))
# #miradulo's soln
In [41]: %timeit np.sum(comparison_array[:, None] < base_array, axis=1)
1 loop, best of 3: 386 ms per loop
In [42]: %timeit get_comparative_sum(base_array, comparison_array)
100 loops, best of 3: 2.36 ms per loop

Is there any way to find top-left and right-bottom pixels of cropped image from a full image using Python? [duplicate]

I have a large NumPy.array field_array and a smaller array match_array, both consisting of int values. Using the following example, how can I check if any match_array-shaped segment of field_array contains values that exactly correspond to the ones in match_array?
import numpy
raw_field = ( 24, 25, 26, 27, 28, 29, 30, 31, 23, \
33, 34, 35, 36, 37, 38, 39, 40, 32, \
-39, -38, -37, -36, -35, -34, -33, -32, -40, \
-30, -29, -28, -27, -26, -25, -24, -23, -31, \
-21, -20, -19, -18, -17, -16, -15, -14, -22, \
-12, -11, -10, -9, -8, -7, -6, -5, -13, \
-3, -2, -1, 0, 1, 2, 3, 4, -4, \
6, 7, 8, 4, 5, 6, 7, 13, 5, \
15, 16, 17, 8, 9, 10, 11, 22, 14)
field_array = numpy.array(raw_field, int).reshape(9,9)
match_array = numpy.arange(12).reshape(3,4)
These examples ought to return True since the pattern described by match_array aligns over [6:9,3:7].
Approach #1
This approach derives from a solution to Implement Matlab's im2col 'sliding' in python that was designed to rearrange sliding blocks from a 2D array into columns. Thus, to solve our case here, those sliding blocks from field_array could be stacked as columns and compared against column vector version of match_array.
Here's a formal definition of the function for the rearrangement/stacking -
def im2col(A,BLKSZ):
# Parameters
M,N = A.shape
col_extent = N - BLKSZ[1] + 1
row_extent = M - BLKSZ[0] + 1
# Get Starting block indices
start_idx = np.arange(BLKSZ[0])[:,None]*N + np.arange(BLKSZ[1])
# Get offsetted indices across the height and width of input array
offset_idx = np.arange(row_extent)[:,None]*N + np.arange(col_extent)
# Get all actual indices & index into input array for final output
return np.take (A,start_idx.ravel()[:,None] + offset_idx.ravel())
To solve our case, here's the implementation based on im2col -
# Get sliding blocks of shape same as match_array from field_array into columns
# Then, compare them with a column vector version of match array.
col_match = im2col(field_array,match_array.shape) == match_array.ravel()[:,None]
# Shape of output array that has field_array compared against a sliding match_array
out_shape = np.asarray(field_array.shape) - np.asarray(match_array.shape) + 1
# Now, see if all elements in a column are ONES and reshape to out_shape.
# Finally, find the position of TRUE indices
R,C = np.where(col_match.all(0).reshape(out_shape))
The output for the given sample in the question would be -
In [151]: R,C
Out[151]: (array([6]), array([3]))
Approach #2
Given that opencv already has template matching function that does square of differences, you can employ that and look for zero differences, which would be your matching positions. So, if you have access to cv2 (opencv module), the implementation would look something like this -
import cv2
from cv2 import matchTemplate as cv2m
M = cv2m(field_array.astype('uint8'),match_array.astype('uint8'),cv2.TM_SQDIFF)
R,C = np.where(M==0)
giving us -
In [204]: R,C
Out[204]: (array([6]), array([3]))
Benchmarking
This section compares runtimes for all the approaches suggested to solve the question. The credit for the various methods listed in this section goes to their contributors.
Method definitions -
def seek_array(search_in, search_for, return_coords = False):
si_x, si_y = search_in.shape
sf_x, sf_y = search_for.shape
for y in xrange(si_y-sf_y+1):
for x in xrange(si_x-sf_x+1):
if numpy.array_equal(search_for, search_in[x:x+sf_x, y:y+sf_y]):
return (x,y) if return_coords else True
return None if return_coords else False
def skimage_based(field_array,match_array):
windows = view_as_windows(field_array, match_array.shape)
return (windows == match_array).all(axis=(2,3)).nonzero()
def im2col_based(field_array,match_array):
col_match = im2col(field_array,match_array.shape)==match_array.ravel()[:,None]
out_shape = np.asarray(field_array.shape) - np.asarray(match_array.shape) + 1
return np.where(col_match.all(0).reshape(out_shape))
def cv2_based(field_array,match_array):
M = cv2m(field_array.astype('uint8'),match_array.astype('uint8'),cv2.TM_SQDIFF)
return np.where(M==0)
Runtime tests -
Case # 1 (Sample data from question):
In [11]: field_array
Out[11]:
array([[ 24, 25, 26, 27, 28, 29, 30, 31, 23],
[ 33, 34, 35, 36, 37, 38, 39, 40, 32],
[-39, -38, -37, -36, -35, -34, -33, -32, -40],
[-30, -29, -28, -27, -26, -25, -24, -23, -31],
[-21, -20, -19, -18, -17, -16, -15, -14, -22],
[-12, -11, -10, -9, -8, -7, -6, -5, -13],
[ -3, -2, -1, 0, 1, 2, 3, 4, -4],
[ 6, 7, 8, 4, 5, 6, 7, 13, 5],
[ 15, 16, 17, 8, 9, 10, 11, 22, 14]])
In [12]: match_array
Out[12]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [13]: %timeit seek_array(field_array, match_array, return_coords = False)
1000 loops, best of 3: 465 µs per loop
In [14]: %timeit skimage_based(field_array,match_array)
10000 loops, best of 3: 97.9 µs per loop
In [15]: %timeit im2col_based(field_array,match_array)
10000 loops, best of 3: 74.3 µs per loop
In [16]: %timeit cv2_based(field_array,match_array)
10000 loops, best of 3: 30 µs per loop
Case #2 (Bigger random data):
In [17]: field_array = np.random.randint(0,4,(256,256))
In [18]: match_array = field_array[100:116,100:116].copy()
In [19]: %timeit seek_array(field_array, match_array, return_coords = False)
1 loops, best of 3: 400 ms per loop
In [20]: %timeit skimage_based(field_array,match_array)
10 loops, best of 3: 54.3 ms per loop
In [21]: %timeit im2col_based(field_array,match_array)
10 loops, best of 3: 125 ms per loop
In [22]: %timeit cv2_based(field_array,match_array)
100 loops, best of 3: 4.08 ms per loop
There's no such search function built in to NumPy, but it is certainly possible to do in NumPy
As long as your arrays are not too massive*, you could use a rolling window approach:
from skimage.util import view_as_windows
windows = view_as_windows(field_array, match_array.shape)
The function view_as_windows is written purely in NumPy so if you don't have skimage you can always copy the code from here.
Then to see if the sub-array appears in the larger array, you can write:
>>> (windows == match_array).all(axis=(2,3)).any()
True
To find the indices of where the top-left corner of the sub-array matches, you can write:
>>> (windows == match_array).all(axis=(2,3)).nonzero()
(array([6]), array([3]))
This approach should also work for arrays of higher dimensions.
*although the array windows takes up no additional memory (only the strides and shape are changed to create a new view of the data), writing windows == match_array creates a boolean array of size (7, 6, 3, 4) which is 504 bytes of memory. If you're working with very large arrays, this approach might not be feasible.
One solution is to search the entire search_in array block-at-a-time (a 'block' being a search_for-shaped slice) until either a matching segment is found or the search_for array is exhausted. I can use it to get coordinates for the matching block, or just a bool result by sending True or False for the return_coords optional argument...
def seek_array(search_in, search_for, return_coords = False):
"""Searches for a contiguous instance of a 2d array `search_for` within a larger `search_in` 2d array.
If the optional argument return_coords is True, the xy coordinates of the zeroeth value of the first matching segment of search_in will be returned, or None if there is no matching segment.
If return_coords is False, a boolean will be returned.
* Both arrays must be sent as two-dimensional!"""
si_x, si_y = search_in.shape
sf_x, sf_y = search_for.shape
for y in xrange(si_y-sf_y+1):
for x in xrange(si_x-sf_x+1):
if numpy.array_equal(search_for, search_in[x:x+sf_x, y:y+sf_y]):
return (x,y) if return_coords else True # don't forget that coordinates are transposed when viewing NumPy arrays!
return None if return_coords else False
I wonder if NumPy doesn't already have a function that can do the same thing, though...
To add to the answers already posted, I'd like to add one that takes into account errors due to floating point precision in case that matrices come from, let's say, image processing for instance, where numbers are subject to floating point operations.
You can recurse the indexes of the larger matrix, searching for the smaller matrix. Then you can extract a submatrix of the larger matrix matching the size of the smaller matrix.
You have a match if the contents of both, the submatrix of 'large' and the 'small' matrix match.
The following example shows how to return the first indexes of the location in the large matrix found to match. It would be trivial to extend this function to return an array of locations found to match if that's the intent.
import numpy as np
def find_submatrix(a, b):
""" Searches the first instance at which 'b' is a submatrix of 'a', iterates
rows first. Returns the indexes of a at which 'b' was found, or None if
'b' is not contained within 'a'"""
a_rows=a.shape[0]
a_cols=a.shape[1]
b_rows=b.shape[0]
b_cols=b.shape[1]
row_diff = a_rows - b_rows
col_diff = a_cols - b_cols
for idx_row in np.arange(row_diff):
for idx_col in np.arange(col_diff):
row_indexes = [idx + idx_row for idx in np.arange(b_rows)]
col_indexes = [idx + idx_col for idx in np.arange(b_cols)]
submatrix_indexes = np.ix_(row_indexes, col_indexes)
a_submatrix = a[submatrix_indexes]
are_equal = np.allclose(a_submatrix, b) # allclose is used for floating point numbers, if they
# are close while comparing, they are considered equal.
# Useful if your matrices come from operations that produce
# floating point numbers.
# You might want to fine tune the parameters to allclose()
if (are_equal):
return[idx_col, idx_row]
return None
Using the function above you can run the following example:
large_mtx = np.array([[1, 2, 3, 7, 4, 2, 6],
[4, 5, 6, 2, 1, 3, 11],
[10, 4, 2, 1, 3, 7, 6],
[4, 2, 1, 3, 7, 6, -3],
[5, 6, 2, 1, 3, 11, -1],
[0, 0, -1, 5, 4, -1, 2],
[10, 4, 2, 1, 3, 7, 6],
[10, 4, 2, 1, 3, 7, 6]
])
# Example 1: An intersection at column 2 and row 1 of large_mtx
small_mtx_1 = np.array([[4, 2], [2,1]])
intersect = find_submatrix(large_mtx, small_mtx_1)
print "Example 1, intersection (col,row): " + str(intersect)
# Example 2: No intersection
small_mtx_2 = np.array([[-14, 2], [2,1]])
intersect = find_submatrix(large_mtx, small_mtx_2)
print "Example 2, intersection (col,row): " + str(intersect)
Which would print:
Example 1, intersection: [1, 2]
Example 2, intersection: None
Here's a solution using the as_strided() function from stride_tricks module
import numpy as np
from numpy.lib.stride_tricks import as_strided
# field_array (I modified it to have two matching arrays)
A = np.array([[ 24, 25, 26, 27, 28, 29, 30, 31, 23],
[ 33, 0, 1, 2, 3, 38, 39, 40, 32],
[-39, 4, 5, 6, 7, -34, -33, -32, -40],
[-30, 8, 9, 10, 11, -25, -24, -23, -31],
[-21, -20, -19, -18, -17, -16, -15, -14, -22],
[-12, -11, -10, -9, -8, -7, -6, -5, -13],
[ -3, -2, -1, 0, 1, 2, 3, 4, -4],
[ 6, 7, 8, 4, 5, 6, 7, 13, 5],
[ 15, 16, 17, 8, 9, 10, 11, 22, 14]])
# match_array
B = np.arange(12).reshape(3,4)
# Window view of A
A_w = as_strided(A, shape=(A.shape[0] - B.shape[0] + 1,
A.shape[1] - B.shape[1] + 1,
B.shape[0], B.shape[1]),
strides=2*A.strides).reshape(-1, B.shape[0], B.shape[1])
match = (A_w == B).all(axis=(1,2))
We can also find the indices of the first element of each matching block in A
where = np.where(match)[0]
ind_flat = where + (B.shape[1] - 1)*(np.floor(where/(A.shape[1] - B.shape[1] + 1)).astype(int))
ind = [tuple(row) for row in np.array(np.unravel_index(ind_flat, A.shape)).T]
Result
print(match.any())
True
print(ind)
[(1, 1), (6, 3)]

numpy find two dimensional array within another [duplicate]

I have a large NumPy.array field_array and a smaller array match_array, both consisting of int values. Using the following example, how can I check if any match_array-shaped segment of field_array contains values that exactly correspond to the ones in match_array?
import numpy
raw_field = ( 24, 25, 26, 27, 28, 29, 30, 31, 23, \
33, 34, 35, 36, 37, 38, 39, 40, 32, \
-39, -38, -37, -36, -35, -34, -33, -32, -40, \
-30, -29, -28, -27, -26, -25, -24, -23, -31, \
-21, -20, -19, -18, -17, -16, -15, -14, -22, \
-12, -11, -10, -9, -8, -7, -6, -5, -13, \
-3, -2, -1, 0, 1, 2, 3, 4, -4, \
6, 7, 8, 4, 5, 6, 7, 13, 5, \
15, 16, 17, 8, 9, 10, 11, 22, 14)
field_array = numpy.array(raw_field, int).reshape(9,9)
match_array = numpy.arange(12).reshape(3,4)
These examples ought to return True since the pattern described by match_array aligns over [6:9,3:7].
Approach #1
This approach derives from a solution to Implement Matlab's im2col 'sliding' in python that was designed to rearrange sliding blocks from a 2D array into columns. Thus, to solve our case here, those sliding blocks from field_array could be stacked as columns and compared against column vector version of match_array.
Here's a formal definition of the function for the rearrangement/stacking -
def im2col(A,BLKSZ):
# Parameters
M,N = A.shape
col_extent = N - BLKSZ[1] + 1
row_extent = M - BLKSZ[0] + 1
# Get Starting block indices
start_idx = np.arange(BLKSZ[0])[:,None]*N + np.arange(BLKSZ[1])
# Get offsetted indices across the height and width of input array
offset_idx = np.arange(row_extent)[:,None]*N + np.arange(col_extent)
# Get all actual indices & index into input array for final output
return np.take (A,start_idx.ravel()[:,None] + offset_idx.ravel())
To solve our case, here's the implementation based on im2col -
# Get sliding blocks of shape same as match_array from field_array into columns
# Then, compare them with a column vector version of match array.
col_match = im2col(field_array,match_array.shape) == match_array.ravel()[:,None]
# Shape of output array that has field_array compared against a sliding match_array
out_shape = np.asarray(field_array.shape) - np.asarray(match_array.shape) + 1
# Now, see if all elements in a column are ONES and reshape to out_shape.
# Finally, find the position of TRUE indices
R,C = np.where(col_match.all(0).reshape(out_shape))
The output for the given sample in the question would be -
In [151]: R,C
Out[151]: (array([6]), array([3]))
Approach #2
Given that opencv already has template matching function that does square of differences, you can employ that and look for zero differences, which would be your matching positions. So, if you have access to cv2 (opencv module), the implementation would look something like this -
import cv2
from cv2 import matchTemplate as cv2m
M = cv2m(field_array.astype('uint8'),match_array.astype('uint8'),cv2.TM_SQDIFF)
R,C = np.where(M==0)
giving us -
In [204]: R,C
Out[204]: (array([6]), array([3]))
Benchmarking
This section compares runtimes for all the approaches suggested to solve the question. The credit for the various methods listed in this section goes to their contributors.
Method definitions -
def seek_array(search_in, search_for, return_coords = False):
si_x, si_y = search_in.shape
sf_x, sf_y = search_for.shape
for y in xrange(si_y-sf_y+1):
for x in xrange(si_x-sf_x+1):
if numpy.array_equal(search_for, search_in[x:x+sf_x, y:y+sf_y]):
return (x,y) if return_coords else True
return None if return_coords else False
def skimage_based(field_array,match_array):
windows = view_as_windows(field_array, match_array.shape)
return (windows == match_array).all(axis=(2,3)).nonzero()
def im2col_based(field_array,match_array):
col_match = im2col(field_array,match_array.shape)==match_array.ravel()[:,None]
out_shape = np.asarray(field_array.shape) - np.asarray(match_array.shape) + 1
return np.where(col_match.all(0).reshape(out_shape))
def cv2_based(field_array,match_array):
M = cv2m(field_array.astype('uint8'),match_array.astype('uint8'),cv2.TM_SQDIFF)
return np.where(M==0)
Runtime tests -
Case # 1 (Sample data from question):
In [11]: field_array
Out[11]:
array([[ 24, 25, 26, 27, 28, 29, 30, 31, 23],
[ 33, 34, 35, 36, 37, 38, 39, 40, 32],
[-39, -38, -37, -36, -35, -34, -33, -32, -40],
[-30, -29, -28, -27, -26, -25, -24, -23, -31],
[-21, -20, -19, -18, -17, -16, -15, -14, -22],
[-12, -11, -10, -9, -8, -7, -6, -5, -13],
[ -3, -2, -1, 0, 1, 2, 3, 4, -4],
[ 6, 7, 8, 4, 5, 6, 7, 13, 5],
[ 15, 16, 17, 8, 9, 10, 11, 22, 14]])
In [12]: match_array
Out[12]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [13]: %timeit seek_array(field_array, match_array, return_coords = False)
1000 loops, best of 3: 465 µs per loop
In [14]: %timeit skimage_based(field_array,match_array)
10000 loops, best of 3: 97.9 µs per loop
In [15]: %timeit im2col_based(field_array,match_array)
10000 loops, best of 3: 74.3 µs per loop
In [16]: %timeit cv2_based(field_array,match_array)
10000 loops, best of 3: 30 µs per loop
Case #2 (Bigger random data):
In [17]: field_array = np.random.randint(0,4,(256,256))
In [18]: match_array = field_array[100:116,100:116].copy()
In [19]: %timeit seek_array(field_array, match_array, return_coords = False)
1 loops, best of 3: 400 ms per loop
In [20]: %timeit skimage_based(field_array,match_array)
10 loops, best of 3: 54.3 ms per loop
In [21]: %timeit im2col_based(field_array,match_array)
10 loops, best of 3: 125 ms per loop
In [22]: %timeit cv2_based(field_array,match_array)
100 loops, best of 3: 4.08 ms per loop
There's no such search function built in to NumPy, but it is certainly possible to do in NumPy
As long as your arrays are not too massive*, you could use a rolling window approach:
from skimage.util import view_as_windows
windows = view_as_windows(field_array, match_array.shape)
The function view_as_windows is written purely in NumPy so if you don't have skimage you can always copy the code from here.
Then to see if the sub-array appears in the larger array, you can write:
>>> (windows == match_array).all(axis=(2,3)).any()
True
To find the indices of where the top-left corner of the sub-array matches, you can write:
>>> (windows == match_array).all(axis=(2,3)).nonzero()
(array([6]), array([3]))
This approach should also work for arrays of higher dimensions.
*although the array windows takes up no additional memory (only the strides and shape are changed to create a new view of the data), writing windows == match_array creates a boolean array of size (7, 6, 3, 4) which is 504 bytes of memory. If you're working with very large arrays, this approach might not be feasible.
One solution is to search the entire search_in array block-at-a-time (a 'block' being a search_for-shaped slice) until either a matching segment is found or the search_for array is exhausted. I can use it to get coordinates for the matching block, or just a bool result by sending True or False for the return_coords optional argument...
def seek_array(search_in, search_for, return_coords = False):
"""Searches for a contiguous instance of a 2d array `search_for` within a larger `search_in` 2d array.
If the optional argument return_coords is True, the xy coordinates of the zeroeth value of the first matching segment of search_in will be returned, or None if there is no matching segment.
If return_coords is False, a boolean will be returned.
* Both arrays must be sent as two-dimensional!"""
si_x, si_y = search_in.shape
sf_x, sf_y = search_for.shape
for y in xrange(si_y-sf_y+1):
for x in xrange(si_x-sf_x+1):
if numpy.array_equal(search_for, search_in[x:x+sf_x, y:y+sf_y]):
return (x,y) if return_coords else True # don't forget that coordinates are transposed when viewing NumPy arrays!
return None if return_coords else False
I wonder if NumPy doesn't already have a function that can do the same thing, though...
To add to the answers already posted, I'd like to add one that takes into account errors due to floating point precision in case that matrices come from, let's say, image processing for instance, where numbers are subject to floating point operations.
You can recurse the indexes of the larger matrix, searching for the smaller matrix. Then you can extract a submatrix of the larger matrix matching the size of the smaller matrix.
You have a match if the contents of both, the submatrix of 'large' and the 'small' matrix match.
The following example shows how to return the first indexes of the location in the large matrix found to match. It would be trivial to extend this function to return an array of locations found to match if that's the intent.
import numpy as np
def find_submatrix(a, b):
""" Searches the first instance at which 'b' is a submatrix of 'a', iterates
rows first. Returns the indexes of a at which 'b' was found, or None if
'b' is not contained within 'a'"""
a_rows=a.shape[0]
a_cols=a.shape[1]
b_rows=b.shape[0]
b_cols=b.shape[1]
row_diff = a_rows - b_rows
col_diff = a_cols - b_cols
for idx_row in np.arange(row_diff):
for idx_col in np.arange(col_diff):
row_indexes = [idx + idx_row for idx in np.arange(b_rows)]
col_indexes = [idx + idx_col for idx in np.arange(b_cols)]
submatrix_indexes = np.ix_(row_indexes, col_indexes)
a_submatrix = a[submatrix_indexes]
are_equal = np.allclose(a_submatrix, b) # allclose is used for floating point numbers, if they
# are close while comparing, they are considered equal.
# Useful if your matrices come from operations that produce
# floating point numbers.
# You might want to fine tune the parameters to allclose()
if (are_equal):
return[idx_col, idx_row]
return None
Using the function above you can run the following example:
large_mtx = np.array([[1, 2, 3, 7, 4, 2, 6],
[4, 5, 6, 2, 1, 3, 11],
[10, 4, 2, 1, 3, 7, 6],
[4, 2, 1, 3, 7, 6, -3],
[5, 6, 2, 1, 3, 11, -1],
[0, 0, -1, 5, 4, -1, 2],
[10, 4, 2, 1, 3, 7, 6],
[10, 4, 2, 1, 3, 7, 6]
])
# Example 1: An intersection at column 2 and row 1 of large_mtx
small_mtx_1 = np.array([[4, 2], [2,1]])
intersect = find_submatrix(large_mtx, small_mtx_1)
print "Example 1, intersection (col,row): " + str(intersect)
# Example 2: No intersection
small_mtx_2 = np.array([[-14, 2], [2,1]])
intersect = find_submatrix(large_mtx, small_mtx_2)
print "Example 2, intersection (col,row): " + str(intersect)
Which would print:
Example 1, intersection: [1, 2]
Example 2, intersection: None
Here's a solution using the as_strided() function from stride_tricks module
import numpy as np
from numpy.lib.stride_tricks import as_strided
# field_array (I modified it to have two matching arrays)
A = np.array([[ 24, 25, 26, 27, 28, 29, 30, 31, 23],
[ 33, 0, 1, 2, 3, 38, 39, 40, 32],
[-39, 4, 5, 6, 7, -34, -33, -32, -40],
[-30, 8, 9, 10, 11, -25, -24, -23, -31],
[-21, -20, -19, -18, -17, -16, -15, -14, -22],
[-12, -11, -10, -9, -8, -7, -6, -5, -13],
[ -3, -2, -1, 0, 1, 2, 3, 4, -4],
[ 6, 7, 8, 4, 5, 6, 7, 13, 5],
[ 15, 16, 17, 8, 9, 10, 11, 22, 14]])
# match_array
B = np.arange(12).reshape(3,4)
# Window view of A
A_w = as_strided(A, shape=(A.shape[0] - B.shape[0] + 1,
A.shape[1] - B.shape[1] + 1,
B.shape[0], B.shape[1]),
strides=2*A.strides).reshape(-1, B.shape[0], B.shape[1])
match = (A_w == B).all(axis=(1,2))
We can also find the indices of the first element of each matching block in A
where = np.where(match)[0]
ind_flat = where + (B.shape[1] - 1)*(np.floor(where/(A.shape[1] - B.shape[1] + 1)).astype(int))
ind = [tuple(row) for row in np.array(np.unravel_index(ind_flat, A.shape)).T]
Result
print(match.any())
True
print(ind)
[(1, 1), (6, 3)]

Categories

Resources