How can i find the intersection of two multidimensional arrays faster? - python

there are two multidimensional boolean arrays with a different number of rows. I want to quickly find indexes of True values in common rows. I wrote the following code but it is too slow.
Is there a faster way to do this?
a=np.random.choice(a=[False, True], size=(100,100))
b=np.random.choice(a=[False, True], size=(1000,100))
for i in a:
for j in b:
if np.array_equal(i, j):
print(np.where(i))

Let's start with an edition to the question that makes sense and usually prints something:
a = np.random.choice(a=[False, True], size=(2, 2))
b = np.random.choice(a=[False, True], size=(4, 2))
print(f"a: \n {a}")
print(f"b: \n {b}")
matches = []
for i, x in enumerate(a):
for j, y in enumerate(b):
if np.array_equal(x, y):
matches.append((i, j))
And the solution using scipy.cdist which compares all rows in a against all rows in b, using hamming distance for Boolean vector comparison:
import numpy as np
import scipy
from scipy import spatial
d = scipy.spatial.distance.cdist(a, b, metric='hamming')
cdist_matches = np.where(d == 0)
mathces_values = [(a[i], b[j]) for (i, j) in matches]
cdist_values = a[cdist_matches[0]], b[cdist_matches[1]]
print(f"matches_inds = \n{matches}")
print(f"matches = \n{mathces_values}")
print(f"cdist_inds = \n{cdist_matches}")
print(f"cdist_matches =\n {cdist_values}")
out:
a:
[[ True False]
[False False]]
b:
[[ True True]
[ True False]
[False False]
[False True]]
matches_inds =
[(0, 1), (1, 2)]
matches =
[(array([ True, False]), array([ True, False])), (array([False, False]), array([False, False]))]
cdist_inds =
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
cdist_matches =
(array([[ True, False],
[False, False]]), array([[ True, False],
[False, False]]))
See this for a pure numpy implementation if you don't want to import scipy

The comparision of each row of a to each row of b can be made by making the shape of a broadcastable to the shape of b with the use of np.newaxis and np.tile
import numpy as np
a=np.random.choice(a=[True, False], size=(2,5))
b=np.random.choice(a=[True, False], size=(10,5))
broadcastable_a = np.tile(a[:, np.newaxis, :], (1, b.shape[0], 1))
a_equal_b = np.equal(b, broadcastable_a)
indexes = np.where(a_equal_b)
indexes = np.stack(np.array(indexes[1:]), axis=1)

Related

NumPy: Find first n columns according to mask

Say I have an array arr in shape (m, n) and a boolean array mask in the same shape as arr. I would like to obtain the first N columns from arr that are True in mask as well.
An example:
arr = np.array([[1,2,3,4,5],
[6,7,8,9,10],
[11,12,13,14,15]])
mask = np.array([[False, True, True, True, True],
[True, False, False, True, False],
[True, True, False, False, False]])
N = 2
Given the above, I would like to write a (vectorized) function that outputs the following:
output = maskify_n_columns(arr, mask, N)
output = np.array(([2,3],[6,9],[11,12]))
You can use broadcasting, numpy.cumsum() and numpy.argmax().
def maskify_n_columns(arr, mask, N):
m = (mask.cumsum(axis=1)[..., None] == np.arange(1,N+1)).argmax(axis=1)
r = arr[np.arange(arr.shape[0])[:, None], m]
return r
maskify_n_columns(arr, mask, 2)
Output:
[[ 2 3]
[ 6 9]
[11 12]]

how to implement this array algorithm in a more efficient way?

Assuming I have n = 3 lists of same length for example:
R1 = [7,5,8,6,0,6,7]
R2 = [8,0,2,2,0,2,2]
R3 = [1,7,5,9,0,9,9]
I need to find the first index t that verifies the n = 3 following conditions for a period p = 2.
Edit: the meaning of period p is the number of consecutive "boxes".
R1[t] >= 5, R1[t+1] >= 5. Here t +p -1 = t+1, we need to only verify for two boxes t and t+1. If p was equal to 3 we will need to verify for t, t+1 and t+2. Note that It's always the same number for which we test, we always test if it's greater than 5 for every index. The condition is always the same for all the "boxes".
R2[t] >= 2, R2[t+1] >= 2
R3[t] >= 9, R3[t+1] >= 9
In total there is 3 * p conditions.
Here the t I am looking for is 5 (indexing is starting from 0).
The basic way to do this is by looping on all the indexes using a for loop. If the condition is found for some index t we store it in some local variable temp and we verify the conditions still hold for every element whose index is between t+1 and t+p -1. If while checking, we find an index that does not satisfy a condition, we forget about the temp and we keep going.
What is the most efficient way to do this in Python if I have large lists (like of 10000 elements)? Is there a more efficient way than the for loop?
Since all your conditions are the same (>=), we could leverage this.
This solution will work for any number of conditions and any size of analysis window, and no for loop is used.
You have an array:
>>> R = np.array([R1, R2, R3]).T
>>> R
array([[7, 8, 1],
[5, 0, 7],
[8, 2, 5],
[6, 2, 9],
[0, 0, 0],
[6, 2, 9],
[7, 2, 9]]
and you have thresholds:
>>> thresholds = [5, 2, 9]
So you can check where the conditions are met:
>>> R >= thresholds
array([[ True, True, False],
[ True, False, False],
[ True, True, False],
[ True, True, True],
[False, False, False],
[ True, True, True],
[ True, True, True]])
And where they all met at the same time:
>>> R_cond = np.all(R >= thresholds, axis=1)
>>> R_cond
array([False, False, False, True, False, True, True])
From there, you want the conditions to be met for a given window.
We'll use the fact that booleans can sum together, and convolution to apply the window:
>>> win_size = 2
>>> R_conv = np.convolve(R_cond, np.ones(win_size), mode="valid")
>>> R_conv
array([0., 0., 1., 1., 1., 2.])
The resulting array will have values equal to win_size at the indices where all conditions are met on the window range.
So let's retrieve the first of those indices:
>>> index = np.where(R_conv == win_size)[0][0]
>>> index
5
If such an index doesn't exist, it will raise an IndexError, I'm letting you handle that.
So, as a one-liner function, it gives:
def idx_conditions(arr, thresholds, win_size, condition):
return np.where(
np.convolve(
np.all(condition(arr, thresholds), axis=1),
np.ones(win_size),
mode="valid"
)
== win_size
)[0][0]
I added the condition as an argument to the function, to be more general.
>>> from operator import ge
>>> idx_conditions(R, thresholds, win_size, ge)
5
This could be a way:
R1 = [7,5,8,6,0,6,7]
R2 = [8,0,2,2,0,2,2]
R3 = [1,7,5,9,0,9,9]
for i,inext in zip(range(len(R1)),range(len(R1))[1:]):
if (R1[i]>=5 and R1[inext]>=5)&(R2[i]>=2 and R2[inext]>=2)&(R3[i]>=9 and R3[inext]>=9):
print(i)
Output:
5
Edit: Generalization could be:
def foo(ls,conditions):
index=0
for i,inext in zip(range(len(R1)),range(len(R1))[1:]):
if all((ls[j][i]>=conditions[j] and ls[j][inext]>=conditions[j]) for j in range(len(ls))):
index=i
return index
R1 = [7,5,8,6,0,6,7]
R2 = [8,0,2,2,0,2,2]
R3 = [1,7,5,9,0,9,9]
R4 = [1,7,5,9,0,1,1]
R5 = [1,7,5,9,0,3,3]
conditions=[5,2,9,1,3]
ls=[R1,R2,R3,R4,R5]
print(foo(ls,conditions))
Output:
5
And, maybe if the arrays match the conditions multiple times, you could return a list of the indexes:
def foo(ls,conditions):
index=[]
for i,inext in zip(range(len(R1)),range(len(R1))[1:]):
if all((ls[j][i]>=conditions[j] and ls[j][inext]>=conditions[j]) for j in range(len(ls))):
print(i)
index.append(i)
return index
R1 = [6,7,8,6,0,6,7]
R2 = [2,2,2,2,0,2,2]
R3 = [9,9,5,9,0,9,9]
R4 = [1,1,5,9,0,1,1]
R5 = [3,3,5,9,0,3,3]
conditions=[5,2,9,1,3]
ls=[R1,R2,R3,R4,R5]
print(foo(ls,conditions))
Output:
[0,5]
Here is a solution using numpy ,without for loops:
import numpy as np
R1 = np.array([7,5,8,6,0,6,7])
R2 = np.array([8,0,2,2,0,2,2])
R3 = np.array([1,7,5,9,0,9,9])
a = np.logical_and(np.logical_and(R1>=5,R2>=2),R3>=9)
np.where(np.logical_and(a[:-1],a[1:]))[0].item()
ouput
5
Edit:
Generalization
Say you have a list of lists R and a list of conditions c:
R = [[7,5,8,6,0,6,7],
[8,0,2,2,0,2,2],
[1,7,5,9,0,9,9]]
c = [5,2,9]
First we convert them to numpy arrays. the reshape(-1,1) converts c to a column matrix so that we can use pythons broadcasting feature in the >= operator
R = np.array(R)
c = np.array(c).reshape(-1,1)
R>=c
output:
array([[ True, True, True, True, False, True, True],
[ True, False, True, True, False, True, True],
[False, False, False, True, False, True, True]])
then we perform logical & operation between all rows using reduce function
a = np.logical_and.reduce(R>=c)
a
output:
array([False, False, False, True, False, True, True])
next we create two arrays by removing first and last element of a and perform a logical & between them which shows which two subsequent elements satisfied the conditions in all lists:
np.logical_and(a[:-1],a[1:])
output:
array([False, False, False, False, False, True])
now np.where just shows the index of the True element
np.where(np.logical_and(a[:-1],a[1:]))[0].item()
output:
5

numpy mask for 2d array with all values in 1d array

I want to convert a 2d matrix of dates to boolean matrix based on dates in a 1d matrix. i.e.,
[[20030102, 20030102, 20070102],
[20040102, 20040102, 20040102].,
[20050102, 20050102, 20050102]]
should become
[[True, True, False],
[False, False, False].,
[True, True, True]]
if I provide a 1d array [20010203, 20030102, 20030501, 20050102, 20060101]
import numpy as np
dateValues = np.array(
[[20030102, 20030102, 20030102],
[20040102, 20040102, 20040102],
[20050102, 20050102, 20050102]])
requestedDates = [20010203, 20030102, 20030501, 20050102, 20060101]
ix = np.in1d(dateValues.ravel(), requestedDates).reshape(dateValues.shape)
print(ix)
Returns:
[[ True True True]
[False False False]
[ True True True]]
Refer to numpy.in1d for more information (documentation):
http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html
a = np.array([[20030102, 20030102, 20070102],
[20040102, 20040102, 20040102],
[20050102, 20050102, 20050102]])
b = np.array([20010203, 20030102, 20030501, 20050102, 20060101])
>>> a.shape
(3, 3)
>>> b.shape
(5,)
>>>
For the comparison, you need to broadcast b onto a by adding an axis to a. - this compares each element of a with each element of b
>>> mask = a[...,None] == b
>>> mask.shape
(3, 3, 5)
>>>
Then use np.any() to see if there are any matches
>>> np.any(mask, axis = 2, keepdims = False)
array([[ True, True, False],
[False, False, False],
[ True, True, True]], dtype=bool)
timeit.Timer comparison with in1d:
>>>
>>> t = Timer("np.any(a[...,None] == b, axis = 2)","from __main__ import np, a, b")
>>> t.timeit(10000)
0.13268041338812964
>>> t = Timer("np.in1d(a.ravel(), b).reshape(a.shape)","from __main__ import np, a, b")
>>> t.timeit(10000)
0.26060646913566643
>>>

Getting a grid of a matrix via logical indexing in Numpy

I'm trying to rewrite a function using numpy which is originally in MATLAB. There's a logical indexing part which is as follows in MATLAB:
X = reshape(1:16, 4, 4).';
idx = [true, false, false, true];
X(idx, idx)
ans =
1 4
13 16
When I try to make it in numpy, I can't get the correct indexing:
X = np.arange(1, 17).reshape(4, 4)
idx = [True, False, False, True]
X[idx, idx]
# Output: array([6, 1, 1, 6])
What's the proper way of getting a grid from the matrix via logical indexing?
You could also write:
>>> X[np.ix_(idx,idx)]
array([[ 1, 4],
[13, 16]])
In [1]: X = np.arange(1, 17).reshape(4, 4)
In [2]: idx = np.array([True, False, False, True]) # note that here idx has to
# be an array (not a list)
# or boolean values will be
# interpreted as integers
In [3]: X[idx][:,idx]
Out[3]:
array([[ 1, 4],
[13, 16]])
In numpy this is called fancy indexing. To get the items you want you should use a 2D array of indices.
You can use an outer to make from your 1D idx a proper 2D array of indices. The outers, when applied to two 1D sequences, compare each element of one sequence to each element of the other. Recalling that True*True=True and False*True=False, the np.multiply.outer(), which is the same as np.outer(), can give you the 2D indices:
idx_2D = np.outer(idx,idx)
#array([[ True, False, False, True],
# [False, False, False, False],
# [False, False, False, False],
# [ True, False, False, True]], dtype=bool)
Which you can use:
x[ idx_2D ]
array([ 1, 4, 13, 16])
In your real code you can use x=[np.outer(idx,idx)] but it does not save memory, working the same as if you included a del idx_2D after doing the slice.

Check if values in a set are in a numpy array in python

I want to check if a NumPyArray has values in it that are in a set, and if so set that area in an array = 1. If not set a keepRaster = 2.
numpyArray = #some imported array
repeatSet= ([3, 5, 6, 8])
confusedRaster = numpyArray[numpy.where(numpyArray in repeatSet)]= 1
Yields:
<type 'exceptions.TypeError'>: unhashable type: 'numpy.ndarray'
Is there a way to loop through it?
for numpyArray
if numpyArray in repeatSet
confusedRaster = 1
else
keepRaster = 2
To clarify and ask for a bit further help:
What I am trying to get at, and am currently doing, is putting a raster input into an array. I need to read values in the 2-d array and create another array based on those values. If the array value is in a set then the value will be 1. If it is not in a set then the value will be derived from another input, but I'll say 77 for now. This is what I'm currently using. My test input has about 1500 rows and 3500 columns. It always freezes at around row 350.
for rowd in range(0, width):
for cold in range (0, height):
if numpyarray.item(rowd,cold) in repeatSet:
confusedArray[rowd][cold] = 1
else:
if numpyarray.item(rowd,cold) == 0:
confusedArray[rowd][cold] = 0
else:
confusedArray[rowd][cold] = 2
In versions 1.4 and higher, numpy provides the in1d function.
>>> test = np.array([0, 1, 2, 5, 0])
>>> states = [0, 2]
>>> np.in1d(test, states)
array([ True, False, True, False, True], dtype=bool)
You can use that as a mask for assignment.
>>> test[np.in1d(test, states)] = 1
>>> test
array([1, 1, 1, 5, 1])
Here are some more sophisticated uses of numpy's indexing and assignment syntax that I think will apply to your problem. Note the use of bitwise operators to replace if-based logic:
>>> numpy_array = numpy.arange(9).reshape((3, 3))
>>> confused_array = numpy.arange(9).reshape((3, 3)) % 2
>>> mask = numpy.in1d(numpy_array, repeat_set).reshape(numpy_array.shape)
>>> mask
array([[False, False, False],
[ True, False, True],
[ True, False, True]], dtype=bool)
>>> ~mask
array([[ True, True, True],
[False, True, False],
[False, True, False]], dtype=bool)
>>> numpy_array == 0
array([[ True, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
>>> numpy_array != 0
array([[False, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
>>> confused_array[mask] = 1
>>> confused_array[~mask & (numpy_array == 0)] = 0
>>> confused_array[~mask & (numpy_array != 0)] = 2
>>> confused_array
array([[0, 2, 2],
[1, 2, 1],
[1, 2, 1]])
Another approach would be to use numpy.where, which creates a brand new array, using values from the second argument where mask is true, and values from the third argument where mask is false. (As with assignment, the argument can be a scalar or an array of the same shape as mask.) This might be a bit more efficient than the above, and it's certainly more terse:
>>> numpy.where(mask, 1, numpy.where(numpy_array == 0, 0, 2))
array([[0, 2, 2],
[1, 2, 1],
[1, 2, 1]])
Here is one possible way of doing what you whant:
numpyArray = np.array([1, 8, 35, 343, 23, 3, 8]) # could be n-Dimensional array
repeatSet = np.array([3, 5, 6, 8])
mask = (numpyArray[...,None] == repeatSet[None,...]).any(axis=-1)
print mask
>>> [False True False False False True True]
In recent numpy you could use a combination of np.isin and np.where to achieve this result. The first method outputs a boolean numpy array that evaluates to True where its vlaues are equal to an array-like specified test element (see doc), while with the second you could create a new array that set some a value where the specified confition evaluates to True and another value where False.
Example
I'll make an example with a random array but using the specific values you provided.
import numpy as np
repeatSet = ([2, 5, 6, 8])
arr = np.array([[1,5,1],
[0,1,0],
[0,0,0],
[2,2,2]])
out = np.where(np.isin(arr, repeatSet), 1, 77)
> out
array([[77, 1, 77],
[77, 77, 77],
[77, 77, 77],
[ 1, 1, 1]])

Categories

Resources