NumPy: Find first n columns according to mask - python

Say I have an array arr in shape (m, n) and a boolean array mask in the same shape as arr. I would like to obtain the first N columns from arr that are True in mask as well.
An example:
arr = np.array([[1,2,3,4,5],
[6,7,8,9,10],
[11,12,13,14,15]])
mask = np.array([[False, True, True, True, True],
[True, False, False, True, False],
[True, True, False, False, False]])
N = 2
Given the above, I would like to write a (vectorized) function that outputs the following:
output = maskify_n_columns(arr, mask, N)
output = np.array(([2,3],[6,9],[11,12]))

You can use broadcasting, numpy.cumsum() and numpy.argmax().
def maskify_n_columns(arr, mask, N):
m = (mask.cumsum(axis=1)[..., None] == np.arange(1,N+1)).argmax(axis=1)
r = arr[np.arange(arr.shape[0])[:, None], m]
return r
maskify_n_columns(arr, mask, 2)
Output:
[[ 2 3]
[ 6 9]
[11 12]]

Related

Compare matrix values columnwise with the corresponding mean

Having a matrix with d features and n samples, I would like to compare each feature of a sample (row) against the mean of the column corresponding to that feature and then assign a corresponding label 1 or 0.
Eg. for a matrix X = [x11, x12; x21, x22] I compute the mean of the two columns (mu1, mu2) and then I keep on comparing (x11, x21 with mu1 and so on) to check whether these are greater or smaller than mu and to then assign a label to them according to the if statement (see below).
I have the mean vector for each column i.e. of length d.
I am now using for-loops however these are not computationally effective.
X_copy = X_train;
mu = np.mean(X_train, axis = 0)
for i in range(X_train.shape[0]):
for j in range(X_train.shape[1]):
if X_train[i,j]<mu[j]: #less than mean for the col, assign 0
X_copy[i,j] = 0
else:
X_copy[i,j] = 1 #more than or equal to mu for the col, assign 1
Is there any better alternative?
I don't have much experience with python hence thank you for understanding.
Direct comparison, which makes the average vector compare on each row of the original array. Then convert the data type of the result to int:
>>> X_train = np.random.rand(3, 4)
>>> X_train
array([[0.4789953 , 0.84095907, 0.53538172, 0.04880835],
[0.64554335, 0.50904539, 0.34069036, 0.5290601 ],
[0.84664389, 0.63984867, 0.66111495, 0.89803495]])
>>> (X_train >= X_train.mean(0)).astype(int)
array([[0, 1, 1, 0],
[0, 0, 0, 1],
[1, 0, 1, 1]])
Update:
There is a broadcast mechanism for operations between numpy arrays. For example, an array is compared with a number, which will make the number swim among all elements of the array and compare them one by one:
>>> X_train > 0.5
array([[False, True, True, False],
[ True, True, False, True],
[ True, True, True, True]])
>>> X_train > np.full(X_train.shape, 0.5) # Equivalent effect.
array([[False, True, True, False],
[ True, True, False, True],
[ True, True, True, True]])
Similarly, you can compare a vector with a 2D array, as long as the length of the vector is the same as that of the first dimension of the array:
>>> mu = X_train.mean(0)
>>> X_train > mu
array([[False, True, True, False],
[False, False, False, True],
[ True, False, True, True]])
>>> X_train > np.tile(mu, (X_train.shape[0], 1)) # Equivalent effect.
array([[False, True, True, False],
[False, False, False, True],
[ True, False, True, True]])
How do I compare other axes? My English is not good, so it is difficult for me to explain. Here I provide the official explanation of numpy. I hope you can get started through it: Broadcasting

Mask of boolean 2D numpy array with True values for elements contained in another 1D numpy array

Take the following example. I have an array test and want to get a boolean mask with True's for all elements that are equal to elements of ref.
import numpy as np
test = np.array([[2, 3, 1, 0], [5, 4, 2, 3], [6, 7, 5 ,4]])
ref = np.array([3, 4, 5])
I am looking for something equivalent to
mask = (test == ref[0]) | (test == ref[1]) | (test == ref[2])
which in this case should yield
>>> print(mask)
[[False, True, False, False],
[ True, True, False, True],
[False, False, True, True]]
but without having to resort to any loops.
Numpy comes with a function isin that does exactly this
np.isin(test, ref)
which return
array([[False, True, False, False],
[ True, True, False, True],
[False, False, True, True]])
You can use numpy broadcasting:
mask = (test[:,None] == ref[:,None]).any(1)
output:
array([[False, True, False, False],
[ True, True, False, True],
[False, False, True, True]])
NB. this is faster that numpy.isin, but creates a (X, X, Y) sized intermediate array where X, Y is the shape of test, so this will consume some memory on very large arrays

How can i find the intersection of two multidimensional arrays faster?

there are two multidimensional boolean arrays with a different number of rows. I want to quickly find indexes of True values in common rows. I wrote the following code but it is too slow.
Is there a faster way to do this?
a=np.random.choice(a=[False, True], size=(100,100))
b=np.random.choice(a=[False, True], size=(1000,100))
for i in a:
for j in b:
if np.array_equal(i, j):
print(np.where(i))
Let's start with an edition to the question that makes sense and usually prints something:
a = np.random.choice(a=[False, True], size=(2, 2))
b = np.random.choice(a=[False, True], size=(4, 2))
print(f"a: \n {a}")
print(f"b: \n {b}")
matches = []
for i, x in enumerate(a):
for j, y in enumerate(b):
if np.array_equal(x, y):
matches.append((i, j))
And the solution using scipy.cdist which compares all rows in a against all rows in b, using hamming distance for Boolean vector comparison:
import numpy as np
import scipy
from scipy import spatial
d = scipy.spatial.distance.cdist(a, b, metric='hamming')
cdist_matches = np.where(d == 0)
mathces_values = [(a[i], b[j]) for (i, j) in matches]
cdist_values = a[cdist_matches[0]], b[cdist_matches[1]]
print(f"matches_inds = \n{matches}")
print(f"matches = \n{mathces_values}")
print(f"cdist_inds = \n{cdist_matches}")
print(f"cdist_matches =\n {cdist_values}")
out:
a:
[[ True False]
[False False]]
b:
[[ True True]
[ True False]
[False False]
[False True]]
matches_inds =
[(0, 1), (1, 2)]
matches =
[(array([ True, False]), array([ True, False])), (array([False, False]), array([False, False]))]
cdist_inds =
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
cdist_matches =
(array([[ True, False],
[False, False]]), array([[ True, False],
[False, False]]))
See this for a pure numpy implementation if you don't want to import scipy
The comparision of each row of a to each row of b can be made by making the shape of a broadcastable to the shape of b with the use of np.newaxis and np.tile
import numpy as np
a=np.random.choice(a=[True, False], size=(2,5))
b=np.random.choice(a=[True, False], size=(10,5))
broadcastable_a = np.tile(a[:, np.newaxis, :], (1, b.shape[0], 1))
a_equal_b = np.equal(b, broadcastable_a)
indexes = np.where(a_equal_b)
indexes = np.stack(np.array(indexes[1:]), axis=1)

Mask minimum values in matrix rows

I have this 3x3 matrix:
a=array([[ 1, 11, 5],
[ 3, 9, 9],
[ 5, 7, -3]])
I need to mask the minimum values in each row in order to calculate the mean of each row discarding the minimum values. Is there a general solution?
I have tried with
a_masked=np.ma.masked_where(a==np.ma.min(a,axis=1),a)
Which masks the minimum value in first and third row, but not the second row?
I would appreciate any help. Thanks!
The issue is because the comparison a == a.min(axis=1) is comparing each column to the minimum value of each row rather than comparing each row to the minimum values. This is because a.min(axis=1) returns a vector rather than a matrix which behaves similarly to an Nx1 array. As such, when broadcasting, the == operator performs the operation in a column-wise fashion to match dimensions.
a == a.min(axis=1)
# array([[ True, False, False],
# [False, False, False],
# [False, False, True]], dtype=bool)
One potential way to fix this is to resize the result of a.min(axis=1) into column vector (e.g. a 3 x 1 2D array).
a == np.resize(a.min(axis=1), [a.shape[0],1])
# array([[ True, False, False],
# [ True, False, False],
# [False, False, True]], dtype=bool)
Or more simply as #ColonelBeuvel has shown:
a == a.min(axis=1)[:,None]
Now applying this to your entire line of code.
a_masked = np.ma.masked_where(a == np.resize(a.min(axis=1),[a.shape[0],1]), a)
# masked_array(data =
# [[-- 11 5]
# [-- 9 9]
# [5 7 --]],
# mask =
# [[ True False False]
# [ True False False]
# [False False True]],
# fill_value = 999999)
What is with the min() function?
For every Row just do min(row) and it gives you the minimum of this list in your Case a row. Simply append this minimum in a list for all Minimum.
minList=[]
for i in array:
minList.append(min(i))

Getting a grid of a matrix via logical indexing in Numpy

I'm trying to rewrite a function using numpy which is originally in MATLAB. There's a logical indexing part which is as follows in MATLAB:
X = reshape(1:16, 4, 4).';
idx = [true, false, false, true];
X(idx, idx)
ans =
1 4
13 16
When I try to make it in numpy, I can't get the correct indexing:
X = np.arange(1, 17).reshape(4, 4)
idx = [True, False, False, True]
X[idx, idx]
# Output: array([6, 1, 1, 6])
What's the proper way of getting a grid from the matrix via logical indexing?
You could also write:
>>> X[np.ix_(idx,idx)]
array([[ 1, 4],
[13, 16]])
In [1]: X = np.arange(1, 17).reshape(4, 4)
In [2]: idx = np.array([True, False, False, True]) # note that here idx has to
# be an array (not a list)
# or boolean values will be
# interpreted as integers
In [3]: X[idx][:,idx]
Out[3]:
array([[ 1, 4],
[13, 16]])
In numpy this is called fancy indexing. To get the items you want you should use a 2D array of indices.
You can use an outer to make from your 1D idx a proper 2D array of indices. The outers, when applied to two 1D sequences, compare each element of one sequence to each element of the other. Recalling that True*True=True and False*True=False, the np.multiply.outer(), which is the same as np.outer(), can give you the 2D indices:
idx_2D = np.outer(idx,idx)
#array([[ True, False, False, True],
# [False, False, False, False],
# [False, False, False, False],
# [ True, False, False, True]], dtype=bool)
Which you can use:
x[ idx_2D ]
array([ 1, 4, 13, 16])
In your real code you can use x=[np.outer(idx,idx)] but it does not save memory, working the same as if you included a del idx_2D after doing the slice.

Categories

Resources