Numpy subset matrix based on another with binary data - python

I have a n x m matrix X and a n x p matrix Y where Y is binary data. In the end I want a p x n matrix Z where the columns of Z are a function of the columns of X subsetting to the column entries corresponding to 1's in Y.
For example
>>> X
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> Y
array([[1, 0],
[1, 0],
[0, 1]])
n_x,m = X.shape
n_y,p = Y.shape
Z = np.zeros([p, n_x])
for i in range(n_x):
col = X[:,[i]]
for j in range(p):
#this is where I subset col with Y[:,[j]]
Z[j][i] = my_func(subsetted_column)
The iterations would produce
i=0, j=0: subsetted_column = [[1],[4]]
i=0, j=1: subsetted_column = [[7]]
i=1, j=0: subsetted_column = [[2],[5]]
i=1, j=1: subsetted_column = [[8]]
i=2, j=0: subsetted_column = [[3],[6]]
i=2, j=1: subsetted_column = [[9]]
I assume there is some way to do that nested loop in a single list comprehension. The function my_func also takes a long time so would be nice to parallelize that somehow.
Edit: I could do something like
for i in range(n_x):
for j in range(p):
subsetted_column = np.trim_zeros(np.multiply(X[:,i], Y[:,j]))
Z[j][i] = my_func(subsetted_column)
But I still believe there is an easier solution

Does this what you want?
import numpy as np
N, M, P = 4, 3, 2
a = np.random.random((N, M))
b = np.random.randint(2, size=(N, P)).astype(bool)
your_func = lambda x: x # insert proper function here
flat = [your_func(ai[bj]) for bj in b.T for ai in a.T]
out = np.empty((P, M), dtype=object)
out.ravel()[:] = flat
print(a)
print(b)
print(out)
Remarks:
It is easiest to convert your masking array to dtype bool because this allows you to use logical indexing.
If your_func returns just a number it's better not to use dtype=object for out.
If you want to parallelise, a list comprehension is perhaps not the best thing to do, but I'm no expert on that. It's just that the loop looks like an obvious parallelisation target, since the order of iterations is irrelevant.
Sample output:
[[ 0.62739382 0.85774837 0.81958524]
[ 0.99690996 0.71202879 0.97636715]
[ 0.89235107 0.91739852 0.39537849]
[ 0.0413107 0.11662271 0.72419308]]
[[False True]
[ True True]
[False False]
[ True True]]
[[array([ 0.99690996, 0.0413107 ]) array([ 0.71202879, 0.11662271])
array([ 0.97636715, 0.72419308])]
[array([ 0.62739382, 0.99690996, 0.0413107 ])
array([ 0.85774837, 0.71202879, 0.11662271])
array([ 0.81958524, 0.97636715, 0.72419308])]]

It may help to perform the subsetting in a pre-processing loop
In [112]: xs = [X[y,:] for y in Y.astype(bool).T]
In [113]: xs
Out[113]:
[array([[1, 2, 3],
[4, 5, 6]]),
array([[7, 8, 9]])]
(.T is used to iterate on columns in the list comprehension; bool allows 'masked' selection)
Let's say, for example that my_func takes the mean on axis=0 for the subsets
In [116]: [np.mean(s, axis=0) for s in xs]
Out[116]: [array([ 2.5, 3.5, 4.5]), array([ 7., 8., 9.])]
In [117]: np.array(_)
Out[117]:
array([[ 2.5, 3.5, 4.5],
[ 7. , 8. , 9. ]])
I could combine it into one loop, but it's harder to think about:
np.array([np.mean(X[y,:],axis=0) for y in Y.astype(bool).T])
With this xs list, you can focus your efforts on applying my_func efficiently to all the columns of xs[i] as np.mean(xs[i], axis=0) does.
The double loop version of this mean
In [121]: p=np.zeros((2,3))
In [122]: for i in range(2):
...: for j in range(3):
...: p[i,j] = np.mean(xs[i][:,j])
...:
In [123]: p
Out[123]:
array([[ 2.5, 3.5, 4.5],
[ 7. , 8. , 9. ]])
Equivalent double list comprehension
In [125]: [[np.mean(i) for i in j.T] for j in xs]
Out[125]: [[2.5, 3.5, 4.5], [7.0, 8.0, 9.0]]

Related

Replacing values in n-dimensional tensor given indices from np.argwhere()

I'm somewhat new to numpy so this might be a dumb question, but here goes:
Let's say I have a tensor of any shape and size, say (100,5,5) or (3,3,10,15,4). I have a randomly generated list of indices for points I want to replace with np.nan. For a (3,3,3) test case, it would be as follows:
>> data = np.random.randn(3,3,3)
>> data
array([[[ 0.21368315, -1.42814113, 1.23021783],
[ 0.25835315, 0.44775156, -1.20489094],
[ 0.25928972, 0.39486046, -1.79189447]],
[[ 2.24080908, -0.89617961, -0.29550817],
[ 0.21756087, 1.33996913, -1.24418745],
[-0.63617598, 0.56848439, 0.8175564 ]],
[[ 0.61367002, -1.16104071, -0.53488283],
[ 1.0363354 , -0.76888041, 1.24524786],
[-0.84329375, -0.61744489, 1.50502058]]])
>> idxs = np.argwhere(np.isfinite(data))
>> dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
>> dropidxs
array([[1, 1, 1],
[2, 0, 2],
[2, 1, 0]])
How do I replace the corresponding values? Previously, when I was only dealing with the 3D case, I did it using the following.
for idx in dropidxs:
i,j,k = dropidxs[idx]
missingCube[i,j,k] = np.nan
But now, I want the function to be able to handle tensors of any size.
I've tried
for idx in dropidxs:
missingCube[idx] = np.nan
and
missingCube[dropidxs] = np.nan
But both (unsurprisingly) end up removing a corresponding slice along axis=0. How should I approach this? Is there an easier way to achieve what I'm trying to do?
In [486]: data = np.random.randn(3,3,3)
With this creation all terms are finite, so nonzero returns a tuple of (27,) arrays:
In [487]: idx = np.nonzero(np.isfinite(data))
In [488]: len(idx)
Out[488]: 3
In [489]: idx[0].shape
Out[489]: (27,)
argwhere produces the same numbers, but in a 2d array:
In [490]: idxs = np.argwhere(np.isfinite(data))
In [491]: idxs.shape
Out[491]: (27, 3)
So you select a subset.
In [492]: dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
In [493]: dropidxs.shape
Out[493]: (3, 3)
In [494]: dropidxs
Out[494]:
array([[1, 1, 0],
[2, 1, 2],
[2, 1, 1]])
We could have generated the same subset by x = np.random.choice(...), and applying that x to the arrays in idxs. But in this case, the argwhere array is easier to work with.
But to apply that array to indexing we still need a tuple of arrays:
In [495]: tup = tuple([dropidxs[:,i] for i in range(3)])
In [496]: tup
Out[496]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [497]: data[tup]
Out[497]: array([-0.27965058, 1.2981397 , 0.4501406 ])
In [498]: data[tup]=np.nan
In [499]: data
Out[499]:
array([[[-0.4899279 , 0.83352547, -1.03798762],
[-0.91445783, 0.05777183, 0.19494065],
[ 0.6835925 , -0.47846423, 0.13513958]],
[[-0.08790631, 0.30224828, -0.39864576],
[ nan, -0.77424244, 1.4788093 ],
[ 0.41915952, -0.09335664, -0.47359613]],
[[-0.40281937, 1.64866377, -0.40354504],
[ 0.74884493, nan, nan],
[ 0.13097487, -1.63995208, -0.98857852]]])
Or we could index with:
In [500]: data[dropidxs[:,0],dropidxs[:,1],dropidxs[:,2]]
Out[500]: array([nan, nan, nan])
Actually, a transpose of dropidxs might be be more convenient:
In [501]: tdrop = dropidxs.T
In [502]: tuple(tdrop)
Out[502]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [503]: data[tuple(tdrop)]
Out[503]: array([nan, nan, nan])
Sometimes we can use * to expand a list/array into a tuple, but not when indexing:
In [504]: data[*tdrop]
File "<ipython-input-504-cb619d907adb>", line 1
data[*tdrop]
^
SyntaxError: invalid syntax
but we can create the tuple with:
In [506]: data[(*tdrop,)]
Out[506]: array([nan, nan, nan])
Is it what you're searching for:
import numpy as np
x = np.random.randn(10, 3, 3, 3)
new_value = 0
x[x < 0] = new_value
or
x[x == -inf] = 0
You can choose from flattened indices and convert back to data indices to set elements to np.nan. Here with a seed(41) to make results reproducible, choosing 3 elements.
import numpy as np
data = np.random.randn(3,3,3)
rng = np.random.default_rng(41)
idx = rng.choice(np.arange(data.size), 3, replace=False)
data[np.unravel_index(idx, data.shape)] = np.nan
data
Output
array([[[ 0.13180452, -0.81228319, -0.04456739],
[ 0.53060077, -0.2246579 , 1.83926463],
[-0.38670047, -0.53703577, 0.49275628]],
[[ 0.36671354, 1.44012848, -0.57209412],
[ 0.53960111, -1.06578638, 1.10669842],
[ 1.1772824 , nan, -0.82792041]],
[[-0.03352594, 0.29351109, 0.57021538],
[-0.33291872, nan, 0.04675677],
[ nan, 2.59450517, -1.9579655 ]]])

Apply custom function/operator between numpy arrays

There are two arrays and I want to get distance between two arrays based on known individual elements distance.
dist = {(4,3): 0.25, (4,1):0.75, (0,0):0, (3,3):0, (2,1):0.25, (1,0): 0.25}
a = np.array([[4, 4, 0], [3, 2, 1]])
b = np.array([[3, 1, 0]])
a
array([[4, 4, 0],
[3, 2, 1]])
b
array([[3, 1, 0]])
expected output based on dictionary dist:
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])
So, if we need which elements are different we can do a!=b. Similarly, instead of !=, I want to apply the below function -
def get_distance(a, b):
return dist[(a, b)]
to get the expected output above.
I tried np.vectorize(get_distance)(a, b) and it works. But I am not sure if it is the best way to do the above in vectorized way. So, for two numpy arrays, what is the best way to apply custom function/operator?
Instead of storing your distance mapping as a dict, use a np.array for lookup (or possibly a sparse matrix if size becomes an issue).
d = np.zeros((5, 4))
for (x, y), z in dist.items():
d[x, y] = z
Then, simply index.
>>> d[a, b]
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])
For a sparse solution (code is almost identical):
In [14]: from scipy import sparse
In [15]: d = sparse.dok_matrix((5, 4))
In [16]: for (x, y), z in dist.items():
...: d[x, y] = z
...:
In [17]: d[a, b].A
Out[17]:
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])

How to apply to a numpy array special function based on neighbour elements?

I want to apply a function to a 2d-array that must work like commulative sum, but instead uses max() function applied to upper, left and upper-left-diagonal neighbours and element itself. It should start from upper-left element and commulate previous results for calculation the next element. Is there a way to do it without nested loops?
Example:
>>> x = np.arange(9)[np.random.permutation(9)].reshape((3,3))
>>> x
array([[3, 4, 8],
[0, 6, 5],
[7, 2, 1]])
>>> res = np.zeros(x.shape)
>>> for i in range(0, x.shape[0]):
for j in range(0, x.shape[1]):
if i==0:
if j==0:
res[i,j] = x[i,j]
else:
res[i,j] = max(x[i,j], res[i,j-1])
else:
if j==0:
res[i,j] = max(x[i,j], res[i-1,j])
else:
res[i,j] = max(x[i,j], res[i-1,j], res[i,j-1], res[i-1,j-1])
>>> res
array([[3., 4., 8.],
[3., 6., 8.],
[7., 7., 8.]])
Because values of "further" res elements depend on values
of "previous" elements (computed earlier), you can not use Numpy
vectorization, as it depends only on elements of the source array.
But you can simplify your code as follows:
res = x.copy()
for i in range(a.shape[0]):
i1 = max(i-1, 0) # Start of row range
for j in range(a.shape[1]):
j1 = max(j-1, 0) # Start of column range
res[i,j] = res[i1:i+1, j1:j+1].max()
​
The result is:
array([[3, 4, 8],
[3, 6, 8],
[7, 7, 8]])
Note that due to res = x.copy() the result has the same dtype as
the original array.
The numpy has cumsum() built-in function, which we can use here if we provide axis as "1" , then it will give cu. sum of the columns , if you want the cu. sum of the rows you can use axis as "0"
import numpy as np
x = np.arange(9)[np.random.permutation(9)].reshape((3,3))
print(x)
x =np.cumsum(x,dtype=float,axis=1)
print(x)
Output:
[[0 2 8]
[3 1 6]
[5 4 7]]
[[ 0. 2. 10.]
[ 3. 4. 10.]
[ 5. 9. 16.]]

How to sum every 2 consecutive vectors using numpy

How to sum every 2 consecutive vectors using numpy. Or the mean of every 2 consecutive vectors.
The list of lists (that can have even or uneven number of vectors.)
example:
[[2,2], [1,2], [1,1], [2,2]] --> [[3,4], [3,3]]
Maybe something like this but using numpy and something that actually works on array of vectors and not an array of integers. Or maybe some sort of array comprehension if the that exists.
def pairwiseSum(lst, n):
sum = 0;
for i in range(len(lst)-1):
# adding the alternate numbers
sum = lst[i] + lst[i + 1]
def mean_consecutive_vectors(lst, step):
idx_list = list(range(step, len(lst), step))
new_lst = np.split(lst, idx_list)
return np.mean(new_lst, axis=1)
Same could be done with np.sum() instead of np.mean().
You can reshape your array into pairs, which will allow you to use np.sum() or np.mean() directly by providing the correct axis:
import numpy as np
a = np.array([[2,2], [1,2], [1,1], [2,2]])
np.sum(a.reshape(-1, 2, 2), axis=1)
# array([[3, 4],
# [3, 3]])
Edit to address comment:
To get a the means of each adjacent pair, you can add slices of the original array and broadcast division by 2:
> a = np.array([[2,2], [1,2], [1,1], [2,2], [11, 10], [20, 30]])
> (a[:-1] + a[1:])/2
array([[ 1.5, 2. ],
[ 1. , 1.5],
[ 1.5, 1.5],
[ 6.5, 6. ],
[15.5, 20. ]])

Compute inverse of 2D arrays along the third axis in a 3D array without loops

I have an array A whose shape is (N, N, K) and I would like to compute another array B with the same shape where B[:, :, i] = np.linalg.inv(A[:, :, i]).
As solutions, I see map and for loops but I am wondering if numpy provides a function to do this (I have tried np.apply_over_axes but it seems that it can only handle 1D array).
with a for loop:
B = np.zeros(shape=A.shape)
for i in range(A.shape[2]):
B[:, :, i] = np.linalg.inv(A[:, :, i])
with map:
B = np.asarray(map(np.linalg.inv, np.squeeze(np.dsplit(A, A.shape[2])))).transpose(1, 2, 0)
For an invertible matrix M we have inv(M).T == inv(M.T) (the transpose of the inverse is equal to the inverse of the transpose).
Since np.linalg.inv is broadcastable, your problem can be solved by simply transposing A, calling inv and transposing the result:
B = np.linalg.inv(A.T).T
For example:
>>> N, K = 2, 3
>>> A = np.random.randint(1, 5, (N, N, K))
>>> A
array([[[4, 2, 3],
[2, 3, 1]],
[[3, 3, 4],
[4, 4, 4]]])
>>> B = np.linalg.inv(A.T).T
>>> B
array([[[ 0.4 , -4. , 0.5 ],
[-0.2 , 3. , -0.125]],
[[-0.3 , 3. , -0.5 ],
[ 0.4 , -2. , 0.375]]])
You can check the values of B match the inverses of the arrays in A as expected:
>>> all(np.allclose(B[:, :, i], np.linalg.inv(A[:, :, i])) for i in range(K))
True

Categories

Resources