Replace subarrays in numpy

Replace subarrays in numpy - python

Given an array,
>>> n = 2
>>> a = numpy.array([[[1,1,1],[1,2,3],[1,3,4]]]*n)
>>> a
array([[[1, 1, 1],
[1, 2, 3],
[1, 3, 4]],
[[1, 1, 1],
[1, 2, 3],
[1, 3, 4]]])
I know that it's possible to replace values in it succinctly like so,
>>> a[a==2] = 0
>>> a
array([[[1, 1, 1],
[1, 0, 3],
[1, 3, 4]],
[[1, 1, 1],
[1, 0, 3],
[1, 3, 4]]])
Is it possible to do the same for an entire row (last axis) in the array? I know that a[a==[1,2,3]] = 11 will work and replace all the elements of the matching subarrays with 11, but I'd like to substitute a different subarray. My intuition tells me to write the following, but an error results,
>>> a[a==[1,2,3]] = [11,22,33]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: array is not broadcastable to correct shape
In summary, what I'd like to get is:
array([[[1, 1, 1],
[11, 22, 33],
[1, 3, 4]],
[[1, 1, 1],
[11, 22, 33],
[1, 3, 4]]])
... and n of course is, in general, a lot larger than 2, and the other axes are also larger than 3, so I don't want to loop over them if I don't need to.
Update: The [1,2,3] (or whatever else I'm looking for) is not always at index 1. An example:
a = numpy.array([[[1,1,1],[1,2,3],[1,3,4]], [[1,2,3],[1,1,1],[1,3,4]]])

You can achieve this with a much higher performance using np.all to check if all the columns have a True value for your comparison, then using the created mask to replace the values:
mask = np.all(a==[1,2,3], axis=2)
a[mask] = [11, 22, 23]
print(a)
#array([[[ 1, 1, 1],
# [11, 22, 33],
# [ 1, 3, 4]],
#
# [[ 1, 1, 1],
# [11, 22, 33],
# [ 1, 3, 4]]])

You have to do something a little more complicated to acheive what you want.
You can't select slices of arrays as such, but you can select all the specific indexes you want.
So first you need to construct an array that represents the rows you wish to select. ie.
data = numpy.array([[1,2,3],[55,56,57],[1,2,3]])
to_select = numpy.array([1,2,3]*3).reshape(3,3) # three rows of [1,2,3]
selected_indices = data == to_select
# array([[ True, True, True],
# [False, False, False],
# [ True, True, True]], dtype=bool)
data = numpy.where(selected_indices, [4,5,6], data)
# array([[4, 5, 6],
# [55, 56, 57],
# [4, 5, 6]])
# done in one step, but perhaps not very clear as to its intent
data = numpy.where(data == numpy.array([1,2,3]*3).reshape(3,3), [4,5,6], data)
numpy.where works by selecting from the second argument if true and the third argument if false.
You can use where to select from 3 different types of data. The first is an array that has the same shape as selected_indices, the second is just a value on its own (like 2 or 7). The first is most complicated as can be of shape that can be broadcast into the same shape as selected_indices. In this case we provided [1,2,3] which can be stacked together to get an array with shape 3x3.

Note sure if this is what you want, your code example does not create the array you say it does. But:
>>> a = np.array([[[1,1,1],[1,2,3],[1,3,4]], [[1,1,1],[1,2,3],[1,3,4]]])
>>> a
array([[[1, 1, 1],
[1, 2, 3],
[1, 3, 4]],
[[1, 1, 1],
[1, 2, 3],
[1, 3, 4]]])
>>> a[:,1,:] = [[8, 8, 8], [8,8,8]]
>>> a
array([[[1, 1, 1],
[8, 8, 8],
[1, 3, 4]],
[[1, 1, 1],
[8, 8, 8],
[1, 3, 4]]])
>>> a[:,1,:] = [88, 88, 88]
>>> a
array([[[ 1, 1, 1],
[88, 88, 88],
[ 1, 3, 4]],
[[ 1, 1, 1],
[88, 88, 88],
[ 1, 3, 4]]])

Related

Now to remove certain rows from 2d Numpy array when they match a given critera?

I have a very large 2d Numpy array (a few columns but billions of rows). As the program runs, I get more of those, thousands of them are generated.
For each one, I'd like to remove all rows that contains certain values in certain positions. For example, if I had:
arr = np.array([
[10, 1, 1, 1],
[1, 2, 1, 2],
[1, 2, 1, 2],
[3, 1, 1, 1],
[2, 2, 1, 2]
[3, 4, 2, 7],
[3, 2, 1, 9],
[3, 2, 2, 2],
]),
I'd like to remove all rows that contain the value 2 on positions 1 and 3, so that I would end up with:
print(arr)
>>> ([
[10, 1, 1, 1],
[3, 1, 1, 2],
[3, 4, 2, 7],
[3, 2, 1, 9],
]),
Because I have such large 2d arrays and so many of them, I'm trying to do this with a Numpy call so that it runs in C, instead of iterating and selecting rows in Python which is much, much slower.
Is there a Numpy way of accomplishing this?
Thanks!
Eduardo

You can use boolean array indexing: i.e. select the 2nd and 4th column and then check that not all of them are equal to 2:
arr[(arr[:, [1,3]] != 2).any(1)]
array([[10, 1, 1, 1],
[ 3, 1, 1, 1],
[ 3, 4, 2, 7],
[ 3, 2, 1, 9]])

How do I do masking in PyTorch / Numpy with different dimensions?

I have a mask with a size of torch.Size([20, 1, 199]) and a tensor, reconstruct_output and inputs both with a size of torch.Size([20, 1, 161, 199]).
I want to set reconstruct_output to inputs where the mask is 0. I tried:
reconstruct_output[mask == 0] = inputs[mask == 0]
But I get an error:
IndexError: The shape of the mask [20, 1, 199] at index 2 does not match the shape of the indexed tensor [20, 1, 161, 199] at index 2

We can use advanced indexing here. To obtain the indexing arrays which we want to use to index both reconstruct_output and inputs, we need the indices along its axes where m==0. For that we can use np.where, and use the resulting indices to update reconstruct_output as:
m = mask == 0
i, _, l = np.where(m)
reconstruct_output[i, ..., l] = inputs[i, ..., l]
Here's a small example which I've checked with:
mask = np.random.randint(0,3, (2, 1, 4))
reconstruct_output = np.random.randint(0,10, (2, 1, 3, 4))
inputs = np.random.randint(0,10, (2, 1, 3, 4))
Giving for instance:
print(reconstruct_output)
array([[[[8, 9, 7, 2],
[5, 4, 6, 1],
[1, 4, 0, 3]]],
[[[4, 3, 3, 4],
[0, 9, 9, 7],
[3, 4, 9, 3]]]])
print(inputs)
array([[[[7, 3, 9, 8],
[3, 1, 0, 8],
[0, 5, 4, 8]]],
[[[3, 7, 5, 8],
[2, 5, 3, 8],
[3, 6, 7, 5]]]])
And the mask:
print(mask)
array([[[0, 1, 2, 1]],
[[1, 0, 1, 0]]])
By using np.where to find the indices where there are zeroes in mask we get:
m = mask == 0
i, _, l = np.where(m)
i
# array([0, 1, 1])
l
# array([0, 1, 3])
Hence we'll be replacing the 0th column from the first 2D array and the 1st and 3rd from the second 2D array.
We can now use these arrays to replace along the corresponding axes indexing as:
reconstruct_output[i, ..., l] = inputs[i, ..., l]
Getting:
reconstruct_output
array([[[[7, 9, 7, 2],
[3, 4, 6, 1],
[0, 4, 0, 3]]],
[[[4, 7, 3, 8],
[0, 5, 9, 8],
[3, 6, 9, 5]]]])

Why does np.argwhere's result shape not match it's input?

Suppose I pass a 1D array:
>>> np.arange(0,20)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
>>> np.arange(0,20).shape
(20,)
into argwhere:
>>> np.argwhere(np.arange(0,20)<10)
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
>>> np.argwhere(np.arange(0,20)<10).shape
(10, 1)
why has the result changed into a 2D array? What's the benefit of this?

argwhere returns the coordinates of where condition is True. In general, coordinates are tuples, therefore the output should be 2D.
>>> np.argwhere(np.arange(0,20).reshape(2,2,5)<10)
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 2],
[0, 0, 3],
[0, 0, 4],
[0, 1, 0],
[0, 1, 1],
[0, 1, 2],
[0, 1, 3],
[0, 1, 4]])
For consistency, this also applies to the case of 1D input.

numpy.argwhere finds indices of elements that fulfill the condition. it happened that some of your elements are the outputted elements themselves (the index is the same as value).
Particularly, in your example the input is one dimensional, the output is one dimension (index) by two (the second is to iterate over values).
I hope this is clear, if not, take this example of two dimensional input array presented in the documentation of numpy:
>>> x = np.arange(6).reshape(2,3)
>>> x
array([[0, 1, 2],
[3, 4, 5]])
>>> np.argwhere(x>1)
array([[0, 2],
[1, 0],
[1, 1],
[1, 2]])

argwhere is simply the transpose of where (actually np.nonzero):
In [17]: np.where(np.arange(0,20)<10)
Out[17]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),)
In [18]: np.transpose(_)
Out[18]:
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
where produces a tuple of arrays, one array per dimension (here a 1 element tuple). transpose turns that tuple into an array (e.g. (1,10) shape), and then transposes it. So it's number of columns is the ndim of the input condition, and the number of rows the number of finds.
argwhere can be useful in visualizing the finds, but is not as useful in programs as the where itself. The where tuple can be used to index the condition array directly. The argwhere array is usually used iteratively. For example:
In [19]: x = np.arange(10).reshape(2,5)
In [20]: x %2
Out[20]:
array([[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1]])
In [21]: np.where(x%2)
Out[21]: (array([0, 0, 1, 1, 1]), array([1, 3, 0, 2, 4]))
In [22]: np.argwhere(x%2)
Out[22]:
array([[0, 1],
[0, 3],
[1, 0],
[1, 2],
[1, 4]])
In [23]: x[np.where(x%2)]
Out[23]: array([1, 3, 5, 7, 9])
In [24]: for i in np.argwhere(x%2):
...: print(x[tuple(i)])
...:
1
3
5
7
9
In [25]: [x[tuple(i)] for i in np.argwhere(x%2)]
Out[25]: [1, 3, 5, 7, 9]

Is there a way to loop through the return value of np.where?

Is there a way to loop-through this tuple(?) where the left array are positions in an array and the right array is the value I would like to insert into the given positions:
(array([ 0, 4, 6, ..., 9992, 9996, 9997]), array([3, 3, 3, ..., 3, 3, 3]))
The output above is generated from the following piece of code:
np.where(h2 == h2[i,:].max())[1]
I would like the result to be like this:
array[0] = 3
array[4] = 3
...
array[9997] = 3

Just use a simple indexing:
indices, values = my_tuple
array[indices] = values
If you don't have the final array yet you can create it using a desire function like np.zeros, np.ones, etc. with a size as the size of maximum index.

I think you want the transpose of the where tuple:
In [204]: x=np.arange(1,13).reshape(3,4)
In [205]: x
Out[205]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [206]: idx=np.where(x)
In [207]: idx
Out[207]:
(array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], dtype=int32),
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32))
In [208]: ij=np.transpose(idx)
In [209]: ij
Out[209]:
array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[2, 0],
[2, 1],
[2, 2],
[2, 3]], dtype=int32)
In fact there's a function that does just that:
np.argwhere(x)
Iterating on ij, I can print:
In [213]: for i,j in ij:
...: print('array[{}]={}'.format(i,j))
...:
array[0]=0
array[0]=1
array[0]=2
zip(*) is a list version of transpose:
for i,j in zip(*idx):
print(i,j)

Selecting specific rows and columns from NumPy array

I've been going crazy trying to figure out what stupid thing I'm doing wrong here.
I'm using NumPy, and I have specific row indices and specific column indices that I want to select from. Here's the gist of my problem:
import numpy as np
a = np.arange(20).reshape((5,4))
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11],
# [12, 13, 14, 15],
# [16, 17, 18, 19]])
# If I select certain rows, it works
print a[[0, 1, 3], :]
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [12, 13, 14, 15]])
# If I select certain rows and a single column, it works
print a[[0, 1, 3], 2]
# array([ 2, 6, 14])
# But if I select certain rows AND certain columns, it fails
print a[[0,1,3], [0,2]]
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# ValueError: shape mismatch: objects cannot be broadcast to a single shape
Why is this happening? Surely I should be able to select the 1st, 2nd, and 4th rows, and 1st and 3rd columns? The result I'm expecting is:
a[[0,1,3], [0,2]] => [[0, 2],
[4, 6],
[12, 14]]

As Toan suggests, a simple hack would be to just select the rows first, and then select the columns over that.
>>> a[[0,1,3], :] # Returns the rows you want
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[12, 13, 14, 15]])
>>> a[[0,1,3], :][:, [0,2]] # Selects the columns you want as well
array([[ 0, 2],
[ 4, 6],
[12, 14]])
[Edit] The built-in method: np.ix_
I recently discovered that numpy gives you an in-built one-liner to doing exactly what #Jaime suggested, but without having to use broadcasting syntax (which suffers from lack of readability). From the docs:
Using ix_ one can quickly construct index arrays that will index the
cross product. a[np.ix_([1,3],[2,5])] returns the array [[a[1,2] a[1,5]], [a[3,2] a[3,5]]].
So you use it like this:
>>> a = np.arange(20).reshape((5,4))
>>> a[np.ix_([0,1,3], [0,2])]
array([[ 0, 2],
[ 4, 6],
[12, 14]])
And the way it works is that it takes care of aligning arrays the way Jaime suggested, so that broadcasting happens properly:
>>> np.ix_([0,1,3], [0,2])
(array([[0],
[1],
[3]]), array([[0, 2]]))
Also, as MikeC says in a comment, np.ix_ has the advantage of returning a view, which my first (pre-edit) answer did not. This means you can now assign to the indexed array:
>>> a[np.ix_([0,1,3], [0,2])] = -1
>>> a
array([[-1, 1, -1, 3],
[-1, 5, -1, 7],
[ 8, 9, 10, 11],
[-1, 13, -1, 15],
[16, 17, 18, 19]])

Fancy indexing requires you to provide all indices for each dimension. You are providing 3 indices for the first one, and only 2 for the second one, hence the error. You want to do something like this:
>>> a[[[0, 0], [1, 1], [3, 3]], [[0,2], [0,2], [0, 2]]]
array([[ 0, 2],
[ 4, 6],
[12, 14]])
That is of course a pain to write, so you can let broadcasting help you:
>>> a[[[0], [1], [3]], [0, 2]]
array([[ 0, 2],
[ 4, 6],
[12, 14]])
This is much simpler to do if you index with arrays, not lists:
>>> row_idx = np.array([0, 1, 3])
>>> col_idx = np.array([0, 2])
>>> a[row_idx[:, None], col_idx]
array([[ 0, 2],
[ 4, 6],
[12, 14]])

USE:
>>> a[[0,1,3]][:,[0,2]]
array([[ 0, 2],
[ 4, 6],
[12, 14]])
OR:
>>> a[[0,1,3],::2]
array([[ 0, 2],
[ 4, 6],
[12, 14]])

Using np.ix_ is the most convenient way to do it (as answered by others), but it also can be done as follows:
>>> rows = [0, 1, 3]
>>> cols = [0, 2]
>>> (a[rows].T)[cols].T
array([[ 0, 2],
[ 4, 6],
[12, 14]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace subarrays in numpy - python

Related

Now to remove certain rows from 2d Numpy array when they match a given critera?

How do I do masking in PyTorch / Numpy with different dimensions?

Why does np.argwhere's result shape not match it's input?

Is there a way to loop through the return value of np.where?

Selecting specific rows and columns from NumPy array

Categories

Resources