Say I have an array like this:
x = [1, 2, 3]
[4, 5, 6]
[7, 8, 9]
And I want to delete both the ith row and column. So if i=1, I'd create (with 0-indexing):
[1, 3]
[7, 9]
Is there an easy way of doing this with a one-liner? I know I can call np.delete() twice, but that seems a little unclean.
It'd be exactly equivalent to np.delete(np.delete(x, idx, 0), idx, 1), where idx is the index of the row/column pair to delete - it'd just look cleaner.
In [196]: x = np.arange(1,10).reshape(3,3)
If you look at np.delete code, you'll see that it's python (not compiled) and takes different approaches depending on how the delete values are specified. One is to make a res array of right size, and copy two slices to it.
Another is to make a boolean mask. For example:
In [197]: mask = np.ones(x.shape[0], bool)
In [198]: mask[1] = 0
In [199]: mask
Out[199]: array([ True, False, True])
Since you are deleting the same row and column, use this indexing:
In [200]: x[mask,:][:,mask]
Out[200]:
array([[1, 3],
[7, 9]])
A 1d boolean mask like this can't be 'broadcasted' in the same ways a integer array can.
We can do the 2d advanced indexing with:
In [201]: idx = np.nonzero(mask)[0]
In [202]: idx
Out[202]: array([0, 2])
In [203]: np.ix_(idx,idx)
Out[203]:
(array([[0],
[2]]),
array([[0, 2]]))
In [204]: x[np.ix_(idx,idx)]
Out[204]:
array([[1, 3],
[7, 9]])
Actually ix_ can work directly from the boolean array(s):
In [207]: np.ix_(mask,mask)
Out[207]:
(array([[0],
[2]]),
array([[0, 2]]))
This isn't a one-liner, but it probably is faster than the double delete, since it strips off all the extra baggage that the more general function requires.
This can be easily achieved by numpy's delete function. It would be:
arr = np.delete(arr, index, 0) # deletes the desired row
arr = np.delete(arr, index, 1) # deletes the desired column at index
The third argument is the axis.
Related
Say I have an array like this:
x = [1, 2, 3]
[4, 5, 6]
[7, 8, 9]
And I want to delete both the ith row and column. So if i=1, I'd create (with 0-indexing):
[1, 3]
[7, 9]
Is there an easy way of doing this with a one-liner? I know I can call np.delete() twice, but that seems a little unclean.
It'd be exactly equivalent to np.delete(np.delete(x, idx, 0), idx, 1), where idx is the index of the row/column pair to delete - it'd just look cleaner.
In [196]: x = np.arange(1,10).reshape(3,3)
If you look at np.delete code, you'll see that it's python (not compiled) and takes different approaches depending on how the delete values are specified. One is to make a res array of right size, and copy two slices to it.
Another is to make a boolean mask. For example:
In [197]: mask = np.ones(x.shape[0], bool)
In [198]: mask[1] = 0
In [199]: mask
Out[199]: array([ True, False, True])
Since you are deleting the same row and column, use this indexing:
In [200]: x[mask,:][:,mask]
Out[200]:
array([[1, 3],
[7, 9]])
A 1d boolean mask like this can't be 'broadcasted' in the same ways a integer array can.
We can do the 2d advanced indexing with:
In [201]: idx = np.nonzero(mask)[0]
In [202]: idx
Out[202]: array([0, 2])
In [203]: np.ix_(idx,idx)
Out[203]:
(array([[0],
[2]]),
array([[0, 2]]))
In [204]: x[np.ix_(idx,idx)]
Out[204]:
array([[1, 3],
[7, 9]])
Actually ix_ can work directly from the boolean array(s):
In [207]: np.ix_(mask,mask)
Out[207]:
(array([[0],
[2]]),
array([[0, 2]]))
This isn't a one-liner, but it probably is faster than the double delete, since it strips off all the extra baggage that the more general function requires.
This can be easily achieved by numpy's delete function. It would be:
arr = np.delete(arr, index, 0) # deletes the desired row
arr = np.delete(arr, index, 1) # deletes the desired column at index
The third argument is the axis.
Given:
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[i] gives the ith row (e.g. [1, 2]). How do I access the ith column? (e.g. [1, 3, 5]). Also, would this be an expensive operation?
To access column 0:
>>> test[:, 0]
array([1, 3, 5])
To access row 0:
>>> test[0, :]
array([1, 2])
This is covered in Section 1.4 (Indexing) of the NumPy reference. This is quick, at least in my experience. It's certainly much quicker than accessing each element in a loop.
>>> test[:,0]
array([1, 3, 5])
this command gives you a row vector, if you just want to loop over it, it's fine, but if you want to hstack with some other array with dimension 3xN, you will have
ValueError: all the input arrays must have same number of dimensions
while
>>> test[:,[0]]
array([[1],
[3],
[5]])
gives you a column vector, so that you can do concatenate or hstack operation.
e.g.
>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
[3, 4, 3],
[5, 6, 5]])
And if you want to access more than one column at a time you could do:
>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
[3, 5],
[6, 8]])
You could also transpose and return a row:
In [4]: test.T[0]
Out[4]: array([1, 3, 5])
Although the question has been answered, let me mention some nuances.
Let's say you are interested in the first column of the array
arr = numpy.array([[1, 2],
[3, 4],
[5, 6]])
As you already know from other answers, to get it in the form of "row vector" (array of shape (3,)), you use slicing:
arr_col1_view = arr[:, 1] # creates a view of the 1st column of the arr
arr_col1_copy = arr[:, 1].copy() # creates a copy of the 1st column of the arr
To check if an array is a view or a copy of another array you can do the following:
arr_col1_view.base is arr # True
arr_col1_copy.base is arr # False
see ndarray.base.
Besides the obvious difference between the two (modifying arr_col1_view will affect the arr), the number of byte-steps for traversing each of them is different:
arr_col1_view.strides[0] # 8 bytes
arr_col1_copy.strides[0] # 4 bytes
see strides and this answer.
Why is this important? Imagine that you have a very big array A instead of the arr:
A = np.random.randint(2, size=(10000, 10000), dtype='int32')
A_col1_view = A[:, 1]
A_col1_copy = A[:, 1].copy()
and you want to compute the sum of all the elements of the first column, i.e. A_col1_view.sum() or A_col1_copy.sum(). Using the copied version is much faster:
%timeit A_col1_view.sum() # ~248 µs
%timeit A_col1_copy.sum() # ~12.8 µs
This is due to the different number of strides mentioned before:
A_col1_view.strides[0] # 40000 bytes
A_col1_copy.strides[0] # 4 bytes
Although it might seem that using column copies is better, it is not always true for the reason that making a copy takes time too and uses more memory (in this case it took me approx. 200 µs to create the A_col1_copy). However if we needed the copy in the first place, or we need to do many different operations on a specific column of the array and we are ok with sacrificing memory for speed, then making a copy is the way to go.
In the case we are interested in working mostly with columns, it could be a good idea to create our array in column-major ('F') order instead of the row-major ('C') order (which is the default), and then do the slicing as before to get a column without copying it:
A = np.asfortranarray(A) # or np.array(A, order='F')
A_col1_view = A[:, 1]
A_col1_view.strides[0] # 4 bytes
%timeit A_col1_view.sum() # ~12.6 µs vs ~248 µs
Now, performing the sum operation (or any other) on a column-view is as fast as performing it on a column copy.
Finally let me note that transposing an array and using row-slicing is the same as using the column-slicing on the original array, because transposing is done by just swapping the shape and the strides of the original array.
A[:, 1].strides[0] # 40000 bytes
A.T[1, :].strides[0] # 40000 bytes
To get several and indepent columns, just:
> test[:,[0,2]]
you will get colums 0 and 2
>>> test
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> ncol = test.shape[1]
>>> ncol
5L
Then you can select the 2nd - 4th column this way:
>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
[6, 7, 8]])
This is not multidimensional. It is 2 dimensional array. where you want to access the columns you wish.
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[:, a:b] # you can provide index in place of a and b
I'm newbie in python and I don't understand the following code;
I expected that test1 and test2 give me same results (8, the sum of second row), instead
a=np.matrix([[1,2,3],[1,3, 4]])
b=np.matrix([[0,1]])
print(np.where(b==1))
test1=a[np.nonzero(b==1),:]
print(test1.sum())
ind,_=np.nonzero(b==1); #found in a code that I'm trying to undestand (why the _ ?)
test2=a[ind,:]
print(test2.sum())
gives me
(array([0]), array([1]))
14
6
in the first case I have the sum of the full matrix, in the second case I have the sum of the first row (instead of the 2nd)
I don't understand why this behavior
In [869]: a
Out[869]:
matrix([[1, 2, 3],
[1, 3, 4]])
In [870]: b
Out[870]: matrix([[0, 1]])
In this use where is the same as nonzero:
In [871]: np.where(b==1)
Out[871]: (array([0], dtype=int32), array([1], dtype=int32))
In [872]: np.nonzero(b==1)
Out[872]: (array([0], dtype=int32), array([1], dtype=int32))
It gives a tuple, one indexing array for each dimension (2 for an np.matrix). The ind,_= just unpacks those arrays, and throws away the 2nd. _ is reused in an interactive session such as the one I'm using.
In [873]: ind,_ =np.nonzero(b==1)
In [874]: ind
Out[874]: array([0], dtype=int32)
Selecting with where returns (0,1) value from a. But is that what you want?
In [875]: a[np.where(b==1)]
Out[875]: matrix([[2]])
Adding the : does index the whole array, but with an added dimension; again probably not what we want
In [876]: a[np.where(b==1),:]
Out[876]:
matrix([[[1, 2, 3]],
[[1, 3, 4]]])
ind is a single indexing array, and so selects the 0's row from a.
In [877]: a[ind,:]
Out[877]: matrix([[1, 2, 3]])
In [878]:
But is the b==1 supposed to find the 2nd element of b, and then select the 2nd row of a? To do that we have to use the 2nd indexing array from where:
In [878]: a[np.where(b==1)[1],:]
Out[878]: matrix([[1, 3, 4]])
Or the 2nd column from a corresponding to the 2nd column of b
In [881]: a[:,np.where(b==1)[1]]
Out[881]:
matrix([[2],
[3]])
Because a and b are np.matrix, the indexing result is always 2d.
For c array, the where produces a single element tuple
In [882]: c=np.array([0,1])
In [883]: np.where(c==1)
Out[883]: (array([1], dtype=int32),)
In [884]: a[_,:] # here _ is the last result, Out[883]
Out[884]: matrix([[1, 3, 4]])
We generally advise using np.array to construct new arrays, even 2d. np.matrix is a convenience for wayward MATLAB users, and often confuses new numpy users.
ndarray objects in numpy have a flat property (e.g. array.flat) that allows one to iterate through its elements. For example:
>>> x = np.arange(1, 7).reshape(2, 3)
>>> x
array([[1, 2, 3],
[4, 5, 6]])
>>> x.flat[3]
4
But how can I return a column-major 1D iterator, so that the example above returns 5 instead of 4?
Approach #1
You can use .ravel('F') to have column major order and then index -
x.ravel('F')[3]
Sample run -
In [100]: x
Out[100]:
array([[1, 2, 3],
[4, 5, 6]])
In [101]: x.ravel('F')[3]
Out[101]: 5
This will create a copy of the entire array before selecting elements -
In [161]: np.may_share_memory(x, x.ravel())
Out[161]: True
In [162]: np.may_share_memory(x, x.ravel('F'))
Out[162]: False
As such this may not the most memory efficient one. For a better one, let's move onto another approach.
Approach #2
We can get the row and column indices from the column-major ordered index and then simply index into the array with it -
x[np.unravel_index(3, np.array(x.shape)[::-1])]
Sample run -
In [147]: x
Out[147]:
array([[1, 2, 3],
[4, 5, 6]])
In [148]: idx = np.unravel_index(3, np.array(x.shape)[::-1])
In [149]: idx
Out[149]: (1, 1) # row, col indices obtained in C order
In [150]: x[idx]
Out[150]: 5
There isn't any copying or flattening or ravel-ing here and uses just an indexing and as such should be efficient both in terms of memory and performance.
Not sure if this is the best way, but it seems that simply
array.T.flat
will give the result I'm looking for. Although I wish there was some appropriate method that I could specify with order='F', which would be easier to understand at a glance.
I want to apply boolean masking both to rows and columns.
With
X = np.array([[1,2,3],[4,5,6]])
mask1 = np.array([True, True])
mask2 = np.array([True, True, False])
X[mask1, mask2]
I expect the output to be
array([[1,2],[4,5]])
instead of
array([1,5])
It's known that
X[:, mask2]
can be used here but that's not a solution for the general case.
I would like to know how it works under the hood and why in this case the result is array([1,5]).
X[mask1, mask2] is described in Boolean Array Indexing Doc as the equivalent of
In [249]: X[mask1.nonzero()[0], mask2.nonzero()[0]]
Out[249]: array([1, 5])
In [250]: X[[0,1], [0,1]]
Out[250]: array([1, 5])
In effect it is giving you X[0,0] and X[1,1] (pairing the 0s and 1s).
What you want instead is:
In [251]: X[[[0],[1]], [0,1]]
Out[251]:
array([[1, 2],
[4, 5]])
np.ix_ is a handy tool for creating the right mix of dimensions
In [258]: np.ix_([0,1],[0,1])
Out[258]:
(array([[0],
[1]]), array([[0, 1]]))
In [259]: X[np.ix_([0,1],[0,1])]
Out[259]:
array([[1, 2],
[4, 5]])
That's effectively a column vector for the 1st axis and row vector for the second, together defining the desired rectangle of values.
But trying to broadcast boolean arrays like this does not work: X[mask1[:,None], mask2]
But that reference section says:
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.
In [260]: X[np.ix_(mask1, mask2)]
Out[260]:
array([[1, 2],
[4, 5]])
In [261]: np.ix_(mask1, mask2)
Out[261]:
(array([[0],
[1]], dtype=int32), array([[0, 1]], dtype=int32))
The boolean section of ix_:
if issubdtype(new.dtype, _nx.bool_):
new, = new.nonzero()
So it works with a mix like X[np.ix_(mask1, [0,2])]
One solution would be to use sequential integer indexing and getting the integers for example from np.where:
>>> X[:, np.where(mask1)[0]][np.where(mask2)[0]]
array([[1, 2],
[4, 5]])
or as #user2357112 pointed out in the comments np.ix_ could be used as well. For example:
>>> X[np.ix_(np.where(mask1)[0], np.where(mask2)[0])]
array([[1, 2],
[4, 5]])
Another idea would be to broadcast your masks and then do it in one step would require a reshape afterwards:
>>> X[np.where(mask1[:, None] * mask2)]
array([1, 2, 4, 5])
>>> X[np.where(mask1[:, None] * mask2)].reshape(2, 2)
array([[1, 2],
[4, 5]])
In a more general sense, your question is bout finding the subpart of an array containing certain rows and columns.
main_array = np.array([[1,2,3],[4,5,6]])
mask_ax_0 = np.array([True, True]) # about which rows, i want
mask_ax_1 = np.array([True, True, False]) # which columns, i want
Answer:
mask_2d = np.logical_and(mask_ax_0.reshape(-1,1), mask_ax_1.reshape(1,-1))
sub_array = main_array[mask_2d].reshape(np.sum(mask_ax_0), np.sum(mask_ax_1))
print(sub_array)
You should be using the numpy.ma module.
In particular, you could use mask_rowcols :
X = np.array([[1,2,3],[4,5,6]])
linesmask = np.array([True, True])
colsmask = np.array([True, True, False])
X = X.view(ma.MaskedArray)
for i in range(len(linesmask)):
X.mask[i][0] = not linemask[i]
for j in range(len(colsmask)):
X.mask[0][j] = not colsmask[j]
X = ma.mask_rowcols(X)