I'm newbie in python and I don't understand the following code;
I expected that test1 and test2 give me same results (8, the sum of second row), instead
a=np.matrix([[1,2,3],[1,3, 4]])
b=np.matrix([[0,1]])
print(np.where(b==1))
test1=a[np.nonzero(b==1),:]
print(test1.sum())
ind,_=np.nonzero(b==1); #found in a code that I'm trying to undestand (why the _ ?)
test2=a[ind,:]
print(test2.sum())
gives me
(array([0]), array([1]))
14
6
in the first case I have the sum of the full matrix, in the second case I have the sum of the first row (instead of the 2nd)
I don't understand why this behavior
In [869]: a
Out[869]:
matrix([[1, 2, 3],
[1, 3, 4]])
In [870]: b
Out[870]: matrix([[0, 1]])
In this use where is the same as nonzero:
In [871]: np.where(b==1)
Out[871]: (array([0], dtype=int32), array([1], dtype=int32))
In [872]: np.nonzero(b==1)
Out[872]: (array([0], dtype=int32), array([1], dtype=int32))
It gives a tuple, one indexing array for each dimension (2 for an np.matrix). The ind,_= just unpacks those arrays, and throws away the 2nd. _ is reused in an interactive session such as the one I'm using.
In [873]: ind,_ =np.nonzero(b==1)
In [874]: ind
Out[874]: array([0], dtype=int32)
Selecting with where returns (0,1) value from a. But is that what you want?
In [875]: a[np.where(b==1)]
Out[875]: matrix([[2]])
Adding the : does index the whole array, but with an added dimension; again probably not what we want
In [876]: a[np.where(b==1),:]
Out[876]:
matrix([[[1, 2, 3]],
[[1, 3, 4]]])
ind is a single indexing array, and so selects the 0's row from a.
In [877]: a[ind,:]
Out[877]: matrix([[1, 2, 3]])
In [878]:
But is the b==1 supposed to find the 2nd element of b, and then select the 2nd row of a? To do that we have to use the 2nd indexing array from where:
In [878]: a[np.where(b==1)[1],:]
Out[878]: matrix([[1, 3, 4]])
Or the 2nd column from a corresponding to the 2nd column of b
In [881]: a[:,np.where(b==1)[1]]
Out[881]:
matrix([[2],
[3]])
Because a and b are np.matrix, the indexing result is always 2d.
For c array, the where produces a single element tuple
In [882]: c=np.array([0,1])
In [883]: np.where(c==1)
Out[883]: (array([1], dtype=int32),)
In [884]: a[_,:] # here _ is the last result, Out[883]
Out[884]: matrix([[1, 3, 4]])
We generally advise using np.array to construct new arrays, even 2d. np.matrix is a convenience for wayward MATLAB users, and often confuses new numpy users.
Related
Say I have an array like this:
x = [1, 2, 3]
[4, 5, 6]
[7, 8, 9]
And I want to delete both the ith row and column. So if i=1, I'd create (with 0-indexing):
[1, 3]
[7, 9]
Is there an easy way of doing this with a one-liner? I know I can call np.delete() twice, but that seems a little unclean.
It'd be exactly equivalent to np.delete(np.delete(x, idx, 0), idx, 1), where idx is the index of the row/column pair to delete - it'd just look cleaner.
In [196]: x = np.arange(1,10).reshape(3,3)
If you look at np.delete code, you'll see that it's python (not compiled) and takes different approaches depending on how the delete values are specified. One is to make a res array of right size, and copy two slices to it.
Another is to make a boolean mask. For example:
In [197]: mask = np.ones(x.shape[0], bool)
In [198]: mask[1] = 0
In [199]: mask
Out[199]: array([ True, False, True])
Since you are deleting the same row and column, use this indexing:
In [200]: x[mask,:][:,mask]
Out[200]:
array([[1, 3],
[7, 9]])
A 1d boolean mask like this can't be 'broadcasted' in the same ways a integer array can.
We can do the 2d advanced indexing with:
In [201]: idx = np.nonzero(mask)[0]
In [202]: idx
Out[202]: array([0, 2])
In [203]: np.ix_(idx,idx)
Out[203]:
(array([[0],
[2]]),
array([[0, 2]]))
In [204]: x[np.ix_(idx,idx)]
Out[204]:
array([[1, 3],
[7, 9]])
Actually ix_ can work directly from the boolean array(s):
In [207]: np.ix_(mask,mask)
Out[207]:
(array([[0],
[2]]),
array([[0, 2]]))
This isn't a one-liner, but it probably is faster than the double delete, since it strips off all the extra baggage that the more general function requires.
This can be easily achieved by numpy's delete function. It would be:
arr = np.delete(arr, index, 0) # deletes the desired row
arr = np.delete(arr, index, 1) # deletes the desired column at index
The third argument is the axis.
I'm learning numpy from a YouTube tutorial. In a video he demonstrated that
wine_data_arr[:, 0].shape
where wine_data_arr are a two dimensional numpy array imported from sklearn. And the result is (178,), and he said "it is a one dimensional array". But in math for example this
[1,2,3]
can represent a 1 by 3 matrix, which has dimension 2. So my question is why wine_data_arr[:, 0] is a one dimension array? I guess this definition must be useful in some situation. So what's that situation?
Try to be more specific: when writing wine_data_arr[:, 0] I provide two arguments, i.e. : and 0 and the result is one dimension. When I write wine_data_arr[:, (0,4)], I still provide two arguments : and (0,4), a tuple, and the result is two dimension. Why not both produce two dimension matrix?
Even if they "look" the same, a vector is not the same as a matrix. Consider:
>>> np.array([1,2,3,4])
array([1, 2, 3,4])
>>> np.matrix([1,2,3,4])
matrix([[1, 2, 3,4]])
>>> np.matrix([[1,2],[3,4]])
matrix([[1, 2],
[3, 4]])
When slicing a two-dimensional array like
>>> wine_data_arr = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> wine_data_arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
you can request a lower-dimensional component (a single row or column) using an integer index
>>> wine_data_arr[:,0]
array([1, 4, 7])
>>> wine_data_arr[0,:]
array([1, 2, 3])
or a same-dimensional "piece" using a slice index:
>>> wine_data_arr[:, 0:1]
array([[1],
[4],
[7]])
If you use two integer indices, you get a single zero-dimensional element of the array:
>>> wine_data_arr[0,0]
1
In numpy arrays can have 0, 1, 2 or more dimensions. In contrast to MATLAB there isn't a 2d lower boundary. Also numpy is generally consistent with Python lists, in display and indexing. MATLAB generally follows linear algebra conventions, but I'm sure there are other math definitions for arrays and vectors. In physics vector represents a point in space, or a direction, not a 2d 'column vector' or 'row vector'.
A list:
In [159]: alist = [1, 2, 3]
In [160]: len(alist)
Out[160]: 3
An array made from this list:
In [161]: arr = np.array(alist)
In [162]: arr.shape
Out[162]: (3,)
Indexing a list removes a level of nesting. Scalar indexing an array removes a dimension. See
https://numpy.org/doc/stable/user/basics.indexing.html
In [163]: alist[0]
Out[163]: 1
In [164]: arr[0]
Out[164]: 1
A 2d array:
In [166]: marr = np.arange(4).reshape(2, 2)
In [167]: marr
Out[167]:
array([[0, 1],
[2, 3]])
Again, scalar indexing removes a dimension:
In [169]: marr[0,:]
Out[169]: array([0, 1])
In [170]: marr[:, 0]
Out[170]: array([0, 2])
In [172]: marr[1, 1]
Out[172]: 3
Indexing with a list or slice preserves the dimension:
In [173]: marr[0, [1]]
Out[173]: array([1])
In [174]: marr[0, 0:1]
Out[174]: array([0])
Count the [] to determine the dimensionality.
The short answer: this is a convention.
Before I go into further details, let me clarify that the "dimensionality" in NumPy, for example, is not the same as that in math. In math, [1, 2, 3] is a three-dimensional vector, or a one by three dimensional matrix if you want. However, here, the dimensionality really means the "physical" dimension of the array, i.e., how many axes are present in your array (or matrix, or tensor, etc.).
Now let me get back to your question of why "this" particular definition of array dimension is helpful. What I'm going to say next is somewhat philosophical and is my own take of it. Essentially, it all boils down to communication between programmers. For example, when you are reading the documentation of some Python code and wondering the dimensionality of the output array, sure the documentation can write "N x M x ..." and then carefully define what N, M, etc. are. But in many cases, just the number of axes (or the "dimensionality" referred to in NumPy) may be sufficient to inform you. In this case, the documentation becomes much cleaner and easier to read while providing enough information about the expected outcome.
I understand the output of np.where() with input of a one-row array. However, when a two-row array was used as an input, I don't understand why the output of b is two arrays.
The output for a[b] makes sense.
a = np.array([[1, 2, 3],[4,5,6]])
print(a)
print ('Indices of elements <4')
b = np.where(a<4)
print(b)
print(a[b])
output for b:
(array([0, 0, 0], dtype=int64), array([0, 1, 2], dtype=int64))
output for a[b]:
[1 2 3]
We require two indices to access each element in 2D array. For eg. i and j.
Hence, if the indices of the 2D array satisfying the condition are (i1,j1), (i2,j2) and (i3,j3) for condition a<4, then np.where() will return a tuple of tuples in format like ((i1,i2,i3),(j1,j2,j3))
Say I have an array like this:
x = [1, 2, 3]
[4, 5, 6]
[7, 8, 9]
And I want to delete both the ith row and column. So if i=1, I'd create (with 0-indexing):
[1, 3]
[7, 9]
Is there an easy way of doing this with a one-liner? I know I can call np.delete() twice, but that seems a little unclean.
It'd be exactly equivalent to np.delete(np.delete(x, idx, 0), idx, 1), where idx is the index of the row/column pair to delete - it'd just look cleaner.
In [196]: x = np.arange(1,10).reshape(3,3)
If you look at np.delete code, you'll see that it's python (not compiled) and takes different approaches depending on how the delete values are specified. One is to make a res array of right size, and copy two slices to it.
Another is to make a boolean mask. For example:
In [197]: mask = np.ones(x.shape[0], bool)
In [198]: mask[1] = 0
In [199]: mask
Out[199]: array([ True, False, True])
Since you are deleting the same row and column, use this indexing:
In [200]: x[mask,:][:,mask]
Out[200]:
array([[1, 3],
[7, 9]])
A 1d boolean mask like this can't be 'broadcasted' in the same ways a integer array can.
We can do the 2d advanced indexing with:
In [201]: idx = np.nonzero(mask)[0]
In [202]: idx
Out[202]: array([0, 2])
In [203]: np.ix_(idx,idx)
Out[203]:
(array([[0],
[2]]),
array([[0, 2]]))
In [204]: x[np.ix_(idx,idx)]
Out[204]:
array([[1, 3],
[7, 9]])
Actually ix_ can work directly from the boolean array(s):
In [207]: np.ix_(mask,mask)
Out[207]:
(array([[0],
[2]]),
array([[0, 2]]))
This isn't a one-liner, but it probably is faster than the double delete, since it strips off all the extra baggage that the more general function requires.
This can be easily achieved by numpy's delete function. It would be:
arr = np.delete(arr, index, 0) # deletes the desired row
arr = np.delete(arr, index, 1) # deletes the desired column at index
The third argument is the axis.
If I have a and b:
a=[[1,2,3],
[4,5,6],
[7,8,9]]
b=8.1
and I want to find the index of the value b in a, I can do:
nonzero(abs(a-b)<0.5)
to get (2,1) as the index, but what do I do if b is a 1d or 2d array? Say,
b=[8.1,3.1,9.1]
and I want to get (2,1),(0,2),(2,2)
In general I expect only one match in a for every value of b. Can I avoid a for loop?
Use a list comprehension:
[nonzero(abs(x-a)<0.5) for x in b]
Vectorized approach with NumPy's broadcasting -
np.argwhere((np.abs(a - b[:,None,None])<0.5))[:,1:]
Explanation -
Extend b from a 1D to a 3D case with None/np.newaxis, keeping the elements along the first axis.
Perform absolute subtractions with the 2D array a, thus bringing in broadcasting and leading to a 3D array of elementwise subtractions.
Compare against the threshold of 0.5 and get the indices corresponding to matches along the last two axes and sorted by the first axis with np.argwhere(...)[:,1:].
Sample run -
In [71]: a
Out[71]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [72]: b
Out[72]: array([ 8.1, 3.1, 9.1, 0.7])
In [73]: np.argwhere((np.abs(a - b[:,None,None])<0.5))[:,1:]
Out[73]:
array([[2, 1],
[0, 2],
[2, 2],
[0, 0]])