I'm learning numpy from a YouTube tutorial. In a video he demonstrated that
wine_data_arr[:, 0].shape
where wine_data_arr are a two dimensional numpy array imported from sklearn. And the result is (178,), and he said "it is a one dimensional array". But in math for example this
[1,2,3]
can represent a 1 by 3 matrix, which has dimension 2. So my question is why wine_data_arr[:, 0] is a one dimension array? I guess this definition must be useful in some situation. So what's that situation?
Try to be more specific: when writing wine_data_arr[:, 0] I provide two arguments, i.e. : and 0 and the result is one dimension. When I write wine_data_arr[:, (0,4)], I still provide two arguments : and (0,4), a tuple, and the result is two dimension. Why not both produce two dimension matrix?
Even if they "look" the same, a vector is not the same as a matrix. Consider:
>>> np.array([1,2,3,4])
array([1, 2, 3,4])
>>> np.matrix([1,2,3,4])
matrix([[1, 2, 3,4]])
>>> np.matrix([[1,2],[3,4]])
matrix([[1, 2],
[3, 4]])
When slicing a two-dimensional array like
>>> wine_data_arr = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> wine_data_arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
you can request a lower-dimensional component (a single row or column) using an integer index
>>> wine_data_arr[:,0]
array([1, 4, 7])
>>> wine_data_arr[0,:]
array([1, 2, 3])
or a same-dimensional "piece" using a slice index:
>>> wine_data_arr[:, 0:1]
array([[1],
[4],
[7]])
If you use two integer indices, you get a single zero-dimensional element of the array:
>>> wine_data_arr[0,0]
1
In numpy arrays can have 0, 1, 2 or more dimensions. In contrast to MATLAB there isn't a 2d lower boundary. Also numpy is generally consistent with Python lists, in display and indexing. MATLAB generally follows linear algebra conventions, but I'm sure there are other math definitions for arrays and vectors. In physics vector represents a point in space, or a direction, not a 2d 'column vector' or 'row vector'.
A list:
In [159]: alist = [1, 2, 3]
In [160]: len(alist)
Out[160]: 3
An array made from this list:
In [161]: arr = np.array(alist)
In [162]: arr.shape
Out[162]: (3,)
Indexing a list removes a level of nesting. Scalar indexing an array removes a dimension. See
https://numpy.org/doc/stable/user/basics.indexing.html
In [163]: alist[0]
Out[163]: 1
In [164]: arr[0]
Out[164]: 1
A 2d array:
In [166]: marr = np.arange(4).reshape(2, 2)
In [167]: marr
Out[167]:
array([[0, 1],
[2, 3]])
Again, scalar indexing removes a dimension:
In [169]: marr[0,:]
Out[169]: array([0, 1])
In [170]: marr[:, 0]
Out[170]: array([0, 2])
In [172]: marr[1, 1]
Out[172]: 3
Indexing with a list or slice preserves the dimension:
In [173]: marr[0, [1]]
Out[173]: array([1])
In [174]: marr[0, 0:1]
Out[174]: array([0])
Count the [] to determine the dimensionality.
The short answer: this is a convention.
Before I go into further details, let me clarify that the "dimensionality" in NumPy, for example, is not the same as that in math. In math, [1, 2, 3] is a three-dimensional vector, or a one by three dimensional matrix if you want. However, here, the dimensionality really means the "physical" dimension of the array, i.e., how many axes are present in your array (or matrix, or tensor, etc.).
Now let me get back to your question of why "this" particular definition of array dimension is helpful. What I'm going to say next is somewhat philosophical and is my own take of it. Essentially, it all boils down to communication between programmers. For example, when you are reading the documentation of some Python code and wondering the dimensionality of the output array, sure the documentation can write "N x M x ..." and then carefully define what N, M, etc. are. But in many cases, just the number of axes (or the "dimensionality" referred to in NumPy) may be sufficient to inform you. In this case, the documentation becomes much cleaner and easier to read while providing enough information about the expected outcome.
Related
One would think that indexing with slices and equivalent lists is equivalent in the result, and it mostly is:
>>> b = np.array([[0,1,2],[3,4,5]])
>>> b[0:2,0:2] # slice & slice
array([[0, 1],
[3, 4]])
>>> b[0:2,[0,1]] # slice & list
array([[0, 1],
[3, 4]])
>>> b[[0,1],0:2] # list & slice
array([[0, 1],
[3, 4]])
However:
>>> b[[0,1],[0,1]] # list & list
array([0, 4])
Okay, there is a work-around:
>>> b[[0,1],:][:,[0,1]]
array([[0, 1],
[3, 4]])
But why is this necessary?
(Note: I originally refered to slices as "ranges".)
One would think that indexing with ranges and equivalent lists is equivalent in the result
One would be wrong. When you say "ranges" you actually mean "slices". In any case, indexing with a slice is considered "basic indexing", and indexing with a list (or array) is "advanced indexing".
The rules to how these two mix together and the shape of the resulting array are pretty arcane. However, your examples are the simplest case, a single advanced index and a basic index (the slice), from the docs:
In the simplest case, there is only a single advanced index. A single
advanced index can for example replace a slice and the result array
will be the same, however, it is a copy and may have a different
memory layout. A slice is preferable when it is possible.
So, you'll notice, advanced indexing will always create a copy, if you only use basic indexing, you get a view. This is an important distinction (1).
But that explains why the slicing seem equivalent in that specific case. However, in the case of:
>>> b = np.array([[0,1,2],[3,4,5]])
>>> b[[0,1],[0,1]]
array([0, 4])
This is just purely integer array advanced indexing
When the index consists of as many integer arrays as the array being
indexed has dimensions, the indexing is straight forward, but
different from slicing.
Advanced indexes always are broadcast and iterated as one
result[i_1, ..., i_M] == x[ind_1[i_1, ..., i_M], ind_2[i_1, ..., i_M], ..., ind_N[i_1, ..., i_M]]
(1) The difference between views and copies, basic vs advanced indxeing
>>> b = np.array([[0,1,2],[3,4,5]])
>>> view = b[0:2,0:2]
>>> copy = b[0:2,[0,1]]
>>> view
array([[0, 1],
[3, 4]])
>>> copy
array([[0, 1],
[3, 4]])
>>> view[0,0] = 1337
>>> b
array([[1337, 1, 2],
[ 3, 4, 5]])
>>> view
array([[1337, 1],
[ 3, 4]])
>>> copy
array([[0, 1],
[3, 4]])
As an addenda to Juanpa's answer, https://stackoverflow.com/a/70212077/901925
you can use two arrays to get the block - but they have to broadcast against each other.
In [348]: b = np.array([[0,1,2],[3,4,5]])
In [349]: b[[[0],[1]], [0,1]]
Out[349]:
array([[0, 1],
[3, 4]])
np.ix_ is a convenience function for creating this kind of indexing set:
In [350]: np.ix_([0,1],[0,1])
Out[350]:
(array([[0],
[1]]),
array([[0, 1]]))
In [351]: b[np.ix_([0,1],[0,1])]
Out[351]:
array([[0, 1],
[3, 4]])
The first index array is (n,1) shape, the second (1,m); together they define a (n,m) space.
You specified a (n,) and (n,) pair, which broadcast to (n,). Think of this as the 'diagonal' of [351].
In some languages like MATLAB pairs of indexing arrays like this do specify the block. But to get the diagonal, you have use an extra step, something like sub2ind to convert the pairs to a 1d indexing array. In numpy the block index requires the extra ix_ like step, but logically it's all explainable by broadcasting.
In Python, i need to split two rows in half, take the first half from row 1 and second half from row 2 and concatenate them into an array which is then saved as a row in another 2d array. for example
values=np.array([[1,2,3,4],[5,6,7,8]])
will become
Y[2,:]= ([1,2,7,8])) // 2 is arbitrarily chosen
I tried doing this with concatenate but got an error
only integer scalar arrays can be converted to a scalar index
x=values.shape[1]
pop[y,:]=np.concatenate(values[temp0,0:int((x-1)/2)],values[temp1,int((x-1)/2):x+1])
temp0 and temp1 are integers, and values is a 2d integer array of dimensions (100,x)
np.concatenate takes a list of arrays, plus a scalar axis parameter (optional)
In [411]: values=np.array([[1,2,3,4],[5,6,7,8]])
...:
Nothing wrong with how you split values:
In [412]: x=values.shape[1]
In [413]: x
Out[413]: 4
In [415]: values[0,0:int((x-1)/2)],values[1,int((x-1)/2):x+1]
Out[415]: (array([1]), array([6, 7, 8]))
wrong:
In [416]: np.concatenate(values[0,0:int((x-1)/2)],values[1,int((x-1)/2):x+1])
----
TypeError: only integer scalar arrays can be converted to a scalar index
It's trying to interpret the 2nd argument as an axis parameter, hence the scalar error message.
right:
In [417]: np.concatenate([values[0,0:int((x-1)/2)],values[1,int((x-1)/2):x+1]])
Out[417]: array([1, 6, 7, 8])
There are other concatenate front ends. Here hstack would work the same. np.append takes 2 arrays, so would work - but too often people use it wrongly. np.r_ is another front end with different syntax.
The indexing might be clearer with:
In [423]: idx = (x-1)//2
In [424]: np.concatenate([values[0,:idx],values[1,idx:]])
Out[424]: array([1, 6, 7, 8])
Try numpy.append
numpy.append Documentation
np.append(values[temp0,0:int((x-1)/2)],values[temp1,int((x-1)/2):x+1])
You don't need splitting and/or concatenation. Just use indexing:
In [47]: values=np.array([[1,2,3,4],[5,6,7,8]])
In [48]: values[[[0], [1]],[[0, 1], [-2, -1]]]
Out[48]:
array([[1, 2],
[7, 8]])
Or ravel to get the flattened version:
In [49]: values[[[0], [1]],[[0, 1], [-2, -1]]].ravel()
Out[49]: array([1, 2, 7, 8])
As a more general approach you can also utilize np.r_ as following:
In [61]: x, y = values.shape
In [62]: values[np.arange(x)[:,None],[np.r_[0:y//2], np.r_[-y//2:0]]].ravel()
Out[62]: array([1, 2, 7, 8])
Reshape to split the second dimension in two; stack the part you want.
a = np.array([[1,2,3,4],[5,6,7,8]])
b = a.reshape(a.shape[0], a.shape[1]//2, 2)
new_row = np.hstack([b[0,0,:], b[1,1,:]])
#new_row = np.hstack([b[0,0], b[1,1]])
I'm newbie in python and I don't understand the following code;
I expected that test1 and test2 give me same results (8, the sum of second row), instead
a=np.matrix([[1,2,3],[1,3, 4]])
b=np.matrix([[0,1]])
print(np.where(b==1))
test1=a[np.nonzero(b==1),:]
print(test1.sum())
ind,_=np.nonzero(b==1); #found in a code that I'm trying to undestand (why the _ ?)
test2=a[ind,:]
print(test2.sum())
gives me
(array([0]), array([1]))
14
6
in the first case I have the sum of the full matrix, in the second case I have the sum of the first row (instead of the 2nd)
I don't understand why this behavior
In [869]: a
Out[869]:
matrix([[1, 2, 3],
[1, 3, 4]])
In [870]: b
Out[870]: matrix([[0, 1]])
In this use where is the same as nonzero:
In [871]: np.where(b==1)
Out[871]: (array([0], dtype=int32), array([1], dtype=int32))
In [872]: np.nonzero(b==1)
Out[872]: (array([0], dtype=int32), array([1], dtype=int32))
It gives a tuple, one indexing array for each dimension (2 for an np.matrix). The ind,_= just unpacks those arrays, and throws away the 2nd. _ is reused in an interactive session such as the one I'm using.
In [873]: ind,_ =np.nonzero(b==1)
In [874]: ind
Out[874]: array([0], dtype=int32)
Selecting with where returns (0,1) value from a. But is that what you want?
In [875]: a[np.where(b==1)]
Out[875]: matrix([[2]])
Adding the : does index the whole array, but with an added dimension; again probably not what we want
In [876]: a[np.where(b==1),:]
Out[876]:
matrix([[[1, 2, 3]],
[[1, 3, 4]]])
ind is a single indexing array, and so selects the 0's row from a.
In [877]: a[ind,:]
Out[877]: matrix([[1, 2, 3]])
In [878]:
But is the b==1 supposed to find the 2nd element of b, and then select the 2nd row of a? To do that we have to use the 2nd indexing array from where:
In [878]: a[np.where(b==1)[1],:]
Out[878]: matrix([[1, 3, 4]])
Or the 2nd column from a corresponding to the 2nd column of b
In [881]: a[:,np.where(b==1)[1]]
Out[881]:
matrix([[2],
[3]])
Because a and b are np.matrix, the indexing result is always 2d.
For c array, the where produces a single element tuple
In [882]: c=np.array([0,1])
In [883]: np.where(c==1)
Out[883]: (array([1], dtype=int32),)
In [884]: a[_,:] # here _ is the last result, Out[883]
Out[884]: matrix([[1, 3, 4]])
We generally advise using np.array to construct new arrays, even 2d. np.matrix is a convenience for wayward MATLAB users, and often confuses new numpy users.
ndarray objects in numpy have a flat property (e.g. array.flat) that allows one to iterate through its elements. For example:
>>> x = np.arange(1, 7).reshape(2, 3)
>>> x
array([[1, 2, 3],
[4, 5, 6]])
>>> x.flat[3]
4
But how can I return a column-major 1D iterator, so that the example above returns 5 instead of 4?
Approach #1
You can use .ravel('F') to have column major order and then index -
x.ravel('F')[3]
Sample run -
In [100]: x
Out[100]:
array([[1, 2, 3],
[4, 5, 6]])
In [101]: x.ravel('F')[3]
Out[101]: 5
This will create a copy of the entire array before selecting elements -
In [161]: np.may_share_memory(x, x.ravel())
Out[161]: True
In [162]: np.may_share_memory(x, x.ravel('F'))
Out[162]: False
As such this may not the most memory efficient one. For a better one, let's move onto another approach.
Approach #2
We can get the row and column indices from the column-major ordered index and then simply index into the array with it -
x[np.unravel_index(3, np.array(x.shape)[::-1])]
Sample run -
In [147]: x
Out[147]:
array([[1, 2, 3],
[4, 5, 6]])
In [148]: idx = np.unravel_index(3, np.array(x.shape)[::-1])
In [149]: idx
Out[149]: (1, 1) # row, col indices obtained in C order
In [150]: x[idx]
Out[150]: 5
There isn't any copying or flattening or ravel-ing here and uses just an indexing and as such should be efficient both in terms of memory and performance.
Not sure if this is the best way, but it seems that simply
array.T.flat
will give the result I'm looking for. Although I wish there was some appropriate method that I could specify with order='F', which would be easier to understand at a glance.
I am attempting to create a simply neuronetwork using Python (I know there are libraries, but I'm building a simple one from scratch to get more familiar with each step taken), and one part of it is to calculate the difference between the true label and the predicted label.
I have the true label in dim <2059 x 1>, and the predicted label also in <2059 x 1>
Both are in np.array
I would expect a simple
l2_error=tag_train-l2
would do the job. (l2 is the predicted label, tag_train is the true label)
but what I got in return is a <2059x2059> matrix. It seems like this operation is doing a subtraction of every possible combination of elements. Why would this happen? I know I can probably run a for loop to get the job done, for I'm wondering why the program would produce this result?
Both dtypes are float64, btw. I don't think it matters, but just in case this info is needed.
As you indicated in the comments, what is happening is that tag_train is a one dimensional array with length 2059 , whereas l2 is supposedly a 2 dimensional array with 2059 rows and 1 column.
So when you try to do subtraction it leads to a 2 dimensional array with 2059 rows and 2059 columns.
If you are 100% sure that l2 would only be one column then you can reshape that array to make it one dimensional before doing the subtraction. Like -
l2.reshape((l2.shape[0],))
Example/Demo -
In [1]: import numpy as np
In [2]: l1 = np.array([1,2,3,4])
In [3]: l2 = np.array([[5],[6],[7],[8]])
In [7]: l2.shape
Out[7]: (4, 1)
In [8]: l2-l1
Out[8]:
array([[4, 3, 2, 1], #Just to show that you get the behaviour when arrays are in
[5, 4, 3, 2], #different dimensions.
[6, 5, 4, 3],
[7, 6, 5, 4]])
In [19]: l2 = l2.reshape((l2.shape[0],))
In [25]: l2 = l2.reshape((l2.shape[0],))
In [26]: l2-l1
Out[26]: array([4, 4, 4, 4])