What Is the Logic Behind Advanced Indexing in Numpy? - python

When the following lines of codes are run, same results are expected. Is the logic behind advanced indexing in Numpy literally zipping different iterables together? If so, I am also curious about what data structure is converted into after zipping. I am using a tuple in my example, but it seems like there are other possibilities. Thanks in advance for the help!
a = np.array([[1,2],[3,4],[5,6]])
print(a[[0,1],[1,1]])
>>> [2 4]
result = zip([0,1],[1,1])
print(a[tuple(result)])
>>> [2 4]

The list and tuple are basically the same - both hold items, but while list is mutable (i.e., you can change its elements) the latter is immutable.
But as far as the numpy indexing is concerned - you can use both, as long as they hold integer values.
The only advantage in using tuple for indexing is that it can not be changed in the middle, and mess up the data extraction (as shown in Example 1), if that is one of your requirements.
Example 1 (imutable):
arr = np.random.randint(0, 10, 6).reshape((2, 3))
idx = tuple(np.random.randint(0, 2, 10).reshape((5, 2)))
for i in range(3):
np.random.shuffle(idx)
print(arr[idx])
Output of Example 1:
TypeError: 'tuple' object does not support item assignment
On the other hand, if you desire a more flexible indexing (as in Example 2), i.e., changing the indices in the process of the run - tuples won't work for you.
Example 2 (Mutable):
import numpy as np
arr = np.random.randint(0, 10, 6).reshape((2, 3))
idx = np.random.randint(0, 2, 10).reshape((5, 2))
for i in range(3):
np.random.shuffle(idx)
print(arr[idx])
Output of Example 2:
[[[0 6 8]
[0 5 5]]
[[0 5 5]
[0 5 5]]
[[0 6 8]
[0 5 5]]...
So, whether to use one or another depends on the outcome you desire.
Cheers.

Related

What is x[ : , 0]?

I don't understand what it does, and only saw it used by a sklearn-object.
On trying some testing like this
x = [(1, 2), (2, 3), (3, 4)]
y = [[1, 2], [2, 3], [3, 4]]
print(y[:, 0])
I got this error (with both x and y):
TypeError: list indices must be integers or slices, not tuple
My assumption was that the : before the comma tells Python to take all entries, while the 0 specifies which 'subset'.
How does one use this expression properly and what does it do?
As explained here this is a numpy-specific notation to index arrays, not plain python. This is why it does not work on your code. In your (initial) case, the sklearn object probably wraps a numpy array that supports the numpy slicing syntax.
In your specific case, it would work as follows:
import numpy as np
y = np.array([[1, 2], [2, 3], [3, 4]])
print(y[:, 0])
# prints: [1 2 3]
This would yield all indexes along the first axis (i.e. use full column vectors), but use only index 0 in the second axis (i.e. use only the first column vector).
What you express would work with more complex objects, like numpy (and its famous slicing). In the case of vanilla Python, this is not possible. To access a specific number (in your case) you would have to do x[2][1] (yielding 4 in your example).
To achieve what you want (take the first item of each tuple), you would do
[item[0] for item in y]. This is list comprehension: iterate through y, take the object at the first index of each item, and make a list out of that.

Reverse sort a column-based numpy array

I want to sort (descending) a numpy array where the array is reshaped to one column structure. However, the following code seems not be working.
a = array([5,1,2,4,9,2]).reshape(-1, 1)
a_sorted = np.sort(a)[::-1]
print("a=",a)
print("a_sorted=",a_sorted)
Output is
a= [[5]
[1]
[2]
[4]
[9]
[2]]
a_sorted= [[2]
[9]
[4]
[2]
[1]
[5]]
That is due to the reshape function. If I remove that, the sort works fine. How can I fix that?
Here you need Axis should be 0 (Column wise sorting)
np.sort(a,axis=0)[::-1]
Discussion:
a = np.array([[4,1],[23,2]])
print(a)
Output:
[[ 4 1]
[23 2]]
# Axis None (Sort as a flatten array)
print(np.sort(a,axis=None))
Output:
[ 1 2 4 23]
# Axis None (Sort as a row wise **(By default is set to 1)**)
print(np.sort(a,axis=1))
[[ 1 4]
[ 2 23]]
# Axis None (Sort as a column wise)
print(np.sort(a,axis=0))
[[ 4 1]
[23 2]]
For more details have a look in:
https://numpy.org/doc/stable/reference/generated/numpy.sort.html
As #tmdavison pointed out in the comments you forgot to use the axis option since by default np sorts matrices by row. By calling the reshape function, in fact, you're transforming the array into a 1-column matrix which sorting by row is trivially the matrix itself.
This would do the job
import numpy as np
a = np.array([5,1,2,4,9,2]).reshape(-1, 1)
a_sorted = np.sort(a, axis = 0)[::-1]
print("a=",a)
print("a_sorted=",a_sorted)
Extra points:
reference to the doc of sort
Next time remember to make the code reproducible (no np before array and no imports in your example). This was an easy case but it's not always like this

Stable conversion of a multi-column (2D) numpy array to an indicator vector

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.
In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

Is there any way of getting multiple ranges of values in numpy array at once?

Let's say we have a simple 1D ndarray. That is:
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,10])
I want to get the first 3 and the last 2 values, so that the output would be [ 1 2 3 9 10].
I have already solved this by merging and concatenating the merged variables as follows :
b= a[:2]
c= a[-2:]
a=np.concatenate([b,c])
However I would like to know if there is a more direct way to achieve this using slices, such as a[:2 and -2:] for instance. As an alternative I already tried this :
a = a[np.r_[:2, -2:]]
but it not seems to be working. It returns me only the first 2 values that is [1 2] ..
Thanks in advance!
Slicing a numpy array needs to be continuous AFAIK. The np.r_[-2:] does not work because it does not know how big the array a is. You could do np.r_[:2, len(a)-2:len(a)], but this will still copy the data since you are indexing with another array.
If you want to avoid copying data or doing any concatenation operation you could use np.lib.stride_tricks.as_strided:
ds = a.dtype.itemsize
np.lib.stride_tricks.as_strided(a, shape=(2,2), strides=(ds * 8, ds)).ravel()
Output:
array([ 1, 2, 9, 10])
But since you want the first 3 and last 2 values the stride for accessing the elements will not be equal. This is a bit trickier, but I suppose you could do:
np.lib.stride_tricks.as_strided(a, shape=(2,3), strides=(ds * 8, ds)).ravel()[:-1]
Output:
array([ 1, 2, 3, 9, 10])
Although, this is a potential dangerous operation because the last element is reading outside the allocated memory.
In afterthought, I cannot find out a way do this operation without copying the data somehow. The numpy ravel in the code snippets above is forced to make a copy of the data. If you can live with using the shapes (2,2) or (2,3) it might work in some cases, but you will only have reading permission to a strided view and this should be enforced by setting the keyword writeable=False.
You could try to access the elements with a list of indices.
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,10])
b = a[[0,1,2,8,9]] # b should now be array([ 1, 2, 3, 9, 10])
Obviously, if your array is too long, you would not want to type out all the indices.
Thus, you could build the inner index list from for loops.
Something like that:
index_list = [i for i in range(3)] + [i for i in range(8, 10)]
b = a[index_list] # b should now be array([ 1, 2, 3, 9, 10])
Therefore, as long as you know where your desired elements are, you can access them individually.

Slice an array to exclude a single element

I would like to slice a numpy array so that I can exclude a single element from it.
For example, like this:
a = numpy.array([1,2,3,4,5])
b = a[0:1::3:4]
b = [1 2 4 5]
Only that this does not work, so either I am doing something wrong, or it isn't possible.
If you are going to repeatedly 'delete' one item at a time, I'd suggest using a boolean mask:
In [493]: a = np.arange(100)
In [494]: mask = np.ones(a.shape, dtype=bool)
In [495]: for i in [2,5,9,20,3,26,40,60]:
...: mask[i]=0
...: a1 = a[mask]
In [496]: a1.shape
Out[496]: (92,)
That's effectively what np.delete does when given a list or array of deletes
In [497]: a2 = np.delete(a, [2,5,9,20,3,26,40,60])
In [498]: np.allclose(a1,a2)
Out[498]: True
For a single element is joins two slices - either by concatenate or copying to result array of the right size. In all cases we have to make a new array.
One exclusion or many, you seek an discontinuous selection of the elements of the original. That can't be produced with a view, which uses shape and strides to select a regular subset of the original.
You need to do something like below
a = np.array([1,2,3,4,5])
b = a[:2]
c = a[3:]
print ( b )
print ( c )
z= np.concatenate((b,c),axis=None)
print ( z )
Output:
[1 2]
[4 5]
[1 2 4 5]
Hence here everything other than 3 is in new numpy.ndarray z here.
Other way is to use to use np.delete function as shown in one the answers where you can provide list of indexes to be deleted inside the [] where list contains coma seperated index to be deleted.
a = np.array([15,14,13,12,11])
a4=np.delete(a,[1,4])
print(a4)
output is :
[15 13 12]
import numpy as np
a = np.array([1,2,3,4,5])
result = np.delete(a,2)
result = [1,2,4,5]
You could always use sets of slicing
b = a[:2]+a[3:]
Will return [1, 2, 4, 5]
for a numpy return value you could do 2 slices and concatenate the results.
b = a[3:]
c = a[:2]
numpy.concatenate([c,b])
Will return
array([1, 2, 4, 5])

Categories

Resources