Reverse sort a column-based numpy array

Reverse sort a column-based numpy array - python

I want to sort (descending) a numpy array where the array is reshaped to one column structure. However, the following code seems not be working.
a = array([5,1,2,4,9,2]).reshape(-1, 1)
a_sorted = np.sort(a)[::-1]
print("a=",a)
print("a_sorted=",a_sorted)
Output is
a= [[5]
[1]
[2]
[4]
[9]
[2]]
a_sorted= [[2]
[9]
[4]
[2]
[1]
[5]]
That is due to the reshape function. If I remove that, the sort works fine. How can I fix that?

Here you need Axis should be 0 (Column wise sorting)
np.sort(a,axis=0)[::-1]
Discussion:
a = np.array([[4,1],[23,2]])
print(a)
Output:
[[ 4 1]
[23 2]]
# Axis None (Sort as a flatten array)
print(np.sort(a,axis=None))
Output:
[ 1 2 4 23]
# Axis None (Sort as a row wise **(By default is set to 1)**)
print(np.sort(a,axis=1))
[[ 1 4]
[ 2 23]]
# Axis None (Sort as a column wise)
print(np.sort(a,axis=0))
[[ 4 1]
[23 2]]
For more details have a look in:
https://numpy.org/doc/stable/reference/generated/numpy.sort.html

As #tmdavison pointed out in the comments you forgot to use the axis option since by default np sorts matrices by row. By calling the reshape function, in fact, you're transforming the array into a 1-column matrix which sorting by row is trivially the matrix itself.
This would do the job
import numpy as np
a = np.array([5,1,2,4,9,2]).reshape(-1, 1)
a_sorted = np.sort(a, axis = 0)[::-1]
print("a=",a)
print("a_sorted=",a_sorted)
Extra points:
reference to the doc of sort
Next time remember to make the code reproducible (no np before array and no imports in your example). This was an easy case but it's not always like this

Related

Stable conversion of a multi-column (2D) numpy array to an indicator vector

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.

In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

What Is the Logic Behind Advanced Indexing in Numpy?

When the following lines of codes are run, same results are expected. Is the logic behind advanced indexing in Numpy literally zipping different iterables together? If so, I am also curious about what data structure is converted into after zipping. I am using a tuple in my example, but it seems like there are other possibilities. Thanks in advance for the help!
a = np.array([[1,2],[3,4],[5,6]])
print(a[[0,1],[1,1]])
>>> [2 4]
result = zip([0,1],[1,1])
print(a[tuple(result)])
>>> [2 4]

The list and tuple are basically the same - both hold items, but while list is mutable (i.e., you can change its elements) the latter is immutable.
But as far as the numpy indexing is concerned - you can use both, as long as they hold integer values.
The only advantage in using tuple for indexing is that it can not be changed in the middle, and mess up the data extraction (as shown in Example 1), if that is one of your requirements.
Example 1 (imutable):
arr = np.random.randint(0, 10, 6).reshape((2, 3))
idx = tuple(np.random.randint(0, 2, 10).reshape((5, 2)))
for i in range(3):
np.random.shuffle(idx)
print(arr[idx])
Output of Example 1:
TypeError: 'tuple' object does not support item assignment
On the other hand, if you desire a more flexible indexing (as in Example 2), i.e., changing the indices in the process of the run - tuples won't work for you.
Example 2 (Mutable):
import numpy as np
arr = np.random.randint(0, 10, 6).reshape((2, 3))
idx = np.random.randint(0, 2, 10).reshape((5, 2))
for i in range(3):
np.random.shuffle(idx)
print(arr[idx])
Output of Example 2:
[[[0 6 8]
[0 5 5]]
[[0 5 5]
[0 5 5]]
[[0 6 8]
[0 5 5]]...
So, whether to use one or another depends on the outcome you desire.
Cheers.

Why fancy indexing is not same as slicing in numpy?

I have been learning Fancy indexing but when I observed the behavior of the following code I got a couple of questions...
According to my understanding,
Fancy Indexing is:
ndArray[ [0,1,2] ] i.e. passing a list of rows / columns
and
Slicing is:
ndArray[ 0:3 ] i.e. giving a range of rows / columns
Now, the problem
A numpy array,
arr = [ [1,2,3],
[4,5,6],
[7,8,9] ]
When I try fancy indexing:
arr[ [0,1], [1,2] ]
>>> [2, 6]
And when slice it,
arr[:2, 1:]
>>> [ [2, 3],
[5, 6] ]
Essentially both of them should return the two-dimension array as both of them mean the same, as they are used interchangeably!
:2 should be equivalent to [0,1] #For rows
1: should be equivalent to [1,2] #For cols
The question:
Why Fancy indexing is not returning as the slice notation? And how to achieve that?
Please enlighten me.
Thanks

Fancy indexing and slicing behave differently by definition / by numpy specification.
So, instead of questioning why that is so, it is better to:
Be able to recognize / distinguish / tell them apart (i.e., have a clear understanding of when does the indexing become fancy indexing, and when is it slicing).
Be aware of the differences in their semantics (outcomes).
In your example:
In the case of fancy indexing, the indices generated for the two axes are combined "in tandem" (similar to how the zip function combines two input sequences "in tandem". (In the words of the official numpy documentation, the two index arrays are "iterated together"). We are passing the list [0, 1] for indexing the array on axis 0, and passing the list [1, 2] for indexing the array on axis 1. The index 0 from the index array [0, 1] is combined only with the corresponding index 1 of the index array [1, 2]. Similarly, the index 1 of the index array [0, 1] is combined only with the corresponding index 2 of the index array [1, 2]. In other words, the index arrays do not combine with each other in a many-to-many fashion. All this was about fancy indexing.
In the case of slicing, the slice :2 that is specified for axis 0 conceptually generates indices '0' and '1' for axis 0; and the slice 1: specified for axis 1 conceptually generates indices 1 and 2 for axis 1. But these generated indices combine in a many-to-many fashion, unlike in the case of fancy indexing. So, they produce four combinations rather than just two.
So, the crucial difference in the defined semantics of fancy indexing and slicing is that in the case of fancy indexing, the fancy index arrays are iterated together.

Delete line from 2D array in Python

I'm have a 585L, 2L numpy array in Python.
The data is organized like in this example.
0,0
1,0
2,1
3,0
4,0
5,1
...
I would like to create an array deleting all the lines where 0 is present on the seconds column. Thus, the final result pretended is:
2,1
5,1
I have read some related examples but I'm still struggling with this.

Since you mention your structure being a numpy array, rather than a list, I would use numpy logical indexing to select only the values you care for.
>>> import numpy as np
>>> x = [[0,0], [1,0], [2,1], [3,0], [4,0], [5,1]] # Create dummy list
>>> x = np.asarray(x) # Convert list to numpy array
>>> x[x[:, 1] != 0] # Select out items whose second column don't contain zeroes
array([[2, 1],
[5, 1]])

here is my answer
if your list like this [[0,0],[2,1],[4,3],[2,0]]
if your list structure isnt like this please tell me
my answer prints the list whose second column num dont equal to 0
print [x for x in your_list if x[1] != 0] #your_list is the variable of the list

You could use an list comprehension. These are described on the Python Data Structures page (Python 2, Python 3).
If your array is:
x = [[0,0],
[1,0],
[2,1],
[3,0],
[4,0],
[5,1]]
Then the following command
[y for y in x if y[1] != 0]
Would return the desired result of:
[[2, 1], [5, 1]]
Edit: I overlooked that it was a numpy array. Taking that into account, JoErNanO's answer is better.

Convert row vector to column vector in NumPy

import numpy as np
matrix1 = np.array([[1,2,3],[4,5,6]])
vector1 = matrix1[:,0] # This should have shape (2,1) but actually has (2,)
matrix2 = np.array([[2,3],[5,6]])
np.hstack((vector1, matrix2))
ValueError: all the input arrays must have same number of dimensions
The problem is that when I select the first column of matrix1 and put it in vector1, it gets converted to a row vector, so when I try to concatenate with matrix2, I get a dimension error. I could do this.
np.hstack((vector1.reshape(matrix2.shape[0],1), matrix2))
But this looks too ugly for me to do every time I have to concatenate a matrix and a vector. Is there a simpler way to do this?

The easier way is
vector1 = matrix1[:,0:1]
For the reason, let me refer you to another answer of mine:
When you write something like a[4], that's accessing the fifth element of the array, not giving you a view of some section of the original array. So for instance, if a is an array of numbers, then a[4] will be just a number. If a is a two-dimensional array, i.e. effectively an array of arrays, then a[4] would be a one-dimensional array. Basically, the operation of accessing an array element returns something with a dimensionality of one less than the original array.

Here are three other options:
You can tidy up your solution a bit by allowing the row dimension of the vector to be set implicitly:
np.hstack((vector1.reshape(-1, 1), matrix2))
You can index with np.newaxis (or equivalently, None) to insert a new axis of size 1:
np.hstack((vector1[:, np.newaxis], matrix2))
np.hstack((vector1[:, None], matrix2))
You can use np.matrix, for which indexing a column with an integer always returns a column vector:
matrix1 = np.matrix([[1, 2, 3],[4, 5, 6]])
vector1 = matrix1[:, 0]
matrix2 = np.matrix([[2, 3], [5, 6]])
np.hstack((vector1, matrix2))

Subsetting
The even simpler way is to subset the matrix.
>>> matrix1
[[1 2 3]
[4 5 6]]
>>> matrix1[:, [0]] # Subsetting
[[1]
[4]]
>>> matrix1[:, 0] # Indexing
[1 4]
>>> matrix1[:, 0:1] # Slicing
[[1]
[4]]
I also mentioned this in a similar question.
It works somewhat similarly to a Pandas dataframe. If you index the dataframe, it gives you a Series. If you subset or slice the dataframe, it gives you a dataframe.
Your approach uses indexing, David Z's approach uses slicing, and my approach uses subsetting.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.