Python Numpy Determine array index which contains overall max/min value - python

Say I have a Numpy nd array like below.
arr = np.array([[2, 3, 9], [1,6,7], [4, 5, 8]])
The numpy array is a 3 x 3 matrix.
What I want to do is the find row index which contains the overall maximum value.
For instance in the above example, the maximum value for the whole ndarray is 9. Since it is located in the first row, I want to return index 0.
What would be the most optimal way to do this?
I know that
np.argmax(arr)
Can return the index of the maximum for the whole array, per row, or per column. However is there method that can return the index of the row which contains the overall maximum value?
It would also be nice if this could also be changed to column wise easily or find the minimum, etc.
I am unsure on how to go about this without using a loop and finding the overall max while keeping a variable for the index, however I feel that this is inefficient for numpy and that there must be a better method.
Any help or thoughts are appreciated.
Thank you for reading.

Use where with amax, amin.
import numpy as np
arr = np.array([[2, 3, 9], [1,6,7], [4, 5, 8]])
max_value_index = np.where(arr == np.amax(arr))
min_value_index = np.where(arr == np.amin(arr))
Output:
(array([0]), array([2]))
(array([1]), array([0]))

Related

Stable conversion of a multi-column (2D) numpy array to an indicator vector

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.
In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

np.where to compute age group index

There is a part of the following code that I don't quite understand.
Here is the code:
import numpy as np
medalNames = np.array(['none', 'bronze', 'silver', 'gold'])
ageGroupCategories = np.array(['B','P','G','T'])
allLowerThresholds = np.array([[-1,0,5,10], [0,5,10,15], [0,11,14,17], [0,15,17,19]])
ageGroupIndex = np.where(ageGroup[0] == ageGroupCategories)[0][0]
In the last line, what does the [0][0] do, why doesn't the code work without it?
A few things:
Use embedded Code boxes
Your code isn't working at all because the variable ageGroup doesn't exist
Now to your question:
since it is an array the [0][0] calls on the first row and first column of the result of the array np.where().
Your question is general and related to the numpy.where function.
Let's take a simple example as follows:
A=np.array([[3,2,1],[4,5,1]])
# array([[3, 2, 1],
# [4, 5, 1]])
print(np.where(A==1))
# (array([0, 1]), array([2, 2]))
As you can see the np.where function returns a tuple. The first element (it's a numpy array) of the tuple is the row/line index, and the second element (it's again a numpy array), is the column index.
np.where(A==1)[0] # this is the first element of the tuple thus,
# the numpy array containing all the row/line
# indices where the value is = 1.
#array([0, 1])
The above tells you that there is a value = 1 in the first (0) and second (1) row of the matrix A.
Next:
np.where(A==1)[0][0]
0
returns the index of the first line that contains a value = 1. 0 here is the first line of matrix A

Iterate over numpy with index (numpy equivalent of python enumerate)

I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.

Most Efficient/Pythonic Way to Do Calculations on a Specific Subset of a 2D Numpy Array

So here is a breakdown of the task:
1) I have a 197x10 2D numpy array. I scan through this and identify specific cells that are of interest (criteria that goes into choosing these cells is not important.) These cells are not restricted to one specific area of the matrix.
2) I have 3247 other 2D Numpy arrays with the same dimension. For a single one of these other arrays, I need to take the cell locations of interest specified by step 1) and take the average of all of these (sum them all together and divide by the number of cell locations of interest.)
3) I need to repeat 2) for each of the other 3246 remaining arrays.
What is the best/most efficient way to "mark" the cells of interest and look at them quickly in the 3247 arrays?
--sample on smaller set--
Let's say given a 2x2 array:
[1, 2]
[3, 4]
Perhaps the cells of interest are the ones that contain 1 and 4. Therefore, for the following arrays:
[5, 6]
[7, 8]
and
[9, 10]
[11, 12]
I would want to take (5+8)/2 and record that somewhere.
I would also want to take (9+12)/2 and record that somewhere.
EDIT
Now if I wanted to find these cells of interest in a pythonic way (using Numpy) with the following criteria:
-start at the first row and check the first element
-continue to go down rows in that column marking elements that satisfy condition
-Stop on the first element that does not satisfy the condition and then go to the next column.
So basically now I want to just keep the row-wise (for a specific column) contiguous cells that are of interest. So for 1), if the array looks like:
[1 2 3]
[4 5 6]
[7 8 9]
And 1,4, 2, 8, and 3 were of interest, I'd only mark 1, 4, 2, 3, since 5 disqualifies 8 as being included.
Pythonic way:
answers = []
# this generates index matrix where the condition is met.
idx = np.argwhere( your condition of (1) matrix comes here)
for array2d in your_3247_arrays:
answer = array2d[idx].mean()
answers.append()
print(answers)
Here is an example:
import numpy as np
A = np.random.rand(197, 10)
B = np.random.rand(3247, 197, 10)
loc = np.where(A > 0.9)
B[:, loc[0], loc[1]].mean(axis=1)

Finding the minimum value in a numpy array and the corresponding values for the rest of that array's row

Consider the following NumPy array:
a = np.array([[1,4], [2,1],(3,10),(4,8)])
This gives an array that looks like the following:
array([[ 1, 4],
[ 2, 1],
[ 3, 10],
[ 4, 8]])
What I'm trying to do is find the minimum value of the second column (which in this case is 1), and then report the other value of that pair (in this case 2). I've tried using something like argmin, but that gets tripped up by the 1 in the first column.
Is there a way to do this easily? I've also considered sorting the array, but I can't seem to get that to work in a way that keeps the pairs together. The data is being generated by a loop like the following, so if there's a easier way to do this that isn't a numpy array, I'd take that as an answer too:
results = np.zeros((100,2))
# Loop over search range, change kappa each time
for i in range(100):
results[i,0] = function1(x)
results[i,1] = function2(y)
How about
a[np.argmin(a[:, 1]), 0]
Break-down
a. Grab the second column
>>> a[:, 1]
array([ 4, 1, 10, 8])
b. Get the index of the minimum element in the second column
>>> np.argmin(a[:, 1])
1
c. Index a with that to get the corresponding row
>>> a[np.argmin(a[:, 1])]
array([2, 1])
d. And take the first element
>>> a[np.argmin(a[:, 1]), 0]
2
Using np.argmin is probably the best way to tackle this. To do it in pure python, you could use:
min(tuple(r[::-1]) for r in a)[::-1]

Categories

Resources