I got a numpy matrix and I want to get the index of the maximum value in each row. E.g.
[[1,2,3],[1,3,2],[3,2,1]]
will return
[0,1,2]
However, when there're more than 1 maximum value in each row, numpy.argmax will only return the smallest index. E.g.
[[0,0,0],[0,0,0],[0,0,0]]
will return
[0,0,0]
Can I change the default (smallest index) to some other values? E.g. when there're equal maximum values, return 1 or None, so that the above result will be
[1,1,1]
or
[None, None, None]
If I can do this in TensorFlow that'll be better.
Thanks!
You can use np.partition two find the two largest values and check if they are equal, and then use that as a mask in np.where to set the default value:
In [228]: a = np.array([[1, 2, 3, 2], [3, 1, 3, 2], [3, 5, 2, 1]])
In [229]: twomax = np.partition(a, -2)[:, -2:].T
In [230]: default = -1
In [231]: argmax = np.where(twomax[0] != twomax[1], np.argmax(a, -1), default)
In [232]: argmax
Out[232]: array([ 2, -1, 1])
A convenient value of "default" is -1, as argmax will not return that on its own. None does not fit in an integer array. A masked array is also an option, but I didn't go that far. Here is a NumPy implementation
def my_argmax(a):
rows = np.where(a == a.max(axis=1)[:, None])[0]
rows_multiple_max = rows[:-1][rows[:-1] == rows[1:]]
my_argmax = a.argmax(axis=1)
my_argmax[rows_multiple_max] = -1
return my_argmax
Example of use:
import numpy as np
a = np.array([[0, 0, 0], [4, 5, 3], [3, 4, 4], [6, 2, 1]])
my_argmax(a) # array([-1, 1, -1, 0])
Explanation: where selects the indexes of all maximal elements in each row. If a row has multiple maxima, the row number will appear more than once in rows array. Since this array is already sorted, such repetition is detected by comparing consecutive elements. This identifies the rows with multiple maxima, after which they are masked in the output of NumPy's argmax method.
Related
Suppose I have the following numpy array
>>> arr = np.array([[0,1,0,1],
[0,0,0,1],
[1,0,1,0]])
where values of 0 indicate "valid" values and values of 1 indicate "invalid" values. Next, suppose that I filter out the invalid 1 values as follows:
>>> mask = arr == 1
>>> arr_filt = arr[~mask]
array([0, 0, 0, 0, 0, 0, 0])
If I want to go from a linear index of a valid value (0) in the original arr to the linear index of the same value in the new filtered arr_filt, I can proceed as follows:
# record the cumulative total number of invalid values
num_invalid_prev = np.cumsum(mask.flatten())
# example of going from the linear index of a valid value in the
# original array to the linear index of the same value in the
# filtered array
ij_idx = (0,2)
linear_idx_orig = np.ravel_multi_index(ij_idx,arr.shape)
linear_idx_filt = linear_idx_orig - num_invalid_prev[linear_idx_orig]
However, I'm interested in going the other way. That is, given the linear index of a valid value in the filtered arr_filt and the same num_invalid_prev array, can I get back the linear index of the same valid value in the unfiltered arr?
You can use np.nonzero() to get indices:
ix = np.c_[np.nonzero(~mask)]
>>> ix
array([[0, 0],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 1],
[2, 3]])
Then, you can for example look up the index of the second valid value:
>>> tuple(ix[1])
(0, 2)
>>> arr[tuple(ix[1])]
0
I have an 2D-array:
A = np.array([[2,3,4],
[2,0,4],
[1,3,7]])
I am searching for the indices per column, which respresent the maximum value of this column without using a for loop.
What I would like to have, is something like:
max_rowIndices_perColumn = np.array([[0,1],[0,2],[2]])
I had the idea to use:
np.where(A== np.amax(A,axis=0))
but as in the second step, I would like to work with every specific column itself, I am not really happy with this idea.
Thank you in advance
You need some deeper knowledge about behaviour of indexing.
Basically, np.where returns advanced indices of True cells in C order (row by row):
>>> np.where(mask)
(array([0, 0, 1, 2, 2]), array([0, 1, 0, 1, 2]))
but you need to do it in Fortran order (column by column) like so:
>>> np.where(mask, order='F') # not working, it doesn't support order parameter
(array([0, 1, 0, 2, 2]), array([0, 0, 1, 1, 2]))
It's not working but you could pass mask.T instead:
>>> np.where(mask.T) # fix
(array([0, 0, 1, 1, 2]), array([0, 1, 0, 2, 2]))
The remaining part is to split row indices into groups. In conclusion, you could solve your problem like so:
mask = A == np.amax(A, axis=0)
x, y = np.where(mask.T)
div_points = np.flatnonzero(np.diff(x)) + 1
np.split(y, div_points)
>>> [array([0, 1]), array([0, 2]), array([2])]
Define a function to get indices of max value in a column:
def idxMax(col):
_, _, inv = np.unique(-col, return_index=True, return_inverse=True)
return np.where(inv == 0)[0].tolist()
Then generate the result as:
result = np.array([ idxMax(col) for col in A.T ], dtype=object)
For your source data, the result is:
array([list([0, 1]), list([0, 2]), list([2])], dtype=object)
Note that in general case there is no any guarantee that each column
will return the same number of max indices, so the result array is
a "ragged" one, and in this case Numpy requires dtype=object be passed.
But if it is enough for you to get a plain pythonic list of lists
(instead of a Numpy array), you can shrink the above code to:
result = [ idxMax(col) for col in A.T ]
In this case the result is:
[[0, 1], [0, 2], [2]]
I would like to get all indexes of maximum values for each row from ndarray.
For example, i have
arr = np.array([[1, 3, 3], [1, 5, 4]])
And i would like to get indexes of all 3's from first row, and all 5's from second row.
np.where(((arr == arr[0].max()) | (arr == arr[1].max())))
And it returns
(array([0, 0, 1], dtype=int64), array([1, 2, 1], dtype=int64))
I want something like that but more universal for any amount of rows.
Because np.where(arr == arr.argmax()) doesn't work like i want it to work. It only returns indexes of first maxium that it met in each row.
#Paul's answer in comments might be the best you can find. Writing it for readers. arr.max(1) find max in each row and arr==arr.max(1,keepdims=True) finds all elements in each row that are equal to corresponding max in that row. Finally nonzero returns indices of those elements:
(arr==arr.max(axis=1,keepdims=True)).nonzero()
output for OP's example:
(array([0, 0, 1]), array([1, 2, 1]))
I have the following code:
import numpy as np
sample = np.random.random((10,10,3))
argmax_indices = np.argmax(sample, axis=2)
i.e. I take the argmax along axis=2 and it gives me a (10,10) matrix. Now, I want to assign these indices value 0. For this, I want to index the sample array. I tried:
max_values = sample[argmax_indices]
but it doesn't work. I want something like
max_values = sample[argmax_indices]
sample[argmax_indices] = 0
I simply validate by checking that max_values - np.max(sample, axis=2) should give a zero matrix of shape (10,10).
Any help will be appreciated.
Here's one approach -
m,n = sample.shape[:2]
I,J = np.ogrid[:m,:n]
max_values = sample[I,J, argmax_indices]
sample[I,J, argmax_indices] = 0
Sample step-by-step run
1) Sample input array :
In [261]: a = np.random.randint(0,9,(2,2,3))
In [262]: a
Out[262]:
array([[[8, 4, 6],
[7, 6, 2]],
[[1, 8, 1],
[4, 6, 4]]])
2) Get the argmax indices along axis=2 :
In [263]: idx = a.argmax(axis=2)
3) Get the shape and arrays for indexing into first two dims :
In [264]: m,n = a.shape[:2]
In [265]: I,J = np.ogrid[:m,:n]
4) Index using I, J and idx for storing the max values using advanced-indexing :
In [267]: max_values = a[I,J,idx]
In [268]: max_values
Out[268]:
array([[8, 7],
[8, 6]])
5) Verify that we are getting an all zeros array after subtracting np.max(a,axis=2) from max_values :
In [306]: max_values - np.max(a, axis=2)
Out[306]:
array([[0, 0],
[0, 0]])
6) Again using advanced-indexing assign those places as zeros and do one more level of visual verification :
In [269]: a[I,J,idx] = 0
In [270]: a
Out[270]:
array([[[0, 4, 6], # <=== Compare this against the original version
[0, 6, 2]],
[[1, 0, 1],
[4, 0, 4]]])
An alternative to np.ogrid is np.indices.
I, J = np.indices(argmax_indices.shape)
sample[I,J,argmax_indices] = 0
This can also be generalized to handle matrices of any dimensionality. The resulting function will set the largest value in every 1-d vector of the matrix along any dimension d desired (dimension 2 in the case of the original question) to 0 (or to whatever value is desired):
def set_zero(sample, d, val):
"""Set all max value along dimension d in matrix sample to value val."""
argmax_idxs = sample.argmax(d)
idxs = [np.indices(argmax_idxs.shape)[j].flatten() for j in range(len(argmax_idxs.shape))]
idxs.insert(d, argmax_idxs.flatten())
sample[idxs] = val
return sample
set_zero(sample, d=2, val=0)
(Tested for numpy 1.14.1 on python 3.6.4 and python 2.7.14)
I am working on a sparse matrix stored in COO format. What would be the fastest way to get the number of consecutive elements per each row.
For example consider the following matrix:
a = [[0,1,2,0],[1,0,0,2],[0,0,0,0],[1,0,1,0]]
Its COO representation would be
(0, 1) 1
(0, 2) 2
(1, 0) 1
(1, 3) 2
(3, 0) 1
(3, 2) 1
I need the result to be [1,2,0,2]. The first row contains two Non-zero elements that lies nearby. Hence its a group or set. In the second row we have two non-zero elements,but they dont lie nearby, and hence we can say that it forms two groups. The third row there are no non-zeroes and hence no groups. The fourth row has again two non-zeroes but separated by zeroes nad hence we consider as two groups. It would be like the number of clusters per row. Iterating through the rows are an option but only if there is no faster solution. Any help in this regard is appreciated.
Another simple example: consider the following row:
[1,2,3,0,0,0,2,0,0,8,7,6,0,0]
The above row should return [3] sine there are three groups of non-zeroes getting separated by zeroes.
Convert it to a dense array, and apply your logic row by row.
you want the number of groups per row
zeros count when defining groups
row iteration is faster with arrays
In coo format your matrix looks like:
In [623]: M=sparse.coo_matrix(a)
In [624]: M.data
Out[624]: array([1, 2, 1, 2, 1, 1])
In [625]: M.row
Out[625]: array([0, 0, 1, 1, 3, 3], dtype=int32)
In [626]: M.col
Out[626]: array([1, 2, 0, 3, 0, 2], dtype=int32)
This format does not implement row indexing; csr and lil do
In [627]: M.tolil().data
Out[627]: array([[1, 2], [1, 2], [], [1, 1]], dtype=object)
In [628]: M.tolil().rows
Out[628]: array([[1, 2], [0, 3], [], [0, 2]], dtype=object)
So the sparse information for the 1st row is a list of nonzero data values, [1,2], and list of their column numbers, [1,2]. Compare that with the row of the dense array, [0, 1, 2, 0]. Which is easier to analyze?
Your first task is to write a function that analyzes one row. I haven't studied your logic enough to say whether the dense form is better than the sparse one or not. It is easy to get the column information from the dense form with M.A[0,:].nonzero().
In your last example, I can get the nonzero indices:
In [631]: np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])
Out[631]: (array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32),)
In [632]: idx=np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])[0]
In [633]: idx
Out[633]: array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32)
In [634]: np.diff(idx)
Out[634]: array([1, 1, 4, 3, 1, 1], dtype=int32)
We may be able to get the desired count from the number of diff values >1, though I'd have to look at more examples to define the details.
Extension of the analysis to multiple rows depends on first thoroughly understanding the single row case.
With the help of #hpauljs comment I came up with following snippet to do this:
M = m.tolil()
r = []
for i in range(M.shape[0]):
sumx=0
idx= M.rows[i]
if (len(idx) > 2):
tempidx = np.diff(idx)
if (1 in tempidx):
temp = filter(lambda a: a != 1, tempidx)
sumx=1
counts = len(temp)
r.append(counts+sumx)
elif (len(idx) == 2):
tempidx = np.diff(idx)
if(tempidx[0]==1):
counts = 1
r.append(counts)
else:
counts = 2
r.append(counts)
elif (len(idx) == 1):
counts = 1
r.append(counts)
else:
counts = 0
r.append(counts)
tempcluster = np.sum(r)/float(M.shape[0])
cluster.append(tempcluster)