Numpy advanced selection not working - python

Can someone please help me to understand why sometimes the advanced selection doesn't work and what I can do to get it to work (2nd case)?
>>> import numpy as np
>>> b = np.random.rand(5, 14, 3, 2)
# advanced selection works as expected
>>> b[[0,1],[0,1]]
array([[[ 0.7575555 , 0.18989068],
[ 0.06816789, 0.95760398],
[ 0.88358107, 0.19558106]],
[[ 0.62122898, 0.95066355],
[ 0.62947885, 0.00297711],
[ 0.70292323, 0.2109297 ]]])
# doesn't work - why?
>>> b[[0,1],[0,1,2]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: shape mismatch: objects cannot be broadcast to a single shape
# but this seems to work
>>> b[:,[0,1,2]]
array([[[[ 7.57555496e-01, 1.89890676e-01],
[ 6.81678915e-02, 9.57603975e-01],
[ 8.83581071e-01, 1.95581063e-01]],
[[ 2.24896112e-01, 4.77818599e-01],
[ 4.29313861e-02, 8.61578045e-02],
[ 4.80092364e-01, 3.66821618e-01]],
...
Update
Breaking up the selection seems to resolve the problem, but I am unsure why this is necessary (or if there's a better way to achieve this).
>>> b.shape
(5, 14, 3, 2)
>>> b[[0,1]].shape
(2, 14, 3, 2)
# trying to separate indexing by dimension.
>>> b[[0,1]][:,[0,1,2]]
array([[[[ 0.7575555 , 0.18989068],
[ 0.06816789, 0.95760398],
[ 0.88358107, 0.19558106]],
[[ 0.22489611, 0.4778186 ],
[ 0.04293139, 0.0861578 ],

You want
b[np.ix_([0, 1], [0, 1, 2])]
You also need to do the same thing for b[[0, 1], [0, 1]], because that's not actually doing what you think it is:
b[np.ix_([0, 1], [0, 1])]
The problem here is that advanced indexing does something completely different from what you think it does. You've made the mistake of thinking that b[[0, 1], [0, 1, 2]] means "take all parts b[i, j] of b where i is 0 or 1 and j is 0, 1, or 2". This is a reasonable mistake to make, considering that it seems to work that way when you have one list in the indexing expression, like
b[:, [1, 3, 5], 2]
In fact, for an array A and one-dimensional integer arrays I and J, A[I, J] is an array where
A[I, J][n] == A[I[n], J[n]]
This generalizes in the natural way to more index arrays, so for example
A[I, J, K][n] == A[I[n], J[n], K[n]]
and to higher-dimensional index arrays, so if I and J are two-dimensional, then
A[I, J][m, n] == A[I[m, n], J[m, n]]
It also applies the broadcasting rules to the index arrays, and converts lists in the indexes to arrays. This is much more powerful than what you expected to happen, but it means that to do what you were trying to do, you need something like
b[[[0],
[1]], [[0, 1, 2]]]
np.ix_ is a helper that will do that for you so you don't have to write a dozen brackets.

I think you misunderstood the advanced selection syntax for this case. I used your example, just made it smaller to be easier to see.
import numpy as np
b = np.random.rand(5, 4, 3, 2)
# advanced selection works as expected
print b[[0,1],[0,1]] # http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
# this picks the two i,j=0 (a 3x2 matrix) and i=1,j=1, another 3x2 matrix
# doesn't work - why?
#print b[[0,1],[0,1,2]] # this doesnt' work because [0,1] and [0,1,2] have different lengths
print b[[0,1,2],[0,1,2]] # works
Output:
[[[ 0.27334558 0.90065184]
[ 0.8624593 0.34324983]
[ 0.19574819 0.2825373 ]]
[[ 0.38660087 0.63941692]
[ 0.81522421 0.16661912]
[ 0.81518479 0.78655536]]]
[[[ 0.27334558 0.90065184]
[ 0.8624593 0.34324983]
[ 0.19574819 0.2825373 ]]
[[ 0.38660087 0.63941692]
[ 0.81522421 0.16661912]
[ 0.81518479 0.78655536]]
[[ 0.65336551 0.1435357 ]
[ 0.91380873 0.45225145]
[ 0.57255923 0.7645396 ]]]

Related

Getting rows corresponding to label, for many labels

I have a 2D array, where each row has a label that is stored in a separate array (not necessarily unique). For each label, I want to extract the rows from my 2D array that have this label. A basic working example of what I want would be this:
import numpy as np
data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
label=np.array([1,1,1,0,1])
#very simple approach
label_values=np.unique(label)
res=[]
for la in label_values:
data_of_this_label_val=data[label==la]
res+=[data_of_this_label_val]
print(res)
The result (res) can have any format, as long as it is easily accessible. In the above example, it would be
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Note that I can easily associate each element in my list to one of the unique labels in label_values (that is, by index).
While this works, using a for loop can take quite a lot of time, especially if my label vector is large. Can this be sped up or coded more elegantly?
You can argsort the labels (which is what unique does under the hood I believe).
If your labels are small nonnegatvie integers as in the example you can get it a bit cheaper, see https://stackoverflow.com/a/53002966/7207392.
>>> import numpy as np
>>>
>>> data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
>>> label=np.array([1,1,1,0,1])
>>>
>>> idx = label.argsort()
# use kind='mergesort' if you require a stable sort, i.e. one that
# preserves the order of equal labels
>>> ls = label[idx]
>>> split = 1 + np.where(ls[1:] != ls[:-1])[0]
>>> np.split(data[idx], split)
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Unfortunately, there isn't a built-in groupby function in numpy, though you could write alternatives. However, your problem could be solved more succinctly using pandas, if that's available to you:
import pandas as pd
res = pd.DataFrame(data).groupby(label).apply(lambda x: x.values).tolist()
# or, if performance is important, the following will be faster on large arrays,
# but less readable IMO:
res = [data[i] for i in pd.DataFrame(data).groupby(label).groups.values()]
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]

Python: Taking a rectangular subset of Numpy array [duplicate]

Can someone please help me to understand why sometimes the advanced selection doesn't work and what I can do to get it to work (2nd case)?
>>> import numpy as np
>>> b = np.random.rand(5, 14, 3, 2)
# advanced selection works as expected
>>> b[[0,1],[0,1]]
array([[[ 0.7575555 , 0.18989068],
[ 0.06816789, 0.95760398],
[ 0.88358107, 0.19558106]],
[[ 0.62122898, 0.95066355],
[ 0.62947885, 0.00297711],
[ 0.70292323, 0.2109297 ]]])
# doesn't work - why?
>>> b[[0,1],[0,1,2]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: shape mismatch: objects cannot be broadcast to a single shape
# but this seems to work
>>> b[:,[0,1,2]]
array([[[[ 7.57555496e-01, 1.89890676e-01],
[ 6.81678915e-02, 9.57603975e-01],
[ 8.83581071e-01, 1.95581063e-01]],
[[ 2.24896112e-01, 4.77818599e-01],
[ 4.29313861e-02, 8.61578045e-02],
[ 4.80092364e-01, 3.66821618e-01]],
...
Update
Breaking up the selection seems to resolve the problem, but I am unsure why this is necessary (or if there's a better way to achieve this).
>>> b.shape
(5, 14, 3, 2)
>>> b[[0,1]].shape
(2, 14, 3, 2)
# trying to separate indexing by dimension.
>>> b[[0,1]][:,[0,1,2]]
array([[[[ 0.7575555 , 0.18989068],
[ 0.06816789, 0.95760398],
[ 0.88358107, 0.19558106]],
[[ 0.22489611, 0.4778186 ],
[ 0.04293139, 0.0861578 ],
You want
b[np.ix_([0, 1], [0, 1, 2])]
You also need to do the same thing for b[[0, 1], [0, 1]], because that's not actually doing what you think it is:
b[np.ix_([0, 1], [0, 1])]
The problem here is that advanced indexing does something completely different from what you think it does. You've made the mistake of thinking that b[[0, 1], [0, 1, 2]] means "take all parts b[i, j] of b where i is 0 or 1 and j is 0, 1, or 2". This is a reasonable mistake to make, considering that it seems to work that way when you have one list in the indexing expression, like
b[:, [1, 3, 5], 2]
In fact, for an array A and one-dimensional integer arrays I and J, A[I, J] is an array where
A[I, J][n] == A[I[n], J[n]]
This generalizes in the natural way to more index arrays, so for example
A[I, J, K][n] == A[I[n], J[n], K[n]]
and to higher-dimensional index arrays, so if I and J are two-dimensional, then
A[I, J][m, n] == A[I[m, n], J[m, n]]
It also applies the broadcasting rules to the index arrays, and converts lists in the indexes to arrays. This is much more powerful than what you expected to happen, but it means that to do what you were trying to do, you need something like
b[[[0],
[1]], [[0, 1, 2]]]
np.ix_ is a helper that will do that for you so you don't have to write a dozen brackets.
I think you misunderstood the advanced selection syntax for this case. I used your example, just made it smaller to be easier to see.
import numpy as np
b = np.random.rand(5, 4, 3, 2)
# advanced selection works as expected
print b[[0,1],[0,1]] # http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
# this picks the two i,j=0 (a 3x2 matrix) and i=1,j=1, another 3x2 matrix
# doesn't work - why?
#print b[[0,1],[0,1,2]] # this doesnt' work because [0,1] and [0,1,2] have different lengths
print b[[0,1,2],[0,1,2]] # works
Output:
[[[ 0.27334558 0.90065184]
[ 0.8624593 0.34324983]
[ 0.19574819 0.2825373 ]]
[[ 0.38660087 0.63941692]
[ 0.81522421 0.16661912]
[ 0.81518479 0.78655536]]]
[[[ 0.27334558 0.90065184]
[ 0.8624593 0.34324983]
[ 0.19574819 0.2825373 ]]
[[ 0.38660087 0.63941692]
[ 0.81522421 0.16661912]
[ 0.81518479 0.78655536]]
[[ 0.65336551 0.1435357 ]
[ 0.91380873 0.45225145]
[ 0.57255923 0.7645396 ]]]

Sum of outer product of corresponding lists in two arrays - NumPy

I am trying to find the numpy matrix operations to get the same result as in the following for loop code. I believe it will be much faster but I am missing some python skills to do it.
It works line by line, each value from a line of x is multiplied by each value of the same line in e and then summed.
The first item of result would be (2*0+2*1+2*4+2*2+2*3)+(0*0+...)+...+(1*0+1*1+1*4+1*2+1*3)=30
Any idea would be much appreciated :).
e = np.array([[0,1,4,2,3],[2,0,2,3,0,1]])
x = np.array([[2,0,0,0,1],[0,3,0,0,4,0]])
result = np.zeros(len(x))
for key, j in enumerate(x):
for jj in j:
for i in e[key]:
result[key] += jj*i
>>> result
Out[1]: array([ 30., 56.])
Those are ragged arrays as they have lists of different lengths. So, a fully vectorized approach even if possible won't be straight-forward. Here's one using np.einsum in a loop comprehension -
[np.einsum('i,j->',x[n],e[n]) for n in range(len(x))]
Sample run -
In [381]: x
Out[381]: array([[2, 0, 0, 0, 1], [0, 3, 0, 0, 4, 0]], dtype=object)
In [382]: e
Out[382]: array([[0, 1, 4, 2, 3], [2, 0, 2, 3, 0, 1]], dtype=object)
In [383]: [np.einsum('i,j->',x[n],e[n]) for n in range(len(x))]
Out[383]: [30, 56]
If you are still feel persistent about a fully vectorized approach, you could make a regular array with the smaller lists being filled zeros. For the same, here's a post that lists a NumPy based approach to do the filling.
Once, we have the regular shaped arrays as x and e, the final result would be simply -
np.einsum('ik,il->i',x,e)
Is this close to what you are looking for?
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html
It seems like you are trying to get the dot product of matrices.

pythonic way for axis-wise winner-take-all in numpy

I am wondering what the most concise and pythonic way to keep only the maximum element in each line of a 2D numpy array while setting all other elements to zeros. Example:
given the following numpy array:
a = [ [1, 8, 3 ,6],
[5, 5, 60, 1],
[63,9, 9, 23] ]
I want the answer to be:
b = [ [0, 8, 0, 0],
[0, 0, 60, 0],
[63,0, 0, 0 ] ]
I can think of several ways to solve that, but what interests me is whether there are python functions to so this just quickly
Thank you in advance
You can use np.max to take the maximum along one axis, then use np.where to zero out the non-maximal elements:
np.where(a == a.max(axis=1, keepdims=True), a, 0)
The keepdims=True argument keeps the singleton dimension after taking the max (i.e. so that a.max(1, keepdims=True).shape == (3, 1)), which simplifies broadcasting it against a.
Don't know what is pythonic, so I assume the way with most python specific grammar is pythonic.
It used two list comprehension, which is feature of python. but in this way it might not that concise.
b = [[y if y == max(x) else 0 for y in x] for x in a ]

How to select inverse of indexes of a numpy array?

I have a large set of data in which I need to compare the distances of a set of samples from this array with all the other elements of the array. Below is a very simple example of my data set.
import numpy as np
import scipy.spatial.distance as sd
data = np.array(
[[ 0.93825827, 0.26701143],
[ 0.99121108, 0.35582816],
[ 0.90154837, 0.86254049],
[ 0.83149103, 0.42222948],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]]
)
sample_indexes = [1,2,3]
# I'd rather not make this
other_indexes = list(set(range(len(data))) - set(sample_indexes))
sample_data = data[sample_indexes]
other_data = data[other_indexes]
# compare them
dists = sd.cdist(sample_data, other_data)
Is there a way to index a numpy array for indexes that are NOT the sample indexes? In my above example I make a list called other_indexes. I'd rather not have to do this for various reasons (large data set, threading, a very VERY low amount of memory on the system this is running on etc. etc. etc.). Is there a way to do something like..
other_data = data[ indexes not in sample_indexes]
I read that numpy masks can do this but I tried...
other_data = data[~sample_indexes]
And this gives me an error. Do I have to create a mask?
mask = np.ones(len(data), np.bool)
mask[sample_indexes] = 0
other_data = data[mask]
not the most elegant for what perhaps should be a single-line statement, but its fairly efficient, and the memory overhead is minimal too.
If memory is your prime concern, np.delete would avoid the creation of the mask, and fancy-indexing creates a copy anyway.
On second thought; np.delete does not modify the existing array, so its pretty much exactly the single line statement you are looking for.
You may want to try in1d
In [5]:
select = np.in1d(range(data.shape[0]), sample_indexes)
In [6]:
print data[select]
[[ 0.99121108 0.35582816]
[ 0.90154837 0.86254049]
[ 0.83149103 0.42222948]]
In [7]:
print data[~select]
[[ 0.93825827 0.26701143]
[ 0.27309625 0.38925281]
[ 0.06510739 0.58445673]
[ 0.61469637 0.05420098]
[ 0.92685408 0.62715114]
[ 0.22587817 0.56819403]
[ 0.28400409 0.21112043]]
You may also use setdiff1d:
In [11]: data[np.setdiff1d(np.arange(data.shape[0]), sample_indexes)]
Out[11]:
array([[ 0.93825827, 0.26701143],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]])
I'm not familiar with the specifics on numpy, but here's a general solution. Suppose you have the following list: a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
You create another list of indices you don't want: inds = [1, 3, 6].
Now simply do this: good_data = [x for x in a if x not in inds], resulting in good_data = [0, 2, 4, 5, 7, 8, 9].

Categories

Resources