pytorch batchwise indexing - python

I am searching for a way to do some batchwise indexing for tensors.
If I have a variable Q of size 1000, I can get the elements I want by
Q[index], where index is a vector of the wanted elements.
Now I would like to do the same for more dimensional tensors.
So suppose Q is of shape n x m and I have a index matrix of shape n x p.
My goal is to get for each of the n rows the specific p elements out of the m elements.
But Q[index] is not working for this situation.
Do you have any thoughts how to handle this?

You can seem to be a simple application of torch.gather which doesn't require any additional reshaping of the data or index tensor:
>>> Q = torch.rand(5, 4)
tensor([[0.8462, 0.3064, 0.2549, 0.2149],
[0.6801, 0.5483, 0.5522, 0.6852],
[0.1587, 0.4144, 0.8843, 0.6108],
[0.5265, 0.8269, 0.8417, 0.6623],
[0.8549, 0.6437, 0.4282, 0.2792]])
>>> index
tensor([[0, 1, 2],
[2, 3, 1],
[0, 1, 2],
[2, 2, 2],
[1, 1, 2]])
The following gather operation applied on dim=1 return a tensor out, such that:
out[i, j] = Q[i, index[i,j]]
This is done with the following call of torch.Tensor.gather on Q:
>>> Q.gather(dim=1, index=index)
tensor([[0.8462, 0.3064, 0.2549],
[0.5522, 0.6852, 0.5483],
[0.1587, 0.4144, 0.8843],
[0.8417, 0.8417, 0.8417],
[0.6437, 0.6437, 0.4282]])

Related

PyTorch: Dynamic Programming as Tensor Operation

Is it possible to get the following loop with a Tensor operation?
a = torch.Tensor([1, 0, 0, 0])
b = torch.Tensor([1, 2, 3, 4])
for i in range(1, a.shape[0]):
a[i] = b[i] + a[i-1]
print(a) # [1, 3, 6, 10]
The operation depends on the previous values in a and the values that are computed on the way (in a dynamic programming fashion).
Is it possible to get this type of sequential computation with a tensor operation?
You can achieve this with a cumulative sum:
b.cumsum(0)

What is a best way to intersect multiple arrays with numpy array?

Suppose I have an example of numpy array:
import numpy as np
X = np.array([2,5,0,4,3,1])
And I also have a list of arrays, like:
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
I want to leave only these items of each list that are also in X. I expect also to do it in a most efficient/common way.
Solution I have tried so far:
Sort X using X.sort().
Find locations of items of each array in X using:
locations = [np.searchsorted(X, n) for n in A]
Leave only proper ones:
masks = [X[locations[i]] == A[i] for i in range(len(A))]
result = [A[i][masks[i]] for i in range(len(A))]
But it doesn't work because locations of third array is out of bounds:
locations = [array([0, 0, 2], dtype=int64), array([0, 1, 2, 3, 4, 5], dtype=int64), array([2, 5, 4, 6], dtype=int64)]
How to solve this issue?
Update
I ended up with idx[idx==len(Xs)] = 0 solution. I've also noticed two different approaches posted between the answers: transforming X into set vs np.sort. Both of them has plusses and minuses: set operations uses iterations which is quite slow in compare with numpy methods; however np.searchsorted speed increases logarithmically unlike acceses of set items which is instant. That why I decided to compare performance using data with huge sizes, especially 1 million items for X, A[0], A[1], A[2].
One idea would be less compute and minimal work when looping. So, here's one with those in mind -
a = np.concatenate(A)
m = np.isin(a,X)
l = np.array(list(map(len,A)))
a_m = a[m]
cut_idx = np.r_[0,l.cumsum()]
l_m = np.add.reduceat(m,cut_idx[:-1])
cl_m = np.r_[0,l_m.cumsum()]
out = [a_m[i:j] for (i,j) in zip(cl_m[:-1],cl_m[1:])]
Alternative #1 :
We can also use np.searchsorted to get the isin mask, like so -
Xs = np.sort(X)
idx = np.searchsorted(Xs,a)
idx[idx==len(Xs)] = 0
m = Xs[idx]==a
Another way with np.intersect1d
If you are looking for the most common/elegant one, think it would be with np.intersect1d -
In [43]: [np.intersect1d(X,A_i) for A_i in A]
Out[43]: [array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 4, 5])]
Solving your issue
You can also solve your out-of-bounds issue, with a simple fix -
for l in locations:
l[l==len(X)]=0
How about this, very simple and efficent:
import numpy as np
X = np.array([2,5,0,4,3,1])
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
X_set = set(X)
A = [np.array([a for a in arr if a in X_set]) for arr in A]
#[array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 5, 4])]
According to the docs, set operations all have O(1) complexity, therefore the overall is O(N)

Sort invariant for numpy.argsort with multiple dimensions

numpy.argsort docs state
Returns:
index_array : ndarray, int
Array of indices that sort a along the specified axis. If a is one-dimensional, a[index_array] yields a sorted a.
How can I apply the result of numpy.argsort for a multidimensional array to get back a sorted array? (NOT just a 1-D or 2-D array; it could be an N-dimensional array where N is known only at runtime)
>>> import numpy as np
>>> np.random.seed(123)
>>> A = np.random.randn(3,2)
>>> A
array([[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471],
[-0.57860025, 1.65143654]])
>>> i=np.argsort(A,axis=-1)
>>> A[i]
array([[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]],
[[ 0.2829785 , -1.50629471],
[-1.0856306 , 0.99734545]],
[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]]])
For me it's not just a matter of using sort() instead; I have another array B and I want to order B using the results of np.argsort(A) along the appropriate axis. Consider the following example:
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = np.argsort(A,axis=-1)
>>> BsortA = ???
# should result in [[4,1,3],[5,1,9]]
# so that corresponding elements of B and sort(A) stay together
It looks like this functionality is already an enhancement request in numpy.
The numpy issue #8708 has a sample implementation of take_along_axis that does what I need; I'm not sure if it's efficient for large arrays but it seems to work.
def take_along_axis(arr, ind, axis):
"""
... here means a "pack" of dimensions, possibly empty
arr: array_like of shape (A..., M, B...)
source array
ind: array_like of shape (A..., K..., B...)
indices to take along each 1d slice of `arr`
axis: int
index of the axis with dimension M
out: array_like of shape (A..., K..., B...)
out[a..., k..., b...] = arr[a..., inds[a..., k..., b...], b...]
"""
if axis < 0:
if axis >= -arr.ndim:
axis += arr.ndim
else:
raise IndexError('axis out of range')
ind_shape = (1,) * ind.ndim
ins_ndim = ind.ndim - (arr.ndim - 1) #inserted dimensions
dest_dims = list(range(axis)) + [None] + list(range(axis+ins_ndim, ind.ndim))
# could also call np.ix_ here with some dummy arguments, then throw those results away
inds = []
for dim, n in zip(dest_dims, arr.shape):
if dim is None:
inds.append(ind)
else:
ind_shape_dim = ind_shape[:dim] + (-1,) + ind_shape[dim+1:]
inds.append(np.arange(n).reshape(ind_shape_dim))
return arr[tuple(inds)]
which yields
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = A.argsort(axis=-1)
>>> take_along_axis(A,i,axis=-1)
array([[1, 2, 3],
[0, 4, 6]])
>>> take_along_axis(B,i,axis=-1)
array([[4, 1, 3],
[5, 1, 9]])
This argsort produces a (3,2) array
In [453]: idx=np.argsort(A,axis=-1)
In [454]: idx
Out[454]:
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32)
As you note applying this to A to get the equivalent of np.sort(A, axis=-1) isn't obvious. The iterative solution is sort each row (a 1d case) with:
In [459]: np.array([x[i] for i,x in zip(idx,A)])
Out[459]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
While probably not the fastest, it is probably the clearest solution, and a good starting point for conceptualizing a better solution.
The tuple(inds) from the take solution is:
(array([[0],
[1],
[2]]),
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32))
In [470]: A[_]
Out[470]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
In other words:
In [472]: A[np.arange(3)[:,None], idx]
Out[472]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
The first part is what np.ix_ would construct, but it does not 'like' the 2d idx.
Looks like I explored this topic a couple of years ago
argsort for a multidimensional ndarray
a[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
I tried to explain what is going on. The take function does the same sort of thing, but constructs the indexing tuple for a more general case (dimensions and axis). Generalizing to more dimensions, but still with axis=-1 should be easy.
For the first axis, A[np.argsort(A,axis=0),np.arange(2)] works.
We just need to use advanced-indexing to index along all axes with those indices array. We can use np.ogrid to create open grids of range arrays along all axes and then replace only for the input axis with the input indices. Finally, index into data array with those indices for the desired output. Thus, essentially, we would have -
# Inputs : arr, ind, axis
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
out = arr[tuple(idx)]
Just to make it functional and do error checks, let's create two functions - One to get those indices and second one to feed in the data array and simply index. The idea with the first function is to get the indices that could be re-used for indexing into any arbitrary array which would support the necessary number of dimensions and lengths along each axis.
Hence, the implementations would be -
def advindex_allaxes(ind, axis):
axis = np.core.multiarray.normalize_axis_index(axis,ind.ndim)
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
return tuple(idx)
def take_along_axis(arr, ind, axis):
return arr[advindex_allaxes(ind, axis)]
Sample runs -
In [161]: A = np.array([[3,2,1],[4,0,6]])
In [162]: B = np.array([[3,1,4],[1,5,9]])
In [163]: i = A.argsort(axis=-1)
In [164]: take_along_axis(A,i,axis=-1)
Out[164]:
array([[1, 2, 3],
[0, 4, 6]])
In [165]: take_along_axis(B,i,axis=-1)
Out[165]:
array([[4, 1, 3],
[5, 1, 9]])
Relevant one.

InvalidArgumentError TensorFlow sparse_to_dense

I have a sparse tensor that I'm building up from a collection of indices and values. I'm trying to implement some code to take a full row slice. Although this functionality doesn't appear to be directly supported in TensorFlow, it seems that there are work arounds that can return the indices and values for a specified row as follows:
def sparse_slice(self, indices, values, needed_row_ids):
needed_row_ids = tf.reshape(needed_row_ids, [1, -1])
num_rows = tf.shape(indices)[0]
partitions = tf.cast(tf.reduce_any(tf.equal(tf.reshape(indices[:, 0], [-1, 1]), needed_row_ids), 1),
tf.int32)
rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
slice_indices = tf.gather(indices, rows_to_gather)
slice_values = tf.gather(values, rows_to_gather)
return slice_indices, slice_values
Then called directly like so on a sparse 4x4 matrix where I am interested in accessing all of the elements in row 3:
with tf.Session().as_default():
indices = tf.constant([[0, 0], [1, 0], [2, 0], [2, 1], [3, 0], [3, 3]])
values = tf.constant([10, 19, 1, 1, 6, 5], dtype=tf.int64)
needed_row_ids = tf.constant([3])
slice_indices, slice_values = self.sparse_slice(indices, values, needed_row_ids)
print('indicies: {} and rows: {}'.format(slice_indices.eval(), slice_values.eval()))
Which outputs the following:
indicies: [[3 0]
[3 3]] and rows: [6 5]
So far so good, I then figure I can use this information to construct a 1x4 dense tensor with the values at the indexes and 0s for the missing columns.
dense_representation = tf.sparse_to_dense(sparse_values=slice_values, sparse_indices=slice_indices,
output_shape=(1,4))
However the moment I run the tensor in a session.
sess = tf.Session()
sess.run(dense_representation)
I receive the following exception:
InvalidArgumentError (see above for traceback): indices[0] = [3,0] is out of bounds: need 0 <= index < [1,4]
[[Node: SparseToDense = SparseToDense[T=DT_INT64, Tindices=DT_INT32, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Gather_2, SparseToDense/output_shape, Gather_3, SparseToDense/default_value)]]
I'm not quite sure what I'm doing wrong or if this has something to do with the output_shape not being properly formed. Essentially I'd like to stuff this all back into a 1 x 4 vector. I haven't been able to find any good examples online for how to do this. Any help would be appreciated.

Indexing from an ndimensional array - numpy/ python

I am working with matrices of (x,y,z) dimensions, and would like to index numerous values from this matrix simultaneously.
ie. if the index A[0,0,0] = 5
and A[1,1,1] = 10
A[[1,1,1], [5,5,5]] = [5, 10]
however indexing like this seems to return huge chunks of the matrix.
Does anyone know how I can accomplish this? I have a large array of indices (n, x, y, z) that i need to use to index from A)
Thanks
You are trying to use 1 as the first index 3 times and 5 as the index into the second dimension (again three times). This will give you the element at A[1,5,:] repeated three times.
A = np.random.rand(6,6,6);
B = A[[1,1,1], [5,5,5]]
# [[ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839],
# [ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839],
# [ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839]]
B.shape
# (3, 6)
Instead, you will want to specify [1,5] for each axis of your matrix.
A[[1,5], [1,5], [1,5]] = [5, 10]
Advanced indexing works like this:
A[I, J, K][n] == A[I[n], J[n], K[n]]
with A, I, J, and K all arrays. That's not the full, general rule, but it's what the rules simplify down to for what you need.
For example, if you want output[0] == A[0, 0, 0] and output[1] == A[1, 1, 1], then your I, J, and K arrays should look like np.array([0, 1]). Lists also work:
A[[0, 1], [0, 1], [0, 1]]

Categories

Resources