Combining slicing and broadcasted indexing for multi-dimensional numpy arrays

Combining slicing and broadcasted indexing for multi-dimensional numpy arrays - python

I have a ND numpy array (let say for instance 3x3x3) from wich I'd like to extract a sub-array, combining slices and index arrays. For instance:
import numpy as np
A = np.arange(3*3*3).reshape((3,3,3))
i0, i1, i2 = ([0,1], [0,1,2], [0,2])
ind1 = j0, j1, j2 = np.ix_(i0, i1, i2)
ind2 = (j0, slice(None), j2)
B1 = A[ind1]
B2 = A[ind2]
I would expect that B1 == B2, but actually, the shapes are different
>>> B1.shape
(2, 3, 2)
>>> B2.shape
(2, 1, 2, 3)
>>> B1
array([[[ 0, 2],
[ 3, 5],
[ 6, 8]],
[[ 9, 11],
[12, 14],
[15, 17]]])
>>> B2
array([[[[ 0, 3, 6],
[ 2, 5, 8]]],
[[[ 9, 12, 15],
[11, 14, 17]]]])
Someone understands why? Any idea of how I could get 'B1' by manipulating only 'A' and 'ind2' objects? The goal is that it would work for any nD arrays, and that I would not have to look for the shape of dimensions I want to keep entirely (hope I'm clear enough:)). Thanks!!
---EDIT---
To be clearer, I would like to have a function 'fun' such that
A[fun(ind2)] == B1

This is the closer I can get to your specs, I haven't been able to devise a solution that can compute the correct indices without knowing A (or, more precisely, its shape...).
import numpy as np
def index(A, s):
ind = []
groups = s.split(';')
for i, group in enumerate(groups):
if group == ":":
ind.append(range(A.shape[i]))
else:
ind.append([int(n) for n in group.split(',')])
return np.ix_(*ind)
A = np.arange(3*3*3).reshape((3,3,3))
ind2 = index(A,"0,1;:;0,2")
print A[ind2]
A shorter version
def index2(A,s):return np.ix_(*[range(A.shape[i])if g==":"else[int(n)for n in g.split(',')]for i,g in enumerate(s.split(';'))])
ind3 = index2(A,"0,1;:;0,2")
print A[ind3]

The indexing subspaces of ind1 are (2,),(3,),(2,), and the resulting B is (2,3,2). This is a simple case of advanced indexing.
ind2 is a case of (advanced) partial indexing. There are 2 indexed arrays, and 1 slice. The advanced indexing documentation states:
If the indexing subspaces are separated (by slice objects), then the broadcasted indexing space is first, followed by the sliced subspace of x.
In this case advanced indexing constructs a (2,2) array (from the 1st and 3rd indexes), and appends the slice dimension at the end, resulting in a (2,2,3) array.
I explain the reasoning in more detail in https://stackoverflow.com/a/27097133/901925
A way to fix a tuple like ind2, is to expand each slice into an array. I recently saw this done in np.insert.
np.arange(*ind2[1].indices(3))
expands : to [0,1,2]. But the replacement has to have the right shape.
ind=list(ind2)
ind[1]=np.arange(*ind2[1].indices(3)).reshape(1,-1,1)
A[ind]
I'm leaving off the details of determining which term is a slice, its dimension, and the relevant reshape. The goal is to reproduce i1.
If indices were generated by something other than ix_, reshaping this slice could be more difficult. For example
A[np.array([0,1])[None,:,None],:,np.array([0,2])[None,None,:]] # (1,2,2,3)
A[np.array([0,1])[None,:,None],np.array([0,1,2])[:,None,None],np.array([0,2])[None,None,:]]
# (3,2,2)
The expanded slice has to be compatible with the other arrays under broadcasting.
Swapping axes after indexing is another option. The logic, though, might be more complex.
But in some cases transposing might actually be simpler:
A[np.array([0,1])[:,None],:,np.array([0,2])[None,:]].transpose(2,0,1)
# (3,2,2)
A[np.array([0,1])[:,None],:,np.array([0,2])[None,:]].transpose(0,2,1)
# (2, 3, 2)

In restricted indexing cases like this using ix_, it is possible to do the indexing in successive steps.
A[ind1]
is the same as
A[i1][:,i2][:,:,i3]
and since i2 is the full range,
A[i1][...,i3]
If you only have ind2 available
A[ind2[0].flatten()][[ind2[2].flatten()]
In more general contexts you have to know how j0,j1,j2 broadcast with each other, but when they are generated by ix_, the relationship is simple.
I can imagine circumstances in which it would be convenient to assign A1 = A[i1], followed by a variety of actions involving A1, including, but not limited to A1[...,i3]. You have to be aware of when A1 is a view, and when it is a copy.
Another indexing tool is take:
A.take(i0,axis=0).take(i2,axis=2)

Related

get a vector from a matrix and a vactor of index in numpy

I have a matrix m = [[1,2,3],[4,5,6],[7,8,9]] and a vector v=[1,2,0] that contains the indices of the rows I want to return for each column of my matrix.
the results I expect should be r=[4,8,3], but I can not find out how to get this result using numpy.
By applying the vector to the index, for each columns I get this : m[v,[0,1,2]] = [4, 8, 3], which is roughly my quest.
To prevent hardcoding the columns, I'm using np.arange(m.shape[1]) and the my final formula looks like r=m[v,np.arange(m.shape[1])]
This sounds weird to me and a little complicated for something that should be quite common.
Is there a clean way to get such result ?

In [157]: m = np.array([[1,2,3],[4,5,6],[7,8,9]]);v=np.array([1,2,0])
In [158]: m
Out[158]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [159]: v
Out[159]: array([1, 2, 0])
In [160]: m[v,np.arange(3)]
Out[160]: array([4, 8, 3])
We are choosing 3 elements, with indices (1,0),(2,1),(0,2).
Closer to the MATLAB approach:
In [162]: np.ravel_multi_index((v,np.arange(3)),(3,3))
Out[162]: array([3, 7, 2])
In [163]: m.flat[_]
Out[163]: array([4, 8, 3])
Octave/MATLAB equivalent
>> m = [1 2 3;4 5 6;7 8 9];
>> v = [2 3 1]
v =
2 3 1
>> m = [1 2 3;4 5 6;7 8 9];
>> v = [2 3 1];
>> sub2ind([3,3],v,[1 2 3])
ans =
2 6 7
>> m(sub2ind([3,3],v,[1 2 3]))
ans =
4 8 3
The same broadcasting is used to access a block, as illustrated in this recent question:
Is there a way in Python to get a sub matrix as in Matlab?

Well, this 'weird/complicated' thing is actually mentioned as a "straight forward" scenario, in the documentation of Integer array andexing, which is a sub-topic under the broader topic of "Advanced Indexing".
To quote some extract:
When the index consists of as many integer arrays as the array being
indexed has dimensions, the indexing is straight forward, but
different from slicing. Advanced indexes always are broadcast and iterated as one. Note that the result shape is identical to the (broadcast) indexing array shapes
Blockquote
If it makes it seem any less complicated/weird, you could use range(m.shape[1]) instead of np.arange(m.shape[1]). It just needs to be any array or array-like structure.
Visualization / Intuition:
When I was learning this (integer array indexing), it helped me to visualize things in the following way:
I visualized the indexing arrays standing side-by-side, all having exactly the same shape (perhaps as a consequence of getting broadcasted together). I also visualized the result array, which also has the same shape as the indexing arrays. In each of these indexing arrays and the result array, I visualized a monkey, capable of doing a walk-through of its own array, hopping to successive elements of its own array. Note that, in general, this identical shape of the indexing arrays and the result array, can be n-dimensional, and this identical shape can be very different from the shape of the source array whose values are actually being indexed.
In your own example, the source array m has shape (3,3), and the indexing arrays and the result array each have a shape of (3,).
Inn your example, there is a monkey in each of those three arrays (the two indexing arrays and the result array). We then visualize the monkeys doing a walk-through of their respective array elements in tandem. Here, "in tandem" means all the three monkeys start at the first element of their respective arrays, and whenever a monkey hops to the next element of its own array, the other monkeys in the other arrays also hop to the next element in their respective arrays. As it hops to each successive element, the monkey in each indexing array calls out the value of the element it has just visited. So the two monkeys in the two indexing arrays read out the values they've just visited, in their respective indexing arrays. The monkey in the result array also hops in tandem with the monkeys in the indexing arrays. It hears the values being called out by the monkeys in the indexing arrays, uses those values as indices into the source array m, and thus determines the value to be picked from source array m. The monkey in the result array picks up this value from the source array m, and stores it the value in the result array, at the location it has just hopped to. Thus, for example, when all the three monkeys are in the second element of their respective arrays, the second position of the result array would get its value determined.

As stated by the numpy documentation, I think the way you mentioned is the standard way to do this task:
Example
From each row, a specific element should be selected. The row index is just [0, 1, 2] and the column index specifies the element to choose for the corresponding row, here [0, 1, 0]. Using both together the task can be solved using advanced indexing:
x = np.array([[1, 2], [3, 4], [5, 6]])
x[[0, 1, 2], [0, 1, 0]]

Divide numpy array into multiple arrays using indices array (Python)

I have an array:
a = [1, 3, 5, 7, 29 ... 5030, 6000]
This array gets created from a previous process, and the length of the array could be different (it is depending on user input).
I also have an array:
b = [3, 15, 67, 78, 138]
(Which could also be completely different)
I want to use the array b to slice the array a into multiple arrays.
More specifically, I want the result arrays to be:
array1 = a[:3]
array2 = a[3:15]
...
arrayn = a[138:]
Where n = len(b).
My first thought was to create a 2D array slices with dimension (len(b), something). However we don't know this something beforehand so I assigned it the value len(a) as that is the maximum amount of numbers that it could contain.
I have this code:
slices = np.zeros((len(b), len(a)))
for i in range(1, len(b)):
slices[i] = a[b[i-1]:b[i]]
But I get this error:
ValueError: could not broadcast input array from shape (518) into shape (2253412)

You can use numpy.split:
np.split(a, b)
Example:
np.split(np.arange(10), [3,5])
# [array([0, 1, 2]), array([3, 4]), array([5, 6, 7, 8, 9])]

b.insert(0,0)
result = []
for i in range(1,len(b)):
sub_list = a[b[i-1]:b[i]]
result.append(sub_list)
result.append(a[b[-1]:])

You are getting the error because you are attempting to create a ragged array. This is not allowed in numpy.
An improvement on #Bohdan's answer:
from itertools import zip_longest
result = [a[start:end] for start, end in zip_longest(np.r_[0, b], b)]
The trick here is that zip_longest makes the final slice go from b[-1] to None, which is equivalent to a[b[-1]:], removing the need for special processing of the last element.
Please do not select this. This is just a thing I added for fun. The "correct" answer is #Psidom's answer.

Most efficient way to implement numpy.in1d for muliple arrays

What is the best way to implement a function which takes an arbitrary number of 1d arrays and returns a tuple containing the indices of the matching values (if any).
Here is some pseudo-code of what I want to do:
a = np.array([1, 0, 4, 3, 2])
b = np.array([1, 2, 3, 4, 5])
c = np.array([4, 2])
(ind_a, ind_b, ind_c) = return_equals(a, b, c)
# ind_a = [2, 4]
# ind_b = [1, 3]
# ind_c = [0, 1]
(ind_a, ind_b, ind_c) = return_equals(a, b, c, sorted_by=a)
# ind_a = [2, 4]
# ind_b = [3, 1]
# ind_c = [0, 1]
def return_equals(*args, sorted_by=None):
...

You can use numpy.intersect1d with reduce for this:
def return_equals(*arrays):
matched = reduce(np.intersect1d, arrays)
return np.array([np.where(np.in1d(array, matched))[0] for array in arrays])
reduce may be little slow here because we are creating intermediate NumPy arrays here(for large number of input it may be very slow), we can prevent this if we use Python's set and its .intersection() method:
matched = np.array(list(set(arrays[0]).intersection(*arrays[1:])))
Related GitHub ticket: n-array versions of set operations, especially intersect1d

This solution basically concatenates all input 1D arrays into one big 1D array with the intention of performing the required operations in a vectorized manner. The only place where it uses loop is at the start where it gets the lengths of the input arrays, which must be minimal on runtime costs.
Here's the function implementation -
import numpy as np
def return_equals(*argv):
# Concatenate input arrays into one big array for vectorized processing
A = np.concatenate((argv[:]))
# lengths of input arrays
narr = len(argv)
lens = np.zeros((1,narr),int).ravel()
for i in range(narr):
lens[i] = len(argv[i])
N = A.size
# Start indices of each group of identical elements from different input arrays
# in a sorted version of the huge concatenated input array
start_idx = np.where(np.append([True],np.diff(np.sort(A))!=0))[0]
# Runlengths of islands of identical elements
runlens = np.diff(np.append(start_idx,N))
# Starting and all indices of the positions in concatenate array that has
# islands of identical elements which are present across all input arrays
good_start_idx = start_idx[runlens==narr]
good_all_idx = good_start_idx[:,None] + np.arange(narr)
# Get offsetted indices and sort them to get the desired output
idx = np.argsort(A)[good_all_idx] - np.append([0],lens[:-1].cumsum())
return np.sort(idx.T,1)

In Python:
def return_equal(*args):
rtr=[]
for i, arr in enumerate(args):
rtr.append([j for j, e in enumerate(arr) if
all(e in a for a in args[0:i]) and
all(e in a for a in args[i+1:])])
return rtr
>>> return_equal(a,b,c)
[[2, 4], [1, 3], [0, 1]]

For start, I'd try:
def return_equals(*args):
x=[]
c=args[-1]
for a in args:
x.append(np.nonzero(np.in1d(a,c))[0])
return x
If I add a d=np.array([1,0,4,3,0]) (it has only 1 match; what if there are no matches?)
then
return_equals(a,b,d,c)
produces:
[array([2, 4], dtype=int32),
array([1, 3], dtype=int32),
array([2], dtype=int32),
array([0, 1], dtype=int32)]
Since the length of both input and returned arrays can differ, you really can't vectorize the problem. That is, it takes some special gymnastics to perform the operation across all inputs at once. And if the number of arrays is small compared to their typical length, I wouldn't worry about speed. Iterating a few times is not expensive. It's iterating over a 100 values that's expensive.
You could, of course, pass the keyword arguments on to in1d.
It's not clear what you are trying to do with the sorted_by parameter. Is that something that you could just as easily apply to the arrays before you pass them to this function?
List comprehension version of this iteration:
[np.nonzero(np.in1d(x,c))[0] for x in [a,b,d,c]]
I can imagine concatenating the arrays into one longer one, applying in1d, and then splitting it up into subarrays. There is a np.split, but it requires that you tell it how many elements to put in each sublist. That means, somehow, determining how many matches there are for each argument. Doing that without looping could be tricky.
The pieces for this (that still need to be packed as function) are:
args=[a,b,d,c]
lens=[len(x) for x in args]
abc=np.concatenate(args)
C=np.cumsum(lens)
I=np.nonzero(np.in1d(abc,c))[0]
S=np.split(I,(2,4,5))
[S[0],S[1]-C[0],S[2]-C[1],S[3]-C[2]]
I
# array([ 2, 4, 6, 8, 12, 15, 16], dtype=int32)
C
# array([ 5, 10, 15, 17], dtype=int32)
The (2,4,5) are the number of elements of I between successive values of C, i.e. the number of elements that match for each of a,b,...

How does `numpy.einsum` work?

The correct way of writing a summation in terms of Einstein summation is a puzzle to me, so I want to try it in my code. I have succeeded in a few cases but mostly with trial and error.
Now there is a case that I cannot figure out. First, a basic question. For two matrices A and B that are Nx1 and 1xN, respectively, AB is NxN but BA is 1x1. When I want to calculate the NxN case with np.einsum I can do:
import numpy as np
a = np.asarray([[1,2]])
b = np.asarray([[2,3]])
print np.einsum('ij,ji->ij', a, b)
and the final array is 2x2. However
a = np.asarray([[1,2]])
b = np.asarray([[2,3]])
print np.einsum('ij,ij->ij', a, b)
returns a 1x2 array. I don't quite understand why this does not give the correct result.
For example for the above case numpy's guide says that arrows can be used to force summation or stop it from taking place. But that seems quite vague to me; in the above case I don't understand how numpy decides about the final size of the output array based on the order of indices (which apparently changes).
Formally I know the following: When there is nothing on the right side of the arrow, one can write the summation mathematically as $\sum\limits_{i=0}^{N}\sum\limits_{j=0}^{M} A_{ij}B_{ij}$
for np.einsum('ij,ij',A,B), but when there is an arrow I am clueless how to interpret it in terms of a formal mathematical expression.

In [22]: a
Out[22]: array([[1, 2]])
In [23]: b
Out[23]: array([[2, 3]])
In [24]: np.einsum('ij,ij->ij',a,b)
Out[24]: array([[2, 6]])
In [29]: a*b
Out[29]: array([[2, 6]])
Here the repetition of the indices in all parts, including output, is interpreted as element by element multiplication. Nothing is summed. a[i,j]*b[i,j] = c[i,j] for all i,j.
In [25]: np.einsum('ij,ji->ij',a,b)
Out[25]:
array([[2, 4],
[3, 6]])
In [28]: np.dot(a.T,b).T
Out[28]:
array([[2, 4],
[3, 6]])
In [38]: np.outer(a,b)
Out[38]:
array([[2, 3],
[4, 6]])
Again no summation because the same indices appear on left and right sides. a[i,j]*b[j,i] = c[i,j], in other words:
[[1*2, 2*2],
[1*3, 2*3]]
In effect an outer product. A look at how a is broadcasted against b.T might help:
In [69]: np.broadcast_arrays(a,b.T)
Out[69]:
[array([[1, 2],
[1, 2]]),
array([[2, 2],
[3, 3]])]
On the left side of the statement, repeated indices indicate which dimensions are multiplied. Matching left and right sides determines whether they are summed or not.
np.einsum('ij,ji->j',a,b) # array([ 5, 10]) sum on i only
np.einsum('ij,ji->i',a,b) # array([ 5, 10]) sum on j only
np.einsum('ij,ji',a,b) # 15 sum on i and j
A while back I worked out a pure Python equivalent to einsum, with most of focus on how it parsed the string. The goal is the create an nditer with which it does a sum of products calculation. But it's not a trivial script to follow, even in Python:
https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py
A simpler sequence showing these summation rules:
In [53]: c=np.array([[1,2],[3,4]])
In [55]: np.einsum('ij',c)
Out[55]:
array([[1, 2],
[3, 4]])
In [56]: np.einsum('ij->i',c)
Out[56]: array([3, 7])
In [57]: np.einsum('ij->j',c)
Out[57]: array([4, 6])
In [58]: np.einsum('ij->',c)
Out[58]: 10
Using arrays that don't have a 1 dimension removes the broadcasting complication:
In [71]: b2=np.arange(1,7).reshape(2,3)
In [72]: np.einsum('ij,ji',a2,b2)
...
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,3)->(2,3) (2,3)->(3,2)
Or should I say, it exposes the attempted broadcasting.
Ellipsis adds a level of complexity to the einsum interpretation. I developed the above mentioned github code when I solved a bug in the uses of .... But I didn't put much effort into refining the documentation.
Ellipsis broadcasting in numpy.einsum
The ellipses are most useful when you want an expression that can handle various sizes of arrays. If your arrays always 2D, it doesn't do anything extra.
By way of example, consider a generalization of the dot, one that multiplies the last dimension of A with the 2nd to the last of B. With ellipsis we can write an expression that can handle a mix of 2d, 3D and larger arrays:
np.einsum('...ij,...jk',np.ones((2,3)),np.ones((3,4))) # (2,4)
np.einsum('...ij,...jk',np.ones((5,2,3)),np.ones((3,4))) # (5,2,4)
np.einsum('...ij,...jk',np.ones((5,2,3)),np.ones((5,3,4))) # (5,2,4)
np.einsum('...ij,...jk',np.ones((5,2,3)),np.ones((7,5,3,4))) # (7,5,2,4)
np.einsum('...ij,...jk->...ik',np.ones((5,2,3)),np.ones((7,5,3,4)) # (7, 5, 2, 4)
The last expression uses the default right hand side indexing ...ik, ellipsis plus the non-summing indices.
Your original example could be written as
np.einsum('...j,j...->...j',a,b)
Effectively it fills in the i (or more dimensions) to match the dimensions of the arrays.
which would also work if a or b was 1d:
np.einsum('...j,j...->...j',a,b[0,:])
np.dot way of generalizing to larger dimensions is different
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
is expressed in einsum as:
np.einsum('ijo,kom->ijkm',np.ones((2,3,4)),np.ones((3,4,2)))
which can be generalized with
np.einsum('...o,kom->...km',np.ones((4,)),np.ones((3,4,2)))
or
np.einsum('ijo,...om->ij...m',np.ones((2,3,4)),np.ones((3,4,2)))
But I don't think I can completely replicate it in einsum. That is, I can't tell it to fill in indices for A, followed by different ones for B.

Indexing NumPy 2D array with another 2D array

I have something like
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
and
select = array([0,1,0,0])
My target is
result = array([1, 5, 7, 6])
I tried _ix as I read at Simplfy row AND column extraction, numpy, but this did not result in what I wanted.
p.s. Please change the title of this question if you can think of a more precise one.

The numpy way to do this is by using np.choose or fancy indexing/take (see below):
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
select = array([0,1,0,0])
result = np.choose(select, m.T)
So there is no need for python loops, or anything, with all the speed advantages numpy gives you. m.T is just needed because choose is really more a choise between the two arrays np.choose(select, (m[:,0], m[:1])), but its straight forward to use it like this.
Using fancy indexing:
result = m[np.arange(len(select)), select]
And if speed is very important np.take, which works on a 1D view (its quite a bit faster for some reason, but maybe not for these tiny arrays):
result = m.take(select+np.arange(0, len(select) * m.shape[1], m.shape[1]))

I prefer to use NP.where for indexing tasks of this sort (rather than NP.ix_)
What is not mentioned in the OP is whether the result is selected by location (row/col in the source array) or by some condition (e.g., m >= 5). In any event, the code snippet below covers both scenarios.
Three steps:
create the condition array;
generate an index array by calling NP.where, passing in this
condition array; and
apply this index array against the source array
>>> import numpy as NP
>>> cnd = (m==1) | (m==5) | (m==7) | (m==6)
>>> cnd
matrix([[ True, False],
[False, True],
[ True, False],
[ True, False]], dtype=bool)
>>> # generate the index array/matrix
>>> # by calling NP.where, passing in the condition (cnd)
>>> ndx = NP.where(cnd)
>>> ndx
(matrix([[0, 1, 2, 3]]), matrix([[0, 1, 0, 0]]))
>>> # now apply it against the source array
>>> m[ndx]
matrix([[1, 5, 7, 6]])
The argument passed to NP.where, cnd, is a boolean array, which in this case, is the result from a single expression comprised of compound conditional expressions (first line above)
If constructing such a value filter doesn't apply to your particular use case, that's fine, you just need to generate the actual boolean matrix (the value of cnd) some other way (or create it directly).

What about using python?
result = array([subarray[index] for subarray, index in zip(m, select)])

IMHO, this is simplest variant:
m[np.arange(4), select]

Since the title is referring to indexing a 2D array with another 2D array, the actual general numpy solution can be found here.
In short:
A 2D array of indices of shape (n,m) with arbitrary large dimension m, named inds, is used to access elements of another 2D array of shape (n,k), named B:
# array of index offsets to be added to each row of inds
offset = np.arange(0, inds.size, inds.shape[1])
# numpy.take(B, C) "flattens" arrays B and C and selects elements from B based on indices in C
Result = np.take(B, offset[:,np.newaxis]+inds)
Another solution, which doesn't use np.take and I find more intuitive, is the following:
B[np.expand_dims(np.arange(B.shape[0]), -1), inds]
The advantage of this syntax is that it can be used both for reading elements from B based on inds (like np.take), as well as for assignment.

result = array([m[j][0] if i==0 else m[j][1] for i,j in zip(select, range(0, len(m)))])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.