I have the following Numpy array.
I want to reshape it to be (3,5,3)
So i must be like:
[
[
[1,6,11],
[2,7,12],
[3,8,13],
[4,9,14],
[5,10,15]
],.......
]
I tried reshape(3,5,3) but it doesn't give the wanted result?
Your input array is of shape (3, 3, 5) and you want it to be reshaped it to (3, 5, 3). There are many ways of doing this. Below are some, as also mentioned in the comments:
First would be to use numpy.reshape() which accepts newshape as a parameter:
In [77]: arr = np.arange(3*3*5).reshape(3, 3, 5)
# reshape to desired shape
In [78]: arr = arr.reshape((3, 5, 3))
In [79]: arr.shape
Out[79]: (3, 5, 3)
Or you can use numpy.transpose() as in:
In [80]: arr = np.arange(3*3*5).reshape(3, 3, 5)
In [81]: arr.shape
Out[81]: (3, 3, 5)
# now, we want to move the last axis which is 2 to second position
# thus our new shape would be `(3, 5, 3)`
In [82]: arr = np.transpose(arr, (0, 2, 1))
In [83]: arr.shape
Out[83]: (3, 5, 3)
Another way would be to use numpy.moveaxis() :
In [87]: arr = np.arange(3*3*5).reshape(3, 3, 5)
# move the last axis (-1) to 2nd position (1)
In [88]: arr = np.moveaxis(arr, -1, 1)
In [89]: arr.shape
Out[89]: (3, 5, 3)
Yet another way would be to just swap the axes using numpy.swapaxes() :
In [90]: arr = np.arange(3*3*5).reshape(3, 3, 5)
In [91]: arr.shape
Out[91]: (3, 3, 5)
# swap the position of ultimate and penultimate axes
In [92]: arr = np.swapaxes(arr, -1, 1)
In [93]: arr.shape
Out[93]: (3, 5, 3)
Choose whichever is more intuitive to you since all approaches return a new view of the desired shape.
Although all of the above return a view, there are some timing differences. So, the preferred way of doing this (for efficiency) would be:
In [124]: arr = np.arange(3*3*5).reshape(3, 3, 5)
In [125]: %timeit np.swapaxes(arr, -1, 1)
456 ns ± 6.79 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [126]: %timeit np.transpose(arr, (0, 2, 1))
458 ns ± 6.93 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [127]: %timeit np.reshape(arr, (3, 5, 3))
635 ns ± 9.06 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [128]: %timeit np.moveaxis(arr, -1, 1)
3.42 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
numpy.swapaxes() and numpy.transpose() takes almost the same time, with numpy.reshape() being a bit slower, while numpy.moveaxis being the slowest among all. So, it'd be wise to use either swapaxes or transpose ufunc.
I found a way of doing it using List comprehension and Numpy transpose.
Code:
import numpy as np
database = [
[
[1,2,3,4,5],
[6,7,8,9,10],
[11,12,13,14,15]
],
[
[16,17,18,19,20],
[21,22,23,24,25],
[26,27,28,29,30]
],
[
[31,32,33,34,35],
[36,37,38,39,40],
[41,42,43,44,45]
]
]
ans = [np.transpose(data) for data in database]
print(ans)
Output:
[array([[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14],
[ 5, 10, 15]]),
array([[16, 21, 26],
[17, 22, 27],
[18, 23, 28],
[19, 24, 29],
[20, 25, 30]]),
array([[31, 36, 41],
[32, 37, 42],
[33, 38, 43],
[34, 39, 44],
[35, 40, 45]])]
Related
I have an array in numpy. I want to roll the first column by 1, second column by 2, etc.
Here is an example.
>>> x = np.reshape(np.arange(15), (5, 3))
>>> x
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
What I want to do:
>>> y = roll(x)
>>> y
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
What is the best way to do it?
The real array will be very big. I'm using cupy, the GPU version of numpy. I will prefer solution fastest on GPU, but of course, any idea is welcomed.
You could use advanced indexing:
import numpy as np
x = np.reshape(np.arange(15), (5, 3))
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h
y = x[shifted, cols]
y:
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
I implemented a naive solution (roll_for) and compares it to #Chrysophylaxs 's solution (roll_indexing).
Conclusion: roll_indexing is faster for small arrays, but the difference shrinks when the array goes bigger, and is eventually slower than roll_for for very large arrays.
Implementations:
import numpy as np
def roll_for(x, shifts=None, axis=-1):
if shifts is None:
shifts = np.arange(1, x.shape[axis] + 1) # OP requirement
xt = x.swapaxes(axis, 0) # https://stackoverflow.com/a/31094758/13636407
yt = np.empty_like(xt)
for idx, shift in enumerate(shifts):
yt[idx] = np.roll(xt[idx], shift=shift)
return yt.swapaxes(0, axis)
def roll_indexing(x):
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h # fix
return x[shifted, cols]
Tests:
M, N = 5, 3
x = np.arange(M * N).reshape(M, N)
expected = np.array([[12, 10, 8], [0, 13, 11], [3, 1, 14], [6, 4, 2], [9, 7, 5]])
assert np.array_equal(expected, roll_for(x))
assert np.array_equal(expected, roll_indexing(x))
M, N = 100, 200
# roll_indexing did'nt work when M < N before fix
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
Benchmark:
M, N = 100, 100
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 859 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit roll_indexing(x) # 81 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
M, N = 1_000, 1_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 12.7 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit roll_indexing(x) # 12.4 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
M, N = 10_000, 10_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 1.3 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit roll_indexing(x) # 1.61 s ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am trying to convert the vanilla python standard deviation function that takes n number of indexes defined by the variable number for calculations into numpy form. However the numpy code is faulty which is saying only integer scalar arrays can be converted to a scalar index is there any way i could by pass this.
Variables
import numpy as np
number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
Vanilla python
std= np.array([list_[i:i+number].std() for i in range(0, len(list_)-number)])
Numpy form
counter = np.arange(0, len(list_)-number, 1)
std = list_[counter:counter+number].std()
In [46]: std= np.array([arr[i:i+number].std() for i in range(0, len(arr)-number)
...: ])
In [47]: std
Out[47]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, 13.13574418, 15.24698722, 14.65383773, 11.62092989,
8.57331689, 4.76392583, 9.49404494, 21.20874383, 24.91417226,
20.84991841, 13.22152789, 10.83343482, 16.01294245, 13.80007894,
10.51866421, 8.29287433, 11.24933733, 15.43661128, 13.65945978])
We can move the std out of the loop. Make a 2d array of windows, and apply std with axis:
In [48]: np.array([arr[i:i+number] for i in range(0, len(arr)-number)]).std(axis
...: =1)
Out[48]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, 13.13574418, 15.24698722, 14.65383773, 11.62092989,
8.57331689, 4.76392583, 9.49404494, 21.20874383, 24.91417226,
20.84991841, 13.22152789, 10.83343482, 16.01294245, 13.80007894,
10.51866421, 8.29287433, 11.24933733, 15.43661128, 13.65945978])
We could also generate the windows with indexing. A convenient way is to use linspace:
In [63]: idx = np.arange(0,len(arr)-number)
In [64]: idx = np.linspace(idx,idx+number,number, endpoint=False,dtype=int)
In [65]: idx
Out[65]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24],
...
[ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28]])
In [66]: arr[idx].std(axis=0)
Out[66]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, 13.13574418, 15.24698722, 14.65383773, 11.62092989,
8.57331689, 4.76392583, 9.49404494, 21.20874383, 24.91417226,
20.84991841, 13.22152789, 10.83343482, 16.01294245, 13.80007894,
10.51866421, 8.29287433, 11.24933733, 15.43661128, 13.65945978])
The rolling-windows using as_strided will probably be faster, but may be harder to understand.
In [67]: timeit std= np.array([arr[i:i+number].std() for i in range(0, len(arr)-
...: number)])
1.05 ms ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [68]: timeit np.array([arr[i:i+number] for i in range(0, len(arr)-number)]).s
...: td(axis=1)
74.7 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [69]: %%timeit
...: idx = np.arange(0,len(arr)-number)
...: idx = np.linspace(idx,idx+number,number, endpoint=False,dtype=int)
...: arr[idx].std(axis=0)
117 µs ± 240 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [73]: timeit np.std(rolling_window(arr, 5), 1)
74.5 µs ± 625 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using a more direct way to generate the rolling index:
In [81]: %%timeit
...: idx = np.arange(len(arr)-number)[:,None]+np.arange(number)
...: arr[idx].std(axis=1)
57.9 µs ± 87.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
your error
In [82]: arr[np.array([1,2,3]):np.array([4,5,6])]
Traceback (most recent call last):
File "<ipython-input-82-3358e59f8fb5>", line 1, in <module>
arr[np.array([1,2,3]):np.array([4,5,6])]
TypeError: only integer scalar arrays can be converted to a scalar index
as taken from Rolling window for 1D arrays in Numpy?
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
np.std(rolling_window(list_, 5), 1)
by the way, your vanilla python code is wrong. it should be:
std= np.array([list_[i:i+number].std() for i in range(0, len(list_)-number+1)])
Just wondering if there is a way to construct a tumbling window in python. So for example if I have list/ndarray , listA = [3,2,5,9,4,6,3,8,7,9]. Then how could I find the maximum of the first 3 items (3,2,5) -> 5, and then the next 3 items (9,4,6) -> 9 and so on... Sort of like breaking it up to sections and finding the max. So the final result would be list [5,9,8,9]
Approach #1: One-liner for windowed-max using np.maximum.reduceat -
In [118]: np.maximum.reduceat(listA,np.arange(0,len(listA),3))
Out[118]: array([5, 9, 8, 9])
Becomes more compact with np.r_ -
np.maximum.reduceat(listA,np.r_[:len(listA):3])
Approach #2: Generic ufunc way
Here's a function for generic ufuncs and that window length as a parameter -
def windowed_ufunc(a, ufunc, W):
a = np.asarray(a)
n = len(a)
L = W*(n//W)
out = ufunc(a[:L].reshape(-1,W),axis=1)
if n>L:
out = np.hstack((out, ufunc(a[L:])))
return out
Sample run -
In [81]: a = [3,2,5,9,4,6,3,8,7,9]
In [82]: windowed_ufunc(a, ufunc=np.max, W=3)
Out[82]: array([5, 9, 8, 9])
On other ufuncs -
In [83]: windowed_ufunc(a, ufunc=np.min, W=3)
Out[83]: array([2, 4, 3, 9])
In [84]: windowed_ufunc(a, ufunc=np.sum, W=3)
Out[84]: array([10, 19, 18, 9])
In [85]: windowed_ufunc(a, ufunc=np.mean, W=3)
Out[85]: array([3.33333333, 6.33333333, 6. , 9. ])
Benchmarking
Timings on NumPy solutions on array data with sample data scaled up by 10000x -
In [159]: a = [3,2,5,9,4,6,3,8,7,9]
In [160]: a = np.tile(a, 10000)
# #yatu's soln
In [162]: %timeit moving_maxima(a, w=3)
435 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#1
In [167]: %timeit np.maximum.reduceat(a,np.arange(0,len(a),3))
353 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#2
In [165]: %timeit windowed_ufunc(a, ufunc=np.max, W=3)
379 µs ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you want a one-liner, you can use list comprehension:
listA = [3,2,5,9,4,6,3,8,7,9]
listB=[max(listA[i:i+3]) for i in range(0,len(listA),3)]
print (listB)
it returns:
[5, 9, 8, 9]
Of course the codes can be written more dynamically: if you want a different window size, just change 3 to any integer.
Using numpy, you can extend the list with zeroes so its length is divisible by the window size, and reshape and compute the maxalong the second axis:
def moving_maxima(a, w):
mod = len(a)%w
d = w if mod else mod
x = np.r_[a, [0]*(d-mod)]
return x.reshape(-1,w).max(1)
Some examples:
moving_maxima(listA,2)
# array([3., 9., 6., 8., 9.])
moving_maxima(listA,3)
#array([5, 9, 8, 9])
moving_maxima(listA,4)
#array([9, 8, 9])
I have 3 NumPy arrays, and I want to create tuples of the i-th element of each list. These tuples represent keys for a dictionary I had previously defined.
Ex:
List 1: [1, 2, 3, 4, 5]
List 2: [6, 7, 8, 9, 10]
List 3: [11, 12, 13, 14, 15]
Desired output: [mydict[(1,6,11)],mydict[(2,7,12)],mydict[(3,8,13)],mydict[(4,9,14)],mydict[(5,10,15)]]
These tuples represent keys of a dictionary I have previously defined (essentially, as input variables to a previously calculated function). I had read that this is the best way to store function values for lookup.
My current method of doing this is as follows:
[dict[x] for x in zip(l1, l2, l3)]
This works, but is obviously slow. Is there a way to vectorize this operation, or make it faster in any way? I'm open to changing the way I've stored the function values as well, if that is necessary.
EDIT: My apologies for the question being unclear. I do in fact, have NumPy arrays. My mistake for referring to them as lists and displaying them as such. They are of the same length.
Your question is a bit confusing, since you're calling these NumPy arrays, and asking for a way to vectorize things, but then showing lists, and labeling them as lists in your example, and using list in the title. I'm going to assume you do have arrays.
>>> l1 = np.array([1, 2, 3, 4, 5])
>>> l2 = np.array([6, 7, 8, 9, 10])
>>> l3 = np.array([11, 12, 13, 14, 15])
If so, you can stack these up in a 2D array:
>>> ll = np.stack((l1, l2, l3))
And then you can just transpose that:
>>> lt = ll.T
This is better than vectorized; it's constant-time. NumPy is just creating another view of the same data, with different striding so it reads in column order instead of row order.
>>> lt
array([[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14],
[ 5, 10, 15]])
As miradulo points out, you can do both of these in one step with column_stack:
>>> lt = np.column_stack((l1, l2, l3))
But I suspect you're actually going to want ll as a value in its own right. (Although I admit I'm just guessing here at what you're trying to do…)
And of course if you want to loop over these rows as 1D arrays instead of doing further vectorized work, you can:
>>> for row in lt:
...: print(row)
[ 1 6 11]
[ 2 7 12]
[ 3 8 13]
[ 4 9 14]
[ 5 10 15]
Of course, you can convert them from 1D arrays to tuples just by calling tuple on each row. Or… whatever that mydict is supposed to be (it doesn't look like a dictionary—there's no key-value pairs, just values), you can do that.
>>> mydict = collections.namedtuple('mydict', list('abc'))
>>> tups = [mydict(*row) for row in lt]
>>> tups
[mydict(a=1, b=6, c=11),
mydict(a=2, b=7, c=12),
mydict(a=3, b=8, c=13),
mydict(a=4, b=9, c=14),
mydict(a=5, b=10, c=15)]
If you're worried about the time to look up a tuple of keys in a dict, itemgetter in the operator module has a C-accelerated version. If keys is a np.array, or a tuple, or whatever, you can do this:
for row in lt:
myvals = operator.itemgetter(*row)(mydict)
# do stuff with myvals
Meanwhile, I decided to slap together a C extension that should be as fast as possible (with no error handling, because I'm lazy it should be a tiny bit faster that way—this code will probably segfault if you give it anything but a dict and a tuple or list):
static PyObject *
itemget_itemget(PyObject *self, PyObject *args) {
PyObject *d;
PyObject *keys;
PyArg_ParseTuple(args, "OO", &d, &keys);
PyObject *seq = PySequence_Fast(keys, "keys must be an iterable");
PyObject **arr = PySequence_Fast_ITEMS(seq);
int seqlen = PySequence_Fast_GET_SIZE(seq);
PyObject *result = PyTuple_New(seqlen);
PyObject **resarr = PySequence_Fast_ITEMS(result);
for (int i=0; i!=seqlen; ++i) {
resarr[i] = PyDict_GetItem(d, arr[i]);
Py_INCREF(resarr[i]);
}
return result;
}
Times for looking up 100 random keys out of a 10000-key dictionary on my laptop with python.org CPython 3.7 on macOS:
itemget.itemget: 1.6µs
operator.itemgetter: 1.8µs
comprehension: 3.4µs
pure-Python operator.itemgetter: 6.7µs
So, I'm pretty sure anything you do is going to be fast enough—that's only 34ns/key that we're trying to optimize. But if that really is too slow, operator.itemgetter does a good enough job moving the loop to C and cuts it roughly in half, which is pretty close to the best possibly result you could expect. (It's hard to imagine looping up a bunch of boxed-value keys in a hash table in much less than 16ns/key, after all.)
Define your 3 lists. You mention 3 arrays, but show lists (and call them that as well):
In [112]: list1,list2,list3 = list(range(1,6)),list(range(6,11)),list(range(11,16))
Now create a dictionary with tuple keys:
In [114]: dd = {x:i for i,x in enumerate(zip(list1,list2,list3))}
In [115]: dd
Out[115]: {(1, 6, 11): 0, (2, 7, 12): 1, (3, 8, 13): 2, (4, 9, 14): 3, (5, 10, 15): 4}
Accessing elements from that dictionary with your code:
In [116]: [dd[x] for x in zip(list1,list2,list3)]
Out[116]: [0, 1, 2, 3, 4]
In [117]: timeit [dd[x] for x in zip(list1,list2,list3)]
1.62 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now for an array equivalent - turn the lists into a 2d array:
In [118]: arr = np.array((list1,list2,list3))
In [119]: arr
Out[119]:
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
Access the same dictionary elements. If I used column_stack I could have omitted the .T, but that's slower. (array transpose is fast)
In [120]: [dd[tuple(x)] for x in arr.T]
Out[120]: [0, 1, 2, 3, 4]
In [121]: timeit [dd[tuple(x)] for x in arr.T]
15.7 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Notice that this is substantially slower. Iteration over an array is slower than iteration over a list. You can't access elements of a dictionary in any sort of numpy 'vectorized' fashion - you have to use a Python iteration.
I can improve on the array iteration by first turning it into a list:
In [124]: arr.T.tolist()
Out[124]: [[1, 6, 11], [2, 7, 12], [3, 8, 13], [4, 9, 14], [5, 10, 15]]
In [125]: timeit [dd[tuple(x)] for x in arr.T.tolist()]
3.21 µs ± 9.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Array construction times:
In [122]: timeit arr = np.array((list1,list2,list3))
3.54 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [123]: timeit arr = np.column_stack((list1,list2,list3))
18.5 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With the pure Python itemgetter (from v3.6.3) there's no savings:
In [149]: timeit operator.itemgetter(*[tuple(x) for x in arr.T.tolist()])(dd)
3.51 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and if I move the getter definition out of the time loop:
In [151]: %%timeit idx = operator.itemgetter(*[tuple(x) for x in arr.T.tolist()]
...: )
...: idx(dd)
...:
482 ns ± 1.85 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I want to sum columns of a 2d array dat by row index idx. The following example works but is slow for large arrays. Any idea to speed it up?
import numpy as np
dat = np.arange(18).reshape(6, 3, order = 'F')
idx = np.array([0, 1, 1, 1, 2, 2])
for i in np.unique(idx):
print(np.sum(dat[idx==i], axis = 0))
Output
[ 0 6 12]
[ 6 24 42]
[ 9 21 33]
Approach #1
We can leverage matrix-multiplication with np.dot -
In [56]: mask = idx[:,None] == np.unique(idx)
In [57]: mask.T.dot(dat)
Out[57]:
array([[ 0, 6, 12],
[ 6, 24, 42],
[ 9, 21, 33]])
Approach #2
For the case with idx already sorted, we can use np.add.reduceat -
In [52]: p = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
In [53]: np.add.reduceat(dat, p, axis=0)
Out[53]:
array([[ 0, 6, 12],
[ 6, 24, 42],
[ 9, 21, 33]])
A bit faster approach with set object and ndarray.sum() method:
In [216]: for i in set(idx):
...: print(dat[idx == i].sum(axis=0))
...:
[ 0 6 12]
[ 6 24 42]
[ 9 21 33]
Time execution comparison:
In [217]: %timeit for i in np.unique(idx): r = np.sum(dat[idx==i], axis = 0)
109 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [218]: %timeit for i in set(idx): r = dat[idx == i].sum(axis=0)
71.1 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)