I have an array in numpy. I want to roll the first column by 1, second column by 2, etc.
Here is an example.
>>> x = np.reshape(np.arange(15), (5, 3))
>>> x
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
What I want to do:
>>> y = roll(x)
>>> y
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
What is the best way to do it?
The real array will be very big. I'm using cupy, the GPU version of numpy. I will prefer solution fastest on GPU, but of course, any idea is welcomed.
You could use advanced indexing:
import numpy as np
x = np.reshape(np.arange(15), (5, 3))
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h
y = x[shifted, cols]
y:
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
I implemented a naive solution (roll_for) and compares it to #Chrysophylaxs 's solution (roll_indexing).
Conclusion: roll_indexing is faster for small arrays, but the difference shrinks when the array goes bigger, and is eventually slower than roll_for for very large arrays.
Implementations:
import numpy as np
def roll_for(x, shifts=None, axis=-1):
if shifts is None:
shifts = np.arange(1, x.shape[axis] + 1) # OP requirement
xt = x.swapaxes(axis, 0) # https://stackoverflow.com/a/31094758/13636407
yt = np.empty_like(xt)
for idx, shift in enumerate(shifts):
yt[idx] = np.roll(xt[idx], shift=shift)
return yt.swapaxes(0, axis)
def roll_indexing(x):
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h # fix
return x[shifted, cols]
Tests:
M, N = 5, 3
x = np.arange(M * N).reshape(M, N)
expected = np.array([[12, 10, 8], [0, 13, 11], [3, 1, 14], [6, 4, 2], [9, 7, 5]])
assert np.array_equal(expected, roll_for(x))
assert np.array_equal(expected, roll_indexing(x))
M, N = 100, 200
# roll_indexing did'nt work when M < N before fix
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
Benchmark:
M, N = 100, 100
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 859 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit roll_indexing(x) # 81 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
M, N = 1_000, 1_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 12.7 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit roll_indexing(x) # 12.4 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
M, N = 10_000, 10_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 1.3 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit roll_indexing(x) # 1.61 s ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
I am trying to convert the vanilla python standard deviation function that takes n number of indexes defined by the variable number for calculations into numpy form. However the numpy code is faulty which is saying only integer scalar arrays can be converted to a scalar index is there any way i could by pass this.
Variables
import numpy as np
number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
Vanilla python
std= np.array([list_[i:i+number].std() for i in range(0, len(list_)-number)])
Numpy form
counter = np.arange(0, len(list_)-number, 1)
std = list_[counter:counter+number].std()
In [46]: std= np.array([arr[i:i+number].std() for i in range(0, len(arr)-number)
...: ])
In [47]: std
Out[47]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, 13.13574418, 15.24698722, 14.65383773, 11.62092989,
8.57331689, 4.76392583, 9.49404494, 21.20874383, 24.91417226,
20.84991841, 13.22152789, 10.83343482, 16.01294245, 13.80007894,
10.51866421, 8.29287433, 11.24933733, 15.43661128, 13.65945978])
We can move the std out of the loop. Make a 2d array of windows, and apply std with axis:
In [48]: np.array([arr[i:i+number] for i in range(0, len(arr)-number)]).std(axis
...: =1)
Out[48]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, 13.13574418, 15.24698722, 14.65383773, 11.62092989,
8.57331689, 4.76392583, 9.49404494, 21.20874383, 24.91417226,
20.84991841, 13.22152789, 10.83343482, 16.01294245, 13.80007894,
10.51866421, 8.29287433, 11.24933733, 15.43661128, 13.65945978])
We could also generate the windows with indexing. A convenient way is to use linspace:
In [63]: idx = np.arange(0,len(arr)-number)
In [64]: idx = np.linspace(idx,idx+number,number, endpoint=False,dtype=int)
In [65]: idx
Out[65]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24],
...
[ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28]])
In [66]: arr[idx].std(axis=0)
Out[66]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, 13.13574418, 15.24698722, 14.65383773, 11.62092989,
8.57331689, 4.76392583, 9.49404494, 21.20874383, 24.91417226,
20.84991841, 13.22152789, 10.83343482, 16.01294245, 13.80007894,
10.51866421, 8.29287433, 11.24933733, 15.43661128, 13.65945978])
The rolling-windows using as_strided will probably be faster, but may be harder to understand.
In [67]: timeit std= np.array([arr[i:i+number].std() for i in range(0, len(arr)-
...: number)])
1.05 ms ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [68]: timeit np.array([arr[i:i+number] for i in range(0, len(arr)-number)]).s
...: td(axis=1)
74.7 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [69]: %%timeit
...: idx = np.arange(0,len(arr)-number)
...: idx = np.linspace(idx,idx+number,number, endpoint=False,dtype=int)
...: arr[idx].std(axis=0)
117 µs ± 240 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [73]: timeit np.std(rolling_window(arr, 5), 1)
74.5 µs ± 625 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using a more direct way to generate the rolling index:
In [81]: %%timeit
...: idx = np.arange(len(arr)-number)[:,None]+np.arange(number)
...: arr[idx].std(axis=1)
57.9 µs ± 87.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
your error
In [82]: arr[np.array([1,2,3]):np.array([4,5,6])]
Traceback (most recent call last):
File "<ipython-input-82-3358e59f8fb5>", line 1, in <module>
arr[np.array([1,2,3]):np.array([4,5,6])]
TypeError: only integer scalar arrays can be converted to a scalar index
as taken from Rolling window for 1D arrays in Numpy?
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
np.std(rolling_window(list_, 5), 1)
by the way, your vanilla python code is wrong. it should be:
std= np.array([list_[i:i+number].std() for i in range(0, len(list_)-number+1)])
Just wondering if there is a way to construct a tumbling window in python. So for example if I have list/ndarray , listA = [3,2,5,9,4,6,3,8,7,9]. Then how could I find the maximum of the first 3 items (3,2,5) -> 5, and then the next 3 items (9,4,6) -> 9 and so on... Sort of like breaking it up to sections and finding the max. So the final result would be list [5,9,8,9]
Approach #1: One-liner for windowed-max using np.maximum.reduceat -
In [118]: np.maximum.reduceat(listA,np.arange(0,len(listA),3))
Out[118]: array([5, 9, 8, 9])
Becomes more compact with np.r_ -
np.maximum.reduceat(listA,np.r_[:len(listA):3])
Approach #2: Generic ufunc way
Here's a function for generic ufuncs and that window length as a parameter -
def windowed_ufunc(a, ufunc, W):
a = np.asarray(a)
n = len(a)
L = W*(n//W)
out = ufunc(a[:L].reshape(-1,W),axis=1)
if n>L:
out = np.hstack((out, ufunc(a[L:])))
return out
Sample run -
In [81]: a = [3,2,5,9,4,6,3,8,7,9]
In [82]: windowed_ufunc(a, ufunc=np.max, W=3)
Out[82]: array([5, 9, 8, 9])
On other ufuncs -
In [83]: windowed_ufunc(a, ufunc=np.min, W=3)
Out[83]: array([2, 4, 3, 9])
In [84]: windowed_ufunc(a, ufunc=np.sum, W=3)
Out[84]: array([10, 19, 18, 9])
In [85]: windowed_ufunc(a, ufunc=np.mean, W=3)
Out[85]: array([3.33333333, 6.33333333, 6. , 9. ])
Benchmarking
Timings on NumPy solutions on array data with sample data scaled up by 10000x -
In [159]: a = [3,2,5,9,4,6,3,8,7,9]
In [160]: a = np.tile(a, 10000)
# #yatu's soln
In [162]: %timeit moving_maxima(a, w=3)
435 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#1
In [167]: %timeit np.maximum.reduceat(a,np.arange(0,len(a),3))
353 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#2
In [165]: %timeit windowed_ufunc(a, ufunc=np.max, W=3)
379 µs ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you want a one-liner, you can use list comprehension:
listA = [3,2,5,9,4,6,3,8,7,9]
listB=[max(listA[i:i+3]) for i in range(0,len(listA),3)]
print (listB)
it returns:
[5, 9, 8, 9]
Of course the codes can be written more dynamically: if you want a different window size, just change 3 to any integer.
Using numpy, you can extend the list with zeroes so its length is divisible by the window size, and reshape and compute the maxalong the second axis:
def moving_maxima(a, w):
mod = len(a)%w
d = w if mod else mod
x = np.r_[a, [0]*(d-mod)]
return x.reshape(-1,w).max(1)
Some examples:
moving_maxima(listA,2)
# array([3., 9., 6., 8., 9.])
moving_maxima(listA,3)
#array([5, 9, 8, 9])
moving_maxima(listA,4)
#array([9, 8, 9])
I have a M x N matrix X and a 1 x N matrix Y. What I would like to do is replace any 0-entry in X with the appropriate value from Y based on its column.
So if
X = np.array([[0, 1, 2], [3, 0, 5]])
and
Y = np.array([10, 20, 30])
The desired end result would be [[10, 1, 2], [3, 20, 5]].
This can be done straightforwardly by generating a M x N matrix where every row is Y and then using filter arrays:
Y = np.ones((X.shape[0], 1)) * Y.reshape(1, -1)
X[X==0] = Y[X==0]
But could this be done using numpy's broadcasting functionality?
Sure. Instead of physically repeating Y, create a broadcasted view of Y with the shape of X, using numpy.broadcast_to:
expanded = numpy.broadcast_to(Y, X.shape)
mask = X==0
x[mask] = expanded[mask]
Expand X to make it a bit more general:
In [306]: X = np.array([[0, 1, 2], [3, 0, 5],[0,1,0]])
where identifies the 0s; the 2nd array identifies the columns
In [307]: idx = np.where(X==0)
In [308]: idx
Out[308]: (array([0, 1, 2, 2]), array([0, 1, 0, 2]))
In [309]: Z = X.copy()
In [310]: Z[idx]
Out[310]: array([0, 0, 0, 0]) # flat list of where to put the values
In [311]: Y[idx[1]]
Out[311]: array([10, 20, 10, 30]) # matching list of values by column
In [312]: Z[idx] = Y[idx[1]]
In [313]: Z
Out[313]:
array([[10, 1, 2],
[ 3, 20, 5],
[10, 1, 30]])
Not doing broadcasting, but reasonably clean numpy.
Times compared to broadcast_to approach
In [314]: %%timeit
...: idx = np.where(X==0)
...: Z[idx] = Y[idx[1]]
...:
9.28 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [315]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
19.5 µs ± 513 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Faster, though the sample size is small.
Another way to make the expanded Y, is with repeat:
In [319]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
10.8 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Whose time is close to my where. It turns out that broadcast_to is relatively slow:
In [321]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...:
10.5 µs ± 52.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [322]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...:
3.76 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We'd have to do more tests to see whether that is just due to a setup cost, or if the relative times still apply with much larger arrays.
I want to sum columns of a 2d array dat by row index idx. The following example works but is slow for large arrays. Any idea to speed it up?
import numpy as np
dat = np.arange(18).reshape(6, 3, order = 'F')
idx = np.array([0, 1, 1, 1, 2, 2])
for i in np.unique(idx):
print(np.sum(dat[idx==i], axis = 0))
Output
[ 0 6 12]
[ 6 24 42]
[ 9 21 33]
Approach #1
We can leverage matrix-multiplication with np.dot -
In [56]: mask = idx[:,None] == np.unique(idx)
In [57]: mask.T.dot(dat)
Out[57]:
array([[ 0, 6, 12],
[ 6, 24, 42],
[ 9, 21, 33]])
Approach #2
For the case with idx already sorted, we can use np.add.reduceat -
In [52]: p = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
In [53]: np.add.reduceat(dat, p, axis=0)
Out[53]:
array([[ 0, 6, 12],
[ 6, 24, 42],
[ 9, 21, 33]])
A bit faster approach with set object and ndarray.sum() method:
In [216]: for i in set(idx):
...: print(dat[idx == i].sum(axis=0))
...:
[ 0 6 12]
[ 6 24 42]
[ 9 21 33]
Time execution comparison:
In [217]: %timeit for i in np.unique(idx): r = np.sum(dat[idx==i], axis = 0)
109 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [218]: %timeit for i in set(idx): r = dat[idx == i].sum(axis=0)
71.1 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I have a NumPy array:
arr = [[1, 2],
[3, 4]]
I want to create a new array that contains powers of arr up to a power order:
# arr_new = [arr^0, arr^1, arr^2, arr^3,...arr^order]
arr_new = [[1, 1, 1, 2, 1, 4, 1, 8],
[1, 1, 3, 4, 9, 16, 27, 64]]
My current approach uses for loops:
# Pre-allocate an array for powers
arr = np.array([[1, 2],[3,4]])
order = 3
rows, cols = arr.shape
arr_new = np.zeros((rows, (order+1) * cols))
# Iterate over each exponent
for i in range(order + 1):
arr_new[:, (i * cols) : (i + 1) * cols] = arr**i
print(arr_new)
Is there a faster (i.e. vectorized) approach to creating powers of an array?
Benchmarking
Thanks to #hpaulj and #Divakar and #Paul Panzer for the answers. I benchmarked the loop-based and broadcasting-based operations on the following test arrays.
arr = np.array([[1, 2],
[3,4]])
order = 3
arrLarge = np.random.randint(0, 10, (100, 100)) # 100 x 100 array
orderLarge = 10
The loop_based function is:
def loop_based(arr, order):
# pre-allocate an array for powers
rows, cols = arr.shape
arr_new = np.zeros((rows, (order+1) * cols))
# iterate over each exponent
for i in range(order + 1):
arr_new[:, (i * cols) : (i + 1) * cols] = arr**i
return arr_new
The broadcast_based function using hstack is:
def broadcast_based_hstack(arr, order):
# Create a 3D exponent array for a 2D input array to force broadcasting
powers = np.arange(order + 1)[:, None, None]
# Generate values (third axis contains array at various powers)
exponentiated = arr ** powers
# Reshape and return array
return np.hstack(exponentiated) # <== using hstack function
The broadcast_based function using reshape is:
def broadcast_based_reshape(arr, order):
# Create a 3D exponent array for a 2D input array to force broadcasting
powers = np.arange(order + 1)[:, None]
# Generate values (3-rd axis contains array at various powers)
exponentiated = arr[:, None] ** powers
# reshape and return array
return exponentiated.reshape(arr.shape[0], -1) # <== using reshape function
The broadcast_based function using cumulative product cumprod and reshape:
def broadcast_cumprod_reshape(arr, order):
rows, cols = arr.shape
# Create 3D empty array where the middle dimension is
# the array at powers 0 through order
out = np.empty((rows, order + 1, cols), dtype=arr.dtype)
out[:, 0, :] = 1 # 0th power is always 1
a = np.broadcast_to(arr[:, None], (rows, order, cols))
# Cumulatively multiply arrays so each multiplication produces the next order
np.cumprod(a, axis=1, out=out[:,1:,:])
return out.reshape(rows, -1)
On Jupyter notebook, I used the timeit command and got these results:
Small arrays (2x2):
%timeit -n 100000 loop_based(arr, order)
7.41 µs ± 174 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_based_hstack(arr, order)
10.1 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_based_reshape(arr, order)
3.31 µs ± 61.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_cumprod_reshape(arr, order)
11 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Large arrays (100x100):
%timeit -n 1000 loop_based(arrLarge, orderLarge)
261 µs ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_based_hstack(arrLarge, orderLarge)
225 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_based_reshape(arrLarge, orderLarge)
223 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_cumprod_reshape(arrLarge, orderLarge)
157 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Conclusions:
It seems that the broadcast based approach using reshape is faster for smaller arrays. However, for large arrays, the cumprod approach scales better and is faster.
Extend arrays to higher dims and let broadcasting do its magic with some help from reshaping -
In [16]: arr = np.array([[1, 2],[3,4]])
In [17]: order = 3
In [18]: (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
Out[18]:
array([[ 1, 1, 1, 2, 1, 4, 1, 8],
[ 1, 1, 3, 4, 9, 16, 27, 64]])
Note that arr[:,None] is essentially arr[:,None,:], but we can skip the trailing ellipsis for brevity.
Timings on a bigger dataset -
In [40]: np.random.seed(0)
...: arr = np.random.randint(0,9,(100,100))
...: order = 10
# #hpaulj's soln with broadcasting and stacking
In [41]: %timeit np.hstack(arr **np.arange(order+1)[:,None,None])
1000 loops, best of 3: 734 µs per loop
In [42]: %timeit (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
1000 loops, best of 3: 401 µs per loop
That reshaping part is practically free and that's where we gain performance here alongwith the broadcasting part of course, as seen in the breakdown below -
In [52]: %timeit (arr[:,None]**np.arange(order+1)[:,None])
1000 loops, best of 3: 390 µs per loop
In [53]: %timeit (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
1000 loops, best of 3: 401 µs per loop
Use broadcasting to generate the values, and reshape or rearrange the values as desired:
In [34]: arr **np.arange(4)[:,None,None]
Out[34]:
array([[[ 1, 1],
[ 1, 1]],
[[ 1, 2],
[ 3, 4]],
[[ 1, 4],
[ 9, 16]],
[[ 1, 8],
[27, 64]]])
In [35]: np.hstack(_)
Out[35]:
array([[ 1, 1, 1, 2, 1, 4, 1, 8],
[ 1, 1, 3, 4, 9, 16, 27, 64]])
Here is a solution using cumulative multiplication which scales better than power based approaches, especially if the input array is of float dtype:
import numpy as np
def f_mult(a, k):
m, n = a.shape
out = np.empty((m, k, n), dtype=a.dtype)
out[:, 0, :] = 1
a = np.broadcast_to(a[:, None], (m, k-1, n))
a.cumprod(axis=1, out=out[:, 1:])
return out.reshape(m, -1)
Timings:
int up to power 9
divakar: 0.4342731796205044 ms
hpaulj: 0.794165057130158 ms
pp: 0.20520629966631532 ms
float up to power 39
divakar: 29.056487752124667 ms
hpaulj: 31.773792404681444 ms
pp: 1.0329263447783887 ms
Code for timings, thks #Divakar:
def f_divakar(a, k):
return (a[:,None]**np.arange(k)[:,None]).reshape(a.shape[0],-1)
def f_hpaulj(a, k):
return np.hstack(a**np.arange(k)[:,None,None])
from timeit import timeit
np.random.seed(0)
a = np.random.randint(0,9,(100,100))
k = 10
print('int up to power 9')
print('divakar:', timeit(lambda: f_divakar(a, k), number=1000), 'ms')
print('hpaulj: ', timeit(lambda: f_hpaulj(a, k), number=1000), 'ms')
print('pp: ', timeit(lambda: f_mult(a, k), number=1000), 'ms')
a = np.random.uniform(0.5,2.0,(100,100))
k = 40
print('float up to power 39')
print('divakar:', timeit(lambda: f_divakar(a, k), number=1000), 'ms')
print('hpaulj: ', timeit(lambda: f_hpaulj(a, k), number=1000), 'ms')
print('pp: ', timeit(lambda: f_mult(a, k), number=1000), 'ms')
You are creating a Vandermonde matrix with a reshape, so it is probably best to use numpy.vander to make it, and let someone else take care of the best algorithm.
This way your code is just:
np.vander(arr.ravel(), order).reshape((arr.shape[0], -1))
That said, it seems like they use something like Paul Panzer's cumprod method under the hood so it should scale well.