I am performing data analysis using a python script and learned from profiling that more than 95 % of the computation time is taken by the line which performs the following operation np.sum(C[np.isin(A, b)]), where A, C are 2D NumPy arrays of equal dimension m x n, and b is a 1D array of variable length. I am wondering if not a dedicated NumPy function, is there a way to accelerate such computation?
Typical sizes of A (int64), C (float64): 10M x 100
Typical size of b (int64): 1000
As your labels are from a small integer range you should get a sizeable speedup from using np.bincount (pp) below. Alternatively, you can speedup lookup by creating a mask (p2). This---as does your original code---allows for replacing np.sum with math.fsum which guarantees an exact within machine precision result (p3). Alternatively, we can pythranize it for another 40% speedup (p4).
On my rig the numba soln (mx) is about as fast as pp but maybe I'm not doing it right.
import numpy as np
import math
from subsum import pflat
MAXIND = 120_000
def OP():
return sum(C[np.isin(A, b)])
def pp():
return np.bincount(A.reshape(-1), C.reshape(-1), MAXIND)[np.unique(b)].sum()
def p2():
grid = np.zeros(MAXIND, bool)
grid[b] = True
return C[grid[A]].sum()
def p3():
grid = np.zeros(MAXIND, bool)
grid[b] = True
return math.fsum(C[grid[A]])
def p4():
return pflat(A.ravel(), C.ravel(), b, MAXIND)
import numba as nb
#nb.njit(parallel=True,fastmath=True)
def nb_ss(A,C,b):
s=set(b)
sum=0.
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
if A[i,j] in s:
sum+=C[i,j]
return sum
def mx():
return nb_ss(A,C,b)
sh = 100_000, 100
A = np.random.randint(0, MAXIND, sh)
C = np.random.random(sh)
b = np.random.randint(0, MAXIND, 1000)
print(OP(), pp(), p2(), p3(), p4(), mx())
from timeit import timeit
print("OP", timeit(OP, number=4)*250)
print("pp", timeit(pp, number=10)*100)
print("p2", timeit(p2, number=10)*100)
print("p3", timeit(p3, number=10)*100)
print("p4", timeit(p4, number=10)*100)
print("mx", timeit(mx, number=10)*100)
The code for the pythran module:
[subsum.py]
import numpy as np
#pythran export pflat(int[:], float[:], int[:], int)
def pflat(A, C, b, MAXIND):
grid = np.zeros(MAXIND, bool)
grid[b] = True
return C[grid[A]].sum()
Compilation is as simple as pythran subsum.py
Sample run:
41330.15849965791 41330.15849965748 41330.15849965747 41330.158499657475 41330.15849965791 41330.158499657446
OP 1963.3807722493657
pp 53.23419079941232
p2 21.8758742994396
p3 26.829131800332107
p4 12.988955597393215
mx 52.37018179905135
I assume you have changed int64 to int8 wherever required.
You can use Numba's parallel and It feature for faster Numpy computations and makes use of the cores.
#numba.jit(nopython=True, parallel=True)
def (A,B,c):
return np.sum(C[np.isin(A, b)])
Documentation for Numba Parallel
I don't know why np.isin is that slow, but you can implement your function quite a lot faster.
The following Numba solution uses a set for fast lookup of values and is parallelized. The memory footprint is also smaller than in the Numpy implementation.
Code
import numpy as np
import numba as nb
#nb.njit(parallel=True,fastmath=True)
def nb_pp(A,C,b):
s=set(b)
sum=0.
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
if A[i,j] in s:
sum+=C[i,j]
return sum
Timings
The pp implementation and the first data sample is form Paul Panzers answer above.
MAXIND = 120_000
sh = 100_000, 100
A = np.random.randint(0, MAXIND, sh)
C = np.random.random(sh)
b = np.random.randint(0, MAXIND, 1000)
MAXIND = 120_000
%timeit res_1=np.sum(C[np.isin(A, b)])
1.5 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
62.5 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=nb_pp(A,C,b)
17.1 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
MAXIND = 10_000_000
%timeit res_1=np.sum(C[np.isin(A, b)])
2.06 s ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
206 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=nb_pp(A,C,b)
17.6 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MAXIND = 100
%timeit res_1=np.sum(C[np.isin(A, b)])
1.01 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
46.8 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=nb_pp(A,C,b)
3.88 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
I implemented codes to try to get maximum occurrence in numpy array. I was satisfactory using numba, but got limitations. I wonder whether it can be improved to a general case.
numba implementation
import numba as nb
import numpy as np
import collections
#nb.njit("int64(int64[:])")
def max_count_unique_num(x):
"""
Counts maximum number of unique integer in x.
Args:
x (numpy array): Integer array.
Returns:
Int
"""
# get maximum value
m = x[0]
for v in x:
if v > m:
m = v
if m == 0:
return x.size
# count each unique value
num = np.zeros(m + 1, dtype=x.dtype)
for k in x:
num[k] += 1
# maximum count
m = 0
for k in num:
if k > m:
m = k
return m
For comparisons, I also implemented numpy's unique and collections.Counter
def np_unique(x):
""" Counts maximum occurrence using numpy's unique. """
ux, uc = np.unique(x, return_counts=True)
return uc.max()
def counter(x):
""" Counts maximum occurrence using collections.Counter. """
counts = collections.Counter(x)
return max(counts.values())
timeit
Edit: Add np.bincount for additional comparison, as suggested by #MechanicPig.
In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The limitations of numba implementation are quite obvious: efficiency only when all values in x are small positive int and will be significantly reduced for very large int; not applicable to float and negative values.
Any way I can generalize the implementation and keep the speed?
Update
After checking the source code of np.unique, an implementation for general cases can be:
#nb.njit(["int64(int64[:])", "int64(float64[:])"])
def max_count_unique_num_2(x):
x.sort()
n = 0
k = 0
x0 = x[0]
for v in x:
if x0 == v:
k += 1
else:
if k > n:
n = k
k = 1
x0 = v
# for last item in x if it equals to previous one
if k > n:
n = k
return n
timeit
In [154]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [155]: %timeit max_count_unique_num(x)
519 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [156]: %timeit np_unique(x)
1.3 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [157]: %timeit max_count_unique_num_2(x)
240 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [158]: x = np.random.randint(0, 200000, size=300000).astype(np.int64)
In [159]: %timeit max_count_unique_num(x)
1.01 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [160]: %timeit np_unique(x)
18.1 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit max_count_unique_num_2(x)
3.58 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So:
If large integer in x and the size is not large, max_count_unique_num_2 beats max_count_unique_num.
Both max_count_unique_num and max_count_unique_num_2 are significantly faster than np.unique.
Small modification on max_count_unique_num_2 can return the item that has maximum occurrence, even all items having same maximum occurrence.
max_count_unique_num_2 can even be accelerated if x is itself sorted by removing x.sort().
What if shortening your code:
#nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in x:
num[k] += 1
return num.max()
or paralleled:
#nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in nb.prange(x.size):
num[x[k]] += 1
return num.max()
Parallelizing will beat for larger data sizes. Note that parallel will get different result in some runs and need to be cured if be possible.
For handling the floats (or negative values) using Numba:
#nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
num = np.zeros(x.size, dtype=np.int8)
for k in x:
for j in range(x.shape[0]):
if k == x[j]:
num[j] += 1
return num.max()
IMO, np.unique(x, return_counts=True)[1].max() is the best choice which handle both integers and floats in a very fast implementation. Numba can be faster for integers (it depends on the data sizes as larger data sizes weaker performance; AIK, it is due to looping instinct than arrays), but for floats the code must be optimized in terms of performance if it could; But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.
Notes: np.bincount can handle just integers.
You can do that without using numpy too.
arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]
If you want to use numpy then try this,
arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]
Both do the exact same job. To check which method is more efficient just do this,
time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method.
for i in range(10**5):
<method you want>
time_f = time.time()
When I ran this I got 0.39 seconds for the first method and 2.69 for the second one. So it's pretty safe to say that the first method is more efficient.
What I want to say is that your implementation is almost the same as numpy.bincount. If you want to make it universal, you can consider encoding the original data:
def encode(ar):
# Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
flatten = ar.ravel()
perm = flatten.argsort()
sort = flatten[perm]
mask = np.concatenate(([False], sort[1:] != sort[:-1]))
encoded = np.empty(sort.shape, np.int64)
encoded[perm] = mask.cumsum()
encoded.shape = ar.shape
return encoded
def count_max(ar):
return max_count_unique_num(encode(ar))
Consider this python code, where I try to compute the eucliean distance of a vector to every row of a matrix. It's very slow compared to the best Julia version I can find using Tullio.jl.
The python version takes 30s but the Julia version only takes 75ms.
I am sure I am not doing the best in Python. Are there faster solutions? Numba and numpy solutions welcome.
import numpy as np
# generate
a = np.random.rand(4000000, 128)
b = np.random.rand(128)
print(a.shape)
print(b.shape)
def lin_norm_ever(a, b):
return np.apply_along_axis(lambda x: np.linalg.norm(x - b), 1, a)
import time
t = time.time()
res = lin_norm_ever(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)
The Julia verions
using Tullio
function comp_tullio(a, c)
dist = zeros(Float32, size(a, 2))
#tullio dist[i] = (c[j] - a[j,i])^2
dist
end
#time comp_tullio(a, c)
#benchmark comp_tullio(a, c) # 75ms on my computer
I would use Numba in this example for best performance. I also added 2 approaches from Divakars linked answer for comparison.
Code
import numpy as np
import numba as nb
from scipy.spatial.distance import cdist
#nb.njit(fastmath=True,parallel=True,cache=True)
def dist_1(mat,vec):
res=np.empty(mat.shape[0],dtype=mat.dtype)
for i in nb.prange(mat.shape[0]):
acc=0
for j in range(mat.shape[1]):
acc+=(mat[i,j]-vec[j])**2
res[i]=np.sqrt(acc)
return res
#from https://stackoverflow.com/a/52364284/4045774
def dist_2(mat,vec):
return cdist(mat, np.atleast_2d(vec)).ravel()
#from https://stackoverflow.com/a/52364284/4045774
def dist_3(mat,vec):
M = mat.dot(vec)
d = np.einsum('ij,ij->i',mat,mat) + np.inner(vec,vec) -2*M
return np.sqrt(d)
Timings
#Float64
a = np.random.rand(4000000, 128)
b = np.random.rand(128)
%timeit dist_1(a,b)
#122 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_2(a,b)
#484 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_3(a,b)
#432 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Float32
a = np.random.rand(4000000, 128).astype(np.float32)
b = np.random.rand(128).astype(np.float32)
%timeit dist_1(a,b)
#68.6 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_2(a,b)
#2.2 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#looks like there is a costly type-casting to float64
%timeit dist_3(a,b)
#228 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As part of a statistical programming package, I need to add log-transformed values together with the LogSumExp Function. This is significantly less efficient than adding unlogged values together.
Furthermore, I need to add values together using the numpy.ufunc.reduecat functionality.
There are various options I've considered, with code below:
(for comparison in non-log-space) use numpy.add.reduceat
Numpy's ufunc for adding logged values together: np.logaddexp.reduceat
Handwritten reduceat function with the following logsumexp functions:
scipy's implemention of logsumexp
logsumexp function in Python (with numba)
Streaming logsumexp function in Python (with numba)
def logsumexp_reduceat(arr, indices, logsum_exp_func):
res = list()
i_start = indices[0]
for cur_index, i in enumerate(indices[1:]):
res.append(logsum_exp_func(arr[i_start:i]))
i_start = i
res.append(logsum_exp_func(arr[i:]))
return res
#numba.jit(nopython=True)
def logsumexp(X):
r = 0.0
for x in X:
r += np.exp(x)
return np.log(r)
#numba.jit(nopython=True)
def logsumexp_stream(X):
alpha = -np.Inf
r = 0.0
for x in X:
if x != -np.Inf:
if x <= alpha:
r += np.exp(x - alpha)
else:
r *= np.exp(alpha - x)
r += 1.0
alpha = x
return np.log(r) + alpha
arr = np.random.uniform(0,0.1, 10000)
log_arr = np.log(arr)
indices = sorted(np.random.randint(0, 10000, 100))
# approach 1
%timeit np.add.reduceat(arr, indices)
12.7 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# approach 2
%timeit np.logaddexp.reduceat(log_arr, indices)
462 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# approach 3, scipy function
%timeit logsum_exp_reduceat(arr, indices, scipy.special.logsumexp)
3.69 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# approach 3 handwritten logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp)
139 µs ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# approach 3 streaming logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp_stream)
164 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The timeit results show that handwritten logsumexp functions with numba are the fastest options, but are still 10x slower than numpy.add.reduceat.
A few questions:
Are there any other approaches (or tweaks to the options I've presented) which are faster? For instance, is there a way to use a lookup table to compute the logsumexp function?
Why is Sebastian Nowozin's "streaming logsumexp" function not faster than the naive approach?
There is some room for improvement
But never expect logsumexp to be as fast as a standard summation, because exp is quite a expensive operation.
Example
import numpy as np
#from version 0.43 until 0.47 this has to be set before importing numba
#Bug: https://github.com/numba/numba/issues/4689
from llvmlite import binding
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb
#nb.njit(fastmath=True,parallel=False)
def logsum_exp_reduceat(arr, indices):
res = np.empty(indices.shape[0],dtype=arr.dtype)
for i in nb.prange(indices.shape[0]-1):
r = 0.
for j in range(indices[i],indices[i+1]):
r += np.exp(arr[j])
res[i]=np.log(r)
r = 0.
for j in range(indices[-1],arr.shape[0]):
r += np.exp(arr[j])
res[-1]=np.log(r)
return res
Timings
#small example where parallelization doesn't make sense
arr = np.random.uniform(0,0.1, 10_000)
log_arr = np.log(arr)
#use arrays if possible
indices = np.sort(np.random.randint(0, 10_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 22 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#with parallelization 84.7 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.add.reduceat(arr, indices)
#4.46 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#large example where parallelization makes sense
arr = np.random.uniform(0,0.1, 1000_000)
log_arr = np.log(arr)
indices = np.sort(np.random.randint(0, 1000_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 1.57 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with parallelization 409 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add.reduceat(arr, indices)
#340 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?
This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a large 2D numpy array. I would like to be able to efficiently run row-wise operations on subsets of the columns, without copying the data.
In what follows,
a = np.arange(1000000).reshape(1000, 10000) and columns = np.arange(1, 1000, 2). For reference,
In [4]: %timeit a.sum(axis=1)
7.26 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The approaches I am aware of are:
fancy indexing with list of columns
In [5]: %timeit a[:, columns].sum(axis=1)
42.5 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
fancy indexing with mask of columns
In [6]: cols_mask = np.zeros(10000, dtype=bool)
...: cols_mask[columns] = True
In [7]: %timeit a[:, cols_mask].sum(axis=1)
42.1 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
masked array
In [8]: cells_mask = np.ones((1000, 10000), dtype=bool)
In [9]: cells_mask[:, columns] = False
In [10]: am = np.ma.masked_array(a, mask=cells_mask)
In [11]: %timeit am.sum(axis=1)
80 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
python loop
In [12]: %timeit sum([a[:, i] for i in columns])
31.2 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Somewhat surprisingly to me, the last approach is the most efficient: moreover, it avoids copying the full data, which for me is a prerequisite. However, it is still much slower than the simple sum (on double the data size), and most importantly, it is not trivial to generalize to other operations (e.g., cumsum).
Is there any approach I am missing? I would be fine with writing some cython code, but I would like the approach to work for any numpy function, not just sum.
On this one pythran seems a bit faster than numba at least on my rig:
import numpy as np
#pythran export col_sum(float[:,:], int[:])
#pythran export col_sum(int[:,:], int[:])
def col_sum(data, idx):
return data.T[idx].sum(0)
Compile with pythran <filename.py>
Timings:
timeit(lambda:cs_pythran.col_sum(a, columns),number=1000)
# 1.644187423051335
timeit(lambda:cs_numba.col_sum(a, columns),number=1000)
# 2.635075871949084
If you want to beat c-compiled block summation, you're probably best off with numba. Any indexing that stays in python (numba creates c-compiled functions with jit) is going to have python overhead.
from numba import jit
#jit
def col_sum(block, idx):
return block[:, idx].sum(1)
%timeit a.sum(axis=1)
100 loops, best of 3: 5.25 ms per loop
%timeit a[:, columns].sum(axis=1)
100 loops, best of 3: 7.24 ms per loop
%timeit col_sum(a, columns)
100 loops, best of 3: 2.46 ms per loop
You can use Numba. For best performance it is usually necessary to write simple loops as you would do in C.
(Numba basically a Python to LLVM-IR code translator, quite like Clang for C)
Code
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def row_sum(arr,columns):
res=np.empty(arr.shape[0],dtype=arr.dtype)
for i in nb.prange(arr.shape[0]):
sum=0.
for j in range(columns.shape[0]):
sum+=arr[i,columns[j]]
res[i]=sum
return res
Timings
a = np.arange(1_000_000).reshape(1_000, 1_000)
columns = np.arange(1, 1000, 2)
%timeit res_1=a[:, columns].sum(axis=1)
1.29 ms ± 8.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res_2=row_sum(a,columns)
59.3 µs ± 4.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.allclose(res_1,res_2)
True
With Transonic (https://transonic.readthedocs.io), it's easy to write code that can be accelerated by different Python accelerators (in practice Cython, Pythran and Numba).
For example, with the boost decorator, one can write
import numpy as np
from transonic import boost
T0 = "int[:, :]"
T1 = "int[:]"
#boost
def row_sum_loops(arr: T0, columns: T1):
# locals type annotations are used only by Cython
i: int
j: int
sum_: int
res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
for i in range(arr.shape[0]):
sum_ = 0
for j in range(columns.shape[0]):
sum_ += arr[i, columns[j]]
res[i] = sum_
return res
#boost
def row_sum_transpose(arr: T0, columns: T1):
return arr.T[columns].sum(0)
On my computer, I obtain:
TRANSONIC_BACKEND="python" python row_sum_boost.py
Checks passed: results are consistent
Python
row_sum_loops 108.57 s
row_sum_transpose 1.38
TRANSONIC_BACKEND="cython" python row_sum_boost.py
Checks passed: results are consistent
Cython
row_sum_loops 0.45 s
row_sum_transpose 1.32 s
TRANSONIC_BACKEND="numba" python row_sum_boost.py
Checks passed: results are consistent
Numba
row_sum_loops 0.27 s
row_sum_transpose 1.16 s
TRANSONIC_BACKEND="pythran" python row_sum_boost.py
Checks passed: results are consistent
Pythran
row_sum_loops 0.27 s
row_sum_transpose 0.76 s
See https://transonic.readthedocs.io/en/stable/examples/row_sum/txt.html for the full code and a more complete comparison on the example of this question.
Note that Pythran is also very efficient with the transonic.jit decorator.