When using numba and accessing elements in multiple 2d numpy arrays, is it better to use the index or to iterate the arrays directly, because I'm finding that a combination of the two is the fastest which seems counterintuitive to me? Or is there another better way to do it?
For context, I am trying to speed up the implementation of the raytracing approach in this paper https://iopscience.iop.org/article/10.1088/1361-6560/ac1f38/pdf.
I have a function which takes the intensity before propagation and the displacement maps that result from the propagation. The resulting intensity is then the original intensity displaced by the displacement maps pixel by pixel with sub-pixel displacements being proportionately shared between the respective adjacent pixels. On a side note, can this be implemented directly in numpy or in another library, as I've noticed it is similar to opencv's remap function.
import numpy as np
from numba import njit
#njit
def raytrace_range(intensity_0, d_y, d_x):
"""
Args:
intensity_0 (2d numpy array): intensity before propagation
d_y (2d numpy array): Displacement along y in pixels
d_x (2d numpy array): Displacement along x in pixels
Returns:
intensity_z (2d numpy array): intensity after propagation
"""
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i in range(n_x):
for j in range(n_y):
i_ij = intensity_0[i, j]
dx_ij=d_x[i,j]
dy_ij=d_y[i,j]
# Always the same from here down
if not dx_ij and not dy_ij:
intensity_z[i,j]+=i_ij
continue
i_new=i
j_new=j
#Calculating displacement bigger than a pixel
if np.abs(dx_ij)>1:
x = np.floor(dx_ij)
i_new=int(i+x)
dx_ij=dx_ij-x
if np.abs(dy_ij)>1:
y = np.floor(dy_ij)
j_new=int(j+y)
dy_ij=dy_ij-y
# Calculating sub-pixel displacement
if 0<=i_new and i_new<n_y and 0<=j_new and j_new<n_x:
intensity_z[i_new,j_new]+=i_ij*(1-np.abs(dx_ij))*(1-np.abs(dy_ij))
if i_new<n_y-1 and dx_ij>=0:
if j_new<n_y-1 and dy_ij>=0:
intensity_z[i_new+1, j_new]+=i_ij*dx_ij*(1-dy_ij)
intensity_z[i_new+1, j_new+1]+=i_ij*dx_ij*dy_ij
intensity_z[i_new, j_new+1]+=i_ij*(1-dx_ij)*dy_ij
if j_new and dy_ij<0:
intensity_z[i_new+1,j_new]+=i_ij*dx_ij*(1-np.abs(dy_ij))
intensity_z[i_new+1,j_new-1]+=i_ij*dx_ij*np.abs(dy_ij)
intensity_z[i_new,j_new-1]+=i_ij*(1-dx_ij)*np.abs(dy_ij)
if i_new and dx_ij<0:
if j_new<n_x-1 and dy_ij>=0:
intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-dy_ij)
intensity_z[i_new-1,j_new+1]+=i_ij*np.abs(dx_ij)*dy_ij
intensity_z[i_new,j_new+1]+=i_ij*(1-np.abs(dx_ij))*dy_ij
if j_new and dy_ij<0:
intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-np.abs(dy_ij))
intensity_z[i_new-1,j_new-1]+=i_ij*dx_ij*dy_ij
intensity_z[i_new,j_new-1]+=i_ij*(1-np.abs(dx_ij))*np.abs(dy_ij)
return intensity_z
I've tried a few other approaches of which this is the fastest (includes the code from above after the comment # Always the same from here down which I've omitted to keep the question relatively short):
#njit
def raytrace_enumerate(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, i_i in enumerate(intensity_0):
for j, i_ij in enumerate(i_i):
dx_ij=d_x[i,j]
dy_ij=d_y[i,j]
#njit
def raytrace_npndenumerate(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for (i, j), i_ij in np.ndenumerate(intensity_0):
dx_ij=d_x[i,j]
dy_ij=d_y[i,j]
#njit
def raytrace_zip(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, (i_i, dy_i, dx_i) in enumerate(zip(intensity_0, d_y, d_x)):
for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):
#njit
def raytrace_stack1(idydx):
n_y, _, n_x = idydx.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, (i_i, dy_i, dx_i) in enumerate(idydx):
for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):
#njit
def raytrace_stack2(idydx):
n_y, n_x, _ = idydx.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, k in enumerate(idydx):
for j, (i_ij, dy_ij, dx_ij) in enumerate(k):
Make up some test data and time:
import timeit
rng = np.random.default_rng()
size = (2010, 2000)
margin = 10
test_data = np.pad(10000*rng.random(size=size), margin)
dx = np.pad(10*(rng.random(size=size)-0.5), margin)
dy = np.pad(10*(rng.random(size=size)-0.5), margin)
# Check results are the same
L = [raytrace_range(test_data, dy, dx), raytrace_enumerate(test_data, dy, dx), raytrace_npndenumerate(test_data, dy, dx), raytrace_zip(test_data, dy, dx), raytrace_stack1(np.stack([test_data, dy, dx], axis=1)), raytrace_stack2(np.stack([test_data, dy, dx], axis=2))]
print((np.diff(np.vstack(L).reshape(len(L),-1),axis=0)==0).all())
%timeit raytrace_range(test_data, dy, dx)
%timeit raytrace_enumerate(test_data, dy, dx)
%timeit raytrace_npndenumerate(test_data, dy, dx)
%timeit raytrace_zip(test_data, dy, dx)
%timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays were pre-stacked
%timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))
Output:
True
40.4 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.5 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
38.6 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
42 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) #Note this would be the fastest if the arrays were pre-stacked
47.4 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit 3: Turns out that removing if statements make range faster than enumerate. See edit 2 below
Interestingly, in my machine times get awful in the stack1 and stack2 options, and indeed enumerate seems to be fastest. Maybe thanks to enumerate numba understands it is a looping variable?:
In [1]: %timeit raytrace_range(test_data, dy, dx)
...: %timeit raytrace_enumerate(test_data, dy, dx)
...: %timeit raytrace_npndenumerate(test_data, dy, dx)
...: %timeit raytrace_zip(test_data, dy, dx)
...: %timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays we
...: re pre-stacked
...: %timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))
61 ms ± 785 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.9 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.9 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
57.5 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
109 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
146 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit: Using fastmath=True did not shove up much time, only ~3ms
Edit 2: Although it is not related to the OP's question, after playing a bit with the functions, turns out that removing "superfluous"(*) conditional statements makes it notably faster. Around 20% on my machine. Turns out the implementation works without them (at least the supplied test returns True):
(*) The operations seem to work regardless, as they are "caught" by the lower operations. At least, provided test vector did not report any issues.
#! Using this it is faster:
# Always the same from here down
# if dx_ij==0 and dy_ij==0:
# intensity_z[i,j]+=i_ij
# continue
#Calculating displacement bigger than a pixel
x = np.floor(dx_ij)
i_new=int(i+x)
dx_ij=dx_ij-x
y = np.floor(dy_ij)
j_new=int(j+y)
dy_ij=dy_ij-y
# Calculating sub-pixel displacement
In [2]: %timeit raytrace_range(test_data, dy, dx)
...: %timeit raytrace_range2(test_data, dy, dx)
...: %timeit raytrace_enumerate(test_data, dy, dx)
64.8 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.1 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In general, the fastest way to iterate over an array is a basic low-level integer iterator. Such a pattern cause the minimum number of transformation in Numba so the compiler should be able to optimize the code pretty well. Functions likes zip and enumerate often add an additional overhead due to indirect code transformations that are not perfectly optimized out.
Here is a basic example:
import numba as nb
#nb.njit('(int_[::1],)')
def test(arr):
s1 = s2 = 0
for i in range(arr.shape[0]):
s1 += i
s2 += arr[i]
return (s1, s2)
arr = np.arange(200_000)
test(arr)
However, things are more complex when you read/write to multiple arrays simultaneously (which is your case). Indeed, Numpy array can be indexed with negative indices so Numba need to perform bound checking every time. This check is expensive compared to the actual access and it can even break some other optimizations (eg. vectorization).
Consequently, Numba has been optimized so to analyse the code and detect cases where bound checking is not needed and prevent adding expensive checks at runtime. This is the case in the above code but not in your raytrace_range function. enumerate and enumerate+zip can help a lot to remove bound checking because Numba can easily prove that the index lies in the bound of the array (theoretically, it could prove this for raytrace_range but the current implementation is unfortunately not smart enough).
You can mostly solve this problem using assertions. It is not only good for optimization but also to make your code more robust!
Moreover, the indexing of multidimensional arrays is sometimes not perfectly optimized by the underlying JIT (LLVM-Lite). There is no reason for them not to be optimized but compiler use heuristics to optimize the code that are far from being perfect (though pretty good in average). You can help by computing views of lines. This generally result in a tiny improvement though.
Here is the improved code:
#njit
def raytrace_range_opt(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
assert intensity_0.shape == d_y.shape
assert intensity_0.shape == d_x.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i in range(n_x):
row_intensity_0 = intensity_0[i, :]
row_d_x = d_x[i, :]
row_d_y = d_y[i, :]
for j in range(n_y):
assert j >= 0 # Crazy optimization (see later)
i_ij = row_intensity_0[j]
dx_ij = row_d_x[j]
dy_ij = row_d_y[j]
# Always the same from here down
if not dx_ij and not dy_ij:
row_intensity_0[j] += i_ij
continue
# Remaining code left unmodified
Notes
Note that I think the indexing of the function raytrace_enumerate is bogus: It should be for i in range(n_y): for j in range(n_x): instead since the access are done with intensity_0[i, j] and you wrote n_y, n_x = intensity_0.shape. Note that swaping the axis also gives correct results based on your validation function (which is suspicious).
The assert j >= 0 instruction alone results in a 8% speed up which is crazy since the integer iterator j is guaranteed to be positive if the n_x is positive which is always the case since it is a shape! This is clearly a missed optimization of Numba that LLVM-Lite cannot optimize (since LLVM-Lite does not know what is a shape and that they are always positive too). This apparent missing assumption in the Numba code causes additional bound checking (of each of the three arrays) that are pretty expensive.
Benchmark
Here are results on my machine:
raytrace_range: 47.8 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_enumerate: 38.9 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_npndenumerate: 54.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_zip: 41 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack1: 86.7 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack2: 84 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_range_opt: 38.6 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you can see raytrace_range_opt is the fastest implementation on my machine.
Related
I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?
In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.
This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)
I'm doing some vectorized algebra using numpy and the wall-clock performance of my algorithm seems weird. The program does roughly as follows:
Create three matrices: Y (KxD), X (NxD), T (KxN)
For each row of Y:
subtract Y[i] from each row of X (by broadcasting),
square the differences along one axis, sum them, take a square root, then store in T.
However, depending on how I perform the broadcasting, computation speed is vastly different. Consider the code:
import numpy as np
from time import perf_counter
D = 128
N = 3000
K = 500
X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))
if True: # negate to enable the second loop
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
T[i] = np.sqrt(np.sum(
np.square(
X - Y[i] # this has dimensions NxD
),
axis=1
))
time += perf_counter() - start
print("Broadcast in line: {:.3f} s".format(time / 100))
exit()
if True:
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(
np.square(
diff
),
axis=1
))
time += perf_counter() - start
print("Broadcast out: {:.3f} s".format(time / 100))
exit()
Times for each loop are measured individually and averaged over 100 executions. The results:
Broadcast in line: 1.504 s
Broadcast out: 0.438 s
The only difference is that broadcasting and subtraction in the first loop is done in-line, while in the second approach I do it before any vectorized operations. Why is this making such a difference?
My system configuration:
Lenovo ThinkStation P920, 2x Xeon Silver 4110, 64 GB RAM
Xubuntu 18.04.2 LTS (bionic)
Python 3.7.3 (GCC 7.3.0)
Numpy 1.16.3 linked against OpenBLAS (that's as much as np.__config__.show() tells me)
PS: Yes I am aware this could be further optimized, but right now I would like to understand what happens under the hood here.
It's not a broadcasting problem
I also added a optimized solution to see how long the actual calculation takes without the large overhead of memory allocation and deallocation.
Functions
import numpy as np
import numba as nb
def func_1(X,Y,T):
for i in range(K):
T[i] = np.sqrt(np.sum(np.square(X - Y[i]),axis=1))
return T
def func_2(X,Y,T):
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(np.square(diff),axis=1))
return T
#nb.njit(fastmath=True,parallel=True)
def func_3(X,Y,T):
for i in nb.prange(Y.shape[0]):
for j in range(X.shape[0]):
diff_sq_sum=0.
for k in range(X.shape[1]):
diff_sq_sum+= (X[j,k] - Y[i,k])**2
T[i,j]=np.sqrt(diff_sq_sum)
return T
Timings
I did all the timings in a Jupyter Notebook and observed a really weird behavior. The following code is in one cell. I also tried calling timit multiple times, but on the first execution of the cell this doesn't change anything.
First execution of the cell
D = 128
N = 3000
K = 500
X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))
#You can do it more often it would not change anything
%timeit func_1(X,Y,T)
%timeit func_1(X,Y,T)
#You can do it more often it would not change anything
%timeit func_2(X,Y,T)
%timeit func_2(X,Y,T)
###Avoid measuring compilation overhead###
%timeit func_3(X,Y,T)
##########################################
%timeit func_3(X,Y,T)
774 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
768 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.74 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Second execution
345 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
337 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
322 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.93 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?
This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am performing data analysis using a python script and learned from profiling that more than 95 % of the computation time is taken by the line which performs the following operation np.sum(C[np.isin(A, b)]), where A, C are 2D NumPy arrays of equal dimension m x n, and b is a 1D array of variable length. I am wondering if not a dedicated NumPy function, is there a way to accelerate such computation?
Typical sizes of A (int64), C (float64): 10M x 100
Typical size of b (int64): 1000
As your labels are from a small integer range you should get a sizeable speedup from using np.bincount (pp) below. Alternatively, you can speedup lookup by creating a mask (p2). This---as does your original code---allows for replacing np.sum with math.fsum which guarantees an exact within machine precision result (p3). Alternatively, we can pythranize it for another 40% speedup (p4).
On my rig the numba soln (mx) is about as fast as pp but maybe I'm not doing it right.
import numpy as np
import math
from subsum import pflat
MAXIND = 120_000
def OP():
return sum(C[np.isin(A, b)])
def pp():
return np.bincount(A.reshape(-1), C.reshape(-1), MAXIND)[np.unique(b)].sum()
def p2():
grid = np.zeros(MAXIND, bool)
grid[b] = True
return C[grid[A]].sum()
def p3():
grid = np.zeros(MAXIND, bool)
grid[b] = True
return math.fsum(C[grid[A]])
def p4():
return pflat(A.ravel(), C.ravel(), b, MAXIND)
import numba as nb
#nb.njit(parallel=True,fastmath=True)
def nb_ss(A,C,b):
s=set(b)
sum=0.
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
if A[i,j] in s:
sum+=C[i,j]
return sum
def mx():
return nb_ss(A,C,b)
sh = 100_000, 100
A = np.random.randint(0, MAXIND, sh)
C = np.random.random(sh)
b = np.random.randint(0, MAXIND, 1000)
print(OP(), pp(), p2(), p3(), p4(), mx())
from timeit import timeit
print("OP", timeit(OP, number=4)*250)
print("pp", timeit(pp, number=10)*100)
print("p2", timeit(p2, number=10)*100)
print("p3", timeit(p3, number=10)*100)
print("p4", timeit(p4, number=10)*100)
print("mx", timeit(mx, number=10)*100)
The code for the pythran module:
[subsum.py]
import numpy as np
#pythran export pflat(int[:], float[:], int[:], int)
def pflat(A, C, b, MAXIND):
grid = np.zeros(MAXIND, bool)
grid[b] = True
return C[grid[A]].sum()
Compilation is as simple as pythran subsum.py
Sample run:
41330.15849965791 41330.15849965748 41330.15849965747 41330.158499657475 41330.15849965791 41330.158499657446
OP 1963.3807722493657
pp 53.23419079941232
p2 21.8758742994396
p3 26.829131800332107
p4 12.988955597393215
mx 52.37018179905135
I assume you have changed int64 to int8 wherever required.
You can use Numba's parallel and It feature for faster Numpy computations and makes use of the cores.
#numba.jit(nopython=True, parallel=True)
def (A,B,c):
return np.sum(C[np.isin(A, b)])
Documentation for Numba Parallel
I don't know why np.isin is that slow, but you can implement your function quite a lot faster.
The following Numba solution uses a set for fast lookup of values and is parallelized. The memory footprint is also smaller than in the Numpy implementation.
Code
import numpy as np
import numba as nb
#nb.njit(parallel=True,fastmath=True)
def nb_pp(A,C,b):
s=set(b)
sum=0.
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
if A[i,j] in s:
sum+=C[i,j]
return sum
Timings
The pp implementation and the first data sample is form Paul Panzers answer above.
MAXIND = 120_000
sh = 100_000, 100
A = np.random.randint(0, MAXIND, sh)
C = np.random.random(sh)
b = np.random.randint(0, MAXIND, 1000)
MAXIND = 120_000
%timeit res_1=np.sum(C[np.isin(A, b)])
1.5 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
62.5 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=nb_pp(A,C,b)
17.1 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
MAXIND = 10_000_000
%timeit res_1=np.sum(C[np.isin(A, b)])
2.06 s ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
206 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=nb_pp(A,C,b)
17.6 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MAXIND = 100
%timeit res_1=np.sum(C[np.isin(A, b)])
1.01 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
46.8 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=nb_pp(A,C,b)
3.88 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I need to generate a tall-and-thin random column-orthonormal matrix in SciPy; that is, the number of rows n is far greater than the number of columns of p by many orders of magnitude (say n = 1e5 and p = 100. I know that scipy.stats.ortho_group generates a square orthogonal matrix. However, in my case it's simply infeasible to generate an n-by-n random orthogonal matrix and then keep the first p columns... Is there a more time- and space- efficient approach?
You can first generate a tall and thin random matrix, and then perform a qr decomposition.
a = np.random.random(size=(100000, 100))
q, _ = np.linalg.qr(a)
Here q is the matrix you want.
To me scipy.linalg.orth was a little bit faster than numpy.linalg.qr:
a = np.random.random(size=(100000, 100))
q = scipy.linalg.orth(a)
Here's a benchmarked answer. Note that I do some transposing so that this will work no matter whether that matrix is tall and thin (gives column-orthonormal) or short and wide (gives row-orthonormal).
def qr_method(n, m):
X = np.random.normal(0,1,(n,m))
if n < m:
X = X.T
Q, _ = np.linalg.qr(X)
if n < m:
Q = Q.T
return Q
def orth_method(n, m):
X = np.random.normal(0,1,(n,m))
if n < m:
X = X.T
Q = scipy.linalg.orth(X)
if n < m:
Q = Q.T
return Q
def ortho_group_method(n, m):
Q = scipy.stats.ortho_group.rvs(max(n, m))[:min(n, m),:]
if m < n:
Q = Q.T
return Q
The ortho_group method (aka make a square matrix and then take a subset) was so slow that I didn't benchmark it along with the others:
%timeit ortho_group_method(500, 20)
2.73 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Of the other two, the difference is negligible, by QR is slightly faster.
%timeit qr_method(10000, 200)
168 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit orth_method(10000, 200)
193 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does it make a difference how tall the matrix is? For a very tall matrix, they are close to equivalent.
%timeit qr_method(100000, 20)
122 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit orth_method(100000, 20)
130 ms ± 6.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For a square matrix, QR is much faster.
%timeit qr_method(500, 500)
47.5 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit orth_method(500, 500)
137 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)