Broadcasting outside main loop speeds up vectorized numpy ops?

Broadcasting outside main loop speeds up vectorized numpy ops? - python

I'm doing some vectorized algebra using numpy and the wall-clock performance of my algorithm seems weird. The program does roughly as follows:
Create three matrices: Y (KxD), X (NxD), T (KxN)
For each row of Y:
subtract Y[i] from each row of X (by broadcasting),
square the differences along one axis, sum them, take a square root, then store in T.
However, depending on how I perform the broadcasting, computation speed is vastly different. Consider the code:
import numpy as np
from time import perf_counter
D = 128
N = 3000
K = 500
X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))
if True: # negate to enable the second loop
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
T[i] = np.sqrt(np.sum(
np.square(
X - Y[i] # this has dimensions NxD
),
axis=1
))
time += perf_counter() - start
print("Broadcast in line: {:.3f} s".format(time / 100))
exit()
if True:
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(
np.square(
diff
),
axis=1
))
time += perf_counter() - start
print("Broadcast out: {:.3f} s".format(time / 100))
exit()
Times for each loop are measured individually and averaged over 100 executions. The results:
Broadcast in line: 1.504 s
Broadcast out: 0.438 s
The only difference is that broadcasting and subtraction in the first loop is done in-line, while in the second approach I do it before any vectorized operations. Why is this making such a difference?
My system configuration:
Lenovo ThinkStation P920, 2x Xeon Silver 4110, 64 GB RAM
Xubuntu 18.04.2 LTS (bionic)
Python 3.7.3 (GCC 7.3.0)
Numpy 1.16.3 linked against OpenBLAS (that's as much as np.__config__.show() tells me)
PS: Yes I am aware this could be further optimized, but right now I would like to understand what happens under the hood here.

It's not a broadcasting problem
I also added a optimized solution to see how long the actual calculation takes without the large overhead of memory allocation and deallocation.
Functions
import numpy as np
import numba as nb
def func_1(X,Y,T):
for i in range(K):
T[i] = np.sqrt(np.sum(np.square(X - Y[i]),axis=1))
return T
def func_2(X,Y,T):
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(np.square(diff),axis=1))
return T
#nb.njit(fastmath=True,parallel=True)
def func_3(X,Y,T):
for i in nb.prange(Y.shape[0]):
for j in range(X.shape[0]):
diff_sq_sum=0.
for k in range(X.shape[1]):
diff_sq_sum+= (X[j,k] - Y[i,k])**2
T[i,j]=np.sqrt(diff_sq_sum)
return T
Timings
I did all the timings in a Jupyter Notebook and observed a really weird behavior. The following code is in one cell. I also tried calling timit multiple times, but on the first execution of the cell this doesn't change anything.
First execution of the cell
D = 128
N = 3000
K = 500
X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))
#You can do it more often it would not change anything
%timeit func_1(X,Y,T)
%timeit func_1(X,Y,T)
#You can do it more often it would not change anything
%timeit func_2(X,Y,T)
%timeit func_2(X,Y,T)
###Avoid measuring compilation overhead###
%timeit func_3(X,Y,T)
##########################################
%timeit func_3(X,Y,T)
774 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
768 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.74 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Second execution
345 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
337 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
322 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.93 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

How to efficiently generate an array using 2 arrays and a formula as input with NumPy

I have two arrays, x and t with n elements (t's elements are in strict ascending order, so no dividing by 0) and a formula on which the creation of my new array, v is based:
v[i] = (x[i+1] - x[i]) / (t[i+1] - t[i])
How can I write this in NumPy? I tried using numpy.fromfunction but didn't manage to make it work.
I did manage to do it using a for loop - but I feel like there's a better way of doing this:
n = 100000
x = np.random.rand(n)
t = np.random.randint(1, 10, n)
t = t.cumsum()
def gen_v(x, t):
v = np.zeros(n - 1)
for i in range(0, n - 1):
v[i] = (x[i+1] - x[i])/(t[i+1]-t[i])
return v
v = gen_v(x, t)
%timeit gen_v(x, t)
Outputs
156 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can use np.diff():
def gen_v(x,t):
return np.diff(x)/np.diff(t)
The benchmark give us:
# Your function:
8.45 ms +- 557 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
# My function:
63.4 us +- 1.62 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)

You could use array slicing
def gen_v(x, t):
return (x[1:] - x[:-1])/(t[1:] - t[:-1])
Benchmarking yields
# Your Function
62.4 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Slicing
277 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
On my meager hardware. ;)

python: fastest way to compute euclidean distance of a vector to every row of a matrix?

Consider this python code, where I try to compute the eucliean distance of a vector to every row of a matrix. It's very slow compared to the best Julia version I can find using Tullio.jl.
The python version takes 30s but the Julia version only takes 75ms.
I am sure I am not doing the best in Python. Are there faster solutions? Numba and numpy solutions welcome.
import numpy as np
# generate
a = np.random.rand(4000000, 128)
b = np.random.rand(128)
print(a.shape)
print(b.shape)
def lin_norm_ever(a, b):
return np.apply_along_axis(lambda x: np.linalg.norm(x - b), 1, a)
import time
t = time.time()
res = lin_norm_ever(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)
The Julia verions
using Tullio
function comp_tullio(a, c)
dist = zeros(Float32, size(a, 2))
#tullio dist[i] = (c[j] - a[j,i])^2
dist
end
#time comp_tullio(a, c)
#benchmark comp_tullio(a, c) # 75ms on my computer

I would use Numba in this example for best performance. I also added 2 approaches from Divakars linked answer for comparison.
Code
import numpy as np
import numba as nb
from scipy.spatial.distance import cdist
#nb.njit(fastmath=True,parallel=True,cache=True)
def dist_1(mat,vec):
res=np.empty(mat.shape[0],dtype=mat.dtype)
for i in nb.prange(mat.shape[0]):
acc=0
for j in range(mat.shape[1]):
acc+=(mat[i,j]-vec[j])**2
res[i]=np.sqrt(acc)
return res
#from https://stackoverflow.com/a/52364284/4045774
def dist_2(mat,vec):
return cdist(mat, np.atleast_2d(vec)).ravel()
#from https://stackoverflow.com/a/52364284/4045774
def dist_3(mat,vec):
M = mat.dot(vec)
d = np.einsum('ij,ij->i',mat,mat) + np.inner(vec,vec) -2*M
return np.sqrt(d)
Timings
#Float64
a = np.random.rand(4000000, 128)
b = np.random.rand(128)
%timeit dist_1(a,b)
#122 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_2(a,b)
#484 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_3(a,b)
#432 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Float32
a = np.random.rand(4000000, 128).astype(np.float32)
b = np.random.rand(128).astype(np.float32)
%timeit dist_1(a,b)
#68.6 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_2(a,b)
#2.2 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#looks like there is a costly type-casting to float64
%timeit dist_3(a,b)
#228 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Fastest Python log-sum-exp in a 'reduceat'

As part of a statistical programming package, I need to add log-transformed values together with the LogSumExp Function. This is significantly less efficient than adding unlogged values together.
Furthermore, I need to add values together using the numpy.ufunc.reduecat functionality.
There are various options I've considered, with code below:
(for comparison in non-log-space) use numpy.add.reduceat
Numpy's ufunc for adding logged values together: np.logaddexp.reduceat
Handwritten reduceat function with the following logsumexp functions:
scipy's implemention of logsumexp
logsumexp function in Python (with numba)
Streaming logsumexp function in Python (with numba)
def logsumexp_reduceat(arr, indices, logsum_exp_func):
res = list()
i_start = indices[0]
for cur_index, i in enumerate(indices[1:]):
res.append(logsum_exp_func(arr[i_start:i]))
i_start = i
res.append(logsum_exp_func(arr[i:]))
return res
#numba.jit(nopython=True)
def logsumexp(X):
r = 0.0
for x in X:
r += np.exp(x)
return np.log(r)
#numba.jit(nopython=True)
def logsumexp_stream(X):
alpha = -np.Inf
r = 0.0
for x in X:
if x != -np.Inf:
if x <= alpha:
r += np.exp(x - alpha)
else:
r *= np.exp(alpha - x)
r += 1.0
alpha = x
return np.log(r) + alpha
arr = np.random.uniform(0,0.1, 10000)
log_arr = np.log(arr)
indices = sorted(np.random.randint(0, 10000, 100))
# approach 1
%timeit np.add.reduceat(arr, indices)
12.7 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# approach 2
%timeit np.logaddexp.reduceat(log_arr, indices)
462 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# approach 3, scipy function
%timeit logsum_exp_reduceat(arr, indices, scipy.special.logsumexp)
3.69 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# approach 3 handwritten logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp)
139 µs ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# approach 3 streaming logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp_stream)
164 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The timeit results show that handwritten logsumexp functions with numba are the fastest options, but are still 10x slower than numpy.add.reduceat.
A few questions:
Are there any other approaches (or tweaks to the options I've presented) which are faster? For instance, is there a way to use a lookup table to compute the logsumexp function?
Why is Sebastian Nowozin's "streaming logsumexp" function not faster than the naive approach?

There is some room for improvement
But never expect logsumexp to be as fast as a standard summation, because exp is quite a expensive operation.
Example
import numpy as np
#from version 0.43 until 0.47 this has to be set before importing numba
#Bug: https://github.com/numba/numba/issues/4689
from llvmlite import binding
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb
#nb.njit(fastmath=True,parallel=False)
def logsum_exp_reduceat(arr, indices):
res = np.empty(indices.shape[0],dtype=arr.dtype)
for i in nb.prange(indices.shape[0]-1):
r = 0.
for j in range(indices[i],indices[i+1]):
r += np.exp(arr[j])
res[i]=np.log(r)
r = 0.
for j in range(indices[-1],arr.shape[0]):
r += np.exp(arr[j])
res[-1]=np.log(r)
return res
Timings
#small example where parallelization doesn't make sense
arr = np.random.uniform(0,0.1, 10_000)
log_arr = np.log(arr)
#use arrays if possible
indices = np.sort(np.random.randint(0, 10_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 22 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#with parallelization 84.7 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.add.reduceat(arr, indices)
#4.46 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#large example where parallelization makes sense
arr = np.random.uniform(0,0.1, 1000_000)
log_arr = np.log(arr)
indices = np.sort(np.random.randint(0, 1000_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 1.57 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with parallelization 409 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add.reduceat(arr, indices)
#340 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Efficiency problem of customizing numpy's vectorized operation

I have a python function given below:
def myfun(x):
if x > 0:
return 0
else:
return np.exp(x)
where np is the numpy library. I want to make the function vectorized in numpy, so I use:
vec_myfun = np.vectorize(myfun)
I did a test to evaluate the efficiency. First I generate a vector of 100 random numbers:
x = np.random.randn(100)
Then I run the following code to obtain the runtime:
%timeit np.exp(x)
%timeit vec_myfun(x)
The runtime for np.exp(x) is 1.07 µs ± 24.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each).
The runtime for vec_myfun(x) is 71.2 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
My question is: compared to np.exp, vec_myfun has only one extra step to check the value of $x$, but it runs much slowly than np.exp. Is there an efficient way to vectorize myfun to make it as efficient as np.exp?

Use np.where:
>>> x = np.random.rand(100,)
>>> %timeit np.exp(x)
1.22 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %timeit np.where(x > 0, 0, np.exp(x))
4.09 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For comparison, your vectorized function runs in about 30 microseconds on my machine.
As to why it runs slower, it's just much more complicated than np.exp. It's doing lots of type deduction, broadcasting, and possibly making many calls to the actual method. Much of this happens in Python itself, while nearly everything in the call to np.exp (and the np.where version here) is in C.

ufunc like np.exp have a where parameter, which can be used as:
In [288]: x = np.random.randn(10)
In [289]: out=np.zeros_like(x)
In [290]: np.exp(x, out=out, where=(x<=0))
Out[290]:
array([0. , 0. , 0. , 0. , 0.09407685,
0.92458328, 0. , 0. , 0.46618914, 0. ])
In [291]: x
Out[291]:
array([ 0.37513573, 1.75273458, 0.30561659, 0.46554985, -2.3636433 ,
-0.07841215, 2.00878429, 0.58441085, -0.76316384, 0.12431333])
This actually skips the calculation where the where is false.
In contrast:
np.where(arr > 0, 0, np.exp(arr))
calculates np.exp(arr) first for all arr (that's normal Python evaluation order), and then performs the where selection. With this exp that isn't a big deal, but with log it could be problems.

Just thinking outside of the box, what about implementing a function piecewise_exp() that basically multiplies np.exp() with arr < 0?
import numpy as np
def piecewise_exp(arr):
return np.exp(arr) * (arr < 0)
Writing the code proposed so far as functions:
#np.vectorize
def myfun(x):
if x > 0:
return 0.0
else:
return np.exp(x)
def bnaeker_exp(arr):
return np.where(arr > 0, 0, np.exp(arr))
And testing that everything is consistent:
np.random.seed(0)
# : test that the functions have the same behavior
num = 10
x = np.random.rand(num) - 0.5
print(x)
print(myfun(x))
print(piecewise_exp(x))
print(bnaeker_exp(x))
Doing some micro-benchmarks for small inputs:
# : micro-benchmarks for small inputs
num = 100
x = np.random.rand(num) - 0.5
%timeit np.exp(x)
# 1.63 µs ± 45.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit myfun(x)
# 54 µs ± 967 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bnaeker_exp(x)
# 4 µs ± 87.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit piecewise_exp(x)
# 3.38 µs ± 59.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
... and for larger inputs:
# : micro-benchmarks for larger inputs
num = 100000
x = np.random.rand(num) - 0.5
%timeit np.exp(x)
# 32.7 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit myfun(x)
# 44.9 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bnaeker_exp(x)
# 481 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit piecewise_exp(x)
# 149 µs ± 2.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This shows that piecewise_exp() is faster than anything else proposed so far, especially for larger inputs for which np.where() gets more inefficient since it uses integer indexing instead of boolean masks, and reasonably approaches np.exp() speed.
EDIT
Also, the performances of the np.where() version (bnaeker_exp()) do depend on the number of elements of the array actually satisfying the condition. If none of them does (like when you test on x = np.random.rand(100)), this is slightly faster than the boolean array multiplication version (piecewise_exp()) (128 µs ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) on my machine for n = 100000).

Generating a tall-and-thin random orthonormal matrix in SciPy

I need to generate a tall-and-thin random column-orthonormal matrix in SciPy; that is, the number of rows n is far greater than the number of columns of p by many orders of magnitude (say n = 1e5 and p = 100. I know that scipy.stats.ortho_group generates a square orthogonal matrix. However, in my case it's simply infeasible to generate an n-by-n random orthogonal matrix and then keep the first p columns... Is there a more time- and space- efficient approach?

You can first generate a tall and thin random matrix, and then perform a qr decomposition.
a = np.random.random(size=(100000, 100))
q, _ = np.linalg.qr(a)
Here q is the matrix you want.

To me scipy.linalg.orth was a little bit faster than numpy.linalg.qr:
a = np.random.random(size=(100000, 100))
q = scipy.linalg.orth(a)

Here's a benchmarked answer. Note that I do some transposing so that this will work no matter whether that matrix is tall and thin (gives column-orthonormal) or short and wide (gives row-orthonormal).
def qr_method(n, m):
X = np.random.normal(0,1,(n,m))
if n < m:
X = X.T
Q, _ = np.linalg.qr(X)
if n < m:
Q = Q.T
return Q
def orth_method(n, m):
X = np.random.normal(0,1,(n,m))
if n < m:
X = X.T
Q = scipy.linalg.orth(X)
if n < m:
Q = Q.T
return Q
def ortho_group_method(n, m):
Q = scipy.stats.ortho_group.rvs(max(n, m))[:min(n, m),:]
if m < n:
Q = Q.T
return Q
The ortho_group method (aka make a square matrix and then take a subset) was so slow that I didn't benchmark it along with the others:
%timeit ortho_group_method(500, 20)
2.73 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Of the other two, the difference is negligible, by QR is slightly faster.
%timeit qr_method(10000, 200)
168 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit orth_method(10000, 200)
193 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does it make a difference how tall the matrix is? For a very tall matrix, they are close to equivalent.
%timeit qr_method(100000, 20)
122 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit orth_method(100000, 20)
130 ms ± 6.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For a square matrix, QR is much faster.
%timeit qr_method(500, 500)
47.5 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit orth_method(500, 500)
137 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Broadcasting outside main loop speeds up vectorized numpy ops? - python

Related

How to efficiently generate an array using 2 arrays and a formula as input with NumPy

python: fastest way to compute euclidean distance of a vector to every row of a matrix?

Fastest Python log-sum-exp in a 'reduceat'

Efficiency problem of customizing numpy's vectorized operation

Generating a tall-and-thin random orthonormal matrix in SciPy

Categories

Resources