Why isn't broadcasting with numpy faster than a nested loop

Why isn't broadcasting with numpy faster than a nested loop - python

I have a calculation in my code that get carried out thousands of times and I wanted to see if I could make it faster as it is currently using two nested loops. I assumed that if I used broadcasting I could make it several times faster.
I've shown the two options below, which thankfully give the same results.
import numpy as np
n = 1000
x = np.random.random([n, 3])
y = np.random.random([n, 3])
func_weight = np.random.random(n)
result = np.zeros([n, 9])
result_2 = np.zeros([n, 9])
# existing
for a in range(3):
for b in range(3):
result[:, 3*a + b] = x[:, a] * y[:, b] * func_weight
# broadcasting - assumed this would be faster
for a in range(3):
result_2[:, 3*a:3*(a+1)] = np.expand_dims(x[:, a], axis=-1) * y * np.expand_dims(func_weight, axis=-1)
Timings
n=100
nested loops: 24.7 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
broadcasting: 70.3 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n=1000
nested loops: 50.5 µs ± 913 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
broadcasting: 148 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n=10000
nested loops: 327 µs ± 7.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
broadcasting: 864 µs ± 5.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In my testing, broadcasting is always slower, so I'm a little confused as to what is happening. I'm guessing that because I had to use expand_dims to get the shapes aligned in the second solution, that is what the big impact on performance is. Is that correct? As the array size grows, there's not much change in performance with the nested loop always about 3 times quicker.
Is there a more optimal third solution that I haven't considered?

In [126]: %%timeit
...: result = np.zeros([n,9])
...: for a in range(3):
...: for b in range(3):
...: result[:, 3*a + b] = x[:, a] * y[:, b] * func_weight
141 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [128]: %%timeit
...: result_2 = np.zeros([n,9])
...: for a in range(3):
...: result_2[:, 3*a:3*(a+1)] = np.expand_dims(x[:, a], axis=-1) * y * n
...: p.expand_dims(func_weight, axis=-1)
202 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A fully broadcasted version:
In [130]: %%timeit
...: result_3 = (x[:,:,None]*y[:,None,:]*func_weight[:,None,None]).reshape(
...: n,9)
88.8 µs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Replacing the expand_dims with np.newaxis/None expansion:
In [131]: %%timeit
...: result_2 = np.zeros([n,9])
...: for a in range(3):
...: result_2[:, 3*a:3*(a+1)] = x[:, a,None] * y * func_weight[:,None]
132 µs ± 315 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So yes, expand_dims is a bit slow, I think because it tries to be general purpose. And an extra layer of function calls.
expand_dims is just a.reshape(shape), but it takes a bit of time to translate your axis parameter into the shape tuple. As an experienced user I find that the None syntax is clearer (and faster) - visually it stands out as a dimension-adding action.

Related

How to change a column in a matrix made of tuples?

I am unsure as to the cost of transforming a matrix of tuples into a list form which is easier to manipulate. The main priority is being able to change a column of the matrix as fast as possible
I have a matrix in the form of
[(a,b,c),(d,e,f),(g,h,i)]
which can appear in any size of n x m but for this example we'll take 3x3 matrix.
my main goal is to be able to change the values of any column in the matrix (only one at a time) (eg (b,e,h)).
my initial attempt was to transform the matrix into a list ie
[[a,b,c],[d,e,f],[g,h,i]]
which would be easier
but I feel that it would be costly in terms of transforming every tuple into a list and back into a tuple.
My main question could be how to optimize this to its fullest?

In [37]: def change_column_list_comp(old_m, col, value):
...: return [
...: tuple(list(row[:col]) + [value] + list(row[col + 1:]))
...: for row in old_m
...: ]
...:
In [38]: def change_column_list_convert(old_m, col, value):
...: list_m = list(map(list, old_m))
...: for row in list_m:
...: row[col] = value
...:
...: return list(map(tuple, list_m))
...:
In [39]: m = [tuple('abc'), tuple('def'), tuple('ghi')]
In [40]: %timeit change_column_list_comp(m, 1, 2)
2.05 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [41]: %timeit change_column_list_convert(m, 1, 2)
1.28 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Looks like converting to a list, modifying the values, and converting back to tuple is faster. Note that this may not be the most efficient way of writing these functions.
However, these functions seem to start to converge as we scale up our matrix.
In [6]: m_100k = [tuple(string.printable)] * 100_000
In [7]: %timeit change_column_list_comp(m_100k, 1, 2)
163 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit change_column_list_convert(m_100k, 1, 2)
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: m_1m = [tuple(string.printable)] * 1_000_000
In [43]: %timeit change_column_list_comp(m_1m, 1, 2)
1.72 s ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %timeit change_column_list_convert(m_1m, 1, 2)
1.24 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
At the end of the day you should be using the right tools for the job. While it's not really in the OP, it's just worth mentioning that numpy is simply the better way to go.
In [13]: m_np = np.array([list('abc'), list('def'), list('ghi')])
In [17]: %timeit m_np[:, 1] = 2; m_np
610 ns ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [20]: m_np_100k = np.array([[string.printable] * 100_000])
In [21]: %timeit m_np_100k[:, 1] = 2; m_np_100k
545 ns ± 63.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: m_np_1m = np.array([[string.printable] * 1_000_000])
# This might be using cached data
In [23]: %timeit m_np_1m[:, 1] = 2; m_np_1m
515 ns ± 31.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# Avoiding cache
In [24]: %timeit m_np_1m[:, 4] = 9; m_np_1m
557 ns ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This might not be the fairest comparison as we're manually returning the matrix, but you can see there is significant improvement.

Why is performance of in-place modification to a numpy array related to the order of dimension being modified?

import numpy as np
a = np.random.random((500, 500, 500))
b = np.random.random((500, 500))
%timeit a[250, :, :] = b
%timeit a[:, 250, :] = b
%timeit a[:, :, 250] = b
107 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52 µs ± 88.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.59 ms ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Observations:
Performance across the three runs above is different: modifying the 1st and 3rd dimension (slicing the 2nd) is the fastest among the three, while modifying the 1st and 2nd dimension (slicing the 3rd) is the slowest.
There seems no monotonicity of speed w.r.t. the dimension being sliced.
Questions are:
What mechanism behind numpy makes my observations?
With answer to 1st question in mind, how to speed up my code by arranging dimensions properly, as some dimensions are modified in bulk and the rests are just being sliced?

As several comments have indicated, it's all about locality of reference. Think about what numpy has to do at the low-level, and how far away from each other in memory the consecutive lvalues are in the 3rd case.
Note also how the results of the timings change when the array are not C-contiguous, but F-contiguous instead:
a = np.asfortranarray(a)
b = np.asfortranarray(b)
%timeit a[250, :, :] = b
%timeit a[:, 250, :] = b
%timeit a[:, :, 250] = b
892 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
169 µs ± 66.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(very small side-note: for the same reason, it is sometimes advantageous to sort a DataFrame before doing a groupby and a bunch of repetitive operations on the groups, somewhat counter-intuitively since the sort itself takes O(nlogn)).

Fastest Python log-sum-exp in a 'reduceat'

As part of a statistical programming package, I need to add log-transformed values together with the LogSumExp Function. This is significantly less efficient than adding unlogged values together.
Furthermore, I need to add values together using the numpy.ufunc.reduecat functionality.
There are various options I've considered, with code below:
(for comparison in non-log-space) use numpy.add.reduceat
Numpy's ufunc for adding logged values together: np.logaddexp.reduceat
Handwritten reduceat function with the following logsumexp functions:
scipy's implemention of logsumexp
logsumexp function in Python (with numba)
Streaming logsumexp function in Python (with numba)
def logsumexp_reduceat(arr, indices, logsum_exp_func):
res = list()
i_start = indices[0]
for cur_index, i in enumerate(indices[1:]):
res.append(logsum_exp_func(arr[i_start:i]))
i_start = i
res.append(logsum_exp_func(arr[i:]))
return res
#numba.jit(nopython=True)
def logsumexp(X):
r = 0.0
for x in X:
r += np.exp(x)
return np.log(r)
#numba.jit(nopython=True)
def logsumexp_stream(X):
alpha = -np.Inf
r = 0.0
for x in X:
if x != -np.Inf:
if x <= alpha:
r += np.exp(x - alpha)
else:
r *= np.exp(alpha - x)
r += 1.0
alpha = x
return np.log(r) + alpha
arr = np.random.uniform(0,0.1, 10000)
log_arr = np.log(arr)
indices = sorted(np.random.randint(0, 10000, 100))
# approach 1
%timeit np.add.reduceat(arr, indices)
12.7 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# approach 2
%timeit np.logaddexp.reduceat(log_arr, indices)
462 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# approach 3, scipy function
%timeit logsum_exp_reduceat(arr, indices, scipy.special.logsumexp)
3.69 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# approach 3 handwritten logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp)
139 µs ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# approach 3 streaming logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp_stream)
164 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The timeit results show that handwritten logsumexp functions with numba are the fastest options, but are still 10x slower than numpy.add.reduceat.
A few questions:
Are there any other approaches (or tweaks to the options I've presented) which are faster? For instance, is there a way to use a lookup table to compute the logsumexp function?
Why is Sebastian Nowozin's "streaming logsumexp" function not faster than the naive approach?

There is some room for improvement
But never expect logsumexp to be as fast as a standard summation, because exp is quite a expensive operation.
Example
import numpy as np
#from version 0.43 until 0.47 this has to be set before importing numba
#Bug: https://github.com/numba/numba/issues/4689
from llvmlite import binding
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb
#nb.njit(fastmath=True,parallel=False)
def logsum_exp_reduceat(arr, indices):
res = np.empty(indices.shape[0],dtype=arr.dtype)
for i in nb.prange(indices.shape[0]-1):
r = 0.
for j in range(indices[i],indices[i+1]):
r += np.exp(arr[j])
res[i]=np.log(r)
r = 0.
for j in range(indices[-1],arr.shape[0]):
r += np.exp(arr[j])
res[-1]=np.log(r)
return res
Timings
#small example where parallelization doesn't make sense
arr = np.random.uniform(0,0.1, 10_000)
log_arr = np.log(arr)
#use arrays if possible
indices = np.sort(np.random.randint(0, 10_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 22 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#with parallelization 84.7 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.add.reduceat(arr, indices)
#4.46 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#large example where parallelization makes sense
arr = np.random.uniform(0,0.1, 1000_000)
log_arr = np.log(arr)
indices = np.sort(np.random.randint(0, 1000_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 1.57 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with parallelization 409 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add.reduceat(arr, indices)
#340 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Computing a slightly different matrix multiplication

I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]

Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.

Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficiency problem of customizing numpy's vectorized operation

I have a python function given below:
def myfun(x):
if x > 0:
return 0
else:
return np.exp(x)
where np is the numpy library. I want to make the function vectorized in numpy, so I use:
vec_myfun = np.vectorize(myfun)
I did a test to evaluate the efficiency. First I generate a vector of 100 random numbers:
x = np.random.randn(100)
Then I run the following code to obtain the runtime:
%timeit np.exp(x)
%timeit vec_myfun(x)
The runtime for np.exp(x) is 1.07 µs ± 24.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each).
The runtime for vec_myfun(x) is 71.2 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
My question is: compared to np.exp, vec_myfun has only one extra step to check the value of $x$, but it runs much slowly than np.exp. Is there an efficient way to vectorize myfun to make it as efficient as np.exp?

Use np.where:
>>> x = np.random.rand(100,)
>>> %timeit np.exp(x)
1.22 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %timeit np.where(x > 0, 0, np.exp(x))
4.09 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For comparison, your vectorized function runs in about 30 microseconds on my machine.
As to why it runs slower, it's just much more complicated than np.exp. It's doing lots of type deduction, broadcasting, and possibly making many calls to the actual method. Much of this happens in Python itself, while nearly everything in the call to np.exp (and the np.where version here) is in C.

ufunc like np.exp have a where parameter, which can be used as:
In [288]: x = np.random.randn(10)
In [289]: out=np.zeros_like(x)
In [290]: np.exp(x, out=out, where=(x<=0))
Out[290]:
array([0. , 0. , 0. , 0. , 0.09407685,
0.92458328, 0. , 0. , 0.46618914, 0. ])
In [291]: x
Out[291]:
array([ 0.37513573, 1.75273458, 0.30561659, 0.46554985, -2.3636433 ,
-0.07841215, 2.00878429, 0.58441085, -0.76316384, 0.12431333])
This actually skips the calculation where the where is false.
In contrast:
np.where(arr > 0, 0, np.exp(arr))
calculates np.exp(arr) first for all arr (that's normal Python evaluation order), and then performs the where selection. With this exp that isn't a big deal, but with log it could be problems.

Just thinking outside of the box, what about implementing a function piecewise_exp() that basically multiplies np.exp() with arr < 0?
import numpy as np
def piecewise_exp(arr):
return np.exp(arr) * (arr < 0)
Writing the code proposed so far as functions:
#np.vectorize
def myfun(x):
if x > 0:
return 0.0
else:
return np.exp(x)
def bnaeker_exp(arr):
return np.where(arr > 0, 0, np.exp(arr))
And testing that everything is consistent:
np.random.seed(0)
# : test that the functions have the same behavior
num = 10
x = np.random.rand(num) - 0.5
print(x)
print(myfun(x))
print(piecewise_exp(x))
print(bnaeker_exp(x))
Doing some micro-benchmarks for small inputs:
# : micro-benchmarks for small inputs
num = 100
x = np.random.rand(num) - 0.5
%timeit np.exp(x)
# 1.63 µs ± 45.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit myfun(x)
# 54 µs ± 967 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bnaeker_exp(x)
# 4 µs ± 87.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit piecewise_exp(x)
# 3.38 µs ± 59.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
... and for larger inputs:
# : micro-benchmarks for larger inputs
num = 100000
x = np.random.rand(num) - 0.5
%timeit np.exp(x)
# 32.7 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit myfun(x)
# 44.9 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bnaeker_exp(x)
# 481 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit piecewise_exp(x)
# 149 µs ± 2.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This shows that piecewise_exp() is faster than anything else proposed so far, especially for larger inputs for which np.where() gets more inefficient since it uses integer indexing instead of boolean masks, and reasonably approaches np.exp() speed.
EDIT
Also, the performances of the np.where() version (bnaeker_exp()) do depend on the number of elements of the array actually satisfying the condition. If none of them does (like when you test on x = np.random.rand(100)), this is slightly faster than the boolean array multiplication version (piecewise_exp()) (128 µs ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) on my machine for n = 100000).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why isn't broadcasting with numpy faster than a nested loop - python

Related

How to change a column in a matrix made of tuples?

Why is performance of in-place modification to a numpy array related to the order of dimension being modified?

Fastest Python log-sum-exp in a 'reduceat'

Computing a slightly different matrix multiplication

Efficiency problem of customizing numpy's vectorized operation

Categories

Resources