Is there a way to speed up the following two lines of code?
choice = np.argmax(cust_profit, axis=0)
taken = np.array([np.sum(choice == i) for i in range(n_pr)])
%timeit np.argmax(cust_profit, axis=0)
37.6 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([np.sum(choice == i) for i in range(n_pr)])
40.2 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n_pr == 2
cust_profit.shape == (n_pr+1, 2000)
Solutions:
%timeit np.unique(choice, return_counts=True)
53.7 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.histogram(choice, bins=np.arange(n_pr + 2))
70.5 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.bincount(choice)
7.4 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
These microseconds worry me, cause this code locates under two layers of scipy.optimize.minimize(method='Nelder-Mead'), that locates in double nested loop, so 40µs equals 4 hours. And I think to wrap it all in genetic search.
The first line seems pretty straightforward. Unless you can sort the data or something like that, you are stuck with the linear lookup in np.argmax. The second line can be sped up simply by using numpy instead of vanilla python to implement it:
v, counts = np.unique(choice, return_counts=True)
Alternatively:
counts = np.histogram(choice, bins=np.arange(n_pr + 2))
A version of histogram optimized for integers also exists:
count = np.bincount(choice)
The latter two options are better if you want to guarantee that the bins include all possible values of choice, regardless of whether they are actually present in the array or not.
That being said, you probably shouldn't worry about something that takes microseconds.
As part of a statistical programming package, I need to add log-transformed values together with the LogSumExp Function. This is significantly less efficient than adding unlogged values together.
Furthermore, I need to add values together using the numpy.ufunc.reduecat functionality.
There are various options I've considered, with code below:
(for comparison in non-log-space) use numpy.add.reduceat
Numpy's ufunc for adding logged values together: np.logaddexp.reduceat
Handwritten reduceat function with the following logsumexp functions:
scipy's implemention of logsumexp
logsumexp function in Python (with numba)
Streaming logsumexp function in Python (with numba)
def logsumexp_reduceat(arr, indices, logsum_exp_func):
res = list()
i_start = indices[0]
for cur_index, i in enumerate(indices[1:]):
res.append(logsum_exp_func(arr[i_start:i]))
i_start = i
res.append(logsum_exp_func(arr[i:]))
return res
#numba.jit(nopython=True)
def logsumexp(X):
r = 0.0
for x in X:
r += np.exp(x)
return np.log(r)
#numba.jit(nopython=True)
def logsumexp_stream(X):
alpha = -np.Inf
r = 0.0
for x in X:
if x != -np.Inf:
if x <= alpha:
r += np.exp(x - alpha)
else:
r *= np.exp(alpha - x)
r += 1.0
alpha = x
return np.log(r) + alpha
arr = np.random.uniform(0,0.1, 10000)
log_arr = np.log(arr)
indices = sorted(np.random.randint(0, 10000, 100))
# approach 1
%timeit np.add.reduceat(arr, indices)
12.7 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# approach 2
%timeit np.logaddexp.reduceat(log_arr, indices)
462 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# approach 3, scipy function
%timeit logsum_exp_reduceat(arr, indices, scipy.special.logsumexp)
3.69 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# approach 3 handwritten logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp)
139 µs ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# approach 3 streaming logsumexp
%timeit logsumexp_reduceat(log_arr, indices, logsumexp_stream)
164 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The timeit results show that handwritten logsumexp functions with numba are the fastest options, but are still 10x slower than numpy.add.reduceat.
A few questions:
Are there any other approaches (or tweaks to the options I've presented) which are faster? For instance, is there a way to use a lookup table to compute the logsumexp function?
Why is Sebastian Nowozin's "streaming logsumexp" function not faster than the naive approach?
There is some room for improvement
But never expect logsumexp to be as fast as a standard summation, because exp is quite a expensive operation.
Example
import numpy as np
#from version 0.43 until 0.47 this has to be set before importing numba
#Bug: https://github.com/numba/numba/issues/4689
from llvmlite import binding
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb
#nb.njit(fastmath=True,parallel=False)
def logsum_exp_reduceat(arr, indices):
res = np.empty(indices.shape[0],dtype=arr.dtype)
for i in nb.prange(indices.shape[0]-1):
r = 0.
for j in range(indices[i],indices[i+1]):
r += np.exp(arr[j])
res[i]=np.log(r)
r = 0.
for j in range(indices[-1],arr.shape[0]):
r += np.exp(arr[j])
res[-1]=np.log(r)
return res
Timings
#small example where parallelization doesn't make sense
arr = np.random.uniform(0,0.1, 10_000)
log_arr = np.log(arr)
#use arrays if possible
indices = np.sort(np.random.randint(0, 10_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 22 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#with parallelization 84.7 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.add.reduceat(arr, indices)
#4.46 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#large example where parallelization makes sense
arr = np.random.uniform(0,0.1, 1000_000)
log_arr = np.log(arr)
indices = np.sort(np.random.randint(0, 1000_000, 100))
%timeit logsum_exp_reduceat(arr, indices)
#without parallelzation 1.57 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with parallelization 409 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add.reduceat(arr, indices)
#340 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there any standard library/numpy equivalent of the following function:
def augmented_assignment_sum(iterable, start=0):
for n in iterable:
start += n
return start
?
While sum(ITERABLE) is very elegant, it uses + operator instead of +=, which in case of np.ndarray objects may affect performance.
I have tested that my function may be as fast as sum() (while its equivalent using + is much slower). As it is a pure Python function, I guess its performance is still handicapped, thus I am looking for some alternative:
In [49]: ARRAYS = [np.random.random((1000000)) for _ in range(100)]
In [50]: def not_augmented_assignment_sum(iterable, start=0):
...: for n in iterable:
...: start = start + n
...: return start
...:
In [51]: %timeit not_augmented_assignment_sum(ARRAYS)
63.6 ms ± 8.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [52]: %timeit sum(ARRAYS)
31.2 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [53]: %timeit augmented_assignment_sum(ARRAYS)
31.2 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [54]: %timeit not_augmented_assignment_sum(ARRAYS)
62.5 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [55]: %timeit sum(ARRAYS)
37 ms ± 9.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [56]: %timeit augmented_assignment_sum(ARRAYS)
27.7 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have tried to use functools.reduce combined with operator.iadd, but its performace is similar:
In [79]: %timeit reduce(iadd, ARRAYS, 0)
33.4 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit reduce(iadd, ARRAYS, 0)
29.4 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am also interested in memory efficiency, thus prefer augmented assignments as they not require creation of intermediate objects.
The answer to the headline question --- I hope #Martijn Pieters will forgive my choice of metaphor --- straight from the horse's mouth is: No, there is no such builtin.
If we allow for a few lines of code to implement such an equivalent we get a rather complicated picture with what is fastest very much depending on operand size:
This graph shows timings of different methods relative to sum over operand size, number of terms is always 100. augmented_assignment_sum starts paying off towards relatively large operand sizes. Using scipy.linalg.blas.*axpy looks pretty competitive over most of the range tested, its main drawback being being way less easy to use than sum.
Code:
from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np
from scipy.linalg import blas
B = BenchmarkBuilder()
#B.add_function()
def augmented_assignment_sum(iterable, start=0):
for n in iterable:
start += n
return start
#B.add_function()
def not_augmented_assignment_sum(iterable, start=0):
for n in iterable:
start = start + n
return start
#B.add_function()
def plain_sum(iterable, start=0):
return sum(iterable,start)
#B.add_function()
def blas_sum(iterable, start=None):
iterable = iter(iterable)
if start is None:
try:
start = next(iterable).copy()
except StopIteration:
return 0
try:
f = {np.dtype('float32'):blas.saxpy,
np.dtype('float64'):blas.daxpy,
np.dtype('complex64'):blas.caxpy,
np.dtype('complex128'):blas.zaxpy}[start.dtype]
except KeyError:
f = blas.daxpy
start = start.astype(float)
for n in iterable:
f(n,start)
return start
#B.add_arguments('size of terms')
def argument_provider():
for exp in range(1,21):
sz = int(2**exp)
yield sz,[np.random.randn(sz) for _ in range(100)]
r = B.run()
r.plot(relative_to=plain_sum)
import pylab
pylab.savefig('inplacesum.png')
What is the best way to count the rows in a 2d numpy array that include all values of another 1d numpy array? The 2nd array can have more columns than the length of the 1d array.
elements = np.arange(4).reshape((2, 2))
test_elements = [2, 3]
somefunction(elements, test_elements)
I would expect the function to return 1.
elements = np.arange(15).reshape((5, 3))
# array([[ 0, 1, 2],
# [ 3, 4, 5],
# [ 6, 7, 8],
# [ 9, 10, 11],
# [12, 13, 14]])
test_elements = [4, 3]
somefunction(elements, test_elements)
Should also return 1.
All elements of the 1d array must be included. If only a few elements are found in a row, it doesn't count. Hence:
elements = np.arange(15).reshape((5, 3))
# array([[ 0, 1, 2],
# [ 3, 4, 5],
# [ 6, 7, 8],
# [ 9, 10, 11],
# [12, 13, 14]])
test_elements = [3, 4, 10]
somefunction(elements, test_elements)
Should also return 0.
Create a boolean array of elements found then use any row-wise this will avoid multiple values in the same row and at last count the rows by using sum,
np.any(np.isin(elements, test), axis=1).sum()
Output
>>> elements
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> test = [1, 6, 7, 4]
>>> np.any(np.isin(elements, test), axis=1).sum()
3
There's probably a more efficient solution, but if you want the rows where "all" elements of test_elements are present, you can reverse np.isin and apply it along each row, with the following:
np.apply_along_axis(lambda x: np.isin(test_elements, x), 1, elements).all(1).sum()
A slightly more efficient (but less readable) variant of #norok2's solution is the following.
sum(map(set(test_elements).issubset, elements))
(EDIT: OK, now I actually had a bit more time to figure out what is going on.)
The are two issues here:
the computational complexity depends on the sizes of both inputs and it is not captured well by a 1D benchmark plot
the actual timing are dominated by variation in the inputs
The problem can be separated in two parts:
looping through the rows
performing the subset check, which is basically a nested-loop quadratic operation (in the worst-case scenario)
We know that, for sufficiently large inputs, looping through the rows is faster in NumPy and slower in pure Python.
For reference, let's consider these two approaches:
# pure Python approach
def all_in_by_row_flt(arr, elems=ELEMS):
return sum(1 for row in arr if all(e in row for e in elems))
# NumPy apprach (based on #Mstaino answer)
def all_in_by_row_np(arr, elems=ELEMS):
def _aaa_helper(row, e=elems):
return np.isin(e, row)
return np.sum(np.all(np.apply_along_axis(_aaa_helper, 1, arr), 1))
Then, considering the subset check operation, if the input is such that the check is performed within fewer loops, pure Python looping gets faster than NumPy. Conversely, if a sufficiently large number of loops is required, then NumPy can actually be faster.
On top of this, there is the looping through the rows, but because the subset check operation is quadratic AND the have different constant coefficients, there are situations for which, despite the rows-looping being faster in NumPy (because the number of rows would be sufficiently large), the overall operation is faster in pure Python.
This was the situation I was running into in the earlier benchmarks, and corresponds to the situation where the subset check is always (or almost) False and it does fail within few loops.
As soon as the subset check starts requiring more loops, the Python only approach begins to lag behind and for the situation where the subset check is actually True for most (if not all) the rows, the NumPy approach is actually faster.
Another key difference between the NumPy and the pure Python approach is that pure Python uses lazy evaluation and NumPy does not, and actually require the creation of potentially large intermediate objects that slow down the computation.
On top of this, NumPy iterates over the rows twice (one in sum() and one in np.apply_along_axis()), while the pure Python approaches only once.
Other approaches using set().issubset() like e.g. from #GZ0 answer:
def all_in_by_row_set(arr, elems=ELEMS):
elems = set(elems)
return sum(map(elems.issubset, row))
have different timings than the explicitly writing of the nested-loop when it comes to subset checking, but they still suffer from slower outer looping.
So, what's next?
The answer is to use Cython or Numba.
The idea is to get NumPy-like (read: C) speed all the times (and not only for sufficiently large inputs), lazy evaluation and minimal number of looping through the rows.
An example of a Cython approach (as implemented in IPython, using the %load_ext Cython magic) is:
%%cython --cplus -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True
cdef long all_in_by_row_c(long[:, :] arr, long[:] elems) nogil:
cdef long result = 0
I = arr.shape[0]
J = arr.shape[1]
K = elems.shape[0]
for i in range(I):
is_subset = True
for k in range(K):
is_contained = False
for j in range(J):
if elems[k] == arr[i, j]:
is_contained = True
break
if not is_contained:
is_subset = False
break
result += 1 if is_subset else 0
return result
def all_in_by_row_cy(long[:, :] arr, long[:] elems):
return all_in_by_row_c(arr, elems)
While a similar Numba code reads:
import numba as nb
#nb.jit(nopython=True, nogil=True)
def all_in_by_row_jit(arr, elems=ELEMS):
result = 0
n_rows, n_cols = arr.shape
for i in range(n_rows):
is_subset = True
for e in elems:
is_contained = False
for r in arr[i, :]:
if e == r:
is_contained = True
break
if not is_contained:
is_subset = False
break
result += 1 if is_subset else 0
return result
Now, time-wise we get to the following (for relatively small number of rows):
arr.shape=(100, 1000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 120 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_jit 129 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_flt 2.44 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_set 9.98 ms ± 52.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_np 13.7 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
arr.shape=(100, 2000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 1.45 ms ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_jit 1.52 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_flt 30.1 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_set 19.8 ms ± 56.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_np 18 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
arr.shape=(100, 3000) elems.shape=(1000,) result=37
Func: all_in_by_row_cy 10.4 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_jit 10.9 ms ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_flt 226 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 30.5 ms ± 92.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np 21.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
arr.shape=(100, 4000) elems.shape=(1000,) result=86
Func: all_in_by_row_cy 16.8 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_jit 17.7 ms ± 42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_flt 385 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 39.5 ms ± 588 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np 25.7 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now that the slow down of the last block cannot be explained by the increased input size in the second dimension.
Actually, if the short-circuit rate is increased (e.g. by changing the values range of the random arrays), for the last block (same input sizes) one gets:
arr.shape=(100, 4000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 152 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_jit 173 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_flt 556 µs ± 8.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_set 39.7 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np 31.5 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that set()-based method is kind of independent on the short-circuit rate (because of the hash-based implementation which has ~O(1) check for presence complexity, but this comes at the expenses of hashing pre-computation and these results indicate this might not be faster than the direct nested-looping approach).
Finally, for larger rows counts :
arr.shape=(100000, 1000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy 141 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_jit 150 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_flt 2.6 s ± 28.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 10.1 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 13.7 s ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
arr.shape=(100000, 2000) elems.shape=(1000,) result=34
Func: all_in_by_row_cy 1.2 s ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 1.27 s ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 24.1 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 19.5 s ± 270 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 18 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
arr.shape=(100000, 3000) elems.shape=(1000,) result=33859
Func: all_in_by_row_cy 9.79 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 10.3 s ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 3min 30s ± 1.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 30 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 21.9 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
arr.shape=(100000, 4000) elems.shape=(1000,) result=86376
Func: all_in_by_row_cy 17 s ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 17.9 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 6min 29s ± 293 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 38.9 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np 25.7 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Finally, note that the Cython/Numba code may be algorithmically optimized.
Is there any downside to validating args this way:
if x in [1,2,3]:
...
Or is it better to do the more traditional way:
if x == 1 or x == 2 or x == 3:
...
The only downside I could see is performance.
On my machine, windows 8.1, python 3.7.3, via ipython, I get:
x = 4
%timeit x in [1,2,3]
183 ns ± 19 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit x == 1 or x == 2 or x == 3
295 ns ± 19.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So it seems both cleaner and faster.
FYI, using a set, I get:
%timeit x in {1,2,3}
116 ns ± 7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)