I made a few experiment and found a number of cases where python's standard random and math library is faster than numpy counterpart.
I think there is a tendency that python's standard library is about 10x faster for small scale operation, while numpy is much faster for large scale (vector) operations. My guess is that numpy has some overhead which becomes dominant for small cases.
My question is: Is my intuition correct? And will it be in general advisable to use the standard library rather than numpy for small (typically scalar) operations?
Examples are below.
import math
import random
import numpy as np
Log and exponential
%timeit math.log(10)
# 158 ns ± 6.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit np.log(10)
# 1.64 µs ± 93.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit math.exp(3)
# 146 ns ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit np.exp(3)
# 1.72 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Generate normal distribution
%timeit random.gauss(0, 1)
# 809 ns ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.normal()
# 2.57 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Choosing a random element
%timeit random.choices([1,2,3], k=1)
# 1.56 µs ± 55.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.choice([1,2,3], size=1)
# 23.1 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Same with numpy array
arr = np.array([1,2,3])
%timeit random.choices(arr, k=1)
# 1.72 µs ± 33.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.random.choice(arr, size=1)
# 18.4 µs ± 502 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With big array
arr = np.arange(10000)
%timeit random.choices(arr, k=1000)
# 401 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.random.choice(arr, size=1000)
# 41.7 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numpy is only really a performance improvement for large blocks of data. The overhead of making sure the memory blocks line up correctly before pouring an ndarray into a c-compiled numpy function will generally overwhelm any time benefit if the array isn't relatively large. This is why so many numpy questions are basically "How do I take this loopy code and make it fast," and why it is considered a valid question in this tag where nearly any other tag will toss you to Code review before they get past the title.
So, yes, your observation is generalizable. Vectorizing is the whole point of numpy. numpy code that isn't vectorized is always slower than bare python code, and is arguably just as "wrong" as cracking a single walnut with a jackhammer. Either find the right tool or get more nuts.
NumPy is used primarily for performance with arrays. This relies on the use of contiguous memory blocks and more efficient lower-level iteration. Applying a NumPy mathematical function on a scalar or calculating a random number are not vectorisable operations. This explains the behaviour you are seeing.
See also What are the advantages of NumPy over regular Python lists?
And will it be in general advisable to use the standard library rather
than NumPy for small (typically scalar) operations?
It's rare that the bottleneck for a program is caused by operations on scalars. In practice, the differences are negligible. So either way is fine. If you are already using NumPy there's no harm in continuing to use NumPy operations on scalars.
It's worth making a special case of calculating random numbers. As you might expect, the random number selected via random vs NumPy may not be the same:
assert random.gauss(0, 1) == np.random.normal() # AssertionError
assert random.choices(arr, k=1)[0] == np.random.choice(arr, size=1)[0] # AssertionError
You have additional functionality in NumPy to make random numbers "predictable". For example, running the below script repeatedly will only ever generate the same result:
np.random.seed(0)
np.random.normal()
The same applies to np.random.choice. So there are differences in how the random number is derived and the functionality available. For testing, or other, purposes you may wish to be able to produce consistent "random" numbers.
If we compute time of execution for given n to create a pascal triangle in python with two different implementations one with python for loop and one with numpy array addition then plotting time required for input n = 2^i the output will be like
source : https://algorithmdotcpp.blogspot.com/2022/01/prove-numpy-is-faster-than-normal-list.html
Related
I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:
large_dimension = 1000
a = np.random.random((1,))
b = np.random.random((1, large_dimension))
c = np.matmul(a, b)
In other words, multiplying matrix b with the scalar a[0].
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this, and they are practically equivalent:
%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Is there a more efficient way to compute this?
Note: I cannot move these computations to a GPU since the program is using multiprocessing and many such computations are done in parallel.
large_dimension = 1000
a = np.random.random((1,))
B = np.random.random((1, large_dimension))
%timeit np.matmul(a, B)
5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a[0] * B
5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Use just float
%timeit float(a[0]) * B
3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid memory allocation use "buffer"
buffer = np.empty_like(B)
%timeit np.multiply(float(a[0]), B, buffer)
2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid unnecessary getting attribute use "alias"
mul = np.multiply
%timeit mul(float(a[0]), B, buffer)
2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And I don't recommend using numpy scalars at all,
because if you avoid it, computation will be faster
a_float = float(a[0])
%timeit mul(a_float, B, buffer)
1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)
rng = range(1000)
%%timeit
for i in rng:
pass
24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for i in rng:
mul(a_float, B, buffer)
1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So,
"best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)
"speedup" = 5.43 / 1.89 = 2.87
In this case, it is probably faster to work with an element-wise multiplication but the time you see is mostly the overhead of Numpy (calling C functions from the CPython interpreter, wrapping/unwraping types, making checks, doing the operation, array allocations, etc.).
since this operation is repeated millions of times
This is the problem. Indeed, the CPython interpreter is very bad at doing things with a low latency. This is especially true when you work on Numpy types as calling a C code and performing checks for trivial operation is much slower than doing it in pure Python which is also much slower than compiled native C/C++ codes. If you really need this, and you cannot vectorize your code using Numpy (because you have a loop iterating over timesteps), then you move away from using CPython, or at least not a pure Python code. Instead, you can use Numba or Cython to mitigate the impact doing C calls, wrapping types, etc. If this is not enough, then you will need to write a native C/C++ code (or any similar language) unless you find exactly a dedicated Python package doing exactly that for you. Note that Numba is fast only when it works on native types or Numpy arrays (containing native types). If you works with a lot of pure Python types and you do not want to rewrite your code, then you can try the PyPy JIT.
Here is a simple example in Numba avoiding the (costly) creation/allocation of a new array (as well as many Numpy internal checks and calls) that is specifically written to solve your specific case:
#nb.njit('void(float64[::1],float64[:,::1],float64[:,::1])')
def fastMul(a, b, out):
val = a[0]
for i in range(b.shape[1]):
out[0,i] = b[0,i] * val
res = np.empty(b.shape, dtype=b.dtype)
%timeit fastMul(a, b, res)
# 397 ns ± 0.587 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
At the time of writing, this solution is faster than all the others. As most of the time is spent in calling Numba and performing some internal checks, using Numba directly for the function containing the iteration loop should result in an even faster code.
import numpy as np
import numba
def matmult_numpy(matrix, c):
return np.matmul(c, matrix)
#numba.jit(nopython=True)
def matmult_numba(matrix, c):
return c*matrix
if __name__ == "__main__":
large_dimension = 1000
a = np.random.random((1, large_dimension))
c = np.random.random((1,))
About a factor of 3 speedup using Numba. Numba cognoscenti may be able to do better by explicitly casting the parameter "c" as a scalar
Check: The result of
%timeit matmult_numpy(a, c) 2.32 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit matmult_numba(a, c)
763 ns ± 6.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
To normalize the rows of a matrix X to unit length, I usually use:
X /= np.linalg.norm(X, axis=1, keepdims=True)
Trying to optimize this operation for an algorithm, I was quite surprised to see that writing out the normalization is about 40% faster on my machine:
X /= np.sqrt(X[:,0]**2+X[:,1]**2+X[:,2]**2)[:,np.newaxis]
X /= np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]
How comes? Where is the performance lost in np.linalg.norm()?
import numpy as np
X = np.random.randn(10000,3)
%timeit X/np.linalg.norm(X,axis=1, keepdims=True)
# 276 µs ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit X/np.sqrt(X[:,0]**2+X[:,1]**2+X[:,2]**2)[:,np.newaxis]
# 169 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit X/np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]
# 185 µs ± 4.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I observe this for (1) python3.6 + numpy v1.17.2 and (2) python3.9 + numpy v1.19.3 on a MacbookPro 2015 with OpenBLAS support.
I don't think this is a duplicate of this post, which addresses matrix norms, while this one is about the L2-norm of vectors.
The source code for row-wise L2-norm boils down to the following lines of code:
def norm(x, keepdims=False):
x = np.asarray(x)
s = x**2
return np.sqrt(s.sum(axis=(1,), keepdims=keepdims))
The simplified code assumes real-valued x and makes use of the fact that np.add.reduce(s, ...) is equivalent to s.sum(...).
The OP question therefore is the same as asking why np.sum(x,axis=1) is slower than sum(x[:,i] for i in range(x.shape[1])):
%timeit X.sum(axis=1, keepdims=False)
# 131 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sum(X[:,i] for i in range(X.shape[1]))
# 36.7 µs ± 91.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This question has been answered already here. In short, the reduction (.sum(axis=1)) comes with overhead costs that generally pay off in terms of floating-point precision and speed (e.g. cache mechanics, parallelism), but don't in the special case of a reduction over just three columns. In this case, the overhead is relatively large compared to the actual computation.
The situation changes if X has more columns. The numpy-boosted normalization now is substantially faster than the reduction using a python for-loop:
X = np.random.randn(10000,100)
%timeit X/np.linalg.norm(X,axis=1, keepdims=True)
# 3.36 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit X/np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]
# 5.92 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another related SO thread is found here: numpy ufuncs vs. for loop.
The question remains why common special cases for reduction (such as the summation over the columns or rows of a matrix with low axis dimension) are not treated by numpy explicitly. Maybe it's because the effect of such optimizations often depends strongly on the target machine and increases code complexity considerably.
I don't understand why a so basic optimization has not yet be done:
In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop
In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop
The whole array is scanned, even if the conclusion is an evidence at first item.
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.
The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
#nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
#nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This question already has answers here:
Why is numpy list access slower than vanilla python?
(2 answers)
Closed 4 years ago.
I'm programming in pure Python for 2 years.
Now I am learning Numpy and I am confused.
In tutorials has given examples that Numpy is way more efficient than pure python. Given examples, but when I try for example simple iteration:
import numpy as np
import time
start = time.time()
list = range(1000000)
array = np.arange(1000000)
for element in list:
pass
print('\n'+str((time.time() - start)*1000)+'\n')
start = time.time()
for element in np.nditer(array, order='F'):
pass
print('\n'+str((time.time() - start)*1000)+'\n')
I got an output:
87.67843246459961
175.25482177734375
As may be seen upper, iteration over Numpy is way less efficient than pure Python.
My question is: I do not understand and cannot myself explain why to use Numpy, and moreso: when to use it?
Numpy is much faster with vector operations.
If you change your code to:
array+=1
instead of:
for element in np.nditer(array, order='F'):
pass
you can see that numpy vastly outperforms the regular python code
The strength of numpy is that you don't need to iterate.
There's no problem with iterating, and in fact in can be useful in some cases, but the vast majority of problems can be solved using functions is numpy.
Using your examples (with the %timeit command in ipython), if you do something simple like adding a number to every element of the list numpy is clearly much faster when it is used directly without iterating.
import numpy as np
import time
start = time.time()
dlist = range(1000000)
darray = np.arange(1000000)
# Pure python
%timeit [e + 2 for e in dlist]
59.8 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Iterating numpy
%timeit [e + 2 for e in darray]
193 ms ± 8.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Converting numpy to list before iterating
%timeit [e+2 for e in list(darray)]
198 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Numpy
%timeit darray + 2
847 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Same thing with more complex operations, like finding the mean:
%timeit sum(dlist)/len(dlist)
16.5 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(darray)/len(darray)
66.6 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Converting to list then iterating
%timeit sum(list(darray))/len(darray)
83.1 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Using numpy's methods
%timeit darray.mean()
1.26 ms ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy is much faster once you understand how to use it. It requires a rather different way of looking at data, and gaining a familiarity with the functions it provides, but once you do it results in simpler and faster code.