I am testing numba performance on some function that takes a numpy array, and compare:
import numpy as np
from numba import jit, vectorize, float64
import time
from numba.core.errors import NumbaWarning
import warnings
warnings.simplefilter('ignore', category=NumbaWarning)
#jit(nopython=True, boundscheck=False) # Set "nopython" mode for best performance, equivalent to #njit
def go_fast(a): # Function is compiled to machine code when called the first time
trace = 0.0
for i in range(a.shape[0]): # Numba likes loops
trace += np.tanh(a[i, i]) # Numba likes NumPy functions
return a + trace # Numba likes NumPy broadcasting
class Main(object):
def __init__(self) -> None:
super().__init__()
self.mat = np.arange(100000000, dtype=np.float64).reshape(10000, 10000)
def my_run(self):
st = time.time()
trace = 0.0
for i in range(self.mat.shape[0]):
trace += np.tanh(self.mat[i, i])
res = self.mat + trace
print('Python Diration: ', time.time() - st)
return res
def jit_run(self):
st = time.time()
res = go_fast(self.mat)
print('Jit Diration: ', time.time() - st)
return res
obj = Main()
x1 = obj.my_run()
x2 = obj.jit_run()
The output is:
Python Diration: 0.2164750099182129
Jit Diration: 0.5367801189422607
How can I obtain an enhance version of this example ?
The slower execution time of the Numba implementation is due to the compilation time since Numba compile the function at the time it is used (only the first time unless the type of the argument change). It does that because it cannot know the type of the arguments before the function is called. Hopefully, you can specify the argument type to Numba so it can compile the function directly (when the decorator function is executed). Here is the resulting code:
#njit('float64[:,:](float64[:,:])')
def go_fast(a):
trace = 0.0
for i in range(a.shape[0]):
trace += np.tanh(a[i, i])
return a + trace
Note that njit is a shortcut for jit+nopython=True and that boundscheck is already set to False by default (see the doc).
On my machine this result in the same execution time for both Numpy and Numba. Indeed, the execution time is not bounded by the computation of the tanh function. It is bounded by the expression a + trace (for both Numba and Numpy). The same execution time is expected since both implement this the same way: they create a temporary new array to perform the addition. Creating a new temporary array is expensive because of page faults and the use of the RAM (a is fully read from the RAM and the temporary array is fully stored in RAM). If you want a faster computation, then you need to perform the operation in-place (this prevent page faults and expensive cache-line write allocations on x86 platforms).
Related
I am trying to loop over 2 2d arrays in Cython. The arrays have the following shape:
ranges_1 is a 6000x3 array of int64, while ranges_2 is a 2000x2 of int64. This iteration needs to be perfomed around 10000 times. This would mean that the total number of calculations inside the nested for loop would be around 2000x6000x10000 = 120 billions times.
This is the code I am using to generate the "dummy" data:
import numpy as np
ranges_1 = np.stack([np.random.randint(0, 10_000, 6_000), np.random.randint(0, 10_000, 6_000), np.arange(0, 6_000)], axis=1)
ranges_2 = np.stack([np.random.randint(0, 10_000, 2_000), np.random.randint(0, 10_000, 2_000)], axis=1)
Which gives 2 arrays like these:
array([[6131, 1478, 0],
[9317, 7263, 1],
[7938, 6249, 2],
...,
[5153, 426, 5997],
[9164, 9211, 5998],
[1695, 1792, 5999]])
and:
array([[ 433, 558],
[3420, 2494],
[6367, 7916],
...,
[8693, 1692],
[1256, 9013],
[4096, 1860]])
The first implementation I tried is a "naive" version and it's the following (the function inside is just a test function which uses all the data in the array):
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t
def test_func(np.ndarray[DTYPE_t, ndim = 2] ranges_1, np.ndarray[DTYPE_t, ndim=2] ranges_2, int n ):
k = 0
for i in range(n):
for j in range(len(ranges_1)):
r1 = ranges_1[j]
a = r1[0]
b = r1[1]
c = r1[2]
for f in range(len(ranges_2)):
r2 = ranges_2[f]
d = r2[0]
e = r2[1]
k = (a + b + c + d + e)/(d+e)
return k
This takes about 5 seconds per each of the 10_000 outside loops.
So I then tried flatteing out the arrays and, since I know the dimension on the other axis, accessing the items like this:
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t
def test_func_flattened(np.ndarray[DTYPE_t, ndim = 1] ranges_1_, np.ndarray[DTYPE_t, ndim=1] ranges_2_, int n ):
k = 0
for i in range(n):
for j in range(0, len(ranges_1_), 3):
a = ranges_1_[j]
b = ranges_1_[j+1]
c = ranges_1_[j+2]
for f in range(0, len(ranges_2_), 2):
d = ranges_2_[f]
e = ranges_2_[f+1]
k = (a + b + c + d + e)/(d+e)
return k
But that provided no speed up at all. The time to perform one single iteration of the 10_000 seems too high, considered that for one single iteration it's just 12_000_000 operations inside the loop. I also tried implementing a much simpler example both in Cython and in python which was then compiled with numba:
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t
def test_1(int n ):
cdef k = 0
cdef a = 0
for i in range(n):
a = i +1
return a
This took 15s to run with n = 1_000_000_000.
While with numba:
def test_1_python(n ):
k = 0
a = 0
for i in range(n):
if i % 2 == 0:
a = a + 1
else:
a = a - 1
return a
test_1_numba= numba.jit(test_1_python)
%%time
test_1_numba(120_000_000_000)
The full run with n = 120bln took about 6s, (albeit the function inside is simpler) this means that it would be 500 times faster than Cython, could this be possible?
I am new to Cython so I probably am missing something obvious, but since the numba version (without the array accessing) is tha much faster I think that the difference in speed might come from the overhead associated with accessing the items in the array.
Is this a wrong assumption?
If not, what would be the best of going about looping over a 2D list of integers in Cython?
What you measure in your benchmarks is mainly compilation artefacts and overheads.
First of all, Cython use the default compiler installed on your machine preferred by the Python stack. On Linux, it should be GCC. On Windows, it is certainly MSVC if installed, otherwise MinGW (if any). Meanwhile, Numba is based on LLVM-Lite which is based on the LLVM stack like Clang. Thus, in your case, it is very likely that different compilers are used resulting in different binary with different performance. If you want to make a fair benchmark, you need to use Clang to build you Cython program.
Additionally, the default optimization for Cython is -O2 while it is -O3 for Numba. The former should not enable auto-vectorization while the later does (this is dependent of the target compilers -- newer version of GCC changes this behaviour). Furthermore, Cython does not enable machine-specific non-portable optimizations by default (since binaries may be packaged for other machines like with pip). This means Cython can only use the old SSE2 SIMD instruction set by default on x86-64 processors. Meanwhile, LLVM-JIT can use of the much faster AVX2/AVX-512 SIMD instruction set. You need to enable such optimization manually with Cython so for the benchmark to be fair (ie. -march=native on GCC/Clang).
In fact, on my x86-64 mainstream Intel machine, Numba does use the AVX2 instruction set and Cython does not on your last benchmark. Here is for example the main loop generated by the Numba JIT:
.LBB0_7:
vptestnmq %ymm5, %ymm4, %k1
vpblendmq %ymm5, %ymm6, %ymm18 {%k1}
vpaddq %ymm0, %ymm18, %ymm0
vpaddq %ymm1, %ymm18, %ymm1
vpaddq %ymm2, %ymm18, %ymm2
vpaddq %ymm3, %ymm18, %ymm3
vpaddq %ymm16, %ymm4, %ymm18
vptestnmq %ymm5, %ymm18, %k1
vpblendmq %ymm5, %ymm6, %ymm18 {%k1}
vpaddq %ymm0, %ymm18, %ymm0
vpaddq %ymm1, %ymm18, %ymm1
vpaddq %ymm2, %ymm18, %ymm2
vpaddq %ymm3, %ymm18, %ymm3
vpaddq %ymm17, %ymm4, %ymm4
addq $-2, %rdx
jne .LBB0_7
For the benchmark doing a = i + 1 in a loop, it is flawed as a good compiler can just optimize out the whole loop (ie. remove it) and replace it with only one assignment since only the last iteration matter. In fact, the same thing applies with k = (a + b + c + d + e)/(d+e): only the last iteration matters. The i variable of for i in range(n) is not even used. Clang and GCC often do such kind of optimizations.
Finally, the speed of your initial first code will be memory-bound if it is modified to compute something meaningful in a real-world use-case and multiple threads are used.
Note that division are very expensive and you can precompute the reciprocal so to perform multiplications instead in your main loop.
I have been playing with jax lately, and it is very impressive, but then the following set of experiments confused me greatly:
First, we set up the timer utility:
import time
def timefunc(foo, *args):
tic = time.perf_counter()
tmp = foo(*args)
toc = time.perf_counter()
print(toc - tic)
return tmp
Now, let’s see what happens when we compute the eigenvalues of a random symmetric matrix matrix, thus (jnp is jax.numpy, so the eigh is done on the GPU)
def jfunc(n):
tmp = np.random.randn(n, n)
return jnp.linalg.eigh(tmp + tmp.T)
def nfunc(n):
tmp = np.random.randn(n, n)
return np.linalg.eigh(tmp + tmp.T)
Now for the timings (the machine is an nVidia DGX box, so the GPU is an A100, while the CPUs are some AMD EPYC2 parts.
>>> e1 = timefunc(nfunc, 10)
0.0002442029945086688
>>> e2 = timefunc(jfunc, 10)
0.013523647998226807
>>> e1 = timefunc(nfunc, 100)
0.11742364699603058
>>> e2 = timefunc(jfunc, 100)
0.11005625998950563
>>> e1 = timefunc(nfunc, 1000)
0.6572738009999739
>>> e2 = timefunc(jfunc, 1000)
0.5530761769914534
>>> e1 = timefunc(nfunc, 10000)
36.22587636699609
>>> e2 = timefunc(jfunc, 10000)
8.867857075005304
You will notice that the crossover is somewhere around 1000. Initially, I thought this was because of the overhead of moving stuff to/from the GPU, but if you define yet another function:
def jjfunc(n):
key=jax.random.PRNGKey(0)
tmp = jax.random.normal(key, [n, n])
return jnp.linalg.eigh(tmp + tmp.T)
>>> e1=timefunc(jjfunc, 10)
0.01886096798989456
>>> e1=timefunc(jjfunc, 100)
0.2756766739912564
>>> e1=timefunc(jjfunc, 1000)
0.7205733209993923
>>> e1=timefunc(jjfunc, 10000)
6.8624101399909705
Note that the small examples are actually (much) slower than moving the numpy array to the GPU and back.
So, my question is: what is going on, and is there a silver bullet? Is this a jax implementation bug?
I don't think your timings are reflective of actual JAX vs. numpy performance, for a few reasons:
JAX's computation model uses Asynchronous Dispatch, which means that JAX operations return before the computation is finished. As mentioned at that link, you can use the block_until_ready() method to ensure you are timing the computation rather than the dispatch.
Because operations like eigh are JIT-compiled by default, the first time you run them for a given size will incur the one-time compilation cost. Subsequent runs will be faster as JAX caches previous compilations.
Your computations are indeed being foiled by device transfer costs. It's easiest to see if you measure it directly:
def transfer(n):
tmp = np.random.randn(n, n)
return jnp.array(tmp).block_until_ready()
timefunc(transfer, 10000);
# 4.600406924000026
Your jjfunc combines the eigh call with the jax.random.normal call. The latter is slower than numpy's random number generation, and I believe is dominating the difference for small n.
Unrelated to JAX, but in general using time.time for profiling Python code can give you misleading results. Modules like timeit are much better for this kind of thing, particularly when you're dealing with microbenchmarks that complete in fractions of a second.
If you're interested in accurate benchmarks of JAX vs. Numpy versions of algorithms, I'd suggest isolating exactly the operations you're interested in benchmarking (i.e. generate the data & do any device transfer outside the benchmarks). Read up on the advice in Asynchronous Dispatch in JAX as it relates to benchmarking, and check out Python's timeit Docs for tips on getting accurate timings of small code snippets (though I find the %timeit magic more convenient if working IPython or Jupyter notebook).
I would like to generate random vectors of the form
[ i if random.uniform(0,1) <= probs[i] for i prange(K) ]
for an K length array of probabilities probs. Each resulting vector has somewhere between 0 and K elements. Conceptually, this is like flipping K specific coins (with particular probabilities of being heads) and recording which of the coins displayed heads.
The arbitrary return length makes it difficult to use any of the automatic parallelization options in numba. E.g.,
from numba import prange, njit, int64, float64
import numpy as np
#njit([int64[:](float64[:], int64)])
def rand_coin(freqs,r):
return np.arange(r)[np.random.uniform(0,1,size=r)<=freqs]
#njit(parallel=True)
def rand_coins(freqs,n):
r = freqs.shape[0]
return [rand_coin(freqs,r) for i in range(n)] # **
r = 10; n =100
freqs = np.random.uniform(0,1,r)
rand_coins(freqs, n)
works great serially but produces a double free or corruption error if the range in the starred line is replaced with prange.
Is is possible parallelize functions returning arrays with random lengths in numba?
prange is a numba function. The typing error is just a generic error from numba, saying that it ran into an issue while compiling the function. The real issue is that you are trying the make a call to a function that isn't declared. You need to utilize the prange function like so:
from numba import njit, int64, float64
import numba
import numpy as np
#njit([int64[:](float64[:], int64)])
def rand_coin(freqs,r):
return np.arange(r)[np.random.uniform(0,1,size=r)<=freqs]
#njit(parallel=True)
def rand_coins(freqs,n):
r = freqs.shape[0]
return [rand_coin(freqs,r) for i in numba.prange(n)] # **
How do you optimize this code (without vectorizing, as this leads up to using the semantics of the calculation, which is quite often far from being non-trivial):
slow_lib.py:
import numpy as np
def foo():
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
The point is that such kind loops correspond quite often to operations where you have double sums over some vector operation.
This is quite slow:
>>t = timeit.timeit('foo()', 'from slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 41.165681839
Ok, so then let's cynothize it and add type annotations likes there is no tomorrow:
c_slow_lib.pyx:
import numpy as np
cimport numpy as np
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def foo():
cdef int size = 200
cdef int i,j
np.random.seed(1000031212)
cdef np.ndarray[np.double_t, ndim=2] bar = np.random.rand(size, size)
cdef np.ndarray[np.double_t, ndim=2] moo = np.zeros((size,size), dtype = np.float)
cdef np.ndarray[np.double_t, ndim=1] val
for i in xrange(0,size):
for j in xrange(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from c_slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 42.3104710579
... ehr... what? Numba to the rescue!
numba_slow_lib.py:
import numpy as np
from numba import jit
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
#jit
def foo():
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from numba_slow_lib import foo', number = 10)
>>print("took: "+str(t))
took: 40.7327859402
So is there really no way to speed this up? The point is:
if I convert the inner loop into a vectorized version (building a larger matrix representing the inner loop and then calling np.outer on the larger matrix) I get much faster code.
if I implement something similar in Matlab (R2016a) this performs quite well due to JIT.
Here's the code for outer:
def outer(a, b, out=None):
a = asarray(a)
b = asarray(b)
return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)
So each call to outer involves a number of python calls. Those eventually call compiled code to perform the multiplication. But each incurs an overhead that has nothing to do with the size of your arrays.
So 200 (200**2?) calls to outer will have all that overhead, whereas one call to outer with all 200 rows has one overhead set, followed by one fast compiled operation.
cython and numba don't compile or otherwise bypass the Python code in outer. All they can do is streamline the iteration code that you wrote - and that isn't consuming much time.
Without getting into details, the MATLAB jit must be able to replace the 'outer' with faster code - it rewrites the iteration. But my experience with MATLAB dates from a time before its jit.
For real speed improvements with cython and numba you need to use primitive numpy/python code all the way down. Or better yet focus your effort on slow inner pieces.
Replacing your outer with a streamlined version cuts run time about in half:
def foo1(N):
size = N
np.random.seed(1000031212)
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += val[:,None]*val
return moo
With the full N=200 your function took 17s per loop. If I replace the inner two lines with pass (no calculation), time drops to 3ms per loop. In other words, the outer loop mechanism is not a big time consumer, at least not compared to many calls to outer().
Memory permitting, you can use np.einsum to perform those heavy calculations in a vectorized manner, like so -
moo = size*np.einsum('ij,ik->jk',bar,bar)
One can also use np.tensordot -
moo = size*np.tensordot(bar,bar,axes=(0,0))
Or simply np.dot -
moo = size*bar.T.dot(bar)
Many tutorials and demonstrations of Cython, Numba, etc. make it seem as if these tools can speed up your code automagically, but in practice, this is often not the case: You'll need to modify your code a little to extract the best performance. If you had already implemented some degree of vectorization, it usually means writing out ALL the loops. Reasons Numpy array operations are non-optimal include:
Lots of temporary arrays are created and looped over;
Significant per-call overhead if the arrays are small;
Short-circuiting logic can't be implemented, because arrays are processed as a whole;
Sometimes the optimal algorithm can't be expressed using array expressions and you settle for an algorithm with a worse time complexity.
Using Numba or Cython wont optimize these problems away! Instead, these tools allow you to write loopy code that is much faster than plain Python.
Also, for Numba specifically, you should be aware of the difference between "object mode" and "nopython mode". The tight loops from your example have to run in nopython mode to provide any significant speedup. However, numpy.outer is not yet supported by Numba, resulting in the function to be compiled in object mode. Decorate with jit(nopython=True) to let such cases throw an exception.
Example to demonstrate a speedup is indeed possible:
import numpy as np
from numba import jit
#jit
def foo_nb(bar):
size = bar.shape[0]
moo = np.zeros((size, size))
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
return moo
#jit
def foo_nb2(bar):
size = bar.shape[0]
moo = np.zeros((size, size))
for i in range(size):
for j in range(size):
for k in range(0,size):
for l in range(0,size):
moo[k,l] += bar[j,k] * bar[j,l]
return moo
size = 100
bar = np.random.rand(size, size)
np.allclose(foo_nb(bar), foo_nb2(bar))
# True
%timeit foo_nb(bar)
# 1 loop, best of 3: 816 ms per loop
%timeit foo_nb2(bar)
# 10 loops, best of 3: 176 ms per loop
The example you show us is kind of inefficient algorithm, since you calculate the same outer product multiple times. The resulting time complexity is O(n^4). It can be reduced to n^3.
for i in range(0,size):
val = bar[i]
moo += size * np.outer(val, val)
Given a data matrix with discrete entries represented as a 2D numpy array, I'm trying to compute the observed frequencies of some features (the columns) only looking at some instances (the rows of the matrix).
I can do that quite easily with numpy using bincount applied to each slice after having done some fancy slicing. Doing that in pure Python, using an external data structure as a count accumulator, is a double loop in C-style.
import numpy
import numba
try:
from time import perf_counter
except:
from time import time
perf_counter = time
def estimate_counts_numpy(data,
instance_ids,
feature_ids):
"""
WRITEME
"""
#
# slicing the data array (probably memory consuming)
curr_data_slice = data[instance_ids, :][:, feature_ids]
estimated_counts = []
for feature_slice in curr_data_slice.T:
counts = numpy.bincount(feature_slice)
#
# checking just for the all 0 case:
# this is not stable for not binary datasets TODO: fix it
if counts.shape[0] < 2:
counts = numpy.append(counts, [0], 0)
estimated_counts.append(counts)
return estimated_counts
#numba.jit(numba.types.int32[:, :](numba.types.int8[:, :],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:, :]))
def estimate_counts_numba(data,
instance_ids,
feature_ids,
feature_vals,
estimated_counts):
"""
WRITEME
"""
#
# actual counting
for i, feature_id in enumerate(feature_ids):
for instance_id in instance_ids:
estimated_counts[i][data[instance_id, feature_id]] += 1
return estimated_counts
if __name__ == '__main__':
#
# creating a large synthetic matrix, testing for performance
rand_gen = numpy.random.RandomState(1337)
n_instances = 2000
n_features = 2000
large_matrix = rand_gen.binomial(1, 0.5, (n_instances, n_features))
#
# random indexes too
n_sample = 1000
rand_instance_ids = rand_gen.choice(n_instances, n_sample, replace=False)
rand_feature_ids = rand_gen.choice(n_features, n_sample, replace=False)
binary_feature_vals = [2 for i in range(n_features)]
#
# testing
numpy_start_t = perf_counter()
e_counts_numpy = estimate_counts_numpy(large_matrix,
rand_instance_ids,
rand_feature_ids)
numpy_end_t = perf_counter()
print('numpy done in {0} secs'.format(numpy_end_t - numpy_start_t))
binary_feature_vals = numpy.array(binary_feature_vals)
#
#
curr_feature_vals = binary_feature_vals[rand_feature_ids]
#
# creating a data structure to hold the slices
# (with numba I cannot use list comprehension?)
# e_counts_numba = [[0 for val in range(feature_val)]
# for feature_val in
# curr_feature_vals]
e_counts_numba = numpy.zeros((n_sample, 2), dtype='int32')
numba_start_t = perf_counter()
estimate_counts_numba(large_matrix,
rand_instance_ids,
rand_feature_ids,
binary_feature_vals,
e_counts_numba)
numba_end_t = perf_counter()
print('numba done in {0} secs'.format(numba_end_t - numba_start_t))
These are the times I get while running the above code:
numpy done in 0.2863295429997379 secs
numba done in 11.55551904299864 secs
My point here is that my implementation is even slower when I try to apply a jit with numba, so I highly suspect I am messing things up.
The reason your function is slow is because Numba has fallen back to object mode to compile the loop.
There are two problems:
Numba doesn't yet support chained indexing of multidimensional arrays, so you need to rewrite this:
estimated_counts[i][data[instance_id, feature_id]]
into this:
estimated_counts[i, data[instance_id, feature_id]]
Your explicit type signature is incorrect. All of your input arrays are actually int64, rather than int8/int32. Rather than fix your signature, you can rely on Numba's automatic JIT to detect the argument types and compile the right version. All you have to do is change the decorator to just #numba.jit. Just make sure you call the function once before you benchmark if you don't want to include compilation time.
With these changes, I benchmark Numba to be about 15% faster than NumPy for this function.