Why does Numba skew the timings of a JIT-compiled function?

Why does Numba skew the timings of a JIT-compiled function? - python

I'm trying to benchmark a Python function that does list operations using Numba against the CPython interpreter. To compare end-to-end time I used the Linux time utility.
time python3.10 list.py
As I understand the first invocation will be expensive due to JIT compilation, but it does not explain why the maximum recorded time is longer than the total time taken to run the entire script.
# list.py
import numpy as np
from time import time, perf_counter
from numba import njit
#njit
def listOperations():
list = []
for i in range(1000):
list.append(i)
list.sort(reverse=True)
list.remove(420)
list.reverse()
if __name__ == "__main__":
repetitions = 1000
timings = np.zeros(repetitions)
for rep in range(repetitions):
start = time() # Similar results with perf_counter too.
listOperations()
timings[rep] = time() - start
# Convert to milliseconds
timings *= 10e3
print("Mean {}ms, Median {}ms, Std. Dev {}ms, Min {}ms, Max {}ms".format(
float('%.4f' % np.mean(timings)),
float('%.4f' % np.median(timings)),
float('%.4f' % np.std(timings)),
float('%.4f' % np.min(timings)),
float('%.4f' % np.max(timings)))
)
For Numba it shows maximum of ~66.3s while the time utility reports ~8s. The complete results are below.
'''
Numba --->
Mean 66.8154ms, Median 0.391ms, Std. Dev 2097.7752ms, Min 0.3219ms, Max 66371.1143ms
real 0m7.982s
user 0m8.248s
sys 0m0.100s
CPython3.10 --->
Mean 1.6395ms, Median 1.6284ms, Std. Dev 0.0708ms, Min 1.5759ms, Max 2.3198ms
real. 0m1.115s
user 0m1.468s
sys 0m0.080s
'''

The main issue is that the compilation time is included in the timings. Indeed, Numba compiles the functions lazily. To prevent this, you must specify the prototype or to execute the first function call outside (which is generally a good practice in benchmarks).
You can use #njit('()') instead of #njit. With this fix, the Numba code is about twice faster on my machine.
Note that your function does not return anything nor read anything in parameter so the JIT can optimize the function to a no-op. To avoid biases, you certainly need to add a parameter, to use it and to return the list. This is apparently not the case on my machine but different versions of Numba may do that.
Note also that Numba list are generally not where Numba shine. Lists are generally slow (both with and without Numba). It is better to use array when the size is known.
By the way, list is a built-in function. Overwriting it can cause sneaky bugs in modules using it (frequent) so this is not a good idea. I advise you to use another name.
Furthermore, note that the standard deviation was pretty big in the results, the median time was good and the maximum time was very big indicating that the timings were not stable and that this instability was due to one slow call. Such results generally indicates the benchmark is flawed or the function itself has an unstable behaviour (typically due to a bug or an initialization done once).

Related

Pandas: Why is Series indexing using .loc taking 100x longer on the first run when timing it?

I'm slicing a quite big pandas series (~5M) using .loc and I stumble upon some weird behavior when checking times in an attempt to optimize my code.
It's weird that the first slicing attempt like series_object.loc[some_indexes] is taking 100x longer than the following ones.
When I try timeit it does not reflect this behaviour, but when checking the partial laps using `time``, we can see that the first lap is taking much longer than the following ones.
Is .loc using some sort of cacheing? if that's so, how does garbage collection is not influencing this?
Is timeit doing the cacheing even with garbage collector disabled and not behaving as it's suppose?
Which time should I trust that my app in production will take when running in a live environment?
I tried this on windows and linux machines using different versions of python (3.6, 3.7 and 2.7) and the behavior is always the same.
Thanks in advance for you help. This thing is banging my head for a week already and I miss not doubting %timeit :)
to reproduce:
Save the following code to a python file eg.:test_loc_times.py
import pandas as pd
import numpy as np
import timeit
import time, gc
def get_data():
ids = np.arange(size_bigseries)
big_series = pd.Series(index=ids, data=np.random.rand(len(ids)), name='{} elements series'.format(len(ids)))
small_slice = np.arange(size_slice)
return big_series, small_slice
# Method to test: a simple pandas slicing with .loc
def basic_loc_indexing(pd_series, slice_ids):
return pd_series.loc[slice_ids].dropna()
# method to time it
def timing_it(func, n, *args):
gcold = gc.isenabled()
gc.disable()
times = []
for i in range(n):
s = time.time()
func(*args)
times.append((time.time()-s)*1000)
if gcold:
gc.enable()
return times
if __name__ == '__main__':
import sys
n_tries = int(sys.argv[1]) if len(sys.argv)>1 and sys.argv[1] is not None else 1000
size_bigseries = int(sys.argv[2]) if len(sys.argv)>2 and sys.argv[2] is not None else 5000000 #5M
size_slice = int(sys.argv[3]) if len(sys.argv)>3 and sys.argv[3] is not None else 100 #5M
#1: timeit()
big_series, small_slice = get_data()
time_with_timeit = timeit.timeit('basic_loc_indexing(big_series, small_slice)',"gc.disable(); from __main__ import basic_loc_indexing, big_series, small_slice",number=n_tries)
print("using timeit: {:.6f}ms".format(time_with_timeit/n_tries*1000))
del big_series, small_slice
#2: time()
big_series, small_slice = get_data()
time_with_time = timing_it(basic_loc_indexing, n_tries, big_series, small_slice)
print("using time: {:.6f}ms".format(np.mean(time_with_time)))
print('head detail: {}\n'.format(time_with_time[:5]))
try out:
Run
python test_loc_times.py 1000 5000000 100
This will run timeit and time 1000 laps on slicing 100 elements from a 5M pandas.Series.
you can try it yourself with other values and the first run it always taking longer.
stdout:
>>> using timeit: 0.789754ms
>>> using time: 0.829869ms
>>> head detail: [145.02716064453125, 0.7691383361816406, 0.7028579711914062, 0.5738735198974609, 0.6380081176757812]
Weird right?
edit:
I found this answer which might be related. What do you think?

This code is likely not idempotent (has side effects that impact its execution).
timeit will run the code once first to measure the time and deduce the number of loops and runs it should use. If your code is not idempotent (has side effects, like cashing) then that first run (not recorded) will be longer and the subsequent (faster runs) will be measured and reported.
You can take a look at the arguments you can pass to timeit (see the doc) to specify the number of loops and forgo that initial run.
Also note that (taken from the doc linked above):
The times reported by %timeit will be slightly higher than those reported by the timeit.py script when variables are accessed. This is due to the fact that %timeit executes the statement in the namespace of the shell, compared with timeit.py, which uses a single setup statement to import function or create variables. Generally, the bias does not matter as long as results from timeit.py are not mixed with those from %timeit.
Edit: Missed the fact that you were passing the number of runs to timeit. In that case, only the latter part of my answer applies, but the numbers you are seeing seem to point to another issue...

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

Weird behavior of Intel's MKL and Python (Ubuntu 16.04, Conda 4.3.30)

I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.

My python codes in general are very slow, is this normal?

I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.

Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.

Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.

Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.

Why is numpy much slower than matlab on a digitize example?

I am comparing performance of numpy vs matlab, in several cases I observed that numpy is significantly slower (indexing, simple operations on arrays such as absolute value, multiplication, sum, etc.). Let's look at the following example, which is somehow striking, involving the function digitize (which I plan to use for synchronizing timestamps):
import numpy as np
import time
scale=np.arange(1,1e+6+1)
y=np.arange(1,1e+6+1,10)
t1=time.time()
ind=np.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
The result is:
Time passed is 55.91 seconds
Let's now try the same example Matlab using the equivalent function histc
scale=[1:1e+6];
y=[1:10:1e+6];
tic
[N,bin]=histc(scale,y);
t=toc;
display(['Time passed is ',num2str(t), ' seconds'])
The result is:
Time passed is 0.10237 seconds
That's 560 times faster!
As I'm learning to extend Python with C++, I implemented my own version of digitize (using boost libraries for the extension):
import analysis # my C++ module implementing digitize
t1=time.time()
ind2=analysis.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind2) #ok
The result is:
Time passed is 0.02 seconds
There is a bit of cheating as my version of digitize assumes inputs are all monotonic, this might explain why it is even faster than Matlab. However, sorting an array of size 1e+6 takes 0.16 seconds (with numpy.sort), making therefore the performance of my function worse (by a factor of approx 1.6) compared to the Matlab function histc.
So the questions are:
Why is numpy.digitize so slow? Is this function not supposed to be written in compiled and optimized code?
Why is my own version of digitize much faster than numpy.digitize, but still slower than Matlab (I am quite confident I use the fastest algorithm possible, given that I assume inputs are already sorted)?
I am using Fedora 16 and I recently installed ATLAS and LAPACK libraries (but there has been so change in performance). Should I perhaps rebuild numpy? I am not sure if my installation of numpy uses the appropriate libraries to gain maximum speed, perhaps Matlab is using better libraries.
Update
Based on the answers so far, I would like to stress that the Matlab function histc is not equivalent to numpy.histogram if someone (like me in this case) does not care about the histogram. I need the second output of hisc, which is a mapping from input values to the index of the provided input bins. Such an output is provided by the numpy functions digitize and searchsorted. As one of the answers says, searchsorted is much faster than digitize. However, searchsorted is still slower than Matlab by a factor 2:
t1=time.time()
ind3=np.searchsorted(y,scale,"right")
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind3) #ok
The result is
Time passed is 0.21 seconds
So the questions are now:
What is the sense of having numpy.digitize if there is an equivalent function numpy.searchsorted which is 280 times faster?
Why is the Matlab function histc (which also provides the output of numpy.searchsorted) 2 times faster than numpy.searchsorted?

First, let's look at why numpy.digitize is slow. If your bins are found to be monotonic, then one of these functions is called depending on whether the bins are nondecreasing or nonincreasing (the code for this is found in numpy/lib/src/_compiled_base.c in the numpy git repo):
static npy_intp
incr_slot_(double x, double *bins, npy_intp lbins)
{
npy_intp i;
for ( i = 0; i < lbins; i ++ ) {
if ( x < bins [i] ) {
return i;
}
}
return lbins;
}
static npy_intp
decr_slot_(double x, double * bins, npy_intp lbins)
{
npy_intp i;
for ( i = lbins - 1; i >= 0; i -- ) {
if (x < bins [i]) {
return i + 1;
}
}
return 0;
}
As you can see, it is doing a linear search. Linear search is much, much slower than binary search so there is your answer as to why it is slow. I will open a ticket for this on the numpy tracker.
Second, I think that Matlab is actually slower than your C++ code because Matlab also assumes that the bins are monotonically nondecreasing.

I can't answer why numpy.digitize() is so slow -- I could confirm your timings on my machine.
The function numpy.searchsorted() does basically the same thing as numpy.digitize(), but efficiently.
ind = np.searchsorted(y, scale, "right")
takes about 0.15 seconds on my machine and gives exactly the same result as your code.
Note that your Matlab code does something different from both of those functions -- it is the equivalent of numpy.histogram().

Before the question can get answered, several subquestions need to be addressed:
In order to get more reliable results, you should run several
iterations of the tests and average their results. This would
somehow eliminate startup effects, which do not have anything to do
with the algorithm. Also, try to use larger data for the same
purpose.
Use the same algortihms across the frameworks. This has already
been addressed in other answers here.
Make sure, the algorithms are really similar enough. How do they
utilize system ressources? How is iterated over memory ? If (just an
example) a Matlab algorithm uses repmat and the numpy would not, the
comparison is not fair.
How does the corresponding framework parallelize? This possibly
is connected to your individual machine / processor configuration.
Matlab does parallelize some (but by far not all) builtin functions.
I dont know about numpy/CPython.
Use a memory profiler in order to find out, how both implementations
behave from that performance point of view.
Afterwards (this is only a guess) we probably will find out, numpy does often behave slower than Matlab. Many questions here at SO come to the same conclusion. One explanation could be, that Matlab has an easier job to optimize array access, because it does not need to take into account a whole collection of general purpose objects (like CPython). The requirements on mathematical arrays are much lower than those on general arrays. numpy on the other hand does utilize CPython, which must serve the full python library - not only numpy. However, according to this comparison test (among many others) Matlab is still pretty slow ...

I don't think you are comparing the same functions in numpy and matlab. The equivalent to histc is np.histogram as far as I can tell from looking at the documentation. I don't have matlab to do a comparison, but when I do the following on my machine:
In [7]: import numpy as np
In [8]: scale=np.arange(1,1e+6+1)
In [9]: y=np.arange(1,1e+6+1,10)
In [10]: %timeit np.histogram(scale,y)
10 loops, best of 3: 135 ms per loop
I get a number that is approximately equivalent to what you get for histc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.