Highly variable execution times in Cython functions

Highly variable execution times in Cython functions - python

I have a performance measurement issue while executing a migration to Cython from C-compiled functions (through scipy.weave) called from a Python engine.
The new cython functions profiled end-to-end with cProfile (if not necessary I won't deep down in cython profiling) record cumulative measurement times highly variable.
Eg. the cumulate time of a cython function executed 9 times per 5 repetitions (after a warm-up of 5 executions - not took in consideration by the profiling function) is taking:
in a first round 215,627339 seconds
in a second round 235,336131 seconds
Each execution calls the functions many times with different, but fixed parameters.
Maybe this variability could depends on CPU loads of the test machine (a cloud-hosted dedicated one), but I wonder if such a variability (almost 10%) could depend someway by cython or lack of optimization (I already use hints on division, bounds check, wrap-around, ...).
Any idea on how to take reliable metrics?

First of all, you need to ensure that your measurement device is capable of measuring what you need: specifically, only the system resources you consume. UNIX's utime is one such command, although even that one still includes swap time. Check the documentation of your profiler: it should have capabilities to measure only the CPU time consumed by the function. If so, then your figures are due to something else.
Once you've controlled the external variations, you need to examine the internal. You've said nothing about the complexion of your function. Some (many?) functions have available short-cuts for data-driven trivialities, such as multiplication by 0 or 1. Some are dependent on an overt or covert iteration that varies with the data. You need to analyze the input data with respect to the algorithm.
One tool you can use is a line-oriented profiler to detail where the variations originate; seeing which lines take the extra time should help determine where the "noise" comes from.

I'm not a performance expert but from my understanding the thing you should be measuring would be the average time it take per execution not the cumulative time? Other than that is your function doing any like reading from disk and/or making network requests?

Related

How to use ForLoop in python efficiently [duplicate]

How do you write (and run) a correct micro-benchmark in Java?
I'm looking for some code samples and comments illustrating various things to think about.
Example: Should the benchmark measure time/iteration or iterations/time, and why?
Related: Is stopwatch benchmarking acceptable?

Tips about writing micro benchmarks from the creators of Java HotSpot:
Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.
Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.
Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run # 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.
Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.
Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.
Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.
Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.
Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.

I know this question has been marked as answered but I wanted to mention two libraries that help us to write micro benchmarks
Caliper from Google
Getting started tutorials
http://codingjunkie.net/micro-benchmarking-with-caliper/
http://vertexlabs.co.uk/blog/caliper
JMH from OpenJDK
Getting started tutorials
Avoiding Benchmarking Pitfalls on the JVM
Using JMH for Java Microbenchmarking
Introduction to JMH

Important things for Java benchmarks are:
Warm up the JIT first by running the code several times before timing it
Make sure you run it for long enough to be able to measure the results in seconds or (better) tens of seconds
While you can't call System.gc() between iterations, it's a good idea to run it between tests, so that each test will hopefully get a "clean" memory space to work with. (Yes, gc() is more of a hint than a guarantee, but it's very likely that it really will garbage collect in my experience.)
I like to display iterations and time, and a score of time/iteration which can be scaled such that the "best" algorithm gets a score of 1.0 and others are scored in a relative fashion. This means you can run all algorithms for a longish time, varying both number of iterations and time, but still getting comparable results.
I'm just in the process of blogging about the design of a benchmarking framework in .NET. I've got a couple of earlier posts which may be able to give you some ideas - not everything will be appropriate, of course, but some of it may be.

jmh is a recent addition to OpenJDK and has been written by some performance engineers from Oracle. Certainly worth a look.
The jmh is a Java harness for building, running, and analysing nano/micro/macro benchmarks written in Java and other languages targetting the JVM.
Very interesting pieces of information buried in the sample tests comments.
See also:
Avoiding Benchmarking Pitfalls on the JVM
Discussion on the main strengths of jmh.

Should the benchmark measure time/iteration or iterations/time, and why?
It depends on what you are trying to test.
If you are interested in latency, use time/iteration and if you are interested in throughput, use iterations/time.

Make sure you somehow use results which are computed in benchmarked code. Otherwise your code can be optimized away.

If you are trying to compare two algorithms, do at least two benchmarks for each, alternating the order. i.e.:
for(i=1..n)
alg1();
for(i=1..n)
alg2();
for(i=1..n)
alg2();
for(i=1..n)
alg1();
I have found some noticeable differences (5-10% sometimes) in the runtime of the same algorithm in different passes..
Also, make sure that n is very large, so that the runtime of each loop is at the very least 10 seconds or so. The more iterations, the more significant figures in your benchmark time and the more reliable that data is.

There are many possible pitfalls for writing micro-benchmarks in Java.
First: You have to calculate with all sorts of events that take time more or less random: Garbage collection, caching effects (of OS for files and of CPU for memory), IO etc.
Second: You cannot trust the accuracy of the measured times for very short intervals.
Third: The JVM optimizes your code while executing. So different runs in the same JVM-instance will become faster and faster.
My recommendations: Make your benchmark run some seconds, that is more reliable than a runtime over milliseconds. Warm up the JVM (means running the benchmark at least once without measuring, that the JVM can run optimizations). And run your benchmark multiple times (maybe 5 times) and take the median-value. Run every micro-benchmark in a new JVM-instance (call for every benchmark new Java) otherwise optimization effects of the JVM can influence later running tests. Don't execute things, that aren't executed in the warmup-phase (as this could trigger class-load and recompilation).

It should also be noted that it might also be important to analyze the results of the micro benchmark when comparing different implementations. Therefore a significance test should be made.
This is because implementation A might be faster during most of the runs of the benchmark than implementation B. But A might also have a higher spread, so the measured performance benefit of A won't be of any significance when compared with B.
So it is also important to write and run a micro benchmark correctly, but also to analyze it correctly.

To add to the other excellent advice, I'd also be mindful of the following:
For some CPUs (e.g. Intel Core i5 range with TurboBoost), the temperature (and number of cores currently being used, as well as thier utilisation percent) affects the clock speed. Since CPUs are dynamically clocked, this can affect your results. For example, if you have a single-threaded application, the maximum clock speed (with TurboBoost) is higher than for an application using all cores. This can therefore interfere with comparisons of single and multi-threaded performance on some systems. Bear in mind that the temperature and volatages also affect how long Turbo frequency is maintained.
Perhaps a more fundamentally important aspect that you have direct control over: make sure you're measuring the right thing! For example, if you're using System.nanoTime() to benchmark a particular bit of code, put the calls to the assignment in places that make sense to avoid measuring things which you aren't interested in. For example, don't do:
long startTime = System.nanoTime();
//code here...
System.out.println("Code took "+(System.nanoTime()-startTime)+"nano seconds");
Problem is you're not immediately getting the end time when the code has finished. Instead, try the following:
final long endTime, startTime = System.nanoTime();
//code here...
endTime = System.nanoTime();
System.out.println("Code took "+(endTime-startTime)+"nano seconds");

http://opt.sourceforge.net/ Java Micro Benchmark - control tasks required to determine the comparative performance characteristics of the computer system on different platforms. Can be used to guide optimization decisions and to compare different Java implementations.

Why the executing time of functions not constant?

I read my university class theoretically the order of growth of functions and tried implementing it practically at home. Although the order of growth turned out to be exact the same as in textbooks but their executing time changes with every single time I execute the program. Why is that?
Source Code
import time
import math
from tabulate import tabulate
n=eval(input("Enter the value of n: "));
t1=time.time()
a=12
t2=time.time()
A=t2-t1
t3=time.time()
b=n
t4=time.time()
B=t4-t3
t5=time.time()
c=math.log10(n);
t6=time.time()
C=t6-t5
t7=time.time()
d=n*math.log10(n);
t8=time.time()
D=t8-t7
t9=time.time()
e=n**2
t10=time.time()
E=t10-t9
t11=time.time()
f=2**n
t12=time.time()
F=t12-t11
print(tabulate([['constant',a,A], ['n',b,B], ['logn',c,C], ['nlogn',d,D], ['n**2',e,E], ['2**n',f,F]], headers=['Function', 'Value', 'Time']))
templist= [A,B,C,D,E,F]
print("The time order in acsending order is: ", sorted(templist,key=int))
First Execution
naufil#naufil-Inspiron-7559:~/Desktop/python$ python3 time_order.py
Enter the value of n: 100
Function Value Time
---------- --------------- -----------
constant 12 2.14577e-06
n 100 1.43051e-06
logn 2 4.1008e-05
nlogn 200 3.57628e-06
n**2 10000 3.33786e-06
2**n 1.26765e+30 3.8147e-06
The time order in acsending order is: [2.1457672119140625e-06, 1.430511474609375e-06, 4.100799560546875e-05, 3.5762786865234375e-06, 3.337860107421875e-06, 3.814697265625e-06]
Second Execution
naufil#naufil-Inspiron-7559:~/Desktop/python$ python3 time_order.py
Enter the value of n: 100
Function Value Time
---------- --------------- -----------
constant 12 2.14577e-06
n 100 1.19209e-06
logn 2 4.64916e-05
nlogn 200 4.05312e-06
n**2 10000 3.33786e-06
2**n 1.26765e+30 3.57628e-06
The time order in acsending order is: [2.1457672119140625e-06, 1.1920928955078125e-06, 4.649162292480469e-05, 4.0531158447265625e-06, 3.337860107421875e-06, 3.5762786865234375e-06]

As other comments and answers have rightly pointed out, the reason for the difference in execution times that you observe come from the way operating systems work. But doing rigorous measures is a complicated matter, so let me elaborate a bit more though and give you pointers to where you should maybe direct your experimentation.
What your OS does behind your back
You can see the OS as a conductor and programs as instrument players, and imagine there are only so many instruments that can play at the same time. The conductor must therefore choose at each time who should play, also making sure nobody is frustrated in the end! Same-wise, the OS is therefore constantly in charge of choosing what programs to execute, meaning what program to dedicate CPU time. The number of programs (or rather processes) that can be executed at the same time is usually limited by the number of cores in your processor.
In practice, the way that OS chooses what to execute is a very complex and fascinating subject, which relies on experimentation-backed heuristics. (Read more here). What you have to understand, is that there is hardly any way for you to alter this behavior, and none to guarantee the same execution time between two calls.
Using linux's time command
Calling python's time like you do measures the physical time elapsed between two calls, so because of what we have said, you don't only measure time spent on your program's execution. If you want to have a better a sense of what time the OS actually dedicated to your program, you can use the linux command time. The user time, will give you the actual CPU time dedicated to the execution of your program. Check out this thread for more info. But understand that this time as well is subject to oscillations!
What wisdom are you trying to draw from your measurements?
Finally, you should ask yourself if the exact time is really what you want. Do you care about the value? or do you want to exhibit behaviors?
Usually what is done to measure performances, is averaging the execution times of repeated calls. This way, the effects that pertain to the OS's business should be averaged out. (You can see that as building an unbiased estimator for a random process). From what I understand, you are trying to show difference in execution times for algorithms with different complexity. So the actual execution time is not so relevant, what is, is the relative order. That is why averaging multiple calls will reduce the variance of the observation and you will be able to make stronger statements as to the relative execution times.

You should address this question to your operating system. What else runs on your computer? List the various processes and see how many there are; all it takes is a process or even a context swap to alter your execution time. Among other things, calling time.time can invoke such a switch, as this is a call to a system process.
It also depends on what system support routines are already loaded when you call them -- many of those calls being implicit or secondary. If you need to allocate more memory for a particular instruction because another process took the last of your RAM and then swapped out ... well, you get the idea, I hope.

Python and C++ performance comparison

In a lecture I've encountered the following problem:
Given a simple program which computes the sum of a column in a large data set, performance of a python and a c++ implementation are being compared. The main bottleneck should be reading the data. The computation itself is rather simple. On first execution, the python version is about 2 times slower than c++ which makes sense.
Then on the second execution, the c++ program speeds up from 4 seconds to 1 second because apparently the "first execution is I/O bound, second is CPU bound". This still makes sense since probably the file contents were cached omitting the slow reading from disk.
However, the python implementation did not speed up at all on the second run, despite the warm cache. I know python is slow, but is it that slow? Does this mean that executing this simple computation in python is slower than reading about .7 GB from disk?
If this is always the case, I'm wondering why the biggest deep learning frameworks I know (PyTorch, tensorflow) have python apis. For real time object detection for example, it must be slower to parse the input (read frames from a video, maybe preprocess) to the network and to interpret the output, than performing the forward propagation itself on a gpu.
Have I misunderstood something? Thank you.

That's not so easy to answer without implementation details, but in general, python is known for it's much less cache friendliness, because you mostly haven't the option to low-level optimize cache behaviour in python. However, this isn't always correct. You propably can optimize the cache friendliness in python directly, or you use parts of c++ code for critical sections. But always consider, that you can just optimize your code better in C++. So if you have really critical code parts, where you want to achieve every percent of speed and effiency, you should use C++. That's the reason, that many programs use both, C++ for raw performance things and python for a nice interface and program structure.

Running multiple instances of a python program efficiently & economically?

I wrote a program that calls a function with the following prototype:
def Process(n):
# the function uses data that is stored as binary files on the hard drive and
# -- based on the value of 'n' -- scans it using functions from numpy & cython.
# the function creates new binary files and saves the results of the scan in them.
#
# I optimized the running time of the function as much as I could using numpy &
# cython, and at present it takes about 4hrs to complete one function run on
# a typical winXP desktop (three years old machine, 2GB memory etc).
My goal is to run this function exactly 10,000 times (for 10,000 different values of 'n') in the fastest & most economical way. following these runs, I will have 10,000 different binary files with the results of all the individual scans. note that every function 'run' is independent (meaning, there is no dependency whatsoever between the individual runs).
So the question is this. having only one PC at home, it is obvious that it will take me around 4.5 years (10,000 runs x 4hrs per run = 40,000 hrs ~= 4.5 years) to complete all runs at home. yet, I would like to have all the runs completed within a week or two.
I know the solution would involve accessing many computing resources at once. what is the best (fastest / most affordable, as my budget is limited) way to do so? must I buy a strong server (how much would it cost?) or can I have this run online? in such a case, is my propritary code gets exposed, by doing so?
in case it helps, every instance of 'Process()' only needs about 500MB of memory. thanks.

Check out PiCloud: http://www.picloud.com/
import cloud
cloud.call(function)
Maybe it's an easy solution.

Does Process access the data on the binary files directly or do you cache it in memory? Reducing the usage of I/O operations should help.
Also, isn't it possible to break Process into separate functions running in parallel? How is the data dependency inside the function?
Finally, you could give some cloud computing service like Amazon EC2 a try (don't forget to read this for tools), but it won't be cheap (EC2 starts at $0.085 per hour) - an alternative would be going to an university with a computer cluster (they are pretty common nowadays, but it will be easier if you know someone there).

Well, from your description, it sounds like things are IO bound... In which case parallelism (at least on one IO device) isn't going to help much.
Edit: I just realized that you were referring more to full cloud computing, rather than running multiple processes on one machine... My advice below still holds, though.... PyTables is quite nice for out-of-core calculations!
You mentioned that you're using numpy's mmap to access the data. Therefore, your execution time is likely to depend heavily on how your data is structured on the disc.
Memmapping can actually be quite slow in any situation where the physical hardware has to spend most of its time seeking (e.g. reading a slice along a plane of constant Z in a C-ordered 3D array). One way of mitigating this is to change the way your data is ordered to reduce the number of seeks required to access the parts you are most likely to need.
Another option that may help is compressing the data. If your process is extremely IO bound, you can actually get significant speedups by compressing the data on disk (and sometimes even in memory) and decompressing it on-the-fly before doing your calculation.
The good news is that there's a very flexible, numpy-oriented library that's already been put together to help you with both of these. Have a look at pytables.
I would be very surprised if tables.Expr doesn't significantly (~ 1 order of magnitude) outperform your out-of-core calculation using a memmapped array. See here for a nice, (though canned) example. From that example:

Python cProfile: how to filter out specific calls from the profiling data?

I've started profiling a script which has many sleep(n) statements. All in all, I get over 99% of the run time spent sleeping. Nevertheless, it occasionally runs into performance problems during the time that it does real work but the relevant, interesting profiling data becomes very difficult to identify when e.g. using kcachegrind.
Is there a way I can blacklist certain calls/functions from being profiled?
Alternatively, how can I filter out such call with post-processing of the profiling data file?
I'm using the profilestats decorator ( http://pypi.python.org/pypi/profilestats ).
Thanks

You need more than just excluding samples during sleep(). You need the remaining samples to tell you something useful. That would be stack sampling, on wall-clock time, summarizing percent at the line-of-code level. Zoom is a good tool for this kind of sampling, and I would hope it's not too hard to ignore samples that contain a particular function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.