How to use ForLoop in python efficiently [duplicate]

How to use ForLoop in python efficiently [duplicate] - python

How do you write (and run) a correct micro-benchmark in Java?
I'm looking for some code samples and comments illustrating various things to think about.
Example: Should the benchmark measure time/iteration or iterations/time, and why?
Related: Is stopwatch benchmarking acceptable?

Tips about writing micro benchmarks from the creators of Java HotSpot:
Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.
Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.
Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run # 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.
Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.
Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.
Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.
Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.
Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.

I know this question has been marked as answered but I wanted to mention two libraries that help us to write micro benchmarks
Caliper from Google
Getting started tutorials
http://codingjunkie.net/micro-benchmarking-with-caliper/
http://vertexlabs.co.uk/blog/caliper
JMH from OpenJDK
Getting started tutorials
Avoiding Benchmarking Pitfalls on the JVM
Using JMH for Java Microbenchmarking
Introduction to JMH

Important things for Java benchmarks are:
Warm up the JIT first by running the code several times before timing it
Make sure you run it for long enough to be able to measure the results in seconds or (better) tens of seconds
While you can't call System.gc() between iterations, it's a good idea to run it between tests, so that each test will hopefully get a "clean" memory space to work with. (Yes, gc() is more of a hint than a guarantee, but it's very likely that it really will garbage collect in my experience.)
I like to display iterations and time, and a score of time/iteration which can be scaled such that the "best" algorithm gets a score of 1.0 and others are scored in a relative fashion. This means you can run all algorithms for a longish time, varying both number of iterations and time, but still getting comparable results.
I'm just in the process of blogging about the design of a benchmarking framework in .NET. I've got a couple of earlier posts which may be able to give you some ideas - not everything will be appropriate, of course, but some of it may be.

jmh is a recent addition to OpenJDK and has been written by some performance engineers from Oracle. Certainly worth a look.
The jmh is a Java harness for building, running, and analysing nano/micro/macro benchmarks written in Java and other languages targetting the JVM.
Very interesting pieces of information buried in the sample tests comments.
See also:
Avoiding Benchmarking Pitfalls on the JVM
Discussion on the main strengths of jmh.

Should the benchmark measure time/iteration or iterations/time, and why?
It depends on what you are trying to test.
If you are interested in latency, use time/iteration and if you are interested in throughput, use iterations/time.

Make sure you somehow use results which are computed in benchmarked code. Otherwise your code can be optimized away.

If you are trying to compare two algorithms, do at least two benchmarks for each, alternating the order. i.e.:
for(i=1..n)
alg1();
for(i=1..n)
alg2();
for(i=1..n)
alg2();
for(i=1..n)
alg1();
I have found some noticeable differences (5-10% sometimes) in the runtime of the same algorithm in different passes..
Also, make sure that n is very large, so that the runtime of each loop is at the very least 10 seconds or so. The more iterations, the more significant figures in your benchmark time and the more reliable that data is.

There are many possible pitfalls for writing micro-benchmarks in Java.
First: You have to calculate with all sorts of events that take time more or less random: Garbage collection, caching effects (of OS for files and of CPU for memory), IO etc.
Second: You cannot trust the accuracy of the measured times for very short intervals.
Third: The JVM optimizes your code while executing. So different runs in the same JVM-instance will become faster and faster.
My recommendations: Make your benchmark run some seconds, that is more reliable than a runtime over milliseconds. Warm up the JVM (means running the benchmark at least once without measuring, that the JVM can run optimizations). And run your benchmark multiple times (maybe 5 times) and take the median-value. Run every micro-benchmark in a new JVM-instance (call for every benchmark new Java) otherwise optimization effects of the JVM can influence later running tests. Don't execute things, that aren't executed in the warmup-phase (as this could trigger class-load and recompilation).

It should also be noted that it might also be important to analyze the results of the micro benchmark when comparing different implementations. Therefore a significance test should be made.
This is because implementation A might be faster during most of the runs of the benchmark than implementation B. But A might also have a higher spread, so the measured performance benefit of A won't be of any significance when compared with B.
So it is also important to write and run a micro benchmark correctly, but also to analyze it correctly.

To add to the other excellent advice, I'd also be mindful of the following:
For some CPUs (e.g. Intel Core i5 range with TurboBoost), the temperature (and number of cores currently being used, as well as thier utilisation percent) affects the clock speed. Since CPUs are dynamically clocked, this can affect your results. For example, if you have a single-threaded application, the maximum clock speed (with TurboBoost) is higher than for an application using all cores. This can therefore interfere with comparisons of single and multi-threaded performance on some systems. Bear in mind that the temperature and volatages also affect how long Turbo frequency is maintained.
Perhaps a more fundamentally important aspect that you have direct control over: make sure you're measuring the right thing! For example, if you're using System.nanoTime() to benchmark a particular bit of code, put the calls to the assignment in places that make sense to avoid measuring things which you aren't interested in. For example, don't do:
long startTime = System.nanoTime();
//code here...
System.out.println("Code took "+(System.nanoTime()-startTime)+"nano seconds");
Problem is you're not immediately getting the end time when the code has finished. Instead, try the following:
final long endTime, startTime = System.nanoTime();
//code here...
endTime = System.nanoTime();
System.out.println("Code took "+(endTime-startTime)+"nano seconds");

http://opt.sourceforge.net/ Java Micro Benchmark - control tasks required to determine the comparative performance characteristics of the computer system on different platforms. Can be used to guide optimization decisions and to compare different Java implementations.

Related

Understand order of magnitude performance gap between python and C++ for CPU heavy application

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why
Hello everybody,
I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.
We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.
When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.
I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.
I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

Highly variable execution times in Cython functions

I have a performance measurement issue while executing a migration to Cython from C-compiled functions (through scipy.weave) called from a Python engine.
The new cython functions profiled end-to-end with cProfile (if not necessary I won't deep down in cython profiling) record cumulative measurement times highly variable.
Eg. the cumulate time of a cython function executed 9 times per 5 repetitions (after a warm-up of 5 executions - not took in consideration by the profiling function) is taking:
in a first round 215,627339 seconds
in a second round 235,336131 seconds
Each execution calls the functions many times with different, but fixed parameters.
Maybe this variability could depends on CPU loads of the test machine (a cloud-hosted dedicated one), but I wonder if such a variability (almost 10%) could depend someway by cython or lack of optimization (I already use hints on division, bounds check, wrap-around, ...).
Any idea on how to take reliable metrics?

First of all, you need to ensure that your measurement device is capable of measuring what you need: specifically, only the system resources you consume. UNIX's utime is one such command, although even that one still includes swap time. Check the documentation of your profiler: it should have capabilities to measure only the CPU time consumed by the function. If so, then your figures are due to something else.
Once you've controlled the external variations, you need to examine the internal. You've said nothing about the complexion of your function. Some (many?) functions have available short-cuts for data-driven trivialities, such as multiplication by 0 or 1. Some are dependent on an overt or covert iteration that varies with the data. You need to analyze the input data with respect to the algorithm.
One tool you can use is a line-oriented profiler to detail where the variations originate; seeing which lines take the extra time should help determine where the "noise" comes from.

I'm not a performance expert but from my understanding the thing you should be measuring would be the average time it take per execution not the cumulative time? Other than that is your function doing any like reading from disk and/or making network requests?

Python and C++ performance comparison

In a lecture I've encountered the following problem:
Given a simple program which computes the sum of a column in a large data set, performance of a python and a c++ implementation are being compared. The main bottleneck should be reading the data. The computation itself is rather simple. On first execution, the python version is about 2 times slower than c++ which makes sense.
Then on the second execution, the c++ program speeds up from 4 seconds to 1 second because apparently the "first execution is I/O bound, second is CPU bound". This still makes sense since probably the file contents were cached omitting the slow reading from disk.
However, the python implementation did not speed up at all on the second run, despite the warm cache. I know python is slow, but is it that slow? Does this mean that executing this simple computation in python is slower than reading about .7 GB from disk?
If this is always the case, I'm wondering why the biggest deep learning frameworks I know (PyTorch, tensorflow) have python apis. For real time object detection for example, it must be slower to parse the input (read frames from a video, maybe preprocess) to the network and to interpret the output, than performing the forward propagation itself on a gpu.
Have I misunderstood something? Thank you.

That's not so easy to answer without implementation details, but in general, python is known for it's much less cache friendliness, because you mostly haven't the option to low-level optimize cache behaviour in python. However, this isn't always correct. You propably can optimize the cache friendliness in python directly, or you use parts of c++ code for critical sections. But always consider, that you can just optimize your code better in C++. So if you have really critical code parts, where you want to achieve every percent of speed and effiency, you should use C++. That's the reason, that many programs use both, C++ for raw performance things and python for a nice interface and program structure.

Writing a CPU bound script to gauge rough CPU performance

I have wrote a script and running it on different machines. Script looks like below
def f(n):
x = None
while n:
x = simple_math(n)
n -= 1
return x
start = now()
f(BIGNUM)
print now() - start
At the end of the script it print how much time does it take to finish. Is this good enough to compare different machine for practical CPU speed for simple Python scripts?
By simple I mean it does not use multiprocessing module or any other technique to take advantage of multi-core machines.
This question is not about
making python programs run faster
multiprocessing module
GIL, I/O efficiency etc.
non cPython programs
Just that I want make sure if my approach to understand CPU performance among machines is fairly correct.

What's wrong with all of the countless existing benchmarks? The more sophisticated ones are propably a bit more robust. The major problems of your naive approach I - and I'm not an expert on this topic, mind you - can spot are:
Modern CPUs are highly complex and employ very clever optimizations. The speed of a purely CPU-bound can vary widely depending on how often the cache can help, how often the program causes pipeline stalls, how often branch prediction is correct, and propably many many more (these were just off the top of my head). Although many of these shouldn't make a difference when you use the same build of the same executable running the same script doing the same pure calculations, they can matter - to a degree none of us can predict - once you change any of these paramteres (e.g. using a different build because of a different OS or architecture).
Multi-threading OSs will never let a program occupy the CPU exclusively. There will always be some other program running at the same time stealing time, and you can't really know how much of the x seconds were spent running your program and how many were spent on other programs. At the very least, you should run a program many times and take the minimum time as the time it takes with relatively little inference from other programs. And even then, you need to have about the same system load in both benchmarks to make the numbers somewhat meaningful.
At least CPython won't multi-thread, so you only get the speed of one core.
But since your requirements seem to be "very rough estimate of CPU speed only, in full awareness that these numbers can't be used for anything except putting CPU speed into orders of magnitude, must be taken with a grain of salt even then and don't tell anything about the actual performance of any real applications", it might be okay - just don't consider it anywhere close to accurate. Still, why not use a hardened benchmark suite that already put some effort into mitigating (not removing - nobody can do that) these problems?
Also note that the timeit stdlib module is both easier to use than manually wielding the stopwatch and tries (not too hard, but it's a start) to fix the second point by the method I mentioned.

You can get a rough idea by using these type of methods. But that will not be exact measurement. The execution time of the script will depend on many other things other than CPU speed, like OS and interpreter version used, current system load, memory speed etc. etc. My suggestion is not to depend on this.
EDIT: Just a note. When it comes to performance, many people think only about CPU speed, but actually performance can be hampered by almost everything on the system. For example you have a high speed CPU but low RAM (both in size and speed), then you will get no performance boost up for the CPU.

In essence: No.
Benchmarking is a very difficult problem which usually is not worth solving yourself. It all depends on why you care. Your method will surely give a very rough estimate on if System A is better than System B, but really only when the outcome is vastly different.
What you're trying to do is determine how Real World Application X will perform on different computers. Very rarely is a real world application approximated by a loop of simple math. Even when it is (scientific computing mostly) you're better off measuring times on the actual program.
Real world applications are usually non-linear, and difficult to measure and simulate. Its really one of those problems which has been solved by someone else much better than you could reasonably solve yourself.
If you want a very rough estimate of performance, sure do it your way. Just don't put too much faith in the results because they will be far from what you might call "scientific"

If I understand your intention correctly (and you could clarify it a bit - what is it exactly that you are trying to measure or estimate, processor speed, code speed, something else and for what purpose, but if I understand you then) why not check how is it done in timeit

Performance differences between Python and C

Working on different projects I have the choice of selecting different programming languages, as long as the task is done.
I was wondering what the real difference is, in terms of performance, between writing a program in Python, versus doing it in C.
The tasks to be done are pretty varied, e.g. sorting textfiles, disk access, network access, textfile parsing.
Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?
And in your experience, given the power of current CPU's (i7), is it really a noticeable difference (Consider that its a program that doesnt bring the system to its knees).

Use python until you have a performance problem. If you ever have one figure out what the problem is (often it isn't what you would have guessed up front). Then solve that specific performance problem which will likely be an algorithm or data structure change. In the rare case that your problem really needs C then you can write just that portion in C and use it from your python code.

C will absolutely crush Python in almost any performance category, but C is far more difficult to write and maintain and high performance isn't always worth the trade off of increased time and difficulty in development.
You say you're doing things like text file processing, but what you omit is how much text file processing you're doing. If you're processing 10 million files an hour, you might benefit from writing it in C. But if you're processing 100 files an hour, why not use python? Do you really need to be able to process a text file in 10ms vs 50ms? If you're planning for the future, ask yourself, "Is this something I can just throw more hardware at later?"
Writing solid code in C is hard. Be sure you can justify that investment in effort.

In general IO bound work will depend more on the algorithm then the language. In this case I would go with Python because it will have first class strings and lots of easy to use libraries for manipulating files, etc.

Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?
Yes.
The noticeable differences are these
There's much less Python code.
The Python code is much easier to read.
Python supports really nice unit testing, so the Python code tends to be higher quality.
You can write the Python code more quickly, since there are fewer quirky language features. No preprocessor, for example, really saves a lot of hacking around. Super-experience C programmers hardly notice it. But all that #include sandwich stuff and making the .h files correct is remarkably time-consuming.
Python can be easier to package and deploy, since you don't need a big fancy make script to do a build.

The first rule of computer performance questions: Your mileage will vary. If small performance differences are important to you, the only way you will get valid information is to test with your configuration, your data, and your benchmark. "Small" here is, say, a factor of two or so.
The second rule of computer performance questions: For most applications, performance doesn't matter -- the easiest way to write the app gives adequate performance, even when the problem scales. If that is the case (and it is usually the case) don't worry about performance.
That said:
C compiles down to machine executable and thus has the potential to execute as at least as fast as any other language
Python is generally interpreted and thus may take more CPU than a compiled language
Very few applications are "CPU bound." I/O (to disk, display, or memory) is not greatly affected by compiled vs interpreted considerations and frequently is a major part of computer time spent on an application
Python works at a higher level of abstraction than C, so your development and debugging time may be shorter
My advice: Develop in the language you find the easiest with which to work. Get your program working, then check for adequate performance. If, as usual, performance is adequate, you're done. If not, profile your specific app to find out what is taking longer than expected or tolerable. See if and how you can fix that part of the app, and repeat as necessary.
Yes, sometimes you might need to abandon work and start over to get the performance you need. But having a working (albeit slow) version of the app will be a big help in making progress. When you do reach and conquer that performance goal you'll be answering performance questions in SO rather than asking them.

If your text files that you are sorting and parsing are large, use C. If they aren't, it doesn't matter. You can write poor code in any language though. I have seen simple code in C for calculating areas of triangles run 10x slower than other C code, because of poor memory management, use of structures, pointers, etc.
Your I/O algorithm should be independent of your compute algorithm. If this is the case, then using C for the compute algorithm can be much faster.

(Assumption - The question implies that the author is familiar with C but not Python, therefore I will base my answer with that in mind.)
I was wondering what the real
difference is, in terms of
performance, between writing a program
in Python, versus doing it in C.
C will almost certainly be faster unless it is implemented poorly, but the real questions are:
What are the development implications
(development time, maintenance, etc.)
for either implementation?
Is the performance benefit significant?
Learning Python can take some time, but there are Python modules that can greatly speed development time. For example, the csv module in Python makes reading and writing csv easy. Also, Python strings, arrays, maps, and other objects make it more flexible than plain C and more elegant, in my opinion, than the equivalent C++. Some things like network access may be much quicker to develop in Python as well.
However, it may take time to learn how to program Python well enough to accomplish your task. Since you are concerned with performance, I suggest trying a simple task, such as sorting a text file, in both C and Python. That will give you a better baseline on both languages in terms of performance, development time, and possibly maintenance.

It really depends a lot on what your doing and if the algorithm in question is available in Python via a natively compiled library. If it is, then I believe you'll be looking at performance numbers close enough that Python is most likely your answer -- assuming it's your preferred language. If you must implement the algorithm yourself, depending on the amount of logic required and the size of your data set, C/C++ may be the better option. It's hard to provide a less nebulous answer without more information.

To get an idea of the raw difference in speed, check out the Computer Languages Benchmark Game.
Then you have to decide whether that difference matters to you.
Personally, I ended up deciding that it did, but most of the time instead of using C, I ended up using other higher-level languages. Personally I mostly use Scala, but Haskell and C# and Java each have their advantages also.

Across all programs, it isn't really possible to say whether things will be quicker or slower on average in Python or C.
For the programs that I've implemented in both languages, using similar algorithms, I've seen no improvement (and sometimes a performance degradation) for string- and IO-heavy code, when reimplementing python code in C. The execution time is dominated by allocation and manipulation of strings (which functionality python implements very efficiently) and waiting for IO operations (which incurs the same overhead in either language), so the extra overhead of python makes very little difference.
But for programs that do even simple operations on image files, say (images being large enough for processing time to be noticeable compared to IO), C is enormously quicker. For this sort of task the bulk of the time running the python code is spent doing Python Stuff, and this dwarfs the time spent on the underlying operations (multiply, add, compare, etc.). When reimplemented as C, the bureaucracy goes away, the computer spends its time doing real honest work, and for that reason the thing runs much quicker.
It's not uncommon for the python code to run in (say) 5 seconds where the C code runs in (say) 0.05. So that's a 100x increase -- but in absolute terms, this is not so big a deal. It takes so much less longer to write python code than it does to write C code that your program would have to be run some huge number of times to turn a time profit. I often reimplement in C, for various reasons, but if you don't have this requirement then it's probably not worth bothering. You won't get that part of your life back, and next year computers will be quicker.

Actually you can solve most of your tasks efficiently with python.
You just should know which tools to use. For text processing there is brilliant package from Egenix guys - http://www.egenix.com/products/python/mxBase/mxTextTools/. I was able to create very efficient parsers with it in python, since all the heavy lifting is done by native code.
Same approach goes for any other problem - if you have performance problems, get a C/C++ library with Python interface which implements whatever bottleneck you got efficiently.

C is definitely faster than Python because Python is written in C.
C is middle level language and hence faster but there not much a great difference between C & Python regarding executable time it takes.
but it is really very easy to write code in Python than C and it take much shorter time to write code and learn Python than C.
Because its easy to write its easy to test also.

You will find C is much slower. Your developers will have to keep track of memory allocation, and use libraries (such as glib) to handle simple things such as dictionaries, or lists, which python has built-in.
Moreover, when an error occurs, your C program will typically just crash, which means you'll need to get the error to happen in a debugger. Python would give you a stack trace (typically).
Your code will be bigger, which means it will contain more bugs. So not only will it take longer to write, it will take longer to debug, and will ship with more bugs. This means that customers will notice the bugs more often.
So your developers will spend longer fixing old bugs and thus new features will get done more slowly.
In the mean-time, your competitors will be using a sensible programming language and their products will be increasing in features and usability, rapidly yours will look bad. Your customers will leave and you'll go out of business.

The excess time to write the code in C compared to Python will be exponentially greater than the difference between C and Python execution speed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.