Has Python got branch prediction? - python

I implemented a physics simulation in Python (most of the heavy lifting is done in numerical libraries anyways, thus performance is good enough).
Now that the project has grown a bit, I added extra functionality via parameters that do not change during the simulation. With that comes the necessity to have the program do one thing or another based on their values, i.e., quite a few if-else scattered around the code.
My question is simple: does Python implement some form of branch prediction? Am I going to wear the performance significantly or is the interpreter smart enough to see that some parameters never change? Having a constant if-else inside a function that is called a million times, is the conditional evaluated every time or some magic happens? When there is no easy way to remove the conditional altogether, is there a way to give the interpreter some hints and favour/emulate branch prediction?

You could in theory benefit here from some JIT functionality that may observe the control flow over time and could effectively suppress never-taken branches by rearranging the code. Some of the Python interpreters contain JIT compilers (I think PyPy does in newer versions, maybe Jython as well), and may be able to do this optimization, but that of course depends on the actual code.
However, the main form of branch prediction is done in HW, and is unrelated to the SW or language constructs used (in Python's case - quite a few levels of abstraction above). This mechanism eventually observes these conditional code paths as branches, and may be able to learn them if they are indeed statically determined. However, as any prediction mechanism, it has limited capacity, and since your code is supposed to be big, it may not be able to accommodate predictions for all these branches. It's still considered quite good, so chances are that the critical ones may work.
Finally, if you really want to optimize your code, you can convert some of these conditions to constants (assigning an argument a constant value instead of parsing the command line), or hiding the condition completely with something like __debug__. This way you won't have to worry about predicting them, but can restore the capability with minimal work if you need them in the future.

Related

Where is #cupy.fuse cupy python decorator documented?

I've seen some demos of #cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. (
This is why one might be better off using numba #jit)
#cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and free costs.
But I cannot find any documentation of this other than the demos and the source code for cupy.fusion class.
Questions I have include:
Will cupy.fuse aggressively inline any python functions called inside the function the decorator is applied to, thereby rolling them into the same kernel?
this enhancement log hints at this but doesn't say if composed functions are in same kernel or simply just allowed when called functions are also decorated.
https://github.com/cupy/cupy/pull/1350
If so, do I need to decorate those functions with #fuse. I'm thinking that might impair the inlining not aid it since it might be rendering those functions into a non-fusable (maybe non-python) form.
If not, could I get automatic inlining by first decorating the function with #numba.jit then subsequently decorating with #fuse. Or would again the #jit render the resulting python in a non-fusable form?
What breaks #fuse? What are the pitfalls? is #fuse experimental and not likely to be maintained?
references:
https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
https://www.slideshare.net/pfi/automatically-fusing-functions-on-cupy
https://github.com/cupy/cupy/blob/master/cupy/core/fusion.py
https://docs-cupy.chainer.org/en/stable/overview.html
https://github.com/cupy/cupy/blob/master/cupy/manipulation/tiling.py
SOME) ANSWERS: I have found answers to some of these questions that I'm positing here
questions:
fusing kernels is such a huge advance I don't understand when I would ever not want to use #fuse. isn't it always better? When is
it a bad idea?
Answer: Fuse does not support many useful operations yet. For example, z = cupy.empty_like(x) does not work, nor does referring to globals. Hence it simply cannot be applied universally.
I'm wondering about it's composability
will #fuse inline the functions it finds within the decorated function?
Answer: Looking at timings, and nvvm markings it looks like it does pull in subroutines and fuse them into the kernel. So dividing things into subroutines rather than monolithic code will work with fuse.
I see that a bug fix in the release notes says that it can now handle calling other functions decorated with #fuse. But this does
not say if their kernels are fused or remain separate.
Answer: Looking at NVVM output it appears they are joined. It's hard to say is there is some residual overhead, but the timing doesn't show significant overheads indicating two separate kernels. The key thing is that it now works. As of cupy 4.1 you could not call a fused function from a fused function as the return types were wrong. But since 5.1 you can. However you do not need to decorate those functions. It just works whether you do or do not.
Why isn't it documented?
Answer: It appears to have some bugs and some incomplete functionality. The code also advises the API for it is subject to change.
However this is basically a miracle function when it can be used, easily improving speed by an order of magnitude on small to medium size arrays. So it would be nice if even this alpha version were documented.

When building Python with profile guided optimization do I have to leave the computer alone?

This may seem or even be a stupid question: When I build something self-tuning like Python with PGO (or ATLAS or I believe FFTW also does it), does the computer have to be otherwise idle (to not interfere with the measurements) or can I pass the time playing Doom?
The linked README from the python source distribution seems to deem this too trivial a matter to mention, but I'm genuinely unsure about this.
What you do on your computer while it is performing the PGO measurements should have no impact what so ever on the result of the optimization. What PGO do is to use measurments to find the hot paths in the code for a given data set and use this information to make the program as fast as possible for this data set and which path is hot and which is not is independent of other programs running on the computer.
To explain things a bit, when optimizing code there are trade offs. The improvement will be higher in some parts of the code and lower in others depending on which code transforms are used and where they are applied. To get a better final result you want high improvements in code that is executed a lot (hot code in compiler lingo) while you can live with smaller improvements in code that is executed less frequently (cold code). Normally a set of heuristics are used to identify these hot parts of the program and apply optimizations in a way that makes these parts as fast as possible. The problem with this approach is that the heuristics does not know anything about how the program will be used in practice and may misidentify hot code as cold.
Profile guided optimization (PGO) is a method to help the compiler to locate the hot parts of the code using data from real executions. As a first step you tell the compiler to build an instrumented version of the program to measure how the code is executed in practice, typically by adding counters to count the number of iterations in loops and which branch is chosen in if-statements. The second step is to run the instrumented program on real data. At the end of execution the program will output the values of all the added counters and by matching counters with the code it is possible to see which parts of the program are hot (high numbers) and which are cold (low numbers). Finally the program is compiled but this time agumented with the program profile. This implies that the compiler no longer need to guess which parts should be faster and which could be slower it can look it up in the profile.

How much slower python classes are compared to their equivalent functions?

When I started learning Python, I created a few applications just using functions and procedural code. However, now I know classes and realized that the code can be much readable (and subjectively easier to understand) if I rewrite it with classes.
How much slower the equivalent classes may get compared to the functions in general? Will the initializer, methods in the classes make any considerable difference in speed?
To answer the question: yes, it is likely to be a little slower, all else being equal. Some things that used to be variables (including functions) are now going to be object attributes, and self.foo is always going to be slightly slower than foo regardless of whether foo was a global or local originally. (Local variables are accessed by index, and globals by name, but an attribute lookup on an object is either a local or a global lookup, plus an additional lookup by name for the attribute, possibly in multiple places.) Calling a method is also slightly slower than calling a function -- not only is it slower to get the attribute, it is also slower to make the call, because a method is a wrapper object that calls the function you wrote, adding an extra function call overhead.
Will this be noticeable? Usually not. In rare cases it might be, say if you are accessing an object attribute a lot (thousands or millions of times) in a particular method. But in that case you can just assign self.foo to a local variable foo at the top of the method, and reference it by the local name throughout, to regain 99.44% of the local variable's performance advantage.
Beyond that there will be some overhead for allocating memory for instances that you probably didn't have before, but unless you are constantly creating and destroying instances, this is likely a one-time cost.
In short: there will be a likely-minor performance hit, and where the performance hit is more than minor, it is easy to mitigate. On the other hand, you could save hours in writing and maintaining the code, assuming your problem lends itself to an object-oriented solution. And saving time is likely why you're using a language like Python to begin with.
No.
In general you will not notice any difference in performance based on using classes or not. The different code structures implied may mean that one is faster than the other, but it's impossible to say which.
Always write code to be read, then if, and only if, it's not fast enough make it faster. Remember: Premature optimization is the root of all evil.
Donald Knuth, one of the grand old minds of computing, is credited with the observation that "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." Deciding to use procedural techniques rather than object-oriented ones on the basis of speed gains that may well not be realized anyway is not a sensible strategy.
If your code works and doesn't need to be modified then feel free to leave it alone. If it needs to be modified then you should consider a judicious refactoring to include classes, since program readability is far more important than speed during development. You will also see benefits in improved maintainability. An old saw from Kernighan and Plauger's "Elements of Programming Style" still applies:
First, make it work. Then (if it doesn't work fast enough) make it work faster.
But, first and foremost, go for readability. Seriously.
You probably don't care as much as you think you do.
Really.
Sure, code with classes might be a little slower through indirection. Maybe. That is what JIT compilation is for, right? I can never remember which versions of python do this and which don't, because:
Performance doesn't matter.
At least constant performance differences like this. Unless you are doing a hell of a lot of computations (you aren't!), you will spend more time developing/debugging/maintaining your code. Optimize for that.
Really. Because you will never ever be able to measure the difference, unless you are in a tight loop. And you don't want to be doing that in python anyway, unless you don't really care about time. It's not like you're trying to balance your segway in python, right? You just want to compute some numbers, right? Your computer is really good at this. Trust it.
That said, this doesn't mean classes are the way to go. Just that speed isn't the question you should be asking. Instead, try to figure out what representation will be the best for your code. It seems, now you know classes, you will write clean code in OO fashion. Go ahead. Learn. Iterate.

Performance differences between Python and C

Working on different projects I have the choice of selecting different programming languages, as long as the task is done.
I was wondering what the real difference is, in terms of performance, between writing a program in Python, versus doing it in C.
The tasks to be done are pretty varied, e.g. sorting textfiles, disk access, network access, textfile parsing.
Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?
And in your experience, given the power of current CPU's (i7), is it really a noticeable difference (Consider that its a program that doesnt bring the system to its knees).
Use python until you have a performance problem. If you ever have one figure out what the problem is (often it isn't what you would have guessed up front). Then solve that specific performance problem which will likely be an algorithm or data structure change. In the rare case that your problem really needs C then you can write just that portion in C and use it from your python code.
C will absolutely crush Python in almost any performance category, but C is far more difficult to write and maintain and high performance isn't always worth the trade off of increased time and difficulty in development.
You say you're doing things like text file processing, but what you omit is how much text file processing you're doing. If you're processing 10 million files an hour, you might benefit from writing it in C. But if you're processing 100 files an hour, why not use python? Do you really need to be able to process a text file in 10ms vs 50ms? If you're planning for the future, ask yourself, "Is this something I can just throw more hardware at later?"
Writing solid code in C is hard. Be sure you can justify that investment in effort.
In general IO bound work will depend more on the algorithm then the language. In this case I would go with Python because it will have first class strings and lots of easy to use libraries for manipulating files, etc.
Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?
Yes.
The noticeable differences are these
There's much less Python code.
The Python code is much easier to read.
Python supports really nice unit testing, so the Python code tends to be higher quality.
You can write the Python code more quickly, since there are fewer quirky language features. No preprocessor, for example, really saves a lot of hacking around. Super-experience C programmers hardly notice it. But all that #include sandwich stuff and making the .h files correct is remarkably time-consuming.
Python can be easier to package and deploy, since you don't need a big fancy make script to do a build.
The first rule of computer performance questions: Your mileage will vary. If small performance differences are important to you, the only way you will get valid information is to test with your configuration, your data, and your benchmark. "Small" here is, say, a factor of two or so.
The second rule of computer performance questions: For most applications, performance doesn't matter -- the easiest way to write the app gives adequate performance, even when the problem scales. If that is the case (and it is usually the case) don't worry about performance.
That said:
C compiles down to machine executable and thus has the potential to execute as at least as fast as any other language
Python is generally interpreted and thus may take more CPU than a compiled language
Very few applications are "CPU bound." I/O (to disk, display, or memory) is not greatly affected by compiled vs interpreted considerations and frequently is a major part of computer time spent on an application
Python works at a higher level of abstraction than C, so your development and debugging time may be shorter
My advice: Develop in the language you find the easiest with which to work. Get your program working, then check for adequate performance. If, as usual, performance is adequate, you're done. If not, profile your specific app to find out what is taking longer than expected or tolerable. See if and how you can fix that part of the app, and repeat as necessary.
Yes, sometimes you might need to abandon work and start over to get the performance you need. But having a working (albeit slow) version of the app will be a big help in making progress. When you do reach and conquer that performance goal you'll be answering performance questions in SO rather than asking them.
If your text files that you are sorting and parsing are large, use C. If they aren't, it doesn't matter. You can write poor code in any language though. I have seen simple code in C for calculating areas of triangles run 10x slower than other C code, because of poor memory management, use of structures, pointers, etc.
Your I/O algorithm should be independent of your compute algorithm. If this is the case, then using C for the compute algorithm can be much faster.
(Assumption - The question implies that the author is familiar with C but not Python, therefore I will base my answer with that in mind.)
I was wondering what the real
difference is, in terms of
performance, between writing a program
in Python, versus doing it in C.
C will almost certainly be faster unless it is implemented poorly, but the real questions are:
What are the development implications
(development time, maintenance, etc.)
for either implementation?
Is the performance benefit significant?
Learning Python can take some time, but there are Python modules that can greatly speed development time. For example, the csv module in Python makes reading and writing csv easy. Also, Python strings, arrays, maps, and other objects make it more flexible than plain C and more elegant, in my opinion, than the equivalent C++. Some things like network access may be much quicker to develop in Python as well.
However, it may take time to learn how to program Python well enough to accomplish your task. Since you are concerned with performance, I suggest trying a simple task, such as sorting a text file, in both C and Python. That will give you a better baseline on both languages in terms of performance, development time, and possibly maintenance.
It really depends a lot on what your doing and if the algorithm in question is available in Python via a natively compiled library. If it is, then I believe you'll be looking at performance numbers close enough that Python is most likely your answer -- assuming it's your preferred language. If you must implement the algorithm yourself, depending on the amount of logic required and the size of your data set, C/C++ may be the better option. It's hard to provide a less nebulous answer without more information.
To get an idea of the raw difference in speed, check out the Computer Languages Benchmark Game.
Then you have to decide whether that difference matters to you.
Personally, I ended up deciding that it did, but most of the time instead of using C, I ended up using other higher-level languages. Personally I mostly use Scala, but Haskell and C# and Java each have their advantages also.
Across all programs, it isn't really possible to say whether things will be quicker or slower on average in Python or C.
For the programs that I've implemented in both languages, using similar algorithms, I've seen no improvement (and sometimes a performance degradation) for string- and IO-heavy code, when reimplementing python code in C. The execution time is dominated by allocation and manipulation of strings (which functionality python implements very efficiently) and waiting for IO operations (which incurs the same overhead in either language), so the extra overhead of python makes very little difference.
But for programs that do even simple operations on image files, say (images being large enough for processing time to be noticeable compared to IO), C is enormously quicker. For this sort of task the bulk of the time running the python code is spent doing Python Stuff, and this dwarfs the time spent on the underlying operations (multiply, add, compare, etc.). When reimplemented as C, the bureaucracy goes away, the computer spends its time doing real honest work, and for that reason the thing runs much quicker.
It's not uncommon for the python code to run in (say) 5 seconds where the C code runs in (say) 0.05. So that's a 100x increase -- but in absolute terms, this is not so big a deal. It takes so much less longer to write python code than it does to write C code that your program would have to be run some huge number of times to turn a time profit. I often reimplement in C, for various reasons, but if you don't have this requirement then it's probably not worth bothering. You won't get that part of your life back, and next year computers will be quicker.
Actually you can solve most of your tasks efficiently with python.
You just should know which tools to use. For text processing there is brilliant package from Egenix guys - http://www.egenix.com/products/python/mxBase/mxTextTools/. I was able to create very efficient parsers with it in python, since all the heavy lifting is done by native code.
Same approach goes for any other problem - if you have performance problems, get a C/C++ library with Python interface which implements whatever bottleneck you got efficiently.
C is definitely faster than Python because Python is written in C.
C is middle level language and hence faster but there not much a great difference between C & Python regarding executable time it takes.
but it is really very easy to write code in Python than C and it take much shorter time to write code and learn Python than C.
Because its easy to write its easy to test also.
You will find C is much slower. Your developers will have to keep track of memory allocation, and use libraries (such as glib) to handle simple things such as dictionaries, or lists, which python has built-in.
Moreover, when an error occurs, your C program will typically just crash, which means you'll need to get the error to happen in a debugger. Python would give you a stack trace (typically).
Your code will be bigger, which means it will contain more bugs. So not only will it take longer to write, it will take longer to debug, and will ship with more bugs. This means that customers will notice the bugs more often.
So your developers will spend longer fixing old bugs and thus new features will get done more slowly.
In the mean-time, your competitors will be using a sensible programming language and their products will be increasing in features and usability, rapidly yours will look bad. Your customers will leave and you'll go out of business.
The excess time to write the code in C compared to Python will be exponentially greater than the difference between C and Python execution speed.

Should I optimise my python code like C++? Does it matter?

I had an argument with a colleague about writing python efficiently. He claimed that though you are programming python you still have to optimise the little bits of your software as much as possible, as if you are writing an efficient algorithm in C++.
Things like:
In an if statement with an or always put the condition most likely to fail first, so the second will not be checked.
Use the most efficient functions for manipulating strings in common use. Not code that grinds strings, but simple things like doing joins and splits, and finding substrings.
Call as less functions as possible, even if it comes on the expense of readability, because of the overhead this creates.
I say, that in most cases it doesn't matter. I should also say that context of the code is not a super-efficient NOC or missile-guidance systems. We're mostly writing tests in python.
What's your view of the matter?
My answer to that would be :
We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil.
(Quoting Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268)
If your application is doing anything like a query to the database, that one query will take more time than anything you can gain with those kind of small optimizations, anyway...
And if running after performances like that, why not code in assembly language, afterall ? Because Python is easier/faster to write and maintain ? Well, if so, you are right :-)
The most important thing is that your code is easy to maintain ; not a couple micro-seconds of CPU-time !
Well, maybe except if you have thousands of servers -- but is it your case ?
The answer is really simple :
Follow Python best practices, not C++ best practices.
Readability in Python is more important that speed.
If performance becomes an issue, measure, then start optimizing.
This sort of premature micro-optimisation is usually a waste of time in my experience, even in C and C++. Write readable code first. If it's running too slowly, run it through a profiler, and if necessary, fix the hot-spots.
Fundamentally, you need to think about return on investment. Is it worth the extra effort in reading and maintaining "optimised" code for the couple of microseconds it saves you? In most cases it isn't.
(Also, compilers and runtimes are getting cleverer. Some micro-optimisations may become micro-pessimisations over time.)
I agree with others: readable code first ("Performance is not a problem until performance is a problem.").
I only want to add that when you absolutely need to write some unreadable and/or non-intuitive code, you can generally isolate it in few specific methods, for which you can write detailed comments, and keep the rest of your code highly readable. If you do so, you'll end up having easy to maintain code, and you'll only have to go through the unreadable parts when you really need to.
I should also say that context of the code is not a super-efficient NOC or missile-guidance systems. We're mostly writing tests in python.
Given this, I'd say that you should take your colleague's advice about writing efficient Python but ignore anything he says that goes against prioritizing readability and maintainability of the code, which will probably be more important than the speed at which it'll execute.
In an if statement with an or always
put the condition most likely to fail
first, so the second will not be
checked.
This is generally a good advice, and also depends on the logic of your program. If it makes sense that the second statement is not evaluated if the first returns false, then do so. Doing the opposite could be a bug otherwise.
Use the most efficient functions for
manipulating strings in common use.
Not code that grinds strings, but
simple things like doing joins and
splits, and finding substrings.
I don't really get this point. Of course you should use the library provided functions, because they are probably implemented in C, and a pure python implementation is most likely to be slower. In any case, no need to reinvent the wheel.
Call as less functions as possible,
even if it comes on the expense of
readability, because of the overhead
this creates.
$ cat withcall.py
def square(a):
return a*a
for i in xrange(1,100000):
i_square = square(i)
$ cat withoutcall.py
for i in xrange(1,100000):
i_square = i*i
$ time python2.3 withcall.py
real 0m5.769s
user 0m4.304s
sys 0m0.215s
$ time python2.3 withcall.py
real 0m5.884s
user 0m4.315s
sys 0m0.206s
$ time python2.3 withoutcall.py
real 0m5.806s
user 0m4.172s
sys 0m0.209s
$ time python2.3 withoutcall.py
real 0m5.613s
user 0m4.171s
sys 0m0.216s
I mean... come on... please.
I think there are several related 'urban legends' here.
False Putting the more often-checked condition first in a conditional and similar optimizations save enough time for a typical program that it is worthy for a typical programmer.
True Some, but not many, people are using such styles in Python in the incorrect belief outlined above.
True Many people use such style in Python when they think that it improves readability of a Python program.
About readability: I think it's indeed useful when you give the most useful conditional first, since this is what people notice first anyway. You should also use ''.join() if you mean concatenation of strings since it's the most direct way to do it (the s += x operation could mean something different).
"Call as less functions as possible" decreases readability and goes against Pythonic principle of code reuse. And so it's not a style people use in Python.
Before introducing performance optimizations at the expense of readability, look into modules like psyco that will do some JIT-ish compiling of distinct functions, often with striking results, with no impairment of readability.
Then if you really want to embark on the optimization path, you must first learn to measure and profile. Optimization MUST BE QUANTITATIVE - do not go with your gut. The hotspot profiler will show you the functions where your program is burning up the most time.
If optimization turns up a function like this is being frequently called:
def get_order_qty(ordernumber):
# look up order in database and return quantity
If there is any repetition of ordernumbers, then memoization would be a good optimization technique to learn, and it is easily packaged in an #memoize decorator so that there is little impact to program readability. The effect of memoizing is that values returned for a given set of input arguments are cached, so that the expensive function can be called only once, with subseqent calls resolved against the cache.
Lastly, consider lifting invariants out of loops. For large multi-dimensional structures, this can save a lot of time - in fact in this case, I would argue that this optimization improves readability, as it often serves to make clear that some expression can be computed at a high-level dimension in the nested logic.
(BTW, is this really what you meant?
•In an if statement with an or always put the condition most likely to fail first, so the second will not be checked.
I should think this might be the case for "and", but an "or" will short-circuit if the first value is True, saving the evaluation of the second term of the conditional. So I would change this optimization "rule" to:
If testing "A and B", put A first if
it is more likely to evaluate to
False.
If testing "A or B", put A first if
it is more likely to evaluate to
True.
But often, the sequence of conditions is driven by the tests themselves:
if obj is not None and hasattr(obj,"name") and obj.name.startswith("X"):
You can't reorder these for optimization - they have to be in this order (or just let the exceptions fly and catch them later:
if obj.name.startswith("X"):
Sure follow Python best-practices (and in fact I agree with the first two recommendations), but maintainability and efficiency are not opposites, they are mostly togethers (if that's a word).
Statements like "always write your IF statements a certain way for performance" are a-priori, i.e. not based on knowledge of what your program spends time on, and are therefore guesses. The first (or second, or third, whatever) rule of performance tuning is don't guess.
If after you measure, profile, or in my case do this, you actually know that you can save much time by re-ordering tests, by all means, do. My money says that's at the 1% level or less.
My visceral reaction is this:
I've worked with guys like your colleague and in general I wouldn't take advice from them.
Ask him if he's ever even used a profiler.

Categories

Resources