Why is branchless programming and built-ins slower?

Why is branchless programming and built-ins slower? - python

I found 2 branchless functions that find the maximum of two numbers in python, and compared them to an if statement and the built-in max function. I thought the branchless or the built-in functions would be the fastest, but the fastest was the if-statement function by a large margin. Does anybody know why this is? Here are the functions:
If-statement (2.16 seconds for 25000 operations):
def max1(a, b):
if a > b:
return a
return b
Built-in (4.69 seconds for 25000 operations):
def max2(a, b):
return max(a, b)
Branchless 1 (4.12 seconds for 25000 operations):
def max3(a, b):
return (a > b) * a + (a <= b) * b
Branchless 2 (5.34 seconds for 25000 operations):
def max4(a, b):
diff = a - b
return a - (diff & diff >> 31)

Your expectations about branching vs. branchless code apply to low-level languages like assembly and C. Branchless code can be faster in low-level languages because it prevents slowdowns caused by branch prediction misses. (Note: this means branchless code can be faster, but it will not necessarily be.)
Python is a high-level language. Assuming you are using the CPython interpreter: for every bytecode instruction you execute, the interpreter has to branch on the kind of opcode, and typically many other things. For example, even the simple < operator requires a branch to check for the < opcode, another branch to check whether the object's class implements a __lt__ method, more branches to check whether the right-hand-side value is of a valid type for the comparison to be performed, and probably several other branches. Even your so-called "branchless" code will in practice result in a lot of branching for these reasons.
Because Python is so high-level, each bytecode instruction is actually doing quite a lot of work compared to a single machine-code instruction. So the performance of simple code like this will mainly depend on how many bytecode instructions have to be interpreted:
Your max1 function has to do three loads of local variables, a comparison, a conditional jump and a return. That's six.
Your max2 function does two loads of local variables, one load of a global variable (referencing the built-in max), and also makes a function call; that requires passing arguments, and is relatively expensive compared to other bytecode instructions. On top of that, the built-in function itself has to do the same work as your own max1 function, so no wonder max2 is slower.
Your max3 function does six loads of local variables, two comparisons, two multiplications, one addition, and one return. That's twelve instructions, so we should expect it to take about twice as long as max1.
Likewise max4 does five loads of local variables, one store to a local variable, one load of a constant, two subtractions, one bitshift, one bitwise "and", and one return. That's twelve instructions again.
That said, note that if we compare your max1 with the built-in function max directly, instead of your max2 which has an extra function call, your max1 function is still a bit faster than the built-in max. This is probably because the built-in max accepts a variable number of arguments, which may involve building a tuple of arguments, and the built-in max function also has a branch to check if it was called with a single iterable argument (e.g. max([3, 1, 4, 2])), and handle that case differently; your max1 function doesn't do those things.

Python code is not machine optimized. It is highly unlikely that you get any "branchless" code optimization in the interpreted code.
Branchless code is faster sometimes if it effectively do less work or the hardware is able to do better branch prediction because of it.
Function call has cost, so if the code inside the function is too trivial, the cost of the function call is relatively high.
There is a missing control case: just call the builtin max function in a loop and compare (as in max2 but without the function call overhead). It is probable that builtin max is implemented in C and it is already optimized for your hardware.

Related

out-of-core/external-memory combinatorics in python

I am iterating the search space of valid Python3 ASTs. With max recursion depth = 3, my laptop runs out of memory. My implementation makes heavy use of generators, specifically 'yield' and itertools.product().
Ideally, I'd replace product() and the max recursion depth with some sort of iterative deepening, but first things first:
Are there any libraries or useful SO posts for out-of-core/external-memory combinatorics?
If not... I am considering the feasibility of using either dask or joblib's delayed()... or perhaps wendelin-core's ZBigArray, though I don't like the looks of its interface:
root = dbopen('test.fs')
root['A'] = A = ZBigArray((10,), np.int)
transaction.commit()
Based on this example, I think that my solution would involve an annotation/wrapper function that eagerly converts the generators to ZBigArrays, replacing root['A'] with something like root[get_key(function_name, *function_args)] It's not pretty, since my generators are not entirely pure--the output is shuffled. In my current version, this shouldn't be a big deal, but the previous and next versions involve using various NNs and RL rather mere shuffling.

First things first- the reason you're getting the out of memory error is because itertools.product() caches intermediate values. It has no idea if the function that gave you your generator is idempotent, and even if it did, it couldn't be able to infer how to call it again given just the generator. This means itertools.product must cache values of each iterable its passed.
The solution here is to bite the small performance bullet and either write explicit for loops, or write your own cartesian product function, which takes functions that would produce each generator. For instance:
def product(*funcs, repeat=None):
if not funcs:
yield ()
return
if repeat is not None:
funcs *= repeat
func, *rest = funcs
for val in func():
for res in product(*rest):
yield (val, ) + res
from functools import partial
values = product(partial(gen1, arg1, arg2), partial(gen2, arg1))
The bonus from rolling your own here is that you can also change how it goes about traversing the A by B by C ... dimensional search space, so that you could do maybe a breadth-first search instead of an iteratively deepening DFS. Or, maybe pick some random space-filling curve, such as the Hilbert Curve which would iterate all indices/depths of each dimension in your product() in a local-centric fashion.
Apart from that, I have one more thing to point out- you can also implement BFS lazily (using generators) to avoid building a queue that could bloat memory usage as well. See this SO answer, copied below for convenience:
def breadth_first(self):
yield self
for c in self.breadth_first():
if not c.children:
return # stop the recursion as soon as we hit a leaf
yield from c.children
Overall, you will take a performance hit from using semi-coroutines, with zero caching, all in python-land (in comparison to the baked in and heavily optimized C of CPython). However, it should still be doable- algorithmic optimizations (avoiding generating semantically nonsensical ASTs, prioritizing ASTs that suit your goal, etc.) will have a larger impact than the constant-factor performance hit.

Python performance: repeating calculations vs temp variable

Does python recalculate every repeating expression in code?
For example does
a = [1,23,45,45,456,34]
b = len(a) + 213
c = len(a) + 3432
differ in performance from
a = [1,23,45,45,456,34]
l = len(a)
b = l + 213
c = l + 3432
I would guess second one uses more memory (to store l) but less cpu. Am I correct?

Does python recalculate every repeating expression in code?
It is unspecified in the language specification. In fact, this is highly dependent of the Python implementation. The mainstream Python implementation, called CPython, does recompute the expression. PyPy (an alternative implementation focusing on performance) usually do not recompute the expression in hot portions of the code, thanks to just-in-time compilation. There are many other implementation of Python (eg. Pyston, Jython, IronPython) and each one can behave differently.
I would guess second one uses more memory (to store l) but less cpu.
Yes, but the difference is actually marginal and still dependent of the Python implementation used (eg. PyPy may not require more memory in this case). Note that calling len on a list is very fast and this is done in constant time.
While the second code should be slightly faster, such micro-optimization will likely have no significant impact on a big code. Keep in mind that readable code are generally easier to maintain, improve and optimize.

numpy: efficient way to do "any" or "all" on the result of an operation

Suppose that you have two NumPy arrays, a and b, and you want to test whether any value of a is greater than the corresponding value of b.
Now you could calculate a boolean array and call its any method:
(a > b).any()
This will do all the looping internally, which is good, but it suffers from the need to perform the comparison on all the pairs even if, say, the very first result evaluates as True.
Alternatively, you could do an explicit loop over scalar comparisons. An example implementation in the case where a and b are the same shape (so broadcasting is not required) might look like:
any(ai > bi for ai, bi in zip(a.flatten(), b.flatten()))
This will benefit from the ability to stop processing after the first True result is encountered, but with all the costs associated with an explicit loop in Python (albeit inside a comprehension).
Is there any way, either in NumPy itself or in an external library, that you could pass in a description of the operation that you wish to perform, rather than the result of that operation, and then have it perform the operation internally (in optimised low-level code) inside an "any" loop that can be broken out from?
One could imagine hypothetically some kind of interface like:
from array_operations import GreaterThan, Any
expression1 = GreaterThan('x', 'y')
expression2 = Any(expression1)
print(expression2.evaluate(x=a, y=b))
If such a thing exists, clearly it could have other uses beyond efficient evaluation of all and any, in terms of being able to create functions dynamically.
Is there anything like this?

One way to solve this is with delayed/deferred/lazy evaluation. The C++ community uses something called "expression templates" to achieve this; you can find an accessible overview here: http://courses.csail.mit.edu/18.337/2015/projects/TylerOlsen/18337_tjolsen_ExpressionTemplates.pdf
In Python the easiest way to do this is using Numba. You basically just write the function you need in Python using for loops, then you decorate it with #numba.njit and it's done. Like this:
#numba.njit
def any_greater(a, b):
for ai, bi in zip(a.flatten(), b.flatten()):
if ai > bi:
return True
return False
There is/was a NumPy enhancement proposal that could help your use case, but I don't think it has been implemented: https://docs.scipy.org/doc/numpy-1.13.0/neps/deferred-ufunc-evaluation.html

Why is `word == word[::-1]` to test for palindrome faster than a more algorithmic solution in Python?

I wrote a disaster of a question on Code Review asking why Python programmers normally test if a string is a palindrome by comparing the string to itself reversed, instead of a more algorithmic way with lower complexity, assuming that the normal way would be faster.
Here is the pythonic way:
def is_palindrome_pythonic(word):
# The slice requires N operations, plus memory
# and the equality requires N operations in the worst case
return word == word[::-1]
Here is my attempt at a more efficient way to accomplish this:
def is_palindrome_normal(word):
# This requires N/2 operations in the worst case
low = 0
high = len(word) - 1
while low < high:
if word[low] != word[high]:
return False
low += 1
high -= 1
return True
I would expect the normal way would be faster than the pythonic way. See for example this great article
Timing it with timeit, however, brought exactly the opposite result:
setup = '''
def is_palindrome_pythonic(word):
# ...
def is_palindrome_normal(word):
# ...
# N here is 2000
first_half = ''.join(map(str, (i for i in range(1000))))
word = first_half + first_half[::-1]
'''
timeit.timeit('is_palindrome_pythonic(word)', setup=setup, number=1000)
# 0.0052
timeit.timeit('is_palindrome_normal(word)', setup=setup, number=1000)
# 0.4268
I then figured that my n was too small, so I changed the length of word from 2000 to 2,000,000. The pythonic way took about 16 seconds on average, whereas the normal way ran several minutes before I canceled it.
Incidentally, in the best case scenario, where the very first letter does not match the very last letter, the normal algorithm was much faster.
What explains the extreme difference between the speeds of the two algorithms?

Because the "Pythonic" way with slicing is implemented in C. The interpreter / VM doesn't need to execute more than approximately once. The bulk of the algorithm is spent in a tight loop of native code.

As much as I love Python, I have to say that if you want maximum speed you probably shouldn't be using Python. ;)
The rule of thumb in Python time optimization is to use operators or module functions that do the bulk of the work at C speed rather than equivalent code running at Python speed. Even if the two equivalent approaches are using algorithms with the same big-O complexity, the time scaling factor of (mostly) running directly on the CPU vs running on the Python virtual machine has a big impact.
This is even true of an algorithm that's mostly just integer arithmetic, since Python integers are immutable objects, so when you do arithmetic there's the overhead of allocating and initialising a new integer object and disposing of the old one. CPython tries to be frugal, and is pretty smart at managing memory (so every new object doesn't require a system call to allocate memory), and of course the CPython interpreter maintains a cache of integers from -5 to 256 (inclusive) so that arithmetic with small numbers isn't so bad. But it's certainly slower than doing arithmetic at C speed with machine integers.
You can see the difference even with a simple counting loop. On my admittedly ancient 32 bit machine running Python 3.6, using the Bash time command to do the timings,
m = 5000000
for i in range(m):
i
is roughly twice as fast as
m = 5000000
i = 0
while i<m:
i += 1
because range can do the arithmetic at C speed, even though it still has to create a new integer object on each iteration. If you replace the i line in the range version with pass the time is roughly halved.
With more complicated algorithms the time differences can be much more significant, eg string or list copying that happens at the C level can often be done with efficient CPU operators that are much faster than chugging along on the Python virtual machine with Python code.
I agree that this can take a while to get used to if you come from a language that gets compiled to native machine code. And I admit that even after over 10 years of using Python it still feels a little weird to me that when (for example) you need to do some bit manipulation stuff that it can often be faster in Python to do it using string operations on a string composed of '0's and '1's that to do it using the traditional bitwise and arithmetic integer operators.
OTOH, I think it's useful to know the traditional algorithms as well as the Pythonic ones. It's rare that a programmer will work only in Python, so it's good to know how to do things in languages that don't work the way that Python does.

What is the problem with reduce()?

There seems to be a lot of heated discussion on the net about the changes to the reduce() function in python 3.0 and how it should be removed. I am having a little difficulty understanding why this is the case; I find it quite reasonable to use it in a variety of cases. If the contempt was simply subjective, I cannot imagine that such a large number of people would care about it.
What am I missing? What is the problem with reduce()?

As Guido says in his The fate of reduce() in Python 3000 post:
So now reduce(). This is actually the one I've always hated most, because, apart from a few examples involving + or *, almost every time I see a reduce() call with a non-trivial function argument, I need to grab pen and paper to diagram what's actually being fed into that function before I understand what the reduce() is supposed to do. So in my mind, the applicability of reduce() is pretty much limited to associative operators, and in all other cases it's better to write out the accumulation loop explicitly.
There is an excellent example of a confusing reduce in the Functional Programming HOWTO article:
Quick, what's the following code doing?
total = reduce(lambda a, b: (0, a[1] + b[1]), items)[1]
You can figure it out, but it takes time to disentangle the expression to figure out
what's going on. Using a short nested def statements makes things a little bit better:
def combine (a, b):
return 0, a[1] + b[1]
total = reduce(combine, items)[1]
But it would be best of all if I had simply used a for loop:
total = 0
for a, b in items:
total += b
Or the sum() built-in and a generator expression:
total = sum(b for a,b in items)
Many uses of reduce() are clearer when written as for loops.

reduce() is not being removed -- it's simply being moved into the functools module. Guido's reasoning is that except for trivial cases like summation, code written using reduce() is usually clearer when written as an accumulation loop.

People worry it encourages an obfuscated style of programming, doing something that can be achieved with clearer methods.
I'm not against reduce myself, I also find it a useful tool sometimes.

The primary reason of reduce's existence is to avoid writing explicit for loops with accumulators. Even though python has some facilities to support the functional style, it is not encouraged. If you like the 'real' and not 'pythonic' functional style - use a modern Lisp (Clojure?) or Haskell instead.

Using reduce to compute the value of a polynomial with Horner's method is both compact and expressive.
Compute polynomial value at x.
a is an array of coefficients for the polynomial
def poynomialValue(a,x):
return reduce(lambda value, coef: value*x + coef, a)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.