Bytecode optimization

Bytecode optimization - python

Here are 2 simple examples. In the first example append method produces LOAD_ATTR instruction inside the cycle, in the second it only produced once and result saved in variable (ie cached). Reminder: I remember, that there extend method for this task which is much faster that this
setup = \
"""LIST = []
ANOTHER_LIST = [i for i in range(10**7)]
def appender(list, another_list):
for elem in another_list:
list.append(elem)
def appender_optimized(list, another_list):
append_method = list.append
for elem in another_list:
append_method(elem)"""
import timeit
print(timeit.timeit("appender(LIST, ANOTHER_LIST)", setup=setup, number=10))
print(timeit.timeit("appender_optimized(LIST, ANOTHER_LIST)", setup=setup, number=10))
Results:
11.92684596051036
7.384205785584728
4.6 seconds difference (even for such a big list) is no joke - for my opinion such difference can not be counted as "micro optimization". Why Python does not do it for me? Because bytecode must be exact reflection of source code? Do compiler even optimize anything? For example,
def te():
a = 2
a += 1
a += 1
a += 1
a += 1
produces
LOAD_FAST 0 (a)
LOAD_CONST 2 (1)
INPLACE_ADD
STORE_FAST 0 (a)
4 times instead of optimize into a += 4. Or do it optimize some famous things like producing bit shift instead of multiplying by 2? Am I misunderstand something about basic language concepts?

Python is a dynamic language. This means that you have a lot of freedom in how you write code. Due to the crazy amounts of introspection that python exposes (which are incredibly useful BTW), many optimizations simply cannot be performed. For example, in your first example, python has no way of knowing what datatype list is going to be when you call it. I could create a really weird class:
class CrazyList(object):
def append(self, value):
def new_append(value):
print "Hello world"
self.append = new_append
Obviously this isn't useful, but I can write this and it is valid python. If I were to pass this type to your above function, the code would be different than the version where you "cache" the append function.
We could write a similar example for += (it could have side-effects that wouldn't get executed if the "compiler" optimized it away).
In order to optimize efficiently, python would have to know your types ... And for a vast majority of your code, it has no (fool-proof) way to get the type data so it doesn't even try for most optimizations.
Please note that this is a micro optimization (and a well documented one). It is useful in some cases, but in most cases it is unnecessary if you write idiomatic python. e.g. your list example is best written using the .extend method as you've noted in your post. Most of the time, if you have a loop that is tight enough for the lookup time of a method to matter in your overall program runtime, then either you should find a way to rewrite just that loop to be more efficient or even push the computation into a faster language (e.g. C). Some libraries are really good at this (numpy).
With that said, there are some optimizations that can be done safely by the "compiler" in a stage known as the "peephole optimizer". e.g. It will do some simple constant folding for you:
>>> import dis
>>> def foo():
... a = 5 * 6
...
>>> dis.dis(foo)
2 0 LOAD_CONST 3 (30)
3 STORE_FAST 0 (a)
6 LOAD_CONST 0 (None)
9 RETURN_VALUE
In some cases, it'll cache values for later use, or turn one type of object into another:
>>> def translate_tuple(a):
... return a in [1, 3]
...
>>> import dis
>>> dis.dis(translate_tuple)
2 0 LOAD_FAST 0 (a)
3 LOAD_CONST 3 ((1, 3))
6 COMPARE_OP 6 (in)
9 RETURN_VALUE
(Note the list got turned into a tuple and cached -- In python3.2+ set literals can also get turned into frozenset and cached).

In general Python optimises virtually nothing. It won't even optimise trivial things like x = x. Python is so dynamic that doing so correctly would be extremely hard. For example the list.append method can't be automatically cached in your first example because it could be changed in another thread, something which can't be done in a more static language like Java.

Related

When will __length_hint__ be inaccurate?

I'm aware that, technically, you cannot know the length of a Python iterator without actually iterating through it.
The __length_hint__ method i.e. it.__length_hint__() returns an estimate of len(list(it)). There's even a wrapper around this method in the operator module, which says that the method "may over- or under-estimate by an arbitrary amount."
For finite iterators, what are the cases where __length_hint__ will be inaccurate? If this can't be known, why not?
I don't see any reference to this in PEP 424.
>>> obja = iter(range(98345984))
>>> obja.__length_hint__()
98345984
>>> import numpy as np
>>> objb = iter(np.arange(817483))
>>> objb.__length_hint__()
817483
I know it's not a great idea to rely on an implementation detail. But this is a detail that is already explicitly used in a top-level function of the operator module. Would there be, for instance, specific data structures that would not give possible inaccuracies?

Basically, anything that is iterating over something that is generated dynamically, rather than iterating over a completed sequence.
Consider a simple iterator that flips a coin, with a head worth 1 point and a tail worth 2 points. It continues to flip the coin until you reach 4 points.
def coinflip():
s = 0
while s < 4:
x = random.choice([1,2])
s += x
yield ("H" if x == 1 else "T")
How long will the sequence be? It could be as short as 2: TT. It could be as long as 4: either HHHH or HHHT. However, in the majority of cases it will be 3: HHT, HTH, HTT, THT or THH. In this case, 3 would be the "safest" guess, but that could be higher or lower.

Understanding Tuple to List conversion behaviour: list(t) or [*t] which is better?

I've a tuple as below:
t=(1,2,3,4,5,6)
I want to convert it to a list, although there is a straight forward way of
l=list(t)
I wanted to understand if the below is more inefficient, if so in what way?
l=[*t]
This is more to understanding if unpacking and packing it back into a list has any overheads vs list(tuple).
I'll try and benchmark the two and post the results here, but if anybody can throw some insight it would be great.

This is pretty easy to check yourself with the timeit and dis modules. I slapped together this script:
import timeit
import dis
def func(t):
return list(t)
def unpack(t):
return [*t]
def func_wrapper():
t = (1,2,3,4,5,6)
func(t)
def unpack_wrapper():
t = (1,2,3,4,5,6)
unpack(t)
print("Disassembly with function:")
print(dis.dis(func))
print("Dissassembly with unpack:")
print(dis.dis(unpack))
print("Func time:")
print(timeit.timeit(func_wrapper, number=10000))
print("Unpack time:")
print(timeit.timeit(unpack_wrapper, number=10000))
And running it shows this output:
Disassembly with function:
5 0 LOAD_GLOBAL 0 (list)
2 LOAD_FAST 0 (t)
4 CALL_FUNCTION 1
6 RETURN_VALUE
None
Dissassembly with unpack:
8 0 LOAD_FAST 0 (t)
2 BUILD_LIST_UNPACK 1
4 RETURN_VALUE
None
Func time:
0.002832347317420137
Unpack time:
0.0016913349487029865
The disassembly shows that the function method's disassembly requires a one additional function call over the unpacking method. The timing results show that, as expected, the overhead of the function call vs using a built-in operator causes a significant increase in execution time.
By execution time alone, unpacking is more "efficient." But remember that execution time is only one part of the equation - this has to be balanced with readability and in some cases, memory consumption (which is harder to benchmark). In most cases, I would recommend you just stick with the function because it's easier to read. I would only switch to the unpacking method if this code is executed frequently (like in a long-running loop) and is on the critical path of your script.

Logic behind Python indexing

I'm curious in Python why x[0] retrieves the first element of x while x[-1] retrieves the first element when reading in the reverse order. The syntax seems inconsistent to me since in the one case we're counting distance from the first element, whereas we don't count distance from the last element when reading backwards. Wouldn't something like x[-0] make more sense? One thought I have is that intervals in Python are generally thought of as inclusive with respect to the lower bound but exclusive for the upper bound, and so the index could maybe be interpreted as distance from a lower or upper bound element. Any ideas on why this notation was chosen? (I'm also just curious why zero indexing is preferred at all.)

The case for zero-based indexing in general is succinctly described by Dijkstra here. On the other hand, you have to think about how Python array indexes are calculated. As the array indexes are first calculated:
x = arr[index]
will first resolve and calculate index, and -0 obviously evaluates to 0, it would be quite impossible to have arr[-0] to indicate the last element.
y = -0 (??)
x = arr[y]
would hardly make sense.
EDIT:
Let's have a look at the following function:
def test():
y = x[-1]
Assume x has been declared above in a global scope. Now let's have a look at the bytecode:
0 LOAD_GLOBAL 0 (x)
3 LOAD_CONST 1 (-1)
6 BINARY_SUBSCR
7 STORE_FAST 0 (y)
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
Basically the global constant x (more precisely its address) is pushed on the stack. Then the array index is evaluated and pushed on the stack. Then the instruction BINARY_SUBSCR which implements TOS = TOS1[TOS] (where TOS means Top of Stack). Then the top of the stack is popped into the variable y.
As the BINARY_SUBSCR handles negative array indices, and that -0 will be evaluated to 0 before being pushed to the top of the stack, it would take major changes (and unnecessary changes) to the interpreter to have arr[-0] indicate the last element of the array.

Its mostly for a couple reasons:
Computers work with 0-based numbers
Older programming languages used 0-based indexing since they were low-level and closer to machine code
Newer, Higher-level languages use it for consistency and the same reasons
For more information: https://en.wikipedia.org/wiki/Zero-based_numbering#Usage_in_programming_languages

In many other languages that use 0-based indexes but without negative index implemented as python, to access the last element of a list (array) requires finding the length of the list and subtracting 1 for the last element, like so:
items[len(items) - 1]
In python the len(items) part can simply be omitted with support for negative index, consider:
>>> items = list(range(10))
>>> items[len(items) - 1]
9
>>> items[-1]
9

In python: 0 == -0, so x[0] == x[-0].
Why is sequence indexing zero based instead of one based? It is a choice the language designer should do. Most languages I know of use 0 based indexing. Xpath uses 1 based for selection.
Using negative indexing is also a convention for the language. Not sure why it was chosen, but it allows for circling or looping the sequence by simple addition (subtraction) on the index.

Is Haskell's laziness an elegant alternative to Python's generators?

In a programming exercise, it was first asked to program the factorial function and then calculate the sum: 1! + 2! + 3! +... n! in O(n) multiplications (so we can't use the factorial directly). I am not searching the solution to this specific (trivial) problem, I'm trying to explore Haskell abilities and this problem is a toy I would like to play with.
I thought Python's generators could be a nice solution to this problem. For example :
from itertools import islice
def ifact():
i , f = 1, 1
yield 1
while True:
f *= i
i += 1
yield f
def sum_fact(n):
return sum(islice(ifact(),5))
Then I've tried to figure out if there was something in Haskell having a similar behavior than this generator and I thought that laziness do all the staff without any additional concept.
For example, we could replace my Python ifact with
fact = scan1 (*) [1..]
And then solve the exercise with the following :
sum n = foldl1 (+) (take n fact)
I wonder if this solution is really "equivalent" to Python's one regarding time complexity and memory usage. I would say that Haskell's solution never store all the list fact since their elements are used only once.
Am I right or totally wrong ?
EDIT :
I should have check more precisely:
Prelude> foldl1 (+) (take 4 fact)
33
Prelude> :sprint fact
fact = 1 : 2 : 6 : 24 : _
So (my implementation of) Haskell store the result, even if it's no longer used.

Indeed, lazy lists can be used this way. There are some subtle differences though:
Lists are data structures. So you can keep them after evaluating them, which can be both good and bad (you can avoid recomputation of values and to recursive tricks as #ChrisDrost described, at the cost of keeping memory unreleased).
Lists are pure. In generators you can have computations with side effects, you can't do that with lists (which is often desirable).
Since Haskell is a lazy language, laziness is everywhere and if you just convert a program from an imperative language to Haskell, the memory requirements can change considerably (as #RomanL describes in his answer).
But Haskell offers more advanced tools to accomplish the generator/consumer pattern. Currently there are three libraries that focus on this problem: pipes, conduit and iteratees. My favorite is conduit, it's easy to use and the complexity of its types is kept low.
They have several advantages, in particular that you can create complex pipelines and you can base them on a chosen monad, which allows you to say what side effects are allowed in a pipeline.
Using conduit, your example could be expressed as follows:
import Data.Functor.Identity
import Data.Conduit
import qualified Data.Conduit.List as C
ifactC :: (Num a, Monad m) => Producer m a
ifactC = loop 1 1
where
loop r n = let r' = r * n
in yield r' >> loop r' (n + 1)
sumC :: (Num a, Monad m) => Consumer a m a
sumC = C.fold (+) 0
main :: IO ()
main = (print . runIdentity) (ifactC $= C.isolate 5 $$ sumC)
-- alternatively running the pipeline in IO monad directly:
-- main = (ifactC $= C.isolate 5 $$ sumC) >>= print
Here we create a Producer (a conduit that consumes no input) that yields factorials indefinitely. Then we compose it with isolate, which ensures that no more than a given number of values are propagated through it, and then we compose it with a Consumer that just sums values and returns the result.

Your examples are not equivalent in memory usage. It is easy to see if you replace * with a + (so that the numbers don't get big too quickly) and then run both examples on a big n such as 10^7. Your Haskell version will consume a lot of memory and python will keep it low.
Python generator will not generate a list of values then sum it up. Instead, the sum function will get values one-by-one from the generator and accumulate them. Thus, the memory usage will remain constant.
Haskell will evaluate functions lazily, but in order to calculate say foldl1 (+) (take n fact) it will have to evaluate the complete expression. For large n this will unfold into a huge expression the same way as (foldl (+) 0 [0..n]) does. For more details on evaluation and reduction have a look here: https://www.haskell.org/haskellwiki/Foldr_Foldl_Foldl%27.
You can fix your sum n by using foldl1' instead of foldl1 as described on the link above. As #user2407038 explained in his comment, you'd also need to keep fact local. The following works in GHC with a constant memory use:
let notfact = scanl1 (+) [1..]
let n = 20000000
let res = foldl' (+) 0 (take n notfact)
Note that in case of the actual factorial in place of notfact memory considerations are less of a concern. The numbers will get big quickly, arbitrary-precision arithmetic will slow things down so you won't be able get to big values of n in order to actually see the difference.

Basically, yes: Haskell's lazy-lists are a lot like Python's generators, if those generators were effortlessly cloneable, cacheable, and composeable. Instead of raising StopIteration you return [] from your recursive function, which can thread state into the generator.
They do some cooler stuff due to self-recursion. For example, your factorial generator is more idiomatically generated like:
facts = 1 : zipWith (*) facts [1..]
or the Fibonaccis as:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
In general any iterative loop can be converted to a recursive algorithm by promoting the loop-state to arguments of a function and then calling it recursively to get the next loop cycle. Generators are just like that, but we prepend some elements each iteration of the recursive function, `go ____ = (stuff) : go ____.
The perfect equivalent is therefore:
ifact :: [Integer]
ifact = go 1 1
where go f i = f : go (f * i) (i + 1)
sum_fact n = sum (take n ifact)
In terms of what's fastest, the absolute fastest in Haskell will probably be the "for loop":
sum_fact n = go 1 1 1
where go acc fact i
| i <= n = go (acc + fact) (fact * i) (i + 1)
| otherwise = acc
The fact that this is "tail-recursive" (a call of go does not pipe any sub-calls to go to another function like (+) or (*)) means that the compiler can package it into a really tight loop, and that's why I'm comparing it with "for loops" even though that's not really a native idea to Haskell.
The above sum_fact n = sum (take n ifact) is a little slower than this but faster than sum (take n facts) where facts is defined with zipWith. The speed differences are not very large and I think mostly just come down to memory allocations that don't get used again.

Order of evaluation in Python is not clear [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Multiple assignment in Python
As we have learnt right since we started with C that on a computer while working in one thread, all operations occur one by one.
I have a doubt in Python 3 language. I have seen codes for swapping variable values using the expression:
a,b = b,a
Or for Fibonacci series using:
a,b = b,a+b
How can these work ? But they do work :O
Does the Python system internally create some temporary variable for these ? What's the order of assignment so that both effectively give the correct result ?
Regards,
Nikhil

At a high level, you are creating two tuples, the left hand side and right hand side, and assigning the right one to the left, which changes the variables one by one to their opposites. Python is a higher level language, so there are more abstractions like this when compared to a language like C.
At a low level, you can see quite clearly what is happening by using the dis module, which can show you the python bytecode for a function:
>>> import dis
>>> def test(x, y):
... x, y = y, x
...
>>> dis.dis(test)
2 0 LOAD_FAST 1 (y)
3 LOAD_FAST 0 (x)
6 ROT_TWO
7 STORE_FAST 0 (x)
10 STORE_FAST 1 (y)
13 LOAD_CONST 0 (None)
16 RETURN_VALUE
What happens is it uses ROT_TWO to swap the order of the items on the stack, which is a very efficient way of doing this.

When you write a, b, you create tuple.
>>> 1, 2
(1, 2)
So, nothing special in evaluation order.

From the Fibonacci example with a=1 and b=1. First, the right hand side is evaluated: b,a+b resulting in the tuple (1,2). Next, the right hand side is assigned to the left hand side, namely a and b. So yes, the evaluation on the right is stored in memory, and then a and b changed to point to these new values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bytecode optimization - python

Related

When will __length_hint__ be inaccurate?

Understanding Tuple to List conversion behaviour: list(t) or [*t] which is better?

Logic behind Python indexing

Is Haskell's laziness an elegant alternative to Python's generators?

Order of evaluation in Python is not clear [duplicate]

Categories

Resources