Related
Here is my simple code example:
import time
t0 = time.time()
s = 0
for i in range(1000000):
s += i
t1 = time.time()
print(s, t1 - t0)
t0 = time.time()
s = sum(i for i in range(1000000))
t1 = time.time()
print(s, t1 - t0)
On my computer (with Python 3.8) it prints:
499999500000 0.22901296615600586
499999500000 1.6930372714996338
So, doing += a million times is 7 times faster than calling sum? That is really unexpected. What is it doing?
Edit: I foolishly allowed a debugger to attach to the process and interfere with my measurements, which was in the end the cause of the slowness. With the debugger out, the measurements are no longer so unpredictable. As some of the answers are clearly showing, what I observed shuold not happen.
Let's use timeit for proper benchmarking and to make it easy to also compare different Python versions, let's run this in Docker containers:
so62514160.py
N = 1000000
def m1():
s = 0
for i in range(N):
s += i
def m2():
s = sum(i for i in range(N))
def m3():
s = sum(range(N))
so62514160bench.sh
for image in python:2.7 python:3.6 python:3.7 python:3.8; do
for fun in m1 m2 m3; do
echo -n "$image" "$fun "
docker run --rm -it -v $(pwd):/app -w /app -e PYTHONDONTWRITEBYTECODE=1 "$image" python -m timeit -s 'import so62514160 as s' "s.$fun()"
done
done
results on my machine:
python:2.7 m1 10 loops, best of 3: 43.5 msec per loop
python:2.7 m2 10 loops, best of 3: 39.6 msec per loop
python:2.7 m3 100 loops, best of 3: 17.1 msec per loop
python:3.6 m1 10 loops, best of 3: 41.9 msec per loop
python:3.6 m2 10 loops, best of 3: 46 msec per loop
python:3.6 m3 100 loops, best of 3: 17.7 msec per loop
python:3.7 m1 5 loops, best of 5: 45 msec per loop
python:3.7 m2 5 loops, best of 5: 40.7 msec per loop
python:3.7 m3 20 loops, best of 5: 17.3 msec per loop
python:3.8 m1 5 loops, best of 5: 48.2 msec per loop
python:3.8 m2 5 loops, best of 5: 44.6 msec per loop
python:3.8 m3 10 loops, best of 5: 19.2 msec per loop
plot
First of all, probably your observation does not generalize well to other systems, as your way of measuring is quite unreliable because it is susceptible to fluctuation in performances that are bound to be dominated by how your OS reacts to the fluctuating system load at the time of measurement.
You should use timeit or something similar.
For example, this are the timings I get on Python 3.6 in a virtual environment (Google Colab) (which seems to be quite reproducible across the other answers):
import numba as nb
def sum_loop(n):
result = 0
for x in range(n):
result += x
return result
sum_loop_nb = nb.jit(sum_loop)
sum_loop_nb.__name__ = 'sum_loop_nb'
def sum_analytical(n):
return n * (n - 1) // 2
def sum_list(n):
return sum([x for x in range(n)])
def sum_gen(n):
return sum(x for x in range(n))
def sum_range(n):
return sum(range(n))
sum_loop_nb(10) # to trigger compilation
funcs = sum_analytical, sum_loop, sum_loop_nb, sum_gen, sum_list, sum_range
n = 1000000
for func in funcs:
print(func.__name__, func(n))
%timeit func(n)
# sum_analytical 499999500000
# 10000000 loops, best of 3: 222 ns per loop
# sum_loop 499999500000
# 10 loops, best of 3: 55.6 ms per loop
# sum_loop_nb 499999500000
# 10000000 loops, best of 3: 196 ns per loop
# sum_gen 499999500000
# 10 loops, best of 3: 51.7 ms per loop
# sum_list 499999500000
# 10 loops, best of 3: 68.4 ms per loop
# sum_range 499999500000
# 100 loops, best of 3: 17.8 ms per loop
It is unlikely that you will observe much different timings across different Python versions.
The sum_analytical() and sum_loop_nb() versions have been included just for fun and are not analyzed further.
The sum_list() is also behaving quite differently from the rest, as it is creating a large, largely unnecessary, object for the computation, and it is also not analyzed further.
The reason for these different timings is in the bytecode produced by the considered versions of the functions, of course. In particular, from sum_loop() through sum_range() one gets progressively simpler code:
import dis
funcs = sum_loop, sum_gen, sum_range
for func in funcs:
print(func.__name__)
print(dis.dis(func))
print()
# sum_loop
# 2 0 LOAD_CONST 1 (0)
# 2 STORE_FAST 1 (result)
# 3 4 SETUP_LOOP 24 (to 30)
# 6 LOAD_GLOBAL 0 (range)
# 8 LOAD_FAST 0 (n)
# 10 CALL_FUNCTION 1
# 12 GET_ITER
# >> 14 FOR_ITER 12 (to 28)
# 16 STORE_FAST 2 (x)
# 4 18 LOAD_FAST 1 (result)
# 20 LOAD_FAST 2 (x)
# 22 INPLACE_ADD
# 24 STORE_FAST 1 (result)
# 26 JUMP_ABSOLUTE 14
# >> 28 POP_BLOCK
# 5 >> 30 LOAD_FAST 1 (result)
# 32 RETURN_VALUE
# None
# sum_gen
# 9 0 LOAD_GLOBAL 0 (sum)
# 2 LOAD_CONST 1 (<code object <genexpr> at 0x7f86d67c49c0, file "<ipython-input-4-9519b0039c88>", line 9>)
# 4 LOAD_CONST 2 ('sum_gen.<locals>.<genexpr>')
# 6 MAKE_FUNCTION 0
# 8 LOAD_GLOBAL 1 (range)
# 10 LOAD_FAST 0 (n)
# 12 CALL_FUNCTION 1
# 14 GET_ITER
# 16 CALL_FUNCTION 1
# 18 CALL_FUNCTION 1
# 20 RETURN_VALUE
# None
# sum_range
# 13 0 LOAD_GLOBAL 0 (sum)
# 2 LOAD_GLOBAL 1 (range)
# 4 LOAD_FAST 0 (n)
# 6 CALL_FUNCTION 1
# 8 CALL_FUNCTION 1
# 10 RETURN_VALUE
# None
Ah, I found an answer myself, but it bring up another question.
So, this is much faster:
t0 = time.time()
s = sum(range(1000000))
t1 = time.time()
print(s, t1 - t0)
The result is:
499999500000 0.05099987983703613
So, sum is faster than +=, as expected, but the generator expression (i for i in range(n)) is much slower than anything else.
I must say that this is also quite surprising.
I get different numbers
python3 main.py
499999500000 0.0832064151763916
499999500000 0.03934478759765625
And from this code
import time
to0 = []
to1 = []
for i in range(1000):
t0 = time.time()
s = 0
for i in range(1000000):
s += i
t1 = time.time()
to0.append(t1 - t0)
t0 = time.time()
s = sum(i for i in range(1000000))
t1 = time.time()
to1.append(t1 - t0)
print(sum(to0)/len(to0))
print(sum(to1)/len(to1))
I get
0.07862246823310852
0.0318267240524292
Try updating your python all this is run on Python 3.7.3
I'm wondering how the Python interpreter actually handles lambda functions in memory.
If I have the following:
def squares():
return [(lambda x: x**2)(num) for num in range(1000)]
Will this create 1000 instances of the lambda function in memory, or is Python smart enough to know that each of these 1000 lambdas is the same, and therefore store them as one function in memory?
Your code create a single lambda function and calls it 1000 times, without creating a new object at each iteration. Because at each iteration you store the result of the lambda function not the function itself. It is equivalent to:
def square(x): return x*x # or def square x = lambda x: x*x
[square(x) for x in range(1000)]
Instead in this way we are going to create a lambda function object for each iteration. See this dummy example:
[lambda x: x*x for _ in range(3)]
gives:
[<function <listcomp>.<lambda> at 0x294358950>,
<function <listcomp>.<lambda> at 0x294358c80>,
<function <listcomp>.<lambda> at 0x294358378>]
The memory addresses of the lambdas, they are all different. Then a different object is created for each lambda.
TL;DR: The memory cost of the lambda objects in your example is the size of one lambda, but only while the squares() function is running, even if you hold a reference to its return value, because the returned list contains no lambda objects.
But even in cases where you do keep more than one function instance created from the same lambda expression (or def statement for that matter), they share the same code object, so the memory cost for each additional instance is less than the cost of the first instance.
In your example,
[(lambda x: x**2)(num) for num in range(1000)]
you're only storing the result of the lambda invocation in the list, not the lambda itself, so the lambda object's memory will be freed.
When exactly the lambda objects get garbage collected depends on your Python implementation. CPython should be able to do it immediately because the reference count drops to 0 each loop:
>>> class PrintsOnDel:
... def __del__(self):
... print('del') # We can see when this gets collected.
...
>>> [[PrintsOnDel(), print(x)][-1] for x in [1, 2, 3]] # Freed each loop.
1
del
2
del
3
del
[None, None, None]
PyPy is another story.
>>>> from __future__ import print_function
>>>> class PrintsOnDel:
.... def __del__(self):
.... print('del')
....
>>>> [[PrintsOnDel(), print(x)][-1] for x in [1, 2, 3]]
1
2
3
[None, None, None]
>>>> import gc
>>>> gc.collect() # Not freed until the gc actually runs!
del
del
del
0
It will create 1000 different lambda instances over time, but they won't all be in memory at once (in CPython) and they'll all point to the same code object, so having multiple instances of a function is not as bad as it sounds:
>>> a, b = [lambda x: x**2 for x in [1, 2]]
>>> a is b # Different lambda objects...
False
>>> a.__code__ is b.__code__ # ...point to the same code object.
True
Disassembling the bytecode can help you to understand exactly what the interpreter is doing:
>>> from dis import dis
>>> dis("[(lambda x: x**2)(num) for num in range(1000)]")
1 0 LOAD_CONST 0 (<code object <listcomp> at 0x000001D11D066870, file "<dis>", line 1>)
2 LOAD_CONST 1 ('<listcomp>')
4 MAKE_FUNCTION 0
6 LOAD_NAME 0 (range)
8 LOAD_CONST 2 (1000)
10 CALL_FUNCTION 1
12 GET_ITER
14 CALL_FUNCTION 1
16 RETURN_VALUE
Disassembly of <code object <listcomp> at 0x000001D11D066870, file "<dis>", line 1>:
1 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (num)
8 LOAD_CONST 0 (<code object <lambda> at 0x000001D11D0667C0, file "<dis>", line 1>)
10 LOAD_CONST 1 ('<listcomp>.<lambda>')
12 MAKE_FUNCTION 0
14 LOAD_FAST 1 (num)
16 CALL_FUNCTION 1
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
Disassembly of <code object <lambda> at 0x000001D11D0667C0, file "<dis>", line 1>:
1 0 LOAD_FAST 0 (x)
2 LOAD_CONST 1 (2)
4 BINARY_POWER
6 RETURN_VALUE
Note the 12 MAKE_FUNCTION instruction each loop. It does make a new lambda instance every time. CPython's VM is a stack machine. Arguments get pushed to the stack by other instructions, and then consumed by later instructions that need them. Notice that above that MAKE_FUNCTION instruction is another one pushing an argument for it.
LOAD_CONST 0 (<code object <lambda>...
So it re-uses the code object.
I always thought comparisons were the fastest operation a computer could execute. I remember hearing it on a presentation from D. Knuth where he'd write loops in descending order "because comparison against 0 is fast". I also read that multiplications should be slower than additions here.
I'm surprised to see that, in both Python 2 and 3, testing under both Linux and Mac, comparisons seem to be much slower than arithmetic operations.
Could anyone explain why?
%timeit 2 > 0
10000000 loops, best of 3: 41.5 ns per loop
%timeit 2 * 2
10000000 loops, best of 3: 27 ns per loop
%timeit 2 * 0
10000000 loops, best of 3: 27.7 ns per loop
%timeit True != False
10000000 loops, best of 3: 75 ns per loop
%timeit True and False
10000000 loops, best of 3: 58.8 ns per loop
And under python 3:
$ ipython3
Python 3.5.2 | packaged by conda-forge | (default, Sep 8 2016, 14:36:38)
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: %timeit 2 + 2
10000000 loops, best of 3: 22.9 ns per loop
In [2]: %timeit 2 * 2
10000000 loops, best of 3: 23.7 ns per loop
In [3]: %timeit 2 > 2
10000000 loops, best of 3: 45.5 ns per loop
In [4]: %timeit True and False
10000000 loops, best of 3: 62.8 ns per loop
In [5]: %timeit True != False
10000000 loops, best of 3: 92.9 ns per loop
This is happening due to Constant Folding in the Peep Hole optimizer within Python compiler.
Using the dis module, if we break each of the statements to look inside how they are being translated at machine level, you will observe that all operators like inequality, equality etc are first loaded into memory and then evaluated. However, all expressions like multiplication, addition etc are calculated and loaded as a constant into memory.
Overall, this leads to a lesser number of execution steps, making the steps faster:
>>> import dis
>>> def m1(): True != False
>>> dis.dis(m1)
1 0 LOAD_GLOBAL 0 (True)
3 LOAD_GLOBAL 1 (False)
6 COMPARE_OP 3 (!=)
9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
>>> def m2(): 2 *2
>>> dis.dis(m2)
1 0 LOAD_CONST 2 (4)
3 POP_TOP
4 LOAD_CONST 0 (None)
7 RETURN_VALUE
>>> def m3(): 2*5
>>> dis.dis(m3)
1 0 LOAD_CONST 3 (10)
3 POP_TOP
4 LOAD_CONST 0 (None)
7 RETURN_VALUE
>>> def m4(): 2 > 0
>>> dis.dis(m4)
1 0 LOAD_CONST 1 (2)
3 LOAD_CONST 2 (0)
6 COMPARE_OP 4 (>)
9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
>>> def m5(): True and False
>>> dis.dis(m5)
1 0 LOAD_GLOBAL 0 (True)
3 JUMP_IF_FALSE_OR_POP 9
6 LOAD_GLOBAL 1 (False)
>> 9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
As others have explained, this is because Python's peephole optimiser optimises arithmetic operations but not comparisons.
Having written my own peephole optimiser for a Basic compiler, I can assure you that optimising constant comparisons is just as easy as optimising constant arithmetic operations. So there is no technical reason why Python should do the latter but not the former.
However, each such optimisation has to be separately programmed, and comes with two costs: the time to program it, and the extra optimising code taking up space in the Python executable. So you find yourself having to do some triage: which of these optimisations is common enough to make it worth the costs?
It seems that the Python implementers, reasonably enough, decided to optimise the arithmetic operations first. Perhaps they will get round to comparisons in a future release.
A quick disassambling reveals that the comparison involves more operations. According to this answer, there is some precalculation done by the "peephole optimiser" (wiki) for multiplication, addition, etc., but not for the comparison operators:
>>> import dis
>>> def a():
... return 2*3
...
>>> dis.dis(a)
2 0 LOAD_CONST 3 (6)
3 RETURN_VALUE
>>> def b():
... return 2 < 3
...
>>> dis.dis(b)
2 0 LOAD_CONST 1 (2)
3 LOAD_CONST 2 (3)
6 COMPARE_OP 0 (<)
9 RETURN_VALUE
Like others have commented - it is due to the peep hole optimizer which pre-computes the results of 2*3 (6). As the dis shows
0 LOAD_CONST 3 (6)
But try this - it will disable the optimizer from pre-computing the results
>>> def a(a, b):
... return a*b
...
>>> dis.dis(a)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 1 (b)
6 BINARY_MULTIPLY
7 RETURN_VALUE
>>> def c(a,b):
... return a<b
...
>>> dis.dis(c)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 1 (b)
6 COMPARE_OP 0 (<)
9 RETURN_VALUE
>>>
If you time these functions the compare will be faster.
For python case the above answers are correct. For machine code things a bit more complicated. I assume we are talking about integer operations here, with floats and complex objects none of the below will apply. Also, we assume that the values you are comparing are already loaded into registers. If they are not fetching them from wherever they are could take 100 of times longer than the actual operations
Modern CPUs have several ways to compare two numbers. Very popular ones are doing XOR a,b if you just want to see if two values are equal or CMP a,b if you want to know the relationship between the values ( less, greater, equal, etc ). CMP operation is just a subtraction with the result thrown away because we are only interested in post-op flags.
Both of these operations are of depth 1, so they could be executed in a single CPU cycle. This is as fast as you can go. Multiplication is a form of repeated additions so the depth of the operation is usually equal to the size of your register. There are some optimizations that could be made to reduce the depth, but generally multiplication is one of the slower operations that CPU can perform.
However, multiplying by 0,1 or any power of 2 could be reduced to shift operation. Which is also depth one operation. So it takes the same time as comparing two numbers. Think about decimal system, you can multiply any number by 10, 100, 1000 by appending zeros at the end of the number. Any optimizing compiler will recognize this type of multiplication and use the most efficient operation for it. Modern CPUs are also pretty advanced, so they can perform same optimization in the hardware by counting how many bits are set in any of the operands. And if it's just one bit the operation will be reduced to the shift.
So in your case multiplying by 2 is as fast as comparing two numbers. As people above pointed out any optimizing compiler will see that you are multiplying two constants, so it will replace just replace the function with returning a constant.
Wow, The answer by #mu 無 blew my mind! However, it is important not to generalize when deriving your conclusions... You are checking the times for CONSTANTS not variables. For variables, multiplication seems to be slower than comparison.
Here is the more interesting case, in which the numbers to be compared are stored in actual variables...
import timeit
def go():
number=1000000000
print
print 'a>b, internal:',timeit.timeit(setup="a=1;b=1", stmt="a>b", number=number)
print 'a*b, internal:',timeit.timeit(setup="a=1;b=1", stmt="a*b", number=number)
print 'a>b, shell :',
%%timeit -n 1000000000 "a=1;b=1" "a>b"
print 'a*b, shell :',
%%timeit -n 1000000000 "a=1;b=1" "a*b"
go()
The result gives:
a>b, internal: 51.9467676445
a*b, internal: 63.870462403
a>b, shell :1000000000 loops, best of 3: 19.8 ns per loop
a>b, shell :1000000000 loops, best of 3: 19.9 ns per loop
And order is restored in the universe ;)
For completeness, lets see some more cases... What about if we have one variable and one constant?
import timeit
def go():
print 'a>2, shell :',
%%timeit -n 10000000 "a=42" "a>2"
print 'a*2, shell :',
%%timeit -n 10000000 "a=42" "a*2"
go()
a>2, shell :10000000 loops, best of 3: 18.3 ns per loop
a*2, shell :10000000 loops, best of 3: 19.3 ns per loop
what happens with bools?
import timeit
def go():
print
number=1000000000
print 'a==b : ', timeit.timeit(setup="a=True;b=False",stmt="a==b",number=number)
print 'a and b : ', timeit.timeit(setup="a=True;b=False",stmt="a and b",number=number)
print 'boolean ==, shell :',
%%timeit -n 1000000000 "a=True;b=False" "a == b"
print 'boolean and, shell :',
%%timeit -n 1000000000 "a=False;b=False" "a and b"
go()
a==b : 70.8013108982
a and b : 38.0614485665
boolean ==, shell :1000000000 loops, best of 3: 17.7 ns per loop
boolean and, shell :1000000000 loops, best of 3: 16.4 ns per loop
:D Now this is interesting, it seems boolean and is faster than ==. However all this would be ok as the Donald Knuth would not loose his sleep, the best way to compare would be to use and...
In practice, we should check numpy, which may be even more significant...
import timeit
def go():
number=1000000 # change if you are in a hurry/ want to be more certain....
print '==== int ===='
print 'a>b : ', timeit.timeit(setup="a=1;b=2",stmt="a>b",number=number*100)
print 'a*b : ', timeit.timeit(setup="a=1;b=2",stmt="a*b",number=number*100)
setup = "import numpy as np;a=np.arange(0,100);b=np.arange(100,0,-1);"
print 'np: a>b : ', timeit.timeit(setup=setup,stmt="a>b",number=number)
print 'np: a*b : ', timeit.timeit(setup=setup,stmt="a*b",number=number)
print '==== float ===='
print 'float a>b : ', timeit.timeit(setup="a=1.1;b=2.3",stmt="a>b",number=number*100)
print 'float a*b : ', timeit.timeit(setup="a=1.1;b=2.3",stmt="a*b",number=number*100)
setup = "import numpy as np;a=np.arange(0,100,dtype=float);b=np.arange(100,0,-1,dtype=float);"
print 'np float a>b : ', timeit.timeit(setup=setup,stmt="a>b",number=number)
print 'np float a*b : ', timeit.timeit(setup=setup,stmt="a*b",number=number)
print '==== bool ===='
print 'a==b : ', timeit.timeit(setup="a=True;b=False",stmt="a==b",number=number*1000)
print 'a and b : ', timeit.timeit(setup="a=True;b=False",stmt="a and b",number=number*1000)
setup = "import numpy as np;a=np.arange(0,100)>50;b=np.arange(100,0,-1)>50;"
print 'np a == b : ', timeit.timeit(setup=setup,stmt="a == b",number=number)
print 'np a and b : ', timeit.timeit(setup=setup,stmt="np.logical_and(a,b)",number=number)
print 'np a == True : ', timeit.timeit(setup=setup,stmt="a == True",number=number)
print 'np a and True : ', timeit.timeit(setup=setup,stmt="np.logical_and(a,True)",number=number)
go()
==== int ====
a>b : 4.5121130192
a*b : 5.62955748632
np: a>b : 0.763992986986
np: a*b : 0.723006032235
==== float ====
float a>b : 6.39567713272
float a*b : 5.62149055215
np float a>b : 0.697037433398
np float a*b : 0.847941712765
==== bool ====
a==b : 6.91458288689
a and b : 3.6289697892
np a == b : 0.789666454087
np a and b : 0.724517620007
np a == True : 1.55066706189
np a and True : 1.44293071804
Again, same behavior...
So I guess, one can benefit by using instead for == in general,
at least in Python 2 (Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]), where I tried all these...
While optimising my code I realised the following:
>>> from timeit import Timer as T
>>> T(lambda : 1234567890 / 4.0).repeat()
[0.22256922721862793, 0.20560789108276367, 0.20530295372009277]
>>> from __future__ import division
>>> T(lambda : 1234567890 / 4).repeat()
[0.14969301223754883, 0.14155197143554688, 0.14141488075256348]
>>> T(lambda : 1234567890 * 0.25).repeat()
[0.13619112968444824, 0.1281130313873291, 0.12830305099487305]
and also:
>>> from math import sqrt
>>> T(lambda : sqrt(1234567890)).repeat()
[0.2597470283508301, 0.2498021125793457, 0.24994492530822754]
>>> T(lambda : 1234567890 ** 0.5).repeat()
[0.15409398078918457, 0.14059877395629883, 0.14049601554870605]
I assume it has to do with the way python is implemented in C, but I wonder if anybody would care to explain why is so?
The (somewhat unexpected) reason for your results is that Python seems to fold constant expressions involving floating-point multiplication and exponentiation, but not division. math.sqrt() is a different beast altogether since there's no bytecode for it and it involves a function call.
On Python 2.6.5, the following code:
x1 = 1234567890.0 / 4.0
x2 = 1234567890.0 * 0.25
x3 = 1234567890.0 ** 0.5
x4 = math.sqrt(1234567890.0)
compiles to the following bytecodes:
# x1 = 1234567890.0 / 4.0
4 0 LOAD_CONST 1 (1234567890.0)
3 LOAD_CONST 2 (4.0)
6 BINARY_DIVIDE
7 STORE_FAST 0 (x1)
# x2 = 1234567890.0 * 0.25
5 10 LOAD_CONST 5 (308641972.5)
13 STORE_FAST 1 (x2)
# x3 = 1234567890.0 ** 0.5
6 16 LOAD_CONST 6 (35136.418286444619)
19 STORE_FAST 2 (x3)
# x4 = math.sqrt(1234567890.0)
7 22 LOAD_GLOBAL 0 (math)
25 LOAD_ATTR 1 (sqrt)
28 LOAD_CONST 1 (1234567890.0)
31 CALL_FUNCTION 1
34 STORE_FAST 3 (x4)
As you can see, multiplication and exponentiation take no time at all since they're done when the code is compiled. Division takes longer since it happens at runtime. Square root is not only the most computationally expensive operation of the four, it also incurs various overheads that the others do not (attribute lookup, function call etc).
If you eliminate the effect of constant folding, there's little to separate multiplication and division:
In [16]: x = 1234567890.0
In [17]: %timeit x / 4.0
10000000 loops, best of 3: 87.8 ns per loop
In [18]: %timeit x * 0.25
10000000 loops, best of 3: 91.6 ns per loop
math.sqrt(x) is actually a little bit faster than x ** 0.5, presumably because it's a special case of the latter and can therefore be done more efficiently, in spite of the overheads:
In [19]: %timeit x ** 0.5
1000000 loops, best of 3: 211 ns per loop
In [20]: %timeit math.sqrt(x)
10000000 loops, best of 3: 181 ns per loop
edit 2011-11-16: Constant expression folding is done by Python's peephole optimizer. The source code (peephole.c) contains the following comment that explains why constant division isn't folded:
case BINARY_DIVIDE:
/* Cannot fold this operation statically since
the result can depend on the run-time presence
of the -Qnew flag */
return 0;
The -Qnew flag enables "true division" defined in PEP 238.
In Python, is there any difference between creating a generator object through a generator expression versus using the yield statement?
Using yield:
def Generator(x, y):
for i in xrange(x):
for j in xrange(y):
yield(i, j)
Using generator expression:
def Generator(x, y):
return ((i, j) for i in xrange(x) for j in xrange(y))
Both functions return generator objects, which produce tuples, e.g. (0,0), (0,1) etc.
Any advantages of one or the other? Thoughts?
There are only slight differences in the two. You can use the dis module to examine this sort of thing for yourself.
Edit: My first version decompiled the generator expression created at module-scope in the interactive prompt. That's slightly different from the OP's version with it used inside a function. I've modified this to match the actual case in the question.
As you can see below, the "yield" generator (first case) has three extra instructions in the setup, but from the first FOR_ITER they differ in only one respect: the "yield" approach uses a LOAD_FAST in place of a LOAD_DEREF inside the loop. The LOAD_DEREF is "rather slower" than LOAD_FAST, so it makes the "yield" version slightly faster than the generator expression for large enough values of x (the outer loop) because the value of y is loaded slightly faster on each pass. For smaller values of x it would be slightly slower because of the extra overhead of the setup code.
It might also be worth pointing out that the generator expression would usually be used inline in the code, rather than wrapping it with the function like that. That would remove a bit of the setup overhead and keep the generator expression slightly faster for smaller loop values even if LOAD_FAST gave the "yield" version an advantage otherwise.
In neither case would the performance difference be enough to justify deciding between one or the other. Readability counts far more, so use whichever feels most readable for the situation at hand.
>>> def Generator(x, y):
... for i in xrange(x):
... for j in xrange(y):
... yield(i, j)
...
>>> dis.dis(Generator)
2 0 SETUP_LOOP 54 (to 57)
3 LOAD_GLOBAL 0 (xrange)
6 LOAD_FAST 0 (x)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 40 (to 56)
16 STORE_FAST 2 (i)
3 19 SETUP_LOOP 31 (to 53)
22 LOAD_GLOBAL 0 (xrange)
25 LOAD_FAST 1 (y)
28 CALL_FUNCTION 1
31 GET_ITER
>> 32 FOR_ITER 17 (to 52)
35 STORE_FAST 3 (j)
4 38 LOAD_FAST 2 (i)
41 LOAD_FAST 3 (j)
44 BUILD_TUPLE 2
47 YIELD_VALUE
48 POP_TOP
49 JUMP_ABSOLUTE 32
>> 52 POP_BLOCK
>> 53 JUMP_ABSOLUTE 13
>> 56 POP_BLOCK
>> 57 LOAD_CONST 0 (None)
60 RETURN_VALUE
>>> def Generator_expr(x, y):
... return ((i, j) for i in xrange(x) for j in xrange(y))
...
>>> dis.dis(Generator_expr.func_code.co_consts[1])
2 0 SETUP_LOOP 47 (to 50)
3 LOAD_FAST 0 (.0)
>> 6 FOR_ITER 40 (to 49)
9 STORE_FAST 1 (i)
12 SETUP_LOOP 31 (to 46)
15 LOAD_GLOBAL 0 (xrange)
18 LOAD_DEREF 0 (y)
21 CALL_FUNCTION 1
24 GET_ITER
>> 25 FOR_ITER 17 (to 45)
28 STORE_FAST 2 (j)
31 LOAD_FAST 1 (i)
34 LOAD_FAST 2 (j)
37 BUILD_TUPLE 2
40 YIELD_VALUE
41 POP_TOP
42 JUMP_ABSOLUTE 25
>> 45 POP_BLOCK
>> 46 JUMP_ABSOLUTE 6
>> 49 POP_BLOCK
>> 50 LOAD_CONST 0 (None)
53 RETURN_VALUE
In this example, not really. But yield can be used for more complex constructs - for example it can accept values from the caller as well and modify the flow as a result. Read PEP 342 for more details (it's an interesting technique worth knowing).
Anyway, the best advice is use whatever is clearer for your needs.
P.S. Here's a simple coroutine example from Dave Beazley:
def grep(pattern):
print "Looking for %s" % pattern
while True:
line = (yield)
if pattern in line:
print line,
# Example use
if __name__ == '__main__':
g = grep("python")
g.next()
g.send("Yeah, but no, but yeah, but no")
g.send("A series of tubes")
g.send("python generators rock!")
There is no difference for the kind of simple loops that you can fit into a generator expression. However yield can be used to create generators that do much more complex processing. Here is a simple example for generating the fibonacci sequence:
>>> def fibgen():
... a = b = 1
... while True:
... yield a
... a, b = b, a+b
>>> list(itertools.takewhile((lambda x: x<100), fibgen()))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
In usage, note a distinction between a generator object vs a generator function.
A generator object is use-once-only, in contrast to a generator function, which can be reused each time you call it again, because it returns a fresh generator object.
Generator expressions are in practice usually used "raw", without wrapping them in a function, and they return a generator object.
E.g.:
def range_10_gen_func():
x = 0
while x < 10:
yield x
x = x + 1
print(list(range_10_gen_func()))
print(list(range_10_gen_func()))
print(list(range_10_gen_func()))
which outputs:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Compare with a slightly different usage:
range_10_gen = range_10_gen_func()
print(list(range_10_gen))
print(list(range_10_gen))
print(list(range_10_gen))
which outputs:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]
[]
And compare with a generator expression:
range_10_gen_expr = (x for x in range(10))
print(list(range_10_gen_expr))
print(list(range_10_gen_expr))
print(list(range_10_gen_expr))
which also outputs:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]
[]
Using yield is nice if the expression is more complicated than just nested loops. Among other things you can return a special first or special last value. Consider:
def Generator(x):
for i in xrange(x):
yield(i)
yield(None)
Yes there is a difference.
For the generator expression (x for var in expr), iter(expr) is called when the expression is created.
When using def and yield to create a generator, as in:
def my_generator():
for var in expr:
yield x
g = my_generator()
iter(expr) is not yet called. It will be called only when iterating on g (and might not be called at all).
Taking this iterator as an example:
from __future__ import print_function
class CountDown(object):
def __init__(self, n):
self.n = n
def __iter__(self):
print("ITER")
return self
def __next__(self):
if self.n == 0:
raise StopIteration()
self.n -= 1
return self.n
next = __next__ # for python2
This code:
g1 = (i ** 2 for i in CountDown(3)) # immediately prints "ITER"
print("Go!")
for x in g1:
print(x)
while:
def my_generator():
for i in CountDown(3):
yield i ** 2
g2 = my_generator()
print("Go!")
for x in g2: # "ITER" is only printed here
print(x)
Since most iterators do not do a lot of stuff in __iter__, it is easy to miss this behavior. A real world example would be Django's QuerySet, which fetch data in __iter__ and data = (f(x) for x in qs) might take a lot of time, while def g(): for x in qs: yield f(x) followed by data=g() would return immediately.
For more info and the formal definition refer to PEP 289 -- Generator Expressions.
When thinking about iterators, the itertools module:
... standardizes a core set of fast, memory efficient tools that are useful by themselves or in combination. Together, they form an “iterator algebra” making it possible to construct specialized tools succinctly and efficiently in pure Python.
For performance, consider itertools.product(*iterables[, repeat])
Cartesian product of input iterables.
Equivalent to nested for-loops in a generator expression. For example, product(A, B) returns the same as ((x,y) for x in A for y in B).
>>> import itertools
>>> def gen(x,y):
... return itertools.product(xrange(x),xrange(y))
...
>>> [t for t in gen(3,2)]
[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]
>>>
There is a difference that could be important in some contexts that hasn't been pointed out yet. Using yield prevents you from using return for something else than implicitly raising StopIteration (and coroutines related stuff).
This means this code is ill-formed (and feeding it to an interpreter will give you an AttributeError):
class Tea:
"""With a cloud of milk, please"""
def __init__(self, temperature):
self.temperature = temperature
def mary_poppins_purse(tea_time=False):
"""I would like to make one thing clear: I never explain anything."""
if tea_time:
return Tea(355)
else:
for item in ['lamp', 'mirror', 'coat rack', 'tape measure', 'ficus']:
yield item
print(mary_poppins_purse(True).temperature)
On the other hand, this code works like a charm:
class Tea:
"""With a cloud of milk, please"""
def __init__(self, temperature):
self.temperature = temperature
def mary_poppins_purse(tea_time=False):
"""I would like to make one thing clear: I never explain anything."""
if tea_time:
return Tea(355)
else:
return (item for item in ['lamp', 'mirror', 'coat rack',
'tape measure', 'ficus'])
print(mary_poppins_purse(True).temperature)