python - performance difference between the two implementations - python

How are the following two implementations have different performance in Python?
from cStringIO import StringIO
from itertools import imap
from sys import stdin
input = imap(int, StringIO(stdin.read()))
print '\n'.join(imap(str, sorted(input)))
AND
import sys
for line in sys.stdin:
l.append(int(line.strip('\n')))
l.sort()
for x in l:
print x
The first implementation is faster than the second for inputs of the order of 10^6 lines. Why so?

>>> dis.dis(first)
2 0 LOAD_GLOBAL 0 (imap)
3 LOAD_GLOBAL 1 (int)
6 LOAD_GLOBAL 2 (StringIO)
9 LOAD_GLOBAL 3 (stdin)
12 LOAD_ATTR 4 (read)
15 CALL_FUNCTION 0
18 CALL_FUNCTION 1
21 CALL_FUNCTION 2
24 STORE_FAST 0 (input)
27 LOAD_CONST 0 (None)
30 RETURN_VALUE
>>> dis.dis(second)
2 0 SETUP_LOOP 48 (to 51)
3 LOAD_GLOBAL 0 (sys)
6 LOAD_ATTR 1 (stdin)
9 CALL_FUNCTION 0
12 GET_ITER
>> 13 FOR_ITER 34 (to 50)
16 STORE_FAST 0 (line)
3 19 LOAD_GLOBAL 2 (l)
22 LOAD_ATTR 3 (append)
25 LOAD_GLOBAL 4 (int)
28 LOAD_FAST 0 (line)
31 LOAD_ATTR 5 (strip)
34 LOAD_CONST 1 ('\n')
37 CALL_FUNCTION 1
40 CALL_FUNCTION 1
43 CALL_FUNCTION 1
46 POP_TOP
47 JUMP_ABSOLUTE 13
>> 50 POP_BLOCK
4 >> 51 LOAD_GLOBAL 2 (l)
54 LOAD_ATTR 6 (sort)
57 CALL_FUNCTION 0
60 POP_TOP
61 LOAD_CONST 0 (None)
64 RETURN_VALUE
first is your first function.
second is your second function.
dis tells one of the reasons why the first one is faster.

Two primary reasons:
The 2nd code explicitly constructs a list and sorts it afterwards, while the 1st version lets sorted create only a internal list while sorting at the same time.
The 2nd code explicitly loops over a list with for (on the Python VM), while the 1st version implicitly loops with imap (over the underlaying structure in C).
Anyways, why is StringIO in there? The most straightforward and probably fastest way is:
from sys import stdin, stdout
stdout.writelines(sorted(stdin, key=int))

Do a step-by-step conversion from the second to the first one and see how the performance changes with each step.
Remove line.strip. This will cause some speed up, whether it would be significant is another matter. The stripping is superfluous as has been mentioned by you and THC4k.
Then replace the for loop using l.append with map(int, sys.stdin). My guess is that this would give a significant speed-up.
Replace map and l.sort with imap and sorted. My guess is that it won't affect the performance, there could be a slight slowdown, but it would be far from significant. Between the two, I'd usually go with the former, but with Python 3 on the horizon the latter is probably preferable.
Replace the for loop using print with print '\n'.join(...). My guess is that this would be another speed-up, but it would cost you some memory.
Add cStringIO (which is completely unnecessary by the way) to see how it affects performance. My guess is that it would be slightly slower, but not enough to counter 4 and 2.
Then, if you try THC4k's answer, it would probably be faster than all of the above, while being simpler and easier to read, and using less memory than 4 and 5. It has slightly different behaviour (it doesn't strip leading zeros from the numbers).
Of course, try this yourself instead of trusting anyone guesses. Also run cProfile on your code and see which parts are losing most time.

Related

Is it better to use nested loop for bigger repetitions or just put the entire range into one loop? Which is faster / less complex?

Which one is better?
for x in range(0,100):
print("Lorem Ipsum")
for x in range(0,10):
for y in range(0,10):
print("Lorem Ipsum")
The second one is harder to read and you construct an unnecessary range iterable (a list in Python 2, a less memory consuming and faster to create range object in Python 3).
From the unnecessary iterable the inner for loop constructs an unnecessary iterator (a list_iterator in Python 2, a range_iterator in Python 3).
The first one is more readable and easier understandable. Use that.
Regarding performance, I doubt it makes any difference and if it does, the 0-100 is faster, because it has smaller code (if the double loop is not optimized away) and thus a smaller code path.
When in doubt about such things, use the one that is easier to understand when you read the code. Premature optimization is a sin.
You can use dis from dis module to disassemble and analyse the bytecode of wich one of your loops is better (in a way your loops needs less memory, less iterators, etc ...).
Here is a traceback:
from dis import dis
def loop1():
for x in range(100):
pass
def loop2():
for x in range(10):
for j in range(10):
pass
Now look under the hood of each loop:
dis(loop1)
2 0 SETUP_LOOP 20 (to 23)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (100)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 GET_ITER
>> 13 FOR_ITER 6 (to 22)
16 STORE_FAST 0 (x)
3 19 JUMP_ABSOLUTE 13
>> 22 POP_BLOCK
>> 23 LOAD_CONST 0 (None)
26 RETURN_VALUE
And look at the amount of data and operations needed in your second loop:
dis(loop2)
2 0 SETUP_LOOP 43 (to 46)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (10)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 GET_ITER
>> 13 FOR_ITER 29 (to 45)
16 STORE_FAST 0 (x)
3 19 SETUP_LOOP 20 (to 42)
22 LOAD_GLOBAL 0 (range)
25 LOAD_CONST 1 (10)
28 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
31 GET_ITER
>> 32 FOR_ITER 6 (to 41)
35 STORE_FAST 1 (j)
4 38 JUMP_ABSOLUTE 32
>> 41 POP_BLOCK
>> 42 JUMP_ABSOLUTE 13
>> 45 POP_BLOCK
>> 46 LOAD_CONST 0 (None)
49 RETURN_VALUE
Because, both of loops do the same thing, the first one is a far better.
Just imagine how would you modify the nested loop for 101 iterations instead of 100 and the disadvantage is clear.

How does List Comprehension exactly work in Python? [duplicate]

This question already has answers here:
What does "list comprehension" and similar mean? How does it work and how can I use it?
(5 answers)
Closed 7 months ago.
I am going trough the docs of Python 3.X, I have doubt about List Comprehension speed of execution and how it exactly work.
Let's take the following example:
Listing 1
...
L = range(0,10)
L = [x ** 2 for x in L]
...
Now in my knowledge this return a new listing, and it's equivalent to write down:
Listing 2
...
res = []
for x in L:
res.append(x ** 2)
...
The main difference is the speed of execution if I am correct. Listing 1 is supposed to be performed at C language speed inside the interpreter, meanwhile Listing 2 is not.
But Listing 2 is what the list comprehension does internally (not sure), so why Listing 1 is executed at C Speed inside the interpreter & Listing 2 is not? Both are converted to byte code before being processed, or am I missing something?
Look at the actual bytecode that is produced. I've put the two fragments of code into fuctions called f1 and f2.
The comprehension does this:
3 15 LOAD_CONST 3 (<code object <listcomp> at 0x7fbf6c1b59c0, file "<stdin>", line 3>)
18 LOAD_CONST 4 ('f1.<locals>.<listcomp>')
21 MAKE_FUNCTION 0
24 LOAD_FAST 0 (L)
27 GET_ITER
28 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
31 STORE_FAST 0 (L)
Notice there is no loop in the bytecode. The loop happens in C.
Now the for loop does this:
4 21 SETUP_LOOP 31 (to 55)
24 LOAD_FAST 0 (L)
27 GET_ITER
>> 28 FOR_ITER 23 (to 54)
31 STORE_FAST 2 (x)
34 LOAD_FAST 1 (res)
37 LOAD_ATTR 1 (append)
40 LOAD_FAST 2 (x)
43 LOAD_CONST 3 (2)
46 BINARY_POWER
47 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
50 POP_TOP
51 JUMP_ABSOLUTE 28
>> 54 POP_BLOCK
In contrast to the comprehension, the loop is clearly here in the bytecode. So the loop occurs in python.
The bytecodes are different, and the first should be faster.
The answer is actually in your question.
When you run any built in python function you are running something that has been written in C and compiled into machine code.
When you write your own version of it, that code must be converted into CPython objects which are handled by the interpreter.
In consequence the built-in approach or function is always quicker (or takes less space) in Python than writing your own function.

Python low-level vs high-level performance (running time analysis of palindrome functions)

I am trying to find the most efficient way to check whether the given string is palindrome or not.
Firstly, I tried brute force which has running time of the order O(N). Then I optimized the code a little bit by making only n/2 comparisons instead of n.
Here is the code:
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
It takes half time when compared to brute force but it is still order O(N).
Meanwhile, I also thought of a solution which uses slice operator.
Here is the code:
def palindrome_py(a):
return a==a[::-1]
Then I did running time analysis of both. Here is the result:
Running time
Length of string used is 50
Length multiplier indicates length of new string(50*multiplier)
Running time for 100000 iterations
For palindrome For palindrome_py Length Multiplier
0.6559998989 0.5309998989 1
1.2970001698 0.5939998627 2
3.5149998665 0.7820000648 3
13.4249999523 1.5310001373 4
65.5319998264 5.2660000324 5
The code I used can be accessed here: Running Time Table Generator
Now, I want to know why there is difference between running time of slice operator(palindrome_py) and the palindrome function.Why I am getting this type of running time?
Why is the slice operator so efficient as compared to the palindrome function, what is happening behind the scenes?
My observations-:
running time is proportional to multiplier ie. running time when multiplier is 2 can be obtained by multiplying running time of case (n-1) ie. 1st in this case by multiplier (n) ie.2
Generalizing, we get Running Time(n)=Running Time(n-1)* Multiplier
Your slicing-based solution is still O(n), the constant got smaller (that's your multiplier). It's faster, because less stuff is done in Python and more stuff is done in C. The bytecode shows it all.
In [1]: import dis
In [2]: %paste
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
## -- End pasted text --
In [3]: dis.dis(palindrome)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (a)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 STORE_FAST 1 (length)
3 12 LOAD_CONST 1 (0)
15 STORE_FAST 2 (iterator)
4 18 SETUP_LOOP 65 (to 86)
>> 21 LOAD_FAST 2 (iterator)
24 LOAD_FAST 1 (length)
27 LOAD_CONST 2 (2)
30 BINARY_TRUE_DIVIDE
31 COMPARE_OP 1 (<=)
34 POP_JUMP_IF_FALSE 85
5 37 LOAD_FAST 0 (a)
40 LOAD_FAST 2 (iterator)
43 BINARY_SUBSCR
44 LOAD_FAST 0 (a)
47 LOAD_FAST 1 (length)
50 LOAD_FAST 2 (iterator)
53 BINARY_SUBTRACT
54 LOAD_CONST 3 (1)
57 BINARY_SUBTRACT
58 BINARY_SUBSCR
59 COMPARE_OP 2 (==)
62 POP_JUMP_IF_FALSE 78
6 65 LOAD_FAST 2 (iterator)
68 LOAD_CONST 3 (1)
71 INPLACE_ADD
72 STORE_FAST 2 (iterator)
75 JUMP_ABSOLUTE 21
8 >> 78 LOAD_CONST 4 (False)
81 RETURN_VALUE
82 JUMP_ABSOLUTE 21
>> 85 POP_BLOCK
10 >> 86 LOAD_CONST 5 (True)
89 RETURN_VALUE
There is a hell lot of Python virtual-machine level instructions, that are basically function calls, which are very expensive in Python.
Now, what's with the second function.
In [4]: %paste
def palindrome_py(a):
return a==a[::-1]
## -- End pasted text --
In [5]: dis.dis(palindrome_py)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 0 (a)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 0 (None)
12 LOAD_CONST 2 (-1)
15 BUILD_SLICE 3
18 BINARY_SUBSCR
19 COMPARE_OP 2 (==)
22 RETURN_VALUE
No Python iteration (jumpers) involved here and you only get 3 calls (these instructions call methods): BUILD_SLICE, BINARY_SUBSCR, COMPARE_OP, all done in C, because str is a built-in type with all methods written C. To be fair, we've seen the same instructions in the first function (along with a lot more other instructions), but there they are repeated for each character, multiplying the method-call overhead by n. Here you only pay the Python's function call overhead once, the rest is done in C.
The bottomline. You shouldn't do low-level stuff in Python manually, because it will run slower than a high-level counterpart (unless you have an asymptotically faster alternative that literally requires low-level magic). Python, unlike many other languages, most of the time encourages you to use abstractions and rewards you with higher performance.

python `for i in iter` vs `while True; i = next(iter)`

To my understanding, both these approach work for operating on every item in a generator:
let i be our operator target
let my_iter be our generator
let callable do_something_with return None
While Loop + StopIteratioon
try:
while True:
i = next(my_iter)
do_something_with(i)
except StopIteration:
pass
For loop / list comprehension
for i in my_iter:
do_something_with(i)
[do_something_with(i) for i in my_iter]
Minor Edit: print(i) replaced with do_something_with(i) as suggested by #kojiro to disambiguate a use case with the interpreter mechanics.
As far as I am aware, these are both applicable ways to iterate over a generator, Is there any reason to prefer one over the other?
Right now the for loop is looking superior to me. Due to: less lines/clutter and readability in general, plus single indent.
I really only see the while approach being advantages if you want to handily break the loop on particular exceptions.
the third option is definitively NOT the same as the first two. the third example creates a list, one each for the return value of print(i), which happens to be None, so not a very interesting list.
the first two are semantically similar. There is a minor, technical difference; the while loop, as presented, does not work if my_iter is not, in fact an iterator (ie, has a __next__() method); for instance, if it's a list. The for loop works for all iterables (has an __iter__() method) in addition to iterators.
The correct version is thus:
my_iter = iter(my_iterable)
try:
while True:
i = next(my_iter)
print(i)
except StopIteration:
pass
Now, aside from readability reasons, there in fact is a technical reason you should prefer the for loop; there is a penalty you pay (in CPython, anyhow) for the number of bytecodes executed in tight inner loops. lets compare:
In [1]: def forloop(my_iter):
...: for i in my_iter:
...: print(i)
...:
In [57]: dis.dis(forloop)
2 0 SETUP_LOOP 24 (to 27)
3 LOAD_FAST 0 (my_iter)
6 GET_ITER
>> 7 FOR_ITER 16 (to 26)
10 STORE_FAST 1 (i)
3 13 LOAD_GLOBAL 0 (print)
16 LOAD_FAST 1 (i)
19 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
22 POP_TOP
23 JUMP_ABSOLUTE 7
>> 26 POP_BLOCK
>> 27 LOAD_CONST 0 (None)
30 RETURN_VALUE
7 bytecodes called in inner loop vs:
In [55]: def whileloop(my_iterable):
....: my_iter = iter(my_iterable)
....: try:
....: while True:
....: i = next(my_iter)
....: print(i)
....: except StopIteration:
....: pass
....:
In [56]: dis.dis(whileloop)
2 0 LOAD_GLOBAL 0 (iter)
3 LOAD_FAST 0 (my_iterable)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 STORE_FAST 1 (my_iter)
3 12 SETUP_EXCEPT 32 (to 47)
4 15 SETUP_LOOP 25 (to 43)
5 >> 18 LOAD_GLOBAL 1 (next)
21 LOAD_FAST 1 (my_iter)
24 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
27 STORE_FAST 2 (i)
6 30 LOAD_GLOBAL 2 (print)
33 LOAD_FAST 2 (i)
36 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
39 POP_TOP
40 JUMP_ABSOLUTE 18
>> 43 POP_BLOCK
44 JUMP_FORWARD 18 (to 65)
7 >> 47 DUP_TOP
48 LOAD_GLOBAL 3 (StopIteration)
51 COMPARE_OP 10 (exception match)
54 POP_JUMP_IF_FALSE 64
57 POP_TOP
58 POP_TOP
59 POP_TOP
8 60 POP_EXCEPT
61 JUMP_FORWARD 1 (to 65)
>> 64 END_FINALLY
>> 65 LOAD_CONST 0 (None)
68 RETURN_VALUE
9 Bytecodes in the inner loop.
We can actually do even better, though.
In [58]: from collections import deque
In [59]: def deqloop(my_iter):
....: deque(map(print, my_iter), 0)
....:
In [61]: dis.dis(deqloop)
2 0 LOAD_GLOBAL 0 (deque)
3 LOAD_GLOBAL 1 (map)
6 LOAD_GLOBAL 2 (print)
9 LOAD_FAST 0 (my_iter)
12 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
15 LOAD_CONST 1 (0)
18 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
21 POP_TOP
22 LOAD_CONST 0 (None)
25 RETURN_VALUE
everything happens in C, collections.deque, map and print are all builtins. (for cpython) so in this case, there are no bytecodes executed for looping. This is only a useful optimization when the iteration step is a c function (as is the case for print. Otherwise, the overhead of a python function call is larger than the JUMP_ABSOLUTE overhead.
The for loop is the most pythonic. Note that you can break out of for loops as well as while loops.
Don't use the list comprehension unless you need the resulting list, otherwise you are needlessly storing all the elements. Your example list comprehension will only work with the print function in Python 3, it won't work with the print statement in Python 2.
I would agree with you that the for loop is superior. As you mentioned it is less clutter and it is a lot easier to read. Programmers like to keep things as simple as possible and the for loop does that. It is also better for novice Python programmers who might not have learned try/except. Also, as Alasdair mentioned, you can break out of for loops. Also the while loop runs an error if you are using a list unless you use iter() on my_iter first.

efficiency vs. readability: obfuscation when using nested boolean index arrays

I have some pretty ugly indexing going on. For example, things like
valid[ data[ index[valid[:,0],0] ] == 0, 1] = False
where valid and index are {Nx2} arrays or bools and ints respectively, and data is {N} long.
If I concentrate really hard, I can convince myself that this is doing what I want... but its incredibly obfuscated. How can I unobfuscate something like this efficiently?
I could break it up, for example:
valid_index = index[valid[:,0],0]
invalid_index = (data[ valid_index ] == 0)
valid[ invalid_index, 1 ] = False
But my arrays will have up to 100's of millions of entries so I don't want to duplicate the memory; and I need to remain as speed efficient as possible.
These two code sequences are nearly identical, and should have very similar performance. That's my "gut feeling"--but then I did static analysis and ran a partial benchmark to confirm.
The clearer option requires four more bytecodes to implement, so will probably be slightly slower. But the extra work is restricted to LOAD_FAST and STORE_FAST, which are just moves from the top of stack (TOS) to/from variables. As the extra work is modest, so should be the performance impact.
You could benchmark the two approaches on your target equipment for more quantitative precision, but on my 3-year-old laptop, 100 million extra LOAD_FAST / STORE_FAST pairs takes just over 3 seconds on standard CPython 2.7.5. So I estimate this clarity will cost you about 6 seconds per 100M entries. While the PyPy just-in-time Python compiler doesn't use the same bytecodes, I timed its overhead for the clear version at about half that, or 3 seconds per 100M. Compared to other work you're doing to process the items, the clearer version probably is not a significant showdown.
The TL;DR Backstory
My first impression is that the code sequences, while different in readability and clarity, are technically very similar, and should not have similar performance characteristics. But let's analyze a bit further using the Python disassembler. I dropped each code snippet into a function:
def one(valid, data):
valid[ data[ index[valid[:,0],0] ] == 0, 1] = False
def two(valid, data):
valid_index = index[valid[:,0],0]
invalid_index = (data[ valid_index ] == 0)
valid[ invalid_index, 1 ] = False
Then using Python's bytecode dissassember:
import dis
dis.dis(one)
print "---"
dis.dis(two)
Gives:
15 0 LOAD_GLOBAL 0 (False)
3 LOAD_FAST 0 (valid)
6 LOAD_FAST 1 (data)
9 LOAD_GLOBAL 1 (index)
12 LOAD_FAST 0 (valid)
15 LOAD_CONST 0 (None)
18 LOAD_CONST 0 (None)
21 BUILD_SLICE 2
24 LOAD_CONST 1 (0)
27 BUILD_TUPLE 2
30 BINARY_SUBSCR
31 LOAD_CONST 1 (0)
34 BUILD_TUPLE 2
37 BINARY_SUBSCR
38 BINARY_SUBSCR
39 LOAD_CONST 1 (0)
42 COMPARE_OP 2 (==)
45 LOAD_CONST 2 (1)
48 BUILD_TUPLE 2
51 STORE_SUBSCR
52 LOAD_CONST 0 (None)
55 RETURN_VALUE
18 0 LOAD_GLOBAL 0 (index)
3 LOAD_FAST 0 (valid)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 0 (None)
12 BUILD_SLICE 2
15 LOAD_CONST 1 (0)
18 BUILD_TUPLE 2
21 BINARY_SUBSCR
22 LOAD_CONST 1 (0)
25 BUILD_TUPLE 2
28 BINARY_SUBSCR
29 STORE_FAST 2 (valid_index)
19 32 LOAD_FAST 1 (data)
35 LOAD_FAST 2 (valid_index)
38 BINARY_SUBSCR
39 LOAD_CONST 1 (0)
42 COMPARE_OP 2 (==)
45 STORE_FAST 3 (invalid_index)
20 48 LOAD_GLOBAL 1 (False)
51 LOAD_FAST 0 (valid)
54 LOAD_FAST 3 (invalid_index)
57 LOAD_CONST 2 (1)
60 BUILD_TUPLE 2
63 STORE_SUBSCR
64 LOAD_CONST 0 (None)
67 RETURN_VALUE
Similar but not identical, and not in the same order. A quick diff of the two shows the same, plus the possibility the clearer function requires more byte codes.
I parsed the bytecode opcodes out of each function's disassembler listing, dropped them into a collections.Counter, and compared the counts:
Bytecode Count(s)
======== ========
BINARY_SUBSCR 3
BUILD_SLICE 1
BUILD_TUPLE 3
COMPARE_OP 1
LOAD_CONST 7
LOAD_FAST 3, 5 *** differs ***
LOAD_GLOBAL 2
RETURN_VALUE 1
STORE_FAST 0, 2 *** differs ***
STORE_SUBSCR 1
Here is where it becomes evident that the second, clearer approach uses only four more bytecodes, and of the simple, fast LOAD_FAST / STORE_FAST variety. Static analysis thus shows no particular reason to fear additional memory allocation or other performance-killing side effects.
I then constructed two functions, very similar to one another, that the disassembler shows differ only in that the second one has an extra LOAD_FAST / STORE_FAST pair. I ran them 100,000,000 times, and compared their runtimes. They differed by just over 3 seconds in CPython 2.7.5, and about 1.5 seconds under PyPy 2.2.1 (based on Python 2.7.3). Even when you double those times (because you have two pairs), it's pretty clear those extra load/store pairs are not going to slow you down much.

Categories

Resources