A few years ago, someone posted on Active State Recipes for comparison purposes, three python/NumPy functions; each of these accepted the same arguments and returned the same result, a distance matrix.
Two of these were taken from published sources; they are both--or they appear to me to be--idiomatic numpy code. The repetitive calculations required to create a distance matrix are driven by numpy's elegant index syntax. Here's one of them:
from numpy.matlib import repmat, repeat
def calcDistanceMatrixFastEuclidean(points):
numPoints = len(points)
distMat = sqrt(sum((repmat(points, numPoints, 1) -
repeat(points, numPoints, axis=0))**2, axis=1))
return distMat.reshape((numPoints,numPoints))
The third created the distance matrix using a single loop (which, obviously is a lot of looping given that a distance matrix of just 1,000 2D points, has one million entries). At first glance this function looked to me like the code I used to write when I was learning NumPy and I would write NumPy code by first writing Python code and then translating it, line by line.
Several months after the Active State post, results of performance tests comparing the three were posted and discussed in a thread on the NumPy mailing list.
The function with the loop in fact significantly outperformed the other two:
from numpy import mat, zeros, newaxis
def calcDistanceMatrixFastEuclidean2(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
delta = zeros((n,n),'d')
for d in xrange(m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
One participant in the thread (Keir Mierle) offered a reason why this might be true:
The reason that I suspect this will be faster is
that it has better locality, completely finishing a computation on a
relatively small working set before moving onto the next one. The one liners
have to pull the potentially large MxN array into the processor repeatedly.
By this poster's own account, his remark is only a suspicion, and it doesn't appear that it was discussed any further.
Any other thoughts about how to account for these results?
In particular, is there a useful rule--regarding when to loop and when to index--that can be extracted from this example as guidance in writing numpy code?
For those not familiar with NumPy, or who haven't looked at the code, this comparison is not based on an edge case--it certainly wouldn't be that interesting to me if it were. Instead, this comparison involves a function that performs a common task in matrix computation (i.e., creating a result array given two antecedents); moreover, each function is in turn comprised of among the most common numpy built-ins.
TL; DR The second code above is only looping over the number of dimensions of the points (3 times through the for loop for 3D points) so the looping isn't much there. The real speed-up in the second code above is that it better harnesses the power of Numpy to avoid creating some extra matrices when finding the differences between points. This reduces memory used and computational effort.
Longer Explanation
I think that the calcDistanceMatrixFastEuclidean2 function is deceiving you with its loop perhaps. It is only looping over the number of dimensions of the points. For 1D points, the loop only executes once, for 2D, twice, and for 3D, thrice. This is really not much looping at all.
Let's analyze the code a little bit to see why the one is faster than the other. calcDistanceMatrixFastEuclidean I will call fast1 and calcDistanceMatrixFastEuclidean2 will be fast2.
fast1 is based on the Matlab way of doing things as is evidenced by the repmap function. The repmap function creates an array in this case that is just the original data repeated over and over again. However, if you look at the code for the function, it is very inefficient. It uses many Numpy functions (3 reshapes and 2 repeats) to do this. The repeat function is also used to create an array that contains the the original data with each data item repeated many times. If our input data is [1,2,3] then we are subtracting [1,2,3,1,2,3,1,2,3] from [1,1,1,2,2,2,3,3,3]. Numpy has had to create a lot of extra matrices in between running Numpy's C code which could have been avoided.
fast2 uses more of Numpy's heavy lifting without creating as many matrices between Numpy calls. fast2 loops through each dimension of the points, does the subtraction and keeps a running total of the squared differences between each dimension. Only at the end is the square root done. So far, this may not sound quite as efficient as fast1, but fast2 avoids doing the repmat stuff by using Numpy's indexing. Let's look at the 1D case for simplicity. fast2 makes a 1D array of the data and subtracts it from a 2D (N x 1) array of the data. This creates the difference matrix between each point and all of the other points without having to use repmat and repeat and thereby bypasses creating a lot of extra arrays. This is where the real speed difference lies in my opinion. fast1 creates a lot of extra in between matrices (and they are created expensively computationally) to find the differences between points while fast2 better harnesses the power of Numpy to avoid these.
By the way, here is a little bit faster version of fast2:
def calcDistanceMatrixFastEuclidean3(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
data = nDimPoints[:,0]
delta = (data - data[:,newaxis])**2
for d in xrange(1,m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
The difference is that we are no longer creating delta as a zeros matrix.
dis for fun:
dis.dis(calcDistanceMatrixFastEuclidean)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (points)
6 CALL_FUNCTION 1
9 STORE_FAST 1 (numPoints)
3 12 LOAD_GLOBAL 1 (sqrt)
15 LOAD_GLOBAL 2 (sum)
18 LOAD_GLOBAL 3 (repmat)
21 LOAD_FAST 0 (points)
24 LOAD_FAST 1 (numPoints)
27 LOAD_CONST 1 (1)
30 CALL_FUNCTION 3
4 33 LOAD_GLOBAL 4 (repeat)
36 LOAD_FAST 0 (points)
39 LOAD_FAST 1 (numPoints)
42 LOAD_CONST 2 ('axis')
45 LOAD_CONST 3 (0)
48 CALL_FUNCTION 258
51 BINARY_SUBTRACT
52 LOAD_CONST 4 (2)
55 BINARY_POWER
56 LOAD_CONST 2 ('axis')
59 LOAD_CONST 1 (1)
62 CALL_FUNCTION 257
65 CALL_FUNCTION 1
68 STORE_FAST 2 (distMat)
5 71 LOAD_FAST 2 (distMat)
74 LOAD_ATTR 5 (reshape)
77 LOAD_FAST 1 (numPoints)
80 LOAD_FAST 1 (numPoints)
83 BUILD_TUPLE 2
86 CALL_FUNCTION 1
89 RETURN_VALUE
dis.dis(calcDistanceMatrixFastEuclidean2)
2 0 LOAD_GLOBAL 0 (array)
3 LOAD_FAST 0 (nDimPoints)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (nDimPoints)
3 12 LOAD_FAST 0 (nDimPoints)
15 LOAD_ATTR 1 (shape)
18 UNPACK_SEQUENCE 2
21 STORE_FAST 1 (n)
24 STORE_FAST 2 (m)
4 27 LOAD_GLOBAL 2 (zeros)
30 LOAD_FAST 1 (n)
33 LOAD_FAST 1 (n)
36 BUILD_TUPLE 2
39 LOAD_CONST 1 ('d')
42 CALL_FUNCTION 2
45 STORE_FAST 3 (delta)
5 48 SETUP_LOOP 76 (to 127)
51 LOAD_GLOBAL 3 (xrange)
54 LOAD_FAST 2 (m)
57 CALL_FUNCTION 1
60 GET_ITER
>> 61 FOR_ITER 62 (to 126)
64 STORE_FAST 4 (d)
6 67 LOAD_FAST 0 (nDimPoints)
70 LOAD_CONST 0 (None)
73 LOAD_CONST 0 (None)
76 BUILD_SLICE 2
79 LOAD_FAST 4 (d)
82 BUILD_TUPLE 2
85 BINARY_SUBSCR
86 STORE_FAST 5 (data)
7 89 LOAD_FAST 3 (delta)
92 LOAD_FAST 5 (data)
95 LOAD_FAST 5 (data)
98 LOAD_CONST 0 (None)
101 LOAD_CONST 0 (None)
104 BUILD_SLICE 2
107 LOAD_GLOBAL 4 (newaxis)
110 BUILD_TUPLE 2
113 BINARY_SUBSCR
114 BINARY_SUBTRACT
115 LOAD_CONST 2 (2)
118 BINARY_POWER
119 INPLACE_ADD
120 STORE_FAST 3 (delta)
123 JUMP_ABSOLUTE 61
>> 126 POP_BLOCK
8 >> 127 LOAD_GLOBAL 5 (sqrt)
130 LOAD_FAST 3 (delta)
133 CALL_FUNCTION 1
136 RETURN_VALUE
I'm not an expert on dis, but it seems like you would have to look more at the functions that the first is calling to know why they take a while. There is a performance profiler tool with Python as well, cProfile.
Related
Which one is better?
for x in range(0,100):
print("Lorem Ipsum")
for x in range(0,10):
for y in range(0,10):
print("Lorem Ipsum")
The second one is harder to read and you construct an unnecessary range iterable (a list in Python 2, a less memory consuming and faster to create range object in Python 3).
From the unnecessary iterable the inner for loop constructs an unnecessary iterator (a list_iterator in Python 2, a range_iterator in Python 3).
The first one is more readable and easier understandable. Use that.
Regarding performance, I doubt it makes any difference and if it does, the 0-100 is faster, because it has smaller code (if the double loop is not optimized away) and thus a smaller code path.
When in doubt about such things, use the one that is easier to understand when you read the code. Premature optimization is a sin.
You can use dis from dis module to disassemble and analyse the bytecode of wich one of your loops is better (in a way your loops needs less memory, less iterators, etc ...).
Here is a traceback:
from dis import dis
def loop1():
for x in range(100):
pass
def loop2():
for x in range(10):
for j in range(10):
pass
Now look under the hood of each loop:
dis(loop1)
2 0 SETUP_LOOP 20 (to 23)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (100)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 GET_ITER
>> 13 FOR_ITER 6 (to 22)
16 STORE_FAST 0 (x)
3 19 JUMP_ABSOLUTE 13
>> 22 POP_BLOCK
>> 23 LOAD_CONST 0 (None)
26 RETURN_VALUE
And look at the amount of data and operations needed in your second loop:
dis(loop2)
2 0 SETUP_LOOP 43 (to 46)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (10)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 GET_ITER
>> 13 FOR_ITER 29 (to 45)
16 STORE_FAST 0 (x)
3 19 SETUP_LOOP 20 (to 42)
22 LOAD_GLOBAL 0 (range)
25 LOAD_CONST 1 (10)
28 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
31 GET_ITER
>> 32 FOR_ITER 6 (to 41)
35 STORE_FAST 1 (j)
4 38 JUMP_ABSOLUTE 32
>> 41 POP_BLOCK
>> 42 JUMP_ABSOLUTE 13
>> 45 POP_BLOCK
>> 46 LOAD_CONST 0 (None)
49 RETURN_VALUE
Because, both of loops do the same thing, the first one is a far better.
Just imagine how would you modify the nested loop for 101 iterations instead of 100 and the disadvantage is clear.
I am trying to find the most efficient way to check whether the given string is palindrome or not.
Firstly, I tried brute force which has running time of the order O(N). Then I optimized the code a little bit by making only n/2 comparisons instead of n.
Here is the code:
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
It takes half time when compared to brute force but it is still order O(N).
Meanwhile, I also thought of a solution which uses slice operator.
Here is the code:
def palindrome_py(a):
return a==a[::-1]
Then I did running time analysis of both. Here is the result:
Running time
Length of string used is 50
Length multiplier indicates length of new string(50*multiplier)
Running time for 100000 iterations
For palindrome For palindrome_py Length Multiplier
0.6559998989 0.5309998989 1
1.2970001698 0.5939998627 2
3.5149998665 0.7820000648 3
13.4249999523 1.5310001373 4
65.5319998264 5.2660000324 5
The code I used can be accessed here: Running Time Table Generator
Now, I want to know why there is difference between running time of slice operator(palindrome_py) and the palindrome function.Why I am getting this type of running time?
Why is the slice operator so efficient as compared to the palindrome function, what is happening behind the scenes?
My observations-:
running time is proportional to multiplier ie. running time when multiplier is 2 can be obtained by multiplying running time of case (n-1) ie. 1st in this case by multiplier (n) ie.2
Generalizing, we get Running Time(n)=Running Time(n-1)* Multiplier
Your slicing-based solution is still O(n), the constant got smaller (that's your multiplier). It's faster, because less stuff is done in Python and more stuff is done in C. The bytecode shows it all.
In [1]: import dis
In [2]: %paste
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
## -- End pasted text --
In [3]: dis.dis(palindrome)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (a)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 STORE_FAST 1 (length)
3 12 LOAD_CONST 1 (0)
15 STORE_FAST 2 (iterator)
4 18 SETUP_LOOP 65 (to 86)
>> 21 LOAD_FAST 2 (iterator)
24 LOAD_FAST 1 (length)
27 LOAD_CONST 2 (2)
30 BINARY_TRUE_DIVIDE
31 COMPARE_OP 1 (<=)
34 POP_JUMP_IF_FALSE 85
5 37 LOAD_FAST 0 (a)
40 LOAD_FAST 2 (iterator)
43 BINARY_SUBSCR
44 LOAD_FAST 0 (a)
47 LOAD_FAST 1 (length)
50 LOAD_FAST 2 (iterator)
53 BINARY_SUBTRACT
54 LOAD_CONST 3 (1)
57 BINARY_SUBTRACT
58 BINARY_SUBSCR
59 COMPARE_OP 2 (==)
62 POP_JUMP_IF_FALSE 78
6 65 LOAD_FAST 2 (iterator)
68 LOAD_CONST 3 (1)
71 INPLACE_ADD
72 STORE_FAST 2 (iterator)
75 JUMP_ABSOLUTE 21
8 >> 78 LOAD_CONST 4 (False)
81 RETURN_VALUE
82 JUMP_ABSOLUTE 21
>> 85 POP_BLOCK
10 >> 86 LOAD_CONST 5 (True)
89 RETURN_VALUE
There is a hell lot of Python virtual-machine level instructions, that are basically function calls, which are very expensive in Python.
Now, what's with the second function.
In [4]: %paste
def palindrome_py(a):
return a==a[::-1]
## -- End pasted text --
In [5]: dis.dis(palindrome_py)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 0 (a)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 0 (None)
12 LOAD_CONST 2 (-1)
15 BUILD_SLICE 3
18 BINARY_SUBSCR
19 COMPARE_OP 2 (==)
22 RETURN_VALUE
No Python iteration (jumpers) involved here and you only get 3 calls (these instructions call methods): BUILD_SLICE, BINARY_SUBSCR, COMPARE_OP, all done in C, because str is a built-in type with all methods written C. To be fair, we've seen the same instructions in the first function (along with a lot more other instructions), but there they are repeated for each character, multiplying the method-call overhead by n. Here you only pay the Python's function call overhead once, the rest is done in C.
The bottomline. You shouldn't do low-level stuff in Python manually, because it will run slower than a high-level counterpart (unless you have an asymptotically faster alternative that literally requires low-level magic). Python, unlike many other languages, most of the time encourages you to use abstractions and rewards you with higher performance.
I have some pretty ugly indexing going on. For example, things like
valid[ data[ index[valid[:,0],0] ] == 0, 1] = False
where valid and index are {Nx2} arrays or bools and ints respectively, and data is {N} long.
If I concentrate really hard, I can convince myself that this is doing what I want... but its incredibly obfuscated. How can I unobfuscate something like this efficiently?
I could break it up, for example:
valid_index = index[valid[:,0],0]
invalid_index = (data[ valid_index ] == 0)
valid[ invalid_index, 1 ] = False
But my arrays will have up to 100's of millions of entries so I don't want to duplicate the memory; and I need to remain as speed efficient as possible.
These two code sequences are nearly identical, and should have very similar performance. That's my "gut feeling"--but then I did static analysis and ran a partial benchmark to confirm.
The clearer option requires four more bytecodes to implement, so will probably be slightly slower. But the extra work is restricted to LOAD_FAST and STORE_FAST, which are just moves from the top of stack (TOS) to/from variables. As the extra work is modest, so should be the performance impact.
You could benchmark the two approaches on your target equipment for more quantitative precision, but on my 3-year-old laptop, 100 million extra LOAD_FAST / STORE_FAST pairs takes just over 3 seconds on standard CPython 2.7.5. So I estimate this clarity will cost you about 6 seconds per 100M entries. While the PyPy just-in-time Python compiler doesn't use the same bytecodes, I timed its overhead for the clear version at about half that, or 3 seconds per 100M. Compared to other work you're doing to process the items, the clearer version probably is not a significant showdown.
The TL;DR Backstory
My first impression is that the code sequences, while different in readability and clarity, are technically very similar, and should not have similar performance characteristics. But let's analyze a bit further using the Python disassembler. I dropped each code snippet into a function:
def one(valid, data):
valid[ data[ index[valid[:,0],0] ] == 0, 1] = False
def two(valid, data):
valid_index = index[valid[:,0],0]
invalid_index = (data[ valid_index ] == 0)
valid[ invalid_index, 1 ] = False
Then using Python's bytecode dissassember:
import dis
dis.dis(one)
print "---"
dis.dis(two)
Gives:
15 0 LOAD_GLOBAL 0 (False)
3 LOAD_FAST 0 (valid)
6 LOAD_FAST 1 (data)
9 LOAD_GLOBAL 1 (index)
12 LOAD_FAST 0 (valid)
15 LOAD_CONST 0 (None)
18 LOAD_CONST 0 (None)
21 BUILD_SLICE 2
24 LOAD_CONST 1 (0)
27 BUILD_TUPLE 2
30 BINARY_SUBSCR
31 LOAD_CONST 1 (0)
34 BUILD_TUPLE 2
37 BINARY_SUBSCR
38 BINARY_SUBSCR
39 LOAD_CONST 1 (0)
42 COMPARE_OP 2 (==)
45 LOAD_CONST 2 (1)
48 BUILD_TUPLE 2
51 STORE_SUBSCR
52 LOAD_CONST 0 (None)
55 RETURN_VALUE
18 0 LOAD_GLOBAL 0 (index)
3 LOAD_FAST 0 (valid)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 0 (None)
12 BUILD_SLICE 2
15 LOAD_CONST 1 (0)
18 BUILD_TUPLE 2
21 BINARY_SUBSCR
22 LOAD_CONST 1 (0)
25 BUILD_TUPLE 2
28 BINARY_SUBSCR
29 STORE_FAST 2 (valid_index)
19 32 LOAD_FAST 1 (data)
35 LOAD_FAST 2 (valid_index)
38 BINARY_SUBSCR
39 LOAD_CONST 1 (0)
42 COMPARE_OP 2 (==)
45 STORE_FAST 3 (invalid_index)
20 48 LOAD_GLOBAL 1 (False)
51 LOAD_FAST 0 (valid)
54 LOAD_FAST 3 (invalid_index)
57 LOAD_CONST 2 (1)
60 BUILD_TUPLE 2
63 STORE_SUBSCR
64 LOAD_CONST 0 (None)
67 RETURN_VALUE
Similar but not identical, and not in the same order. A quick diff of the two shows the same, plus the possibility the clearer function requires more byte codes.
I parsed the bytecode opcodes out of each function's disassembler listing, dropped them into a collections.Counter, and compared the counts:
Bytecode Count(s)
======== ========
BINARY_SUBSCR 3
BUILD_SLICE 1
BUILD_TUPLE 3
COMPARE_OP 1
LOAD_CONST 7
LOAD_FAST 3, 5 *** differs ***
LOAD_GLOBAL 2
RETURN_VALUE 1
STORE_FAST 0, 2 *** differs ***
STORE_SUBSCR 1
Here is where it becomes evident that the second, clearer approach uses only four more bytecodes, and of the simple, fast LOAD_FAST / STORE_FAST variety. Static analysis thus shows no particular reason to fear additional memory allocation or other performance-killing side effects.
I then constructed two functions, very similar to one another, that the disassembler shows differ only in that the second one has an extra LOAD_FAST / STORE_FAST pair. I ran them 100,000,000 times, and compared their runtimes. They differed by just over 3 seconds in CPython 2.7.5, and about 1.5 seconds under PyPy 2.2.1 (based on Python 2.7.3). Even when you double those times (because you have two pairs), it's pretty clear those extra load/store pairs are not going to slow you down much.
This is a contrived example to demonstrate referencing the same dictionary item multiple times in a for-loop and a list-comprehension. First, the for-loop:
dict_index_mylists = {0:['a', 'b', 'c'], 1:['b', 'c', 'a'], 2:['c', 'a', 'b']}
# for-loop
myseq = []
for i in [0, 1, 2]:
interim = dict_index_mylists[i]
if interim[0] == 'b' or interim[1] == 'c' or interim[2] == 'a':
myseq.append(interim)
In the for-loop, the interim list is referenced from the dictionary object and is then referenced multiple times in the if-conditional which may make sense particularly if the dictionary is very large and/or on storage. Then again, the 'interim' reference maybe unnecessary because the Python dictionary is optimized for performance.
This is a list-comprehension of the for-loop:
# list-comprehension
myseq = [dict_index_mylists[i] for i in [0, 1, 2] if dict_index_mylists[i][0] == 'b' or dict_index_mylists[i][1] == 'c' or dict_index_mylists[i][2] == 'a']
The questions are:
a. Does the list-comprehension make multiple references to the dictionary item or does it reference and keep a local 'interim' list to work on?
b. What is the optimal list-comprehension expression that contains multiple conditionals on the same dictionary item and where the dictionary is very large?
You seem to be asking only about optimization of common sub-expressions. In your list comprehension, it will index into the dictionary multiple times. Python is dynamic, it is difficult to know what side effects an operation like dict_index_mylists[i] might have, so CPython simply executes the operation as many times as you tell it to.
Other implementations like PyPy use a JIT and may optimize away subexpressions, but it is difficult to know for sure what it will do ahead of time.
If you are very concerned with performance, you need to time various options to see which is best.
I'm no expert at looking at python bytecode, but here's my attempt to learn something new this morning:
def dostuff():
myseq = [dict_index_mylists[i] for i in [0, 1, 2] if dict_index_mylists[i][0] == 'b' or dict_index_mylists[i][1] == 'c' or dict_index_mylists[i][2] == 'a']
import dis
dis.dis(dostuff)
If you look at the output (below), there are 4 calls to LOAD_GLOBAL, so it doesn't look like python is storing an interim list. As for your second question, what you have is probably about as good as you can do. It's not as bad as you might think though. dict objects access items by a hash function, so their lookup complexity is O(1) regardless of dictionary size. Of course, you could always use timeit and compare the two implementations (with loop and list-comp) and then choose the faster one. Profiling (as always) is your friend.
APENDIX (output of dis.dis(dostuff))
5 0 BUILD_LIST 0
3 DUP_TOP
4 STORE_FAST 0 (_[1])
7 LOAD_CONST 1 (0)
10 LOAD_CONST 2 (1)
13 LOAD_CONST 3 (2)
16 BUILD_LIST 3
19 GET_ITER
>> 20 FOR_ITER 84 (to 107)
23 STORE_FAST 1 (i)
26 LOAD_GLOBAL 0 (dict_index_mylists)
29 LOAD_FAST 1 (i)
32 BINARY_SUBSCR
33 LOAD_CONST 1 (0)
36 BINARY_SUBSCR
37 LOAD_CONST 4 ('b')
40 COMPARE_OP 2 (==)
43 JUMP_IF_TRUE 42 (to 88)
46 POP_TOP
47 LOAD_GLOBAL 0 (dict_index_mylists)
50 LOAD_FAST 1 (i)
53 BINARY_SUBSCR
54 LOAD_CONST 2 (1)
57 BINARY_SUBSCR
58 LOAD_CONST 5 ('c')
61 COMPARE_OP 2 (==)
64 JUMP_IF_TRUE 21 (to 88)
67 POP_TOP
68 LOAD_GLOBAL 0 (dict_index_mylists)
71 LOAD_FAST 1 (i)
74 BINARY_SUBSCR
75 LOAD_CONST 3 (2)
78 BINARY_SUBSCR
79 LOAD_CONST 6 ('a')
82 COMPARE_OP 2 (==)
85 JUMP_IF_FALSE 15 (to 103)
>> 88 POP_TOP
89 LOAD_FAST 0 (_[1])
92 LOAD_GLOBAL 0 (dict_index_mylists)
95 LOAD_FAST 1 (i)
98 BINARY_SUBSCR
99 LIST_APPEND
100 JUMP_ABSOLUTE 20
>> 103 POP_TOP
104 JUMP_ABSOLUTE 20
>> 107 DELETE_FAST 0 (_[1])
110 STORE_FAST 2 (myseq)
113 LOAD_CONST 0 (None)
116 RETURN_VALUE
First point: nothing (expect 'myseq') is "created" here, neither in the forloop nor in the listcomp versions of your code - it's just a reference to the existing dict item.
Now to answer you questions : the list comp version will make a lookup (a call to dict.__getitem__ for each of the dict_index_mylists[i] expressions. Each of these lookup will a return a reference to the same list. You can avoid these extra lookups by retaining a local reference to the dict's items, ie :
myseq = [
item for item in (dict_index_mylists[i] for i in (0, 1, 2))
if item[0] == 'b' or item[1] == 'c' or item[2] == 'a'
]
but there's no point in writing a listcomp just for the sake of writing a listcomp.
Note that if you don't care about the original ordering and want to apply this to your whole dict, using dict.itervalues() would be simpler.
wrt/ the second question, "optimal" is not an absolute. What do you want to optimize for ? space ? time ? readability ?
How are the following two implementations have different performance in Python?
from cStringIO import StringIO
from itertools import imap
from sys import stdin
input = imap(int, StringIO(stdin.read()))
print '\n'.join(imap(str, sorted(input)))
AND
import sys
for line in sys.stdin:
l.append(int(line.strip('\n')))
l.sort()
for x in l:
print x
The first implementation is faster than the second for inputs of the order of 10^6 lines. Why so?
>>> dis.dis(first)
2 0 LOAD_GLOBAL 0 (imap)
3 LOAD_GLOBAL 1 (int)
6 LOAD_GLOBAL 2 (StringIO)
9 LOAD_GLOBAL 3 (stdin)
12 LOAD_ATTR 4 (read)
15 CALL_FUNCTION 0
18 CALL_FUNCTION 1
21 CALL_FUNCTION 2
24 STORE_FAST 0 (input)
27 LOAD_CONST 0 (None)
30 RETURN_VALUE
>>> dis.dis(second)
2 0 SETUP_LOOP 48 (to 51)
3 LOAD_GLOBAL 0 (sys)
6 LOAD_ATTR 1 (stdin)
9 CALL_FUNCTION 0
12 GET_ITER
>> 13 FOR_ITER 34 (to 50)
16 STORE_FAST 0 (line)
3 19 LOAD_GLOBAL 2 (l)
22 LOAD_ATTR 3 (append)
25 LOAD_GLOBAL 4 (int)
28 LOAD_FAST 0 (line)
31 LOAD_ATTR 5 (strip)
34 LOAD_CONST 1 ('\n')
37 CALL_FUNCTION 1
40 CALL_FUNCTION 1
43 CALL_FUNCTION 1
46 POP_TOP
47 JUMP_ABSOLUTE 13
>> 50 POP_BLOCK
4 >> 51 LOAD_GLOBAL 2 (l)
54 LOAD_ATTR 6 (sort)
57 CALL_FUNCTION 0
60 POP_TOP
61 LOAD_CONST 0 (None)
64 RETURN_VALUE
first is your first function.
second is your second function.
dis tells one of the reasons why the first one is faster.
Two primary reasons:
The 2nd code explicitly constructs a list and sorts it afterwards, while the 1st version lets sorted create only a internal list while sorting at the same time.
The 2nd code explicitly loops over a list with for (on the Python VM), while the 1st version implicitly loops with imap (over the underlaying structure in C).
Anyways, why is StringIO in there? The most straightforward and probably fastest way is:
from sys import stdin, stdout
stdout.writelines(sorted(stdin, key=int))
Do a step-by-step conversion from the second to the first one and see how the performance changes with each step.
Remove line.strip. This will cause some speed up, whether it would be significant is another matter. The stripping is superfluous as has been mentioned by you and THC4k.
Then replace the for loop using l.append with map(int, sys.stdin). My guess is that this would give a significant speed-up.
Replace map and l.sort with imap and sorted. My guess is that it won't affect the performance, there could be a slight slowdown, but it would be far from significant. Between the two, I'd usually go with the former, but with Python 3 on the horizon the latter is probably preferable.
Replace the for loop using print with print '\n'.join(...). My guess is that this would be another speed-up, but it would cost you some memory.
Add cStringIO (which is completely unnecessary by the way) to see how it affects performance. My guess is that it would be slightly slower, but not enough to counter 4 and 2.
Then, if you try THC4k's answer, it would probably be faster than all of the above, while being simpler and easier to read, and using less memory than 4 and 5. It has slightly different behaviour (it doesn't strip leading zeros from the numbers).
Of course, try this yourself instead of trusting anyone guesses. Also run cProfile on your code and see which parts are losing most time.