Is there any any faster way to swap two list elements in Python than
L[a], L[b] = L[b], L[a]
or would I have to resort to Cython or Weave or the like?
Looks like the Python compiler optimizes out the temporary tuple with this construct:
code:
import dis
def swap1():
a=5
b=4
a, b = b, a
def swap2():
a=5
b=4
c = a
a = b
b = c
print 'swap1():'
dis.dis(swap1)
print 'swap2():'
dis.dis(swap2)
output:
swap1():
6 0 LOAD_CONST 1 (5)
3 STORE_FAST 0 (a)
7 6 LOAD_CONST 2 (4)
9 STORE_FAST 1 (b)
8 12 LOAD_FAST 1 (b)
15 LOAD_FAST 0 (a)
18 ROT_TWO
19 STORE_FAST 0 (a)
22 STORE_FAST 1 (b)
25 LOAD_CONST 0 (None)
28 RETURN_VALUE
swap2():
11 0 LOAD_CONST 1 (5)
3 STORE_FAST 0 (a)
12 6 LOAD_CONST 2 (4)
9 STORE_FAST 1 (b)
13 12 LOAD_FAST 0 (a)
15 STORE_FAST 2 (c)
14 18 LOAD_FAST 1 (b)
21 STORE_FAST 0 (a)
15 24 LOAD_FAST 2 (c)
27 STORE_FAST 1 (b)
30 LOAD_CONST 0 (None)
33 RETURN_VALUE
Two loads, a ROT_TWO, and two saves, versus three loads and three saves. You are unlikely to find a faster mechanism.
If you could post a representative code sample, we could do a better job of benchmarking your options. FWIW, for the following dumb benchmark, I get about a 3x speedup with Shed Skin and a 10x speedup with PyPy.
from time import time
def swap(L):
for i in xrange(1000000):
for b, a in enumerate(L):
L[a], L[b] = L[b], L[a]
def main():
start = time()
L = list(reversed(range(100)))
swap(L[:])
print time() - start
return L
if __name__ == "__main__":
print len(main())
# for shedskin:
# shedskin -b -r -e listswap.py && make
# python -c "import listswap; print len(listswap.main())"
I try this method as the easiest way to swap two numbers in a list:
lst= [23, 65, 19, 90]
pos1 = lst.pop(0)
pos2 = lst.pop(1)
lst.append(pos1)
lst.append(pos2)
print(lst)
I found this method as the fastest way to swap two numbers:
mylist = [11,23,5,8,13,17];
first_el = mylist.pop(0)
last_el = mylist.pop(-1)
mylist.insert(0, last_el)
mylist.append(first_el)
Related
In this trivial example, I want to factor out the i < 5 condition of a list comprehension into it's own function. I also want to eat my cake and have it too, and avoid the overhead of the CALL_FUNCTION bytecode/creating a new frame in the python virtual machine.
Is there any way to factor out the conditions inside of a list comprehension into a new function but somehow get a disassembled result that avoids the large overhead of CALL_FUNCTION?
import dis
import sys
import timeit
def my_filter(n):
return n < 5
def a():
# list comprehension with function call
return [i for i in range(10) if my_filter(i)]
def b():
# list comprehension without function call
return [i for i in range(10) if i < 5]
assert a() == b()
>>> sys.version_info[:]
(3, 6, 5, 'final', 0)
>>> timeit.timeit(a)
1.2616060493517098
>>> timeit.timeit(b)
0.685117881097812
>>> dis.dis(a)
3 0 LOAD_CONST 1 (<code object <listcomp> at 0x0000020F4890B660, file "<stdin>", line 3>)
# ...
>>> dis.dis(b)
3 0 LOAD_CONST 1 (<code object <listcomp> at 0x0000020F48A42270, file "<stdin>", line 3>)
# ...
# list comprehension with function call
# big overhead with that CALL_FUNCTION at address 12
>>> dis.dis(a.__code__.co_consts[1])
3 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_GLOBAL 0 (my_filter)
10 LOAD_FAST 1 (i)
12 CALL_FUNCTION 1
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
# list comprehension without function call
>>> dis.dis(b.__code__.co_consts[1])
3 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_FAST 1 (i)
10 LOAD_CONST 0 (5)
12 COMPARE_OP 0 (<)
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
I'm willing to take a hacky solution that I would never use in production, like somehow replacing the bytecode at run time.
In other words, is it possible to replace a's addresses 8, 10, and 12 with b's 8, 10, and 12 at runtime?
Consolidating all of the excellent answers in the comments into one.
As georg says, this sounds like you are looking for a way to inline a function or an expression, and there is no such thing in CPython attempts have been made: https://bugs.python.org/issue10399
Therefore, along the lines of "metaprogramming", you can build the lambda's inline and eval:
from typing import Callable
import dis
def b():
# list comprehension without function call
return [i for i in range(10) if i < 5]
def gen_list_comprehension(expr: str) -> Callable:
return eval(f"lambda: [i for i in range(10) if {expr}]")
a = gen_list_comprehension("i < 5")
dis.dis(a.__code__.co_consts[1])
print("=" * 10)
dis.dis(b.__code__.co_consts[1])
which when run under 3.7.6 gives:
6 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_FAST 1 (i)
10 LOAD_CONST 0 (5)
12 COMPARE_OP 0 (<)
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
==========
1 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_FAST 1 (i)
10 LOAD_CONST 0 (5)
12 COMPARE_OP 0 (<)
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
From a security standpoint "eval" is dangerous, athough here it is less so because what you can do inside a lambda. And what can be done in an IfExp expression is even more limited, but still dangerous like call a function that does evil things.
However, if you want the same effect that is more secure, instead of working with strings you can modify AST's. I find that a lot more cumbersome though.
A hybrid approach would be the call ast.parse() and check the result. For example:
import ast
def is_cond_str(s: str) -> bool:
try:
mod_ast = ast.parse(s)
expr_ast = isinstance(mod_ast.body[0])
if not isinstance(expr_ast, ast.Expr):
return False
compare_ast = expr_ast.value
if not isinstance(compare_ast, ast.Compare):
return False
return True
except:
return False
This is a little more secure, but there still may be evil functions in the condition so you could keep going. Again, I find this a little tedious.
Coming from the other direction of starting off with bytecode, there is my cross-version assembler; see https://pypi.org/project/xasm/
I have a large dict src (up to 1M items) and I would like to take N (typical values would be N=10K-20K) items, store them in a new dict dst and leave only the remaining items in src. It doesn't matter which N items are taken. I'm looking for the fastest way to do it on Python 3.6 or 3.7.
Fastest approach I've found so far:
src = {i: i ** 3 for i in range(1000000)}
# Taking items 1 by 1 (~0.0059s)
dst = {}
while len(dst) < 20000:
item = src.popitem()
dst[item[0]] = item[1]
Is there anything better? Even a marginal gain would be good.
A simple comprehension inside dict will do:
dict(src.popitem() for _ in range(20000))
Here you have the timing tests
setup = """
src = {i: i ** 3 for i in range(1000000)}
def method_1(d):
dst = {}
while len(dst) < 20000:
item = d.popitem()
dst[item[0]] = item[1]
return dst
def method_2(d):
return dict(d.popitem() for _ in range(20000))
"""
import timeit
print("Method 1: ", timeit.timeit('method_1(src)', setup=setup, number=1))
print("Method 2: ", timeit.timeit('method_2(src)', setup=setup, number=1))
Results:
Method 1: 0.007701821999944514
Method 2: 0.004668198998842854
This is a bit faster still:
from itertools import islice
def method_4(d):
result = dict(islice(d.items(), 20000))
for k in result: del d[k]
return result
Compared to other versions, using Netwave's testcase:
Method 1: 0.004459443036466837 # original
Method 2: 0.0034434819826856256 # Netwave
Method 3: 0.002602717955596745 # chepner
Method 4: 0.001974945073015988 # this answer
The extra speedup seems to come from avoiding transitions between C and Python functions. From disassembly we can note that the dict instantiation happens on C side, with only 3 function calls from Python. The loop uses DELETE_SUBSCR opcode instead of needing a function call:
>>> dis.dis(method_4)
2 0 LOAD_GLOBAL 0 (dict)
2 LOAD_GLOBAL 1 (islice)
4 LOAD_FAST 0 (d)
6 LOAD_ATTR 2 (items)
8 CALL_FUNCTION 0
10 LOAD_CONST 1 (20000)
12 CALL_FUNCTION 2
14 CALL_FUNCTION 1
16 STORE_FAST 1 (result)
3 18 SETUP_LOOP 18 (to 38)
20 LOAD_FAST 1 (result)
22 GET_ITER
>> 24 FOR_ITER 10 (to 36)
26 STORE_FAST 2 (k)
28 LOAD_FAST 0 (d)
30 LOAD_FAST 2 (k)
32 DELETE_SUBSCR
34 JUMP_ABSOLUTE 24
>> 36 POP_BLOCK
4 >> 38 LOAD_FAST 1 (result)
40 RETURN_VALUE
Compared with the iterator in method_2:
>>> dis.dis(d.popitem() for _ in range(20000))
1 0 LOAD_FAST 0 (.0)
>> 2 FOR_ITER 14 (to 18)
4 STORE_FAST 1 (_)
6 LOAD_GLOBAL 0 (d)
8 LOAD_ATTR 1 (popitem)
10 CALL_FUNCTION 0
12 YIELD_VALUE
14 POP_TOP
16 JUMP_ABSOLUTE 2
>> 18 LOAD_CONST 0 (None)
20 RETURN_VALUE
which needs a Python to C function call for each item.
I found this approach slightly faster (-10% speed) using dictionary comprehension that consumes a loop using range that yields & unpacks the keys & values
dst = {key:value for key,value in (src.popitem() for _ in range(20000))}
on my machine:
your code: 0.00899505615234375
my code: 0.007996797561645508
so about 12% faster, not bad but not as good as not unpacking like Netwave simpler answer
This approach can be useful if you want to transform the keys or values in the process.
In Python, while assigning a value to a variable, we can either do:
variable = variable + 20
or
variable += 20.
While I do understand that both the operations are semantically same, i.e., they achieve the same goal of increasing the previous value of variable by 20, I was wondering if there are subtle run-time performance differences between the two, or any other slight differences which might deem one better than the other.
Is there any such difference, or are they exactly the same?
If there is any difference, is it the same for other languages such as C++?
Thanks.
Perhaps this can help you understand better:
import dis
def a():
x = 0
x += 20
return x
def b():
x = 0
x = x + 20
return x
print 'In place add'
dis.dis(a)
print 'Binary add'
dis.dis(b)
We get the following outputs:
In place add
4 0 LOAD_CONST 1 (0)
3 STORE_FAST 0 (x)
5 6 LOAD_FAST 0 (x)
9 LOAD_CONST 2 (20)
12 INPLACE_ADD
13 STORE_FAST 0 (x)
6 16 LOAD_FAST 0 (x)
19 RETURN_VALUE
Binary add
9 0 LOAD_CONST 1 (0)
3 STORE_FAST 0 (x)
10 6 LOAD_FAST 0 (x)
9 LOAD_CONST 2 (20)
12 BINARY_ADD
13 STORE_FAST 0 (x)
11 16 LOAD_FAST 0 (x)
19 RETURN_VALUE
You could do a loop a thousand or so times using a timer to compare perfomance, but the main difference is that one. I suppose binary add should be faster tho.
As user2357112 pointed out, the original performance test script didn't worked as expected by me. "after the first execution of s1, your list has no 1s in it, so no further executions of s1 and no executions of s2 actually take the x==1 branch."
The modified version:
import timeit
import random
random.seed(0)
a = [ random.randrange(10) for _ in range(10000)]
change_from = 1
change_to = 6
setup = "from __main__ import a, change_from, change_to"
# s1 is replaced for a simple for loop, which is faster than the original
s1 = """\
for i,x in enumerate(a):
if x == change_from:
a[i] = change_to
change_from, change_to = change_to, change_from
"""
s2 = """\
a = [change_to if x==change_from else x for x in a]
change_from, change_to = change_to, change_from
"""
print(timeit.timeit(stmt=s1,number=10000, setup=setup))
print(timeit.timeit(stmt=s2, number=10000, setup=setup))
This script replaces every occurrences of 1 to 6, and the next run every occurrences of 6 to 1. And so on. The result is:
7.841739330212443
5.5166219217914065
Why is the list comprehension faster?
And how should one figure out this kind of question?
boardrider's comment looks interesting, thanks.
The following python version is used:
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Since I didn't got a detailed answer I tried to figure it out.
If I represent the simple for loop with this function:
def func1(a):
for i,x in enumerate(a):
if x == 1:
a[i] = 6
return(a)
and disassemble it, i got the following:
func1:
7 0 SETUP_LOOP 48 (to 51)
3 LOAD_GLOBAL 0 (enumerate)
6 LOAD_FAST 0 (a)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 GET_ITER
>> 13 FOR_ITER 34 (to 50)
16 UNPACK_SEQUENCE 2
19 STORE_FAST 1 (i)
22 STORE_FAST 2 (x)
8 25 LOAD_FAST 2 (x)
28 LOAD_CONST 1 (1)
31 COMPARE_OP 2 (==)
34 POP_JUMP_IF_FALSE 13
9 37 LOAD_CONST 2 (6)
40 LOAD_FAST 0 (a)
43 LOAD_FAST 1 (i)
46 STORE_SUBSCR
47 JUMP_ABSOLUTE 13
>> 50 POP_BLOCK
10 >> 51 LOAD_FAST 0 (a)
54 RETURN_VALUE
This is simple. It iterates through the a, and if it finds the value 1 it replaces it with 6 with STORE_SUBSCR.
If I represent the comprehension variant with this function:
def func2(a):
a = [6 if x==1 else x for x in a]
return(a)
and disassemble it, I got the following:
func2:
7 0 LOAD_CONST 1 (<code object <listcomp> at 0x00000000035731E0, file "<file_path>", line 7>)
3 LOAD_CONST 2 ('func2.<locals>.<listcomp>')
6 MAKE_FUNCTION 0
9 LOAD_FAST 0 (a)
12 GET_ITER
13 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
16 STORE_FAST 0 (a)
8 19 LOAD_FAST 0 (a)
22 RETURN_VALUE
This is shorter than the previous. However it starts with a code object loading. The func2 has the following code constants:
>>> func2.__code__.co_consts
(None, <code object <listcomp> at 0x00000000035731E0, file "<file_path>", line 7>, 'func2.<locals>.<listcomp>')
and the listcomp code object looks like this:
>>> dis.dis(func2.__code__.co_consts[1].co_code)
0 BUILD_LIST 0
3 LOAD_FAST 0 (0)
>> 6 FOR_ITER 30 (to 39)
9 STORE_FAST 1 (1)
12 LOAD_FAST 1 (1)
15 LOAD_CONST 0 (0)
18 COMPARE_OP 2 (==)
21 POP_JUMP_IF_FALSE 30
24 LOAD_CONST 1 (1)
27 JUMP_FORWARD 3 (to 33)
>> 30 LOAD_FAST 1 (1)
>> 33 LIST_APPEND 2
36 JUMP_ABSOLUTE 6
>> 39 RETURN_VALUE
So essentially the two implementation performs the similar steps. The main difference is that the comprehension version replaces the FOR_ITER with a CALL_FUNCTION.
From this I should see why the list comprehension is faster, but I don't.
So my original question is still on:
Why is the list comprehension faster?
The following 2 code snippets (A & B) both return the intersection of 2 dictionaries.
Both of following 2 code snippets should run in O(n) and output the same results. However code snippet B which is pythonic, seems to run faster. These code snippets come from the Python Cookbook.
Code Snippet A:
def simpleway():
result = []
for k in to500.keys():
if evens.has_key(k):
result.append(k)
return result
Code Snippet B:
def pythonicsimpleway():
return [k for k in to500 if k in evens]
Some setup logic and the function used to time both functions =>
to500 = {}
for i in range(500): to500[i] = 1
evens = {}
for i in range(0,1000,2): evens[i] = 1
def timeo(fun, n=1000):
def void(): pass
start = time.clock()
for i in range(n): void()
stend = time.clock()
overhead = stend - start
start = time.clock()
for i in range(n): fun()
stend = time.clock()
thetime = stend - start
return fun.__name__, thetime - overhead
With Python 2.7.5 using a 2.3 Ghz Ivy Bridge Quad Core Processor (OS X 10.8.4)
I get
>>> timeo(simpleway)
('simpleway', 0.08928500000000028)
>>> timeo(pythonicsimpleway)
('pythonicsimpleway', 0.04579400000000078)
They don't quite do the same thing; the first one does a lot more work:
It looks up the .has_key() and .append() methods each time in the loop, and then calls them. This requires a stack push and pop for each call.
It appends each new element to a list one by one. The Python list has to be grown dynamically to make room for these elements as you do so.
The list comprehension collects all generated elements in a C array before creating the python list object in one operation.
The two functions do produce the same result, one is just needlessly slower.
If you want to go into the nitty gritty details, take a look at the bytecode disassembly using the dis module:
>>> dis.dis(simpleway)
2 0 BUILD_LIST 0
3 STORE_FAST 0 (result)
3 6 SETUP_LOOP 51 (to 60)
9 LOAD_GLOBAL 0 (to500)
12 LOAD_ATTR 1 (keys)
15 CALL_FUNCTION 0
18 GET_ITER
>> 19 FOR_ITER 37 (to 59)
22 STORE_FAST 1 (k)
4 25 LOAD_GLOBAL 2 (evens)
28 LOAD_ATTR 3 (has_key)
31 LOAD_FAST 1 (k)
34 CALL_FUNCTION 1
37 POP_JUMP_IF_FALSE 19
5 40 LOAD_FAST 0 (result)
43 LOAD_ATTR 4 (append)
46 LOAD_FAST 1 (k)
49 CALL_FUNCTION 1
52 POP_TOP
53 JUMP_ABSOLUTE 19
56 JUMP_ABSOLUTE 19
>> 59 POP_BLOCK
6 >> 60 LOAD_FAST 0 (result)
63 RETURN_VALUE
>>> dis.dis(pythonicsimpleway)
2 0 BUILD_LIST 0
3 LOAD_GLOBAL 0 (to500)
6 GET_ITER
>> 7 FOR_ITER 24 (to 34)
10 STORE_FAST 0 (k)
13 LOAD_FAST 0 (k)
16 LOAD_GLOBAL 1 (evens)
19 COMPARE_OP 6 (in)
22 POP_JUMP_IF_FALSE 7
25 LOAD_FAST 0 (k)
28 LIST_APPEND 2
31 JUMP_ABSOLUTE 7
>> 34 RETURN_VALUE
The number of bytecode instructions per iteration is much larger for the explicit for loop. The simpleway loop has to execute 11 instructions per iteration (if .has_key() is True), vs. 7 for the list comprehension, where the extra instructions mostly cover LOAD_ATTR and CALL_FUNCTION.
If you want to make the first version faster, replace .has_key() with an in test, loop directly over the dictionary and cache the .append() attribute in a local variable:
def simpleway_optimized():
result = []
append = result.append
for k in to500:
if k in evens:
append(k)
return result
Then use the timeit module to test timings properly (repeated runs, most accurate timer for your platform):
>>> timeit('f()', 'from __main__ import evens, to500, simpleway as f', number=10000)
1.1673870086669922
>>> timeit('f()', 'from __main__ import evens, to500, pythonicsimpleway as f', number=10000)
0.5441269874572754
>>> timeit('f()', 'from __main__ import evens, to500, simpleway_optimized as f', number=10000)
0.6551430225372314
Here simpleway_optimized is approaching the list comprehension method in speed, but the latter still can win by building the python list object in one step.