I have a large dict src (up to 1M items) and I would like to take N (typical values would be N=10K-20K) items, store them in a new dict dst and leave only the remaining items in src. It doesn't matter which N items are taken. I'm looking for the fastest way to do it on Python 3.6 or 3.7.
Fastest approach I've found so far:
src = {i: i ** 3 for i in range(1000000)}
# Taking items 1 by 1 (~0.0059s)
dst = {}
while len(dst) < 20000:
item = src.popitem()
dst[item[0]] = item[1]
Is there anything better? Even a marginal gain would be good.
A simple comprehension inside dict will do:
dict(src.popitem() for _ in range(20000))
Here you have the timing tests
setup = """
src = {i: i ** 3 for i in range(1000000)}
def method_1(d):
dst = {}
while len(dst) < 20000:
item = d.popitem()
dst[item[0]] = item[1]
return dst
def method_2(d):
return dict(d.popitem() for _ in range(20000))
"""
import timeit
print("Method 1: ", timeit.timeit('method_1(src)', setup=setup, number=1))
print("Method 2: ", timeit.timeit('method_2(src)', setup=setup, number=1))
Results:
Method 1: 0.007701821999944514
Method 2: 0.004668198998842854
This is a bit faster still:
from itertools import islice
def method_4(d):
result = dict(islice(d.items(), 20000))
for k in result: del d[k]
return result
Compared to other versions, using Netwave's testcase:
Method 1: 0.004459443036466837 # original
Method 2: 0.0034434819826856256 # Netwave
Method 3: 0.002602717955596745 # chepner
Method 4: 0.001974945073015988 # this answer
The extra speedup seems to come from avoiding transitions between C and Python functions. From disassembly we can note that the dict instantiation happens on C side, with only 3 function calls from Python. The loop uses DELETE_SUBSCR opcode instead of needing a function call:
>>> dis.dis(method_4)
2 0 LOAD_GLOBAL 0 (dict)
2 LOAD_GLOBAL 1 (islice)
4 LOAD_FAST 0 (d)
6 LOAD_ATTR 2 (items)
8 CALL_FUNCTION 0
10 LOAD_CONST 1 (20000)
12 CALL_FUNCTION 2
14 CALL_FUNCTION 1
16 STORE_FAST 1 (result)
3 18 SETUP_LOOP 18 (to 38)
20 LOAD_FAST 1 (result)
22 GET_ITER
>> 24 FOR_ITER 10 (to 36)
26 STORE_FAST 2 (k)
28 LOAD_FAST 0 (d)
30 LOAD_FAST 2 (k)
32 DELETE_SUBSCR
34 JUMP_ABSOLUTE 24
>> 36 POP_BLOCK
4 >> 38 LOAD_FAST 1 (result)
40 RETURN_VALUE
Compared with the iterator in method_2:
>>> dis.dis(d.popitem() for _ in range(20000))
1 0 LOAD_FAST 0 (.0)
>> 2 FOR_ITER 14 (to 18)
4 STORE_FAST 1 (_)
6 LOAD_GLOBAL 0 (d)
8 LOAD_ATTR 1 (popitem)
10 CALL_FUNCTION 0
12 YIELD_VALUE
14 POP_TOP
16 JUMP_ABSOLUTE 2
>> 18 LOAD_CONST 0 (None)
20 RETURN_VALUE
which needs a Python to C function call for each item.
I found this approach slightly faster (-10% speed) using dictionary comprehension that consumes a loop using range that yields & unpacks the keys & values
dst = {key:value for key,value in (src.popitem() for _ in range(20000))}
on my machine:
your code: 0.00899505615234375
my code: 0.007996797561645508
so about 12% faster, not bad but not as good as not unpacking like Netwave simpler answer
This approach can be useful if you want to transform the keys or values in the process.
Related
In this trivial example, I want to factor out the i < 5 condition of a list comprehension into it's own function. I also want to eat my cake and have it too, and avoid the overhead of the CALL_FUNCTION bytecode/creating a new frame in the python virtual machine.
Is there any way to factor out the conditions inside of a list comprehension into a new function but somehow get a disassembled result that avoids the large overhead of CALL_FUNCTION?
import dis
import sys
import timeit
def my_filter(n):
return n < 5
def a():
# list comprehension with function call
return [i for i in range(10) if my_filter(i)]
def b():
# list comprehension without function call
return [i for i in range(10) if i < 5]
assert a() == b()
>>> sys.version_info[:]
(3, 6, 5, 'final', 0)
>>> timeit.timeit(a)
1.2616060493517098
>>> timeit.timeit(b)
0.685117881097812
>>> dis.dis(a)
3 0 LOAD_CONST 1 (<code object <listcomp> at 0x0000020F4890B660, file "<stdin>", line 3>)
# ...
>>> dis.dis(b)
3 0 LOAD_CONST 1 (<code object <listcomp> at 0x0000020F48A42270, file "<stdin>", line 3>)
# ...
# list comprehension with function call
# big overhead with that CALL_FUNCTION at address 12
>>> dis.dis(a.__code__.co_consts[1])
3 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_GLOBAL 0 (my_filter)
10 LOAD_FAST 1 (i)
12 CALL_FUNCTION 1
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
# list comprehension without function call
>>> dis.dis(b.__code__.co_consts[1])
3 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_FAST 1 (i)
10 LOAD_CONST 0 (5)
12 COMPARE_OP 0 (<)
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
I'm willing to take a hacky solution that I would never use in production, like somehow replacing the bytecode at run time.
In other words, is it possible to replace a's addresses 8, 10, and 12 with b's 8, 10, and 12 at runtime?
Consolidating all of the excellent answers in the comments into one.
As georg says, this sounds like you are looking for a way to inline a function or an expression, and there is no such thing in CPython attempts have been made: https://bugs.python.org/issue10399
Therefore, along the lines of "metaprogramming", you can build the lambda's inline and eval:
from typing import Callable
import dis
def b():
# list comprehension without function call
return [i for i in range(10) if i < 5]
def gen_list_comprehension(expr: str) -> Callable:
return eval(f"lambda: [i for i in range(10) if {expr}]")
a = gen_list_comprehension("i < 5")
dis.dis(a.__code__.co_consts[1])
print("=" * 10)
dis.dis(b.__code__.co_consts[1])
which when run under 3.7.6 gives:
6 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_FAST 1 (i)
10 LOAD_CONST 0 (5)
12 COMPARE_OP 0 (<)
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
==========
1 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_FAST 1 (i)
10 LOAD_CONST 0 (5)
12 COMPARE_OP 0 (<)
14 POP_JUMP_IF_FALSE 4
16 LOAD_FAST 1 (i)
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
From a security standpoint "eval" is dangerous, athough here it is less so because what you can do inside a lambda. And what can be done in an IfExp expression is even more limited, but still dangerous like call a function that does evil things.
However, if you want the same effect that is more secure, instead of working with strings you can modify AST's. I find that a lot more cumbersome though.
A hybrid approach would be the call ast.parse() and check the result. For example:
import ast
def is_cond_str(s: str) -> bool:
try:
mod_ast = ast.parse(s)
expr_ast = isinstance(mod_ast.body[0])
if not isinstance(expr_ast, ast.Expr):
return False
compare_ast = expr_ast.value
if not isinstance(compare_ast, ast.Compare):
return False
return True
except:
return False
This is a little more secure, but there still may be evil functions in the condition so you could keep going. Again, I find this a little tedious.
Coming from the other direction of starting off with bytecode, there is my cross-version assembler; see https://pypi.org/project/xasm/
As user2357112 pointed out, the original performance test script didn't worked as expected by me. "after the first execution of s1, your list has no 1s in it, so no further executions of s1 and no executions of s2 actually take the x==1 branch."
The modified version:
import timeit
import random
random.seed(0)
a = [ random.randrange(10) for _ in range(10000)]
change_from = 1
change_to = 6
setup = "from __main__ import a, change_from, change_to"
# s1 is replaced for a simple for loop, which is faster than the original
s1 = """\
for i,x in enumerate(a):
if x == change_from:
a[i] = change_to
change_from, change_to = change_to, change_from
"""
s2 = """\
a = [change_to if x==change_from else x for x in a]
change_from, change_to = change_to, change_from
"""
print(timeit.timeit(stmt=s1,number=10000, setup=setup))
print(timeit.timeit(stmt=s2, number=10000, setup=setup))
This script replaces every occurrences of 1 to 6, and the next run every occurrences of 6 to 1. And so on. The result is:
7.841739330212443
5.5166219217914065
Why is the list comprehension faster?
And how should one figure out this kind of question?
boardrider's comment looks interesting, thanks.
The following python version is used:
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Since I didn't got a detailed answer I tried to figure it out.
If I represent the simple for loop with this function:
def func1(a):
for i,x in enumerate(a):
if x == 1:
a[i] = 6
return(a)
and disassemble it, i got the following:
func1:
7 0 SETUP_LOOP 48 (to 51)
3 LOAD_GLOBAL 0 (enumerate)
6 LOAD_FAST 0 (a)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 GET_ITER
>> 13 FOR_ITER 34 (to 50)
16 UNPACK_SEQUENCE 2
19 STORE_FAST 1 (i)
22 STORE_FAST 2 (x)
8 25 LOAD_FAST 2 (x)
28 LOAD_CONST 1 (1)
31 COMPARE_OP 2 (==)
34 POP_JUMP_IF_FALSE 13
9 37 LOAD_CONST 2 (6)
40 LOAD_FAST 0 (a)
43 LOAD_FAST 1 (i)
46 STORE_SUBSCR
47 JUMP_ABSOLUTE 13
>> 50 POP_BLOCK
10 >> 51 LOAD_FAST 0 (a)
54 RETURN_VALUE
This is simple. It iterates through the a, and if it finds the value 1 it replaces it with 6 with STORE_SUBSCR.
If I represent the comprehension variant with this function:
def func2(a):
a = [6 if x==1 else x for x in a]
return(a)
and disassemble it, I got the following:
func2:
7 0 LOAD_CONST 1 (<code object <listcomp> at 0x00000000035731E0, file "<file_path>", line 7>)
3 LOAD_CONST 2 ('func2.<locals>.<listcomp>')
6 MAKE_FUNCTION 0
9 LOAD_FAST 0 (a)
12 GET_ITER
13 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
16 STORE_FAST 0 (a)
8 19 LOAD_FAST 0 (a)
22 RETURN_VALUE
This is shorter than the previous. However it starts with a code object loading. The func2 has the following code constants:
>>> func2.__code__.co_consts
(None, <code object <listcomp> at 0x00000000035731E0, file "<file_path>", line 7>, 'func2.<locals>.<listcomp>')
and the listcomp code object looks like this:
>>> dis.dis(func2.__code__.co_consts[1].co_code)
0 BUILD_LIST 0
3 LOAD_FAST 0 (0)
>> 6 FOR_ITER 30 (to 39)
9 STORE_FAST 1 (1)
12 LOAD_FAST 1 (1)
15 LOAD_CONST 0 (0)
18 COMPARE_OP 2 (==)
21 POP_JUMP_IF_FALSE 30
24 LOAD_CONST 1 (1)
27 JUMP_FORWARD 3 (to 33)
>> 30 LOAD_FAST 1 (1)
>> 33 LIST_APPEND 2
36 JUMP_ABSOLUTE 6
>> 39 RETURN_VALUE
So essentially the two implementation performs the similar steps. The main difference is that the comprehension version replaces the FOR_ITER with a CALL_FUNCTION.
From this I should see why the list comprehension is faster, but I don't.
So my original question is still on:
Why is the list comprehension faster?
To my understanding, both these approach work for operating on every item in a generator:
let i be our operator target
let my_iter be our generator
let callable do_something_with return None
While Loop + StopIteratioon
try:
while True:
i = next(my_iter)
do_something_with(i)
except StopIteration:
pass
For loop / list comprehension
for i in my_iter:
do_something_with(i)
[do_something_with(i) for i in my_iter]
Minor Edit: print(i) replaced with do_something_with(i) as suggested by #kojiro to disambiguate a use case with the interpreter mechanics.
As far as I am aware, these are both applicable ways to iterate over a generator, Is there any reason to prefer one over the other?
Right now the for loop is looking superior to me. Due to: less lines/clutter and readability in general, plus single indent.
I really only see the while approach being advantages if you want to handily break the loop on particular exceptions.
the third option is definitively NOT the same as the first two. the third example creates a list, one each for the return value of print(i), which happens to be None, so not a very interesting list.
the first two are semantically similar. There is a minor, technical difference; the while loop, as presented, does not work if my_iter is not, in fact an iterator (ie, has a __next__() method); for instance, if it's a list. The for loop works for all iterables (has an __iter__() method) in addition to iterators.
The correct version is thus:
my_iter = iter(my_iterable)
try:
while True:
i = next(my_iter)
print(i)
except StopIteration:
pass
Now, aside from readability reasons, there in fact is a technical reason you should prefer the for loop; there is a penalty you pay (in CPython, anyhow) for the number of bytecodes executed in tight inner loops. lets compare:
In [1]: def forloop(my_iter):
...: for i in my_iter:
...: print(i)
...:
In [57]: dis.dis(forloop)
2 0 SETUP_LOOP 24 (to 27)
3 LOAD_FAST 0 (my_iter)
6 GET_ITER
>> 7 FOR_ITER 16 (to 26)
10 STORE_FAST 1 (i)
3 13 LOAD_GLOBAL 0 (print)
16 LOAD_FAST 1 (i)
19 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
22 POP_TOP
23 JUMP_ABSOLUTE 7
>> 26 POP_BLOCK
>> 27 LOAD_CONST 0 (None)
30 RETURN_VALUE
7 bytecodes called in inner loop vs:
In [55]: def whileloop(my_iterable):
....: my_iter = iter(my_iterable)
....: try:
....: while True:
....: i = next(my_iter)
....: print(i)
....: except StopIteration:
....: pass
....:
In [56]: dis.dis(whileloop)
2 0 LOAD_GLOBAL 0 (iter)
3 LOAD_FAST 0 (my_iterable)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 STORE_FAST 1 (my_iter)
3 12 SETUP_EXCEPT 32 (to 47)
4 15 SETUP_LOOP 25 (to 43)
5 >> 18 LOAD_GLOBAL 1 (next)
21 LOAD_FAST 1 (my_iter)
24 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
27 STORE_FAST 2 (i)
6 30 LOAD_GLOBAL 2 (print)
33 LOAD_FAST 2 (i)
36 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
39 POP_TOP
40 JUMP_ABSOLUTE 18
>> 43 POP_BLOCK
44 JUMP_FORWARD 18 (to 65)
7 >> 47 DUP_TOP
48 LOAD_GLOBAL 3 (StopIteration)
51 COMPARE_OP 10 (exception match)
54 POP_JUMP_IF_FALSE 64
57 POP_TOP
58 POP_TOP
59 POP_TOP
8 60 POP_EXCEPT
61 JUMP_FORWARD 1 (to 65)
>> 64 END_FINALLY
>> 65 LOAD_CONST 0 (None)
68 RETURN_VALUE
9 Bytecodes in the inner loop.
We can actually do even better, though.
In [58]: from collections import deque
In [59]: def deqloop(my_iter):
....: deque(map(print, my_iter), 0)
....:
In [61]: dis.dis(deqloop)
2 0 LOAD_GLOBAL 0 (deque)
3 LOAD_GLOBAL 1 (map)
6 LOAD_GLOBAL 2 (print)
9 LOAD_FAST 0 (my_iter)
12 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
15 LOAD_CONST 1 (0)
18 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
21 POP_TOP
22 LOAD_CONST 0 (None)
25 RETURN_VALUE
everything happens in C, collections.deque, map and print are all builtins. (for cpython) so in this case, there are no bytecodes executed for looping. This is only a useful optimization when the iteration step is a c function (as is the case for print. Otherwise, the overhead of a python function call is larger than the JUMP_ABSOLUTE overhead.
The for loop is the most pythonic. Note that you can break out of for loops as well as while loops.
Don't use the list comprehension unless you need the resulting list, otherwise you are needlessly storing all the elements. Your example list comprehension will only work with the print function in Python 3, it won't work with the print statement in Python 2.
I would agree with you that the for loop is superior. As you mentioned it is less clutter and it is a lot easier to read. Programmers like to keep things as simple as possible and the for loop does that. It is also better for novice Python programmers who might not have learned try/except. Also, as Alasdair mentioned, you can break out of for loops. Also the while loop runs an error if you are using a list unless you use iter() on my_iter first.
The following 2 code snippets (A & B) both return the intersection of 2 dictionaries.
Both of following 2 code snippets should run in O(n) and output the same results. However code snippet B which is pythonic, seems to run faster. These code snippets come from the Python Cookbook.
Code Snippet A:
def simpleway():
result = []
for k in to500.keys():
if evens.has_key(k):
result.append(k)
return result
Code Snippet B:
def pythonicsimpleway():
return [k for k in to500 if k in evens]
Some setup logic and the function used to time both functions =>
to500 = {}
for i in range(500): to500[i] = 1
evens = {}
for i in range(0,1000,2): evens[i] = 1
def timeo(fun, n=1000):
def void(): pass
start = time.clock()
for i in range(n): void()
stend = time.clock()
overhead = stend - start
start = time.clock()
for i in range(n): fun()
stend = time.clock()
thetime = stend - start
return fun.__name__, thetime - overhead
With Python 2.7.5 using a 2.3 Ghz Ivy Bridge Quad Core Processor (OS X 10.8.4)
I get
>>> timeo(simpleway)
('simpleway', 0.08928500000000028)
>>> timeo(pythonicsimpleway)
('pythonicsimpleway', 0.04579400000000078)
They don't quite do the same thing; the first one does a lot more work:
It looks up the .has_key() and .append() methods each time in the loop, and then calls them. This requires a stack push and pop for each call.
It appends each new element to a list one by one. The Python list has to be grown dynamically to make room for these elements as you do so.
The list comprehension collects all generated elements in a C array before creating the python list object in one operation.
The two functions do produce the same result, one is just needlessly slower.
If you want to go into the nitty gritty details, take a look at the bytecode disassembly using the dis module:
>>> dis.dis(simpleway)
2 0 BUILD_LIST 0
3 STORE_FAST 0 (result)
3 6 SETUP_LOOP 51 (to 60)
9 LOAD_GLOBAL 0 (to500)
12 LOAD_ATTR 1 (keys)
15 CALL_FUNCTION 0
18 GET_ITER
>> 19 FOR_ITER 37 (to 59)
22 STORE_FAST 1 (k)
4 25 LOAD_GLOBAL 2 (evens)
28 LOAD_ATTR 3 (has_key)
31 LOAD_FAST 1 (k)
34 CALL_FUNCTION 1
37 POP_JUMP_IF_FALSE 19
5 40 LOAD_FAST 0 (result)
43 LOAD_ATTR 4 (append)
46 LOAD_FAST 1 (k)
49 CALL_FUNCTION 1
52 POP_TOP
53 JUMP_ABSOLUTE 19
56 JUMP_ABSOLUTE 19
>> 59 POP_BLOCK
6 >> 60 LOAD_FAST 0 (result)
63 RETURN_VALUE
>>> dis.dis(pythonicsimpleway)
2 0 BUILD_LIST 0
3 LOAD_GLOBAL 0 (to500)
6 GET_ITER
>> 7 FOR_ITER 24 (to 34)
10 STORE_FAST 0 (k)
13 LOAD_FAST 0 (k)
16 LOAD_GLOBAL 1 (evens)
19 COMPARE_OP 6 (in)
22 POP_JUMP_IF_FALSE 7
25 LOAD_FAST 0 (k)
28 LIST_APPEND 2
31 JUMP_ABSOLUTE 7
>> 34 RETURN_VALUE
The number of bytecode instructions per iteration is much larger for the explicit for loop. The simpleway loop has to execute 11 instructions per iteration (if .has_key() is True), vs. 7 for the list comprehension, where the extra instructions mostly cover LOAD_ATTR and CALL_FUNCTION.
If you want to make the first version faster, replace .has_key() with an in test, loop directly over the dictionary and cache the .append() attribute in a local variable:
def simpleway_optimized():
result = []
append = result.append
for k in to500:
if k in evens:
append(k)
return result
Then use the timeit module to test timings properly (repeated runs, most accurate timer for your platform):
>>> timeit('f()', 'from __main__ import evens, to500, simpleway as f', number=10000)
1.1673870086669922
>>> timeit('f()', 'from __main__ import evens, to500, pythonicsimpleway as f', number=10000)
0.5441269874572754
>>> timeit('f()', 'from __main__ import evens, to500, simpleway_optimized as f', number=10000)
0.6551430225372314
Here simpleway_optimized is approaching the list comprehension method in speed, but the latter still can win by building the python list object in one step.
Is there any any faster way to swap two list elements in Python than
L[a], L[b] = L[b], L[a]
or would I have to resort to Cython or Weave or the like?
Looks like the Python compiler optimizes out the temporary tuple with this construct:
code:
import dis
def swap1():
a=5
b=4
a, b = b, a
def swap2():
a=5
b=4
c = a
a = b
b = c
print 'swap1():'
dis.dis(swap1)
print 'swap2():'
dis.dis(swap2)
output:
swap1():
6 0 LOAD_CONST 1 (5)
3 STORE_FAST 0 (a)
7 6 LOAD_CONST 2 (4)
9 STORE_FAST 1 (b)
8 12 LOAD_FAST 1 (b)
15 LOAD_FAST 0 (a)
18 ROT_TWO
19 STORE_FAST 0 (a)
22 STORE_FAST 1 (b)
25 LOAD_CONST 0 (None)
28 RETURN_VALUE
swap2():
11 0 LOAD_CONST 1 (5)
3 STORE_FAST 0 (a)
12 6 LOAD_CONST 2 (4)
9 STORE_FAST 1 (b)
13 12 LOAD_FAST 0 (a)
15 STORE_FAST 2 (c)
14 18 LOAD_FAST 1 (b)
21 STORE_FAST 0 (a)
15 24 LOAD_FAST 2 (c)
27 STORE_FAST 1 (b)
30 LOAD_CONST 0 (None)
33 RETURN_VALUE
Two loads, a ROT_TWO, and two saves, versus three loads and three saves. You are unlikely to find a faster mechanism.
If you could post a representative code sample, we could do a better job of benchmarking your options. FWIW, for the following dumb benchmark, I get about a 3x speedup with Shed Skin and a 10x speedup with PyPy.
from time import time
def swap(L):
for i in xrange(1000000):
for b, a in enumerate(L):
L[a], L[b] = L[b], L[a]
def main():
start = time()
L = list(reversed(range(100)))
swap(L[:])
print time() - start
return L
if __name__ == "__main__":
print len(main())
# for shedskin:
# shedskin -b -r -e listswap.py && make
# python -c "import listswap; print len(listswap.main())"
I try this method as the easiest way to swap two numbers in a list:
lst= [23, 65, 19, 90]
pos1 = lst.pop(0)
pos2 = lst.pop(1)
lst.append(pos1)
lst.append(pos2)
print(lst)
I found this method as the fastest way to swap two numbers:
mylist = [11,23,5,8,13,17];
first_el = mylist.pop(0)
last_el = mylist.pop(-1)
mylist.insert(0, last_el)
mylist.append(first_el)