Why is this slicing code faster than more procedural code?

Why is this slicing code faster than more procedural code? - python

I have a Python function that takes a list and returns a generator yielding 2-tuples of each adjacent pair, e.g.
>>> list(pairs([1, 2, 3, 4]))
[(1, 2), (2, 3), (3, 4)]
I've considered an implementation using 2 slices:
def pairs(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
and one written in a more procedural style:
def pairs(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
Testing using range(2 ** 15) as input yields the following times (you can find my testing code and output here):
2 slices: 100 loops, best of 3: 4.23 msec per loop
0 slices: 100 loops, best of 3: 5.68 msec per loop
Part of the performance hit for the sliceless implementation is the comparison in the loop (if last is not dummy). Removing that (making the output incorrect) improves its performance, but it's still slower than the zip-a-pair-of-slices implementation:
2 slices: 100 loops, best of 3: 4.48 msec per loop
0 slices: 100 loops, best of 3: 5.2 msec per loop
So, I'm stumped. Why is zipping together 2 slices, effectively iterating over the list twice in parallel, faster than iterating once, updating last and x as you go?
EDIT
Dan Lenski proposed a third implementation:
def pairs(xs):
for ii in range(1,len(xs)):
yield xs[ii-1], xs[ii]
Here's its comparison to the other implementations:
2 slices: 100 loops, best of 3: 4.37 msec per loop
0 slices: 100 loops, best of 3: 5.61 msec per loop
Lenski's: 100 loops, best of 3: 6.43 msec per loop
It's even slower! Which is baffling to me.
EDIT 2:
ssm suggested using itertools.izip instead of zip, and it's even faster than zip:
2 slices, izip: 100 loops, best of 3: 3.68 msec per loop
So, izip is the winner so far! But still for difficult-to inspect reasons.

Lots of interesting discussion elsewhere in this thread. Basically, we started out comparing two versions of this function, which I'm going to describe with the following dumb names:
The "zip-py" version:
def pairs(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
The "loopy" version:
def pairs(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
So why does the loopy version turn out to be slower? Basically, I think it comes down to a couple things:
The loopy version explicitly does extra work: it compares two objects' identities (if last is not dummy: ...) on every pair-generating iteration of the inner loop.
#mambocab's edit shows that not doing this comparison does make the loopy version
slightly faster, but doesn't fully close the gap.
The zippy version does more stuff in compiled C code that the loopy version does in Python code:
Combining two objects into a tuple. The loopy version does yield last,x, while in the zippy version the tuple p comes straight from zip, so it just does yield p.
Binding variable names to object: the loopy version does this twice in every loop, assigning x in the for loop and last=x. The zippy version does this just once, in the for loop.
Interestingly, there is one way in which the zippy version is actually doing more work: it uses two listiterators, iter(xs[:-1]) and iter(xs[1:]), which get passed to zip. The loopy version only uses one listiterator (for x in xs).
Creating a listiterator object (the output of iter([])) is likely a very highly optimized operation since Python programmers use it so frequently.
Iterating over list slices, xs[:-1] and xs[1:], is a very lightweight operation which adds almost no overhead compared to iterating over the whole list. Essentially, it just means moving the starting or ending point of the iterator, but not changing what happens on each iteration.

This i the result for the iZip which is actually closer to your implementation. Looks like what you would expect. The zip version is creating the entire list in memory within the function so it is the fastest. The loop version just los through the list, so it is a little slower. The izipis the closest in resemblance to the code, but I am guessing there is some memory-management backend processes which increase the time of execution.
In [11]: %timeit pairsLoop([1,2,3,4,5])
1000000 loops, best of 3: 651 ns per loop
In [12]: %timeit pairsZip([1,2,3,4,5])
1000000 loops, best of 3: 637 ns per loop
In [13]: %timeit pairsIzip([1,2,3,4,5])
1000000 loops, best of 3: 655 ns per loop
The version of code is shown below as requested:
from itertools import izip
def pairsIzip(xs):
for p in izip(xs[:-1], xs[1:]):
yield p
def pairsZip(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
def pairsLoop(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x

I suspect the main reason that the second version is slower is because it does a comparison operation for every single pair that it yields:
# pair-generating loop
for x in xs:
if last is not dummy:
yield last,x
last = x
The first version does not do anything but spit out values. With the loop variables renamed, it's equivalent to this:
# pair-generating loop
for last,x in zip(xs[:-1], xs[1:]):
yield last,x
It's not especially pretty or Pythonic, but you could write a procedural version without a comparison in the inner loop. How fast does this one run?
def pairs(xs):
for ii in range(1,len(xs)):
yield xs[ii-1], xs[ii]

Related

How to deoptimze memory access in python?

This may not useful. It's just a challenge I have set up for myself.
Let's say you have a big array. What can you do so that the program does not benefit from caching, cache line prefetching or the fact that the next memory access can only be determined after the first access finishes.
So we have our array:
array = [0] * 10000000
What would be the best way to deoptimize the memory access if you had to access all elements in a loop? The idea is to increase the access time of each memory location as much as possible
I'm not looking for a solution which proposes to do "something else" (which takes time) before doing the next access. The idea is really to increase the access time as much as possible. I guess we have to traverse the array in a certain way (perhaps randomly? I'm still looking into it)

I did not expect any difference, but in fact accessing the digits in random order is significantly slower than accessing them in order or in reverse order (which is both about the same).
>>> N = 10**5
>>> arr = [random.randint(0, 1000) for _ in range(N)]
>>> srt = list(range(N))
>>> rvd = srt[::-1]
>>> rnd = random.sample(srt, N)
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.9 ms per loop
>>> %timeit sum(arr[i] for i in rvd)
10 loops, best of 5: 25.7 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 59.2 ms per loop
And it really seems to be the randomness. Just accessing indices out of order, but with a pattern, e.g. as [0, N-1, 2, N-3, ...] or [0, N/2, 1, N/2+1, ...], is just as fast as accessing them in order:
>>> alt1 = [i if i % 2 == 0 else N - i for i in range(N)]
>>> alt2 = [i for p in zip(srt[:N//2], srt[N//2:]) for i in p]
>>> %timeit sum(arr[i] for i in alt1)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in alt2)
10 loops, best of 5: 24.1 ms per loop
Interestingly, just iterating the shuffled indices (and calculating their sum as with the array above) is also slower than doing the same with the sorted indices, but not as much. Of the ~35ms difference between srt and rnd, ~10ms seem to come from iterating the randomized indices, and ~25ms for actually accessing the indices in random order.
>>> %timeit sum(i for i in srt)
100 loops, best of 5: 19.7 ms per loop
>>> %timeit sum(i for i in rnd)
10 loops, best of 5: 30.5 ms per loop
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 56 ms per loop
(IPython 5.8.0 / Python 3.7.3 on a rather old laptop running Linux)

Python interns small integers. Use integers > 255. * just adds references to the number already in the list when expanded, use unique values instead. Caches hate randomness, so go random.
import random
array = list(range(256, 10000256))
while array:
array.pop(random.randint(0, len(array)-1))
A note on interning small integers. When you create an integer in your program, say 12345, python creates an object on the heap of 55 or greater bytes. This is expensive. So, numbers between (I think) -4 and 255 are built into python to optimize common small number operations. By avoiding these numbers you force python to allocate integers on the heap, spreading out the amount of memory you will touch and reducing cache efficiency.
If you use a single number in the array [1234] * 100000, that single number is referenced many times. If you use unique numbers, they are all individually allocated on the heap, increasing memory footprint. And when they are removed from the list, python has to touch the object to reduce its reference count which pulls its memory location into cache, invalidating something else.

Slow recursion in python

I know this subject is well discussed but I've come around a case I don't really understand how the recursive method is "slower" than a method using "reduce,lambda,xrange".
def factorial2(x, rest=1):
if x <= 1:
return rest
else:
return factorial2(x-1, rest*x)
def factorial3(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
I know python doesn't optimize tail recursion so the question isn't about it. To my understanding, a generator should still generate n amount of numbers using the +1 operator. So technically, fact(n) should add a number n times just like the recursive one. The lambda in the reduce will be called n times just as the recursive method... So since we don't have tail call optimization in both case, stacks will be created/destroyed and returned n times. And a if in the generator should check when to raise a StopIteration exception.
This makes me wonder why does the recursive method still slowlier than the other one since the recursive one use simple arithmetic and doesn't use generators.
In a test, I replaced rest*x by x in the recursive method and the time spent got reduced on par with the method using reduce.
Here are my timings for fact(400), 1000 times
factorial3 : 1.22370505333
factorial2 : 1.79896998405
Edit:
Making the method start from 1 to n doesn't help either instead of n to 1. So not an overhead with the -1.
Also, can we make the recursive method faster. I tried multiple things like global variables that I can change... Using a mutable context by placing variables in cells that I can modify like an array and keep the recursive method without parameters. Sending the method used for recursion as a parameter so we don't have to "dereference" it in our scope...?! But nothings makes it faster.
I'll point out that I have a version of the fact that use a forloop that is much faster than both of those 2 methods so there is clearly space for improvement but I wouldn't expect anything faster than the forloop.

The slowness of the recursive version comes from the need to resolve on each call the named (argument) variables. I have provided a different recursive implementation that has only one argument and it works slightly faster.
$ cat fact.py
def factorial_recursive1(x):
if x <= 1:
return 1
else:
return factorial_recursive1(x-1)*x
def factorial_recursive2(x, rest=1):
if x <= 1:
return rest
else:
return factorial_recursive2(x-1, rest*x)
def factorial_reduce(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
# Ignore the rest of the code for now, we'll get back to it later in the answer
def range_prod(a, b):
if a + 1 < b:
c = (a+b)//2
return range_prod(a, c) * range_prod(c, b)
else:
return a
def factorial_divide_and_conquer(n):
return 1 if n <= 1 else range_prod(1, n+1)
$ ipython -i fact.py
In [1]: %timeit factorial_recursive1(400)
10000 loops, best of 3: 79.3 µs per loop
In [2]: %timeit factorial_recursive2(400)
10000 loops, best of 3: 90.9 µs per loop
In [3]: %timeit factorial_reduce(400)
10000 loops, best of 3: 61 µs per loop
Since in your example very large numbers are involved, initially I suspected that the performance difference might be due to the order of multiplication. Multiplying on every iteration a large partial product by the next number is proportional to the number of digits/bits in the product, so the time complexity of such a method is O(n2), where n is the number of bits in the final product. Instead it is better to use a divide and conquer technique, where the final result is obtained as a product of two approximately equally long values each of which is computed recursively in the same manner. So I implemented that version too (see factorial_divide_and_conquer(n) in the above code) . As you can see below it still loses to the reduce()-based version for small arguments (due to the same problem with named parameters) but outperforms it for large arguments.
In [4]: %timeit factorial_divide_and_conquer(400)
10000 loops, best of 3: 90.5 µs per loop
In [5]: %timeit factorial_divide_and_conquer(4000)
1000 loops, best of 3: 1.46 ms per loop
In [6]: %timeit factorial_reduce(4000)
100 loops, best of 3: 3.09 ms per loop
UPDATE
Trying to run the factorial_recursive?() versions with x=4000 hits the default recursion limit, so the limit must be increased:
In [7]: sys.setrecursionlimit(4100)
In [8]: %timeit factorial_recursive1(4000)
100 loops, best of 3: 3.36 ms per loop
In [9]: %timeit factorial_recursive2(4000)
100 loops, best of 3: 7.02 ms per loop

Normalize strings that represent (combinatorical) necklaces [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to match "necklaces" of symbols in Python by looking up their linear representations, for which I use normal strings. For example, the strings "AABC", "ABCA", "BCAA", "CAAB" all represent the same necklace (pictured).
In order to get an overview, I store only one of the equivalent strings of a given necklace as a "representative". As to check if I've stored a candidate necklace, I need a function to normalize any given string representation. As a kind of pseudo code, I wrote a function in Python:
import collections
def normalized(s):
q = collections.deque(s)
l = list()
l.append(''.join(q))
for i in range(len(s)-1):
q.rotate(1)
l.append(''.join(q))
l.sort()
return l[0]
For all the string representations in the above example necklace, this function returns "AABC", which comes first alphabetically.
Since I'm relatively new to Python, I wonder - if I'd start implementing an application in Python - would this function be already "good enough" for production code? In other words: Would an experienced Python programmer use this function, or are there obvious flaws?

If I understand you correctly you first need to construct all circular permutations of the input sequence and then determine the (lexicographically) smallest element. That is the root of your symbol loop.
Try this:
def normalized(s):
L = [s[i:] + s[:i] for i in range(len(s))]
return sorted(L)[0]
This code works only with strings, no conversions between queue and string as in your code. A quick timing test shows this code to run in 30-50% of the time.
It would be interesting to know the length of s in your application. As all permutations have to be stored temporarily len(s)^2 bytes are needed for the temp list L. Hopefully this is not a constraint in your case.
Edit:
Today I stumbled upon the observation that if you concatenate the original string to itself it will contain all rotations as substrings. So the code will be:
def normalized4(s):
ss = s + s # contains all rotations of s as substrings
n = len(s)
return min((ss[i:i+n] for i in range(n)))
This will indeed run faster as there is only one concatination left plus n slicings. Using stringlengths of 10 to 10**5 the runtime is between 55% and 66% on my machine, compared to the min() version with a generator.
Please note that you trade off speed for memory consumption (2x) which doesn't matter here but might do in a different setting.

You could use min rather then sorting:
def normalized2(s):
return min(( s[i:] + s[:i] for i in range(len(s)) ))
But still it needs to copy string len(s) times. Faster way is to filter starting indexes of smallest char, until you get only one. Effectively search for smallest loop:
def normalized3(s):
ssize=len(s)
minchar= min(s)
minindexes= [ i for i in range(ssize) if minchar == s[i] ]
for offset in range(1,ssize):
if len( minindexes ) == 1 :
break
minchar= min( s[(i+offset)%ssize] for i in minindexes )
minindexes= [i for i in minindexes if minchar == s[(i+offset)%ssize]]
return s[minindexes[0]:] + s[:minindexes[0]]
For long string this is much faster:
In [143]: loop = [ random.choice("abcd") for i in range(100) ]
In [144]: timeit normalized(loop)
1000 loops, best of 3: 237 µs per loop
In [145]: timeit normalized2(loop)
10000 loops, best of 3: 91.3 µs per loop
In [146]: timeit normalized3(loop)
100000 loops, best of 3: 16.9 µs per loop
But if we have much of repetition, this method is not eficient:
In [147]: loop = "abcd" * 25
In [148]: timeit normalized(loop)
1000 loops, best of 3: 245 µs per loop
In [149]: timeit normalized2(loop)
100000 loops, best of 3: 18.8 µs per loop
In [150]: timeit normalized3(loop)
1000 loops, best of 3: 612 µs per loop
We can also scan forward the string, but I doubt it could be any faster, without some fancy algorithm.

How about something like this:
patterns = ['abc', 'bca', 'cab']
normalized = lambda p: ''.join(sorted(p))
normalized_patterns = set(normalized(p) for p in patterns)
Example output:
In [1]: normalized = lambda p: ''.join(sorted(p))
In [2]: normalized('abba')
Out[2]: 'aabb'
In [3]: normalized('CBAE')
Out[3]: 'ABCE'

Why is the "map" version of ThreeSum so slow?

I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)

Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)

Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop

You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)

Pre-allocating a list of None

Suppose you want to write a function which yields a list of objects, and you know in advance the length n of such list.
In python the list supports indexed access in O(1), so it is arguably a good idea to pre-allocate the list and access it with indexes instead of allocating an empty list and using the append() method. This is because we avoid the burden of expanding the whole list if the space is not enough.
If I'm using python, probably performances are not that relevant in any case, but what is the better way of pre-allocating a list?
I know these possible candidates:
[None] * n → allocating two lists
[None for x in range(n)] — or xrange in python2 → building another object
Is one significantly better than the other?
What if we are in the case n = len(input)? Since input exists already, would [None for x in input] have better performances w.r.t. [None] * len(input)?

In between those two options the first one is clearly better as no Python for loop is involved.
>>> %timeit [None] * 100
1000000 loops, best of 3: 469 ns per loop
>>> %timeit [None for x in range(100)]
100000 loops, best of 3: 4.8 us per loop
Update:
And list.append has an O(1) complexity too, it might be a better choice than pre-creating list if you assign the list.append method to a variable.
>>> n = 10**3
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10000 loops, best of 3: 73.2 us per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10000 loops, best of 3: 92.2 us per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10000 loops, best of 3: 59.4 us per loop
>>> n = 10**6
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10 loops, best of 3: 106 ms per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10 loops, best of 3: 122 ms per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10 loops, best of 3: 91.8 ms per loop

When you append an item to a list, Python 'over-allocates', see the source-code of the list object. This means that for example when adding 1 item to a list of 8 items, it actually makes room for 8 new items, and uses only the first one of those. The next 7 appends are then 'for free'.
In many languages (e.g. old versions of Matlab, the newer JIT might be better) you are always told that you need to pre-allocate your vectors, since appending during a loop is very expensive. In the worst case, appending of a single item to a list of length n can cost O(n) time, since you might have to create a bigger list and copy all the existing items over. If you need to do this on every iteration, the overall cost of adding n items is O(n^2), ouch. Python's pre-allocation scheme spreads the cost of growing the array over many single appends (see amortized costs), effectively making the cost of a single append O(1) and the overall cost of adding n items O(n).
Additionally, the overhead of the rest of your Python code is usually so large, that the tiny speedup that can be obtained by pre-allocating is insignificant. So in most cases, simply forget about pre-allocating, unless your profiler tells you that appending to a list is a bottleneck.
The other answers show some profiling of the list preallocation itself, but this is useless. The only thing that matters is profiling your complete code, with all your calculations inside your loop, with and without pre-allocation. If my prediction is right, the difference is so small that the computation time you win is dwarfed by the time spent thinking about, writing and maintaining the extra lines to pre-allocate your list.

Obviously, the first version. Let me explain why.
When you do [None] * n, Python internally creates a list object of size n and it copies the the same object (here None) (this is the reason, you should use this method only when you are dealing with immutable objects) to all the memory locations. So memory allocation is done only once. After that a single iteration through the list to copy the object to all the elements. list_repeat is the function which corresponds to this type of list creation.
# Creates the list of specified size
np = (PyListObject *) PyList_New(size);
....
...
items = np->ob_item;
if (Py_SIZE(a) == 1) {
elem = a->ob_item[0];
for (i = 0; i < n; i++) {
items[i] = elem; // Copies the same item
Py_INCREF(elem);
}
return (PyObject *) np;
}
When you use a list comprehension to build a list, Python cannot know the actual size of the list being created, so it initially allocates a chunk of memory and a fresh copy of the object is stored in the list. When the list grows beyond the allocated length, it has to allocate the memory again and continue with the new object creation and storing that in the list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.