Pre-allocating a list of None - python

Suppose you want to write a function which yields a list of objects, and you know in advance the length n of such list.
In python the list supports indexed access in O(1), so it is arguably a good idea to pre-allocate the list and access it with indexes instead of allocating an empty list and using the append() method. This is because we avoid the burden of expanding the whole list if the space is not enough.
If I'm using python, probably performances are not that relevant in any case, but what is the better way of pre-allocating a list?
I know these possible candidates:
[None] * n → allocating two lists
[None for x in range(n)] — or xrange in python2 → building another object
Is one significantly better than the other?
What if we are in the case n = len(input)? Since input exists already, would [None for x in input] have better performances w.r.t. [None] * len(input)?

In between those two options the first one is clearly better as no Python for loop is involved.
>>> %timeit [None] * 100
1000000 loops, best of 3: 469 ns per loop
>>> %timeit [None for x in range(100)]
100000 loops, best of 3: 4.8 us per loop
Update:
And list.append has an O(1) complexity too, it might be a better choice than pre-creating list if you assign the list.append method to a variable.
>>> n = 10**3
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10000 loops, best of 3: 73.2 us per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10000 loops, best of 3: 92.2 us per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10000 loops, best of 3: 59.4 us per loop
>>> n = 10**6
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10 loops, best of 3: 106 ms per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10 loops, best of 3: 122 ms per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10 loops, best of 3: 91.8 ms per loop

When you append an item to a list, Python 'over-allocates', see the source-code of the list object. This means that for example when adding 1 item to a list of 8 items, it actually makes room for 8 new items, and uses only the first one of those. The next 7 appends are then 'for free'.
In many languages (e.g. old versions of Matlab, the newer JIT might be better) you are always told that you need to pre-allocate your vectors, since appending during a loop is very expensive. In the worst case, appending of a single item to a list of length n can cost O(n) time, since you might have to create a bigger list and copy all the existing items over. If you need to do this on every iteration, the overall cost of adding n items is O(n^2), ouch. Python's pre-allocation scheme spreads the cost of growing the array over many single appends (see amortized costs), effectively making the cost of a single append O(1) and the overall cost of adding n items O(n).
Additionally, the overhead of the rest of your Python code is usually so large, that the tiny speedup that can be obtained by pre-allocating is insignificant. So in most cases, simply forget about pre-allocating, unless your profiler tells you that appending to a list is a bottleneck.
The other answers show some profiling of the list preallocation itself, but this is useless. The only thing that matters is profiling your complete code, with all your calculations inside your loop, with and without pre-allocation. If my prediction is right, the difference is so small that the computation time you win is dwarfed by the time spent thinking about, writing and maintaining the extra lines to pre-allocate your list.

Obviously, the first version. Let me explain why.
When you do [None] * n, Python internally creates a list object of size n and it copies the the same object (here None) (this is the reason, you should use this method only when you are dealing with immutable objects) to all the memory locations. So memory allocation is done only once. After that a single iteration through the list to copy the object to all the elements. list_repeat is the function which corresponds to this type of list creation.
# Creates the list of specified size
np = (PyListObject *) PyList_New(size);
....
...
items = np->ob_item;
if (Py_SIZE(a) == 1) {
elem = a->ob_item[0];
for (i = 0; i < n; i++) {
items[i] = elem; // Copies the same item
Py_INCREF(elem);
}
return (PyObject *) np;
}
When you use a list comprehension to build a list, Python cannot know the actual size of the list being created, so it initially allocates a chunk of memory and a fresh copy of the object is stored in the list. When the list grows beyond the allocated length, it has to allocate the memory again and continue with the new object creation and storing that in the list.

Related

Why is islice(permutations) 100 times faster if I keep a reference to the underlying iterator?

Iterating through islice(permutations(a), n) is somehow 100 times faster if I just keep an extra reference to the permutations iterator. Alternating between with and without the extra reference:
2.1 ms with
202.2 ms without
2.1 ms with
195.8 ms without
2.1 ms with
192.4 ms without
What's going on?
Full code (Try it online!):
from timeit import timeit
from itertools import permutations, islice
from collections import deque
a = range(10 ** 7)
n = 10 ** 5
for label in ['with', 'without'] * 3:
if label == 'with':
perms = islice((foo := permutations(a)), n)
else:
perms = islice(permutations(a), n)
next(perms)
t = timeit(lambda: deque(perms, 0), number=1)
print('%5.1f ms ' % (t * 1e3), label)
I just realized why. If I don't don't keep a reference, then the iterator and all it entails gets garbage collected at the end of timing, and that's included in the time.
Note that the list I build permutations of is very large. So each permutation is very large. So the permutations iterator has a large result tuple and internal state data structure, and I also have millions of integer objects from the range. All that must be cleaned up.
When I halve the size of a to a = range(10 ** 7 // 2), the times for "without" extra reference drop to about half (100 ms).
When I double the size of a to a = range(10 ** 7 * 2), the times for "without" extra reference roughly double (over 400 ms).
Neither change affects the "with" times (always around 2 ms).
In case anyone is wondering why I build permutations of such a large list: I was looking into the time it takes permutations to provide all n! permutations of n elements. One might think it needs O(n × n!), since that's the overall result size. But it reuses and modifies its result tuple if it can, so it doesn't build each permutation from scratch but just needs to update it a little. So I tested that with large n in order to see a large speed difference between when it can and can't reuse its result tuple. It is indeed much faster if it can, and seems to take only O(n!) time to provide all permutations. It appears to on average change just 2.63 elements from one permutation to the next.

How to deoptimze memory access in python?

This may not useful. It's just a challenge I have set up for myself.
Let's say you have a big array. What can you do so that the program does not benefit from caching, cache line prefetching or the fact that the next memory access can only be determined after the first access finishes.
So we have our array:
array = [0] * 10000000
What would be the best way to deoptimize the memory access if you had to access all elements in a loop? The idea is to increase the access time of each memory location as much as possible
I'm not looking for a solution which proposes to do "something else" (which takes time) before doing the next access. The idea is really to increase the access time as much as possible. I guess we have to traverse the array in a certain way (perhaps randomly? I'm still looking into it)
I did not expect any difference, but in fact accessing the digits in random order is significantly slower than accessing them in order or in reverse order (which is both about the same).
>>> N = 10**5
>>> arr = [random.randint(0, 1000) for _ in range(N)]
>>> srt = list(range(N))
>>> rvd = srt[::-1]
>>> rnd = random.sample(srt, N)
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.9 ms per loop
>>> %timeit sum(arr[i] for i in rvd)
10 loops, best of 5: 25.7 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 59.2 ms per loop
And it really seems to be the randomness. Just accessing indices out of order, but with a pattern, e.g. as [0, N-1, 2, N-3, ...] or [0, N/2, 1, N/2+1, ...], is just as fast as accessing them in order:
>>> alt1 = [i if i % 2 == 0 else N - i for i in range(N)]
>>> alt2 = [i for p in zip(srt[:N//2], srt[N//2:]) for i in p]
>>> %timeit sum(arr[i] for i in alt1)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in alt2)
10 loops, best of 5: 24.1 ms per loop
Interestingly, just iterating the shuffled indices (and calculating their sum as with the array above) is also slower than doing the same with the sorted indices, but not as much. Of the ~35ms difference between srt and rnd, ~10ms seem to come from iterating the randomized indices, and ~25ms for actually accessing the indices in random order.
>>> %timeit sum(i for i in srt)
100 loops, best of 5: 19.7 ms per loop
>>> %timeit sum(i for i in rnd)
10 loops, best of 5: 30.5 ms per loop
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 56 ms per loop
(IPython 5.8.0 / Python 3.7.3 on a rather old laptop running Linux)
Python interns small integers. Use integers > 255. * just adds references to the number already in the list when expanded, use unique values instead. Caches hate randomness, so go random.
import random
array = list(range(256, 10000256))
while array:
array.pop(random.randint(0, len(array)-1))
A note on interning small integers. When you create an integer in your program, say 12345, python creates an object on the heap of 55 or greater bytes. This is expensive. So, numbers between (I think) -4 and 255 are built into python to optimize common small number operations. By avoiding these numbers you force python to allocate integers on the heap, spreading out the amount of memory you will touch and reducing cache efficiency.
If you use a single number in the array [1234] * 100000, that single number is referenced many times. If you use unique numbers, they are all individually allocated on the heap, increasing memory footprint. And when they are removed from the list, python has to touch the object to reduce its reference count which pulls its memory location into cache, invalidating something else.

perfomance of len(List) vs reading a variable

A similar question has already been ask Cost of len() function here. However, this question looks at the cost of len it self.
Suppose, I have a code that repeats many times len(List), every time is O(1), reading a variable is also O(1) plus assigning it is also O(1).
As a side note, I find that n_files = len(Files) is somewhat more readable than repeated len(Files) in my code. So, that is already an incentive for me to do this.
You could also argue against me, that somewhere in the code Files can be modified, so n_files is no longer correct, but that is not the case.
My question is:
Is the a number of calls to len(Files) after which accessing n_files
will be faster?
A few results (time, in seconds, for one million calls), with a ten-element list using Python 2.7.10 on Windows 7; store is whether we store the length or keeping calling len, and alias is whether or not we create a local alias for len:
Store Alias n= 1 10 100
Yes Yes 0.862 1.379 6.669
Yes No 0.792 1.337 6.543
No Yes 0.914 1.924 11.616
No No 0.879 1.987 12.617
and a thousand-element list:
Store Alias n= 1 10 100
Yes Yes 0.877 1.369 6.661
Yes No 0.785 1.299 6.808
No Yes 0.926 1.886 11.720
No No 0.891 1.948 12.843
Conclusions:
Storing the result is more efficient than calling len repeatedly, even for n == 1;
Creating a local alias for len can make a small improvement for larger n where we aren't storing the result, but not as much as just storing the result would; and
The influence of the length of the list is negligible, suggesting that whether or not the integers are interned isn't making any difference.
Test script:
def test(n, l, store, alias):
if alias:
len_ = len
len_l = len_(l)
else:
len_l = len(l)
for _ in range(n):
if store:
_ = len_l
elif alias:
_ = len_(l)
else:
_ = len(l)
if __name__ == '__main__':
from itertools import product
from timeit import timeit
setup = 'from __main__ import test, l'
for n, l, store, alias in product(
(1, 10, 100),
([None]*10,),
(True, False),
(True, False),
):
test_case = 'test({!r}, l, {!r}, {!r})'.format(n, store, alias)
print test_case, len(l),
print timeit(test_case, setup=setup)
Function calls in python are costly, so if you are 100% sure that the size of n_files would not change when you are accessing its length from the variable, you can use the variable, if that is what is more readable for you as well.
An Example performance test for both accessing len(list) and accessing from variable , gives the following result -
In [36]: l = list(range(100000))
In [37]: n_l = len(l)
In [40]: %timeit newn = len(l)
10000000 loops, best of 3: 92.8 ns per loop
In [41]: %timeit new_n = n_l
10000000 loops, best of 3: 33.1 ns per loop
Accessing the variable is always faster than using len() .
Using l = len(li) is faster:
python -m timeit -s "li = [1, 2, 3]" "len(li)"
1000000 loops, best of 3: 0.239 usec per loop
python -m timeit -s "li = [1, 2, 3]; l = len(li)" "l"
10000000 loops, best of 3: 0.0949 usec per loop
Using len(Files) instead of n_files is likely to be slower. Yes you have to lookup n_files, but in the former case you'll have to lookup both len and Files and then on top of that call a function that "calculates" the length of Files.

Why is the "map" version of ThreeSum so slow?

I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)
Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)
Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop
You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)

Why is this slicing code faster than more procedural code?

I have a Python function that takes a list and returns a generator yielding 2-tuples of each adjacent pair, e.g.
>>> list(pairs([1, 2, 3, 4]))
[(1, 2), (2, 3), (3, 4)]
I've considered an implementation using 2 slices:
def pairs(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
and one written in a more procedural style:
def pairs(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
Testing using range(2 ** 15) as input yields the following times (you can find my testing code and output here):
2 slices: 100 loops, best of 3: 4.23 msec per loop
0 slices: 100 loops, best of 3: 5.68 msec per loop
Part of the performance hit for the sliceless implementation is the comparison in the loop (if last is not dummy). Removing that (making the output incorrect) improves its performance, but it's still slower than the zip-a-pair-of-slices implementation:
2 slices: 100 loops, best of 3: 4.48 msec per loop
0 slices: 100 loops, best of 3: 5.2 msec per loop
So, I'm stumped. Why is zipping together 2 slices, effectively iterating over the list twice in parallel, faster than iterating once, updating last and x as you go?
EDIT
Dan Lenski proposed a third implementation:
def pairs(xs):
for ii in range(1,len(xs)):
yield xs[ii-1], xs[ii]
Here's its comparison to the other implementations:
2 slices: 100 loops, best of 3: 4.37 msec per loop
0 slices: 100 loops, best of 3: 5.61 msec per loop
Lenski's: 100 loops, best of 3: 6.43 msec per loop
It's even slower! Which is baffling to me.
EDIT 2:
ssm suggested using itertools.izip instead of zip, and it's even faster than zip:
2 slices, izip: 100 loops, best of 3: 3.68 msec per loop
So, izip is the winner so far! But still for difficult-to inspect reasons.
Lots of interesting discussion elsewhere in this thread. Basically, we started out comparing two versions of this function, which I'm going to describe with the following dumb names:
The "zip-py" version:
def pairs(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
The "loopy" version:
def pairs(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
So why does the loopy version turn out to be slower? Basically, I think it comes down to a couple things:
The loopy version explicitly does extra work: it compares two objects' identities (if last is not dummy: ...) on every pair-generating iteration of the inner loop.
#mambocab's edit shows that not doing this comparison does make the loopy version
slightly faster, but doesn't fully close the gap.
The zippy version does more stuff in compiled C code that the loopy version does in Python code:
Combining two objects into a tuple. The loopy version does yield last,x, while in the zippy version the tuple p comes straight from zip, so it just does yield p.
Binding variable names to object: the loopy version does this twice in every loop, assigning x in the for loop and last=x. The zippy version does this just once, in the for loop.
Interestingly, there is one way in which the zippy version is actually doing more work: it uses two listiterators, iter(xs[:-1]) and iter(xs[1:]), which get passed to zip. The loopy version only uses one listiterator (for x in xs).
Creating a listiterator object (the output of iter([])) is likely a very highly optimized operation since Python programmers use it so frequently.
Iterating over list slices, xs[:-1] and xs[1:], is a very lightweight operation which adds almost no overhead compared to iterating over the whole list. Essentially, it just means moving the starting or ending point of the iterator, but not changing what happens on each iteration.
This i the result for the iZip which is actually closer to your implementation. Looks like what you would expect. The zip version is creating the entire list in memory within the function so it is the fastest. The loop version just los through the list, so it is a little slower. The izipis the closest in resemblance to the code, but I am guessing there is some memory-management backend processes which increase the time of execution.
In [11]: %timeit pairsLoop([1,2,3,4,5])
1000000 loops, best of 3: 651 ns per loop
In [12]: %timeit pairsZip([1,2,3,4,5])
1000000 loops, best of 3: 637 ns per loop
In [13]: %timeit pairsIzip([1,2,3,4,5])
1000000 loops, best of 3: 655 ns per loop
The version of code is shown below as requested:
from itertools import izip
def pairsIzip(xs):
for p in izip(xs[:-1], xs[1:]):
yield p
def pairsZip(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
def pairsLoop(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
I suspect the main reason that the second version is slower is because it does a comparison operation for every single pair that it yields:
# pair-generating loop
for x in xs:
if last is not dummy:
yield last,x
last = x
The first version does not do anything but spit out values. With the loop variables renamed, it's equivalent to this:
# pair-generating loop
for last,x in zip(xs[:-1], xs[1:]):
yield last,x
It's not especially pretty or Pythonic, but you could write a procedural version without a comparison in the inner loop. How fast does this one run?
def pairs(xs):
for ii in range(1,len(xs)):
yield xs[ii-1], xs[ii]

Categories

Resources