At first, I want to test the memory usage between generator and list-comprehension.The book give me a little bench code snippet and I run it on my PC(python3.6, Windows),find something unexpected.
On the book, it said, because list-comprehension has to create a real list and allocate memory for it, itering from a list-comprehension must be slower than itering from a generator.
Ofcourse, list-comprehension use more memory than generator.
FOllowing is my code,which is not satisfy previous opinion(within sum function).
import tracemalloc
from time import time
def timeIt(func):
start = time()
func()
print('%s use time' % func.__name__, time() - start)
return func
tracemalloc.start()
numbers = range(1, 1000000)
#timeIt
def lStyle():
return sum([i for i in numbers if i % 3 == 0])
#timeIt
def gStyle():
return sum((i for i in numbers if i % 3 == 0))
lStyle()
gStyle()
shouldSize = [i for i in numbers if i % 3 == 0]
snapshotL = tracemalloc.take_snapshot()
top_stats = snapshotL.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
The output:
lStyle use time 0.4460000991821289
gStyle use time 0.6190001964569092
[ Top 10 ]
F:/py3proj/play.py:31: size=11.5 MiB, count=333250, average=36 B
F:/py3proj/play.py:33: size=448 B, count=1, average=448 B
F:/py3proj/play.py:22: size=136 B, count=1, average=136 B
F:/py3proj/play.py:17: size=136 B, count=1, average=136 B
F:/py3proj/play.py:14: size=76 B, count=2, average=38 B
F:/py3proj/play.py:8: size=34 B, count=1, average=34 B
Two points:
Generator use more time and same memory space.
The list-comprehension in sum function seems not create the total list
I think maybe the sum function did something i don't know.Who can explain this?
The book is High Perfoamance Python.chapter 5.But i did sth myself different from the book to check the validity in other context. And his code is here book_code,he didn't put the list-comprehension in sum funciton.
When it comes to time performance test, I do rely on the timeit module because it automatically executes multiple runs of the code.
On my system, timeit gives following results (I strongly reduced sizes because of the numerous runs):
>>> timeit.timeit("sum([i for i in numbers if i % 3 == 0])", "numbers = range(1, 1000)")
59.54427594248068
>>> timeit.timeit("sum((i for i in numbers if i % 3 == 0))", "numbers = range(1, 1000)")
64.36398425334801
So the generator is slower by about 8% (*). This is not really a surprize because the generator has to execute some code on the fly to get next value, while a precomputed list only increment its current pointer.
Memory evalutation is IMHO more complex, so I have used Compute Memory footprint of an object and its contents (Python recipe) from activestate
>>> numbers = range(1, 100)
>>> numbers = range(1, 100000)
>>> l = [i for i in numbers if i % 3 == 0]
>>> g = (i for i in numbers if i % 3 == 0)
>>> total_size(l)
1218708
>>> total_size(g)
88
>>> total_size(numbers)
48
My interpretation is that a list uses memory for all of its items (which is not a surprize), while a generator only need few values and some code so the memory footprint is much lesser for the generator.
I strongly think that you have used tracemalloc for something it is not intended for. It is aimed at searching possible memory leaks (large blocs of memory never deallocated) and not at controlling the memory used by individual objects.
BEWARE: I could only test for small sizes. But for very large sizes, the list could exhaust the available memory and the machine will use virtual memory from swap. In that case, the list version will become much slower. More details there
Related
I'm not sure if this is a duplicate of the other PyPy memory questions, but here I'll provide a concrete example.
from __future__ import division
def mul_inv(a, m):
"""Modular multiplicative inverse, a^-1 mod m. Credit: rosettacode.org"""
m0 = m
x0, x1 = 0, 1
if m == 1: return 1
while a > 1:
assert m != 0, "a and m must be coprime"
q = a // m
a, m = m, a%m
x0, x1 = x1 - q * x0, x0
if x1 < 0: x1 += m0
return x1
M = 1000000009
L = 10**8
bin2 = [0] * L
bin2[0] = 1
for n in range(L-1):
bin2[n+1] = (bin2[n] * (4*n + 2) * mul_inv(n+1, M)) % M
if n % 10**5 == 0: print(n, bin2[n])
print(bin2[:20])
With python 3.6, the program uses 3-4 GB at most and runs to completion (Armin Rigo's list change doesn't change this significantly). With python 2.7.13 running PyPy 5.10.0, the program reaches 8 GB (how much RAM I have) quickly and freezes. Even with gc.collect() calls the program runs out of memory when n is about 3.5 * 10^7.
Where is this memory usage coming from? The only large memory usage should be initializing bin2 as a 10^8 int list. Nothing else should be increasing the memory usage, under the assumption that all the local variables in mul_inv are garbage collected.
Oops, it's a bad case of the optimization for lists of integers. The problem is that this starts as a list of ints:
bin2 = [0] * L
This is internally stored as an array of ints. It's usually much more compact, even though in this case it doesn't change anything---because on CPython it's a list containing L copies of the same object 0.
But the problem is that pretty soon, we store a long in the list. At this point, we need to turn the whole list into the generic kind that can store anything. But! The problem is that we see 100 million zeroes, and so we create 100 million 0 objects. This creates instantly 3 GB of memory pressure for nothing, in addition to 800MB for the list itself.
We can check that the problem doesn't occur if we initialize the list like this, so that it really contains 100 million times the same object:
bin2 = [0L] * L # Python 2.x
bin2[0] = 1
That said, in your example you don't need the list to contain 100 million elements in the first place. You can initialize it as:
bin2 = [1]
and use bin2.append(). This lets the program start much more quickly and without any large memory usage near the beginning.
Note that PyPy3 still uses more memory than CPython3.
AFAICT the issue here is that you're assigning longs to the array, and despite your modulo, PyPy doesn't seem to notice that the number still fits into a machine word.
I can think of two ways to fix this:
Pass the value assigned to bin2[n+1] through int().
Use array.array().
The former only affects PyPy2, and results in what appears to be a stable memory footprint of ~800MB on my Mac, whereas the latter appears to stabilise at ~1.4GB regardless of whether I run it in PyPy2 or PyPy3.
I haven't run the program fully to completion, though, so YMMV…
Why is the first method so slow?
It can be up to 1000 times slower, any ideas on how to make it faster?
In this case, performance is number one priority. In my first attempt I tried to make it multipricessing, but it was quite slow as well.
Python - Set the first element of a generator - Applied to itertools
import time
import operator as op
from math import factorial
from itertools import combinations
def nCr(n, r):
# https://stackoverflow.com/a/4941932/1167783
r = min(r, n-r)
if r == 0:
return 1
numer = reduce(op.mul, xrange(n, n-r, -1))
denom = reduce(op.mul, xrange(1, r+1))
return numer // denom
def kthCombination(k, l, r):
# https://stackoverflow.com/a/1776884/1167783
if r == 0:
return []
elif len(l) == r:
return l
else:
i = nCr(len(l)-1, r-1)
if k < i:
return l[0:1] + kthCombination(k, l[1:], r-1)
else:
return kthCombination(k-i, l[1:], r)
def iter_manual(n, p):
numbers_list = [i for i in range(n)]
for comb in xrange(factorial(n)/(factorial(p)*factorial(n-p))):
x = kthCombination(comb, numbers_list, p)
# Do something, for example, store those combinations
# For timing i'm going to do something simple
def iter(n, p):
for i in combinations([i for i in range(n)], p):
# Do something, for example, store those combinations
# For timing i'm going to do something simple
x = i
#############################
if __name__ == "__main__":
n = 40
p = 5
print '%s combinations' % (factorial(n)/(factorial(p)*factorial(n-p)))
t0_man = time.time()
iter_manual(n, p)
t1_man = time.time()
total_man = t1_man - t0_man
t0_iter = time.time()
iter(n, p)
t1_iter = time.time()
total_iter = t1_iter - t0_iter
print 'Manual: %s' %total_man
print 'Itertools: %s' %total_iter
print 'ratio: %s' %(total_man/total_iter)
There are several factors at play here.
The most important is garbage collection. Any method that generates a lot of unnecessary allocations is going to be slow because of GC pauses. In this vein, list comprehensions are fast (for Python) because they are highly optimized under the hood in their allocation and execution. Wherever speed is important, prefer list comprehensions.
Next up you've got function calls. Function calls are relatively expensive as #roganjosh points out in the comments. This is (again) particularly true if the function generates a lot of garbage or holds on to long-lived closures.
Now we come to loops. Garbage is again the biggest concern, hoist your variables outside the loop and reuse them on each iteration.
Last but certainly not least is that Python is, in a sense, a hosted language: generally on the CPython runtime. Anything implemented in the runtime itself (particularly if the thing in question is implemented in C rather than Python itself) is going to be faster than your (logically equivalent) code.
NOTE
All of this advice is detrimental to code quality. Use with caution. Profile first. Also note that compilers are generally smart enough to do all of this for you, for instance PyPy will generally run faster for the same code than the standard Python runtime because it does optimizations like this for you when it runs your code.
NOTE 2
One of the implementations uses reduce. In theory, reduce could be fast. But it isn't for lots of reasons, the chief of which could possibly be summed up as "Guido didn't/doesn't care". So don't use reduce when speed is important.
Say you have to carry out a computation by using 2 or even 3 loops. Intuitively, one may thing that it's more efficient to do this with a single loop. I tried a simple Python example:
import itertools
import timeit
def case1(n):
c = 0
for i in range(n):
c += 1
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += 1
return c
print(case1(1000))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(1000)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
This code run:
$ python3 code.py
1000
1000
0.8281264099932741
1.04944919400441
So effectively 1 loop seems to be a bit more efficient. Yet I have a slightly different scenario in my problem, as I need to use the values in an array (in the following example I use the function range for simplification). That is, if I collapse everything to a single loop I would have to create an extended array from the values of another array whose size is between 2 and 10 elements.
import itertools
import timeit
def case1(n):
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
c = 0
for i in range(len(b)):
c += b[i]
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += i*j*k
return c
print(case1(10))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(10)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
In my computer this code run in:
$ python3 code.py
91125
91125
2.435348572995281
1.6435037050105166
So it seems the 3 nested loops are more efficient because I spend sometime creating the array b in case1. so I'm not sure I'm creating this array in the most efficient way, but leaving that aside, does it really pay off collapsing loops to a single one? I'm using Python here, but what about compiled languages like C++? Does the compiler in this case do something to optimize the single loop? Or on the other hand, does the compiler do some optimization when you have multiple nested loops?
This is why the single loop function takes supposedly longer than it should
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
Just by changing the whole function to
def case1(n, b):
c = 0
for i in range(len(b)):
c += b[i]
return c
Makes the timeit return :
case1 : 0.965343249744
case2 : 2.28501694207
Your case is simple enough that various optimizations would probably do a lot. Be it numpy for more efficient array's, maybe pypy for a better JIT optimizer, or various other things.
Looking at the bytecode via the dis module can help you understand what happens under the hood and make some micro optimizations, but in general it does not really matter if you do one loop or a nested loop, if your memory access pattern is somewhat predictable for the CPU. If not, it may differ wildly.
Python has some bytecodes that are cheap and others that are more expensive, e.g. function calls are much more expensive than a simple addition. Same with creating new objects and various other things. So the usual optimization is moving the loop to C, which is one of the benefits of itertools sometimes.
Once you are on the C-level it usually comes down to: Avoid syscalls/mallocs() in tight loops, have predictable memory access patterns and make sure your algorithm is cache friendly.
So, your algorithms above will probably vary wildly in performance if you go to large values of N, due to the amount of memory allocation and cache access.
But the fastest way for the specific problem above would be to find a closed form for the function, it seems wasteful to iterate for that, as there must be a much simpler formula to calculate the final value of 'c'. As usual, first get the best algorithm before doing micro optimizations.
e.g. Wolfram Alpha tells you that you could replace two loops with, there is probably a closed form for all three, but Alpha didn't tell me...
def case3(n):
c = 0
for j in range(n):
c += (j* n^2 *(n+1)^2))/4
return c
I am trying to benchmark a few method of itertools against generators and list comprehensions. The idea is that I want to build an iterator by filtering some entries from a base list.
Here is the code I came up with(Edited after accepted answer):
from itertools import ifilter
import collections
import random
import os
from timeit import Timer
os.system('cls')
# define large arrays
listArrays = [xrange(100), xrange(1000), xrange(10000), xrange(100000)]
#Number of element to be filtered out
nb_elem = 100
# Number of times we run the test
nb_rep = 1000
def discard(it):
collections.deque(it, maxlen=0)
def testGenerator(arr, sample):
discard(x for x in sample if x in arr)
def testIterator(arr, sample):
discard(ifilter(sample.__contains__, arr))
def testList(arr, sample):
discard([x for x in sample if x in arr])
if __name__ == '__main__':
for arr in listArrays:
print 'Size of array: %s ' % len(arr)
print 'number of iterations %s' % nb_rep
sample = random.sample(arr, nb_elem)
t1 = Timer('testIterator(arr, sample)', 'from __main__ import testIterator, arr, sample')
tt1 = t1.timeit(number=nb_rep)
t2 = Timer('testList(arr, sample)', 'from __main__ import testList, arr, sample')
tt2 = t2.timeit(number=nb_rep)
t3 = Timer('testGenerator(arr, sample)', 'from __main__ import testGenerator, arr, sample')
tt3 = t3.timeit(number=nb_rep)
norm = min(tt1, tt2, tt3)
print 'maximum runtime %.6f' % max(tt1, tt2, tt3)
print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % \
(tt1/norm, tt2/norm, tt3/norm)
print '===========================================
==========='
And the results that I get Please note that the edited version was not run on the same machine ( thus useful to have normalized results) and was ran with a 32bits interpreter with python 2.7.3 :
Size of array: 100
number of iterations 1000
maximum runtime 0.125595
normalized times:
iterator: 1.000000
list: 1.260302
generator: 1.276030
======================================================
Size of array: 1000
number of iterations 1000
maximum runtime 1.740341
normalized times:
iterator: 1.466031
list: 1.010701
generator: 1.000000
======================================================
Size of array: 10000
number of iterations 1000
maximum runtime 17.033630
normalized times:
iterator: 1.441600
list: 1.000000
generator: 1.010979
======================================================
Size of array: 100000
number of iterations 1000
maximum runtime 169.677963
normalized times:
iterator: 1.455594
list: 1.000000
generator: 1.008846
======================================================
Could you give some suggestions on improvement and comment on whether or not this benchmark can give accurate results?
I know that the condition in my decorator might bias the results. I am hoping for some suggestions regarding that.
Thanks.
First, instead of trying to duplicate everything timeit does, just use it. The time function may not have enough accuracy to be useful, and writing dozens of lines of scaffolding code (especially if it has to hacky things like switching on func.__name__) that you don't need is just inviting bugs for no reason.
Assuming there are no bugs, it probably won't affect the results significantly. You're doing a tiny bit of extra work and charging it to testIterator, but that's only once per outer loop. But still, there's no benefit to doing it, so let's not.
def testGenerator(arr,sample):
for i in (x for x in sample if x in arr):
k = random.random()
def testIterator(arr,sample):
for i in ifilter(lambda x: x in sample, arr):
k = random.random()
def testList(arr,sample):
for i in [x for x in sample if x in arr]:
k = random.random()
tests = testIterator, testGenerator, testList
for arr in listArrays:
print 'Size of array: %s ' % len(arr)
print 'number of iterations %s' % nb_rep
sample = random.sample(arr, nb_elem)
funcs = [partial(test, arr, sample) for test in tests]
times = [timeit.timeit(func, number=nb_rep) for func in funcs]
norm = min(*times)
print 'maximum runtime %.6f' % max(*times)
print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % (times[0]/norm,times[1]/norm,times[2]/norm)
print '======================================================'
Next, why are you doing that k = random.random() in there? From a quick test, just executing that line N times without the complex loop is 0.19x as long as the whole thing. So, you're adding 20% to each of the numbers, which dilutes the difference between them for no reason.
Once you get rid of that, the for loop is serving no purpose except to consume the iterator, and that's adding extra overhead as well. As of 2.7.3 and 3.3.0, the fastest way to consume an iterator without custom C code is deque(it, maxlen=0), so, let's try this:
def discard(it):
collections.deque(it, maxlen=0)
def testGenerator(arr,sample):
discard(x for x in sample if x in arr)
def testIterator(arr,sample):
discard(ifilter(sample.__contains__, arr))
def testList(arr,sample):
discard([x for x in sample if x in arr])
Or, alternatively, just have the functions return a generator/ifilter/list and then make the scaffolding call discard on the result (it shouldn't matter either way).
Meanwhile, for the testIterator case, are you trying to test the cost of the lambda vs. an inline expression, or the cost of ifilter vs. a generator? If you want to test the former, this is correct; if the latter, you probably want to optimize that. For example, passing sample.__contains__ instead of lambda x: x in sample seems to be 20% faster in 64-bit Python 3.3.0 and 30% faster in 32-bit 2.7.2 (although for some reason not faster at all in 64-bit 2.7.2).
Finally, unless you're just testing for exactly one implementation/platform/version, make sure to run it on as many as you can. For example, with 64-bit CPython 2.7.2, list and generator are always neck-and-neck while iterator gradually climbs from 1.0x to 1.4x as the lists grow, but in PyPy 1.9.0, iterator is always fastest, with generator and list starting off 2.1x and 1.9x slower but closing to 1.2x as the lists grow.
So, if you decided against iterator because "it's slow", you might be trading a large slowdown on PyPy for a much smaller speedup on CPython.
Of course that might be acceptable, e.g., because even the slowest PyPy run is blazingly fast, or because none of your users use PyPy, or whatever. But it's definitely part of the answer to "is this benchmark relevant?"
Question may sound complicated, but actually is pretty simple, but i can't find any nice solution in Python.
I have ranges like
("8X5000", "8X5099"). Here X can be any digit, so I want to match numbers that fall into one of the ranges (805000..805099 or 815000..815099 or ... 895000..895099).
How can I do this?
#TimPietzcker answer is correct and Pythonic, but it raises some performance concerns (arguably making it even more Pythonic). It creates an iterator that is searches for a value. I don't expect Python to be able to optimize the search.
This should perform better:
def IsInRange(n, r=("8X5000", "8X5099")):
(minr, maxr) = [[int(i) for i in l.split('X')] for l in r]
p = len(r[0]) - r[0].find('X')
nl = (n // 10**p, n % 10**(p-1))
fInRange = all([minr[i] <= nl[i] <= maxr[i] for i in range(2)])
return fInRange
The second line inside the function is a nested list comprehension so may be a little hard to read but it sets:
minr = [8, 5000]
maxr = [8, 5099]
When n = 595049:
nl = (5, 5049)
The code just splits the ranges into parts (while converting to int), splits the target number into parts, then range checks the parts. It wouldn't be hard to enhance this to handle multiple X's in the range specifiers.
Update
I just tested relative performance using timeit:
def main():
t1 = timeit.timeit('MultiRange.in_range(985000)', setup='import MultiRange', number=10000)
t2 = timeit.timeit('MultiRange.IsInRange(985000)', setup='import MultiRange', number=10000)
print t1, t2
print float(t2)/float(t1), 1 - float(t2)/float(t1)
On my 32-bit Win 7 machine running Python 2.7.2 my solution is almost 10 times faster than #TimPietzcker's (to be specific, it runs in 12% of the time). As you increase the size of the range, it only gets worse. When:
ranges=("8X5000", "8X5999")
The performance boost is 50x. Even for the smallest range, my version runs 4 times faster.
With #PaulMcGuire suggested performance patch to in_range, my version runs 3 times faster.
Update 2
Motivated by #PaulMcGuire's comment I went ahead and refactored our functions into classes. Here's mine:
class IsInRange5(object):
def __init__(self, r=("8X5000", "8X5099")):
((self.minr0, self.minr1), (self.maxr0, self.maxr1)) = [[int(i) for i in l.split('X')] for l in r]
pos = len(r[0]) - r[0].find('X')
self.basel = 10**(pos-1)
self.baseh = self.basel*10
self.ir = range(2)
def __contains__(self, n):
return self.minr0 <= n // self.baseh <= self.maxr0 and \
self.minr1 <= n % self.basel <= self.maxr1
This did close the gap, but even after pre-computing range invariants (for both) #PaulMcGuire's took 50% longer.
range = (80555,80888)
x = 80666
print range[0] < x < range[1]
maybe what your looking for ...
Example for Python 3 (in Python 2, use xrange instead of range):
def in_range(number, ranges=("8X5000", "8X5099")):
actual_ranges = ((int(ranges[0].replace("X", digit)),
int(ranges[1].replace("X", digit)) + 1)
for digit in "0123456789")
return any(number in range(*interval) for interval in actual_ranges)
Results:
>>> in_range(805001)
True
>>> in_range(895099)
True
>>> in_range(805100)
False
An improvement to this, suggested by Paul McGuire (thanks!):
def in_range(number, ranges=("8X5000", "8X5099")):
actual_ranges = ((int(ranges[0].replace("X", digit)),
int(ranges[1].replace("X", digit)))
for digit in "0123456789")
return any(minval <= number <= maxval for minval, maxval in actual_ranges)