I am trying to benchmark a few method of itertools against generators and list comprehensions. The idea is that I want to build an iterator by filtering some entries from a base list.
Here is the code I came up with(Edited after accepted answer):
from itertools import ifilter
import collections
import random
import os
from timeit import Timer
os.system('cls')
# define large arrays
listArrays = [xrange(100), xrange(1000), xrange(10000), xrange(100000)]
#Number of element to be filtered out
nb_elem = 100
# Number of times we run the test
nb_rep = 1000
def discard(it):
collections.deque(it, maxlen=0)
def testGenerator(arr, sample):
discard(x for x in sample if x in arr)
def testIterator(arr, sample):
discard(ifilter(sample.__contains__, arr))
def testList(arr, sample):
discard([x for x in sample if x in arr])
if __name__ == '__main__':
for arr in listArrays:
print 'Size of array: %s ' % len(arr)
print 'number of iterations %s' % nb_rep
sample = random.sample(arr, nb_elem)
t1 = Timer('testIterator(arr, sample)', 'from __main__ import testIterator, arr, sample')
tt1 = t1.timeit(number=nb_rep)
t2 = Timer('testList(arr, sample)', 'from __main__ import testList, arr, sample')
tt2 = t2.timeit(number=nb_rep)
t3 = Timer('testGenerator(arr, sample)', 'from __main__ import testGenerator, arr, sample')
tt3 = t3.timeit(number=nb_rep)
norm = min(tt1, tt2, tt3)
print 'maximum runtime %.6f' % max(tt1, tt2, tt3)
print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % \
(tt1/norm, tt2/norm, tt3/norm)
print '===========================================
==========='
And the results that I get Please note that the edited version was not run on the same machine ( thus useful to have normalized results) and was ran with a 32bits interpreter with python 2.7.3 :
Size of array: 100
number of iterations 1000
maximum runtime 0.125595
normalized times:
iterator: 1.000000
list: 1.260302
generator: 1.276030
======================================================
Size of array: 1000
number of iterations 1000
maximum runtime 1.740341
normalized times:
iterator: 1.466031
list: 1.010701
generator: 1.000000
======================================================
Size of array: 10000
number of iterations 1000
maximum runtime 17.033630
normalized times:
iterator: 1.441600
list: 1.000000
generator: 1.010979
======================================================
Size of array: 100000
number of iterations 1000
maximum runtime 169.677963
normalized times:
iterator: 1.455594
list: 1.000000
generator: 1.008846
======================================================
Could you give some suggestions on improvement and comment on whether or not this benchmark can give accurate results?
I know that the condition in my decorator might bias the results. I am hoping for some suggestions regarding that.
Thanks.
First, instead of trying to duplicate everything timeit does, just use it. The time function may not have enough accuracy to be useful, and writing dozens of lines of scaffolding code (especially if it has to hacky things like switching on func.__name__) that you don't need is just inviting bugs for no reason.
Assuming there are no bugs, it probably won't affect the results significantly. You're doing a tiny bit of extra work and charging it to testIterator, but that's only once per outer loop. But still, there's no benefit to doing it, so let's not.
def testGenerator(arr,sample):
for i in (x for x in sample if x in arr):
k = random.random()
def testIterator(arr,sample):
for i in ifilter(lambda x: x in sample, arr):
k = random.random()
def testList(arr,sample):
for i in [x for x in sample if x in arr]:
k = random.random()
tests = testIterator, testGenerator, testList
for arr in listArrays:
print 'Size of array: %s ' % len(arr)
print 'number of iterations %s' % nb_rep
sample = random.sample(arr, nb_elem)
funcs = [partial(test, arr, sample) for test in tests]
times = [timeit.timeit(func, number=nb_rep) for func in funcs]
norm = min(*times)
print 'maximum runtime %.6f' % max(*times)
print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % (times[0]/norm,times[1]/norm,times[2]/norm)
print '======================================================'
Next, why are you doing that k = random.random() in there? From a quick test, just executing that line N times without the complex loop is 0.19x as long as the whole thing. So, you're adding 20% to each of the numbers, which dilutes the difference between them for no reason.
Once you get rid of that, the for loop is serving no purpose except to consume the iterator, and that's adding extra overhead as well. As of 2.7.3 and 3.3.0, the fastest way to consume an iterator without custom C code is deque(it, maxlen=0), so, let's try this:
def discard(it):
collections.deque(it, maxlen=0)
def testGenerator(arr,sample):
discard(x for x in sample if x in arr)
def testIterator(arr,sample):
discard(ifilter(sample.__contains__, arr))
def testList(arr,sample):
discard([x for x in sample if x in arr])
Or, alternatively, just have the functions return a generator/ifilter/list and then make the scaffolding call discard on the result (it shouldn't matter either way).
Meanwhile, for the testIterator case, are you trying to test the cost of the lambda vs. an inline expression, or the cost of ifilter vs. a generator? If you want to test the former, this is correct; if the latter, you probably want to optimize that. For example, passing sample.__contains__ instead of lambda x: x in sample seems to be 20% faster in 64-bit Python 3.3.0 and 30% faster in 32-bit 2.7.2 (although for some reason not faster at all in 64-bit 2.7.2).
Finally, unless you're just testing for exactly one implementation/platform/version, make sure to run it on as many as you can. For example, with 64-bit CPython 2.7.2, list and generator are always neck-and-neck while iterator gradually climbs from 1.0x to 1.4x as the lists grow, but in PyPy 1.9.0, iterator is always fastest, with generator and list starting off 2.1x and 1.9x slower but closing to 1.2x as the lists grow.
So, if you decided against iterator because "it's slow", you might be trading a large slowdown on PyPy for a much smaller speedup on CPython.
Of course that might be acceptable, e.g., because even the slowest PyPy run is blazingly fast, or because none of your users use PyPy, or whatever. But it's definitely part of the answer to "is this benchmark relevant?"
Related
At first, I want to test the memory usage between generator and list-comprehension.The book give me a little bench code snippet and I run it on my PC(python3.6, Windows),find something unexpected.
On the book, it said, because list-comprehension has to create a real list and allocate memory for it, itering from a list-comprehension must be slower than itering from a generator.
Ofcourse, list-comprehension use more memory than generator.
FOllowing is my code,which is not satisfy previous opinion(within sum function).
import tracemalloc
from time import time
def timeIt(func):
start = time()
func()
print('%s use time' % func.__name__, time() - start)
return func
tracemalloc.start()
numbers = range(1, 1000000)
#timeIt
def lStyle():
return sum([i for i in numbers if i % 3 == 0])
#timeIt
def gStyle():
return sum((i for i in numbers if i % 3 == 0))
lStyle()
gStyle()
shouldSize = [i for i in numbers if i % 3 == 0]
snapshotL = tracemalloc.take_snapshot()
top_stats = snapshotL.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
The output:
lStyle use time 0.4460000991821289
gStyle use time 0.6190001964569092
[ Top 10 ]
F:/py3proj/play.py:31: size=11.5 MiB, count=333250, average=36 B
F:/py3proj/play.py:33: size=448 B, count=1, average=448 B
F:/py3proj/play.py:22: size=136 B, count=1, average=136 B
F:/py3proj/play.py:17: size=136 B, count=1, average=136 B
F:/py3proj/play.py:14: size=76 B, count=2, average=38 B
F:/py3proj/play.py:8: size=34 B, count=1, average=34 B
Two points:
Generator use more time and same memory space.
The list-comprehension in sum function seems not create the total list
I think maybe the sum function did something i don't know.Who can explain this?
The book is High Perfoamance Python.chapter 5.But i did sth myself different from the book to check the validity in other context. And his code is here book_code,he didn't put the list-comprehension in sum funciton.
When it comes to time performance test, I do rely on the timeit module because it automatically executes multiple runs of the code.
On my system, timeit gives following results (I strongly reduced sizes because of the numerous runs):
>>> timeit.timeit("sum([i for i in numbers if i % 3 == 0])", "numbers = range(1, 1000)")
59.54427594248068
>>> timeit.timeit("sum((i for i in numbers if i % 3 == 0))", "numbers = range(1, 1000)")
64.36398425334801
So the generator is slower by about 8% (*). This is not really a surprize because the generator has to execute some code on the fly to get next value, while a precomputed list only increment its current pointer.
Memory evalutation is IMHO more complex, so I have used Compute Memory footprint of an object and its contents (Python recipe) from activestate
>>> numbers = range(1, 100)
>>> numbers = range(1, 100000)
>>> l = [i for i in numbers if i % 3 == 0]
>>> g = (i for i in numbers if i % 3 == 0)
>>> total_size(l)
1218708
>>> total_size(g)
88
>>> total_size(numbers)
48
My interpretation is that a list uses memory for all of its items (which is not a surprize), while a generator only need few values and some code so the memory footprint is much lesser for the generator.
I strongly think that you have used tracemalloc for something it is not intended for. It is aimed at searching possible memory leaks (large blocs of memory never deallocated) and not at controlling the memory used by individual objects.
BEWARE: I could only test for small sizes. But for very large sizes, the list could exhaust the available memory and the machine will use virtual memory from swap. In that case, the list version will become much slower. More details there
Why is the first method so slow?
It can be up to 1000 times slower, any ideas on how to make it faster?
In this case, performance is number one priority. In my first attempt I tried to make it multipricessing, but it was quite slow as well.
Python - Set the first element of a generator - Applied to itertools
import time
import operator as op
from math import factorial
from itertools import combinations
def nCr(n, r):
# https://stackoverflow.com/a/4941932/1167783
r = min(r, n-r)
if r == 0:
return 1
numer = reduce(op.mul, xrange(n, n-r, -1))
denom = reduce(op.mul, xrange(1, r+1))
return numer // denom
def kthCombination(k, l, r):
# https://stackoverflow.com/a/1776884/1167783
if r == 0:
return []
elif len(l) == r:
return l
else:
i = nCr(len(l)-1, r-1)
if k < i:
return l[0:1] + kthCombination(k, l[1:], r-1)
else:
return kthCombination(k-i, l[1:], r)
def iter_manual(n, p):
numbers_list = [i for i in range(n)]
for comb in xrange(factorial(n)/(factorial(p)*factorial(n-p))):
x = kthCombination(comb, numbers_list, p)
# Do something, for example, store those combinations
# For timing i'm going to do something simple
def iter(n, p):
for i in combinations([i for i in range(n)], p):
# Do something, for example, store those combinations
# For timing i'm going to do something simple
x = i
#############################
if __name__ == "__main__":
n = 40
p = 5
print '%s combinations' % (factorial(n)/(factorial(p)*factorial(n-p)))
t0_man = time.time()
iter_manual(n, p)
t1_man = time.time()
total_man = t1_man - t0_man
t0_iter = time.time()
iter(n, p)
t1_iter = time.time()
total_iter = t1_iter - t0_iter
print 'Manual: %s' %total_man
print 'Itertools: %s' %total_iter
print 'ratio: %s' %(total_man/total_iter)
There are several factors at play here.
The most important is garbage collection. Any method that generates a lot of unnecessary allocations is going to be slow because of GC pauses. In this vein, list comprehensions are fast (for Python) because they are highly optimized under the hood in their allocation and execution. Wherever speed is important, prefer list comprehensions.
Next up you've got function calls. Function calls are relatively expensive as #roganjosh points out in the comments. This is (again) particularly true if the function generates a lot of garbage or holds on to long-lived closures.
Now we come to loops. Garbage is again the biggest concern, hoist your variables outside the loop and reuse them on each iteration.
Last but certainly not least is that Python is, in a sense, a hosted language: generally on the CPython runtime. Anything implemented in the runtime itself (particularly if the thing in question is implemented in C rather than Python itself) is going to be faster than your (logically equivalent) code.
NOTE
All of this advice is detrimental to code quality. Use with caution. Profile first. Also note that compilers are generally smart enough to do all of this for you, for instance PyPy will generally run faster for the same code than the standard Python runtime because it does optimizations like this for you when it runs your code.
NOTE 2
One of the implementations uses reduce. In theory, reduce could be fast. But it isn't for lots of reasons, the chief of which could possibly be summed up as "Guido didn't/doesn't care". So don't use reduce when speed is important.
I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)
Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)
Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop
You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)
I have a function
def getSamples():
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
goodSamples = []
rightCount = 0
totalCount = 0
while(rightCount < 100000):
z0 = np.random.normal(5, 14)
u0 = np.random.uniform(0,k*q(z0))
if(p(z0) > u0):
goodSamples.append(z0)
rightCount += 1
totalCount += 1
return np.array(goodSamples)
My implementation to generate 100000 samples is taking much long. How can I make it fast with itertools or something similar?
I would say that the secret to making this code faster does not lie in changing the loop syntax. Here are a few points:
np.random.normal has an additional parameter size that lets you get many values at once. I would suggest using an array of say 1E09 elements and then checking your condition on that for how many are good. You can then estimate how likely that is.
To create your uniform samples, why not use sympy for symbolic evaluation of the pdf? (I don't know if this is faster but it could be since you already know the mean and variance.)
Again, for p could you use a symbolic function?
In general, performance problems are caused by doing things the "wrong way". Numpy can be very fast when used as it is designed to be used, that is by exploiting its vector processing where these vectorized operations are handed off to compiled code. Two bad practices that come from other programing languages/approaches are
Loops: Whenever you think you need a loop stop and think. Most of the time you do not and in fact do not even want one. It is much faster both to write and run code without loops.
Memory allocation: Whenever you know the size of an object, preallocate space for it. Growing memory, particularly in Python lists, is very slow compared to the alternatives.
In this case it is easy to get (approximately) two orders of magnitude speedup; the tradeoff is more memory usage.
Below is some representative code, it is not meant to be blindly used. I have not even verified it produces the correct results. It is more or less a direct translation of your routine. It appears you are drawing random numbers from a probability distribution using the rejection method. There may be more efficient algorithms to do this for your probability distribution.
def getSamples2() :
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
N = 100000 # Total number of samples we want
Ngood = 0 # Current number of good samples
goodSamples = np.zeros(N) # Storage for the good samples
while Ngood < N : # Unfortunately a loop, ....
z0 = np.random.normal(5, 14, size=N)
u0 = np.random.uniform(size=N)*k*q(z0)
ind, = np.where(p(z0) > u0)
n = min(len(ind), N-Ngood)
goodSamples[Ngood:Ngood+n] = z0[ind[:n]]
Ngood += n
return goodSamples
This generates random numbers in chunks and saves the good ones. I have not tried to optimize the chunk size (here I just use N, the total number we want, in principle this could/should be different and could even be adjusted based on the number we have left to generate). This still uses a loop, unfortunately, but now this will be run "tens" of times instead of 100,000 times. This also uses the where function and array slicing; these are good general tools to be comfortable with.
In one test with %timeit on my machine I found
In [27]: %timeit getSamples() # Original routine
1 loops, best of 3: 49.3 s per loop
In [28]: %timeit getSamples2()
1 loops, best of 3: 505 ms per loop
Here is kinda itertools "magic", but I'm not sure it can help. Probably it's much better for perfomance to prepare an numpy array (using zeros) and fill it without creating python auto-growing list. Here is both itertools and zero-preparations. (Excuse me in advance for untested code)
from itertools import count, ifilter, imap, takewhile
import operator
def getSamples():
p = lambda x : mlab.normpdf(x, 3, 2) + mlab.normpdf(x, -5, 1)
q = lambda x : mlab.normpdf(x, 5, 14)
k = 30
n = 100000
samples_iter = imap(
operator.itemgetter(1),
takewhile(
lambda i, s: i < n,
enumerate(
ifilter(lambda z: p(z) > np.random.uniform(0,k*q(z)),
(np.random.normal(5, 14) for _ in count()))
)))
goodSamples = numpy.zeros(n)
# set values from iterator, probably there is a better way for that
for i, sample in enumerate(samples_iter):
goodSamples[i] = sample
return goodSamples
I recently posted a question using a lambda function and in a reply someone had mentioned lambda is going out of favor, to use list comprehensions instead. I am relatively new to Python. I ran a simple test:
import time
S=[x for x in range(1000000)]
T=[y**2 for y in range(300)]
#
#
time1 = time.time()
N=[x for x in S for y in T if x==y]
time2 = time.time()
print 'time diff [x for x in S for y in T if x==y]=', time2-time1
#print N
#
#
time1 = time.time()
N=filter(lambda x:x in S,T)
time2 = time.time()
print 'time diff filter(lambda x:x in S,T)=', time2-time1
#print N
#
#
#http://snipt.net/voyeg3r/python-intersect-lists/
time1 = time.time()
N = [val for val in S if val in T]
time2 = time.time()
print 'time diff [val for val in S if val in T]=', time2-time1
#print N
#
#
time1 = time.time()
N= list(set(S) & set(T))
time2 = time.time()
print 'time diff list(set(S) & set(T))=', time2-time1
#print N #the results will be unordered as compared to the other ways!!!
#
#
time1 = time.time()
N=[]
for x in S:
for y in T:
if x==y:
N.append(x)
time2 = time.time()
print 'time diff using traditional for loop', time2-time1
#print N
They all print the same N so I commented that print stmt out (except the last way it's unordered), but the resulting time differences were interesting over repeated tests as seen in this one example:
time diff [x for x in S for y in T if x==y]= 54.875
time diff filter(lambda x:x in S,T)= 0.391000032425
time diff [val for val in S if val in T]= 12.6089999676
time diff list(set(S) & set(T))= 0.125
time diff using traditional for loop 54.7970001698
So while I find list comprehensions on the whole easier to read, there seems to be some performance issues at least in this example.
So, two questions:
Why is lambda etc being pushed aside?
For the list comprehension ways, is there a more efficient implementation and how would you KNOW it's more efficient without testing? I mean, lambda/map/filter was supposed to be less efficient because of the extra function calls, but it seems to be MORE efficient.
Paul
Your tests are doing very different things. With S being 1M elements and T being 300:
[x for x in S for y in T if x==y]= 54.875
This option does 300M equality comparisons.
filter(lambda x:x in S,T)= 0.391000032425
This option does 300 linear searches through S.
[val for val in S if val in T]= 12.6089999676
This option does 1M linear searches through T.
list(set(S) & set(T))= 0.125
This option does two set constructions and one set intersection.
The differences in performance between these options is much more related to the algorithms each one is using, rather than any difference between list comprehensions and lambda.
When I fix your code so that the list comprehension and the call to filter are actually doing the same work things change a whole lot:
import time
S=[x for x in range(1000000)]
T=[y**2 for y in range(300)]
#
#
time1 = time.time()
N=[x for x in T if x in S]
time2 = time.time()
print 'time diff [x for x in T if x in S]=', time2-time1
#print N
#
#
time1 = time.time()
N=filter(lambda x:x in S,T)
time2 = time.time()
print 'time diff filter(lambda x:x in S,T)=', time2-time1
#print N
Then the output is more like:
time diff [x for x in T if x in S]= 0.414485931396
time diff filter(lambda x:x in S,T)= 0.466315984726
So the list comprehension has a time that's generally pretty close to and usually less than the lambda expression.
The reason lambda expressions are being phased out is that many people think they are a lot less readable than list comprehensions. I sort of reluctantly agree.
Q: Why is lambda etc being pushed aside?
A: List comprehensions and generator expressions are generally considered to be a nice mix of power and readability. The pure functional-programming style where you use map(), reduce(), and filter() with functions (often lambda functions) is considered not as clear. Also, Python has added built-in functions that nicely handle all the major uses for reduce().
Suppose you wanted to sum a list. Here are two ways of doing it.
lst = range(10)
print reduce(lambda x, y: x + y, lst)
print sum(lst)
Sign me up as a fan of sum() and not a fan of reduce() to solve this problem. Here's another, similar problem:
lst = range(10)
print reduce(lambda x, y: bool(x or y), lst)
print any(lst)
Not only is the any() solution easier to understand, but it's also much faster; it has short-circuit evaluation, such that it will stop evaluating as soon as it has found any true value. The reduce() has to crank through the entire list. This performance difference would be stark if the list was a million items long, and the first item evaluated true. By the way, any() was added in Python 2.5; if you don't have it, here is a version for older versions of Python:
def any(iterable):
for x in iterable:
if x:
return True
return False
Suppose you wanted to make a list of squares of even numbers from some list.
lst = range(10)
print map(lambda x: x**2, filter(lambda x: x % 2 == 0, lst))
print [x**2 for x in lst if x % 2 == 0]
Now suppose you wanted to sum that list of squares.
lst = range(10)
print sum(map(lambda x: x**2, filter(lambda x: x % 2 == 0, lst)))
# list comprehension version of the above
print sum([x**2 for x in lst if x % 2 == 0])
# generator expression version; note the lack of '[' and ']'
print sum(x**2 for x in lst if x % 2 == 0)
The generator expression actually just returns an iterable object. sum() takes the iterable and pulls values from it, one by one, summing as it goes, until all the values are consumed. This is the most efficient way you can solve this problem in Python. In contrast, the map() solution, and the equivalent solution with a list comprehension inside the call to sum(), must first build a list; this list is then passed to sum(), used once, and discarded. The time to build the list and then delete it again is just wasted. (EDIT: and note that the version with both map and filter must build two lists, one built by filter and one built by map; both lists are discarded.) (EDIT: But in Python 3.0 and newer, map() and filter() are now both "lazy" and produce an iterator instead of a list; so this point is less true than it used to be. Also, in Python 2.x you were able to use itertools.imap() and itertools.ifilter() for iterator-based map and filter. But I continue to prefer the generator expression solutions over any map/filter solutions.)
By composing map(), filter(), and reduce() in combination with lambda functions, you can do many powerful things. But Python has idiomatic ways to solve the same problems which are simultaneously better performing and easier to read and understand.
Many people have already pointed out that you're comparing apples with oranges, etc, etc. But I think nobody's shown how to a really simple comparison -- list comprehension vs map plus lambda with little else to get in the way -- and that might be:
$ python -mtimeit -s'L=range(1000)' 'map(lambda x: x+1, L)'
1000 loops, best of 3: 328 usec per loop
$ python -mtimeit -s'L=range(1000)' '[x+1 for x in L]'
10000 loops, best of 3: 129 usec per loop
Here, you can see very sharply the cost of lambda -- about 200 microseconds, which in the case of a sufficiently simple operation such as this one swamps the operation itself.
Numbers are very similar with filter of course, since the problem is not filter or map, but rather the lambda itself:
$ python -mtimeit -s'L=range(1000)' '[x for x in L if not x%7]'
10000 loops, best of 3: 162 usec per loop
$ python -mtimeit -s'L=range(1000)' 'filter(lambda x: not x%7, L)'
1000 loops, best of 3: 334 usec per loop
No doubt the fact that lambda can be less clear, or its weird connection with Sparta (Spartans had a Lambda, for "Lakedaimon", painted on their shields -- this suggests that lambda is pretty dictatorial and bloody;-) have at least as much to do with its slowly falling out of fashion, as its performance costs. But the latter are quite real.
First of all, test like this:
import timeit
S=[x for x in range(10000)]
T=[y**2 for y in range(30)]
print "v1", timeit.Timer('[x for x in S for y in T if x==y]',
'from __main__ import S,T').timeit(100)
print "v2", timeit.Timer('filter(lambda x:x in S,T)',
'from __main__ import S,T').timeit(100)
print "v3", timeit.Timer('[val for val in T if val in S]',
'from __main__ import S,T').timeit(100)
print "v4", timeit.Timer('list(set(S) & set(T))',
'from __main__ import S,T').timeit(100)
And basically you are doing different things each time you test. When you would rewrite the list-comprehension for example as
[val for val in T if val in S]
performance would be on par with the 'lambda/filter' construct.
Sets are the correct solution for this. However try swapping S and T and see how long it takes!
filter(lambda x:x in T,S)
$ python -m timeit -s'S=[x for x in range(1000000)];T=[y**2 for y in range(300)]' 'filter(lambda x:x in S,T)'
10 loops, best of 3: 485 msec per loop
$ python -m timeit -r1 -n1 -s'S=[x for x in range(1000000)];T=[y**2 for y in range(300)]' 'filter(lambda x:x in T,S)'
1 loops, best of 1: 19.6 sec per loop
So you see that the order of S and T are quite important
Changing the order of the list comprehension to match the filter gives
$ python -m timeit -s'S=[x for x in range(1000000)];T=[y**2 for y in range(300)]' '[x for x in T if x in S]'
10 loops, best of 3: 441 msec per loop
So if fact the list comprehension is slightly faster than the lambda on my computer
Your list comprehension and lambda are doing different things, the list comprehension matching the lambda would be [val for val in T if val in S].
Efficiency isn't the reason why list comprehension are preferred (while they actually are slightly faster in almost all cases). The reason why they are preferred is readability.
Try it with smaller loop body and larger loops, like make T a set, and iterate over S. In that case on my machine the list comprehension is nearly twice as fast.
Your profiling is done wrong. Take a look the timeit module and try again.
lambda defines anonymous functions. Their main problem is that many people don't know the whole python library and use them to re-implement functions that are already in the operator, functools etc module ( and much faster ).
List comprehensions have nothing to do with lambda. They are equivalent to the standard filter and map functions from functional languages. LCs are preferred because they can be used as generators too, not to mention readability.
This is pretty fast:
def binary_search(a, x, lo=0, hi=None):
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
midval = a[mid]
if midval < x:
lo = mid+1
elif midval > x:
hi = mid
else:
return mid
return -1
time1 = time.time()
N = [x for x in T if binary_search(S, x) >= 0]
time2 = time.time()
print 'time diff binary search=', time2-time1
Simply: less comparisions, less time.
List comprehensions can make a bigger difference if you have to process your filtered results. In your case you just build a list, but if you had to do something like this:
n = [f(i) for i in S if some_condition(i)]
you would gain from LC optimization over this:
n = map(f, filter(some_condition(i), S))
simply because the latter has to build an intermediate list (or tuple, or string, depending on the nature of S). As a consequence you will also notice a different impact on the memory used by each method, the LC will keep lower.
The lambda in itself does not matter.