lambda versus list comprehension performance

lambda versus list comprehension performance - python

I recently posted a question using a lambda function and in a reply someone had mentioned lambda is going out of favor, to use list comprehensions instead. I am relatively new to Python. I ran a simple test:
import time
S=[x for x in range(1000000)]
T=[y**2 for y in range(300)]
#
#
time1 = time.time()
N=[x for x in S for y in T if x==y]
time2 = time.time()
print 'time diff [x for x in S for y in T if x==y]=', time2-time1
#print N
#
#
time1 = time.time()
N=filter(lambda x:x in S,T)
time2 = time.time()
print 'time diff filter(lambda x:x in S,T)=', time2-time1
#print N
#
#
#http://snipt.net/voyeg3r/python-intersect-lists/
time1 = time.time()
N = [val for val in S if val in T]
time2 = time.time()
print 'time diff [val for val in S if val in T]=', time2-time1
#print N
#
#
time1 = time.time()
N= list(set(S) & set(T))
time2 = time.time()
print 'time diff list(set(S) & set(T))=', time2-time1
#print N #the results will be unordered as compared to the other ways!!!
#
#
time1 = time.time()
N=[]
for x in S:
for y in T:
if x==y:
N.append(x)
time2 = time.time()
print 'time diff using traditional for loop', time2-time1
#print N
They all print the same N so I commented that print stmt out (except the last way it's unordered), but the resulting time differences were interesting over repeated tests as seen in this one example:
time diff [x for x in S for y in T if x==y]= 54.875
time diff filter(lambda x:x in S,T)= 0.391000032425
time diff [val for val in S if val in T]= 12.6089999676
time diff list(set(S) & set(T))= 0.125
time diff using traditional for loop 54.7970001698
So while I find list comprehensions on the whole easier to read, there seems to be some performance issues at least in this example.
So, two questions:
Why is lambda etc being pushed aside?
For the list comprehension ways, is there a more efficient implementation and how would you KNOW it's more efficient without testing? I mean, lambda/map/filter was supposed to be less efficient because of the extra function calls, but it seems to be MORE efficient.
Paul

Your tests are doing very different things. With S being 1M elements and T being 300:
[x for x in S for y in T if x==y]= 54.875
This option does 300M equality comparisons.
 
filter(lambda x:x in S,T)= 0.391000032425
This option does 300 linear searches through S.
 
[val for val in S if val in T]= 12.6089999676
This option does 1M linear searches through T.
 
list(set(S) & set(T))= 0.125
This option does two set constructions and one set intersection.
The differences in performance between these options is much more related to the algorithms each one is using, rather than any difference between list comprehensions and lambda.

When I fix your code so that the list comprehension and the call to filter are actually doing the same work things change a whole lot:
import time
S=[x for x in range(1000000)]
T=[y**2 for y in range(300)]
#
#
time1 = time.time()
N=[x for x in T if x in S]
time2 = time.time()
print 'time diff [x for x in T if x in S]=', time2-time1
#print N
#
#
time1 = time.time()
N=filter(lambda x:x in S,T)
time2 = time.time()
print 'time diff filter(lambda x:x in S,T)=', time2-time1
#print N
Then the output is more like:
time diff [x for x in T if x in S]= 0.414485931396
time diff filter(lambda x:x in S,T)= 0.466315984726
So the list comprehension has a time that's generally pretty close to and usually less than the lambda expression.
The reason lambda expressions are being phased out is that many people think they are a lot less readable than list comprehensions. I sort of reluctantly agree.

Q: Why is lambda etc being pushed aside?
A: List comprehensions and generator expressions are generally considered to be a nice mix of power and readability. The pure functional-programming style where you use map(), reduce(), and filter() with functions (often lambda functions) is considered not as clear. Also, Python has added built-in functions that nicely handle all the major uses for reduce().
Suppose you wanted to sum a list. Here are two ways of doing it.
lst = range(10)
print reduce(lambda x, y: x + y, lst)
print sum(lst)
Sign me up as a fan of sum() and not a fan of reduce() to solve this problem. Here's another, similar problem:
lst = range(10)
print reduce(lambda x, y: bool(x or y), lst)
print any(lst)
Not only is the any() solution easier to understand, but it's also much faster; it has short-circuit evaluation, such that it will stop evaluating as soon as it has found any true value. The reduce() has to crank through the entire list. This performance difference would be stark if the list was a million items long, and the first item evaluated true. By the way, any() was added in Python 2.5; if you don't have it, here is a version for older versions of Python:
def any(iterable):
for x in iterable:
if x:
return True
return False
Suppose you wanted to make a list of squares of even numbers from some list.
lst = range(10)
print map(lambda x: x**2, filter(lambda x: x % 2 == 0, lst))
print [x**2 for x in lst if x % 2 == 0]
Now suppose you wanted to sum that list of squares.
lst = range(10)
print sum(map(lambda x: x**2, filter(lambda x: x % 2 == 0, lst)))
# list comprehension version of the above
print sum([x**2 for x in lst if x % 2 == 0])
# generator expression version; note the lack of '[' and ']'
print sum(x**2 for x in lst if x % 2 == 0)
The generator expression actually just returns an iterable object. sum() takes the iterable and pulls values from it, one by one, summing as it goes, until all the values are consumed. This is the most efficient way you can solve this problem in Python. In contrast, the map() solution, and the equivalent solution with a list comprehension inside the call to sum(), must first build a list; this list is then passed to sum(), used once, and discarded. The time to build the list and then delete it again is just wasted. (EDIT: and note that the version with both map and filter must build two lists, one built by filter and one built by map; both lists are discarded.) (EDIT: But in Python 3.0 and newer, map() and filter() are now both "lazy" and produce an iterator instead of a list; so this point is less true than it used to be. Also, in Python 2.x you were able to use itertools.imap() and itertools.ifilter() for iterator-based map and filter. But I continue to prefer the generator expression solutions over any map/filter solutions.)
By composing map(), filter(), and reduce() in combination with lambda functions, you can do many powerful things. But Python has idiomatic ways to solve the same problems which are simultaneously better performing and easier to read and understand.

Many people have already pointed out that you're comparing apples with oranges, etc, etc. But I think nobody's shown how to a really simple comparison -- list comprehension vs map plus lambda with little else to get in the way -- and that might be:
$ python -mtimeit -s'L=range(1000)' 'map(lambda x: x+1, L)'
1000 loops, best of 3: 328 usec per loop
$ python -mtimeit -s'L=range(1000)' '[x+1 for x in L]'
10000 loops, best of 3: 129 usec per loop
Here, you can see very sharply the cost of lambda -- about 200 microseconds, which in the case of a sufficiently simple operation such as this one swamps the operation itself.
Numbers are very similar with filter of course, since the problem is not filter or map, but rather the lambda itself:
$ python -mtimeit -s'L=range(1000)' '[x for x in L if not x%7]'
10000 loops, best of 3: 162 usec per loop
$ python -mtimeit -s'L=range(1000)' 'filter(lambda x: not x%7, L)'
1000 loops, best of 3: 334 usec per loop
No doubt the fact that lambda can be less clear, or its weird connection with Sparta (Spartans had a Lambda, for "Lakedaimon", painted on their shields -- this suggests that lambda is pretty dictatorial and bloody;-) have at least as much to do with its slowly falling out of fashion, as its performance costs. But the latter are quite real.

First of all, test like this:
import timeit
S=[x for x in range(10000)]
T=[y**2 for y in range(30)]
print "v1", timeit.Timer('[x for x in S for y in T if x==y]',
'from __main__ import S,T').timeit(100)
print "v2", timeit.Timer('filter(lambda x:x in S,T)',
'from __main__ import S,T').timeit(100)
print "v3", timeit.Timer('[val for val in T if val in S]',
'from __main__ import S,T').timeit(100)
print "v4", timeit.Timer('list(set(S) & set(T))',
'from __main__ import S,T').timeit(100)
And basically you are doing different things each time you test. When you would rewrite the list-comprehension for example as
[val for val in T if val in S]
performance would be on par with the 'lambda/filter' construct.

Sets are the correct solution for this. However try swapping S and T and see how long it takes!
filter(lambda x:x in T,S)
$ python -m timeit -s'S=[x for x in range(1000000)];T=[y**2 for y in range(300)]' 'filter(lambda x:x in S,T)'
10 loops, best of 3: 485 msec per loop
$ python -m timeit -r1 -n1 -s'S=[x for x in range(1000000)];T=[y**2 for y in range(300)]' 'filter(lambda x:x in T,S)'
1 loops, best of 1: 19.6 sec per loop
So you see that the order of S and T are quite important
Changing the order of the list comprehension to match the filter gives
$ python -m timeit -s'S=[x for x in range(1000000)];T=[y**2 for y in range(300)]' '[x for x in T if x in S]'
10 loops, best of 3: 441 msec per loop
So if fact the list comprehension is slightly faster than the lambda on my computer

Your list comprehension and lambda are doing different things, the list comprehension matching the lambda would be [val for val in T if val in S].
Efficiency isn't the reason why list comprehension are preferred (while they actually are slightly faster in almost all cases). The reason why they are preferred is readability.
Try it with smaller loop body and larger loops, like make T a set, and iterate over S. In that case on my machine the list comprehension is nearly twice as fast.

Your profiling is done wrong. Take a look the timeit module and try again.
lambda defines anonymous functions. Their main problem is that many people don't know the whole python library and use them to re-implement functions that are already in the operator, functools etc module ( and much faster ).
List comprehensions have nothing to do with lambda. They are equivalent to the standard filter and map functions from functional languages. LCs are preferred because they can be used as generators too, not to mention readability.

This is pretty fast:
def binary_search(a, x, lo=0, hi=None):
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
midval = a[mid]
if midval < x:
lo = mid+1
elif midval > x:
hi = mid
else:
return mid
return -1
time1 = time.time()
N = [x for x in T if binary_search(S, x) >= 0]
time2 = time.time()
print 'time diff binary search=', time2-time1
Simply: less comparisions, less time.

List comprehensions can make a bigger difference if you have to process your filtered results. In your case you just build a list, but if you had to do something like this:
n = [f(i) for i in S if some_condition(i)]
you would gain from LC optimization over this:
n = map(f, filter(some_condition(i), S))
simply because the latter has to build an intermediate list (or tuple, or string, depending on the nature of S). As a consequence you will also notice a different impact on the memory used by each method, the LC will keep lower.
The lambda in itself does not matter.

Related

Python 3: Loops, list comprehension and map slower compared to Python 2?

I am currently learning Python 3 and thought some speed comparison could be neat. Thus, I created some in-place and some temporary list functions, which simply add 1 to a long list of integers.
However, the results were really surprising to me...it seems that Python 3 is slower in every use case: different tested for loops, a while loop, the map function, and list comprehensions.
Of course (see the code below) it just comparison of in-place list mutations or loops that build up and return a temporary list...so e.g. the "standard" map of Python 3, that just creates an iterator, is still fast, but at some point one would always like to print, write or somehow visualize the data. Hence, something like list(map()) should be relatively common in Python 3?!
What am I missing here?
Is Python 3 really slower than Python 2 for loops, comprehensions and map? If so, why?
Python 3.4 timings:
> python3.4 func_speed_tests.py
list_comprehension.........1.2785s
for_loop...................1.9988s
for_loop_append............1.8471s
for_loop_range.............1.7585s
while_loop.................2.2539s
calc_map...................2.0763s
Python 2.7 timings:
> python2.7 func_speed_tests.py
list_comprehension.........0.9472s
for_loop...................1.2821s
for_loop_append............1.5663s
for_loop_range.............1.3101s
while_loop.................2.1914s
calc_map...................1.9101s
Here is the code:
import timeit
from random import shuffle
# generate a long list of integers
mlst = list(range(0, 1000000))
# randomly sort the list
shuffle(mlst)
# number of benchmark passes
n = 10
#####################
# functions to time #
#####################
def list_comprehension(lst):
return [x + 1 for x in lst]
def for_loop(lst):
for k, v in enumerate(lst):
lst[k] = v + 1
def for_loop_append(lst):
tmp = []
for item in lst:
tmp.append(item + 1)
def for_loop_range(lst):
for k in range(len(lst)):
lst[k] += 1
return lst
def while_loop(lst):
it = iter(lst)
tmp = []
while it:
try:
tmp.append(next(it)+1)
except StopIteration:
break
return tmp
def calc_map(lst):
lst = map(lambda x: x + 1, lst)
return list(lst)
############################################
# helper functions for timing and printing #
############################################
def print_time(func_name):
print("{:.<25}{:.>8.4f}s".format(func_name, timer(func_name)))
def timer(func):
return timeit.timeit(stmt="{}(mlst)".format(func),
setup="from __main__ import mlst, {}".format(func),
number=n)
#################
# run the tests #
#################
print_time("list_comprehension")
print_time("for_loop")
print_time("for_loop_append")
print_time("for_loop_range")
print_time("while_loop")
print_time("calc_map")

Python3 is slightly slower that Python2. Check with pystone benchmark. If speed is of concern stick with Python 2.7.0 Release which has better third party package support.

Why is the "map" version of ThreeSum so slow?

I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)

Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)

Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop

You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)

Need a fast way to count and sum an iterable in a single pass

Can any one help me? I'm trying to come up with a way to compute
>>> sum_widths = sum(col.width for col in cols if not col.hide)
and also count the number of items in this sum, without having to make two passes over cols.
It seems unbelievable but after scanning the std-lib (built-in functions, itertools, functools, etc), I couldn't even find a function which would count the number of members in an iterable. I found the function itertools.count, which sounds like what I want, but It's really just a deceptively named range function.
After a little thought I came up with the following (which is so simple that the lack of a library function may be excusable, except for its obtuseness):
>>> visable_col_count = sum(col is col for col in cols if not col.hide)
However, using these two functions requires two passes of the iterable, which just rubs me the wrong way.
As an alternative, the following function does what I want:
>>> def count_and_sum(iter):
>>> count = sum = 0
>>> for item in iter:
>>> count += 1
>>> sum += item
>>> return count, sum
The problem with this is that it takes 100 times as long (according to timeit) as the sum of a generator expression form.
If anybody can come up with a simple one-liner which does what I want, please let me know (using Python 3.3).
Edit 1
Lots of great ideas here, guys. Thanks to all who replied. It will take me a while to digest all these answers, but I will and I will try to pick one to check.
Edit 2
I repeated the timings on my two humble suggestions (count_and_sum function and 2 separate sum functions) and discovered that my original timing was way off, probably due to an auto-scheduled backup process running in the background.
I also timed most of the excellent suggestions given as answers here, all with the same model. Analysing these answers has been quite an education for me: new uses for deque, enumerate and reduce and first time for count and accumulate. Thanks to all!
Here are the results (from my slow netbook) using the software I'm developing for display:
┌───────────────────────────────────────────────────────┐
│ Count and Sum Timing │
├──────────────────────────┬───────────┬────────────────┤
│ Method │Time (usec)│Time (% of base)│
├──────────────────────────┼───────────┼────────────────┤
│count_and_sum (base) │ 7.2│ 100%│
│Two sums │ 7.5│ 104%│
│deque enumerate accumulate│ 7.3│ 101%│
│max enumerate accumulate │ 7.3│ 101%│
│reduce │ 7.4│ 103%│
│count sum │ 7.3│ 101%│
└──────────────────────────┴───────────┴────────────────┘
(I didn't time the complex and fold methods as being just too obscure, but thanks anyway.)
Since there's very little difference in timing among all these methods I decided to use the count_and_sum function (with an explicit for loop) as being the most readable, explicit and simple (Python Zen) and it also happens to be the fastest!
I wish I could accept one of these amazing answers as correct but they are all equally good though more or less obscure, so I'm just up-voting everybody and accepting my own answer as correct (count_and_sum function) since that's what I'm using.
What was that about "There should be one-- and preferably only one --obvious way to do it."?

Using complex numbers
z = [1, 2, 4, 5, 6]
y = sum(x + 1j for x in z)
sum_z, count_z = y.real, int(y.imag)
print sum_z, count_z
18.0 5

I don't know about speed, but this is kind of pretty:
>>> from itertools import accumulate
>>> it = range(10)
>>> max(enumerate(accumulate(it), 1))
(10, 45)

Adaption of DSM's answer. using deque(... maxlen=1) to save memory use.
import itertools
from collections import deque
deque(enumerate(itertools.accumulate(x), 1), maxlen=1)
timing code in ipython:
import itertools , random
from collections import deque
def count_and_sum(iter):
count = sum = 0
for item in iter:
count += 1
sum += item
return count, sum
X = [random.randint(0, 10) for _ in range(10**7)]
%timeit count_and_sum(X)
%timeit deque(enumerate(itertools.accumulate(X), 1), maxlen=1)
%timeit (max(enumerate(itertools.accumulate(X), 1)))
results: now faster than OP's method
1 loops, best of 3: 1.08 s per loop
1 loops, best of 3: 659 ms per loop
1 loops, best of 3: 1.19 s per loop

Here's some timing data that might be of interest:
import timeit
setup = '''
import random, functools, itertools, collections
x = [random.randint(0, 10) for _ in range(10**5)]
def count_and_sum(it):
c, s = 0, 0
for i in it:
c += 1
s += i
return c, s
def two_pass(it):
return sum(i for i in it), sum(True for i in it)
def functional(it):
return functools.reduce(lambda pair, x: (pair[0]+1, pair[1]+x), it, [0, 0])
def accumulator(it):
return max(enumerate(itertools.accumulate(it), 1))
def complex(it):
cpx = sum(x + 1j for x in it)
return cpx.real, int(cpx.imag)
def dequed(it):
return collections.deque(enumerate(itertools.accumulate(it), 1), maxlen=1)
'''
number = 100
for stmt in ['count_and_sum(x)',
'two_pass(x)',
'functional(x)',
'accumulator(x)',
'complex(x)',
'dequed(x)']:
print('{:.4}'.format(timeit.timeit(stmt=stmt, setup=setup, number=number)))
Result:
3.404 # OP's one-pass method
3.833 # OP's two-pass method
8.405 # Timothy Shields's fold method
3.892 # DSM's accumulate-based method
4.946 # 1_CR's complex-number method
2.002 # M4rtini's deque-based modification of DSM's method
Given these results, I'm not really sure how the OP is seeing a 100x slowdown with the one-pass method. Even if the data looks radically different from a list of random integers, that just shouldn't happen.
Also, M4rtini's solution looks like the clear winner.
To clarify, these results are in CPython 3.2.3. For a comparison to PyPy3, see James_pic's answer, which shows some serious gains from JIT compilation for some methods (also mentioned in a comment by M4rtini.

As a follow-up to senshin's answer, it's worth noting that the performance differences are largely due to quirks in CPython's implementation, that make some methods slower than others (for example, for loops are relatively slow in CPython). I thought it would be interesting to try the exact same test in PyPy (using PyPy3 2.1 beta), which has different performance characteristics. In PyPy the results are:
0.6227 # OP's one-pass method
0.8714 # OP's two-pass method
1.033 # Timothy Shields's fold method
6.354 # DSM's accumulate-based method
1.287 # 1_CR's complex-number method
3.857 # M4rtini's deque-based modification of DSM's method
In this case, the OP's one-pass method is fastest. This makes sense, as it's arguably the simplest (at least from a compiler's point of view) and PyPy can eliminate many of the overheads by inlining method calls, which CPython can't.
For comparison, CPython 3.3.2 on my machine give the following:
1.651 # OP's one-pass method
1.825 # OP's two-pass method
3.258 # Timothy Shields's fold method
1.684 # DSM's accumulate-based method
3.072 # 1_CR's complex-number method
1.191 # M4rtini's deque-based modification of DSM's method

You can keep count inside the sum with tricks similar to this
>>> from itertools import count
>>> cnt = count()
>>> sum((next(cnt), x)[1] for x in range(10) if x%2)
25
>>> next(cnt)
5
But it's probably going to be more readable to just use a for loop

You can use this:
from itertools import count
lst = range(10)
c = count(1)
tot = sum(next(c) and x for x in lst if x % 2)
n = next(c)-1
print(n, tot)
# 5 25
It's kind of a hack, but it works perfectly well.

I don't know what the Python syntax is off hand, but you could potentially use a fold. Something like this:
(count, total) = fold((0, 0), lambda pair, x: (pair[0] + 1, pair[1] + x))
The idea is to use a seed of (0,0) and then at each step add 1 to the first component and the current number to the second component.
For comparison, you could implement sum as a fold as follows:
total = fold(0, lambda t, x: t + x)

1_CR's complex number solution is cute but overly hacky. The reason it works is that a complex number is a 2-tuple, that sums elementwise. The same is true of numpy arrays and I think it's slightly cleaner to use those:
import numpy as np
z = [1, 2, 4, 5, 6]
y = sum(np.array([x, 1]) for x in z)
sum_z, count_z = y[0], y[1]
print sum_z, count_z
18 5

Something else to consider: If it is possible to determine a minimum possible count, we can let the efficient built-in sum do part of the work:
from itertools import islice
def count_and_sum(iterable):
# insert favorite implementation here
def count_and_sum_with_min_count(iterable, min_count):
iterator = iter(iterable)
slice_sum = sum(islice(iterator, None, min_count))
rest_count, rest_sum = count_and_sum(iterator)
return min_count + rest_count, slice_sum + rest_sum
For example, using the deque method with a sequence of 10000000 items and min_count of 5000000, timing results are:
count_and_sum: 1.03
count_and_sum_with_min_count: 0.63

Thanks for all the great answers, but I decided to use my original count_and_sum function, called as follows:
>>> cc, cs = count_and_sum(c.width for c in cols if not c.hide)
As explained in the edits to my original question this turned out to be the fastest and most readable solution.

How about this?
It seems to work.
from functools import reduce
class Column:
def __init__(self, width, hide):
self.width = width
self.hide = hide
lst = [Column(10, False), Column(100, False), Column(1000, True), Column(10000, False)]
print(reduce(lambda acc, col: Column(col.width + acc.width, False) if not col.hide else acc, lst, Column(0, False)).width)

You might only need the sum & count today, but who knows what you'll need tomorrow!
Here's an easily extensible solution:
def fold_parallel(itr, **fs):
res = {
k: zero for k, (zero, f) in fs.items()
}
for x in itr:
for k, (_, f) in fs.items():
res[k] = f(res[k], x)
return res
from operator import add
print(fold_parallel([1, 2, 3],
count = (0, lambda a, b: a + 1),
sum = (0, add),
))
# {'count': 3, 'sum': 6}

Does this benchmark seem relevant?

I am trying to benchmark a few method of itertools against generators and list comprehensions. The idea is that I want to build an iterator by filtering some entries from a base list.
Here is the code I came up with(Edited after accepted answer):
from itertools import ifilter
import collections
import random
import os
from timeit import Timer
os.system('cls')
# define large arrays
listArrays = [xrange(100), xrange(1000), xrange(10000), xrange(100000)]
#Number of element to be filtered out
nb_elem = 100
# Number of times we run the test
nb_rep = 1000
def discard(it):
collections.deque(it, maxlen=0)
def testGenerator(arr, sample):
discard(x for x in sample if x in arr)
def testIterator(arr, sample):
discard(ifilter(sample.__contains__, arr))
def testList(arr, sample):
discard([x for x in sample if x in arr])
if __name__ == '__main__':
for arr in listArrays:
print 'Size of array: %s ' % len(arr)
print 'number of iterations %s' % nb_rep
sample = random.sample(arr, nb_elem)
t1 = Timer('testIterator(arr, sample)', 'from __main__ import testIterator, arr, sample')
tt1 = t1.timeit(number=nb_rep)
t2 = Timer('testList(arr, sample)', 'from __main__ import testList, arr, sample')
tt2 = t2.timeit(number=nb_rep)
t3 = Timer('testGenerator(arr, sample)', 'from __main__ import testGenerator, arr, sample')
tt3 = t3.timeit(number=nb_rep)
norm = min(tt1, tt2, tt3)
print 'maximum runtime %.6f' % max(tt1, tt2, tt3)
print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % \
(tt1/norm, tt2/norm, tt3/norm)
print '===========================================
==========='
And the results that I get Please note that the edited version was not run on the same machine ( thus useful to have normalized results) and was ran with a 32bits interpreter with python 2.7.3 :
Size of array: 100
number of iterations 1000
maximum runtime 0.125595
normalized times:
iterator: 1.000000
list: 1.260302
generator: 1.276030
======================================================
Size of array: 1000
number of iterations 1000
maximum runtime 1.740341
normalized times:
iterator: 1.466031
list: 1.010701
generator: 1.000000
======================================================
Size of array: 10000
number of iterations 1000
maximum runtime 17.033630
normalized times:
iterator: 1.441600
list: 1.000000
generator: 1.010979
======================================================
Size of array: 100000
number of iterations 1000
maximum runtime 169.677963
normalized times:
iterator: 1.455594
list: 1.000000
generator: 1.008846
======================================================
Could you give some suggestions on improvement and comment on whether or not this benchmark can give accurate results?
I know that the condition in my decorator might bias the results. I am hoping for some suggestions regarding that.
Thanks.

First, instead of trying to duplicate everything timeit does, just use it. The time function may not have enough accuracy to be useful, and writing dozens of lines of scaffolding code (especially if it has to hacky things like switching on func.__name__) that you don't need is just inviting bugs for no reason.
Assuming there are no bugs, it probably won't affect the results significantly. You're doing a tiny bit of extra work and charging it to testIterator, but that's only once per outer loop. But still, there's no benefit to doing it, so let's not.
def testGenerator(arr,sample):
for i in (x for x in sample if x in arr):
k = random.random()
def testIterator(arr,sample):
for i in ifilter(lambda x: x in sample, arr):
k = random.random()
def testList(arr,sample):
for i in [x for x in sample if x in arr]:
k = random.random()
tests = testIterator, testGenerator, testList
for arr in listArrays:
print 'Size of array: %s ' % len(arr)
print 'number of iterations %s' % nb_rep
sample = random.sample(arr, nb_elem)
funcs = [partial(test, arr, sample) for test in tests]
times = [timeit.timeit(func, number=nb_rep) for func in funcs]
norm = min(*times)
print 'maximum runtime %.6f' % max(*times)
print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % (times[0]/norm,times[1]/norm,times[2]/norm)
print '======================================================'
Next, why are you doing that k = random.random() in there? From a quick test, just executing that line N times without the complex loop is 0.19x as long as the whole thing. So, you're adding 20% to each of the numbers, which dilutes the difference between them for no reason.
Once you get rid of that, the for loop is serving no purpose except to consume the iterator, and that's adding extra overhead as well. As of 2.7.3 and 3.3.0, the fastest way to consume an iterator without custom C code is deque(it, maxlen=0), so, let's try this:
def discard(it):
collections.deque(it, maxlen=0)
def testGenerator(arr,sample):
discard(x for x in sample if x in arr)
def testIterator(arr,sample):
discard(ifilter(sample.__contains__, arr))
def testList(arr,sample):
discard([x for x in sample if x in arr])
Or, alternatively, just have the functions return a generator/ifilter/list and then make the scaffolding call discard on the result (it shouldn't matter either way).
Meanwhile, for the testIterator case, are you trying to test the cost of the lambda vs. an inline expression, or the cost of ifilter vs. a generator? If you want to test the former, this is correct; if the latter, you probably want to optimize that. For example, passing sample.__contains__ instead of lambda x: x in sample seems to be 20% faster in 64-bit Python 3.3.0 and 30% faster in 32-bit 2.7.2 (although for some reason not faster at all in 64-bit 2.7.2).
Finally, unless you're just testing for exactly one implementation/platform/version, make sure to run it on as many as you can. For example, with 64-bit CPython 2.7.2, list and generator are always neck-and-neck while iterator gradually climbs from 1.0x to 1.4x as the lists grow, but in PyPy 1.9.0, iterator is always fastest, with generator and list starting off 2.1x and 1.9x slower but closing to 1.2x as the lists grow.
So, if you decided against iterator because "it's slow", you might be trading a large slowdown on PyPy for a much smaller speedup on CPython.
Of course that might be acceptable, e.g., because even the slowest PyPy run is blazingly fast, or because none of your users use PyPy, or whatever. But it's definitely part of the answer to "is this benchmark relevant?"

Python optimisations in this code?

I have two fairly simple code snippets and I'm running both of them a very large amount of times; I'm trying to determine if there's any optimisation I can do to speed up the execution time. If there's anything that stands out as something that could be done a lot quicker...
In the first one, we've got a list, fields. We've also got a list of lists, weights. We're trying to find which weight list multiplied by fields will produce the maximum sum. Fields is about 30k entries long.
def find_best(weights,fields):
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += float(fields[i]) * weights[c][i]
if score > best:
best = score
winner = c
return winner
In the second one, we're trying to update two of our weight lists; one gets increased and one decreased. The amount to increase/decrease each element in the is equal to the corresponding element in fields (e.g. if fields[4] = 10.5, then we want to increase weights[toincrease][4] by 10.5 and decrease weights[todecrease][4] by 10.5)
def update_weights(weights,fields,toincrease,todecrease):
for i in range(num_fields):
update = float(fields[i])
weights[toincrease][i] += update
weights[todecrease][i] -= update
return weights
I hope this isn't an overly specific question.

When you are trying to optimise, the thing you have to do is profile and measure! Python provides the timeit module which makes measuring things easy!
This will assume that you've converted fields to a list of floats beforehand (outside any of these functions), since the string → float conversion is very slow. You can do this via fields = [float(f) for f in string_fields].
Also, for doing numerical processing, pure python isn't very good, since it ends up doing a lot of type-checking (and some other stuff) for each operation. Using a C library like numpy will give massive improvements.
find_best
I have incorporated the answers of others (and a few more) into a profiling suite (say, test_find_best.py):
import random, operator, numpy as np, itertools, timeit
fields = [random.random() for _ in range(3000)]
fields_string = [str(field) for field in fields]
weights = [[random.random() for _ in range(3000)] for c in range(100)]
npw = np.array(weights)
npf = np.array(fields)
num_fields = len(fields)
num_category = len(weights)
def f_original():
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += float(fields_string[i]) * weights[c][i]
if score > best:
best = score
winner = c
def f_original_no_string():
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += fields[i] * weights[c][i]
if score > best:
best = score
winner = c
def f_original_xrange():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = 0
for i in xrange(num_fields):
score += fields[i] * weights[c][i]
if score > best:
best = score
winner = c
# Zenon http://stackoverflow.com/a/10134298/1256624
def f_index_comprehension():
winner = -1
best = -float('inf')
for c in range(num_category):
score = sum(fields[i] * weights[c][i] for i in xrange(num_fields))
if score > best:
best = score
winner = c
# steveha http://stackoverflow.com/a/10134247/1256624
def f_comprehension():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(f * w for f, w in itertools.izip(fields, weights[c]))
if score > best:
best = score
winner = c
def f_schwartz_original(): # https://en.wikipedia.org/wiki/Schwartzian_transform
tup = max(((i, sum(t[0] * t[1] for t in itertools.izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
def f_schwartz_opt(): # https://en.wikipedia.org/wiki/Schwartzian_transform
tup = max(((i, sum(f * w for f,w in itertools.izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=operator.itemgetter(1)
)
def fweight(field_float_list, wlist):
f = iter(field_float_list)
return sum(f.next() * w for w in wlist)
def f_schwartz_iterate():
tup = max(
((i, fweight(fields, wlist)) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
# Nolen Royalty http://stackoverflow.com/a/10134147/1256624
def f_numpy_mult_sum():
np.argmax(np.sum(npf * npw, axis = 1))
# me
def f_imap():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(itertools.imap(operator.mul, fields, weights[c]))
if score > best:
best = score
winner = c
def f_numpy():
np.argmax(npw.dot(npf))
for f in [f_original,
f_index_comprehension,
f_schwartz_iterate,
f_original_no_string,
f_schwartz_original,
f_original_xrange,
f_schwartz_opt,
f_comprehension,
f_imap]:
print "%s: %.2f ms" % (f.__name__, timeit.timeit(f,number=10)/10 * 1000)
for f in [f_numpy_mult_sum, f_numpy]:
print "%s: %.2f ms" % (f.__name__, timeit.timeit(f,number=100)/100 * 1000)
Running python test_find_best.py gives me:
f_original: 310.34 ms
f_index_comprehension: 102.58 ms
f_schwartz_iterate: 103.39 ms
f_original_no_string: 96.36 ms
f_schwartz_original: 90.52 ms
f_original_xrange: 89.31 ms
f_schwartz_opt: 69.48 ms
f_comprehension: 68.87 ms
f_imap: 53.33 ms
f_numpy_mult_sum: 3.57 ms
f_numpy: 0.62 ms
So the numpy version using .dot (sorry, I can't find the documentation for it atm) is the fastest. If you are doing a lot of numerical operations (which it seems you are), it might be worth converting fields and weights as numpy arrays as soon as you create them.
update_weights
Numpy is likely to offer a similar speed-up for update_weights, doing something like:
def update_weights(weights, fields, to_increase, to_decrease):
weights[to_increase,:] += fields
weights[to_decrease,:] -= fields
return weights
(I haven't tested or profiled that btw, you need to do that.)

I think you could get a pretty big speed boost using numpy. Stupidly simple example:
>>> fields = numpy.array([1, 4, 1, 3, 2, 5, 1])
>>> weights = numpy.array([[.2, .3, .4, .2, .1, .5, .9], [.3, .1, .1, .9, .2, .4, .5]])
>>> fields * weights
array([[ 0.2, 1.2, 0.4, 0.6, 0.2, 2.5, 0.9],
[ 0.3, 0.4, 0.1, 2.7, 0.4, 2. , 0.5]])
>>> result = _
>>> numpy.argmax(numpy.sum(result, axis=1))
1
>>> result[1]
array([ 0.3, 0.4, 0.1, 2.7, 0.4, 2. , 0.5])

If you are running Python 2.x I would use xrange() rather than range(), uses less memory as it doesn't generate a list
This is assuming you want to keep the current code structure.

First, if you are using Python 2.x, you can gain some speed by using xrange() instead of range(). In Python 3.x there is no xrange(), but the built-in range() is basically the same as xrange().
Next, if we are going for speed, we need to write less code, and rely more on Python's built-in features (that are written in C for speed).
You could speed things up by using a generator expression inside of sum() like so:
from itertools import izip
def find_best(weights,fields):
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(float(t[0]) * t[1] for t in izip(fields, weights[c]))
if score > best:
best = score
winner = c
return winner
Applying the same idea again, let's try to use max() to find the best result. I think this code is ugly to look at, but if you benchmark it and it's enough faster, it might be worth it:
from itertools import izip
def find_best(weights, fields):
tup = max(
((i, sum(float(t[0]) * t[1] for t in izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
return tup[0]
Ugh! But if I didn't make any mistakes, this does the same thing, and it should rely a lot on the C machinery in Python. Measure it and see if it is faster.
So, we are calling max(). We are giving it a generator expression, and it will find the max value returned from the generator expression. But you want the index of the best value, so the generator expression returns a tuple: index and weight value. So we need to pass the generator expression as the first argument, and the second argument must be a key function that looks at the weight value from the tuple and ignores the index. Since the generator expression is not the only argument to max() it needs to be in parens. Then it builds a tuple of i and the calculated weight, calculated by the same sum() we used above. Finally once we get back a tuple from max() we index it to get the index value, and return that.
We can make this much less ugly if we break out a function. This adds the overhead of a function call, but if you measure it I'll bet it isn't too much slower. Also, now that I think about it, it makes sense to build a list of fields values already pre-coerced to float; then we can use that multiple times. Also, instead of using izip() to iterate over two lists in parallel, let's just make an iterator and explicitly ask it for values. In Python 2.x we use the .next() method function to ask for a value; in Python 3.x you would use the next() built-in function.
def fweight(field_float_list, wlist):
f = iter(field_float_list)
return sum(f.next() * w for w in wlist)
def find_best(weights, fields):
flst = [float(x) for x in fields]
tup = max(
((i, fweight(flst, wlist)) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
return tup[0]
If there are 30K fields values, then pre-computing the float() values is likely to be a big speed win.
EDIT: I missed one trick. Instead of the lambda function, I should have used operator.itemgetter() like some of the code in the accepted answer. Also, the accepted answer timed things, and it does look like the overhead of the function call was significant. But the Numpy answers were so much faster that it's not worth playing with this answer anymore.
As for the second part, I don't think it can be sped up very much. I'll try:
def update_weights(weights,fields,toincrease,todecrease):
w_inc = weights[toincrease]
w_dec = weights[todecrease]
for i, f in enumerated(fields):
f = float(f) # see note below
w_inc[i] += f
w_dec[i] -= f
So, instead of iterating over an xrange(), here we just iterate over the fields values directly. We have a line that coerces to float.
Note that if the weights values are already float, we don't really need to coerce to float here, and we can save time by just deleting that line.
Your code was indexing the weights list four times: twice to do the increment, twice to do the decrement. This code does the first index (using the toincrease or todecrease) argument just once. It still has to index by i in order for += to work. (My first version tried to avoid this with an iterator and didn't work. I should have tested before posting. But it's fixed now.)
One last version to try: instead of incrementing and decrementing values as we go, just use list comprehensions to build a new list with the values we want:
def update_weights(weights, field_float_list, toincrease, todecrease):
f = iter(field_float_list)
weights[toincrease] = [x + f.next() for x in weights[toincrease]]
f = iter(field_float_list)
weights[todecrease] = [x - f.next() for x in weights[todecrease]]
This assumes you have already coerced all the fields values to float, as shown above.
Is it faster, or slower, to replace the whole list this way? I'm going to guess faster, but I'm not sure. Measure and see!
Oh, I should add: note that my version of update_weights() shown above does not return weights. This is because in Python it is considered a good practice to not return a value from a function that mutates a data structure, just to make sure that nobody ever gets confused about which functions do queries and which functions change things.
http://en.wikipedia.org/wiki/Command-query_separation
Measure measure measure. See how much faster my suggestions are, or are not.

An easy optimisation is to use xrange instead of range. xrange is a generator function that yields results one by one as you iterate over it; whereas range first creates the entire (30,000 item) list as a temporary object, using more memory and CPU cycles.

As #Levon says, xrange() in python2.x is a must. Also, if you are in python2.4+ you can use generator expression (thanks #steveha) , which kinda work like list comprehensions (only in 2.6+), for your inner loop as simply as follows:
for i in range(num_fields):
score += float(fields[i]) * weights[c][i]
equivalent to
score = sum(float(fields[i]) * weights[c][i]) for i in num_fields)
Also in general, there is this great page on the python wiki about simple but effective
optimizations tricks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lambda versus list comprehension performance - python

Related

Python 3: Loops, list comprehension and map slower compared to Python 2?

Why is the "map" version of ThreeSum so slow?

Need a fast way to count and sum an iterable in a single pass

Does this benchmark seem relevant?

Python optimisations in this code?

Categories

Resources