number of loops matters efficiency (interpreted vs compiled languages?)

number of loops matters efficiency (interpreted vs compiled languages?) - python

Say you have to carry out a computation by using 2 or even 3 loops. Intuitively, one may thing that it's more efficient to do this with a single loop. I tried a simple Python example:
import itertools
import timeit
def case1(n):
c = 0
for i in range(n):
c += 1
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += 1
return c
print(case1(1000))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(1000)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
This code run:
$ python3 code.py
1000
1000
0.8281264099932741
1.04944919400441
So effectively 1 loop seems to be a bit more efficient. Yet I have a slightly different scenario in my problem, as I need to use the values in an array (in the following example I use the function range for simplification). That is, if I collapse everything to a single loop I would have to create an extended array from the values of another array whose size is between 2 and 10 elements.
import itertools
import timeit
def case1(n):
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
c = 0
for i in range(len(b)):
c += b[i]
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += i*j*k
return c
print(case1(10))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(10)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
In my computer this code run in:
$ python3 code.py
91125
91125
2.435348572995281
1.6435037050105166
So it seems the 3 nested loops are more efficient because I spend sometime creating the array b in case1. so I'm not sure I'm creating this array in the most efficient way, but leaving that aside, does it really pay off collapsing loops to a single one? I'm using Python here, but what about compiled languages like C++? Does the compiler in this case do something to optimize the single loop? Or on the other hand, does the compiler do some optimization when you have multiple nested loops?

This is why the single loop function takes supposedly longer than it should
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
Just by changing the whole function to
def case1(n, b):
c = 0
for i in range(len(b)):
c += b[i]
return c
Makes the timeit return :
case1 : 0.965343249744
case2 : 2.28501694207

Your case is simple enough that various optimizations would probably do a lot. Be it numpy for more efficient array's, maybe pypy for a better JIT optimizer, or various other things.
Looking at the bytecode via the dis module can help you understand what happens under the hood and make some micro optimizations, but in general it does not really matter if you do one loop or a nested loop, if your memory access pattern is somewhat predictable for the CPU. If not, it may differ wildly.
Python has some bytecodes that are cheap and others that are more expensive, e.g. function calls are much more expensive than a simple addition. Same with creating new objects and various other things. So the usual optimization is moving the loop to C, which is one of the benefits of itertools sometimes.
Once you are on the C-level it usually comes down to: Avoid syscalls/mallocs() in tight loops, have predictable memory access patterns and make sure your algorithm is cache friendly.
So, your algorithms above will probably vary wildly in performance if you go to large values of N, due to the amount of memory allocation and cache access.
But the fastest way for the specific problem above would be to find a closed form for the function, it seems wasteful to iterate for that, as there must be a much simpler formula to calculate the final value of 'c'. As usual, first get the best algorithm before doing micro optimizations.
e.g. Wolfram Alpha tells you that you could replace two loops with, there is probably a closed form for all three, but Alpha didn't tell me...
def case3(n):
c = 0
for j in range(n):
c += (j* n^2 *(n+1)^2))/4
return c

Related

Speed up the List appending process

I would like to speed up the process of the below code
x = []
y = []
z = []
for i in range(0, 1000000):
if 0 < u[i] < 1920 and 0 < v[i] < 1080:
x.append(u[i])
y.append(v[i])
z.append([x_ind[i], y_ind[i]])
Any ideas would be really appreciated.
Thanks

Typically, you can optimize cases like this by replacing loops over indices and indexing with loops over the raw values. So replacement code here would be:
x = []
y = []
z = []
for a, b, c in zip(u, v, ind):
if 0 < a < 1920 and 0 < b < 1080:
x.append(a)
y.append(b)
z.append([c, c])
If u, v and ind might be longer than 1000000 and you must stop at 1000000 items checked, you'd just add an import to the top of the file, from itertools import islice, and change the for loop itself to:
for a, b, c in islice(zip(u, v, ind), 1000000):
Either way, you remove all indexing from the code (indexing has one of the worst ratios of overhead to useful work accomplished in the CPython reference interpreter, though other interpreters and tools like Cython will behave differently) and, if you use nicer names than a, b and c, more self-documenting code.
There are minor benefits (decreasing in the most recent versions of Python) to pre-binding copies of append instead of dynamic binding, so if you're really hurting for speed, especially on older versions of Python that didn't optimize away the creation of bound methods, you can try:
x = []
y = []
z = []
xapp, yapp, zapp = x.append, y.append, z.append
for a, b, c in zip(u, v, ind):
if 0 < a < 1920 and 0 < b < 1080:
xapp(a)
yapp(b)
zapp([c, c])
(adding islice if needed) to reduce method call overhead a bit at the expense of uglier code. Definitely don't do this unless profiling has shown this is the hot code path and you really need it faster.
Lastly, a note: If this code is being run at top-level (outside of any function) it will run significantly slower (variable lookup for locally scoped names in a function is a C array lookup; looking up globally scoped names, which all lookups outside a function involve, involves at least one dict key lookup, which is substantially more expensive). Put it in a function (along with the definitions of x, y and z; u, v and ind don't matter so much if you're zipping rather than indexing them) and call that function instead of running at global scope and it should run a lot faster.
Improvements beyond this might be possible using numpy arrays instead of lists, but you'd need to be much more specific about your problem to hazard a guess on the utility of such a change.

What does sum do?

At first, I want to test the memory usage between generator and list-comprehension.The book give me a little bench code snippet and I run it on my PC(python3.6, Windows),find something unexpected.
On the book, it said, because list-comprehension has to create a real list and allocate memory for it, itering from a list-comprehension must be slower than itering from a generator.
Ofcourse, list-comprehension use more memory than generator.
FOllowing is my code,which is not satisfy previous opinion(within sum function).
import tracemalloc
from time import time
def timeIt(func):
start = time()
func()
print('%s use time' % func.__name__, time() - start)
return func
tracemalloc.start()
numbers = range(1, 1000000)
#timeIt
def lStyle():
return sum([i for i in numbers if i % 3 == 0])
#timeIt
def gStyle():
return sum((i for i in numbers if i % 3 == 0))
lStyle()
gStyle()
shouldSize = [i for i in numbers if i % 3 == 0]
snapshotL = tracemalloc.take_snapshot()
top_stats = snapshotL.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
The output:
lStyle use time 0.4460000991821289
gStyle use time 0.6190001964569092
[ Top 10 ]
F:/py3proj/play.py:31: size=11.5 MiB, count=333250, average=36 B
F:/py3proj/play.py:33: size=448 B, count=1, average=448 B
F:/py3proj/play.py:22: size=136 B, count=1, average=136 B
F:/py3proj/play.py:17: size=136 B, count=1, average=136 B
F:/py3proj/play.py:14: size=76 B, count=2, average=38 B
F:/py3proj/play.py:8: size=34 B, count=1, average=34 B
Two points:
Generator use more time and same memory space.
The list-comprehension in sum function seems not create the total list
I think maybe the sum function did something i don't know.Who can explain this?
The book is High Perfoamance Python.chapter 5.But i did sth myself different from the book to check the validity in other context. And his code is here book_code,he didn't put the list-comprehension in sum funciton.

When it comes to time performance test, I do rely on the timeit module because it automatically executes multiple runs of the code.
On my system, timeit gives following results (I strongly reduced sizes because of the numerous runs):
>>> timeit.timeit("sum([i for i in numbers if i % 3 == 0])", "numbers = range(1, 1000)")
59.54427594248068
>>> timeit.timeit("sum((i for i in numbers if i % 3 == 0))", "numbers = range(1, 1000)")
64.36398425334801
So the generator is slower by about 8% (*). This is not really a surprize because the generator has to execute some code on the fly to get next value, while a precomputed list only increment its current pointer.
Memory evalutation is IMHO more complex, so I have used Compute Memory footprint of an object and its contents (Python recipe) from activestate
>>> numbers = range(1, 100)
>>> numbers = range(1, 100000)
>>> l = [i for i in numbers if i % 3 == 0]
>>> g = (i for i in numbers if i % 3 == 0)
>>> total_size(l)
1218708
>>> total_size(g)
88
>>> total_size(numbers)
48
My interpretation is that a list uses memory for all of its items (which is not a surprize), while a generator only need few values and some code so the memory footprint is much lesser for the generator.
I strongly think that you have used tracemalloc for something it is not intended for. It is aimed at searching possible memory leaks (large blocs of memory never deallocated) and not at controlling the memory used by individual objects.
BEWARE: I could only test for small sizes. But for very large sizes, the list could exhaust the available memory and the machine will use virtual memory from swap. In that case, the list version will become much slower. More details there

Python - Optimizing combination generation

Why is the first method so slow?
It can be up to 1000 times slower, any ideas on how to make it faster?
In this case, performance is number one priority. In my first attempt I tried to make it multipricessing, but it was quite slow as well.
Python - Set the first element of a generator - Applied to itertools
import time
import operator as op
from math import factorial
from itertools import combinations
def nCr(n, r):
# https://stackoverflow.com/a/4941932/1167783
r = min(r, n-r)
if r == 0:
return 1
numer = reduce(op.mul, xrange(n, n-r, -1))
denom = reduce(op.mul, xrange(1, r+1))
return numer // denom
def kthCombination(k, l, r):
# https://stackoverflow.com/a/1776884/1167783
if r == 0:
return []
elif len(l) == r:
return l
else:
i = nCr(len(l)-1, r-1)
if k < i:
return l[0:1] + kthCombination(k, l[1:], r-1)
else:
return kthCombination(k-i, l[1:], r)
def iter_manual(n, p):
numbers_list = [i for i in range(n)]
for comb in xrange(factorial(n)/(factorial(p)*factorial(n-p))):
x = kthCombination(comb, numbers_list, p)
# Do something, for example, store those combinations
# For timing i'm going to do something simple
def iter(n, p):
for i in combinations([i for i in range(n)], p):
# Do something, for example, store those combinations
# For timing i'm going to do something simple
x = i
#############################
if __name__ == "__main__":
n = 40
p = 5
print '%s combinations' % (factorial(n)/(factorial(p)*factorial(n-p)))
t0_man = time.time()
iter_manual(n, p)
t1_man = time.time()
total_man = t1_man - t0_man
t0_iter = time.time()
iter(n, p)
t1_iter = time.time()
total_iter = t1_iter - t0_iter
print 'Manual: %s' %total_man
print 'Itertools: %s' %total_iter
print 'ratio: %s' %(total_man/total_iter)

There are several factors at play here.
The most important is garbage collection. Any method that generates a lot of unnecessary allocations is going to be slow because of GC pauses. In this vein, list comprehensions are fast (for Python) because they are highly optimized under the hood in their allocation and execution. Wherever speed is important, prefer list comprehensions.
Next up you've got function calls. Function calls are relatively expensive as #roganjosh points out in the comments. This is (again) particularly true if the function generates a lot of garbage or holds on to long-lived closures.
Now we come to loops. Garbage is again the biggest concern, hoist your variables outside the loop and reuse them on each iteration.
Last but certainly not least is that Python is, in a sense, a hosted language: generally on the CPython runtime. Anything implemented in the runtime itself (particularly if the thing in question is implemented in C rather than Python itself) is going to be faster than your (logically equivalent) code.
NOTE
All of this advice is detrimental to code quality. Use with caution. Profile first. Also note that compilers are generally smart enough to do all of this for you, for instance PyPy will generally run faster for the same code than the standard Python runtime because it does optimizations like this for you when it runs your code.
NOTE 2
One of the implementations uses reduce. In theory, reduce could be fast. But it isn't for lots of reasons, the chief of which could possibly be summed up as "Guido didn't/doesn't care". So don't use reduce when speed is important.

Why is the "map" version of ThreeSum so slow?

I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)

Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)

Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop

You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)

Need a fast way to count and sum an iterable in a single pass

Can any one help me? I'm trying to come up with a way to compute
>>> sum_widths = sum(col.width for col in cols if not col.hide)
and also count the number of items in this sum, without having to make two passes over cols.
It seems unbelievable but after scanning the std-lib (built-in functions, itertools, functools, etc), I couldn't even find a function which would count the number of members in an iterable. I found the function itertools.count, which sounds like what I want, but It's really just a deceptively named range function.
After a little thought I came up with the following (which is so simple that the lack of a library function may be excusable, except for its obtuseness):
>>> visable_col_count = sum(col is col for col in cols if not col.hide)
However, using these two functions requires two passes of the iterable, which just rubs me the wrong way.
As an alternative, the following function does what I want:
>>> def count_and_sum(iter):
>>> count = sum = 0
>>> for item in iter:
>>> count += 1
>>> sum += item
>>> return count, sum
The problem with this is that it takes 100 times as long (according to timeit) as the sum of a generator expression form.
If anybody can come up with a simple one-liner which does what I want, please let me know (using Python 3.3).
Edit 1
Lots of great ideas here, guys. Thanks to all who replied. It will take me a while to digest all these answers, but I will and I will try to pick one to check.
Edit 2
I repeated the timings on my two humble suggestions (count_and_sum function and 2 separate sum functions) and discovered that my original timing was way off, probably due to an auto-scheduled backup process running in the background.
I also timed most of the excellent suggestions given as answers here, all with the same model. Analysing these answers has been quite an education for me: new uses for deque, enumerate and reduce and first time for count and accumulate. Thanks to all!
Here are the results (from my slow netbook) using the software I'm developing for display:
┌───────────────────────────────────────────────────────┐
│ Count and Sum Timing │
├──────────────────────────┬───────────┬────────────────┤
│ Method │Time (usec)│Time (% of base)│
├──────────────────────────┼───────────┼────────────────┤
│count_and_sum (base) │ 7.2│ 100%│
│Two sums │ 7.5│ 104%│
│deque enumerate accumulate│ 7.3│ 101%│
│max enumerate accumulate │ 7.3│ 101%│
│reduce │ 7.4│ 103%│
│count sum │ 7.3│ 101%│
└──────────────────────────┴───────────┴────────────────┘
(I didn't time the complex and fold methods as being just too obscure, but thanks anyway.)
Since there's very little difference in timing among all these methods I decided to use the count_and_sum function (with an explicit for loop) as being the most readable, explicit and simple (Python Zen) and it also happens to be the fastest!
I wish I could accept one of these amazing answers as correct but they are all equally good though more or less obscure, so I'm just up-voting everybody and accepting my own answer as correct (count_and_sum function) since that's what I'm using.
What was that about "There should be one-- and preferably only one --obvious way to do it."?

Using complex numbers
z = [1, 2, 4, 5, 6]
y = sum(x + 1j for x in z)
sum_z, count_z = y.real, int(y.imag)
print sum_z, count_z
18.0 5

I don't know about speed, but this is kind of pretty:
>>> from itertools import accumulate
>>> it = range(10)
>>> max(enumerate(accumulate(it), 1))
(10, 45)

Adaption of DSM's answer. using deque(... maxlen=1) to save memory use.
import itertools
from collections import deque
deque(enumerate(itertools.accumulate(x), 1), maxlen=1)
timing code in ipython:
import itertools , random
from collections import deque
def count_and_sum(iter):
count = sum = 0
for item in iter:
count += 1
sum += item
return count, sum
X = [random.randint(0, 10) for _ in range(10**7)]
%timeit count_and_sum(X)
%timeit deque(enumerate(itertools.accumulate(X), 1), maxlen=1)
%timeit (max(enumerate(itertools.accumulate(X), 1)))
results: now faster than OP's method
1 loops, best of 3: 1.08 s per loop
1 loops, best of 3: 659 ms per loop
1 loops, best of 3: 1.19 s per loop

Here's some timing data that might be of interest:
import timeit
setup = '''
import random, functools, itertools, collections
x = [random.randint(0, 10) for _ in range(10**5)]
def count_and_sum(it):
c, s = 0, 0
for i in it:
c += 1
s += i
return c, s
def two_pass(it):
return sum(i for i in it), sum(True for i in it)
def functional(it):
return functools.reduce(lambda pair, x: (pair[0]+1, pair[1]+x), it, [0, 0])
def accumulator(it):
return max(enumerate(itertools.accumulate(it), 1))
def complex(it):
cpx = sum(x + 1j for x in it)
return cpx.real, int(cpx.imag)
def dequed(it):
return collections.deque(enumerate(itertools.accumulate(it), 1), maxlen=1)
'''
number = 100
for stmt in ['count_and_sum(x)',
'two_pass(x)',
'functional(x)',
'accumulator(x)',
'complex(x)',
'dequed(x)']:
print('{:.4}'.format(timeit.timeit(stmt=stmt, setup=setup, number=number)))
Result:
3.404 # OP's one-pass method
3.833 # OP's two-pass method
8.405 # Timothy Shields's fold method
3.892 # DSM's accumulate-based method
4.946 # 1_CR's complex-number method
2.002 # M4rtini's deque-based modification of DSM's method
Given these results, I'm not really sure how the OP is seeing a 100x slowdown with the one-pass method. Even if the data looks radically different from a list of random integers, that just shouldn't happen.
Also, M4rtini's solution looks like the clear winner.
To clarify, these results are in CPython 3.2.3. For a comparison to PyPy3, see James_pic's answer, which shows some serious gains from JIT compilation for some methods (also mentioned in a comment by M4rtini.

As a follow-up to senshin's answer, it's worth noting that the performance differences are largely due to quirks in CPython's implementation, that make some methods slower than others (for example, for loops are relatively slow in CPython). I thought it would be interesting to try the exact same test in PyPy (using PyPy3 2.1 beta), which has different performance characteristics. In PyPy the results are:
0.6227 # OP's one-pass method
0.8714 # OP's two-pass method
1.033 # Timothy Shields's fold method
6.354 # DSM's accumulate-based method
1.287 # 1_CR's complex-number method
3.857 # M4rtini's deque-based modification of DSM's method
In this case, the OP's one-pass method is fastest. This makes sense, as it's arguably the simplest (at least from a compiler's point of view) and PyPy can eliminate many of the overheads by inlining method calls, which CPython can't.
For comparison, CPython 3.3.2 on my machine give the following:
1.651 # OP's one-pass method
1.825 # OP's two-pass method
3.258 # Timothy Shields's fold method
1.684 # DSM's accumulate-based method
3.072 # 1_CR's complex-number method
1.191 # M4rtini's deque-based modification of DSM's method

You can keep count inside the sum with tricks similar to this
>>> from itertools import count
>>> cnt = count()
>>> sum((next(cnt), x)[1] for x in range(10) if x%2)
25
>>> next(cnt)
5
But it's probably going to be more readable to just use a for loop

You can use this:
from itertools import count
lst = range(10)
c = count(1)
tot = sum(next(c) and x for x in lst if x % 2)
n = next(c)-1
print(n, tot)
# 5 25
It's kind of a hack, but it works perfectly well.

I don't know what the Python syntax is off hand, but you could potentially use a fold. Something like this:
(count, total) = fold((0, 0), lambda pair, x: (pair[0] + 1, pair[1] + x))
The idea is to use a seed of (0,0) and then at each step add 1 to the first component and the current number to the second component.
For comparison, you could implement sum as a fold as follows:
total = fold(0, lambda t, x: t + x)

1_CR's complex number solution is cute but overly hacky. The reason it works is that a complex number is a 2-tuple, that sums elementwise. The same is true of numpy arrays and I think it's slightly cleaner to use those:
import numpy as np
z = [1, 2, 4, 5, 6]
y = sum(np.array([x, 1]) for x in z)
sum_z, count_z = y[0], y[1]
print sum_z, count_z
18 5

Something else to consider: If it is possible to determine a minimum possible count, we can let the efficient built-in sum do part of the work:
from itertools import islice
def count_and_sum(iterable):
# insert favorite implementation here
def count_and_sum_with_min_count(iterable, min_count):
iterator = iter(iterable)
slice_sum = sum(islice(iterator, None, min_count))
rest_count, rest_sum = count_and_sum(iterator)
return min_count + rest_count, slice_sum + rest_sum
For example, using the deque method with a sequence of 10000000 items and min_count of 5000000, timing results are:
count_and_sum: 1.03
count_and_sum_with_min_count: 0.63

Thanks for all the great answers, but I decided to use my original count_and_sum function, called as follows:
>>> cc, cs = count_and_sum(c.width for c in cols if not c.hide)
As explained in the edits to my original question this turned out to be the fastest and most readable solution.

How about this?
It seems to work.
from functools import reduce
class Column:
def __init__(self, width, hide):
self.width = width
self.hide = hide
lst = [Column(10, False), Column(100, False), Column(1000, True), Column(10000, False)]
print(reduce(lambda acc, col: Column(col.width + acc.width, False) if not col.hide else acc, lst, Column(0, False)).width)

You might only need the sum & count today, but who knows what you'll need tomorrow!
Here's an easily extensible solution:
def fold_parallel(itr, **fs):
res = {
k: zero for k, (zero, f) in fs.items()
}
for x in itr:
for k, (_, f) in fs.items():
res[k] = f(res[k], x)
return res
from operator import add
print(fold_parallel([1, 2, 3],
count = (0, lambda a, b: a + 1),
sum = (0, add),
))
# {'count': 3, 'sum': 6}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

number of loops matters efficiency (interpreted vs compiled languages?) - python

Related

Speed up the List appending process

What does sum do?

Python - Optimizing combination generation

Why is the "map" version of ThreeSum so slow?

Need a fast way to count and sum an iterable in a single pass

Categories

Resources