Python - Optimizing combination generation - python

Why is the first method so slow?
It can be up to 1000 times slower, any ideas on how to make it faster?
In this case, performance is number one priority. In my first attempt I tried to make it multipricessing, but it was quite slow as well.
Python - Set the first element of a generator - Applied to itertools
import time
import operator as op
from math import factorial
from itertools import combinations
def nCr(n, r):
# https://stackoverflow.com/a/4941932/1167783
r = min(r, n-r)
if r == 0:
return 1
numer = reduce(op.mul, xrange(n, n-r, -1))
denom = reduce(op.mul, xrange(1, r+1))
return numer // denom
def kthCombination(k, l, r):
# https://stackoverflow.com/a/1776884/1167783
if r == 0:
return []
elif len(l) == r:
return l
else:
i = nCr(len(l)-1, r-1)
if k < i:
return l[0:1] + kthCombination(k, l[1:], r-1)
else:
return kthCombination(k-i, l[1:], r)
def iter_manual(n, p):
numbers_list = [i for i in range(n)]
for comb in xrange(factorial(n)/(factorial(p)*factorial(n-p))):
x = kthCombination(comb, numbers_list, p)
# Do something, for example, store those combinations
# For timing i'm going to do something simple
def iter(n, p):
for i in combinations([i for i in range(n)], p):
# Do something, for example, store those combinations
# For timing i'm going to do something simple
x = i
#############################
if __name__ == "__main__":
n = 40
p = 5
print '%s combinations' % (factorial(n)/(factorial(p)*factorial(n-p)))
t0_man = time.time()
iter_manual(n, p)
t1_man = time.time()
total_man = t1_man - t0_man
t0_iter = time.time()
iter(n, p)
t1_iter = time.time()
total_iter = t1_iter - t0_iter
print 'Manual: %s' %total_man
print 'Itertools: %s' %total_iter
print 'ratio: %s' %(total_man/total_iter)

There are several factors at play here.
The most important is garbage collection. Any method that generates a lot of unnecessary allocations is going to be slow because of GC pauses. In this vein, list comprehensions are fast (for Python) because they are highly optimized under the hood in their allocation and execution. Wherever speed is important, prefer list comprehensions.
Next up you've got function calls. Function calls are relatively expensive as #roganjosh points out in the comments. This is (again) particularly true if the function generates a lot of garbage or holds on to long-lived closures.
Now we come to loops. Garbage is again the biggest concern, hoist your variables outside the loop and reuse them on each iteration.
Last but certainly not least is that Python is, in a sense, a hosted language: generally on the CPython runtime. Anything implemented in the runtime itself (particularly if the thing in question is implemented in C rather than Python itself) is going to be faster than your (logically equivalent) code.
NOTE
All of this advice is detrimental to code quality. Use with caution. Profile first. Also note that compilers are generally smart enough to do all of this for you, for instance PyPy will generally run faster for the same code than the standard Python runtime because it does optimizations like this for you when it runs your code.
NOTE 2
One of the implementations uses reduce. In theory, reduce could be fast. But it isn't for lots of reasons, the chief of which could possibly be summed up as "Guido didn't/doesn't care". So don't use reduce when speed is important.

Related

What does sum do?

At first, I want to test the memory usage between generator and list-comprehension.The book give me a little bench code snippet and I run it on my PC(python3.6, Windows),find something unexpected.
On the book, it said, because list-comprehension has to create a real list and allocate memory for it, itering from a list-comprehension must be slower than itering from a generator.
Ofcourse, list-comprehension use more memory than generator.
FOllowing is my code,which is not satisfy previous opinion(within sum function).
import tracemalloc
from time import time
def timeIt(func):
start = time()
func()
print('%s use time' % func.__name__, time() - start)
return func
tracemalloc.start()
numbers = range(1, 1000000)
#timeIt
def lStyle():
return sum([i for i in numbers if i % 3 == 0])
#timeIt
def gStyle():
return sum((i for i in numbers if i % 3 == 0))
lStyle()
gStyle()
shouldSize = [i for i in numbers if i % 3 == 0]
snapshotL = tracemalloc.take_snapshot()
top_stats = snapshotL.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
The output:
lStyle use time 0.4460000991821289
gStyle use time 0.6190001964569092
[ Top 10 ]
F:/py3proj/play.py:31: size=11.5 MiB, count=333250, average=36 B
F:/py3proj/play.py:33: size=448 B, count=1, average=448 B
F:/py3proj/play.py:22: size=136 B, count=1, average=136 B
F:/py3proj/play.py:17: size=136 B, count=1, average=136 B
F:/py3proj/play.py:14: size=76 B, count=2, average=38 B
F:/py3proj/play.py:8: size=34 B, count=1, average=34 B
Two points:
Generator use more time and same memory space.
The list-comprehension in sum function seems not create the total list
I think maybe the sum function did something i don't know.Who can explain this?
The book is High Perfoamance Python.chapter 5.But i did sth myself different from the book to check the validity in other context. And his code is here book_code,he didn't put the list-comprehension in sum funciton.
When it comes to time performance test, I do rely on the timeit module because it automatically executes multiple runs of the code.
On my system, timeit gives following results (I strongly reduced sizes because of the numerous runs):
>>> timeit.timeit("sum([i for i in numbers if i % 3 == 0])", "numbers = range(1, 1000)")
59.54427594248068
>>> timeit.timeit("sum((i for i in numbers if i % 3 == 0))", "numbers = range(1, 1000)")
64.36398425334801
So the generator is slower by about 8% (*). This is not really a surprize because the generator has to execute some code on the fly to get next value, while a precomputed list only increment its current pointer.
Memory evalutation is IMHO more complex, so I have used Compute Memory footprint of an object and its contents (Python recipe) from activestate
>>> numbers = range(1, 100)
>>> numbers = range(1, 100000)
>>> l = [i for i in numbers if i % 3 == 0]
>>> g = (i for i in numbers if i % 3 == 0)
>>> total_size(l)
1218708
>>> total_size(g)
88
>>> total_size(numbers)
48
My interpretation is that a list uses memory for all of its items (which is not a surprize), while a generator only need few values and some code so the memory footprint is much lesser for the generator.
I strongly think that you have used tracemalloc for something it is not intended for. It is aimed at searching possible memory leaks (large blocs of memory never deallocated) and not at controlling the memory used by individual objects.
BEWARE: I could only test for small sizes. But for very large sizes, the list could exhaust the available memory and the machine will use virtual memory from swap. In that case, the list version will become much slower. More details there

number of loops matters efficiency (interpreted vs compiled languages?)

Say you have to carry out a computation by using 2 or even 3 loops. Intuitively, one may thing that it's more efficient to do this with a single loop. I tried a simple Python example:
import itertools
import timeit
def case1(n):
c = 0
for i in range(n):
c += 1
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += 1
return c
print(case1(1000))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(1000)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
This code run:
$ python3 code.py
1000
1000
0.8281264099932741
1.04944919400441
So effectively 1 loop seems to be a bit more efficient. Yet I have a slightly different scenario in my problem, as I need to use the values in an array (in the following example I use the function range for simplification). That is, if I collapse everything to a single loop I would have to create an extended array from the values of another array whose size is between 2 and 10 elements.
import itertools
import timeit
def case1(n):
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
c = 0
for i in range(len(b)):
c += b[i]
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += i*j*k
return c
print(case1(10))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(10)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
In my computer this code run in:
$ python3 code.py
91125
91125
2.435348572995281
1.6435037050105166
So it seems the 3 nested loops are more efficient because I spend sometime creating the array b in case1. so I'm not sure I'm creating this array in the most efficient way, but leaving that aside, does it really pay off collapsing loops to a single one? I'm using Python here, but what about compiled languages like C++? Does the compiler in this case do something to optimize the single loop? Or on the other hand, does the compiler do some optimization when you have multiple nested loops?
This is why the single loop function takes supposedly longer than it should
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
Just by changing the whole function to
def case1(n, b):
c = 0
for i in range(len(b)):
c += b[i]
return c
Makes the timeit return :
case1 : 0.965343249744
case2 : 2.28501694207
Your case is simple enough that various optimizations would probably do a lot. Be it numpy for more efficient array's, maybe pypy for a better JIT optimizer, or various other things.
Looking at the bytecode via the dis module can help you understand what happens under the hood and make some micro optimizations, but in general it does not really matter if you do one loop or a nested loop, if your memory access pattern is somewhat predictable for the CPU. If not, it may differ wildly.
Python has some bytecodes that are cheap and others that are more expensive, e.g. function calls are much more expensive than a simple addition. Same with creating new objects and various other things. So the usual optimization is moving the loop to C, which is one of the benefits of itertools sometimes.
Once you are on the C-level it usually comes down to: Avoid syscalls/mallocs() in tight loops, have predictable memory access patterns and make sure your algorithm is cache friendly.
So, your algorithms above will probably vary wildly in performance if you go to large values of N, due to the amount of memory allocation and cache access.
But the fastest way for the specific problem above would be to find a closed form for the function, it seems wasteful to iterate for that, as there must be a much simpler formula to calculate the final value of 'c'. As usual, first get the best algorithm before doing micro optimizations.
e.g. Wolfram Alpha tells you that you could replace two loops with, there is probably a closed form for all three, but Alpha didn't tell me...
def case3(n):
c = 0
for j in range(n):
c += (j* n^2 *(n+1)^2))/4
return c

How to check if a number is within a group of non-consecutive ranges?

Question may sound complicated, but actually is pretty simple, but i can't find any nice solution in Python.
I have ranges like
("8X5000", "8X5099"). Here X can be any digit, so I want to match numbers that fall into one of the ranges (805000..805099 or 815000..815099 or ... 895000..895099).
How can I do this?
#TimPietzcker answer is correct and Pythonic, but it raises some performance concerns (arguably making it even more Pythonic). It creates an iterator that is searches for a value. I don't expect Python to be able to optimize the search.
This should perform better:
def IsInRange(n, r=("8X5000", "8X5099")):
(minr, maxr) = [[int(i) for i in l.split('X')] for l in r]
p = len(r[0]) - r[0].find('X')
nl = (n // 10**p, n % 10**(p-1))
fInRange = all([minr[i] <= nl[i] <= maxr[i] for i in range(2)])
return fInRange
The second line inside the function is a nested list comprehension so may be a little hard to read but it sets:
minr = [8, 5000]
maxr = [8, 5099]
When n = 595049:
nl = (5, 5049)
The code just splits the ranges into parts (while converting to int), splits the target number into parts, then range checks the parts. It wouldn't be hard to enhance this to handle multiple X's in the range specifiers.
Update
I just tested relative performance using timeit:
def main():
t1 = timeit.timeit('MultiRange.in_range(985000)', setup='import MultiRange', number=10000)
t2 = timeit.timeit('MultiRange.IsInRange(985000)', setup='import MultiRange', number=10000)
print t1, t2
print float(t2)/float(t1), 1 - float(t2)/float(t1)
On my 32-bit Win 7 machine running Python 2.7.2 my solution is almost 10 times faster than #TimPietzcker's (to be specific, it runs in 12% of the time). As you increase the size of the range, it only gets worse. When:
ranges=("8X5000", "8X5999")
The performance boost is 50x. Even for the smallest range, my version runs 4 times faster.
With #PaulMcGuire suggested performance patch to in_range, my version runs 3 times faster.
Update 2
Motivated by #PaulMcGuire's comment I went ahead and refactored our functions into classes. Here's mine:
class IsInRange5(object):
def __init__(self, r=("8X5000", "8X5099")):
((self.minr0, self.minr1), (self.maxr0, self.maxr1)) = [[int(i) for i in l.split('X')] for l in r]
pos = len(r[0]) - r[0].find('X')
self.basel = 10**(pos-1)
self.baseh = self.basel*10
self.ir = range(2)
def __contains__(self, n):
return self.minr0 <= n // self.baseh <= self.maxr0 and \
self.minr1 <= n % self.basel <= self.maxr1
This did close the gap, but even after pre-computing range invariants (for both) #PaulMcGuire's took 50% longer.
range = (80555,80888)
x = 80666
print range[0] < x < range[1]
maybe what your looking for ...
Example for Python 3 (in Python 2, use xrange instead of range):
def in_range(number, ranges=("8X5000", "8X5099")):
actual_ranges = ((int(ranges[0].replace("X", digit)),
int(ranges[1].replace("X", digit)) + 1)
for digit in "0123456789")
return any(number in range(*interval) for interval in actual_ranges)
Results:
>>> in_range(805001)
True
>>> in_range(895099)
True
>>> in_range(805100)
False
An improvement to this, suggested by Paul McGuire (thanks!):
def in_range(number, ranges=("8X5000", "8X5099")):
actual_ranges = ((int(ranges[0].replace("X", digit)),
int(ranges[1].replace("X", digit)))
for digit in "0123456789")
return any(minval <= number <= maxval for minval, maxval in actual_ranges)

Python and performance of list comprehensions

Suppose you have got a list comprehension in python, like
Values = [ f(x) for x in range( 0, 1000 ) ]
with f being just a function without side effects. So all the entries can be computed independently.
Is Python able to increase the performance of this list comprehension compared with the "obvious" implementation; e.g. by shared-memory-parallelization on multicore CPUs?
In Python 3.2 they added concurrent.futures, a nice library for solving problems concurrently. Consider this example:
import math, time
from concurrent import futures
PRIMES = [112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419, 112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def bench(f):
start = time.time()
f()
elapsed = time.time() - start
print("Completed in {} seconds".format(elapsed))
def concurrent():
with futures.ProcessPoolExecutor() as executor:
values = list(executor.map(is_prime, PRIMES))
def listcomp():
values = [is_prime(x) for x in PRIMES]
Results on my quad core:
>>> bench(listcomp)
Completed in 14.463825941085815 seconds
>>> bench(concurrent)
Completed in 3.818351984024048 seconds
No, Python will not magically parallelize this for you. In fact, it can't, since it cannot prove the independence of the entries; that would require a great deal of program inspection/verification, which is impossible to get right in the general case.
If you want quick coarse-grained multicore parallelism, I recommend joblib instead:
from joblib import delayed, Parallel
values = Parallel(n_jobs=NUM_CPUS)(delayed(f)(x) for x in range(1000))
Not only have I witnessed near-linear speedups using this library, it also has the great feature of signals such as the one from Ctrl-C onto its worker processes, which cannot be said of all multiprocess libraries.
Note that joblib doesn't really support shared-memory parallelism: it spawns worker processes, not threads, so it incurs some communication overhead from sending data to workers and results back to the master process.
Try if the following can be faster:
Values = map(f,range(0,1000))
That's a functionnal manner to code
Another idea is to replace all occurences of Values in the code by the generator expression
imap(f,range(0,1000)) # Python < 3
map(f,range(0,1000)) # Python 3

Python Syntax Problem

I'm just getting back into Project Euler and have lost my account and solutions, so I'm back on problem 7. However, my code doesn't work. It seems fairly elementary to me, can someone help me debug my (short) script?
Should find the 10001st Prime.
#!/usr/bin/env python
#encoding: utf-8
"""
P7.py
Created by Andrew Levenson on 2010-06-29.
Copyright (c) 2010 __ME__. All rights reserved.
"""
import sys
import os
from math import sqrt
def isPrime(num):
flag = True
for x in range(2,int(sqrt(num))):
if( num % x == 0 ):
flag = False
if flag == True:
return True
else:
return False
def main():
i, n = 1, 3
p = False
end = 6
while end - i >= 0:
p = isPrime(n)
if p == True:
i = i + 1
print n
n = n + 1
if __name__ == '__main__':
main()
Edit*: Sorry, the issue is it says every number is prime. :/
The syntax is fine (in Python 2). The semantics has some avoidable complications, and this off-by-one bug:
for x in range(2,int(sqrt(num))):
if( num % x == 0 ):
flag = False
range(2, Y) goes from 2 included to Y excluded -- so you're often not checking the last possible divisor and thereby deeming "primes" many numbers that aren't. As the simplest fix, try a 1 + int(... in that range. After which, removing those avoidable complications is advisable: for example,
if somebool: return True
else: return False
is never warranted, as the simpler return somebool does the same job.
A simplified version of your entire code (with just indispensable optimizations, but otherwise exactly the same algorithm) might be, for example:
from math import sqrt
def isPrime(num):
for x in range(3, int(1 + sqrt(num)), 2):
if num % x == 0: return False
return True
def main():
i, n = 0, 3
end = 6
while i < end:
if isPrime(n):
i += 1
print n
n += 2
if __name__ == '__main__':
main()
"Return as soon as you know the answer" was already explained, I've added one more crucial optimization (+= 2, instead of 1, for n, as we "know" even numbers > 3 are not primes, and a tweak of the range for the same reason).
It's possible to get cuter, e.g.:
def isPrime(num):
return all(num % x for x n range(3, int(1 + sqrt(num)), 2))
though this may not look "simpler" if you're unfamiliar with the all built-in, it really is, because it saves you having to do (and readers of the code having to follow) low level logic, in favor of an appropriate level of abstraction to express the function's key idea, that is, "num is prime iff all possible odd divisors have a [[non-0]] remainder when the division is tried" (i.e., express the concept directly in precise, executable form). The algorithm within is actually still identical.
Going further...:
import itertools as it
def odd():
for n in it.count(1):
yield n + n + 1
def main():
end = 5
for i, n in enumerate(it.ifilter(isPrime, odd())):
print n
if i >= end: break
Again, this is just the same algorithm as before, just expressed at a more appropriate level of abstraction: the generation of the sequence of odd numbers (from 3 included upwards) placed into its own odd generator, and some use of the enumerate built-in and itertools functionality to avoid inappropriate (and unneeded) low-level expression / reasoning.
I repeat: no fundamental optimization applied yet -- just suitable abstraction. Optimization of unbounded successive primes generation in Python (e.g. via an open-ended Eratosthenes Sieve approach) has been discussed in depth elsewhere, e.g. here (be sure to check the comments too!). Here I was focusing on showing how (with built-ins such as enumerate, all, and any, the crucial itertools, plus generators and generator expressions) many "looping" problems can be expressed in modern Python at more appropriate levels of abstraction than the "C-inspired" ones that may appear most natural to most programmers reared on C programming and the like. (Perhaps surprisingly to scholars used to C++'s "abstraction penalty" first identified by Stepanov, Python usually tends to have an "abstraction premium" instead, especially if itertools, well known for its blazing speed, is used extensively and appropriately... but, that's really a different subject;-).
Isn't this better?
def isPrime(num):
for x in range(2,int(sqrt(num))):
if( num % x == 0 ):
return False
return True
And this:
def main():
i, n = 1, 3
while i <= 6:
if isPrime(n):
i = i + 1
print n
n = n + 1
Also, I'm not seeing a 10001 anywhere in there...

Categories

Resources