Suppose you have got a list comprehension in python, like
Values = [ f(x) for x in range( 0, 1000 ) ]
with f being just a function without side effects. So all the entries can be computed independently.
Is Python able to increase the performance of this list comprehension compared with the "obvious" implementation; e.g. by shared-memory-parallelization on multicore CPUs?
In Python 3.2 they added concurrent.futures, a nice library for solving problems concurrently. Consider this example:
import math, time
from concurrent import futures
PRIMES = [112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419, 112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def bench(f):
start = time.time()
f()
elapsed = time.time() - start
print("Completed in {} seconds".format(elapsed))
def concurrent():
with futures.ProcessPoolExecutor() as executor:
values = list(executor.map(is_prime, PRIMES))
def listcomp():
values = [is_prime(x) for x in PRIMES]
Results on my quad core:
>>> bench(listcomp)
Completed in 14.463825941085815 seconds
>>> bench(concurrent)
Completed in 3.818351984024048 seconds
No, Python will not magically parallelize this for you. In fact, it can't, since it cannot prove the independence of the entries; that would require a great deal of program inspection/verification, which is impossible to get right in the general case.
If you want quick coarse-grained multicore parallelism, I recommend joblib instead:
from joblib import delayed, Parallel
values = Parallel(n_jobs=NUM_CPUS)(delayed(f)(x) for x in range(1000))
Not only have I witnessed near-linear speedups using this library, it also has the great feature of signals such as the one from Ctrl-C onto its worker processes, which cannot be said of all multiprocess libraries.
Note that joblib doesn't really support shared-memory parallelism: it spawns worker processes, not threads, so it incurs some communication overhead from sending data to workers and results back to the master process.
Try if the following can be faster:
Values = map(f,range(0,1000))
That's a functionnal manner to code
Another idea is to replace all occurences of Values in the code by the generator expression
imap(f,range(0,1000)) # Python < 3
map(f,range(0,1000)) # Python 3
Related
I'm using python's multiprocessing module some analyses. Specifically, I use a pool and the apply_async method to compute some results from a large number of variables. The time required to compute the results vary drastically with the input, which got me wondering if the pool will either A) immediately assign a process to each input when I call apply_async, or if B) processes are assigned as they become ready. My worry is A) would imply that one of the pool's processes might be assigned to several of the 'heavy' inputs and so take longer than necessary to complete.
I tried running the below code and got similar returns in each run (<2 seconds), indicating that B) is the case, at least here. Is that what should be expected to happen always?
import multiprocessing as mp
import numpy as np
import time
def foo(x):
if x % 4 == 0:
time.sleep(1) # Simulating a large workload
else:
time.sleep(.1) # Simulating a small workload
return 2*x
if __name__ == '__main__':
for _ in range(20):
with mp.Pool(processes=4) as pool:
jobs = []
now = time.time()
vals = list(range(1, 9))
np.random.shuffle(vals)
for x in vals:
jobs.append(pool.apply_async(foo, args=(x,)))
results = [j.get() for j in jobs]
elapsed = time.time() - now
print(elapsed)
I'm trying to use multiprocessing in python for the first time. I wrote a basic prime searching program and I want to run simultaneously it on each core. The problem is: when the program does the multiprocessing, it not only does the 'primesearch' function but also the beginning of the code. My expected output would be a list of prime numbers between 0 and a limit, but it writes 16 times (I have 16 cores and 16 processes) "Enter a limit: "
Here is my code:
import time
import os
from multiprocessing import Process
# Defining lists
primes = []
processes = []
l = [0]
limit = int(input('Enter a limit: '))
def primesearch(lower,upper):
global primes
for num in range(lower, upper):
if num > 1:
for i in range(2, num):
if (num % i) == 0:
break
else:
primes.append(num)
# Start the clock
starter = time.perf_counter()
#Dividing data
step = limit // os.cpu_count()
for x in range(os.cpu_count()):
l.append(step * (x+1))
l[-1] = limit
#Multiprocessing
for init in range(os.cpu_count()):
processes.append(Process(target=primesearch, args=[l[init], l[init + 1],] ))
for process in processes:
process.start()
for process in processes:
process.join()
#End clock
finish = time.perf_counter()
print(primes)
print(f'Finished in {round(finish-starter, 2)} second')
What could be the problem?
You are using Windows - If you read the Python documenation for multiprocessing, it will reveal to you that you should protect your main code using if __name__==“__main__”: This is because on Windows each process re-executes the complete main .py file.
This is used in pretty much every example in the documentation., and explained in the section at the end ‘Programming guidelines’.
See https://docs.python.org/3/library/multiprocessing.html
Except for the __main__ issue, your way of using primes as a global list doesn't seem to work.
I imported Queue from multiprocessing and used primes = Queue() and
size = primes.qsize()
print([primes.get() for _ in range(size)])
primes.close()
in the main function and primes.put(num) in your function. I don't know if it's the best way, for me this works but if N > 12000 then the console freezes. Also, in this case, using multiprocessing is actually slightly slower than single process.
If you aim for speed, you can test only to the square root of num, which saves about half of the time. There are many optimizations you can do. If you are testing huge numbers, you can use the Rabin-Miller algorithm.
http://inventwithpython.com/cracking/chapter22.html
Why is the first method so slow?
It can be up to 1000 times slower, any ideas on how to make it faster?
In this case, performance is number one priority. In my first attempt I tried to make it multipricessing, but it was quite slow as well.
Python - Set the first element of a generator - Applied to itertools
import time
import operator as op
from math import factorial
from itertools import combinations
def nCr(n, r):
# https://stackoverflow.com/a/4941932/1167783
r = min(r, n-r)
if r == 0:
return 1
numer = reduce(op.mul, xrange(n, n-r, -1))
denom = reduce(op.mul, xrange(1, r+1))
return numer // denom
def kthCombination(k, l, r):
# https://stackoverflow.com/a/1776884/1167783
if r == 0:
return []
elif len(l) == r:
return l
else:
i = nCr(len(l)-1, r-1)
if k < i:
return l[0:1] + kthCombination(k, l[1:], r-1)
else:
return kthCombination(k-i, l[1:], r)
def iter_manual(n, p):
numbers_list = [i for i in range(n)]
for comb in xrange(factorial(n)/(factorial(p)*factorial(n-p))):
x = kthCombination(comb, numbers_list, p)
# Do something, for example, store those combinations
# For timing i'm going to do something simple
def iter(n, p):
for i in combinations([i for i in range(n)], p):
# Do something, for example, store those combinations
# For timing i'm going to do something simple
x = i
#############################
if __name__ == "__main__":
n = 40
p = 5
print '%s combinations' % (factorial(n)/(factorial(p)*factorial(n-p)))
t0_man = time.time()
iter_manual(n, p)
t1_man = time.time()
total_man = t1_man - t0_man
t0_iter = time.time()
iter(n, p)
t1_iter = time.time()
total_iter = t1_iter - t0_iter
print 'Manual: %s' %total_man
print 'Itertools: %s' %total_iter
print 'ratio: %s' %(total_man/total_iter)
There are several factors at play here.
The most important is garbage collection. Any method that generates a lot of unnecessary allocations is going to be slow because of GC pauses. In this vein, list comprehensions are fast (for Python) because they are highly optimized under the hood in their allocation and execution. Wherever speed is important, prefer list comprehensions.
Next up you've got function calls. Function calls are relatively expensive as #roganjosh points out in the comments. This is (again) particularly true if the function generates a lot of garbage or holds on to long-lived closures.
Now we come to loops. Garbage is again the biggest concern, hoist your variables outside the loop and reuse them on each iteration.
Last but certainly not least is that Python is, in a sense, a hosted language: generally on the CPython runtime. Anything implemented in the runtime itself (particularly if the thing in question is implemented in C rather than Python itself) is going to be faster than your (logically equivalent) code.
NOTE
All of this advice is detrimental to code quality. Use with caution. Profile first. Also note that compilers are generally smart enough to do all of this for you, for instance PyPy will generally run faster for the same code than the standard Python runtime because it does optimizations like this for you when it runs your code.
NOTE 2
One of the implementations uses reduce. In theory, reduce could be fast. But it isn't for lots of reasons, the chief of which could possibly be summed up as "Guido didn't/doesn't care". So don't use reduce when speed is important.
I have an iterator function that yields an infinite stream of integers:
def all_ints(start=0):
yield start
yield all_ints(start+1)
I want to have a pool of threads or processes do calculations on these up to $POOLSIZE at a time. Each process will possibly save a result to some shared data structure so I do not need the return value from the process/thread function. It seems to me this use of the python3 Pool would achieve this:
# dummy example functions
def check_prime(n):
return n % 2 == 0
def store_prime(p):
''' synchronize, write to some shared structure'''
pass
p = Pool()
for n in all_ints():
p.apply_async(check_prime, (n,), callback=store_prime)
But when I run this I get a python process that just continually uses more memory (and not from the iterator, that can run for days). I would expect this behavior if I was storing the results of all the apply_async calls, but I am not.
What am I doing wrong here? Or is there another API from the thread pool I should be using?
I think you are looking for Pool.imap_unordered, which uses the pooled processes to apply a function to the elements yielded by an iterator. Its parameter chunksize allows you to specify how many items from the iterator are passed to the pool in each step.
Also I would avoid using any shared memory structures for IPC. Just let the "expensive" function sent to the pool return the information you need, and process it in the main process.
Here's an example (where I abort after 200'000 results; if you remove that part, you'll see the processes happily work in a fixed amount of RAM "forever"):
from multiprocessing import Pool
from math import sqrt
import itertools
import time
def check_prime(n):
if n == 2: return (n, True)
if n % 2 == 0 or n < 2: return (n, False)
for i in range(3, int(sqrt(n))+1, 2):
if n % i == 0: return (n, False)
return (n, True)
def main():
L = 200000 # limit for performance timing
p = Pool()
n_primes = 0
before = time.time()
for (n, is_prime) in p.imap_unordered(check_prime, itertools.count(1), 1000):
if is_prime:
n_primes += 1
if n_primes >= L:
break
print("Computed %d primes in %.1fms" % (n_primes, (time.time()-before)*1000.0))
if __name__ == "__main__":
main()
Output on my Intel Core i5 (2 Core, 4 Threads):
Computed 200000 primes in 15167.9ms
Output if I change it to Pool(1), so using just 1 subprocess:
Computed 200000 primes in 37909.2ms
HTH!
Questions
Why is the CPU usage of my threaded python merge sort only 50% for each core?
Why does this result in "cannot create new thread" errors for relatively small inputs (100000)
How can I make this more pythonic? (It's very ugly.)
Linux/Ubuntu 12.4 64-bit i5 mobile (quad)
from random import shuffle
from threading import *
import time
import Queue
q = Queue.LifoQueue()
def merge_sort(num, q):
end = len(num)
if end > 1:
mid = end / 2
thread = Thread(target=merge_sort, args=(num[0:mid],q,))
thread1 = Thread(target=merge_sort, args=(num[mid:end],q,))
thread.setDaemon(True)
thread1.setDaemon(True)
thread.start()
thread1.start()
thread.join()
thread1.join()
return merge(q.get(num), q.get(num))
else:
if end != 0:
q.put(num)
else:
print "?????"
return
def merge(num, num1):
a = []
while len(num) is not 0 and len(num1) is not 0:
if num[0] < num1[0]:
a.append(num.pop(0))
else:
a.append(num1.pop(0))
if len(num) is not 0:
for i in range(0,len(num)):
a.append(num.pop(0))
if len(num1) is not 0:
for i in range(0,len(num1)):
a.append(num1.pop(0))
q.put(a)
return a
def main():
val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = xrange(0, val)
shuffle(numbers)
numbers = merge_sort(numbers[0:val], q)
# print "Sorted list is: \n"
# for number in numbers:
# print number
print str(time.time() - start_time) + " seconds to run.\n"
if __name__ == "__main__":
main()
For the 100000 input your code tries to create ~200000 threads. Python threads are real OS threads so the 50% CPU load that you are seeing is probably the system busy handling the threads. On my system the error happens around ~32000 threads.
Your code as written can't possibly work:
from random import shuffle
#XXX won't work
numbers = xrange(0, val)
shuffle(numbers)
xrange() is not a mutable sequence.
Note: the sorting takes much less time than the random shuffling of the array:
import numpy as np
numbers = np.random.permutation(10000000) # here spent most of the time
numbers.sort()
If you want to sort parts of the array using different threads; you can do it:
from multiprocessing.dummy import Pool # use threads
Pool(2).map(lambda a: a.sort(), [numbers[:N//2], numbers[N//2:]])
a.sort() releases GIL so the code uses 2 CPUs.
If you include the time it takes to merge the sorted parts; it may be faster just to sort the whole array at once (numbers.sort()) in a single thread.
You may want to look into using Parallel Python, as by default CPython will be restricted to one core because of the Global Interpreter Lock (GIL). This is why CPython cannot perform true CPU bound concurrent operations. But, CPython is still great at carrying out IO bound tasks.
There is a good article that describes the threading limitations of CPyton here.