I'm using python's multiprocessing module some analyses. Specifically, I use a pool and the apply_async method to compute some results from a large number of variables. The time required to compute the results vary drastically with the input, which got me wondering if the pool will either A) immediately assign a process to each input when I call apply_async, or if B) processes are assigned as they become ready. My worry is A) would imply that one of the pool's processes might be assigned to several of the 'heavy' inputs and so take longer than necessary to complete.
I tried running the below code and got similar returns in each run (<2 seconds), indicating that B) is the case, at least here. Is that what should be expected to happen always?
import multiprocessing as mp
import numpy as np
import time
def foo(x):
if x % 4 == 0:
time.sleep(1) # Simulating a large workload
else:
time.sleep(.1) # Simulating a small workload
return 2*x
if __name__ == '__main__':
for _ in range(20):
with mp.Pool(processes=4) as pool:
jobs = []
now = time.time()
vals = list(range(1, 9))
np.random.shuffle(vals)
for x in vals:
jobs.append(pool.apply_async(foo, args=(x,)))
results = [j.get() for j in jobs]
elapsed = time.time() - now
print(elapsed)
Related
I need, in a data analysis python project, to use both classes and multiprocessing features, and I haven't found a good example of it on Google.
My basic idea - which is probably wrong - is to create a class with a high-size variable (it's a pandas dataframe in my case), and then to define a method which computes an operation (a sum in this case).
import multiprocessing
import time
class C:
def __init__(self):
self.__data = list(range(0, 10**7))
def func(self, nums):
return sum(nums)
def start_multi(self):
for n_procs in range(1, 4):
print()
time_start = time.clock()
chunks = [self.__data[(i-1)*len(self.__data)// n_procs: (i)*len(self.__data)// n_procs] for i in range(1, n_procs+1)]
pool = multiprocessing.Pool(processes=n_procs)
results = pool.map_async(self.func, chunks )
results.wait()
pool.close()
results = results.get()
print(sum(results))
print("n_procs", n_procs, "total time: ", time.clock() - time_start)
print('sum(list(range(0, 10**7)))', sum(list(range(0, 10**7))))
c = C()
c.start_multi()
The code doesn't work properly: I get the following print output
sum(list(range(0, 10**7))) 49999995000000
49999995000000
n_procs 1 total time: 0.45133500000000026
49999995000000
n_procs 2 total time: 0.8055279999999954
49999995000000
n_procs 3 total time: 1.1330870000000033
that is the computation time increases instead of decreasing. So, which is the error in this code?
But I'm also worried by the RAM usage since, when the variable chunks is created, the self.__data RAM usage is doubled. Is it possible, when dealing with multiprocessing code, and more specifically in this code, to avoid this memory waste? (I promise I'll put everything on Spark in the future :) )
It looks like there are a few things at play here:
The chunking operation is pretty slow. On my computer the generation of the chunks was taking about 16% of the time for the cases with multiple processes. The single process, non-pool, version doesn't have that overhead.
You are sending a lot of data into your processes. The chunks array is all the raw data for the ranges which needs to get pickled and sent over to the new processes. It would be much easier to, instead of sending all the raw data, just send the start and end indices.
In general, if you put timers in your func you'll see that most of the time is not being spent there. That's why you aren't seeing a speedup. Most of the time is spent on the chunking, pickling, forking, and other overhead.
As an alternative, you should try switching the chunking technique to just compute the start and end numbers and to avoid sending over so much data.
Next, I would recommend doing something a little more computationally hard than computing the sum. For example, you can try counting primes. Here is an example where we use simple prime computing from here and we use a modified chunking technique. Otherwise, tried to keep the code the same.
import multiprocessing
import time
from math import sqrt; from itertools import count, islice
# credit to https://stackoverflow.com/a/27946768
def isPrime(n):
return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
limit = 6
class C:
def __init__(self):
pass
def func(self, start_end_tuple):
start, end = start_end_tuple
primes = []
for x in range(start, end):
if isPrime(x):
primes.append(x)
return len(primes)
def get_chunks(self, total_size, n_procs):
# start and end value tuples
chunks = []
# Example: (10, 5) -> (2, 0) so 2 numbers per process
# (10, 3) -> (3, 1) or here the first process does 4 and the others do 3
quotient, remainder = divmod(total_size, n_procs)
current_start = 0
for i in range(0, n_procs):
my_amount = quotient
if i == 0:
# somebody needs to do extra
my_amount += remainder
chunks.append((current_start, current_start + my_amount))
current_start += my_amount
return chunks
def start_multi(self):
for n_procs in range(1, 4):
time_start = time.clock()
# chunk the start and end indices instead
chunks = self.get_chunks(10**limit, n_procs)
pool = multiprocessing.Pool(processes=n_procs)
results = pool.map_async(self.func, chunks)
results.wait()
results = results.get()
print(sum(results))
time_delta = time.clock() - time_start
print("n_procs {} time {}".format(n_procs, time_delta))
c = C()
time_start = time.clock()
print("serial func(...) = {}".format(c.func((1, 10**limit))))
print("total time {}".format(time.clock() - time_start))
c.start_multi()
This should result in a speedup for the multiple processes. Assuming you have the cores for it.
I have an iterator function that yields an infinite stream of integers:
def all_ints(start=0):
yield start
yield all_ints(start+1)
I want to have a pool of threads or processes do calculations on these up to $POOLSIZE at a time. Each process will possibly save a result to some shared data structure so I do not need the return value from the process/thread function. It seems to me this use of the python3 Pool would achieve this:
# dummy example functions
def check_prime(n):
return n % 2 == 0
def store_prime(p):
''' synchronize, write to some shared structure'''
pass
p = Pool()
for n in all_ints():
p.apply_async(check_prime, (n,), callback=store_prime)
But when I run this I get a python process that just continually uses more memory (and not from the iterator, that can run for days). I would expect this behavior if I was storing the results of all the apply_async calls, but I am not.
What am I doing wrong here? Or is there another API from the thread pool I should be using?
I think you are looking for Pool.imap_unordered, which uses the pooled processes to apply a function to the elements yielded by an iterator. Its parameter chunksize allows you to specify how many items from the iterator are passed to the pool in each step.
Also I would avoid using any shared memory structures for IPC. Just let the "expensive" function sent to the pool return the information you need, and process it in the main process.
Here's an example (where I abort after 200'000 results; if you remove that part, you'll see the processes happily work in a fixed amount of RAM "forever"):
from multiprocessing import Pool
from math import sqrt
import itertools
import time
def check_prime(n):
if n == 2: return (n, True)
if n % 2 == 0 or n < 2: return (n, False)
for i in range(3, int(sqrt(n))+1, 2):
if n % i == 0: return (n, False)
return (n, True)
def main():
L = 200000 # limit for performance timing
p = Pool()
n_primes = 0
before = time.time()
for (n, is_prime) in p.imap_unordered(check_prime, itertools.count(1), 1000):
if is_prime:
n_primes += 1
if n_primes >= L:
break
print("Computed %d primes in %.1fms" % (n_primes, (time.time()-before)*1000.0))
if __name__ == "__main__":
main()
Output on my Intel Core i5 (2 Core, 4 Threads):
Computed 200000 primes in 15167.9ms
Output if I change it to Pool(1), so using just 1 subprocess:
Computed 200000 primes in 37909.2ms
HTH!
I started recently to learn multiprocessing in python. Regarding this I have some questions. The following code shows my example:
import multiprocessing
from time import *
def func(n):
for i in range(100):
print(i, "/ 100")
for j in range(100000):
a=j*i
b=j*i/2
if __name__ == '__main__':
#Test with multiprosessing
pool = multiprocessing.Pool(processes=4)
t1 = clock()
pool.map(func, range(10))
pool.close()
t2 = clock()
print(t2-t1)
#Test without multiprocessing
func(range(10))
t3 = clock()
print(t3-t2)
Does this code use the four cores of the cpu or did I make a mistake?
Why is the runtime without multiprocessing so much faster? Is there a mistake?
Why does the print command not work while using the multiprocessing?
It does submit four processes at a time to your process pool. Your multiprocessing example is running func ten times, whereas the plain call is running only once. In addition, starting processes has some run time overhead. These probably account for the difference in run times.
I think a simpler example is instructive. func now sleeps for five seconds, then prints out its input n, along with the time.
import multiprocessing
import time
def func(n):
time.sleep(5)
print([n, time.time()])
if __name__ == '__main__':
#Test with multiprosessing
print("With multiprocessing")
pool = multiprocessing.Pool(processes=4)
pool.map(func, range(5))
pool.close()
#Test without multiprocessing
print("Without multiprocessing")
func(1)
pool.map(func, range(5)) runs func(0), func(1), ..., func(4).
This outputs
With multiprocessing
[2, 1480778015.3355303]
[3, 1480778015.3355303]
[1, 1480778015.3355303]
[0, 1480778015.3355303]
[4, 1480778020.3495753]
Without multiprocessing
[1, 1480778025.3653867]
Note that the first four are output at the same time, and not strictly in order. The fifth (n == 4), gets output five seconds later, which makes sense since we had a pool of four processes and it could only get started once the first four were done.
Questions
Why is the CPU usage of my threaded python merge sort only 50% for each core?
Why does this result in "cannot create new thread" errors for relatively small inputs (100000)
How can I make this more pythonic? (It's very ugly.)
Linux/Ubuntu 12.4 64-bit i5 mobile (quad)
from random import shuffle
from threading import *
import time
import Queue
q = Queue.LifoQueue()
def merge_sort(num, q):
end = len(num)
if end > 1:
mid = end / 2
thread = Thread(target=merge_sort, args=(num[0:mid],q,))
thread1 = Thread(target=merge_sort, args=(num[mid:end],q,))
thread.setDaemon(True)
thread1.setDaemon(True)
thread.start()
thread1.start()
thread.join()
thread1.join()
return merge(q.get(num), q.get(num))
else:
if end != 0:
q.put(num)
else:
print "?????"
return
def merge(num, num1):
a = []
while len(num) is not 0 and len(num1) is not 0:
if num[0] < num1[0]:
a.append(num.pop(0))
else:
a.append(num1.pop(0))
if len(num) is not 0:
for i in range(0,len(num)):
a.append(num.pop(0))
if len(num1) is not 0:
for i in range(0,len(num1)):
a.append(num1.pop(0))
q.put(a)
return a
def main():
val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = xrange(0, val)
shuffle(numbers)
numbers = merge_sort(numbers[0:val], q)
# print "Sorted list is: \n"
# for number in numbers:
# print number
print str(time.time() - start_time) + " seconds to run.\n"
if __name__ == "__main__":
main()
For the 100000 input your code tries to create ~200000 threads. Python threads are real OS threads so the 50% CPU load that you are seeing is probably the system busy handling the threads. On my system the error happens around ~32000 threads.
Your code as written can't possibly work:
from random import shuffle
#XXX won't work
numbers = xrange(0, val)
shuffle(numbers)
xrange() is not a mutable sequence.
Note: the sorting takes much less time than the random shuffling of the array:
import numpy as np
numbers = np.random.permutation(10000000) # here spent most of the time
numbers.sort()
If you want to sort parts of the array using different threads; you can do it:
from multiprocessing.dummy import Pool # use threads
Pool(2).map(lambda a: a.sort(), [numbers[:N//2], numbers[N//2:]])
a.sort() releases GIL so the code uses 2 CPUs.
If you include the time it takes to merge the sorted parts; it may be faster just to sort the whole array at once (numbers.sort()) in a single thread.
You may want to look into using Parallel Python, as by default CPython will be restricted to one core because of the Global Interpreter Lock (GIL). This is why CPython cannot perform true CPU bound concurrent operations. But, CPython is still great at carrying out IO bound tasks.
There is a good article that describes the threading limitations of CPyton here.
Suppose you have got a list comprehension in python, like
Values = [ f(x) for x in range( 0, 1000 ) ]
with f being just a function without side effects. So all the entries can be computed independently.
Is Python able to increase the performance of this list comprehension compared with the "obvious" implementation; e.g. by shared-memory-parallelization on multicore CPUs?
In Python 3.2 they added concurrent.futures, a nice library for solving problems concurrently. Consider this example:
import math, time
from concurrent import futures
PRIMES = [112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419, 112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def bench(f):
start = time.time()
f()
elapsed = time.time() - start
print("Completed in {} seconds".format(elapsed))
def concurrent():
with futures.ProcessPoolExecutor() as executor:
values = list(executor.map(is_prime, PRIMES))
def listcomp():
values = [is_prime(x) for x in PRIMES]
Results on my quad core:
>>> bench(listcomp)
Completed in 14.463825941085815 seconds
>>> bench(concurrent)
Completed in 3.818351984024048 seconds
No, Python will not magically parallelize this for you. In fact, it can't, since it cannot prove the independence of the entries; that would require a great deal of program inspection/verification, which is impossible to get right in the general case.
If you want quick coarse-grained multicore parallelism, I recommend joblib instead:
from joblib import delayed, Parallel
values = Parallel(n_jobs=NUM_CPUS)(delayed(f)(x) for x in range(1000))
Not only have I witnessed near-linear speedups using this library, it also has the great feature of signals such as the one from Ctrl-C onto its worker processes, which cannot be said of all multiprocess libraries.
Note that joblib doesn't really support shared-memory parallelism: it spawns worker processes, not threads, so it incurs some communication overhead from sending data to workers and results back to the master process.
Try if the following can be faster:
Values = map(f,range(0,1000))
That's a functionnal manner to code
Another idea is to replace all occurences of Values in the code by the generator expression
imap(f,range(0,1000)) # Python < 3
map(f,range(0,1000)) # Python 3