Why is Pool slower than the same number of Processes

Why is Pool slower than the same number of Processes - python

I recently tried refactoring some parallel processes into a pool and was surprised that the pool took almost twice as long as pure processes. Please assume they are running on the same machine with the same number of cores. I hope that someone can explain why my implementation using Pool is taking longer and perhaps offer some advice:
Shared dependency:
https://github.com/taynaud/python-louvain
from community import best_partition
Here is the faster implementation using Process. [UPDATE] refactored to control the number of active processes same as the Pool implementation, still faster:
processes = []
pipes = []
def _get_partition(send_end):
send_end.send(best_partition(a_graph, resolution=res, randomize=rand))
for idx in range(iterations):
recv_end, send_end = Pipe(False)
p = Process(target=_get_partition, args=(send_end,))
processes.append(p)
pipes.append(recv_end)
running_procs = []
finished_procs = []
while len(finished_procs) < iterations:
while len(running_procs) < n_cores and len(processes):
proc = processes.pop()
proc.start()
running_procs.append(proc)
for idx, proc in enumerate(running_procs):
if not proc.is_alive():
finished_procs.append(running_procs.pop(idx))
for p in finished_procs:
p.join()
partitions = [pipe.recv() for pipe in pipes]
And here is the slower, Pool implementation. This is still slower no matter how many processes the pool is given:
pool = Pool(processes=n_cores)
results = [
pool.apply_async(
best_partition,
(a_graph,),
dict(resolution=res, randomize=rand)
) for i in range(iterations)
]
partitions = [res.get() for res in results]
pool.close()
pool.join()

Usually when there is a difference between a pool and bunch of processes (it can be for the benefit of either), it is your data set and task performed that define the outcome.
Without knowing what your a_graph is, I make a wild guess it is something big. In your process model, you rely on the in-memory copy of this in your subprocesses. In your Pool model, you transmit a copy of a_graph as an argument to each worker every time one is called. This is in practice implemented as a queue. In your process model, your subprocess gets a copy of this at C level when Python interpreter calls fork(). This is much faster than transmitting a large Python object, dictionary, array or whatever it is, via a queue.
The reverse would be true if tasks took only a minuscule time to complete. In this case, Pool is the better performing solution, as Pool passes tasks to already running processes. Processes do not need to be recreated after each task. In this case, the overhead needed to create a lot of new processes that run only a fraction of a second, slows the process implementation down.
As I said, this is pure speculation, but in your examples there is a significant difference what you actually transmit as a parameter to your workers, and that might be the explanation.

Related

Multiprocessing in Python for a short task that needs to run several times

EDIT : seems that I misused multiprocessing by starting a lot short processes instead of dividing it into n_cpu long processes
I'm doing multi processing optimization and it's the first time that I actually use this. I have a function that takes a short time to run, however it needs to run a lot of times. I tried this code :
processes = []
# length is a variable that is in the order of 100k to 1M
for j in range(0, length):
p = mp.Process(target=mp_function)
processes.append(p)
p.start()
for process in processes:
process.join()
However this takes a very long time to run and just freezes my computer. I've also tried this but it takes a very long time (I just decided to not let it finish as it clearly takes a lot longer than what is acceptable) :
processes = []
for j in range(0, length):
p = mp.Process(target=mp_function)
processes.append(p)
p.start()
p.join()
I'm not sure what I'm doing wrong since it's the same approach as indicated on Python's documentation :
p = Process(target=f, args=('bob',))
p.start()
p.join()
From what I've read online, it seems that Pool is to be prefered when I have a few tasks that each takes a long time to execute, and Process is to be used when I have a lot of short tasks takes a short time to execute.
I'm not sure what I'm doing wrong, so any help would be very appreciated

A pool may work well. It will take a value and return a result which involves some processing time, but if you keep it to simple types, its fairly fast. By default the pool will "chunk" the input data, meaning that it will send equal amount of work to each worker in chunks to reduce interprocess communication.
import multiprocessing as mp
if __name__ == "__main__":
with mp.Pool() as pool: # using default 1 process per cpu, reduce
# if it makes machine run too slow
result = pool.map(my_function, [None]*length)
That list could be large in your case, so an intermediate function that knows how to break things out may be reasonable.
def my_function_runner(num_calls):
for _ in num_calls:
my_function()
cpus = mp.cpu_count()
with mp.Pool(processes=cpus) as pool:
num_calls_list = [length/cpus] * (cpus - 1)
num_calls_list.append(length%cpus)
result = pool.map(my_function_runner, num_calls_list)

I am trying to understand how to share read-only objects with multiprocessing

I am trying to understand how to share read-only objects with multiprocessing. Sharing bigset when it is a global variable works fine:
from multiprocessing import Pool
bigset = set(xrange(pow(10, 7)))
def worker(x):
return x in bigset
def main():
pool = Pool(5)
print all(pool.imap(worker, xrange(pow(10, 6))))
pool.close()
pool.join()
if __name__ == '__main__':
main()
htop shows that the parent process uses 100% CPU and 0.8% memory, while the workload is distributed evenly among the five children processes: each is using 10% CPU and 0.8% memory. It's all good.
But the numbers start going crazy if I move bigset inside main:
from multiprocessing import Pool
from functools import partial
def worker(x, l):
return x in l
def main():
bigset = set(xrange(pow(10, 7)))
_worker = partial(worker, l=bigset)
pool = Pool(5)
print all(pool.imap(_worker, xrange(pow(10, 6))))
pool.close()
pool.join()
if __name__ == '__main__':
main()
Now htop shows 2 or 3 processes jumping up and down between 50% and 80% CPU, while the remaining processes use less than 10% CPU. And while the parent process still uses 0.8% memory, now all children use 1.9% memory.
What's happening?

When you pass bigset as an argument, it is pickled by the parent process and unplickled by the children processes.[1][2]
Pickling and unpickling a large set requires a lot of time. This explains why you are seeing few processes doing their job: the parent process has to pickle a lot of big objects, and the children have to wait for it. The parent process is a bottleneck.
Pickling parameters implies that parameters have to be sent to processes. Sending data from a process to another requires system calls, which is why you are not seeing 100% CPU usage by the user space code. Part of the CPU time is spent on kernel space.[3]
Pickling objects and sending them to subprocesses also implies that: 1. you need memory for the pickle buffer; 2. each subprocess gets a copy of bigset. This is why you are seeing an increase in memory usage.
Instead, when bigset is a global variable, it is not sent anywhere (unless you are using a start method other than fork). It is just inherited as-is by subprocesses, using the usual copy-on-write rules of fork().
Footnotes:
In case you don't know what "pickling" means: pickle is one of the standard Python protocols to transform arbitrary Python objects to and from a byte sequence.
imap() & co. use queues behind the scenes, and queues work by pickling/unpickling objects.
I tried running your code (with all(pool.imap(_worker, xrange(100))) in order to make the process faster) and I got: 2 minutes user time, 13 seconds system time. That's almost 10% system time.

Memory usage keep growing with Python's multiprocessing.pool

Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.

I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res

Simply create the pool within your loop and close it at the end of the loop with
pool.close().

Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.

I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache

You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link

I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.

No increase in speed when multithreading python hdf5 parsing function

I have a function that:
1) reads in a hdf5 dataset as integer ascii code
2) converts ascii integers to characters...chr() function
3) joins the characters into a single string function
Upon profiling, I found that the vast majority of the calculation is spent on the step #2, the conversion of the ascii integers to characters. I have somewhat optimized this call by using:
''.join([chr(x) for x in file[dataSetName].value])
As my parsing function seems to be cpu bound (the conversion of integer to characters) and not i/o bound, I expected to obtain a more/less linear speed enhancement with the number of cores devoted to parsing. To parse one file serially takes ~15 seconds...to parse 10 files (on my 12 core machine) takes ~150 seconds while using 10 threads. That is, there seems to be no enhancement at all.
I have used the following code to launch my threads:
threads=[]
timer=[]
threadNumber=10
for i,d in enumerate(sortedDirSet):
timer.append(time.time())
# self.loadFile(d,i)
threads.append(Thread(target=self.loadFileargs=(d,i)))
threads[-1].start()
if(i%threadNumber==0):
for i2,t in enumerate(threads):
t.join()
print(time.time()-timer[i2])
timer=[]
threads=[]
for t in threads:
t.join()
Any help would be greatly appreciated.

Python cannot use multiple cores (due to GIL) unless you spawn subprocesses (with multiprocessing for example). Thus you won't get any performance boost with spawning threads for CPU bound tasks.
Here's an example of a script using multiprocessing and queue:
from Queue import Empty # <-- only needed to catch Exception
from multiprocessing import Process, Queue, cpu_count
def loadFile(d, i, queue):
# some other stuff
queue.put(result)
if name == "main":
queue = Queue()
no = cpu_count()
processes = []
for i,d in enumerate(sortedDirSet):
p = Process(target=self.loadFile, args=(d, i, queue))
p.start()
processes.append(p)
if i % no == 0:
for p in processes:
p.join()
processes = []
for p in processes:
p.join()
results = []
while True:
try:
# False means "don't wait when Empty, throw an exception instead"
data = queue.get(False)
results.append(data)
except Empty:
break
# You have all the data, do something with it
The other (more complicated) way would be to use pipe instead of queue.
It would be also more efficient to spawn processes, then create a job queue and send them (via pipe) to subprocesses (so you won't have to create a process each time). But this would be even more complicated, so let's leave it like that.

Freakish is correct with his answer, it will be the GIL thwarting your efforts.
If you were to use python 3, you could do this very nicely using concurrent.futures. I believe PyPy has also backported this feature.
Also, you could eek a little bit more speed out of your code by replacing your list comprehension:
''.join([chr(x) for x in file[dataSetName].value])
With a map:
''.join(map(chr, file[dataSetName].value))
My tests (on a massive random list) using above code showed 15.73s using list comprehension and 12.44s using map.

Python multiprocessing Queue failure

I create 100 child processes
proc_list = [
Process(target = simulator, args=(result_queue,))
for i in xrange(100)]
and start them
for proc in proc_list: proc.start()
Each process puts into the result_queue (instance of multiprocessing.Queue) 10000 tuples after doing some processing.
def simulate(alg_instance, image_ids, gamma, results,
simulations, sim_semaphore):
(rs, qs, t_us) = alg_instance.simulate_multiple(image_ids, gamma,
simulations)
all_tuples = zip(rs, qs, t_us)
for result in all_tuples:
results.put(result)
sim_semaphore.release()
I should be (?) getting 1000000 tuples at the queue, but after various runs I get these (sample) sizes:
14912
19563
12952
13524
7487
18350
15986
11928
14281
14282
7317
Any suggestions?

My solution to multiprocessing issues is almost always to use the Manager objects. While the exposed interface is the same, the underlying implementation is much simpler and has less bugs.
from multiprocessing import Manager
manager = Manager()
result_queue = manager.Queue()
Try it out and see if it doesn't fix your issues.

The multiprocessing.Queue is said to be thread-safe in its documentations. But when you are doing inter-process communications with Queue, it should be used with multiprocessing.Manager().Queue()

There's no evidence from the OP post that multiprocessing.Queue does not work. The code posted by the OP is not at all sufficient to understand what's going on: do they join all the processes? do they correctly pass the queue to the child processes (has to be as a parameter if it's on Windows)? do their child processes verify that they actually got 10000 tuples? etc.
There's a chance that the OP is really encountering a hard-to-reproduce bug in mp.Queue, but given the amount of testing CPython has gone through, and the fact that I just ran 100 processes x 10000 results without any trouble, I suspect the OP actually had some problem in their own code.
Yes, Manager().Queue() mentioned in other answers is a perfectly fine way to share data, but there's no reason to avoid multiprocessing.Queue() based on unconfirmed reports that "something is wrong with it".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.