Python multiprocessing Queue failure

Python multiprocessing Queue failure - python

I create 100 child processes
proc_list = [
Process(target = simulator, args=(result_queue,))
for i in xrange(100)]
and start them
for proc in proc_list: proc.start()
Each process puts into the result_queue (instance of multiprocessing.Queue) 10000 tuples after doing some processing.
def simulate(alg_instance, image_ids, gamma, results,
simulations, sim_semaphore):
(rs, qs, t_us) = alg_instance.simulate_multiple(image_ids, gamma,
simulations)
all_tuples = zip(rs, qs, t_us)
for result in all_tuples:
results.put(result)
sim_semaphore.release()
I should be (?) getting 1000000 tuples at the queue, but after various runs I get these (sample) sizes:
14912
19563
12952
13524
7487
18350
15986
11928
14281
14282
7317
Any suggestions?

My solution to multiprocessing issues is almost always to use the Manager objects. While the exposed interface is the same, the underlying implementation is much simpler and has less bugs.
from multiprocessing import Manager
manager = Manager()
result_queue = manager.Queue()
Try it out and see if it doesn't fix your issues.

The multiprocessing.Queue is said to be thread-safe in its documentations. But when you are doing inter-process communications with Queue, it should be used with multiprocessing.Manager().Queue()

There's no evidence from the OP post that multiprocessing.Queue does not work. The code posted by the OP is not at all sufficient to understand what's going on: do they join all the processes? do they correctly pass the queue to the child processes (has to be as a parameter if it's on Windows)? do their child processes verify that they actually got 10000 tuples? etc.
There's a chance that the OP is really encountering a hard-to-reproduce bug in mp.Queue, but given the amount of testing CPython has gone through, and the fact that I just ran 100 processes x 10000 results without any trouble, I suspect the OP actually had some problem in their own code.
Yes, Manager().Queue() mentioned in other answers is a perfectly fine way to share data, but there's no reason to avoid multiprocessing.Queue() based on unconfirmed reports that "something is wrong with it".

Related

How do I run two looping functions parallel to each other? [duplicate]

Suppose I have the following in Python
# A loop
for i in range(10000):
Do Task A
# B loop
for i in range(10000):
Do Task B
How do I run these loops simultaneously in Python?

If you want concurrency, here's a very simple example:
from multiprocessing import Process
def loop_a():
while 1:
print("a")
def loop_b():
while 1:
print("b")
if __name__ == '__main__':
Process(target=loop_a).start()
Process(target=loop_b).start()
This is just the most basic example I could think of. Be sure to read http://docs.python.org/library/multiprocessing.html to understand what's happening.
If you want to send data back to the program, I'd recommend using a Queue (which in my experience is easiest to use).
You can use a thread instead if you don't mind the global interpreter lock. Processes are more expensive to instantiate but they offer true concurrency.

There are many possible options for what you wanted:
use loop
As many people have pointed out, this is the simplest way.
for i in xrange(10000):
# use xrange instead of range
taskA()
taskB()
Merits: easy to understand and use, no extra library needed.
Drawbacks: taskB must be done after taskA, or otherwise. They can't be running simultaneously.
multiprocess
Another thought would be: run two processes at the same time, python provides multiprocess library, the following is a simple example:
from multiprocessing import Process
p1 = Process(target=taskA, args=(*args, **kwargs))
p2 = Process(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
merits: task can be run simultaneously in the background, you can control tasks(end, stop them etc), tasks can exchange data, can be synchronized if they compete the same resources etc.
drawbacks: too heavy!OS will frequently switch between them, they have their own data space even if data is redundant. If you have a lot tasks (say 100 or more), it's not what you want.
threading
threading is like process, just lightweight. check out this post. Their usage is quite similar:
import threading
p1 = threading.Thread(target=taskA, args=(*args, **kwargs))
p2 = threading.Thread(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
coroutines
libraries like greenlet and gevent provides something called coroutines, which is supposed to be faster than threading. No examples provided, please google how to use them if you're interested.
merits: more flexible and lightweight
drawbacks: extra library needed, learning curve.

Why do you want to run the two processes at the same time? Is it because you think they will go faster (there is a good chance that they wont). Why not run the tasks in the same loop, e.g.
for i in range(10000):
doTaskA()
doTaskB()
The obvious answer to your question is to use threads - see the python threading module. However threading is a big subject and has many pitfalls, so read up on it before you go down that route.
Alternatively you could run the tasks in separate proccesses, using the python multiprocessing module. If both tasks are CPU intensive this will make better use of multiple cores on your computer.
There are other options such as coroutines, stackless tasklets, greenlets, CSP etc, but Without knowing more about Task A and Task B and why they need to be run at the same time it is impossible to give a more specific answer.

from threading import Thread
def loopA():
for i in range(10000):
#Do task A
def loopB():
for i in range(10000):
#Do task B
threadA = Thread(target = loopA)
threadB = Thread(target = loobB)
threadA.run()
threadB.run()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()

You could use threading or multiprocessing.

How about: A loop for i in range(10000): Do Task A, Do Task B ? Without more information i dont have a better answer.

I find that using the "pool" submodule within "multiprocessing" works amazingly for executing multiple processes at once within a Python Script.
See Section: Using a pool of workers
Look carefully at "# launching multiple evaluations asynchronously may use more processes" in the example. Once you understand what those lines are doing, the following example I constructed will make a lot of sense.
import numpy as np
from multiprocessing import Pool
def desired_function(option, processes, data, etc...):
# your code will go here. option allows you to make choices within your script
# to execute desired sections of code for each pool or subprocess.
return result_array # "for example"
result_array = np.zeros("some shape") # This is normally populated by 1 loop, lets try 4.
processes = 4
pool = Pool(processes=processes)
args = (processes, data, etc...) # Arguments to be passed into desired function.
multiple_results = []
for i in range(processes): # Executes each pool w/ option (1-4 in this case).
multiple_results.append(pool.apply_async(param_process, (i+1,)+args)) # Syncs each.
results = np.array(res.get() for res in multiple_results) # Retrieves results after
# every pool is finished!
for i in range(processes):
result_array = result_array + results[i] # Combines all datasets!
The code will basically run the desired function for a set number of processes. You will have to carefully make sure your function can distinguish between each process (hence why I added the variable "option".) Additionally, it doesn't have to be an array that is being populated in the end, but for my example, that's how I used it. Hope this simplifies or helps you better understand the power of multiprocessing in Python!

Why is Pool slower than the same number of Processes

I recently tried refactoring some parallel processes into a pool and was surprised that the pool took almost twice as long as pure processes. Please assume they are running on the same machine with the same number of cores. I hope that someone can explain why my implementation using Pool is taking longer and perhaps offer some advice:
Shared dependency:
https://github.com/taynaud/python-louvain
from community import best_partition
Here is the faster implementation using Process. [UPDATE] refactored to control the number of active processes same as the Pool implementation, still faster:
processes = []
pipes = []
def _get_partition(send_end):
send_end.send(best_partition(a_graph, resolution=res, randomize=rand))
for idx in range(iterations):
recv_end, send_end = Pipe(False)
p = Process(target=_get_partition, args=(send_end,))
processes.append(p)
pipes.append(recv_end)
running_procs = []
finished_procs = []
while len(finished_procs) < iterations:
while len(running_procs) < n_cores and len(processes):
proc = processes.pop()
proc.start()
running_procs.append(proc)
for idx, proc in enumerate(running_procs):
if not proc.is_alive():
finished_procs.append(running_procs.pop(idx))
for p in finished_procs:
p.join()
partitions = [pipe.recv() for pipe in pipes]
And here is the slower, Pool implementation. This is still slower no matter how many processes the pool is given:
pool = Pool(processes=n_cores)
results = [
pool.apply_async(
best_partition,
(a_graph,),
dict(resolution=res, randomize=rand)
) for i in range(iterations)
]
partitions = [res.get() for res in results]
pool.close()
pool.join()

Usually when there is a difference between a pool and bunch of processes (it can be for the benefit of either), it is your data set and task performed that define the outcome.
Without knowing what your a_graph is, I make a wild guess it is something big. In your process model, you rely on the in-memory copy of this in your subprocesses. In your Pool model, you transmit a copy of a_graph as an argument to each worker every time one is called. This is in practice implemented as a queue. In your process model, your subprocess gets a copy of this at C level when Python interpreter calls fork(). This is much faster than transmitting a large Python object, dictionary, array or whatever it is, via a queue.
The reverse would be true if tasks took only a minuscule time to complete. In this case, Pool is the better performing solution, as Pool passes tasks to already running processes. Processes do not need to be recreated after each task. In this case, the overhead needed to create a lot of new processes that run only a fraction of a second, slows the process implementation down.
As I said, this is pure speculation, but in your examples there is a significant difference what you actually transmit as a parameter to your workers, and that might be the explanation.

Enqueuing a tf.RandomShuffleQueue from multiple processes using multiprocessing

I would like to use multiple processes (not threads) to do some preprocessing and enqueue the results to a tf.RandomShuffleQueue which can be used by my main graph for training.
Is there a way to do that ?
My actual problem
I have converted my dataset into TFRecords split across 256 shards. I want to start 20 processes using multiprocessing and let each process a range of shards. Each process should read images and then augment them and push them into a tf.RandomShuffleQueue from which the input can be given to a graph for training.
Some people advised me to go through the inception example in tensorflow. However, it is a very different situation because there only reading of the data shards is done by multiple threads (not processes), while the preprocessing (e.g - augmentation) takes place in the main thread.

(This aims to solve your actual problem)
In another topic, someone told you that Python has the global interpreter lock (GIL) and therefore there would be no speed benefits from multi-core, unless you used multiple processes.
This was probably what prompted your desire to use multiprocessing.
However, with TF, Python is normally used only to construct the "graph". The actual execution happens in native code (or GPU), where GIL plays no role whatsoever.
In light of this, I recommend simply letting TF use multithreading. This can be controlled using the intra_op_parallelism_threads argument, such as:
with tf.Session(graph=graph,
config=tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=20)) as sess:
# ...
(Side note: if you have, say, a 2-CPU, 32-core system, the best argument may very well be intra_op_parallelism_threads=16, depending on a lot of factors)

Comment: The pickling of TFRecords is not that important.
I can pass a list of lists containing names of ranges of sharded TFRecord files.
Therebe I have to restart Decision process!
Comment: I can pass it to a Pool.map() as an argument.
Verify, if a multiprocesing.Queue() can handle this.
Results of Tensor functions are a Tensor object.
Try the following:
tensor_object = func(TFRecord)
q = multiprocessing.Manager().Queue()
q.put(tensor_object)
data = q.get()
print(data)
Comment: how do I make sure that all the processes enqueue to the same queue ?
This is simple done enqueue the results from Pool.map(...
after all process finished.
Alternate we can enqueue parallel, queueing data from all processes.
But doing so, depends on pickleabel data as described above.
For instance:
import multiprocessing as mp
def func(filename):
TFRecord = read(filename)
tensor_obj = tf.func(TFRecord)
return tensor_obj
def main_Tensor(tensor_objs):
tf = # ... instantiat Tensor Session
rsq = tf.RandomShuffleQueue(...)
for t in tensor_objs:
rsq.enqueue(t)
if __name__ == '__main__':
sharded_TFRecords = ['file1', 'file2']
with mp.Pool(20) as pool:
tensor_objs = pool.map(func, sharded_TFRecords)
pool.join()
main_Tensor(tensor_objs)

It seems the recommended way to run TF with multiprocessing is via creating a separate tf.Session for each child as sharing it across processes is unfeasible.
You can take a look at this example, I hope it helps.
[EDIT: Old answer]
You can use a multiprocessing.Pool and rely on its callback mechanism to put results in the tf.RandomShuffleQueue as soon as they are ready.
Here's a very simple example on how to do it.
from multiprocessing import Pool
class Processor(object):
def __init__(self, random_shuffle_queue):
self.queue = random_shuffle_queue
self.pool = Pool()
def schedule_task(self, task):
self.pool.apply_async(processing_function, args=[task], callback=self.task_done)
def task_done(self, results):
self.queue.enqueue(results)
This assumes Python 2, for Python 3 I'd recommend to use a concurrent.futures.ProcessPoolExecutor.

Memory usage keep growing with Python's multiprocessing.pool

Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.

I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res

Simply create the pool within your loop and close it at the end of the loop with
pool.close().

Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.

I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache

You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link

I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.

Sharing data between processes without physically moving it

I have a job where I get a lot of separate tasks through. For each task I need to download some data, process it and then upload it again.
I'm using a multiprocessing pool for the processing.
I have a couple of issues I'm unsure of though.
Firstly the data can be up to 20MB roughly, I ideally want to get it to the child worker process without physically moving it in memory and getting the resulting data back to the parent process as well without moving it. As I'm not sure how some tools are working under the hood I don't know if I can just pass the data as an argument to the pool's apply_async (from my understanding it serialises the objects and then they're created again once the reach the subprocess?), or if I should use a multiprocessing Queue or mmap maybe? or something else?
I looked at ctypes objects but from what I understand only the objects that are defined when the pool is created when the process forks can be shared? Which is no good for me as I'll continuously have new data coming in which I need to share.
One thing I shouldn't need to worry about is any concurrent access on the data so I shouldn't need any type of locking. This is because the processing will only start after the data has been downloaded, and the upload will also only start after the output data has been generated.
Another issue I'm having is that sometimes the tasks coming in might spike and as a result I'm downloading data for the tasks quicker than the child processes can process it. So therefore I'm downloading data quicker than I can finish the tasks and dispose of the data and python is dying from running out of memory. What would be a good way to hold up the tasks at the downloading stage when memory is almost full / too much data is in the job pipeline?
I was thinking of some type of "ref" count by using the number of data bytes so I can limit the amount of data between download and upload and only download when the number is below some threshold. Although I'd be worried a child might sometimes fail and I'd never get to take the data it had off of the count. Is there a good way to achieve this kind of thing?

(This is an outcome of the discussion of my previous answer)
Have you tried POSH
This example shows that one can append elements to a mutable list, which is probably what you want (copied from documentation):
import posh
l = posh.share(range(3))
if posh.fork():
#parent process
l.append(3)
posh.waitall()
else:
# child process
l.append(4)
posh.exit(0)
print l
-- Output --
[0, 1, 2, 3, 4]
-- OR --
[0, 1, 2, 4, 3]

Here is cannonical example from multiprocessing documentation:
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
Note that num and arr are shared objects. Is it what you are looking for?

I clobbered this together since I need to figure this out for myself anyway. I'm by no means very accomplished when it comes to multiprocessing or threading, but at least it works. Maybe it can be done in a smarter way, I couldn't figure out how to use the lock that comes with the non-raw Array type. Maybe someone will suggest improvements in comments.
from multiprocessing import Process, Event
from multiprocessing.sharedctypes import RawArray
def modify(s, task_event, result_event):
for i in range(4):
print "Worker: waiting for task"
task_event.wait()
task_event.clear()
print "Worker: got task"
s.value = s.value.upper()
result_event.set()
if __name__ == '__main__':
data_list = ("Data", "More data", "oh look, data!", "Captain Pickard")
task_event = Event()
result_event = Event()
s = RawArray('c', "X" * max(map(len, data_list)))
p = Process(target=modify, args=(s, task_event, result_event))
p.start()
for data in data_list:
s.value = data
task_event.set()
print "Sent new task. Waiting for results."
result_event.wait()
result_event.clear()
print "Got result: {0}".format(s.value)
p.join()
In this example, data_list is defined beforehand, but it need not be. The only information I needed from that list was the length of the longest string. As long as you have some practical upper bound for the length, it's no problem.
Here's the output of the program:
Sent new task. Waiting for results.
Worker: waiting for task
Worker: got task
Worker: waiting for task
Got result: DATA
Sent new task. Waiting for results.
Worker: got task
Worker: waiting for task
Got result: MORE DATA
Sent new task. Waiting for results.
Worker: got task
Worker: waiting for task
Got result: OH LOOK, DATA!
Sent new task. Waiting for results.
Worker: got task
Got result: CAPTAIN PICKARD
As you can see, btel did in fact provide the solution, but the problem lay in keeping the two processes in lockstep with each other, so that the worker only starts working on a new task when the task is ready, and so that the main process doesn't read the result before it's complete.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.