Deadlock with big object in multiprocessing.Queue

Deadlock with big object in multiprocessing.Queue - python

When you supply a large-enough object into multiprocessing.Queue, the program seems to hang at weird places. Consider this minimal example:
import multiprocessing
def dump_dict(queue, size):
queue.put({x: x for x in range(size)})
print("Dump finished")
if __name__ == '__main__':
SIZE = int(1e5)
queue = multiprocessing.Queue()
process = multiprocessing.Process(target=dump_dict, args=(queue, SIZE))
print("Starting...")
process.start()
print("Joining...")
process.join()
print("Done")
print(len(queue.get()))
If the SIZE parameter is small-enough (<= 1e4 at least in my case), the whole program runs smoothly without a problem, but once the SIZE is big-enough, the program hangs at weird places. Now, when searching for explanation, i.e. python multiprocessing - process hangs on join for large queue, I have always seen general answers of "you need to consume from the queue". But what seems weird is that the program actually prints Dump finished i.e. reaching the code line after putting the object into the queue. Furthermore using Queue.put_nowait instead of Queue.put did not make a difference.
Finally if you use Process.join(1) instead of Process.join() the whole process finishes with complete dictionary in the queue (i.e. the print(len(..)) line will print 10000).
Can somebody give me a little bit more insight into this?

You need to queue.get() in the parent before you process.join() to prevent a deadlock. The queue has spawned a feeder-thread with its first queue.put() and the MainThread in your worker-process is joining this feeder-thread before exiting. So the worker-process won't exit before the result is flushed to (OS-pipe-)buffer completely, but your result is too big to fit into the buffer and your parent doesn't read from the queue until the worker has exited, resulting in a deadlock.
You see the output of print("Dump finished") because the actual sending happens from the feeder-thread, queue.put() itself just appends to a collections.deque within the worker-process as an intermediate step.

Facing a similar problem, I solved it using #Darkonaut's answer and the following implementation:
import time
done = 0
while done < n: # n is the number of objects you expect to get
if not queue.empty():
done += 1
results = queue.get()
# Do something with the results
else:
time.sleep(.5)
Doesn't feel very pythonic, but it worked!

Related

How to subprocess a big list of files using all CPUs?

I need to convert 86,000 TEX files to XML using the LaTeXML library in the command line. I tried to write a Python script to automate this with the subprocess module, utilizing all 4 cores.
def get_outpath(tex_path):
path_parts = pathlib.Path(tex_path).parts
arxiv_id = path_parts[2]
outpath = 'xml/' + arxiv_id + '.xml'
return outpath
def convert_to_xml(inpath):
outpath = get_outpath(inpath)
if os.path.isfile(outpath):
message = '{}: Already converted.'.format(inpath)
print(message)
return
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
message = '{}: Converted!'.format(inpath)
print(message)
def start():
start_time = time.time()
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count(),
maxtasksperchild=1)
print('Initialized {} threads'.format(multiprocessing.cpu_count()))
print('Beginning conversion...')
for _ in pool.imap_unordered(convert_to_xml, preprints, chunksize=5):
pass
pool.close()
pool.join()
print("TIME: {}".format(total_time))
start()
The script results in Too many open files and slows down my computer. From looking at Activity Monitor, it looks like this script is trying to create 86,000 conversion subprocesses at once, and each process is trying to open a file. Maybe this is the result of pool.imap_unordered(convert_to_xml, preprints) -- maybe I need to not use map in conjunction with subprocess.Popen, since I just have too many commands to call? What would be an alternative?
I've spent all day trying to figure out the right way to approach bulk subprocessing. I'm new to this part of Python, so any tips for heading in the right direction would be much appreciated. Thanks!

In convert_to_xml, the process = subprocess.Popen(...) statements spawns a latexml subprocess.
Without a blocking call such as process.communicate(), the convert_to_xml ends even while latexml continues to run in the background.
Since convert_to_xml ends, the Pool sends the associated worker process another task to run and so convert_to_xml is called again.
Once again another latexml process is spawned in the background.
Pretty soon, you are up to your eyeballs in latexml processes and the resource limit on the number of open files is reached.
The fix is easy: add process.communicate() to tell convert_to_xml to wait until the latexml process has finished.
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
process.communicate()
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
else: # use else so that this won't run if there is an Exception
message = '{}: Converted!'.format(inpath)
print(message)
Regarding if __name__ == '__main__':
As martineau pointed out, there is a warning in the multiprocessing docs that
code that spawns new processes should not be called at the top level of a module.
Instead, the code should be contained inside a if __name__ == '__main__' statement.
In Linux, nothing terrible happens if you disregard this warning.
But in Windows, the code "fork-bombs". Or more accurately, the code
causes an unmitigated chain of subprocesses to be spawned, because on Windows fork is simulated by spawning a new Python process which then imports the calling script. Every import spawns a new Python process. Every Python process tries to import the calling script. The cycle is not broken until all resources are consumed.
So to be nice to our Windows-fork-bereft brethren, use
if __name__ == '__main__:
start()
Sometimes processes require a lot of memory. The only reliable way to free memory is to terminate the process. maxtasksperchild=1 tells the pool to terminate each worker process after it completes 1 task. It then spawns a new worker process to handle another task (if there are any). This frees the (memory) resources the original worker may have allocated which could not otherwise have been freed.
In your situation it does not look like the worker process is going to require much memory, so you probably don't need maxtasksperchild=1.
In convert_to_xml, the process = subprocess.Popen(...) statements spawns a latexml subprocess.
Without a blocking call such as process.communicate(), the convert_to_xml ends even while latexml continues to run in the background.
Since convert_to_xml ends, the Pool sends the associated worker process another task to run and so convert_to_xml is called again.
Once again another latexml process is spawned in the background.
Pretty soon, you are up to your eyeballs in latexml processes and the resource limit on the number of open files is reached.
The fix is easy: add process.communicate() to tell convert_to_xml to wait until the latexml process has finished.
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
process.communicate()
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
else: # use else so that this won't run if there is an Exception
message = '{}: Converted!'.format(inpath)
print(message)
The chunksize affects how many tasks a worker performs before sending the result back to the main process.
Sometimes this can affect performance, especially if interprocess communication is a signficant portion of overall runtime.
In your situation, convert_to_xml takes a relatively long time (assuming we wait until latexml finishes) and it simply returns None. So interprocess communication probably isn't a significant portion of overall runtime. Therefore, I don't expect you would find a significant change in performance in this case (though it never hurts to experiment!).
In plain Python, map should not be used just to call a function multiple times.
For a similar stylistic reason, I would reserve using the pool.*map* methods for situations where I cared about the return values.
So instead of
for _ in pool.imap_unordered(convert_to_xml, preprints, chunksize=5):
pass
you might consider using
for preprint in preprints:
pool.apply_async(convert_to_xml, args=(preprint, ))
instead.
The iterable passed to any of the pool.*map* functions is consumed
immediately. It doesn't matter if the iterable is an iterator. There is no
special memory benefit to using an iterator here. imap_unordered returns an
iterator, but it does not handle its input in any especially iterator-friendly
way.
No matter what type of iterable you pass, upon calling the pool.*map* function the iterable is
consumed and turned into tasks which are put into a task queue.
Here is code which corroborates this claim:
version1.py:
import multiprocessing as mp
import time
def foo(x):
time.sleep(0.1)
return x * x
def gen():
for x in range(1000):
if x % 100 == 0:
print('Got here')
yield x
def start():
pool = mp.Pool()
for item in pool.imap_unordered(foo, gen()):
pass
pool.close()
pool.join()
if __name__ == '__main__':
start()
version2.py:
import multiprocessing as mp
import time
def foo(x):
time.sleep(0.1)
return x * x
def gen():
for x in range(1000):
if x % 100 == 0:
print('Got here')
yield x
def start():
pool = mp.Pool()
for item in gen():
result = pool.apply_async(foo, args=(item, ))
pool.close()
pool.join()
if __name__ == '__main__':
start()
Running version1.py and version2.py both produce the same result.
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Crucially, you will notice that Got here is printed 10 times very quickly at
the beginning of the run, and then there is a long pause (while the calculation
is done) before the program ends.
If the generator gen() were somehow consumed slowly by pool.imap_unordered,
we should expect Got here to be printed slowly as well. Since Got here is
printed 10 times and quickly, we can see that the iterable gen() is being
completely consumed well before the tasks are completed.
Running these programs should hopefully give you confidence that
pool.imap_unordered and pool.apply_async are putting tasks in the queue
essentially in the same way: immediate after the call is made.

Can a python Multiprocessing queue be passed to the child process?

I have a big dataset in a data acquisition system I wrote in python that takes infinitely long to pass over a queue from the child process to the parent. I want to save the data acquired at the end of the acquisition and tried this using the queue function in Multiprocessing. Instead of doing it this way I would prefer it if I could instead pass a message over the queue from the parent to the child to save my data before I kill the child process. Is this possible? An example of what I thought it might look like is:
def acquireData(self, var1, queue):
import h5py
# Put my acquisition code here
queue.get()
if queue == True:
f = h5py.File("FileName","w")
f.create_dataset('Data',data=data)
f.close()
if __name__ == '__main__':
from multiprocessing import Process, Queue
queue = Queue()
inter_thread = Process(target=acquireData, args=(var1,queue))
queue.put(False)
inter_thread.start()
while True:
if not args.automate:
# Let c++ threads run for given amount of time
# Wait for stop from OP GUI
else:
queue.put(True)
break
print("Acquisition finished, cleaning up...")
sleep(2)
inter_thread.terminate()
Is this allowed? If this type of interfacing between processes is allowed then do I have the right notation? For some reference I have on the order of 9e7 data points in the array I'm trying to save and I have 7 arrays which is simply not being passed to my parent process in a timely manner by putting these arrays into the queue. Thank you.

First, yes, passing a queue to a child is not only legal, but the main use case for queues. See the first example in the docs, which does exactly that.
However, you've got some problems with your code:
queue.get()
if queue == True:
First, your queue is never going to be the boolean value True, it's going to be a Queue. You almost never want to check if x == True: in Python; you want to check if x:. For example, if [1, 2]: will pass, while if [1, 2] == True: will not.
Second, your queue isn't even the thing you want to check in the first place. It isn't truthy or falsey (or it isn't relevant whether it is); it's the value the main process put on the queue and you pulled off that's either truthy or falsey. Which you discarded as soon as you retrieved it.
So, do this:
flag = queue.get()
if flag:
Or, more simply:
if queue.get():
I'm not sure whether this is exactly what you want or not. That queue.get() will block forever until the main process puts something there. Is that what you wanted? If so, great; you're done with this part of your code. If not, you need to think about what you wanted instead.
As designed, the parent will always wait 2 seconds, even if the child finished long before that. A better solution is to join the child with a timeout of 2 seconds. Then you can terminate it if times out.
Plus, are you sure the termination behavior you've designed is what you want? You're doing a "soft kill request" with the queue, then waiting 2 seconds, then doing a "medium-hard kill request" with terminate, and never doing a "hard kill" with kill. That could be a perfectly reasonable design—but if it's not your design, you've implemented the wrong thing.

Python Gevent Shared Queue (Listener Process)

I am trying to get some code working where I can implement logging into a multi-threaded program using gevent. What I'd like to do is set up custom logging handlers to put log events into a Queue, while a listener process is continuously watching for new log events to handle appropriately. I have done this in the past with Multiprocessing, but never with Gevent.
I'm having an issue where the program is getting caught up in the infinite loop (listener process), and not allowing the other threads to "do work"...
Ideally, after the worker processes have finished, I can pass an arbitrary value to the listener process to tell it to break the loop, and then join all the processes together. Here's what I have so far:
import gevent
from gevent.pool import Pool
import Queue
import random
import time
def listener(q):
while True:
if not q.empty():
num = q.get()
print "The number is: %s" % num
if num <= 100:
print q.get()
# got passed 101, break out
else:
break
else:
continue
def worker(pid,q):
if pid == 0:
listener(q)
else:
gevent.sleep(random.randint(0,2)*0.001)
num = random.randint(1,100)
q.put(num)
def main():
q = Queue.Queue()
all_threads = []
all_threads = [gevent.spawn(worker, pid,q) for pid in xrange(10)]
gevent.wait(all_threads[1:])
q.put(101)
gevent.joinall(all_threads)
if __name__ == '__main__':
main()
As I said, the program seems to be getting hung up on that first process and does not allow the other workers to do their thing. I have also tried spawning the listener process completely separately itself (which is actually how I would rather do it), but that didn't seem to work either so I tried this way.
Any help would be appreciated, feel like I am probably just missing something obvious about gevent's back end.
Thanks

The first problem is that your listener is never yielding if the queue is initially empty. The first task you spawn is your listener. When it starts, there's a while True:, the q will be empty, so you go to the else branch, which just continues, looping back to the start of the while loop, and then the q is still empty. So you just sit in the first thread constantly checking the q is empty.
The key thing here is that gevent does not use "native" threads or processes. Unlike "real" threads, which can be switched to at any time by something behind the scenes (like your OS scheduler), gevent uses 'greenlets', which require that you do something to "yield control" to another task. That something is whatever gevent thinks would block, such as read from the network, disk, or use one of the blocking gevent operations.
One crude fix would be to start your listener when pid == 9 rather than 0. By making it spawn last, there will be items in the q, and it will go into the main if branch. The downside is that this doesn't fix the logic problem, so the first time the queue is empty, you'll get stuck in your infinite loop again.
A more correct fix would be to put gevent.sleep() instead of continue. sleep is a blocking operation, so your other tasks will get a chance to run. Without arguments, it waits for no time, but still gives gevent the chance to decide to switch to another task if it is ready to run. This still isn't very efficient, though, as if the Queue is empty, it's going to spend a lot of pointless time checking that over and over and asking to run again as soon as it can. sleep'ing for longer than the default of 0 will be more efficient, but would delay processing your log messages.
However, you can instead take advantage of the fact that many of gevent's types, such as Queue, can be used in more Pythonic ways and make your code a lot simpler and easier to understand, as well as more efficient.
import gevent
from gevent.queue import Queue
def listener(q):
for msg in q:
print "the number is %d" % msg
def worker(pid,q):
gevent.sleep(random.randint(0,2)*0.001)
num = random.randint(1,100)
q.put(num)
def main():
q = Queue()
listener_task = gevent.spawn(listener, q)
worker_tasks = [gevent.spawn(worker, pid, q) for pid in xrange(1, 10)]
gevent.wait(worker_tasks)
q.put(StopIteration)
gevent.join(listener_task)
Here, Queue can operate as an iterator in a for loop. As long as there are messages, it will get an item, run the loop, and then wait for another item. If there are no items, it will just block and hang around until the next one arrives. Since it blocks, though, gevent will switch to one of your other tasks to run, avoiding the infinite loop problem your example code has.
Because this version is using the Queue as a for loop iterator, there's also automatically a nice sentinel value we can put in the queue to make the listener task quit. If a for loop gets StopIteration from its iterator, it will exit cleanly. So when our for loop that's reading from q gets StopIteration from the q, it exits, and then the function exits, and the spawned task is finished.

Output Queue of a Python multiprocessing is providing more results than expected

From the following code I would expect that the length of the resulting list were the same as the one of the range of items with which the multiprocess is feed:
import multiprocessing as mp
def worker(working_queue, output_queue):
while True:
if working_queue.empty() is True:
break #this is supposed to end the process.
else:
picked = working_queue.get()
if picked % 2 == 0:
output_queue.put(picked)
else:
working_queue.put(picked+1)
return
if __name__ == '__main__':
static_input = xrange(100)
working_q = mp.Queue()
output_q = mp.Queue()
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(mp.cpu_count())]
for proc in processes:
proc.start()
for proc in processes:
proc.join()
results_bank = []
while True:
if output_q.empty() is True:
break
else:
results_bank.append(output_q.get())
print len(results_bank) # length of this list should be equal to static_input, which is the range used to populate the input queue. In other words, this tells whether all the items placed for processing were actually processed.
results_bank.sort()
print results_bank
Has anyone any idea about how to make this code to run properly?

This code will never stop:
Each worker gets an item from the queue as long as it is not empty:
picked = working_queue.get()
and puts a new one for each that it got:
working_queue.put(picked+1)
As a result the queue will never be empty except when the timing between the process happens to be such that the queue is empty at the moment one of the processes calls empty(). Because the queue length is initially 100 and you have as many processes as cpu_count() I would be surprised if this ever stops on any realistic system.
Well executing the code with slight modification proves me wrong, it does stop at some point, which actually surprises me. Executing the code with one process there seems to be a bug, because after some time the process freezes but does not return. With multiple processes the result is varying.
Adding a short sleep period in the loop iteration makes the code behave as I expected and explained above. There seems to be some timing issue between Queue.put, Queue.get and Queue.empty, although they are supposed to be thread-safe. Removing the empty test also gives the expected result (without ever getting stuck at an empty queue).
Found the reason for the varying behaviour. The objects put on the queue are not flushed immediately. Therefore empty might return False although there are items in the queue waiting to be flushed.
From the documentation:
Note: When an object is put on a queue, the object is pickled and a
background thread later flushes the pickled data to an underlying
pipe. This has some consequences which are a little surprising, but
should not cause any practical difficulties – if they really bother
you then you can instead use a queue created with a manager.
After putting an object on an empty queue there may be an infinitesimal delay before the queue’s empty() method returns False and get_nowait() can return without raising Queue.Empty.
If multiple processes are enqueuing objects, it is possible for the objects to be received at the other end out-of-order. However, objects enqueued by the same process will always be in the expected order with respect to each other.

Return whichever expression returns first

I have two different functions f, and g that compute the same result with different algorithms. Sometimes one or the other takes a long time while the other terminates quickly. I want to create a new function that runs each simultaneously and then returns the result from the first that finishes.
I want to create that function with a higher order function
h = firstresult(f, g)
What is the best way to accomplish this in Python?
I suspect that the solution involves threading. I'd like to avoid discussion of the GIL.

I would simply use a Queue for this. Start the threads and the first one which has a result ready writes to the queue.
Code
from threading import Thread
from time import sleep
from Queue import Queue
def firstresult(*functions):
queue = Queue()
threads = []
for f in functions:
def thread_main():
queue.put(f())
thread = Thread(target=thread_main)
threads.append(thread)
thread.start()
result = queue.get()
return result
def slow():
sleep(1)
return 42
def fast():
return 0
if __name__ == '__main__':
print firstresult(slow, fast)
Live demo
http://ideone.com/jzzZX2
Notes
Stopping the threads is an entirely different topic. For this you need to add some state variable to the threads which needs to be checked in regular intervals. As I want to keep this example short I simply assumed that part and assumed that all workers get the time to finish their work even though the result is never read.
Skipping the discussion about the Gil as requested by the questioner. ;-)

Now - unlike my suggestion on the other answer, this piece of code does exactly what you are requesting:
from multiprocessing import Process, Queue
import random
import time
def firstresult(func1, func2):
queue = Queue()
proc1 = Process(target=func1,args=(queue,))
proc2 = Process(target=func2, args=(queue,))
proc1.start();proc2.start()
result = queue.get()
proc1.terminate(); proc2.terminate()
return result
def algo1(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 1")
def algo2(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 2")
print firstresult(algo1, algo2)

Run each function in a new worker thread, the 2 worker threads send the result back to the main thread in a 1 item queue or something similar. When the main thread receives the result from the winner, it kills (do python threads support kill yet? lol.) both worker threads to avoid wasting time (one function may take hours while the other only takes a second).
Replace the word thread with process if you want.

You will need to run each function in another process (with multiprocessing) or in a different thread.
If both are CPU bound, multithread won help much - exactly due to the GIL -
so multiprocessing is the way.
If the return value is a pickleable (serializable) object, I have this decorator I created that simply runs the function in background, in another process:
https://bitbucket.org/jsbueno/lelo/src
It is not exactly what you want - as both are non-blocking and start executing right away. The tirck with this decorator is that it blocks (and waits for the function to complete) as when you try to use the return value.
But on the other hand - it is just a decorator that does all the work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.