How to implement a multiprocessing priority queue in Python? - python

Anybody familiar with how I can implement a multiprocessing priority queue in python?

Alas, it's nowhere as simple as changing queueing discipline of a good old Queue.Queue: the latter is in fact designed to be subclassed according to a template-method pattern, and overriding just the hook methods _put and/or _get can easily allow changing the queueing discipline (in 2.6 explicit LIFO and priority implementations are offered, but they were easy to make even in earlier versions of Python).
For multiprocessing, in the general case (multiple readers, multiple writers), I see no solution for how to implement priority queues except to give up on the distributed nature of the queue; designate one special auxiliary process that does nothing but handle queues, send (essentially) RPCs to it to create a queue with a specified discipline, do puts and gets to it, get info about it, &c. So one would get the usual problems about ensuring every process knows about the aux proc's location (host and port, say), etc (easier if the process is always spawned at startup by the main proc). A pretty large problem, especially if one wants to do it with good performance, safeguards against aux proc's crashes (requiring replication of data to slave processes, distributed "master election" among slaves if master crashes, &c), and so forth. Doing it from scratch sounds like a PhD's worth of work. One might start from Johnson's work, or piggyback on some very general approach like ActiveMQ.
Some special cases (e.g. single reader, single writer) might be easier, and turn out to be faster for their limited area of application; but then a very specifically restricted spec should be drawn up for that limited area -- and the results would not constitute a (general purpose) "multiprocessing queue", but one applicable only to the given constrained set of requirements.

There is a bug that prevents true FIFO.
Read here.
An alternate way to build a priority queue multiprocessing setup would be certainly be great!

While this isn't an answer, maybe it can help you develop an multiprocessing queue.
Here's a simple priority queue class I wrote using Python's Array:
class PriorityQueue():
"""A basic priority queue that dequeues items with the smallest priority number."""
def __init__(self):
"""Initializes the queue with no items in it."""
self.array = []
self.count = 0
def enqueue(self, item, priority):
"""Adds an item to the queue."""
self.array.append([item, priority])
self.count += 1
def dequeue(self):
"""Removes the highest priority item (smallest priority number) from the queue."""
max = -1
dq = 0
if(self.count > 0):
self.count -= 1
for i in range(len(self.array)):
if self.array[i][1] != None and self.array[i][1] > max:
max = self.array[i][1]
if max == -1:
return self.array.pop(0)
else:
for i in range(len(self.array)):
if self.array[i][1] != None and self.array[i][1] <= max:
max = self.array[i][1]
dq = i
return self.array.pop(dq)
def requeue(self, item, newPrio):
"""Changes specified item's priority."""
for i in range(len(self.array)):
if self.array[i][0] == item:
self.array[i][1] = newPrio
break
def returnArray(self):
"""Returns array representation of the queue."""
return self.array
def __len__(self):
"""Returnes the length of the queue."""
return self.count

I had the same use case. But with a finite number of priorities.
What I am ending up doing is creating one Queue per priority, and my Process workers will try to get the items from those queues, starting with the most important queue to the less important one (moving from one queue to the other is done when the queue is empty)

Depending on your requirements you could use the operating system and the file system in a number of ways. How large will the queue grow and how fast does it have to be? If the queue may be big but you are willing to open a couple files for every queue access you could use a BTree implementation to store the queue and file locking to enforce exclusive access. Slowish but robust.
If the queue will remain relatively small and you need it to be fast you might be able to use shared memory on some operating systems...
If the queue will be small (1000s of entries) and you don't need
it to be really fast you could use something
as simple as a directory with files containing the data with file locking. This would be my preference if small and slow is okay.
If the queue can be large and you want it to be fast on average, then you probably should use a dedicated server process like Alex suggests. This is a pain in the neck however.
What are your performance and size requirements?

Inspired by #user211505's suggestion, I put together something quick and dirty.
Note that this is not a complete solution to the difficult problem of priority queues in multiprocessing production environments. However, if you're just messing around or need something for a short project, this will likely fit the bill:
from time import sleep
from datetime import datetime
from Queue import Empty
from multiprocessing import Queue as ProcessQueue
class SimplePriorityQueue(object):
'''
Simple priority queue that works with multiprocessing. Only a finite number
of priorities are allowed. Adding many priorities slow things down.
Also: no guarantee that this will pull the highest priority item
out of the queue if many items are being added and removed. Race conditions
exist where you may not get the highest priority queue item. However, if
you tend to keep your queues not empty, this will be relatively rare.
'''
def __init__(self, num_priorities=1, default_sleep=.2):
self.queues = []
self.default_sleep = default_sleep
for i in range(0, num_priorities):
self.queues.append(ProcessQueue())
def __repr__(self):
return "<Queue with %d priorities, sizes: %s>"%(len(self.queues),
", ".join(map(lambda (i, q): "%d:%d"%(i, q.qsize()),
enumerate(self.queues))))
qsize = lambda(self): sum(map(lambda q: q.qsize(), self.queues))
def get(self, block=True, timeout=None):
start = datetime.utcnow()
while True:
for q in self.queues:
try:
return q.get(block=False)
except Empty:
pass
if not block:
raise Empty
if timeout and (datetime.utcnow()-start).total_seconds > timeout:
raise Empty
if timeout:
time_left = (datetime.utcnow()-start).total_seconds - timeout
sleep(time_left/4)
else:
sleep(self.default_sleep)
get_nowait = lambda(self): self.get(block=False)
def put(self, priority, obj, block=False, timeout=None):
if priority < 0 or priority >= len(self.queues):
raise Exception("Priority %d out of range."%priority)
# Block and timeout don't mean much here because we never set maxsize
return self.queues[priority].put(obj, block=block, timeout=timeout)

Related

Data structure to control the next step

I am learning coroutine
class Scheduler:
def __init__(self):
self.ready = Queue() # a queue of tasks that are ready to run.
self.taskmap = {} #dictionary that keeps track of all active tasks (each task has a unique integer task ID)
def new(self, target): #introduce a new task to the scheduler
newtask = Task(target)
self.taskmap[newtask.tid] = newtask
self.schedule(newtask)
return newtask.tid
def schedule(self, task):
self.ready.put(task)
def mainloop(self):
while self.taskmap: #does not remove element from taskmap
task = self.ready.get() self.ready
result = task.run()
self.schedule(task)
When reading the task = self.ready.get() in schedule, I suddenly realize that the nature of data structure is about control, to control the next step, while the nature of algorithm is also about control, to control all the steps.
Does the understanding make sense?
The Queue object defines control of what step is next, yes. It's FIFO, as described here.
Here, it looks like you're just trying to keep track of tasks, whether there are any remaining, which are executing, and so on. This is "controlling all the steps." Yes.
What's unclear is the purpose. The data structure and algorithm should be suited to your purpose. asyncio can help you implement parallelism and event-driven designs, for example. Sometimes the goal is to quickly and efficiently render data from a source into a data structure. What you're getting at is more meaningful (to me, at least) in the context of an end goal.

python3 priorityqueue size by priority?

is there a way to take the .size() of the priority queue ... for each priority level?
say i have an object, which i will put in a priority queue, with a priority based on the state of some_variable ...
q = PriorityQueue()
if some_variable:
q.put((1, my_object))
else:
q.put((2, my_object))
can i then find out, by asking for some kind of .qsize() method, such as a variant maybe of...
q.qsize()
but instead of the total size of the queue, i'd like to know the size for each of the two separate 'groups'/priorities (qsize where priority==1, qsize where priority==2), in this case.
as opposed to just the total q.qsize()? hopefully that makes sense.
The simplest way I can see to do this would be
from collections import Counter
sizeByPriority = Counter(priority for priority, _elem in q.queue)
but of course this requires iterating over the entire queue, which may be prohibitive in your situation, and probably isn't thread-safe.

Avoiding race conditions in Python 3's multiprocessing Queues

I'm trying to find the maximum weight of about 6.1 billion (custom) items and I would like to do this with parallel processing. For my particular application there are better algorithms that don't require my iterating over 6.1 billion items, but the textbook that explains them is over my head and my boss wants this done in 4 days. I figured I have a better shot with my company's fancy server and parallel processing. However, everything I know about parallel processing comes from reading the Python documentation. Which is to say I'm pretty lost...
My current theory is to set up a feeder process, an input queue, a whole bunch (say, 30) of worker processes, and an output queue (finding the maximum element in the output queue will be trivial). What I don't understand is how the feeder process can tell the worker processes when to stop waiting for items to come through the input queue.
I had thought about using multiprocessing.Pool.map_async on my iterable of 6.1E9 items, but it takes nearly 10 minutes just to iterate through the items without doing anything to them. Unless I'm misunderstanding something..., having map_async iterate through them to assign them to processes could be done while the processes begin their work. (Pool also provides imap but the documentation says it's similar to map, which doesn't appear to work asynchronously. I want asynchronous, right?)
Related questions: Do I want to use concurrent.futures instead of multiprocessing? I couldn't be the first person to implement a two-queue system (that's exactly how the lines at every deli in America work...) so is there a more Pythonic/built-in way to do this?
Here's a skeleton of what I'm trying to do. See the comment block in the middle.
import multiprocessing as mp
import queue
def faucet(items, bathtub):
"""Fill bathtub, a process-safe queue, with 6.1e9 items"""
for item in items:
bathtub.put(item)
bathtub.close()
def drain_filter(bathtub, drain):
"""Put maximal item from bathtub into drain.
Bathtub and drain are process-safe queues.
"""
max_weight = 0
max_item = None
while True:
try:
current_item = bathtub.get()
# The following line three lines are the ones that I can't
# quite figure out how to trigger without a race condition.
# What I would love is to trigger them AFTER faucet calls
# bathtub.close and the bathtub queue is empty.
except queue.Empty:
drain.put((max_weight, max_item))
return
else:
bathtub.task_done()
if not item.is_relevant():
continue
current_weight = item.weight
if current_weight > max_weight:
max_weight = current_weight
max_item = current_item
def parallel_max(items, nprocs=30):
"""The elements of items should have a method `is_relevant`
and an attribute `weight`. `items` itself is an immutable
iterator object.
"""
bathtub_q = mp.JoinableQueue()
drain_q = mp.Queue()
faucet_proc = mp.Process(target=faucet, args=(items, bathtub_q))
worker_procs = mp.Pool(processes=nprocs)
faucet_proc.start()
worker_procs.apply_async(drain_filter, bathtub_q, drain_q)
finalists = []
for i in range(nprocs):
finalists.append(drain_q.get())
return max(finalists)
HERE'S THE ANSWER
I found a very thorough answer to my question, and a gentle introduction to multitasking from Python Foundation communications director Doug Hellman. What I wanted was the "poison pill" pattern. Check it out here: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
Props to #MRAB for posting the kernel of that concept.
You could put a special terminating item, such as None, into the queue. When a worker sees it, it can put it back for the other workers to see, and then terminate. Alternatively, you could put one special terminating item per worker into the queue.

Extending python Queue.PriorityQueue (worker priority, work package types)

I would like to extend the Queue.PriorityQueue described here: http://docs.python.org/library/queue.html#Queue.PriorityQueue
The queue will hold work packages with a priority. Workers will get work packages and process them. I want to make the following additions:
Workers have a priority too. When multiple workers are idle the one with the highest priority should process an incoming work package.
Not every worker can process every work package, so a mechanism is needed that checks if work package type and worker capabilities have a match.
I am looking for hints, how this is best implemented (starting from scratch, extending PrioriyQueue or Queue, ...).
edit
Here is my first (untested) try. The basic idea is that all waiting threads will be notified. Then they all try to get a work item through _choose_worker(self, worker). (Made it community wiki)
edit
Works for some simple tests now...
edit
Added a custom BaseManager and a local copy of the worker list in the _choose_worker function.
edit
bug fix
import Queue
from Queue import Empty, Full
from time import time as _time
import heapq
class AdvancedQueue(Queue.PriorityQueue):
# Initialize the queue representation
def _init(self, _maxsize):
self.queue = []
self.worker = []
def put(self, item, block=True, timeout=None):
'''
Put an item into the queue.
If optional args 'block' is true and 'timeout' is None (the default),
block if necessary until a free slot is available. If 'timeout' is
a positive number, it blocks at most 'timeout' seconds and raises
the Full exception if no free slot was available within that time.
Otherwise ('block' is false), put an item on the queue if a free slot
is immediately available, else raise the Full exception ('timeout'
is ignored in that case).
'''
self.not_full.acquire()
try:
if self.maxsize > 0:
if not block:
if self._qsize() == self.maxsize:
raise Full
elif timeout is None:
while self._qsize() == self.maxsize:
self.not_full.wait()
elif timeout < 0:
raise ValueError("'timeout' must be a positive number")
else:
endtime = _time() + timeout
while self._qsize() == self.maxsize:
remaining = endtime - _time()
if remaining <= 0.0:
raise Full
self.not_full.wait(remaining)
self._put(item)
self.unfinished_tasks += 1
self.not_empty.notifyAll() # only change
finally:
self.not_full.release()
def get(self, worker, block=True, timeout=None):
self.not_empty.acquire()
try:
self._put_worker(worker)
if not block:
if not self._qsize():
raise Empty
else:
return self._choose_worker(worker)
elif timeout is None:
while True:
while not self._qsize():
self.not_empty.wait()
try:
return self._choose_worker(worker)
except Empty:
self.not_empty.wait()
elif timeout < 0:
raise ValueError("'timeout' must be a positive number")
else:
endtime = _time() + timeout
def wait(endtime):
remaining = endtime - _time()
if remaining <= 0.0:
raise Empty
self.not_empty.wait(remaining)
while True:
while not self._qsize():
wait(endtime)
try:
return self._choose_worker(worker)
except Empty:
wait(endtime)
finally:
self._remove_worker(worker)
self.not_empty.release()
# Put a new worker in the worker queue
def _put_worker(self, worker, heappush=heapq.heappush):
heappush(self.worker, worker)
# Remove a worker from the worker queue
def _remove_worker(self, worker):
self.worker.remove(worker)
# Choose a matching worker with highest priority
def _choose_worker(self, worker):
worker_copy = self.worker[:] # we need a copy so we can remove assigned worker
for item in self.queue:
for enqueued_worker in worker_copy:
if item[1].type in enqueued_worker[1].capabilities:
if enqueued_worker == worker:
self.queue.remove(item)
self.not_full.notify()
return item
else:
worker_copy.remove(enqueued_worker)
# item will be taken by enqueued_worker (which has higher priority),
# so enqueued_worker is busy and can be removed
continue
raise Empty
I think you are describing a situation where you have two "priority queues" - one for the jobs and one for the workers. The naive approach is to take the top priority job and the top priority worker and try to pair them. But of course this fails when the worker is unable to execute the job.
To fix this I'd suggest first taking the top priority job and then iterating over all the workers in order of descending priority until you find one that can process that job. If none of the workers can process the job then take the second highest priority job, and so on. So effectively you have nested loops, something like this:
def getNextWorkerAndJobPair():
for job in sorted(jobs, key=priority, reverse=True):
for worker in sorted(workers, key=priority, reverse=True):
if worker.can_process(job):
return (worker, job)
The above example sorts the data unnecessarily many times though. To avoid this it would be best to store the data already in sorted order. As for what data structures to use, I'm not really sure what the best is. Ideally you would want O(log n) inserts and removals and to be able to iterate over the collection in sorted order in O(n) time. I think PriorityQueue meets the first of those requirements but not the second. I imagine that sortedlist from the blist package would work, but I haven't tried it myself and the webpage isn't specific about the performance guarantees that this class offers.
The way I have suggested to iterate over the jobs first and then over the workers in the inner loop is not the only approach you could take. You could also reverse the order of the loops so that you choose the highest priority worker first and then try to find a job for it. Or you could find the valid (job, worker) pair that has the maximum value of f(priority_job, priority_worker) for some function f (for example just add the priorities).
The only answer was useful but not detailed enough, so I will accept my own answer for now. See the code in the question.

Dumping a multiprocessing.Queue into a list

I wish to dump a multiprocessing.Queue into a list. For that task I've written the following function:
import Queue
def dump_queue(queue):
"""
Empties all pending items in a queue and returns them in a list.
"""
result = []
# START DEBUG CODE
initial_size = queue.qsize()
print("Queue has %s items initially." % initial_size)
# END DEBUG CODE
while True:
try:
thing = queue.get(block=False)
result.append(thing)
except Queue.Empty:
# START DEBUG CODE
current_size = queue.qsize()
total_size = current_size + len(result)
print("Dumping complete:")
if current_size == initial_size:
print("No items were added to the queue.")
else:
print("%s items were added to the queue." % \
(total_size - initial_size))
print("Extracted %s items from the queue, queue has %s items \
left" % (len(result), current_size))
# END DEBUG CODE
return result
But for some reason it doesn't work.
Observe the following shell session:
>>> import multiprocessing
>>> q = multiprocessing.Queue()
>>> for i in range(100):
... q.put([range(200) for j in range(100)])
...
>>> q.qsize()
100
>>> l=dump_queue(q)
Queue has 100 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 1 items from the queue, queue has 99 items left
>>> l=dump_queue(q)
Queue has 99 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 3 items from the queue, queue has 96 items left
>>> l=dump_queue(q)
Queue has 96 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 1 items from the queue, queue has 95 items left
>>>
What's happening here? Why aren't all the items being dumped?
Try this:
import Queue
import time
def dump_queue(queue):
"""
Empties all pending items in a queue and returns them in a list.
"""
result = []
for i in iter(queue.get, 'STOP'):
result.append(i)
time.sleep(.1)
return result
import multiprocessing
q = multiprocessing.Queue()
for i in range(100):
q.put([range(200) for j in range(100)])
q.put('STOP')
l=dump_queue(q)
print len(l)
Multiprocessing queues have an internal buffer which has a feeder thread which pulls work off a buffer and flushes it to the pipe. If not all of the objects have been flushed, I could see a case where Empty is raised prematurely. Using a sentinel to indicate the end of the queue is safe (and reliable). Also, using the iter(get, sentinel) idiom is just better than relying on Empty.
I don't like that it could raise empty due to flushing timing (I added the time.sleep(.1) to allow a context switch to the feeder thread, you may not need it, it works without it - it's a habit to release the GIL).
# in theory:
def dump_queue(q):
q.put(None)
return list(iter(q.get, None))
# in practice this might be more resilient:
def dump_queue(q):
q.put(None)
return list(iter(lambda : q.get(timeout=0.00001), None))
# but neither case handles all the ways things can break
# for that you need 'managers' and 'futures' ... see Commentary
I prefer None for sentinels, but I would tend to agree with jnoller that mp.queue could use a safe and simple sentinel. His comments on risks of getting empty raised early is also valid, see below.
Commentary:
This is old and Python has changed, but, this does come up has a hit if you're having issues with lists <-> queue in MP Python. So, let's look a little deeper:
First off, this is not a bug, it's a feature: https://bugs.python.org/issue20147. To save you some time from reading that discussion and more details in the documentation, here are some highlights (kind of philosophical but I think it might help some who are starting with MP/MT in Python):
MP Queues are structures capable of being communicated with from different threads, different processes on the same system, and in fact can be different (networked) computers
In general with parallel/distributed systems, strict synchronization is expensive, so every time you use part of the API for any MP/MT datastructures, you need to look at the documentation to see what it promises to do, or not. Hint: if a function doesn't include the word "lock" or "semaphore" or "barrier" etc, then it will be some mixture of "asynchronous" and "best effort" (approximate), or what you might call "flaky."
Specific to this situation: Python is an interpreted language, with a famous single interpreter thread with it's famous "Global Interpreter Lock" (GIL). If your entire program is single-process, single threaded, then everything is hunky dory. If not (and with MP it's egregiously not), you need to give the interpreter some breathing room. time.sleep() is your friend. In this case, timeouts.
In your solution you are only using flaky functions - get() and qsize(). And the code is in fact worse than you might think - dial up the size of the queue and the size of the objects and you're likely to break things:
Now, you can work with flaky routines, but you need to give them room to maneuver. In your example you're just hammering that queue. All you need to do is change the line thing = queue.get(block=False) to instead be thing = queue.get(block=True,timeout=0.00001) and you should be fine.
The time 0.00001 is chosen carefully (10^-5), it's about the smallest that you can safely make it (this is where art meets science).
Some comments on why you need the timout: this relates to the internals of how MP queues work. When you 'put' something into an MP queue, it's not actually put into the queue, it's queued up to eventually be there. That's why qsize() happens to give you a correct result - that part of the code knows there's a pile of things "in" the queue. You just need to realize that an object "in" the queue is not the same thing as "i can now read it." Think of MP queues as sending a letter with USPS or FedEx - you might have a receipt and a tracking number showing that "it's in the mail," but the recipient can't open it yet. Now, to be even more specific, in your case you get '0' items accessible right away. That's because the single interpreter thread you're running hasn't had any chance to process stuff that's "queued up", so your first loop just queues up a bunch of stuff for the queue, but you're immediately forcing your single thread to try to do a get() before it's even had a chance to line up even a single object for you.
One might argue that it slows code down to have these timeouts. Not really - MP queues are heavy-weight constructs, you should only be using them to pass pretty heavy-weight "things" around, either big chunks of data, or at least complex computation. the act of adding 10^-5 seconds actually does is give the interpreter a chance to do thread scheduling - at which point it will see your backed-up put() operations.
Caveat
The above is not completely correct, and this is (arguably) an issue with the design of the get() function. The semantics of setting timeout to non-zero is that the get() function will not block for longer than that before returning Empty. But it might not actually be Empty (yet). So if you know your queue has a bunch of stuff to get, then the second solution above works better, or even with a longer timeout. Personally I think they should have kept the timeout=0 behavior, but had some actual built-in tolerance of 1e-5, because a lot of people will get confused about what can happen around gets and puts to MP constructs.
In your example code, you're not actually spinning up parallel processes. If we were to do that, then you'd start getting some random results - sometimes only some of the queue objects will be removed, sometimes it will hang, sometimes it will crash, sometimes more than one thing will happen. In the below example, one process crashes and the other hangs:
The underlying problem is that when you insert the sentinel, you need to know that the queue is finished. That should be done has part of the logic around the queue - if for example you have a classical master-worker design, then the master would need to push a sentinel (end) when the last task has been added. Otherwise you end up with race conditions.
The "correct" (resilient) approach is to involve managers and futures:
import multiprocessing
import concurrent.futures
def fill_queue(q):
for i in range(5000):
q.put([range(200) for j in range(100)])
def dump_queue(q):
q.put(None)
return list(iter(q.get, None))
with multiprocessing.Manager() as manager:
q = manager.Queue()
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.submit(fill_queue, q) # add stuff
executor.submit(fill_queue, q) # add more stuff
executor.submit(fill_queue, q) # ... and more
# 'step out' of the executor
l = dump_queue(q)
# 'step out' of the manager
print(f"Saw {len(l)} items")
Let the manager handle your MP constructs (queues, dictionaries, etc), and within that let the futures handle your processes (and within that, if you want, let another future handle threads). This assures that things are cleaned up as you 'unravel' the work.

Categories

Resources