How to slice a queue? - python

I have a queue which I process in a loop
while True:
# a processing loop
batch = []
while True:
e = q.get()
if e:
batch.append(e)
else:
# the queue is empty
break
do_something_with(batch)
# wait a moment before emptying the queue again
time.sleep(2)
The idea is to empty the queue, process its content and wait a moment before checking the content again.
I sometimes hit a race condition where the queue is alimented while I get() an element and I end up with a growing batch which is never processed further.
One solution would be to check batch size and process it when the size is right. This does not work if I have not that many events getting in the queue and batch never reaching the correct size - but I need the events (whatever is their number) to be processed and not wait until enough accumulate.
The second solution is to build a check based on the size and on the time batch is idle -- this overly complicates the code.
One good solution would be to "get up to n elements from the queue at once". I could not find anything like that in the documentation. Is there a way to pop several elements at once from the queue (à la slicing for a list)?

Queue.get blocks by default; source of infinite loop.
Queue.get(block=True, timeout=None)
Remove and return an item from the queue. If optional args block is
true and timeout is None (the default), block if necessary until an
item is available. If timeout is a positive number, it blocks at most
timeout seconds and raises the Empty exception if no item was
available within that time. Otherwise (block is false), return an item
if one is immediately available, else raise the Empty exception
(timeout is ignored in that case).
You should use, Queue.get_nowait or Queue.get(block=False) to prevent block. Or use Queue.get(timeout=<seconds>) wait at most <seconds> when queue is empty.
Solution mentioned in your question sound good:
BATCH_SIZE = 10
while True:
batch = []
# Get out of loop if enough item collected or queue is empty
while len(batch) < BATCH_SIZE:
try:
e = q.get_nowait() # OR q.get(timeout=0.1)
except Empty:
break
# To prevent empty batch
# if batch:
# break
do_something_with(batch)
# wait a moment before emptying the queue again
time.sleep(2)

The workaround I am using now, until a better idea/solution:
while True:
# get up to 1000 elements form the queue
batch = []
for _ in range(1000):
try:
e = q.get(block=False)
except queue.Empty:
continue
else:
batch.append(e)
do_something_with(batch)
time.sleep(2)
I may make 1000 useless attempts to get an element (queue empty), or have all of them (even when the queue grows), or anything in between

Related

Detect if main process has been quit from background process

I have a single background process running alongside the main one, where it uses Queue to communicate (using multiprocessing, not multithreading). The main process runs constantly, and the background thread runs once per queue item so that if it gets backlogged, it can still catch up. Instead of closing with the main script (I've enabled daemon for that), I would prefer it to run until the queue is empty, then save and quit.
It's started like this:
q_send = Queue()
q_recv = Queue()
p1 = Process(target=background_process, args=(q_send, q_recv))
p1.daemon = True
p1.start()
Here's how the background process currently runs:
while True:
received_data = q_recv.get()
#do stuff
One way I've considered is to switch the loop to run all the time, but check the size of the queue before trying to read it, and wait a few seconds if it's empty before trying again. There are a couple of problems though. The whole point is it'll run once per item, so if there are 1000 queued commands, it seems a little inefficient checking the queue size before each one. Also, there's no real limit on how long the main process can go without sending an update, so I'd have to set the timeout quite high, as opposed to instantly exiting when the connection is broken, and queue emptied. With the background thread using up to 2gb of ram, it could probably do with exiting as soon as possible.
It'd also make it look a lot more messy:
afk_time = 0
while True:
if afk_time > 300:
return
if not q_recv.qsize():
time.sleep(2)
afk_time += 2
else:
received_data = q_recv.get()
#do stuff
I came across is_alive(), and thought perhaps getting the main process from current_process() might work, but it gave a picking error when I tried to send it to the queue.
Queue.get has a keyword argument timeout which determines the time to wait for an item if the queue is empty. If no item is available when the timeout elapses then a Empty exception is raised.
Remove and return an item from the queue. If optional args block is true and timeout is None (the default), block if necessary until an item is available. If timeout is a positive number, it blocks at most timeout seconds and raises the Empty exception if no item was available within that time. Otherwise (block is false), return an item if one is immediately available, else raise the Empty exception (timeout is ignored in that case).
So you can except that error and break out of the loop:
try:
received_data = q_recv.get(timeout=300)
except queue.Empty:
return

efficient python raw_input and serial port polling

I am working on a python project that is polling for data on a COM port and also polling for user input. As of now, the program is working flawlessly but seems to be inefficient. I have the serial port polling occurring in a while loop running in a separate thread and sticking data into a Queue. The user input polling is also occurring in a while loop running in a separate thread sticking input into a Queue. Unfortunately I have too much code and posting it would take away from the point of the question.
So is there a more efficient way to poll a serial or raw_input() without sticking them in an infinite loop and running them in their own thread?
I have been doing a lot of research on this topic and keep coming across the "separate thread and Queue" paradigm. However, when I run this program I am using nearly 30% of my CPU resources on a quad-core i7. There has to be a better way.
I have worked with ISR's in C and was hoping there is something similar to interrupts that I could be using. My recent research has uncovered a lot of "Event" libraries with callbacks but I can't seems to wrap my head around how they would fit in my situation. I am developing on a Windows 7 (64-bit) machine but will be moving the finished product to a RPi when I am finished. I'm not looking for code, I just need to be pointed in the right direction. Thank you for any info.
You're seeing the high CPU usage because your main thread is using the non-blocking get_nowait call to poll two different queues in an infinite loop, which means most of the time your loop is going to be constantly looping. Constantly running through the loop uses CPU cycles, just as any tight infinite loop does. To avoid using lots of CPU, you want to have your infinite loops use blocking I/O, so that they wait until there's actually data to process before continuing. This way, you're not constantly running through the loop, and therefore using CPU.
So, user input thread:
while True:
data = raw_input() # This blocks, and won't use CPU while doing so
queue.put({'type' : 'input' : 'data' : data})
COM thread:
while True:
data = com.get_com_data() # This blocks, and won't use CPU while doing so
queue.put({'type' : 'COM' : 'data' : data})
main thread:
while True:
data = queue.get() # This call will block, and won't use CPU while doing so
# process data
The blocking get call will just wait until it's woken up by a put in another thread, using a threading.Condition object. It's not repeatedly polling. From Queue.py:
# Notify not_empty whenever an item is added to the queue; a
# thread waiting to get is notified then.
self.not_empty = _threading.Condition(self.mutex)
...
def get(self, block=True, timeout=None):
self.not_empty.acquire()
try:
if not block:
if not self._qsize():
raise Empty
elif timeout is None:
while not self._qsize():
self.not_empty.wait() # This is where the code blocks
elif timeout < 0:
raise ValueError("'timeout' must be a non-negative number")
else:
endtime = _time() + timeout
while not self._qsize():
remaining = endtime - _time()
if remaining <= 0.0:
raise Empty
self.not_empty.wait(remaining)
item = self._get()
self.not_full.notify()
return item
finally:
self.not_empty.release()
def put(self, item, block=True, timeout=None):
self.not_full.acquire()
try:
if self.maxsize > 0:
if not block:
if self._qsize() == self.maxsize:
raise Full
elif timeout is None:
while self._qsize() == self.maxsize:
self.not_full.wait()
elif timeout < 0:
raise ValueError("'timeout' must be a non-negative number")
else:
endtime = _time() + timeout
while self._qsize() == self.maxsize:
remaining = endtime - _time()
if remaining <= 0.0:
raise Full
self.not_full.wait(remaining)
self._put(item)
self.unfinished_tasks += 1
self.not_empty.notify() # This is what wakes up `get`
finally:
self.not_full.release()

Python multiprocessing queue.get with timeout vs sleep

I can't find any documentation on python's queue get with timeout: get([block[, timeout]]) whereas there is good documentation on python's time.sleep() at http://www.pythoncentral.io/pythons-time-sleep-pause-wait-sleep-stop-your-code/.
I've used the linux time to time a loop of 5, 500 and 5000 over both with a period of 100 ms and they both seem similar.
Snippet 1: with queue timeout
while True:
try:
if self._queue.get(True,period) == '!STOP!: break
except:
# Queue.Empty session, keep going
-- do stuff here --
Snippet 2: With time sleep
while True:
try:
if self._queue.get_nowait() == '!STOP!: break
except:
# Queue.Empty session, keep going
-- do stuff here --
time.sleep(period)
Snippet 1 is preferred because instead of sleeping, and then checking the poison pill queue, it 'sleeps' checking the queue. Of course it is a pretty moot point, since the period will normally only be between 0.100 and 0.500 secs but I wan't to make sure there isn't something in the queue.get that I'm missing.
As you said, the first option is a better choice because instead of just unconditionally sleeping for period, then checking to see if anything is in the queue, and then sleeping again, you're actively waiting for something to be put into the queue for the entire period, and then just briefly doing something other than waiting for the '!STOP!' to arrive. There's no hidden gotchas; get_nowait is internally using time.time() + period to decide how long to wait 1) to be able to acquire the internal lock on the queue, and 2) for something to actually be in the queue to get. Here's the relevant code from multprocessing/queues.py:
if block:
deadline = time.time() + timeout
if not self._rlock.acquire(block, timeout): # Waits for up to `timeout` to get the lock
raise Empty # raise empty if it didn't get it
try:
if block:
timeout = deadline - time.time()
if timeout < 0 or not self._poll(timeout): # Once it has the lock, waits for however much time is left before `deadline` to something to arrive
raise Empty
elif not self._poll():
raise Empty
res = self._recv()
self._sem.release()
return res
finally:
self._rlock.release()

Output Queue of a Python multiprocessing is providing more results than expected

From the following code I would expect that the length of the resulting list were the same as the one of the range of items with which the multiprocess is feed:
import multiprocessing as mp
def worker(working_queue, output_queue):
while True:
if working_queue.empty() is True:
break #this is supposed to end the process.
else:
picked = working_queue.get()
if picked % 2 == 0:
output_queue.put(picked)
else:
working_queue.put(picked+1)
return
if __name__ == '__main__':
static_input = xrange(100)
working_q = mp.Queue()
output_q = mp.Queue()
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(mp.cpu_count())]
for proc in processes:
proc.start()
for proc in processes:
proc.join()
results_bank = []
while True:
if output_q.empty() is True:
break
else:
results_bank.append(output_q.get())
print len(results_bank) # length of this list should be equal to static_input, which is the range used to populate the input queue. In other words, this tells whether all the items placed for processing were actually processed.
results_bank.sort()
print results_bank
Has anyone any idea about how to make this code to run properly?
This code will never stop:
Each worker gets an item from the queue as long as it is not empty:
picked = working_queue.get()
and puts a new one for each that it got:
working_queue.put(picked+1)
As a result the queue will never be empty except when the timing between the process happens to be such that the queue is empty at the moment one of the processes calls empty(). Because the queue length is initially 100 and you have as many processes as cpu_count() I would be surprised if this ever stops on any realistic system.
Well executing the code with slight modification proves me wrong, it does stop at some point, which actually surprises me. Executing the code with one process there seems to be a bug, because after some time the process freezes but does not return. With multiple processes the result is varying.
Adding a short sleep period in the loop iteration makes the code behave as I expected and explained above. There seems to be some timing issue between Queue.put, Queue.get and Queue.empty, although they are supposed to be thread-safe. Removing the empty test also gives the expected result (without ever getting stuck at an empty queue).
Found the reason for the varying behaviour. The objects put on the queue are not flushed immediately. Therefore empty might return False although there are items in the queue waiting to be flushed.
From the documentation:
Note: When an object is put on a queue, the object is pickled and a
background thread later flushes the pickled data to an underlying
pipe. This has some consequences which are a little surprising, but
should not cause any practical difficulties – if they really bother
you then you can instead use a queue created with a manager.
After putting an object on an empty queue there may be an infinitesimal delay before the queue’s empty() method returns False and get_nowait() can return without raising Queue.Empty.
If multiple processes are enqueuing objects, it is possible for the objects to be received at the other end out-of-order. However, objects enqueued by the same process will always be in the expected order with respect to each other.

Extending python Queue.PriorityQueue (worker priority, work package types)

I would like to extend the Queue.PriorityQueue described here: http://docs.python.org/library/queue.html#Queue.PriorityQueue
The queue will hold work packages with a priority. Workers will get work packages and process them. I want to make the following additions:
Workers have a priority too. When multiple workers are idle the one with the highest priority should process an incoming work package.
Not every worker can process every work package, so a mechanism is needed that checks if work package type and worker capabilities have a match.
I am looking for hints, how this is best implemented (starting from scratch, extending PrioriyQueue or Queue, ...).
edit
Here is my first (untested) try. The basic idea is that all waiting threads will be notified. Then they all try to get a work item through _choose_worker(self, worker). (Made it community wiki)
edit
Works for some simple tests now...
edit
Added a custom BaseManager and a local copy of the worker list in the _choose_worker function.
edit
bug fix
import Queue
from Queue import Empty, Full
from time import time as _time
import heapq
class AdvancedQueue(Queue.PriorityQueue):
# Initialize the queue representation
def _init(self, _maxsize):
self.queue = []
self.worker = []
def put(self, item, block=True, timeout=None):
'''
Put an item into the queue.
If optional args 'block' is true and 'timeout' is None (the default),
block if necessary until a free slot is available. If 'timeout' is
a positive number, it blocks at most 'timeout' seconds and raises
the Full exception if no free slot was available within that time.
Otherwise ('block' is false), put an item on the queue if a free slot
is immediately available, else raise the Full exception ('timeout'
is ignored in that case).
'''
self.not_full.acquire()
try:
if self.maxsize > 0:
if not block:
if self._qsize() == self.maxsize:
raise Full
elif timeout is None:
while self._qsize() == self.maxsize:
self.not_full.wait()
elif timeout < 0:
raise ValueError("'timeout' must be a positive number")
else:
endtime = _time() + timeout
while self._qsize() == self.maxsize:
remaining = endtime - _time()
if remaining <= 0.0:
raise Full
self.not_full.wait(remaining)
self._put(item)
self.unfinished_tasks += 1
self.not_empty.notifyAll() # only change
finally:
self.not_full.release()
def get(self, worker, block=True, timeout=None):
self.not_empty.acquire()
try:
self._put_worker(worker)
if not block:
if not self._qsize():
raise Empty
else:
return self._choose_worker(worker)
elif timeout is None:
while True:
while not self._qsize():
self.not_empty.wait()
try:
return self._choose_worker(worker)
except Empty:
self.not_empty.wait()
elif timeout < 0:
raise ValueError("'timeout' must be a positive number")
else:
endtime = _time() + timeout
def wait(endtime):
remaining = endtime - _time()
if remaining <= 0.0:
raise Empty
self.not_empty.wait(remaining)
while True:
while not self._qsize():
wait(endtime)
try:
return self._choose_worker(worker)
except Empty:
wait(endtime)
finally:
self._remove_worker(worker)
self.not_empty.release()
# Put a new worker in the worker queue
def _put_worker(self, worker, heappush=heapq.heappush):
heappush(self.worker, worker)
# Remove a worker from the worker queue
def _remove_worker(self, worker):
self.worker.remove(worker)
# Choose a matching worker with highest priority
def _choose_worker(self, worker):
worker_copy = self.worker[:] # we need a copy so we can remove assigned worker
for item in self.queue:
for enqueued_worker in worker_copy:
if item[1].type in enqueued_worker[1].capabilities:
if enqueued_worker == worker:
self.queue.remove(item)
self.not_full.notify()
return item
else:
worker_copy.remove(enqueued_worker)
# item will be taken by enqueued_worker (which has higher priority),
# so enqueued_worker is busy and can be removed
continue
raise Empty
I think you are describing a situation where you have two "priority queues" - one for the jobs and one for the workers. The naive approach is to take the top priority job and the top priority worker and try to pair them. But of course this fails when the worker is unable to execute the job.
To fix this I'd suggest first taking the top priority job and then iterating over all the workers in order of descending priority until you find one that can process that job. If none of the workers can process the job then take the second highest priority job, and so on. So effectively you have nested loops, something like this:
def getNextWorkerAndJobPair():
for job in sorted(jobs, key=priority, reverse=True):
for worker in sorted(workers, key=priority, reverse=True):
if worker.can_process(job):
return (worker, job)
The above example sorts the data unnecessarily many times though. To avoid this it would be best to store the data already in sorted order. As for what data structures to use, I'm not really sure what the best is. Ideally you would want O(log n) inserts and removals and to be able to iterate over the collection in sorted order in O(n) time. I think PriorityQueue meets the first of those requirements but not the second. I imagine that sortedlist from the blist package would work, but I haven't tried it myself and the webpage isn't specific about the performance guarantees that this class offers.
The way I have suggested to iterate over the jobs first and then over the workers in the inner loop is not the only approach you could take. You could also reverse the order of the loops so that you choose the highest priority worker first and then try to find a job for it. Or you could find the valid (job, worker) pair that has the maximum value of f(priority_job, priority_worker) for some function f (for example just add the priorities).
The only answer was useful but not detailed enough, so I will accept my own answer for now. See the code in the question.

Categories

Resources