In python multi-producer & multi-consumer threading, may queue.join() be unreliable?

In python multi-producer & multi-consumer threading, may queue.join() be unreliable? - python

A python multi-producer & multi-consumer threading pseudocode:
def threadProducer():
while upstreams_not_done:
data = do_some_work()
queue_of_data.put(data)
def threadConsumer():
while True:
data = queue_of_data.get()
do_other_work()
queue_of_data.task_done()
queue_of_data = queue.Queue()
list_of_producers = create_and_start_producers()
list_of_consumers = create_and_start_consumers()
queue_of_data.join()
# is now all work done?
In which queue_of_data.task_done() is called for each item in queue.
When producers work slower then consumers, is there a possibility queue_of_data.join() non-blocks at some moment when no producer generates data yet, but all consumers finish their tasks by task_done()?
And if Queue.join() is not reliable like this, how can I check if all work done?

The usual way is to put a sentinel value (like None) on the queue, one for each consumer thread, when producers are done. Then consumers are written to exit the thread when it pulls None from the queue.
So, e.g., in the main program:
for t in list_of_producers:
t.join()
# Now we know all producers are done.
for t in list_of_consumers:
queue_of_data.put(None) # tell a consumer we're done
for t in list_of_consumers:
t.join()
and consumers look like:
def threadConsumer():
while True:
data = queue_of_data.get()
if data is None:
break
do_other_work()
Note: if producers can overwhelm consumers, create the queue with a maximum size. Then queue.put() will block when the queue reaches that size, until a consumer removes something from the queue.

Related

SimpleQueue vs Queue in Python - what is the advantage of using SimpleQueue?

The queue — A synchronized queue class simply states that
there are fewer functions allowed with SimpleQueue.
I need very basic queue functionality for a multithreading application, would it help in any way to use SimpleQueue?

queue.SimpleQueue handles more than threadsafe concurrency. It handles reentrancy - it is safe to call queue.SimpleQueue.put in precarious situations where it might be interrupting other work in the same thread. For example, you can safely call it from __del__ methods, weakref callbacks, or signal module signal handlers.
If you need that, use queue.SimpleQueue.

The python documentations specifies that the simple queue cannot use the functionality of tracking (task_done, join). These can be used to track that every item in the queue has been processed by another process/ thread.
example code:
import threading, queue
q = queue.Queue()
def worker():
while True:
item = q.get()
print(f'Working on {item}')
print(f'Finished {item}')
q.task_done()
# turn-on the worker thread
threading.Thread(target=worker, daemon=True).start()
# send thirty task requests to the worker
for item in range(30):
q.put(item)
print('All task requests sent\n', end='')
# block until all tasks are done
q.join()
print('All work completed')
In the above code the main thread uses join to wait for the other thread to finish processing every item it send. Meanwhile, the worker thread signals "task done" every time he handles an item in the queue. "task" is an item in the queue in this context.
Hope this helps,
for more documentation visit: https://docs.python.org/3/library/queue.html

python queue thread performance

I have 2 questions regarding thread and queue in python.
What does max_size arg do in queue.Queue()?
Are there any performance improvement depending on the # of threads(num_worker_threads)? I can't find any improvements on it.
-> If there's no improvements depending on the # of threads, why do we need this?
import time
import queue
import threading
num_worker_threads = 10
store = []
def worker():
while True:
item = q.get()
if item is None:
break
store.append(item)
q.task_done()
start = time.time()
q = queue.Queue() # what does max_size do in queue?
threads = []
for i in range(num_worker_threads):
t = threading.Thread(target=worker)
t.start()
threads.append(t)
for item in range(1000000):
q.put(item)
# block until all tasks are done
q.join()
# stop workers
for i in range(num_worker_threads):
q.put(None)
for t in threads:
t.join()
end = time.time()
print('running time: {}'.format(end - start))

You set the max size on a queue in order to provide a throttling mechanism for your workers. Assuming you have X producers and Y consumers using the same queue, if any of the consumers throws a Queue.Full exception, you can use this as a signal to slow down on producing.
In your case, since work doesn't actually do anything computationally expensive (just appends to a list, in an unsafe manner i might add), and you only have 1 producer and 10 consumers, the queue will most likely always be empty and instead the consumers are idling (and would throw a Queue.Empty exception if the get method had a timeout attached.
As a side note, in order to figure out which of the consumer and producers are idling, you should use the timeout parameter for both the put and get methods of the queue

Share variable from producer process to all consumer processes

Application represents stream splitter - producer process is receiving data stream and multiple consumer processes (client connections) send (pass) received data to connected client.
I've found this sample code for Condition Variable (it's using multithreading but it should work for multiprocessing also) and refactored it so it doesn't pop item in consumer process. That's how I expected that other consumer process will be able to reuse it and resend same data to connected clients. Once all consumer processes finish sending I'd remove item[0] from buffer array. But this is not working since processes are not executing in predictable order.
1. Receive new data - Producer process
2. Send received data - Consumer process [1]
3. Send received data - Consumer process [2]
...
n. Send received data - Consumer process [n]
Loop everything.
Usually happens that producer process removes item[0] before all Consumer processes get to retrieve item[0] and send it.
I guess one possible solution would be to use separate Queue() for each consumer process and in producer process to populate those separate queues.
Is it possible to use Event() to notify consumer process that new data arrived and then pass that data independently from other consumer processes without queue?
If using queue is best solution is it possible to use only one queue and keep new data until all consumer processes finish sending it?
I'm open to any suggestions since I'm not sure what's is the best way to do this.
import threading
import time
# A list of items that are being produced. Note: it is actually
# more efficient to use a collections.deque() object for this.
items = []
# A condition variable for items
items_cv = threading.Condition()
# A producer thread
def producer():
print "I'm the producer"
for i in range(30):
with items_cv: # Always must acquire the lock first
items.append(i) # Add an item to the list
items_cv.notify() # Send a notification signal
time.sleep(1)
items.pop(0) # Pop item remove it from the "buffer"
# A consumer thread
def consumer():
print "I'm a consumer", threading.currentThread().name
while True:
with items_cv: # Must always acquire the lock
while not items: # Check if there are any items
items_cv.wait() # If not, we have to sleep
# x = items.pop(0) # Pop an item off
x = items[0]
print threading.currentThread().name,"got", x
time.sleep(5)
# Launch a bunch of consumers
cons = [threading.Thread(target=consumer)
for i in range(10)]
for c in cons:
c.setDaemon(True)
c.start()
# Run the producer
producer()

Easiest way to solve this problem is to have Queue per each client. In my acceptor (listener) function I have this peace of code which creates buffer/queue for each incoming connection.
buffer = multiprocessing.Queue()
self.client_buffers.append(buffer)
process = multiprocessing.Process(name=procces_name,
target=self.stream_handler,
args(conn, address, buffer))
process.daemon = True
process.start()
In the main (producer) thread each queue is populated as soon as new data arrives.
while True:
data = sock.recv(2048)
if not data: break
for buffer in self.client_buffers:
buffer.put(data)
And each consumer process sends data independently
if not buffer.empty():
connection.sendall(buffer.get())

multithreading spawn new process when worker has finished

I would like to define a pool of n workers and have each execute tasks held in a rabbitmq queue. When this task finished (fails or succeeds) I want the worker execute another task from the queue.
I can see in docs how to spawn a pool of workers and have them all wait for their siblings to complete. I would something like different though: I would like to have a buffer of n tasks where when one worker finishes it adds another tasks to the buffer (so no more than n tasks are in the bugger). Im having difficulty searching for this in docs.
For context, my non-multithreading code is this:
while True:
message = get_frame_from_queue() # get message from rabbit mq
do_task(message.body) #body defines urls to download file
acknowledge_complete(message) # tell rabbitmq the message is acknowledged
At this stage my "multithreading" implementation will look like this:
#recieves('ask_for_a_job')
def get_a_task():
# this function is executed when `ask_for_a_job` signal is fired
message = get_frame_from_queue()
do_task(message)
def do_tasks(task_info):
try:
# do stuff
finally:
# once the "worker" has finished start another.
fire_fignal('ask_for_a_job')
# start the "workers"
for i in range(5):
fire_fignal('ask_for_a_job')
I don't want to reinvent the wheel. Is there a more built in way to achieve this?
Note get_frame_from_queue is not thread safe.

You should be able to have each subprocess/thread consume directly from the queue, and then within each thread, simply process from the queue exactly as you would synchronously.
from threading import Thread
def do_task(msg):
# Do stuff here
def consume():
while True:
message = get_frame_from_queue()
do_task(message.body)
acknowledge_complete(message)
if __name __ == "__main__":
threads = []
for i in range(5):
t = Thread(target=consume)
t.start()
threads.append(t)
This way, you'll always have N messages from the queue being processed simultaneously, without any need for signaling to occur between threads.
The only "gotcha" here is the thread-safety of the rabbitmq library you're using. Depending on how it's implemented, you may need a separate connection per thread, or possibly one connection with a channel per thread, etc.

One solution is to leverage the multiprocessing.Pool object. Use an outer loop to get N items from RabbitMQ. Feed the items to the Pool, waiting until the entire batch is done. Then loop through the batch, acknowledging each message. Lastly continue the outer loop.
source
import multiprocessing
def worker(word):
return bool(word=='whiskey')
messages = ['syrup', 'whiskey', 'bitters']
BATCHSIZE = 2
pool = multiprocessing.Pool(BATCHSIZE)
while messages:
# take first few messages, one per worker
batch,messages = messages[:BATCHSIZE],messages[BATCHSIZE:]
print 'BATCH:',
for res in pool.imap_unordered(worker, batch):
print res,
print
# TODO: acknowledge msgs in 'batch'
output
BATCH: False True
BATCH: False

Simulating Cancellation Tokens in Python Threading

I just wrote a task queue in Python whose job is to limit the number of tasks that are run at one time. This is a little different than Queue.Queue because instead of limiting how many items can be in the queue, it limits how many can be taken out at one time. It still uses an unbounded Queue.Queue to do its job, but it relies on a Semaphore to limit the number of threads:
from Queue import Queue
from threading import BoundedSemaphore, Lock, Thread
class TaskQueue(object):
"""
Queues tasks to be run in separate threads and limits the number
concurrently running tasks.
"""
def __init__(self, limit):
"""Initializes a new instance of a TaskQueue."""
self.__semaphore = BoundedSemaphore(limit)
self.__queue = Queue()
self.__cancelled = False
self.__lock = Lock()
def enqueue(self, callback):
"""Indicates that the given callback should be ran."""
self.__queue.put(callback)
def start(self):
"""Tells the task queue to start running the queued tasks."""
thread = Thread(target=self.__process_items)
thread.start()
def stop(self):
self.__cancel()
# prevent blocking on a semaphore.acquire
self.__semaphore.release()
# prevent blocking on a Queue.get
self.__queue.put(lambda: None)
def __cancel(self):
print 'canceling'
with self.__lock:
self.__cancelled = True
def __process_items(self):
while True:
# see if the queue has been stopped before blocking on acquire
if self.__is_canceled():
break
self.__semaphore.acquire()
# see if the queue has been stopped before blocking on get
if self.__is_canceled():
break
callback = self.__queue.get()
# see if the queue has been stopped before running the task
if self.__is_canceled():
break
def runTask():
try:
callback()
finally:
self.__semaphore.release()
thread = Thread(target=runTask)
thread.start()
self.__queue.task_done()
def __is_canceled(self):
with self.__lock:
return self.__cancelled
The Python interpreter runs forever unless I explicitly stop the task queue. This is a lot more tricky than I thought it would be. If you look at the stop method, you'll see that I set a canceled flag, release the semaphore and put a no-op callback on the queue. The last two parts are necessary because the code could be blocking on the Semaphore or on the Queue. I basically have to force these to go through so that the loop has a chance to break out.
This code works. This class is useful when running a service that is trying to run thousands of tasks in parallel. In order to keep the machine running smoothly and to prevent the OS from screaming about too many active threads, this code will limit the number of threads living at any one time.
I have written a similar chunk of code in C# before. What made that code particular cut 'n' dry was that .NET has something called a CancellationToken that just about every threading class uses. Any time there is a blocking operation, that operation takes an optional token. If the parent task is ever canceled, any child tasks blocking with that token will be immediately canceled, as well. This seems like a much cleaner way to exit than to "fake it" by releasing semaphores or putting values in a queue.
I was wondering if there was an equivalent way of doing this in Python? I definitely want to be using threads instead of something like asynchronous events. I am wondering if there is a way to achieve the same thing using two Queue.Queues where one is has a max size and the other doesn't - but I'm still not sure how to handle cancellation.

I think your code can be simplified by using poisoning and Thread.join():
from Queue import Queue
from threading import Thread
poison = object()
class TaskQueue(object):
def __init__(self, limit):
def process_items():
while True:
callback = self._queue.get()
if callback is poison:
break
try:
callback()
except:
pass
finally:
self._queue.task_done()
self._workers = [Thread(target=process_items) for _ in range(limit)]
self._queue = Queue()
def enqueue(self, callback):
self._queue.put(callback)
def start(self):
for worker in self._workers:
worker.start()
def stop(self):
for worker in self._workers:
self._queue.put(poison)
while self._workers:
self._workers.pop().join()
Untested.
I removed the comments, for brevity.
Also, in this version process_items() is truly private.
BTW: The whole point of the Queue module is to free you from the dreaded locking and event stuff.

You seem to be creating a new thread for each task from the queue. This is wasteful in itself, and also leads you to the problem of how to limit the number of threads.
Instead, a common approach is to create a fixed number of worker threads and let them freely pull tasks from the queue. To cancel the queue, you can clear it and let the workers stay alive in anticipation of future work.

I took Janne Karila's advice and created a thread pool. This eliminated the need for a semaphore. The problem is if you ever expect the queue to go away, you have to stop the worker threads from running (just a variation of what I did before). The new code is fairly similar:
class TaskQueue(object):
"""
Queues tasks to be run in separate threads and limits the number
concurrently running tasks.
"""
def __init__(self, limit):
"""Initializes a new instance of a TaskQueue."""
self.__workers = []
for _ in range(limit):
worker = Thread(target=self.__process_items)
self.__workers.append(worker)
self.__queue = Queue()
self.__cancelled = False
self.__lock = Lock()
self.__event = Event()
def enqueue(self, callback):
"""Indicates that the given callback should be ran."""
self.__queue.put(callback)
def start(self):
"""Tells the task queue to start running the queued tasks."""
for worker in self.__workers:
worker.start()
def stop(self):
"""
Stops the queue from processing anymore tasks. Any actively running
tasks will run to completion.
"""
self.__cancel()
# prevent blocking on a Queue.get
for _ in range(len(self.__workers)):
self.__queue.put(lambda: None)
self.__event.wait()
def __cancel(self):
with self.__lock:
self.__queue.queue.clear()
self.__cancelled = True
def __process_items(self):
while True:
callback = self.__queue.get()
# see if the queue has been stopped before running the task
if self.__is_canceled():
break
try:
callback()
except:
pass
finally:
self.__queue.task_done()
self.__event.set()
def __is_canceled(self):
with self.__lock:
return self.__cancelled
If you look carefully, I had to do some accounting to kill off the workers. I basically wait on an Event for as many times as there are workers. I clear the underlying queue to prevent workers from being cancelled any other way. I also wait after pumping each bogus value into the queue, so only one worker can cancel out at a time.
I've ran some tests on this and it appears to be working. It would still be nice to eliminate the need for bogus values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

In python multi-producer & multi-consumer threading, may queue.join() be unreliable? - python

Related

SimpleQueue vs Queue in Python - what is the advantage of using SimpleQueue?

python queue thread performance

Share variable from producer process to all consumer processes

multithreading spawn new process when worker has finished

Simulating Cancellation Tokens in Python Threading

Categories

Resources