How can I implement a multi-producer, multi-consumer paradigm in Gevent? - python

I have some producer function which rely on I/O heavy blocking calls and some consumer functions which too rely on I/O heavy blocking calls. In order to speed them up, I used the Gevent micro-threading library as glue.
Here's what my paradigm looks like:
import gevent
from gevent.queue import *
import time
import random
q = JoinableQueue()
workers = []
producers = []
def do_work(wid, value):
gevent.sleep(random.randint(0,2))
print 'Task', value, 'done', wid
def worker(wid):
while True:
item = q.get()
try:
print "Got item %s" % item
do_work(wid, item)
finally:
print "No more items"
q.task_done()
def producer():
while True:
item = random.randint(1, 11)
if item == 10:
print "Signal Received"
return
else:
print "Added item %s" % item
q.put(item)
for i in range(4):
workers.append(gevent.spawn(worker, random.randint(1, 100000)))
#This doesnt work.
for j in range(2):
producers.append(gevent.spawn(producer))
#Uncommenting this makes this script work.
#producer()
q.join()
I have four consumer and would like to have two producers. The producers exit when they a signal i.e. 10. The consumers keep feeding off this queue and the whole task finishes when the producers and consumers are over.
However, this doesn't work. If I comment out the for loop which spawns multiple producers and use only a single producer, the script runs fine.
I can't seem to figure out what I've done wrong.
Any ideas?
Thanks

You don't actually want to quit when the queue has no unfinished work, because conceptually that's not when the application should finish.
You want to quit when the producers have finished, and then when there is no unfinished work.
# Wait for all producers to finish producing
gevent.joinall(producers)
# *Now* we want to make sure there's no unfinished work
q.join()
# We don't care about workers. We weren't paying them anything, anyways
gevent.killall(workers)
# And, we're done.

I think it does q.join() before anything is put in the queue and exits immediately. Try joining all producers before joining queue.

What you want do to is to block the main program while the producers and workers communicate. Blocking on the queue will wait until the queue is empty and then yield, which could be immediately. Put this at the end of your program instead of q.join()
gevent.joinall(producers)

I have met same issues like yours. The main problem with your code was that your producer has been spawned in gevent thread which make worker couldn't get task immediately.
I suggest that you should run producer() in the main process not spawn in gevent thread When the process run met the producer which could push the task immediately.
import gevent
from gevent.queue import *
import time
import random
q = JoinableQueue()
workers = []
producers = []
def do_work(wid, value):
gevent.sleep(random.randint(0,2))
print 'Task', value, 'done', wid
def worker(wid):
while True:
item = q.get()
try:
print "Got item %s" % item
do_work(wid, item)
finally:
print "No more items"
q.task_done()
def producer():
while True:
item = random.randint(1, 11)
if item == 10:
print "Signal Received"
return
else:
print "Added item %s" % item
q.put(item)
producer()
for i in range(4):
workers.append(gevent.spawn(worker, random.randint(1, 100000)))
Codes above make sense.. :)

Related

How does Python Queue know it will be empty?

I would like to understand how a queue knows that it wont receive any new items. In the following example the queue will indefintely wait when the tputter thread is not started (I assume because nothing was put to it so far). If the tputter is started it waits between 'puts' until something new is there and as soon as everything is finished it stops. But how does the tgetter know whether something new will end up in the queue or not?
import threading
import queue
import time
q = queue.Queue()
def getter():
for i in range(5):
print('worker:', q.get())
time.sleep(2)
def putter():
for i in range(5):
print('putter: ', i)
q.put(i)
time.sleep(3)
tgetter = threading.Thread(target=getter)
tgetter.start()
tputter = threading.Thread(target=putter)
#tputter.start()
A common way to do this is to use the "poison pill" pattern. Basically, the producer and consumer agree on a special "poison pill" object that the producer can load into the queue, which will indicate that no more items are going to be sent, and the consumer can shut down.
So, in your example, it'd look like this:
import threading
import queue
import time
q = queue.Queue()
END = object()
def getter():
while True:
item = q.get()
if item == END:
break
print('worker:', item)
time.sleep(2)
def putter():
for i in range(5):
print('putter: ', i)
q.put(i)
time.sleep(3)
q.put(END)
tgetter = threading.Thread(target=getter)
tgetter.start()
tputter = threading.Thread(target=putter)
#tputter.start()
This is a little contrived, since the producer is hard-coded to always send five items, so you have to imagine that the consumer doesn't know ahead of time how many items the producer will send.

Why the queue still joining after I called task_done()?

Python3.6
First I put some items in a queue, then start a thread and called join() of the queue in the main thread, then called get() in the thread loop, when the size of queue == 0, I called task_done() and break loop and exit from the thread. But the join() method still blocked in the main thread. I can not figure out what`s wrong.
Below is the code
Thanks
import queue
import threading
def worker(work_queue):
while True:
if work_queue.empty():
print("Task 1 Over!")
work_queue.task_done()
break
else:
_ = work_queue.get()
print(work_queue.qsize())
# do actual work
def main():
work_queue = queue.Queue()
for i in range(10):
work_queue.put("Item %d" % (i + 1))
t = threading.Thread(target=worker, args=(work_queue, ))
t.setDaemon(True)
t.start()
print("Main Thread 1")
work_queue.join()
print("Main Thread 2")
t.join()
print("Finish!")
if __name__ == "__main__":
main()
task_done should be called for each work item which is dequeued and processed, not once the queue is entirely empty. (There'd be no reason for that-- the queue already knows when it's empty.) join() will block until task_done has been called as many times as put was called.
So:
def worker(work_queue):
while True:
if work_queue.empty():
print("Task 1 Over!")
break
else:
_ = work_queue.get()
print(work_queue.qsize())
# do actual work
Note that it's weird for a worker to exit as soon as it sees an empty queue. Normally it would get() with blocking, and only exit when it got a "time to exit" work item out of the queue.

Should I bother locking the queue when I put to or get from it?

I've been gong through the tutorials about multithreading and queue in python3. As the official tutorial goes, "The Queue class in this module implements all the required locking semantics". But in another tutorial, I've seen an example as following:
import queue
import threading
import time
exitFlag = 0
class myThread (threading.Thread):
def __init__(self, threadID, name, q):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.q = q
def run(self):
print ("Starting " + self.name)
process_data(self.name, self.q)
print ("Exiting " + self.name)
def process_data(threadName, q):
while not exitFlag:
queueLock.acquire()
if not workQueue.empty():
data = q.get()
queueLock.release()
print ("%s processing %s" % (threadName, data))
else:
queueLock.release()
time.sleep(1)
threadList = ["Thread-1", "Thread-2", "Thread-3"]
nameList = ["One", "Two", "Three", "Four", "Five"]
queueLock = threading.Lock()
workQueue = queue.Queue(10)
threads = []
threadID = 1
# Create new threads
for tName in threadList:
thread = myThread(threadID, tName, workQueue)
thread.start()
threads.append(thread)
threadID += 1
# Fill the queue
queueLock.acquire()
for word in nameList:
workQueue.put(word)
queueLock.release()
# Wait for queue to empty
while not workQueue.empty():
pass
# Notify threads it's time to exit
exitFlag = 1
# Wait for all threads to complete
for t in threads:
t.join()
print ("Exiting Main Thread")
I believe the tutorial you're following is a bad example of how to use Python's threadsafe queue. In particular, the tutorial is using the threadsafe queue in a way that unfortunately requires an extra lock. Indeed, this extra lock means that the threadsafe queue in the tutorial could be replaced with an old-fashioned non-threadsafe queue based on a simple list.
The reason that locking is needed is hinted at by the documentation for Queue.empty():
If empty() returns False it doesn't guarantee that a subsequent call to get() will not block.
The issue is that another thread could run in-between the call to empty() and the call to get(), stealing the item that empty() otherwise reported to exist. The tutorial probably uses the lock to ensure that the thread has exclusive access to the queue from the call to empty() until the call to get(). Without this lock, two threads could enter into the if-statement and both issue a call to get(), meaning that one of them could block, waiting for an item that will never be pushed.
Let me show you how to use the threadsafe queue properly. Instead of checking empty() first, just rely directly on the blocking behavior of get():
def process_data(threadName, q):
while True:
data = q.get()
if exitFlag:
break
print("%s processing %s" % (threadName, data))
The queue's internal locking will ensure that two threads do not interfere for the duration of the call to get(), and no queueLock is needed. Note that the tutorial's original code would check exitFlag periodically every 1 second, whereas this modified queue requires you to push a dummy object into the queue after setting exitFlag to True -- otherwise, the flag will never be checked.
The last part of the controller code would need to be modified as follows:
# Notify threads it's time to exit
exitFlag = 1
for _ in range(len(threadList)):
# Push a dummy element causing a single thread to wake-up and stop.
workQueue.put(None)
# Wait for all threads to exit
for t in threads:
t.join()
There is another issue with the tutorial's use of the threadsafe queue, namely that a busy-loop is used in the main thread when waiting for the queue to empty:
# Wait for queue to empty
while not workQueue.empty():
pass
To wait for the queue to empty it would be better to use Queue.task_done() in the threads and then call Queue.join() in the main thread. At the end of the loop body in process_data(), call q.task_done(). In the main controller code, instead of the while-loop above, call q.join().
See also the example in the bottom of Python's documentation page on the queue module.
Alternatively, you can keep the queueLock and replace the threadsafe queue with a plain old list as follows:
Replace workQueue = queue.Queue(10) with workQueue = []
Replace if not workQueue.empty() with if len(workQueue) > 0
Replace workQueue.get() with workQueue.pop(0)
Replace workQueue.put(word) with workQueue.append(word)
Note that this doesn't preserve the blocking behavior of put() present in the original version.

Understanding Python Queues and Setting threads to run as a Daemon thread

Lets say I have the below code:
import Queue
import threading
import time
def basic_worker(queue, thread_name):
while True:
if queue.empty(): break
print "Starting %s" % (threading.currentThread().getName()) + "\n"
item = queue.get()
##do_work on item which might take 10-15 minutes to complete
queue.task_done()
print "Ending %s" % (threading.currentThread().getName()) + "\n"
def basic(queue):
# http://docs.python.org/library/queue.html
for i in range(10):
t = threading.Thread(target=basic_worker,args=(queue,tName,))
t.daemon = True
t.start()
queue.join() # block until all tasks are done
print 'got here' + '\n'
queue = Queue.Queue()
for item in range(4):
queue.put(item)
basic(queue)
print "End of program"
My question is, if I set t.daemon = True will it exit the code killing the threads that are taking 10-15 minutes to do some work on the item from the queue? Because from what I have read it says that the program will exit if there are any daemonic threads alive. My understanding is that the threads working on the item taking a long time will also exit incompletely. If I don't set t.daemon = True my program hangs forever and doesn't exit when there are no items in the queue.
The reason why the programm hangs forever if t.daemon = False, is that the following code block ...
if queue.empty(): break
... leads to a race-condition.
Imagine there is only one item left in the queue and two threads evaluate the condition above nearly simultaneously. The condition evaluates to False for both threads ... so they don't break.
The faster thread gets the last item, while the slower hangs forever in the statement item = queue.get().
Respecting the fact that daemon mode is False the program waits for all threads to be finished. That never happens.
From my point of view, the code you provided (with t.daemon = True), works fine.
May the following sentence confuses you:
The entire Python program exits when no alive non-daemon threads are left.
... but consider: If you start all threads from the main thread with t.daemon = True, the only non-daemon thread is the main thread itself. So the program exists when the main thread is finished.
... and that does not happen until the queue is empty, because of the queue.join() statement. So you long running computations inside the child threads will not be interrupted.
There is no need to check the queue.empty(), when using daemon threads and queue.join().
This should be enough:
#!/bin/python
import Queue
import threading
import time
def basic_worker(queue, thread_name):
print "Starting %s" % (threading.currentThread().getName()) + "\n"
while True:
item = queue.get()
##do_work on item which might take 10-15 minutes to complete
time.sleep(5) # to simulate work
queue.task_done()
def basic(queue):
# http://docs.python.org/library/queue.html
for i in range(10):
print 'enqueuing', i
t = threading.Thread(target=basic_worker, args=(queue, i))
t.daemon = True
t.start()
queue.join() # block until all tasks are done
print 'got here' + '\n'
queue = Queue.Queue()
for item in range(4):
queue.put(item)
basic(queue)
print "End of program"

Python Daemon Thread Clean Up Logic on Abrupt sys.exit()

Using Linux and Python 2.7.6, I have a script that uploads lots of files at one time. I am using multi-threading with the Queue and Threading modules.
I implemented a handler for SIGINT to stop the script if the user hits ctrl-C. I prefer to use daemon threads so I don't have to clear the queue, which would require alot of re-writing code to make the SIGINT handler have access to the Queue object since the handlers don't take parameters.
To make sure the daemon threads finish and clean up before sys.exit(), I am using threading.Event() and threading.clear() to make threads wait. This code seems to work as print threading.enumerate() only shows the main thread before the script terminates when I did debugging. Just to make sure, I was wondering if there is any kind of insight to this clean up implementation that I might be missing even though it seems to be working for me:
def signal_handler(signal, frame):
global kill_received
kill_received = True
msg = (
"\n\nYou pressed Ctrl+C!"
"\nYour logs and their locations are:"
"\n{}\n{}\n{}\n\n".format(debug, error, info))
logger.info(msg)
threads = threading.Event()
threads.clear()
while True:
time.sleep(3)
threads_remaining = len(threading.enumerate())
print threads_remaining
if threads_remaining == 1:
sys.exit()
def do_the_uploads(file_list, file_quantity,
retry_list, authenticate):
"""The uploading engine"""
value = raw_input(
"\nPlease enter how many concurent "
"uploads you want at one time(example: 200)> ")
value = int(value)
logger.info('{} concurent uploads will be used.'.format(value))
confirm = raw_input(
"\nProceed to upload files? Enter [Y/y] for yes: ").upper()
if confirm == "Y":
kill_received = False
sys.stdout.write("\x1b[2J\x1b[H")
q = CustomQueue()
def worker():
global kill_received
while not kill_received:
item = q.get()
upload_file(item, file_quantity, retry_list, authenticate, q)
q.task_done()
for i in range(value):
t = Thread(target=worker)
t.setDaemon(True)
t.start()
for item in file_list:
q.put(item)
q.join()
print "Finished. Cleaning up processes...",
#Allowing the threads to cleanup
time.sleep(4)
def upload_file(file_obj, file_quantity, retry_list, authenticate, q):
"""Uploads a file. One file per it's own thread. No batch style. This way if one upload
fails no others are effected."""
absolute_path_filename, filename, dir_name, token, url = file_obj
url = url + dir_name + '/' + filename
try:
with open(absolute_path_filename) as f:
r = requests.put(url, data=f, headers=header_collection, timeout=20)
except requests.exceptions.ConnectionError as e:
pass
if src_md5 == r.headers['etag']:
file_quantity.deduct()
If you want to handle Ctrl+C; it is enough to handle KeyboardInterrupt exception in the main thread. Don't use global X in a function unless you do X = some_value in it. Using time.sleep(4) to allow the threads to cleanup is a code smell. You don't need it.
I am using threading.Event() and threading.clear() to make threads wait.
This code has no effect on your threads:
# create local variable
threads = threading.Event()
# clear internal flag in it (that is returned by .is_set/.wait methods)
threads.clear()
Don't call logger.info() from a signal handler in a multithreaded program. It might deadlock your program. Only a limited set of functions can be called from a signal handler. The safe option is to set a global flag in it and exit:
def signal_handler(signal, frame):
global kill_received
kill_received = True
# return (no more code)
The signal might be delayed until q.join() returns. Even if the signal were delivered immediately; q.get() blocks your child threads. They hang until the main thread exits. To fix both issues, you could use a sentinel to signal child processes that there are no more work, drop the signal handler completely in this case:
def worker(stopped, queue, *args):
for item in iter(queue.get, None): # iterate until queue.get() returns None
if not stopped.is_set(): # a simple global flag would also work here
upload_file(item, *args)
else:
break # exit prematurely
# do child specific clean up here
# start threads
q = Queue.Queue()
stopped = threading.Event() # set when threads should exit prematurely
threads = set()
for _ in range(number_of_threads):
t = Thread(target=worker, args=(stopped, q)+other_args)
threads.add(t)
t.daemon = True
t.start()
# provide work
for item in file_list:
q.put(item)
for _ in threads:
q.put(None) # put sentinel to signal the end
while threads: # until there are alive child threads
try:
for t in threads:
t.join(.3) # use a timeout to get KeyboardInterrupt sooner
if not t.is_alive():
threads.remove(t) # remove dead
break
except (KeyboardInterrupt, SystemExit):
print("got Ctrl+C (SIGINT) or exit() is called")
stopped.set() # signal threads to exit gracefully
I've renamed value to number_of_threads. I've used explicit threads set
If an individual upload_file() blocks then the program won't exit on Ctrl-C.
Your case seems to be simple enough for multiprocessing.Pool interface:
from multiprocessing.pool import ThreadPool
from functools import partial
def do_uploads(number_of_threads, file_list, **kwargs_for_upload_file):
process_file = partial(upload_file, **kwargs_for_upload_file)
pool = ThreadPool(number_of_threads) # number of concurrent uploads
try:
for _ in pool.imap_unordered(process_file, file_list):
pass # you could report progress here
finally:
pool.close() # no more additional work
pool.join() # wait until current work is done
It should gracefully exit on Ctrl-C i.e., uploads that are in progress are allowed to finish but new uploads are not started.

Categories

Resources