I can't seem to figure out why my queue-based producer/consumer process is blocking and executing indefinitely:
def producer(q,urls):
for url in urls:
thread = ThreadChild(Profile.collection,Profile.collection,url,True)
thread.follow_links = follow
thread.start()
q.put(thread,True)
log.info('Done making threads')
def consumer(q,total_urls):
thread = q.get(True)
thread.join(timeout=40.0)
q.task_done()
q = Queue(2)
prod_thread = threading.Thread(target=producer, args=(q, urls))
cons_thread = threading.Thread(target=consumer, args=(q, len(urls)))
prod_thread.start()
cons_thread.start()
prod_thread.join(timeout=60.0)
cons_thread.join(timeout=60.0)
I've tried putting timeouts on both the producer and consumer threads, as well as the child process the producer spawns, and the process still run on an on indefinitely.
ThreadChild is a process that does some simple network jobs until it runs out of urls to process. The threads should not take long to execute. The var urls is just a list of urls for the thread to process. It's worthy of note that the log for 'Done making threads' never prints (log is a std python logger bound to a StreamHandler).
Shouldn't the timeouts defined for the producer and consumer threads terminate everything after 60 seconds, regardless of what's left in the queue? Have I misunderstood the use of these methods and structured the way things are added to the queue wrong?
Related
The queue — A synchronized queue class simply states that
there are fewer functions allowed with SimpleQueue.
I need very basic queue functionality for a multithreading application, would it help in any way to use SimpleQueue?
queue.SimpleQueue handles more than threadsafe concurrency. It handles reentrancy - it is safe to call queue.SimpleQueue.put in precarious situations where it might be interrupting other work in the same thread. For example, you can safely call it from __del__ methods, weakref callbacks, or signal module signal handlers.
If you need that, use queue.SimpleQueue.
The python documentations specifies that the simple queue cannot use the functionality of tracking (task_done, join). These can be used to track that every item in the queue has been processed by another process/ thread.
example code:
import threading, queue
q = queue.Queue()
def worker():
while True:
item = q.get()
print(f'Working on {item}')
print(f'Finished {item}')
q.task_done()
# turn-on the worker thread
threading.Thread(target=worker, daemon=True).start()
# send thirty task requests to the worker
for item in range(30):
q.put(item)
print('All task requests sent\n', end='')
# block until all tasks are done
q.join()
print('All work completed')
In the above code the main thread uses join to wait for the other thread to finish processing every item it send. Meanwhile, the worker thread signals "task done" every time he handles an item in the queue. "task" is an item in the queue in this context.
Hope this helps,
for more documentation visit: https://docs.python.org/3/library/queue.html
The thing I cannot figure out is that although ThreadPoolExecutor uses daemon workers, they will still run even if main thread exit.
I can provide a minimal example in python3.6.4:
import concurrent.futures
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread_pool = concurrent.futures.ThreadPoolExecutor()
thread_pool.submit(fn)
while True:
time.sleep(1)
print("Wow")
Both main thread and the worker thread are infinite loops. So if I use KeyboardInterrupt to terminate main thread, I expect that the whole program will terminate too. But actually the worker thread is still running even though it is a daemon thread.
The source code of ThreadPoolExecutor confirms that worker threads are daemon thread:
t = threading.Thread(target=_worker,
args=(weakref.ref(self, weakref_cb),
self._work_queue))
t.daemon = True
t.start()
self._threads.add(t)
Further, if I manually create a daemon thread, it works like a charm:
from threading import Thread
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread = Thread(target=fn)
thread.daemon = True
thread.start()
while True:
time.sleep(1)
print("Wow")
So I really cannot figure out this strange behavior.
Suddenly... I found why. According to much more source code of ThreadPoolExecutor:
# Workers are created as daemon threads. This is done to allow the interpreter
# to exit when there are still idle threads in a ThreadPoolExecutor's thread
# pool (i.e. shutdown() was not called). However, allowing workers to die with
# the interpreter has two undesirable properties:
# - The workers would still be running during interpreter shutdown,
# meaning that they would fail in unpredictable ways.
# - The workers could be killed while evaluating a work item, which could
# be bad if the callable being evaluated has external side-effects e.g.
# writing to a file.
#
# To work around this problem, an exit handler is installed which tells the
# workers to exit when their work queues are empty and then waits until the
# threads finish.
_threads_queues = weakref.WeakKeyDictionary()
_shutdown = False
def _python_exit():
global _shutdown
_shutdown = True
items = list(_threads_queues.items())
for t, q in items:
q.put(None)
for t, q in items:
t.join()
atexit.register(_python_exit)
There is an exit handler which will join all unfinished worker...
Here's the way to avoid this problem. Bad design can be beaten by another bad design. People write daemon=True only if they really know that the worker won't damage any objects or files.
In my case, I created TreadPoolExecutor with a single worker and after a single submit I just deleted the newly created thread from the queue so the interpreter won't wait till this thread stops on its own. Notice that worker threads are created after submit, not after the initialization of TreadPoolExecutor.
import concurrent.futures.thread
from concurrent.futures import ThreadPoolExecutor
...
executor = ThreadPoolExecutor(max_workers=1)
future = executor.submit(lambda: self._exec_file(args))
del concurrent.futures.thread._threads_queues[list(executor._threads)[0]]
It works in Python 3.8 but may not work in 3.9+ since this code is accessing private variables.
See the working piece of code on github
I'm doing some threads expirements, and noticed that my code works even without q.task_done() statement.
import Queue, threading
queue = Queue.Queue()
def get_url(url):
queue.put({url: len(urllib2.urlopen(url).read())})
def read_from_queue():
m = queue.get()
print m.items()
queue.task_done() # <-- this can be removed and still works
def use_threads():
threads = []
for u in urls:
t = threading.Thread(target=get_url, args=(u,))
threads.append(t)
t.start()
for t in threads:
t.join()
threads = []
for r in urls:
t = threading.Thread(target=read_from_queue)
threads.append(t)
t.start()
for t in threads:
t.join()
This is a simple program that loops over list of urls, reading their content and sums it up to the len of bytes. It then puts in the queue a dict containing the url name and its size.
I have timeit.timeit tested both cases; the results are mixed but that make sense because most of the work happens on network.
How the queue knows a task is done? How the t.join() returns without task_done() is being called on the queue?
queue.task_done only affect queue.join
queue.task_done doesn't affect thread.join
You are calling thread.join and never call queue.join, so queue.task_done doesn't matter
Zang MingJie got it right. I was join() the threads, not the queue itself.
When the threads complete, the join() returns.
That's the piece I was missing:
The whole idea of task_done() is when the threads are daemons, or never returns until killed. Then you can't join() the threads, because it will deadlock.
So, when you have such a scenario - you join() the queue. This will return when the queue is empty of tasks (indicating there is currently no more work).
I am trying to use The Queue in python which will be multithreaded. I just wanted to know the approach I am using is correct or not. And if I am doing something redundant or If there is a better approach that I should use.
I am trying to get new requests from a table and schedule them using some logic to perform some operation like running a query.
So here from the main thread I spawn a separate thread for the queue.
if __name__=='__main__':
request_queue = SetQueue(maxsize=-1)
worker = Thread(target=request_queue.process_queue)
worker.setDaemon(True)
worker.start()
while True:
try:
#Connect to the database get all the new requests to be verified
db = Database(username_testschema, password_testschema, mother_host_testschema, mother_port_testschema, mother_sid_testschema, 0)
#Get new requests for verification
verify_these = db.query("SELECT JOB_ID FROM %s.table WHERE JOB_STATUS='%s' ORDER BY JOB_ID" %
(username_testschema, 'INITIATED'))
#If there are some requests to be verified, put them in the queue.
if len(verify_these) > 0:
for row in verify_these:
print "verifying : %s" % row[0]
verify_id = row[0]
request_queue.put(verify_id)
except Exception as e:
logger.exception(e)
finally:
time.sleep(10)
Now in the Setqueue class I have a process_queue function which is used for processing the top 2 requests in every run that were added to the queue.
'''
Overridding the Queue class to use set as all_items instead of list to ensure unique items added and processed all the time,
'''
class SetQueue(Queue.Queue):
def _init(self, maxsize):
Queue.Queue._init(self, maxsize)
self.all_items = set()
def _put(self, item):
if item not in self.all_items:
Queue.Queue._put(self, item)
self.all_items.add(item)
'''
The Multi threaded queue for verification process. Take the top two items, verifies them in a separate thread and sleeps for 10 sec.
This way max two requests per run will be processed.
'''
def process_queue(self):
while True:
scheduler_obj = Scheduler()
try:
if self.qsize() > 0:
for i in range(2):
job_id = self.get()
t = Thread(target=scheduler_obj.verify_func, args=(job_id,))
t.start()
for i in range(2):
t.join(timeout=1)
self.task_done()
except Exception as e:
logger.exception(
"QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
finally:
time.sleep(10)
I want to see if my understanding is correct and if there can be any issues with it.
So the main thread running in while True in the main func connects to database gets new requests and puts it in the queue. The worker thread(daemon) for the queue keeps on getting new requests from the queue and fork non-daemon threads which do the processing and since timeout for the join is 1 the worker thread will keep on taking new requests without getting blocked, and its child thread will keep on processing in the background. Correct?
So in case if the main process exit these won`t be killed until they finish their work but the worker daemon thread would exit.
Doubt : If the parent is daemon and child is non daemon and if parent exits does child exit?).
I also read here :- David beazley multiprocessing
By david beazley in using a Pool as a Thread Coprocessor section where he is trying to solve a similar problem. So should I follow his steps :-
1. Create a pool of processes.
2. Open a thread like I am doing for request_queue
3. In that thread
def process_verification_queue(self):
while True:
try:
if self.qsize() > 0:
job_id = self.get()
pool.apply_async(Scheduler.verify_func, args=(job_id,))
except Exception as e:
logger.exception("QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
Use a process from the pool and run the verify_func in parallel. Will this give me more performance?
While its possible to create a new independent thread for the queue, and process that data separately the way you are doing it, I believe it is more common for each independent worker thread to post messages to a queue that they already "know" about. Then that queue is processed from some other thread by pulling messages out of that queue.
Design Idea
The way I invision your application would be three threads. The main thread, and two worker threads. 1 worker thread would get requests from the database and put them in the queue. The other worker thread would process that data from the queue
The main thread would just waiting for the other threads to finish by using the thread functions .join()
You would protect queue that the threads have access to and make it thread safe by using a mutex. I have seen this pattern in many other designs in other languages as well.
Suggested Reading
"Effective Python" by Brett Slatkin has a great example of this very question.
Instead of inheriting from Queue, he just creates a wrapper to it in his class
called MyQueue and adds a get() and put(message) function.
He even provides the source code at his Github repo
https://github.com/bslatkin/effectivepython/blob/master/example_code/item_39.py
I'm not affiliated with the book or its author, but I highly recommend it as I learned quite a few things from it :)
I like this explanation of the advantages & differences between using threads and processes -
".....But there's a silver lining: processes can make progress on multiple threads of execution simultaneously. Since a parent process doesn't share the GIL with its child processes, all processes can execute simultaneously (subject to the constraints of the hardware and OS)...."
He has some great explanations for getting around GIL and how to improve performance
Read more here:
http://jeffknupp.com/blog/2013/06/30/pythons-hardest-problem-revisited/
I am a newbie in python programming, what I understand is that a process can be a daemon, but a thread in a daemon mode, I couldn't understand the usecase of this, I would request the python gurus to help me in understanding this.
Here is some basic code using threading:
import Queue
import threading
def basic_worker(queue):
while True:
item = queue.get()
# do_work(item)
print(item)
queue.task_done()
def basic():
# http://docs.python.org/library/queue.html
queue = Queue.Queue()
for i in range(3):
t = threading.Thread(target=basic_worker,args=(queue,))
t.daemon = True
t.start()
for item in range(4):
queue.put(item)
queue.join() # block until all tasks are done
print('got here')
basic()
When you run it, you get
% test.py
0
1
2
3
got here
Now comment out the line:
t.daemon = True
Run it again, and you'll see that the script prints the same result, but hangs.
The main thread ends (note that got here was printed), but the second thread never finishes.
In contrast, when t.daemon is set to True, the thread t is terminated when the main thread ends.
Note that "daemon threads" has little to do with daemon processes.
It looks like people intend to use Queue to explain threading, but I think there should be a much simpler way, by using time.sleep(), to demo a daemon thread.
Create daemon thread by setting the daemon parameter (default as None):
from threading import Thread
import time
def worker():
time.sleep(3)
print('daemon done')
thread = Thread(target=worker, daemon=True)
thread.start()
print('main done')
Output:
main done
Process finished with exit code 0
Remove the daemon argument, like:
thread = Thread(target=worker)
Re-run and see the output:
main done
daemon done
Process finished with exit code 0
Here we already see the difference of a daemon thread:
The entire Python program can exit if only daemon thread is left.
isDaemon() and setDaemon() are old getter/setter API. Using constructor argument, as above, or daemon property is recommended.
Module Queue has been renamed queue starting with Python3 to better reflect the fact that there are several queue classes (lifo, fifo, priority) in the module.
so please make the changes while using this example
In simple words...
What is a Daemon thread?
daemon threads can shut down any time in between their flow whereas non-daemon (i.e. user threads) execute completely.
daemon threads run intermittently in the background as long as other non-daemon threads are running.
When all of the non-daemon threads are complete, daemon threads terminate automatically (no matter whether they got fully executed or not).
daemon threads are service providers for user threads running in the same process.
python does not care about daemon threads to complete when in running state, NOT EVEN the finally block but python does give preference to non-daemon threads that are created by us.
daemon threads act as services in operating systems.
python stops the daemon threads when all user threads (in contrast to the daemon threads) are terminated. Hence daemon threads can be used to implement, for example, a monitoring functionality as the thread is stopped by the python as soon as all user threads have stopped.
In a nutshell
If you do something like this
thread = Thread(target=worker_method, daemon=True)
there is NO guarantee that worker_method will get executed completely.
Where does this behaviour be useful?
Consider two threads t1 (parent thread) and t2 (child thread). Let t2 be daemon. Now, you want to analyze the working of t1 while it is in running state; you can write the code to do this in t2.
Reference:
StackOverflow - What is a daemon thread in Java?
GeeksForGeeks - Python daemon threads
TutotrialsPoint - Concurrency in Python - Threads
Official Python Documentation
I've adapted #unutbu's answer for python 3. Make sure that you run this script from the command line and not some interactive environment like jupyter notebook.
import queue
import threading
def basic_worker(q):
while True:
item = q.get()
# do_work(item)
print(item)
q.task_done()
def basic():
q = queue.Queue()
for item in range(4):
q.put(item)
for i in range(3):
t = threading.Thread(target=basic_worker,args=(q,))
t.daemon = True
t.start()
q.join() # block until all tasks are done
print('got here')
basic()
So when you comment out the daemon line, you'll notice that the program does not finish, you'll have to interrupt it manually.
Setting the threads to daemon threads makes sure that they are killed once they have finished.
Note: you could achieve the same thing here without daemon threads, if you would replace the infinite while loop with another condition:
def basic_worker(q):
while not q.empty():
item = q.get()
# do_work(item)
print(item)
q.task_done()