Python threading with queue: how to avoid to use join? - python

I have a scenario with 2 threads:
a thread waiting for messages from a socket (embedded in a C library - blocking call is "Barra.ricevi") then putting an element on a queue
a thread waiting to get element from the queue and do something
Sample code
import Barra
import Queue
import threading
posQu = Queue.Queue(maxsize=0)
def threadCAN():
while True:
canMsg = Barra.ricevi("can0")
if canMsg[0] == 'ERR':
print (canMsg)
else:
print ("Enqueued message"), canMsg
posQu.put(canMsg)
thCan = threading.Thread(target = threadCAN)
thCan.daemon = True
thCan.start()
while True:
posMsg = posQu.get()
print ("Messagge from the queue"), posMsg
The result is that every time a new message is coming from the socket a new element is added to the queue, BUT the main thread that should get items from the queue is never woke up.
The output is as follow:
Enqueued message
Enqueued message
Enqueued message
Enqueued message
I expected to have:
Enqueued message
Messagge from the queue
Enqueued message
Messagge from the queue
The only way to solve this issue seams to add the line:
posQu.join()
at the end of the thread waiting for messages from the socket, and the line:
posQu.task_done()
at the end of the main thread.
In this case, after that a new message has been received from the socket, the thread is blocking waiting for the main thread to process the enqueued item.
Unfortunately this isn't the desired behavior since I would like a thread always ready to get messages from a socket and not waiting for a job to be compleated from another thread.
What I am doing wrong ?
Thanks
Andrew
(Italy)

This is likely because your Barra does not release the global interpreter lock (GIL) when Barra.ricevi. You may want to check this though.
The GIL ensures that only one thread can run at any one time (limiting the usefulness of threads in a multi-processor system). The GIL switches threads every 100 "ticks" -- a tick loosely mapping to bytecode instructions. See here for more details.
In your producer thread, not much happens outside of the C-library call. This means the producer thread will get to call Barra.ricevi a great many times before the GIL switches to another thread.
Solutions to this are to, in terms of increasing complexity:
Call time.sleep(0) after adding an item to the queue. This yields the thread so that another thread can run.
Use sys.setcheckinterval() to lower the amount of "ticks" executed before switching threads. This is will come at the cost of making the program much more computationally expensive.
Use multiprocessing rather than threading. This includes using multiprocessing.Queue instead of Queue.Queue.
Modify Barra so that it does release the GIL when its functions are called.
Example using multiprocessing. Be aware that when using multiprocessing, your processes no longer have an implied shared state. You will need to have a look at multiprocessing to see how to pass information between processes.
import Barra
import multiprocessing
def threadCAN(posQu):
while True:
canMsg = Barra.ricevi("can0")
if canMsg[0] == 'ERR':
print(canMsg)
else:
print("Enqueued message", canMsg)
posQu.put(canMsg)
if __name__ == "__main__":
posQu = multiprocessing.Queue(maxsize=0)
procCan = multiprocessing.Process(target=threadCAN, args=(posQu,))
procCan.daemon = True
procCan.start()
while True:
posMsg = posQu.get()
print("Messagge from the queue", posMsg)

Related

SimpleQueue vs Queue in Python - what is the advantage of using SimpleQueue?

The queue — A synchronized queue class simply states that
there are fewer functions allowed with SimpleQueue.
I need very basic queue functionality for a multithreading application, would it help in any way to use SimpleQueue?
queue.SimpleQueue handles more than threadsafe concurrency. It handles reentrancy - it is safe to call queue.SimpleQueue.put in precarious situations where it might be interrupting other work in the same thread. For example, you can safely call it from __del__ methods, weakref callbacks, or signal module signal handlers.
If you need that, use queue.SimpleQueue.
The python documentations specifies that the simple queue cannot use the functionality of tracking (task_done, join). These can be used to track that every item in the queue has been processed by another process/ thread.
example code:
import threading, queue
q = queue.Queue()
def worker():
while True:
item = q.get()
print(f'Working on {item}')
print(f'Finished {item}')
q.task_done()
# turn-on the worker thread
threading.Thread(target=worker, daemon=True).start()
# send thirty task requests to the worker
for item in range(30):
q.put(item)
print('All task requests sent\n', end='')
# block until all tasks are done
q.join()
print('All work completed')
In the above code the main thread uses join to wait for the other thread to finish processing every item it send. Meanwhile, the worker thread signals "task done" every time he handles an item in the queue. "task" is an item in the queue in this context.
Hope this helps,
for more documentation visit: https://docs.python.org/3/library/queue.html

The workers in ThreadPoolExecutor is not really daemon

The thing I cannot figure out is that although ThreadPoolExecutor uses daemon workers, they will still run even if main thread exit.
I can provide a minimal example in python3.6.4:
import concurrent.futures
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread_pool = concurrent.futures.ThreadPoolExecutor()
thread_pool.submit(fn)
while True:
time.sleep(1)
print("Wow")
Both main thread and the worker thread are infinite loops. So if I use KeyboardInterrupt to terminate main thread, I expect that the whole program will terminate too. But actually the worker thread is still running even though it is a daemon thread.
The source code of ThreadPoolExecutor confirms that worker threads are daemon thread:
t = threading.Thread(target=_worker,
args=(weakref.ref(self, weakref_cb),
self._work_queue))
t.daemon = True
t.start()
self._threads.add(t)
Further, if I manually create a daemon thread, it works like a charm:
from threading import Thread
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread = Thread(target=fn)
thread.daemon = True
thread.start()
while True:
time.sleep(1)
print("Wow")
So I really cannot figure out this strange behavior.
Suddenly... I found why. According to much more source code of ThreadPoolExecutor:
# Workers are created as daemon threads. This is done to allow the interpreter
# to exit when there are still idle threads in a ThreadPoolExecutor's thread
# pool (i.e. shutdown() was not called). However, allowing workers to die with
# the interpreter has two undesirable properties:
# - The workers would still be running during interpreter shutdown,
# meaning that they would fail in unpredictable ways.
# - The workers could be killed while evaluating a work item, which could
# be bad if the callable being evaluated has external side-effects e.g.
# writing to a file.
#
# To work around this problem, an exit handler is installed which tells the
# workers to exit when their work queues are empty and then waits until the
# threads finish.
_threads_queues = weakref.WeakKeyDictionary()
_shutdown = False
def _python_exit():
global _shutdown
_shutdown = True
items = list(_threads_queues.items())
for t, q in items:
q.put(None)
for t, q in items:
t.join()
atexit.register(_python_exit)
There is an exit handler which will join all unfinished worker...
Here's the way to avoid this problem. Bad design can be beaten by another bad design. People write daemon=True only if they really know that the worker won't damage any objects or files.
In my case, I created TreadPoolExecutor with a single worker and after a single submit I just deleted the newly created thread from the queue so the interpreter won't wait till this thread stops on its own. Notice that worker threads are created after submit, not after the initialization of TreadPoolExecutor.
import concurrent.futures.thread
from concurrent.futures import ThreadPoolExecutor
...
executor = ThreadPoolExecutor(max_workers=1)
future = executor.submit(lambda: self._exec_file(args))
del concurrent.futures.thread._threads_queues[list(executor._threads)[0]]
It works in Python 3.8 but may not work in 3.9+ since this code is accessing private variables.
See the working piece of code on github

Python semaphore "hangs" in tight loops

I have been having an issue with Python Semaphores appearing to "lock" for an unbounded amount of time when there is a tight relationship between acquire/release. I do not have this issue with Lock/RLock
Below is code distilled to to the simplest case which exhibits the concerning behavior.
import threading
import time
sem = threading.Semaphore()
#sem = threading.RLock()
exit = False
def spinner():
while(not exit):
sem.acquire()
time.sleep(1)
sem.release()
t = threading.Thread(target=spinner)
t.start()
print time.strftime("%H:%M:%S",time.gmtime())
for i in range(0,10):
sem.acquire()
print "Accessed!"
sem.release()
print time.strftime("%H:%M:%S",time.gmtime())
exit = True
t.join()
When I use the semaphore method, this takes an unpredictable amount of time (sometimes 20 minutes!)
When I use a Lock or RLock, this completes quickly as I expect.
Am I missing something? It seems like semaphore with the default value=1 should behave the same as Lock.
According to the documentation I'm looking at, calling release on one thread should unblock an indeterminate other blocked thread. However what I think is happening is that the thread which calls the release is free to keep running, then re-acquire if it is still within its time-slice. When hitting acquire again it sees an unblocked semaphore and gets access again. Bad luck thus forces the waiting thread to keep waiting a long time.
Am I missing something? Why would Lock/RLock work any better?

Using Multithreaded queue in python the correct way?

I am trying to use The Queue in python which will be multithreaded. I just wanted to know the approach I am using is correct or not. And if I am doing something redundant or If there is a better approach that I should use.
I am trying to get new requests from a table and schedule them using some logic to perform some operation like running a query.
So here from the main thread I spawn a separate thread for the queue.
if __name__=='__main__':
request_queue = SetQueue(maxsize=-1)
worker = Thread(target=request_queue.process_queue)
worker.setDaemon(True)
worker.start()
while True:
try:
#Connect to the database get all the new requests to be verified
db = Database(username_testschema, password_testschema, mother_host_testschema, mother_port_testschema, mother_sid_testschema, 0)
#Get new requests for verification
verify_these = db.query("SELECT JOB_ID FROM %s.table WHERE JOB_STATUS='%s' ORDER BY JOB_ID" %
(username_testschema, 'INITIATED'))
#If there are some requests to be verified, put them in the queue.
if len(verify_these) > 0:
for row in verify_these:
print "verifying : %s" % row[0]
verify_id = row[0]
request_queue.put(verify_id)
except Exception as e:
logger.exception(e)
finally:
time.sleep(10)
Now in the Setqueue class I have a process_queue function which is used for processing the top 2 requests in every run that were added to the queue.
'''
Overridding the Queue class to use set as all_items instead of list to ensure unique items added and processed all the time,
'''
class SetQueue(Queue.Queue):
def _init(self, maxsize):
Queue.Queue._init(self, maxsize)
self.all_items = set()
def _put(self, item):
if item not in self.all_items:
Queue.Queue._put(self, item)
self.all_items.add(item)
'''
The Multi threaded queue for verification process. Take the top two items, verifies them in a separate thread and sleeps for 10 sec.
This way max two requests per run will be processed.
'''
def process_queue(self):
while True:
scheduler_obj = Scheduler()
try:
if self.qsize() > 0:
for i in range(2):
job_id = self.get()
t = Thread(target=scheduler_obj.verify_func, args=(job_id,))
t.start()
for i in range(2):
t.join(timeout=1)
self.task_done()
except Exception as e:
logger.exception(
"QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
finally:
time.sleep(10)
I want to see if my understanding is correct and if there can be any issues with it.
So the main thread running in while True in the main func connects to database gets new requests and puts it in the queue. The worker thread(daemon) for the queue keeps on getting new requests from the queue and fork non-daemon threads which do the processing and since timeout for the join is 1 the worker thread will keep on taking new requests without getting blocked, and its child thread will keep on processing in the background. Correct?
So in case if the main process exit these won`t be killed until they finish their work but the worker daemon thread would exit.
Doubt : If the parent is daemon and child is non daemon and if parent exits does child exit?).
I also read here :- David beazley multiprocessing
By david beazley in using a Pool as a Thread Coprocessor section where he is trying to solve a similar problem. So should I follow his steps :-
1. Create a pool of processes.
2. Open a thread like I am doing for request_queue
3. In that thread
def process_verification_queue(self):
while True:
try:
if self.qsize() > 0:
job_id = self.get()
pool.apply_async(Scheduler.verify_func, args=(job_id,))
except Exception as e:
logger.exception("QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
Use a process from the pool and run the verify_func in parallel. Will this give me more performance?
While its possible to create a new independent thread for the queue, and process that data separately the way you are doing it, I believe it is more common for each independent worker thread to post messages to a queue that they already "know" about. Then that queue is processed from some other thread by pulling messages out of that queue.
Design Idea
The way I invision your application would be three threads. The main thread, and two worker threads. 1 worker thread would get requests from the database and put them in the queue. The other worker thread would process that data from the queue
The main thread would just waiting for the other threads to finish by using the thread functions .join()
You would protect queue that the threads have access to and make it thread safe by using a mutex. I have seen this pattern in many other designs in other languages as well.
Suggested Reading
"Effective Python" by Brett Slatkin has a great example of this very question.
Instead of inheriting from Queue, he just creates a wrapper to it in his class
called MyQueue and adds a get() and put(message) function.
He even provides the source code at his Github repo
https://github.com/bslatkin/effectivepython/blob/master/example_code/item_39.py
I'm not affiliated with the book or its author, but I highly recommend it as I learned quite a few things from it :)
I like this explanation of the advantages & differences between using threads and processes -
".....But there's a silver lining: processes can make progress on multiple threads of execution simultaneously. Since a parent process doesn't share the GIL with its child processes, all processes can execute simultaneously (subject to the constraints of the hardware and OS)...."
He has some great explanations for getting around GIL and how to improve performance
Read more here:
http://jeffknupp.com/blog/2013/06/30/pythons-hardest-problem-revisited/

multithreading spawn new process when worker has finished

I would like to define a pool of n workers and have each execute tasks held in a rabbitmq queue. When this task finished (fails or succeeds) I want the worker execute another task from the queue.
I can see in docs how to spawn a pool of workers and have them all wait for their siblings to complete. I would something like different though: I would like to have a buffer of n tasks where when one worker finishes it adds another tasks to the buffer (so no more than n tasks are in the bugger). Im having difficulty searching for this in docs.
For context, my non-multithreading code is this:
while True:
message = get_frame_from_queue() # get message from rabbit mq
do_task(message.body) #body defines urls to download file
acknowledge_complete(message) # tell rabbitmq the message is acknowledged
At this stage my "multithreading" implementation will look like this:
#recieves('ask_for_a_job')
def get_a_task():
# this function is executed when `ask_for_a_job` signal is fired
message = get_frame_from_queue()
do_task(message)
def do_tasks(task_info):
try:
# do stuff
finally:
# once the "worker" has finished start another.
fire_fignal('ask_for_a_job')
# start the "workers"
for i in range(5):
fire_fignal('ask_for_a_job')
I don't want to reinvent the wheel. Is there a more built in way to achieve this?
Note get_frame_from_queue is not thread safe.
You should be able to have each subprocess/thread consume directly from the queue, and then within each thread, simply process from the queue exactly as you would synchronously.
from threading import Thread
def do_task(msg):
# Do stuff here
def consume():
while True:
message = get_frame_from_queue()
do_task(message.body)
acknowledge_complete(message)
if __name __ == "__main__":
threads = []
for i in range(5):
t = Thread(target=consume)
t.start()
threads.append(t)
This way, you'll always have N messages from the queue being processed simultaneously, without any need for signaling to occur between threads.
The only "gotcha" here is the thread-safety of the rabbitmq library you're using. Depending on how it's implemented, you may need a separate connection per thread, or possibly one connection with a channel per thread, etc.
One solution is to leverage the multiprocessing.Pool object. Use an outer loop to get N items from RabbitMQ. Feed the items to the Pool, waiting until the entire batch is done. Then loop through the batch, acknowledging each message. Lastly continue the outer loop.
source
import multiprocessing
def worker(word):
return bool(word=='whiskey')
messages = ['syrup', 'whiskey', 'bitters']
BATCHSIZE = 2
pool = multiprocessing.Pool(BATCHSIZE)
while messages:
# take first few messages, one per worker
batch,messages = messages[:BATCHSIZE],messages[BATCHSIZE:]
print 'BATCH:',
for res in pool.imap_unordered(worker, batch):
print res,
print
# TODO: acknowledge msgs in 'batch'
output
BATCH: False True
BATCH: False

Categories

Resources