Completing threads and interacting with results at a different rate

Completing threads and interacting with results at a different rate - python

I'd like to get some feedback on an approach for receiving data from multiple threads in a concurrent.futures.ThreadPoolExecutor and iterating over the results. Given the scenario a ThreadPoolExecutor has future thread results appended to a buffer container and a secondary / decoupled operation read and withdraw from the same buffer container.
Thread Manager Workflow
/|-> Thread 1 > results \
ThreadPoolExecutor --|-> Thread 2 > results --> Queue [1,2,3] (end)
\|-> Thread 3 > results /
Now we have results from the threads in a First-In-First-Out queue container - which needs to be thread-safe. Now the above process is done and results (str|int|bool|list|dict|any) are in the container awaiting processing by the next step: Communicate the gathered results.
Communication Workflow
/|-> Terminal Print
Queue [1,2,3] < Listener > Communicate --|-> Speech Engine Say
\|-> Write to Log / File
The Communicate class needs to be "listening" on the Queue for new entries, and processing each as they come in at it's own speed (the rate of speech using a text to speech module - Producer-Consumer Problem) and potentially any number of other outputs, so this really can't be invoked from the top-down. If, the Thread Manager calls directly or lets each thread call the Communicate class directly to invoke the Speech Engine we will hear stuttered speech as the speech engine will override itself with each invocation. Thus, we need to decouple the Thread Manager workflow from the Communicate workflow but have them write & read with an In/Out type buffer or Queue and need for a "listener" concept.
I've found references for a structure like the following running as a daemon thread, but the while loop makes me cringe and consumes too much cpu, so I still need a non-blocking approach, where self.pipeline is a queue.Queue object:
while True :
try :
if not self.pipeline.empty ( ) :
task = self.pipeline.get ( timeout=1 )
if task :
self.serve ( task, )
except queue.Empty :
continue
Again, in need of something other than a while loop for this...

As you write in the comments, its standard producer consumer problem. One solution in python is using multithreading and the Queue class
The queue is thread safe . Its using a mutex internally, which handles busy waiting.
Queue.get will eventually call wait on its internal mutex. This will block the calling thread . But instead of busy waiting , which is using cpu, the thread will be put in sleep state. A thread scheduler of the os will take over from here, and will wake up the thread , when items are available (simplified ).
So you can still have while True loops within multiple thread consumers which call queue.get on shared queue. If items are available the threads directly process them, if not, they go into sleep mode and free the cpu. Same goes for producer threads , they simply call Queue.put
However there is one caveat in python. Python has something called global interpreter lock - GIL. This is because it is using a lot of c extension and allows modules which bring in c extensions. But those are not always thread safe. A GIL means, that only one thread will run on only one cpu at a time.
So , once an item is in the queue, only one consumer at a time will wake up and process the result. Also normally one producer can run at a time.
Except those threads start waiting for some I/O, like reading from a socket. Because I/O notification is handled by some other cpu part, there is always some waiting time for I/O. In that time, the threads release the GIL and other threads can do the work.
Summed up, it only makes sense to have multiple consumers and producer threads if they also do some I/O work - read/write on a network socket or disk. This is called concurrency. If you want to use multiple cpu cores at same time, you need to use multiprocessing in python instead of threads.
And it only makes sense to have more processes than cores, if there is also some IO work.
Example

I would suggest that you use multiprocessing rather than threading to ensure maximum parallelism. I am not sure whether you really need a process pool for what you are trying to do rather than 4 dedicated processes; it's a question of how "threads" 1 through 3 are getting their data for feeding to the queue to be processed by the 4th process. Are these implemented by a single, identical worker function to whom "jobs" are submitted? If so then a process pool of 3 identical workers is what you want. But if these are 3 separate functions with their own processing logic, then you just want to create 3 Process instances. I am working on the second assumption.
Since we are now in the realm of multiprocessing, I would suggest using a "managed" Queue instance created with the following code:
with multiprocessing.Manager() as manager:
q = manager.Queue()
Access to such a queue is synchronized across processeses. The following code is a rough idea of creating the processes and accessing the queue:
import multiprocessing
import time
class Communicate:
def listen(self, q):
while True:
obj = q.get()
if obj == None: # our signal to terminate
return
# do something with objects
print(obj)
def process1(q):
while True:
time.sleep(1)
q.put(1)
def process2(q):
while True:
time.sleep(.5)
q.put(2)
def process3(q):
while True:
time.sleep(1.5)
q.put(3)
if __name__ == '__main__':
communicator = Communicate()
with multiprocessing.Manager() as manager:
#start the commmunicator process:
q = manager.Queue()
p = multiprocessing.Process(target=communicator.listen, args=(q,))
p.start()
# start the other 3 processes:
p1 = multiprocessing.Process(target=process1, args=(q,))
p1.daemon = True
p1.start()
# start the other 3 processes:
p2 = multiprocessing.Process(target=process2, args=(q,))
p2.daemon = True
p2.start()
# start the other 3 processes:
p3 = multiprocessing.Process(target=process3, args=(q,))
p3.daemon = True
p3.start()
input('Hit any enter to terminate\n')
q.put(None) # signal for termination
p.join() # wait for process to complete

Related

python multiprocessing pool blocking main thread

I have the following snippet which attempts to split processing across multiple sub-processes.
def search(self):
print("Checking queue for jobs to process")
if self._job_queue.has_jobs_to_process():
print("Queue threshold met, processing jobs.")
job_sub_lists = partition_jobs(self._job_queue.get_jobs_to_process(), self._process_pool_size)
populated_sub_lists = [sub_list for sub_list in job_sub_lists if len(sub_list) > 0]
self._process_pool.map(process, populated_sub_lists)
print("Job processing pool mapped")
The search function is being called by the main process in a while loop and if the queue reaches a threshold count, the processing pool is mapped to the process function with the jobs sourced from the queue. My question is, does the python multiprocessing pool block the main process during execution or does it immediately continue execution? I don't want to encounter the scenario where "has_jobs_to_process()" evaluates to true and during the processing of the jobs, it evaluates to true for another set of jobs and "self._process_pool.map(process, populated_sub_lists)" is called again as I do not know the consequences of calling map again while processes are running.

multiprocessing.Pool.map blocks the calling thread (not necessarily the MainThread!), not the whole process.
Other threads of the parent process will not be blocked. You could call pool.map from multiple threads in the parent process without breaking things (doesn't make much sense, though). That's because Pool uses thread-safe queue.Queue internally for it's _taskqueue.

From the multiprocessing docs, multiprocessing.map will block the main process during execution until a result is ready, and multiprocessing.map_async will not.

AttributeError 'DupFd' in 'multiprocessing.resource_sharer' | Python multiprocessing + threading

I'm trying to communicate between multiple threading.Thread(s) doing I/O-bound tasks and multiple multiprocessing.Process(es) doing CPU-bound tasks. Whenever a thread finds work for a process, it will be put on a multiprocessing.Queue, together with the sending end of a multiprocessing.Pipe(duplex=False). The processes then do their part and send results back to the threads via the Pipe. This procedure seems to work in roughly 70% of the cases, the other 30% I receive an AttributeError: Can't get attribute 'DupFd' on <module 'multiprocessing.resource_sharer' from '/usr/lib/python3.5/multiprocessing/resource_sharer.py'>
To reproduce:
import multiprocessing
import threading
import time
def thread_work(work_queue, pipe):
while True:
work_queue.put((threading.current_thread().name, pipe[1]))
received = pipe[0].recv()
print("{}: {}".format(threading.current_thread().name, threading.current_thread().name == received))
time.sleep(0.3)
def process_work(work_queue):
while True:
thread, pipe = work_queue.get()
pipe.send(thread)
work_queue = multiprocessing.Queue()
for i in range(0,3):
receive, send = multiprocessing.Pipe(duplex=False)
t = threading.Thread(target=thread_work, args=[work_queue, (receive, send)])
t.daemon = True
t.start()
for i in range(0,2):
p = multiprocessing.Process(target=process_work, args=[work_queue])
p.daemon = True
p.start()
time.sleep(5)
I had a look in the multiprocessing source code, but couldn't understand why this error occurs.
I tried using the queue.Queue, or a Pipe with duplex=True (default) but coudn't find a pattern in the error. Does anyone have a clue how to debug this?

You are forking an already multi-threaded main-process here. That is known to be problematic in general.
It is in-fact problem prone (and not just in Python). The rule is "thread after you fork, not before". Otherwise, the locks used by the thread executor will get duplicated across processes. If one of those processes dies while it has the lock, all of the other processes using that lock will deadlock -Raymond Hettinger.
Trigger for the error you get is apparantly that the duplication of the file-descriptor for the pipe fails in the child process.
To resolve this issue, either create your child-processes as long as your main-process is still single-threaded or use another start_method for creating new processes like 'spawn' (default on Windows) or 'forkserver', if available.
forkserver
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Available on Unix platforms which support passing file descriptors over Unix pipes. docs
You can specify another start_method with:
multiprocessing.set_start_method(method)
Set the method which should be used to start child processes. method can be 'fork', 'spawn' or 'forkserver'.
Note that this should be called at most once, and it should be protected inside the if name == 'main' clause of the main module. docs
For a benchmark of the specific start_methods (on Ubuntu 18.04) look here.

threading and multithreading in python with an example

I am a beginner in python and unable to get an idea about threading.By using simple example could someone please explain threading and multithreading in python?
-Thanks

Here is Alex Martelli's answer about multithreading, as linked above.
He uses a simple program that tries some URLs then returns the contents of first one to respond.
import Queue
import threading
import urllib2
# called by each thread
def get_url(q, url):
q.put(urllib2.urlopen(url).read())
theurls = ["http://google.com", "http://yahoo.com"]
q = Queue.Queue()
for u in theurls:
t = threading.Thread(target=get_url, args = (q,u))
t.daemon = True
t.start()
s = q.get()
print s
This is a case where threading is used as a simple optimization: each subthread is waiting for a URL to resolve and respond, in order to put its contents on the queue; each thread is a daemon (won't keep the process up if main thread ends -- that's more common than not); the main thread starts all subthreads, does a get on the queue to wait until one of them has done a put, then emits the results and terminates (which takes down any subthreads that might still be running, since they're daemon threads).
Proper use of threads in Python is invariably connected to I/O operations (since CPython doesn't use multiple cores to run CPU-bound tasks anyway, the only reason for threading is not blocking the process while there's a wait for some I/O). Queues are almost invariably the best way to farm out work to threads and/or collect the work's results, by the way, and they're intrinsically threadsafe so they save you from worrying about locks, conditions, events, semaphores, and other inter-thread coordination/communication concepts.

Having a global queue (or list) that is available to all threads

I want to write a script which consumes data over the internet and places the data which is pulled every n number of seconds in to a queue/list, then I will have x number of threads which I will create at the start of the script that will pick up and process data as it is added to the queue. My questions are:
How can I create such a global variable (list/queue) in my script that is then accessible to all my threads?
In my threads, I plan to check if the queue has data in it, if so then retrieve this data, release the lock and start processing it. Once the thread is finished working on the task, go back to the start and keep checking the queue. If there is no data in the queue, sleep for a specified number of time and then check the queue again.

You can define a Queue and pass it to multiple threads:
from threading import Thread
from Queue import Queue
def task(queue):
while True:
item = queue.get() # blocks until an item is available
# process item
queue = Queue()
t = Thread(target=task, args=(queue, ))
t.daemon = True
t.start()

If you want your app to be really multithreaded then consider using standalone queue (like activemq or zeromq) and consume it from your scripts running in different os processes because of GIL (with standalone queue it is very easy even to use it in network - plus to scalability).

Python multiprocessing - watchdog process?

I have a set of long-running process in a typical "pub/sub" setup with queues for communication.
I would like to do two things, and I can't figure out how to accomplish both simultaneously:
Addition/removal of workers. For example, I want to be able to add extra consumers if I see that my pending queue size has grown too large.
Watchdog for my processes - I want to be notified if any of my producers or consumers crashes.
I can do (2) in isolation:
try:
while True:
for process in workers + consumers:
if not process.is_alive():
logger.critical("%-8s%s died!", process.pid, process.name)
sleep(3)
except KeyboardInterrupt:
# Python propagates CTRL+C to all workers, no need to terminate them
logger.warn('Received CTR+C, shutting down')
The above blocks, which prevents me from doing (1).
So I decided to move the code into its own process.
This doesn't work, because process.is_alive() only works for a parent checking the status of its children. In this case, the processes I want to check would be siblings instead of children.
I'm a bit stumped on how to proceed. How can my main process support changes to subprocesses while also monitoring subprocesses?

multiprocessing.Pool actually has a watchdog built-in already. It runs a thread that checks every 0.1 seconds to see if a worker has died. If it has, it starts a new one to take its place:
def _handle_workers(pool):
thread = threading.current_thread()
# Keep maintaining workers until the cache gets drained, unless the pool
# is terminated.
while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
pool._maintain_pool()
time.sleep(0.1)
# send sentinel to stop workers
pool._taskqueue.put(None)
debug('worker handler exiting')
def _maintain_pool(self):
"""Clean up any exited workers and start replacements for them.
"""
if self._join_exited_workers():
self._repopulate_pool()
This is primarily used to implement the maxtasksperchild keyword argument, and is actually problematic in some cases. If a process dies while a map or apply command is running, and that process is in the middle of handling a task associated with that call, it will never finish. See this question for more information about that behavior.
That said, if you just want to know that a process has died, you can just create a thread (not a process) that monitors the pids of all the processes in the pool, and if the pids in the list ever change, you know a process has crashed:
def monitor_pids(pool):
pids = [p.pid for p in pool._pool]
while True:
new_pids = [p.pid for p in pool._pool]
if new_pids != pids:
print("A worker died")
pids = new_pids
time.sleep(3)
Edit:
If you're rolling your own Pool implementation, you can just take a cue from multiprocessing.Pool, and run your monitoring code in a background thread in the parent process. The checks to see if the processes are still running are quick, so the time lost to the background thread taking the GIL should be negligible. Consider that the multiprocessing.Process watchdog is running every 0.1 seconds! Running yours every 3 seconds shouldn't cause any problems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.