I have a set of long-running process in a typical "pub/sub" setup with queues for communication.
I would like to do two things, and I can't figure out how to accomplish both simultaneously:
Addition/removal of workers. For example, I want to be able to add extra consumers if I see that my pending queue size has grown too large.
Watchdog for my processes - I want to be notified if any of my producers or consumers crashes.
I can do (2) in isolation:
try:
while True:
for process in workers + consumers:
if not process.is_alive():
logger.critical("%-8s%s died!", process.pid, process.name)
sleep(3)
except KeyboardInterrupt:
# Python propagates CTRL+C to all workers, no need to terminate them
logger.warn('Received CTR+C, shutting down')
The above blocks, which prevents me from doing (1).
So I decided to move the code into its own process.
This doesn't work, because process.is_alive() only works for a parent checking the status of its children. In this case, the processes I want to check would be siblings instead of children.
I'm a bit stumped on how to proceed. How can my main process support changes to subprocesses while also monitoring subprocesses?
multiprocessing.Pool actually has a watchdog built-in already. It runs a thread that checks every 0.1 seconds to see if a worker has died. If it has, it starts a new one to take its place:
def _handle_workers(pool):
thread = threading.current_thread()
# Keep maintaining workers until the cache gets drained, unless the pool
# is terminated.
while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
pool._maintain_pool()
time.sleep(0.1)
# send sentinel to stop workers
pool._taskqueue.put(None)
debug('worker handler exiting')
def _maintain_pool(self):
"""Clean up any exited workers and start replacements for them.
"""
if self._join_exited_workers():
self._repopulate_pool()
This is primarily used to implement the maxtasksperchild keyword argument, and is actually problematic in some cases. If a process dies while a map or apply command is running, and that process is in the middle of handling a task associated with that call, it will never finish. See this question for more information about that behavior.
That said, if you just want to know that a process has died, you can just create a thread (not a process) that monitors the pids of all the processes in the pool, and if the pids in the list ever change, you know a process has crashed:
def monitor_pids(pool):
pids = [p.pid for p in pool._pool]
while True:
new_pids = [p.pid for p in pool._pool]
if new_pids != pids:
print("A worker died")
pids = new_pids
time.sleep(3)
Edit:
If you're rolling your own Pool implementation, you can just take a cue from multiprocessing.Pool, and run your monitoring code in a background thread in the parent process. The checks to see if the processes are still running are quick, so the time lost to the background thread taking the GIL should be negligible. Consider that the multiprocessing.Process watchdog is running every 0.1 seconds! Running yours every 3 seconds shouldn't cause any problems.
Related
I'd like to get some feedback on an approach for receiving data from multiple threads in a concurrent.futures.ThreadPoolExecutor and iterating over the results. Given the scenario a ThreadPoolExecutor has future thread results appended to a buffer container and a secondary / decoupled operation read and withdraw from the same buffer container.
Thread Manager Workflow
/|-> Thread 1 > results \
ThreadPoolExecutor --|-> Thread 2 > results --> Queue [1,2,3] (end)
\|-> Thread 3 > results /
Now we have results from the threads in a First-In-First-Out queue container - which needs to be thread-safe. Now the above process is done and results (str|int|bool|list|dict|any) are in the container awaiting processing by the next step: Communicate the gathered results.
Communication Workflow
/|-> Terminal Print
Queue [1,2,3] < Listener > Communicate --|-> Speech Engine Say
\|-> Write to Log / File
The Communicate class needs to be "listening" on the Queue for new entries, and processing each as they come in at it's own speed (the rate of speech using a text to speech module - Producer-Consumer Problem) and potentially any number of other outputs, so this really can't be invoked from the top-down. If, the Thread Manager calls directly or lets each thread call the Communicate class directly to invoke the Speech Engine we will hear stuttered speech as the speech engine will override itself with each invocation. Thus, we need to decouple the Thread Manager workflow from the Communicate workflow but have them write & read with an In/Out type buffer or Queue and need for a "listener" concept.
I've found references for a structure like the following running as a daemon thread, but the while loop makes me cringe and consumes too much cpu, so I still need a non-blocking approach, where self.pipeline is a queue.Queue object:
while True :
try :
if not self.pipeline.empty ( ) :
task = self.pipeline.get ( timeout=1 )
if task :
self.serve ( task, )
except queue.Empty :
continue
Again, in need of something other than a while loop for this...
As you write in the comments, its standard producer consumer problem. One solution in python is using multithreading and the Queue class
The queue is thread safe . Its using a mutex internally, which handles busy waiting.
Queue.get will eventually call wait on its internal mutex. This will block the calling thread . But instead of busy waiting , which is using cpu, the thread will be put in sleep state. A thread scheduler of the os will take over from here, and will wake up the thread , when items are available (simplified ).
So you can still have while True loops within multiple thread consumers which call queue.get on shared queue. If items are available the threads directly process them, if not, they go into sleep mode and free the cpu. Same goes for producer threads , they simply call Queue.put
However there is one caveat in python. Python has something called global interpreter lock - GIL. This is because it is using a lot of c extension and allows modules which bring in c extensions. But those are not always thread safe. A GIL means, that only one thread will run on only one cpu at a time.
So , once an item is in the queue, only one consumer at a time will wake up and process the result. Also normally one producer can run at a time.
Except those threads start waiting for some I/O, like reading from a socket. Because I/O notification is handled by some other cpu part, there is always some waiting time for I/O. In that time, the threads release the GIL and other threads can do the work.
Summed up, it only makes sense to have multiple consumers and producer threads if they also do some I/O work - read/write on a network socket or disk. This is called concurrency. If you want to use multiple cpu cores at same time, you need to use multiprocessing in python instead of threads.
And it only makes sense to have more processes than cores, if there is also some IO work.
Example
I would suggest that you use multiprocessing rather than threading to ensure maximum parallelism. I am not sure whether you really need a process pool for what you are trying to do rather than 4 dedicated processes; it's a question of how "threads" 1 through 3 are getting their data for feeding to the queue to be processed by the 4th process. Are these implemented by a single, identical worker function to whom "jobs" are submitted? If so then a process pool of 3 identical workers is what you want. But if these are 3 separate functions with their own processing logic, then you just want to create 3 Process instances. I am working on the second assumption.
Since we are now in the realm of multiprocessing, I would suggest using a "managed" Queue instance created with the following code:
with multiprocessing.Manager() as manager:
q = manager.Queue()
Access to such a queue is synchronized across processeses. The following code is a rough idea of creating the processes and accessing the queue:
import multiprocessing
import time
class Communicate:
def listen(self, q):
while True:
obj = q.get()
if obj == None: # our signal to terminate
return
# do something with objects
print(obj)
def process1(q):
while True:
time.sleep(1)
q.put(1)
def process2(q):
while True:
time.sleep(.5)
q.put(2)
def process3(q):
while True:
time.sleep(1.5)
q.put(3)
if __name__ == '__main__':
communicator = Communicate()
with multiprocessing.Manager() as manager:
#start the commmunicator process:
q = manager.Queue()
p = multiprocessing.Process(target=communicator.listen, args=(q,))
p.start()
# start the other 3 processes:
p1 = multiprocessing.Process(target=process1, args=(q,))
p1.daemon = True
p1.start()
# start the other 3 processes:
p2 = multiprocessing.Process(target=process2, args=(q,))
p2.daemon = True
p2.start()
# start the other 3 processes:
p3 = multiprocessing.Process(target=process3, args=(q,))
p3.daemon = True
p3.start()
input('Hit any enter to terminate\n')
q.put(None) # signal for termination
p.join() # wait for process to complete
I have a Python application which runs as the main process in a kubernetes pod, and this process kicks off some child processes to long poll a list of SQS queues (1 process per queue). Occasionally, one of the processes becomes a zombie and stops processing, and hangs up all other processes too, including the parent.
if __name__ == '__main__':
PROCESSES = []
for queue, module in qfmapper.items():
PROCESSES.append(Process(target=poll_for_messages, args=(queue,module)))
for process in PROCESSES:
process.start()
for process in PROCESSES:
process.join()
I've tried handling the SIGCHLD signal in the parent before it kicks off the children, but that doesn't seem to kill the parent if one of the children are killed. I know this leaves behind other child processes, but since kubernetes kills the pod if PID 1 dies, it shouldn't matter. This however doesn't seem to work, as the parent doesn't react to it. I'm assuming this is because process.join() blocks the parent.
So I've tried replacing individual Process calls with a Pool:
with contextlib.closing(mp.Pool(len(qfmapper))) as pool:
for queue, module in qfmapper.items():
pool.apply_async(poll_for_messages, args=(queue, module))
pool.close()
pool.join()
This again kicks off the polling processes as expected, but killing one doesn't seem to get replaced with the same call again. It spins up another worker to maintain the Pool, but it doesn't kick it off with the same arguments that the original apply_async call does.
I also tried using map, and that does restart the process if killed, but doesn't loop through all of the queues in my list; it just does the first one in the list multiple times. I've also tried starmap, and just used the for loop to build a list of iterables, but again that doesn't recover if one of the workers is killed.
So, ultimately, this comes back to the title of this question. How do you automatically restart a process that has died / been killed? I've searched high and low and I can't seem to find any answers for what seems to me like a "normal" thing to want to do. This is all running on Python 3.7.3, but I can upgrade to 3.8 if it has any features worth using to resolve this issue.
I have the following snippet which attempts to split processing across multiple sub-processes.
def search(self):
print("Checking queue for jobs to process")
if self._job_queue.has_jobs_to_process():
print("Queue threshold met, processing jobs.")
job_sub_lists = partition_jobs(self._job_queue.get_jobs_to_process(), self._process_pool_size)
populated_sub_lists = [sub_list for sub_list in job_sub_lists if len(sub_list) > 0]
self._process_pool.map(process, populated_sub_lists)
print("Job processing pool mapped")
The search function is being called by the main process in a while loop and if the queue reaches a threshold count, the processing pool is mapped to the process function with the jobs sourced from the queue. My question is, does the python multiprocessing pool block the main process during execution or does it immediately continue execution? I don't want to encounter the scenario where "has_jobs_to_process()" evaluates to true and during the processing of the jobs, it evaluates to true for another set of jobs and "self._process_pool.map(process, populated_sub_lists)" is called again as I do not know the consequences of calling map again while processes are running.
multiprocessing.Pool.map blocks the calling thread (not necessarily the MainThread!), not the whole process.
Other threads of the parent process will not be blocked. You could call pool.map from multiple threads in the parent process without breaking things (doesn't make much sense, though). That's because Pool uses thread-safe queue.Queue internally for it's _taskqueue.
From the multiprocessing docs, multiprocessing.map will block the main process during execution until a result is ready, and multiprocessing.map_async will not.
this is done in python 2.7.12
serialHelper is a class module arround python serial and this code does work nicely
#!/usr/bin/env python
import threading
from time import sleep
import serialHelper
sh = serialHelper.SerialHelper()
def serialGetter():
h = 0
while True:
h = h + 1
s_resp = sh.getResponse()
print ('response ' + s_resp)
sleep(3)
if __name__ == '__main__':
try:
t = threading.Thread(target=sh.serialReader)
t.setDaemon(True)
t.start()
serialGetter()
#tSR = threading.Thread(target=serialGetter)
#tSR.setDaemon(True)
#tSR.start()
except Exception as e:
print (e)
however the attemp to run serialGetter as thread as remarked it just dies.
Any reason why that function can not run as thread ?
Quoting from the Python documentation:
The entire Python program exits when no alive non-daemon threads are left.
So if you setDaemon(True) every new thread and then exit the main thread (by falling off the end of the script), the whole program will exit immediately. This kills all of the threads. Either don't use setDaemon(True), or don't exit the main thread without first calling join() on all of the threads you want to wait for.
Stepping back for a moment, it may help to think about the intended use case of a daemon thread. In Unix, a daemon is a process that runs in the background and (typically) serves requests or performs operations, either on behalf of remote clients over the network or local processes. The same basic idea applies to daemon threads:
You launch the daemon thread with some kind of work queue.
When you need some work done on the thread, you hand it a work object.
When you want the result of that work, you use an event or a future to wait for it to complete.
After requesting some work, you always eventually wait for it to complete, or perhaps cancel it (if your worker protocol supports cancellation).
You don't have to clean up the daemon thread at program termination. It just quietly goes away when there are no other threads left.
The problem is step (4). If you forget about some work object, and exit the app without waiting for it to complete, the work may get interrupted. Daemon threads don't gracefully shut down, so you could leave the outside world in an inconsistent state (e.g. an incomplete database transaction, a file that never got closed, etc.). It's often better to use a regular thread, and replace step (5) with an explicit "Finish up your work and shut down" work object that the main thread hands to the worker thread before exiting. The worker thread then recognizes this object, stops waiting on the work queue, and terminates itself once it's no longer doing anything else. This is slightly more up-front work, but is much safer in the event that a work object is inadvertently abandoned.
Because of all of the above, I recommend not using daemon threads unless you have a strong reason for them.
I am working on a project where I have a pool of workers. I am not using the built-in multiprocessing.Pool, but have created my own process pool.
The way it works is that I have created two instances of multiprocessing.Queue - one for sending work tasks to the workers and another to receive the results back.
Each worker just sits in a permanently running loop like this:
while True:
try:
request = self.request_queue.get(True, 5)
except Queue.Empty:
continue
else:
result = request.callable(*request.args, **request.kwargs)
self.results_queue.put((request, result))
There is also some error-handling code, but I have left it out for brewity. Each worker process has daemon set to 1.
I wish to properly shutdown the main process and all child worker processes. My experiences so far (doing Ctrl+C):
With no special implementations, each child process stops/crashes with a KeyboardInterrupt traceback, but the main process does not exist and have to be killed (sudo kill -9).
If I implement a signal handler for the child processes, set to ignore SIGINT's, the main thread shows the KeyboardInterrupt tracebok but nothing happens either way.
If I implement a signal handler for the child processes and the main process, I can see that the signal handler is called in the main process, but calling sys.exit() does not seem to have any effect.
I am looking for a "best practice" way of handling this. I also read somewhere that shutting down processes that were interacting with Queues and Pipes might cause them to deadlock with other processes (due to the Semaphores and other stuff used internally).
My current approach would be the following:
- Find a way to send an internal signal to each process (using a seperate command queue or similar) that will terminate their main loop.
- Implement a signal handler for the main loop that sends the shutdown command. The child processes will have a child handler that sets them to ignore the signal.
Is this the right approach?
The thing you need to watch out for is to deal with the possibility that there are messages in the queues at the time that you want to shutdown so you need a way for your processes to drain their input queues cleanly. Assuming that your main process is the one that will recognize that it is time to shutdown, you could do this.
Send a sentinel to each worker process. This is a special message (frequently None) that can never look like a normal message. After the sentinel, flush and close the queue to each worker process.
In your worker processes use code similar to the following pseudocode:
while True: # Your main processing loop
msg = inqueue.dequeue() # A blocking wait
if msg is None:
break
do_something()
outqueue.flush()
outqueue.close()
If it is possible that several processes could be sending messages on the inqueue you will need a more sophisticated approach. This sample taken from the source code for the monitor method in logging.handlers.QueueListener in Python 3.2 or later shows one possibility.
"""
Monitor the queue for records, and ask the handler
to deal with them.
This method runs on a separate, internal thread.
The thread will terminate if it sees a sentinel object in the queue.
"""
q = self.queue
has_task_done = hasattr(q, 'task_done')
# self._stop is a multiprocessing.Event object that has been set by the
# main process as part of the shutdown processing, before sending
# the sentinel
while not self._stop.isSet():
try:
record = self.dequeue(True)
if record is self._sentinel:
break
self.handle(record)
if has_task_done:
q.task_done()
except queue.Empty:
pass
# There might still be records in the queue.
while True:
try:
record = self.dequeue(False)
if record is self._sentinel:
break
self.handle(record)
if has_task_done:
q.task_done()
except queue.Empty:
break