how to to terminate process using python's multiprocessing

how to to terminate process using python's multiprocessing - python

I have some code that needs to run against several other systems that may hang or have problems not under my control. I would like to use python's multiprocessing to spawn child processes to run independent of the main program and then when they hang or have problems terminate them, but I am not sure of the best way to go about this.
When terminate is called it does kill the child process, but then it becomes a defunct zombie that is not released until the process object is gone. The example code below where the loop never ends works to kill it and allow a respawn when called again, but does not seem like a good way of going about this (ie multiprocessing.Process() would be better in the __init__()).
Anyone have a suggestion?
class Process(object):
def __init__(self):
self.thing = Thing()
self.running_flag = multiprocessing.Value("i", 1)
def run(self):
self.process = multiprocessing.Process(target=self.thing.worker, args=(self.running_flag,))
self.process.start()
print self.process.pid
def pause_resume(self):
self.running_flag.value = not self.running_flag.value
def terminate(self):
self.process.terminate()
class Thing(object):
def __init__(self):
self.count = 1
def worker(self,running_flag):
while True:
if running_flag.value:
self.do_work()
def do_work(self):
print "working {0} ...".format(self.count)
self.count += 1
time.sleep(1)

You might run the child processes as daemons in the background.
process.daemon = True
Any errors and hangs (or an infinite loop) in a daemon process will not affect the main process, and it will only be terminated once the main process exits.
This will work for simple problems until you run into a lot of child daemon processes which will keep reaping memories from the parent process without any explicit control.
Best way is to set up a Queue to have all the child processes communicate to the parent process so that we can join them and clean up nicely. Here is some simple code that will check if a child processing is hanging (aka time.sleep(1000)), and send a message to the queue for the main process to take action on it:
import multiprocessing as mp
import time
import queue
running_flag = mp.Value("i", 1)
def worker(running_flag, q):
count = 1
while True:
if running_flag.value:
print(f"working {count} ...")
count += 1
q.put(count)
time.sleep(1)
if count > 3:
# Simulate hanging with sleep
print("hanging...")
time.sleep(1000)
def watchdog(q):
"""
This check the queue for updates and send a signal to it
when the child process isn't sending anything for too long
"""
while True:
try:
msg = q.get(timeout=10.0)
except queue.Empty as e:
print("[WATCHDOG]: Maybe WORKER is slacking")
q.put("KILL WORKER")
def main():
"""The main process"""
q = mp.Queue()
workr = mp.Process(target=worker, args=(running_flag, q))
wdog = mp.Process(target=watchdog, args=(q,))
# run the watchdog as daemon so it terminates with the main process
wdog.daemon = True
workr.start()
print("[MAIN]: starting process P1")
wdog.start()
# Poll the queue
while True:
msg = q.get()
if msg == "KILL WORKER":
print("[MAIN]: Terminating slacking WORKER")
workr.terminate()
time.sleep(0.1)
if not workr.is_alive():
print("[MAIN]: WORKER is a goner")
workr.join(timeout=1.0)
print("[MAIN]: Joined WORKER successfully!")
q.close()
break # watchdog process daemon gets terminated
if __name__ == '__main__':
main()
Without terminating worker, attempt to join() it to the main process would have blocked forever since worker has never finished.

The way Python multiprocessing handles processes is a bit confusing.
From the multiprocessing guidelines:
Joining zombie processes
On Unix when a process finishes but has not been joined it becomes a zombie. There should never be very many because each time a new process starts (or active_children() is called) all completed processes which have not yet been joined will be joined. Also calling a finished process’s Process.is_alive will join the process. Even so it is probably good practice to explicitly join all the processes that you start.
In order to avoid a process to become a zombie, you need to call it's join() method once you kill it.
If you want a simpler way to deal with the hanging calls in your system you can take a look at pebble.

Related

terminate thread and child-threads

I use several threads with child-threads. Now i want to stop a parent-thread or wait until a parent thread has done its work without checking and stopping all the child threads.
My thought is to wrap the parent thread with a process and then just to terminate the process, which seems to terminate the corresponding parent-thread with all its childs.
def worker(conn):
#this is the class including the parent thread
xi = Test_Risk_Calc('99082')
#start working
xi.test()
#finished
print ('EVERYTHING IS DONE, BUT CHILD THREADS ARE STILL ALIVE')
conn.send('EXIT SEND')
return
def main():
parent_conn, child_conn = Pipe()
p = multiprocessing.Process(target=worker, args=(child_conn,))
p.start()
#wait for finished parent-thread
print(parent_conn.recv())
p.terminate()
p.join()
print('JOINED, PROCESS AND ALL ITS THREADS ARE TERMINATED')
return
I am not sure if this is a proper way to solve the problem

I think it's okay like you've done it. However, you can avoid the "terminate" which always is a bit "hard".
The simple solution to that would be to start the child threads with the argument "daemon = True". This actually terminates the child threads automatically if their parent terminates.
Like I said before, this might seem a bit cleaner, but in the end, it would be the same, I think.
EDIT:
Maybe consider using asyncio (async/await) for concurrent programming. You could create tasks with
task = asyncio.create_task(my_task())
and cancel these tasks later
task.cancel()
Nice thing here is, that this "cancel" throws an exception into the task. So within the task you can do stuff like this:
async def my_task():
try:
... stuff ...
except asyncio.CancelledError:
... you can handle the cancellation, or ignore it
In python, tasks are not very useful. For concurrent execution you can better use asyncio and if you have cpu-bound tasks you can use multiprocessing (maybe in the form of process pools).

The workers in ThreadPoolExecutor is not really daemon

The thing I cannot figure out is that although ThreadPoolExecutor uses daemon workers, they will still run even if main thread exit.
I can provide a minimal example in python3.6.4:
import concurrent.futures
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread_pool = concurrent.futures.ThreadPoolExecutor()
thread_pool.submit(fn)
while True:
time.sleep(1)
print("Wow")
Both main thread and the worker thread are infinite loops. So if I use KeyboardInterrupt to terminate main thread, I expect that the whole program will terminate too. But actually the worker thread is still running even though it is a daemon thread.
The source code of ThreadPoolExecutor confirms that worker threads are daemon thread:
t = threading.Thread(target=_worker,
args=(weakref.ref(self, weakref_cb),
self._work_queue))
t.daemon = True
t.start()
self._threads.add(t)
Further, if I manually create a daemon thread, it works like a charm:
from threading import Thread
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread = Thread(target=fn)
thread.daemon = True
thread.start()
while True:
time.sleep(1)
print("Wow")
So I really cannot figure out this strange behavior.

Suddenly... I found why. According to much more source code of ThreadPoolExecutor:
# Workers are created as daemon threads. This is done to allow the interpreter
# to exit when there are still idle threads in a ThreadPoolExecutor's thread
# pool (i.e. shutdown() was not called). However, allowing workers to die with
# the interpreter has two undesirable properties:
# - The workers would still be running during interpreter shutdown,
# meaning that they would fail in unpredictable ways.
# - The workers could be killed while evaluating a work item, which could
# be bad if the callable being evaluated has external side-effects e.g.
# writing to a file.
#
# To work around this problem, an exit handler is installed which tells the
# workers to exit when their work queues are empty and then waits until the
# threads finish.
_threads_queues = weakref.WeakKeyDictionary()
_shutdown = False
def _python_exit():
global _shutdown
_shutdown = True
items = list(_threads_queues.items())
for t, q in items:
q.put(None)
for t, q in items:
t.join()
atexit.register(_python_exit)
There is an exit handler which will join all unfinished worker...

Here's the way to avoid this problem. Bad design can be beaten by another bad design. People write daemon=True only if they really know that the worker won't damage any objects or files.
In my case, I created TreadPoolExecutor with a single worker and after a single submit I just deleted the newly created thread from the queue so the interpreter won't wait till this thread stops on its own. Notice that worker threads are created after submit, not after the initialization of TreadPoolExecutor.
import concurrent.futures.thread
from concurrent.futures import ThreadPoolExecutor
...
executor = ThreadPoolExecutor(max_workers=1)
future = executor.submit(lambda: self._exec_file(args))
del concurrent.futures.thread._threads_queues[list(executor._threads)[0]]
It works in Python 3.8 but may not work in 3.9+ since this code is accessing private variables.
See the working piece of code on github

Python dynamic multiprocessing and signalling issues

I have a python multiprocessing setup (i.e. worker processes) with custom signal handling, which prevents the worker from cleanly using multiprocessing itself. (See extended problem description below).
The Setup
The master class that spawns all worker processes looks like the following (some parts stripped to only contain the important parts).
Here, it re-binds its own signals only to print Master teardown; actually the received signals are propagated down the process tree and must be handled by the workers themselves. This is achieved by re-binding the signals after workers have been spawned.
class Midlayer(object):
def __init__(self, nprocs=2):
self.nprocs = nprocs
self.procs = []
def handle_signal(self, signum, frame):
log.info('Master teardown')
for p in self.procs:
p.join()
sys.exit()
def start(self):
# Start desired number of workers
for _ in range(nprocs):
p = Worker()
self.procs.append(p)
p.start()
# Bind signals for master AFTER workers have been spawned and started
signal.signal(signal.SIGINT, self.handle_signal)
signal.signal(signal.SIGTERM, self.handle_signal)
# Serve forever, only exit on signals
for p in self.procs:
p.join()
The worker class bases multiprocessing.Process and implements its own run()-method.
In this method, it connects to a distributed message queue and polls the queue for items forever. Forever should be: until the worker receives SIGINT or SIGTERM. The worker should not quit immediately; instead, it has to finish whatever calculation it does and will quit afterwards (once quit_req is set to True).
class Worker(Process):
def __init__(self):
self.quit_req = False
Process.__init__(self)
def handle_signal(self, signum, frame):
print('Stopping worker (pid: {})'.format(self.pid))
self.quit_req = True
def run(self):
# Set signals for worker process
signal.signal(signal.SIGINT, self.handle_signal)
signal.signal(signal.SIGTERM, self.handle_signal)
q = connect_to_some_distributed_message_queue()
# Start consuming
print('Starting worker (pid: {})'.format(self.pid))
while not self.quit_req:
message = q.poll()
if len(message):
try:
print('{} handling message "{}"'.format(
self.pid, message)
)
# Facade pattern: Pick the correct target function for the
# requested message and execute it.
MessageRouter.route(message)
except Exception as e:
print('{} failed handling "{}": {}'.format(
self.pid, message, e.message)
)
The Problem
So far for the basic setup, where (almost) everything works fine:
The master process spawns the desired number of workers
Each worker connects to the message queue
Once a message is published, one of the workers receives it
The facade pattern (using a class named MessageRouter) routes the received message to the respective function and executes it
Now for the problem: Target functions (where the message gets directed to by the MessageRouter facade) may contain very complex business logic and thus may require multiprocessing.
If, for example, the target function contains something like this:
nproc = 4
# Spawn a pool, because we have expensive calculation here
p = Pool(processes=nproc)
# Collect result proxy objects for async apply calls to 'some_expensive_calculation'
rpx = [p.apply_async(some_expensive_calculation, ()) for _ in range(nproc)]
# Collect results from all processes
res = [rpx.get(timeout=.5) for r in rpx]
# Print all results
print(res)
Then the processes spawned by the Pool will also redirect their signal handling for SIGINT and SIGTERM to the worker's handle_signal function (because of signal propagation to the process subtree), essentially printing Stopping worker (pid: ...) and not stopping at all. I know, that this happens due to the fact that I have re-bound the signals for the worker before its own child-processes are spawned.
This is where I'm stuck: I just cannot set the workers' signals after spawning its child processes, because I do not know whether or not it spawns some (target functions are masked and may be written by others), and because the worker stays (as designed) in its poll-loop. At the same time, I cannot expect the implementation of a target function that uses multiprocessing to re-bind its own signal handlers to (whatever) default values.
Currently, I feel like restoring signal handlers in each loop in the worker (before the message is routed to its target function) and resetting them after the function has returned is the only option, but it simply feels wrong.
Do I miss something? Do you have any advice? I'd be really happy if someone could give me a hint on how to solve the flaws of my design here!

There is not a clear approach for tackling the issue in the way you want to proceed. I often find myself in situations where I have to run unknown code (represented as Python entry point functions which might get down into some C weirdness) in multiprocessing environments.
This is how I approach the problem.
The main loop
Usually the main loop is pretty simple, it fetches a task from some source (HTTP, Pipe, Rabbit Queue..) and submits it to a Pool of workers. I make sure the KeyboardInterrupt exception is correctly handled to shutdown the service.
try:
while 1:
task = get_next_task()
service.process(task)
except KeyboardInterrupt:
service.wait_for_pending_tasks()
logging.info("Sayonara!")
The workers
The workers are managed by a Pool of workers from either multiprocessing.Pool or from concurrent.futures.ProcessPoolExecutor. If I need more advanced features such as timeout support I either use billiard or pebble.
Each worker will ignore SIGINT as recommended here. SIGTERM is left as default.
The service
The service is controlled either by systemd or supervisord. In either cases, I make sure that the termination request is always delivered as a SIGINT (CTL+C).
I want to keep SIGTERM as an emergency shutdown rather than relying only on SIGKILL for that. SIGKILL is not portable and some platforms do not implement it.
"I whish it was that simple"
If things are more complex, I'd consider the use of frameworks such as Luigi or Celery.
In general, reinventing the wheel on such things is quite detrimental and gives little gratifications. Especially if someone else will have to look at that code.
The latter sentence does not apply if your aim is to learn how these things are done of course.

I was able to do this using Python 3 and set_start_method(method) with the 'forkserver' flavour. Another way Python 3 > Python 2!
Where by "this" I mean:
Have a main process with its own signal handler which just joins the children.
Have some worker processes with a signal handler which may spawn...
further subprocesses which do not have a signal handler.
The behaviour on Ctrl-C is then:
manager process waits for workers to exit.
workers run their signal handlers, (an maybe set a stop flag and continue executing to finish their job, although I didn't bother in my example, I just joined the child I knew I had) and then exit.
all children of the workers die immediately.
Of course note that if your intention is for the children of the workers not to crash you will need to install some ignore handler or something for them in your worker process run() method, or somewhere.
To mercilessly lift from the docs:
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Available on Unix platforms which support passing file descriptors over Unix pipes.
The idea is therefore that the "server process" inherits the default signal handling behaviour before you install your new ones, so all its children also have default handling.
Code in all its glory:
from multiprocessing import Process, set_start_method
import sys
from signal import signal, SIGINT
from time import sleep
class NormalWorker(Process):
def run(self):
while True:
print('%d %s work' % (self.pid, type(self).__name__))
sleep(1)
class SpawningWorker(Process):
def handle_signal(self, signum, frame):
print('%d %s handling signal %r' % (
self.pid, type(self).__name__, signum))
def run(self):
signal(SIGINT, self.handle_signal)
sub = NormalWorker()
sub.start()
print('%d joining %d' % (self.pid, sub.pid))
sub.join()
print('%d %s joined sub worker' % (self.pid, type(self).__name__))
def main():
set_start_method('forkserver')
processes = [SpawningWorker() for ii in range(5)]
for pp in processes:
pp.start()
def sig_handler(signum, frame):
print('main handling signal %d' % signum)
for pp in processes:
pp.join()
print('main out')
sys.exit()
signal(SIGINT, sig_handler)
while True:
sleep(1.0)
if __name__ == '__main__':
main()

Since my previous answer was python 3 only, I thought I'd also suggest a more dirty method for fun which should work on both python 2 and python 3. Not Windows though...
multiprocessing just uses os.fork() under the covers, so patch it to reset the signal handling in the child:
import os
from signal import SIGINT, SIG_DFL
def patch_fork():
print('Patching fork')
os_fork = os.fork
def my_fork():
print('Fork fork fork')
cpid = os_fork()
if cpid == 0:
# child
signal(SIGINT, SIG_DFL)
return cpid
os.fork = my_fork
You can call that at the start of the run method of your Worker processes (so that you don't affect the Manager) and so be sure that any children will ignore those signals.
This might seem crazy, but if you're not too concerned about portability it might actually not be a bad idea as it's simple and probably pretty resilient over different python versions.

You can store pid of main process (when register signal handler) and use it inside signal handler to route execution flow:
if os.getpid() != main_pid:
sys.exit(128 + signum)

Running Python multi-threaded process & interrupt a child thread with a signal

I am trying to write a Python multi-threaded script that does the following two things in different threads:
Parent: Start Child Thread, Do some simple task, Stop Child Thread
Child: Do some long running task.
Below is a simple way to do it. And it works for me:
from multiprocessing import Process
import time
def child_func():
while not stop_thread:
time.sleep(1)
if __name__ == '__main__':
child_thread = Process(target=child_func)
stop_thread = False
child_thread.start()
time.sleep(3)
stop_thread = True
child_thread.join()
But a complication arises because in actuality, instead of the while-loop in child_func(), I need to run a single long-running process that doesn't stop unless it is killed by Ctrl-C. So I cannot periodically check the value of stop_thread in there. So how can I tell my child process to end when I want it to?
I believe the answer has to do with using signals. But I haven't seen a good example of how to use them in this exact situation. Can someone please help by modifying my code above to use signals to communicate between the Child and the Parent thread. And making the child-thread terminate iff the user hits Ctrl-C.

There is no need to use the signal module here unless you want to do cleanup on your child process. It is possible to stop any child processes using the terminate method (which has the same effect as SIGTERM)
from multiprocessing import Process
import time
def child_func():
time.sleep(1000)
if __name__ == '__main__':
event = Event()
child_thread = Process(target=child_func)
child_thread.start()
time.sleep(3)
child_thread.terminate()
child_thread.join()
The docs are here: https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Process.terminate

Simulating Cancellation Tokens in Python Threading

I just wrote a task queue in Python whose job is to limit the number of tasks that are run at one time. This is a little different than Queue.Queue because instead of limiting how many items can be in the queue, it limits how many can be taken out at one time. It still uses an unbounded Queue.Queue to do its job, but it relies on a Semaphore to limit the number of threads:
from Queue import Queue
from threading import BoundedSemaphore, Lock, Thread
class TaskQueue(object):
"""
Queues tasks to be run in separate threads and limits the number
concurrently running tasks.
"""
def __init__(self, limit):
"""Initializes a new instance of a TaskQueue."""
self.__semaphore = BoundedSemaphore(limit)
self.__queue = Queue()
self.__cancelled = False
self.__lock = Lock()
def enqueue(self, callback):
"""Indicates that the given callback should be ran."""
self.__queue.put(callback)
def start(self):
"""Tells the task queue to start running the queued tasks."""
thread = Thread(target=self.__process_items)
thread.start()
def stop(self):
self.__cancel()
# prevent blocking on a semaphore.acquire
self.__semaphore.release()
# prevent blocking on a Queue.get
self.__queue.put(lambda: None)
def __cancel(self):
print 'canceling'
with self.__lock:
self.__cancelled = True
def __process_items(self):
while True:
# see if the queue has been stopped before blocking on acquire
if self.__is_canceled():
break
self.__semaphore.acquire()
# see if the queue has been stopped before blocking on get
if self.__is_canceled():
break
callback = self.__queue.get()
# see if the queue has been stopped before running the task
if self.__is_canceled():
break
def runTask():
try:
callback()
finally:
self.__semaphore.release()
thread = Thread(target=runTask)
thread.start()
self.__queue.task_done()
def __is_canceled(self):
with self.__lock:
return self.__cancelled
The Python interpreter runs forever unless I explicitly stop the task queue. This is a lot more tricky than I thought it would be. If you look at the stop method, you'll see that I set a canceled flag, release the semaphore and put a no-op callback on the queue. The last two parts are necessary because the code could be blocking on the Semaphore or on the Queue. I basically have to force these to go through so that the loop has a chance to break out.
This code works. This class is useful when running a service that is trying to run thousands of tasks in parallel. In order to keep the machine running smoothly and to prevent the OS from screaming about too many active threads, this code will limit the number of threads living at any one time.
I have written a similar chunk of code in C# before. What made that code particular cut 'n' dry was that .NET has something called a CancellationToken that just about every threading class uses. Any time there is a blocking operation, that operation takes an optional token. If the parent task is ever canceled, any child tasks blocking with that token will be immediately canceled, as well. This seems like a much cleaner way to exit than to "fake it" by releasing semaphores or putting values in a queue.
I was wondering if there was an equivalent way of doing this in Python? I definitely want to be using threads instead of something like asynchronous events. I am wondering if there is a way to achieve the same thing using two Queue.Queues where one is has a max size and the other doesn't - but I'm still not sure how to handle cancellation.

I think your code can be simplified by using poisoning and Thread.join():
from Queue import Queue
from threading import Thread
poison = object()
class TaskQueue(object):
def __init__(self, limit):
def process_items():
while True:
callback = self._queue.get()
if callback is poison:
break
try:
callback()
except:
pass
finally:
self._queue.task_done()
self._workers = [Thread(target=process_items) for _ in range(limit)]
self._queue = Queue()
def enqueue(self, callback):
self._queue.put(callback)
def start(self):
for worker in self._workers:
worker.start()
def stop(self):
for worker in self._workers:
self._queue.put(poison)
while self._workers:
self._workers.pop().join()
Untested.
I removed the comments, for brevity.
Also, in this version process_items() is truly private.
BTW: The whole point of the Queue module is to free you from the dreaded locking and event stuff.

You seem to be creating a new thread for each task from the queue. This is wasteful in itself, and also leads you to the problem of how to limit the number of threads.
Instead, a common approach is to create a fixed number of worker threads and let them freely pull tasks from the queue. To cancel the queue, you can clear it and let the workers stay alive in anticipation of future work.

I took Janne Karila's advice and created a thread pool. This eliminated the need for a semaphore. The problem is if you ever expect the queue to go away, you have to stop the worker threads from running (just a variation of what I did before). The new code is fairly similar:
class TaskQueue(object):
"""
Queues tasks to be run in separate threads and limits the number
concurrently running tasks.
"""
def __init__(self, limit):
"""Initializes a new instance of a TaskQueue."""
self.__workers = []
for _ in range(limit):
worker = Thread(target=self.__process_items)
self.__workers.append(worker)
self.__queue = Queue()
self.__cancelled = False
self.__lock = Lock()
self.__event = Event()
def enqueue(self, callback):
"""Indicates that the given callback should be ran."""
self.__queue.put(callback)
def start(self):
"""Tells the task queue to start running the queued tasks."""
for worker in self.__workers:
worker.start()
def stop(self):
"""
Stops the queue from processing anymore tasks. Any actively running
tasks will run to completion.
"""
self.__cancel()
# prevent blocking on a Queue.get
for _ in range(len(self.__workers)):
self.__queue.put(lambda: None)
self.__event.wait()
def __cancel(self):
with self.__lock:
self.__queue.queue.clear()
self.__cancelled = True
def __process_items(self):
while True:
callback = self.__queue.get()
# see if the queue has been stopped before running the task
if self.__is_canceled():
break
try:
callback()
except:
pass
finally:
self.__queue.task_done()
self.__event.set()
def __is_canceled(self):
with self.__lock:
return self.__cancelled
If you look carefully, I had to do some accounting to kill off the workers. I basically wait on an Event for as many times as there are workers. I clear the underlying queue to prevent workers from being cancelled any other way. I also wait after pumping each bogus value into the queue, so only one worker can cancel out at a time.
I've ran some tests on this and it appears to be working. It would still be nice to eliminate the need for bogus values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.