MPI signal handling

MPI signal handling - python

When using mpirun, is it possible to catch signals (for example, the SIGINT generated by ^C) in the code being run?
For example, I'm running a parallelized python code. I can except KeyboardInterrupt to catch those errors when running python blah.py by itself, but I can't when doing mpirun -np 1 python blah.py.
Does anyone have a suggestion? Even finding how to catch signals in a C or C++ compiled program would be a helpful start.
If I send a signal to the spawned Python processes, they can handle the signals properly; however, signals sent to the parent orterun process (i.e. from exceeding wall time on a cluster, or pressing control-C in a terminal) will kill everything immediately.

I think it is really implementation dependent.
In SLURM, I tried to use sbatch --signal USR1#30 to send SIGUSR1 (whose signum is 30,10 or 16) to the program launched by srun commands. And the process received signal SIGUSR1 = 10.
For platform MPI of IBM, according to https://www.ibm.com/support/knowledgecenter/en/SSF4ZA_9.1.4/pmpi_guide/signal_propagation.html
SIGINT, SIGUSR1, SIGUSR2 will be bypassed to processes.
In MPICH, SIGUSR1 is used by the process manager for internal notification of abnormal failures.
ref: http://lists.mpich.org/pipermail/discuss/2014-October/003242.html>
Open MPI on the other had will forward SIGUSR1 and SIGUSR2 from mpiexec to the other processes.
ref: http://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php#sect14>
For IntelMPI, according to https://software.intel.com/en-us/mpi-developer-reference-linux-hydra-environment-variables
I_MPI_JOB_SIGNAL_PROPAGATION and I_MPI_JOB_TIMEOUT_SIGNAL can be set to send signal.
Another thing worth notice: For many python scripts, they will invoke other library or codes through cython, and if the SIGUSR1 is caught by the sub-process, something unwanted might happen.

If you use mpirun --nw, then mpirun itself should terminate as soon as it's started the subprocesses, instead of waiting for their termination; if that's acceptable then I believe your processes would be able to catch their own signals.

The signal module supports setting signal handlers using signal.signal:
Set the handler for signal signalnum to the function handler. handler can be a callable Python object taking two arguments (see below), or one of the special values signal.SIG_IGN or signal.SIG_DFL. The previous signal handler will be returned ...
import signal
def ignore(sig, stack):
print "I'm ignoring signal %d" % (sig, )
signal.signal(signal.SIGINT, ignore)
while True: pass
If you send a SIGINT to a Python interpreter running this script (via kill -INT <pid>), it will print a message and simply continue to run.

Related

Python, How to handle signal in Windows

This is specifically for Windows, I don't have this issue on linux based systems.
So, I have a program that creates subprocesses when running it.
These subprocesses will terminate correctly if the program exits normally, or even with exceptions or ctrl+c event, by using try and KeyboardInterrupt and finally in if __name__ == '__main__':
However, if I kill the program in the middle, I'm talking about killing it in PyCharm, using the STOP button. Those subprocesses will not terminate. I'm not exactly sure what signal this STOP button sends on Windows.
I tried signal handling using signal.signal(signal.SIGTERM, handler). It doesn't work, I have tried SIGTERM, SIGINT, (SIGKILL, CTRL_C_EVENT, CTRL_BREAK_EVENT don't work in signal handler. ). None of them works. I have also read this post: How to handle the signal in python on windows machine
How can I gracefully exit in this scenario? This STOP button in PyCharm scenario.

python subprocess avoid signal handling by the child

well, I have a usr1 signal handler in a script. By sending a SIGUSR1 from outside to my script, my handler does its work, but the signal is spread also to the child that I create via Popen. How can I do this?

The rsync manual page says that exit code 20 means:
Received SIGUSR1 or SIGINT
So if you are killing it with kill (not kill -15 which you say you sometimes use) then it would die with this exit code too.

proper way to stop a daemon process

I have a Jython script that I run as a daemon. It starts up, logs into a server and then goes into a loop that checks for things to process, processes them, then sleeps for 5 seconds.
I have a cron job that checks every 5 minutes to make sure that the process is running and starts it again if not.
I have another cron job that once a day restarts the process no matter what. We do this because sometimes the daemon's connection to the server sometimes gets screwed up and there is no way to tell when this happens.
The problem I have with this "solution" is the 2nd cron job that kills the process and starts another one. Its okay if it gets killed while it is sleeping but bad things might happen if the daemon is in the middle of processing things when it is killed.
What is the proper way to stop a daemon process... instead of just killing it?
Is there a standard practice for this in general, in Python, or in Java?
In the future I may move to pure Python instead of Jython.
Thanks

You can send a SIGTERM first before sending SIGKILL when terminating the process and receive the signal by the Jython script.
For example, send a SIGTERM, which can be received and processed by your script and if nothing happens within a specified time period, you can send SIGKILL and force kill the process.
For more information on handling the events, please see the signal module documentation.
Also, example that may be handy (uses atexit hook):
#!/usr/bin/env python
from signal import signal, SIGTERM
from sys import exit
import atexit
def cleanup():
print "Cleanup"
if __name__ == "__main__":
from time import sleep
atexit.register(cleanup)
# Normal exit when killed
signal(SIGTERM, lambda signum, stack_frame: exit(1))
sleep(10)
Taken from here.

The normal Linux type way to do this would be to send a signal to your long-running process that's hanging. You can handle this with Python's built in signal library.
http://docs.python.org/library/signal.html
So, you can send a SIGHUP to your 1st app from your 2nd app, and handle it in the first based on whether you're in a state where it's OK to reboot.

Twisted program and TERM signal

I have a simple example:
from twisted.internet import utils,reactor
def test:
utils.getProcessOutput(executable="/bin/sleep",args=["10000"])
reactor.callWhenRunning(test)
reactor.run()
when I send signal "TERM" to program, "sleep" continues to be carried out, when I press Ctrl-C on keyboard "sleep" stopping. ( Ctrl-C is not equivalent signal TERM ?) Why ? How to kill "sleep" after send signal "TERM" to this program ?

Ctrl-C sends SIGINT to the entire foreground process group. That means it gets send to your Twisted program and to the sleep child process.
If you want to kill the sleep process whenever the Python process is going to exit, then you may want a before shutdown trigger:
def killSleep():
# Do it, somehow
reactor.addSystemEventTrigger('before', 'shutdown', killSleep)
As your example code is written, killSleep is difficult to implement. getProcessOutput doesn't give you something that easily allows the child to be killed (for example, you don't know its pid). If you use reactor.spawnProcess and a custom ProcessProtocol, this problem is solved though - the ProcessProtocol will be connected to a process transport which has a signalProcess method which you can use to send a SIGTERM (or whatever you like) to the child process.
You could also ignore SIGINT and this point and then manually deliver it to the whole process group:
import os, signal
def killGroup():
signal.signal(signal.SIGINT, signal.SIG_IGN)
os.kill(-os.getpgid(os.getpid()), signal.SIGINT)
reactor.addSystemEventTrigger('before', 'shutdown', killGroup)
Ignore SIGINT because the Twisted process is already shutting down and another signal won't do any good (and will probably confuse it or at least lead to spurious errors being reported). Sending a signal to -os.getpgid(os.getpid()) is how to send it to your entire process group.

How to get alert for shutdown of python process/abrupt termination?

How we can hook up a code inside a python process so that it should send an alert in case of shutdown of process/abrupt termination ?

Use Supervisor Daemon

It's not clear what exactly you mean. Shutdown/abort of the process itself? Or of a child process?
Shutdown/abort of a process itself: Have a look at Pythons atexit module; here you can register a callback for when your program cleanly exits. But there is absolutely no way for you to catch all circumstances, if your program fails b/o a serious issue (e.g. segfault) your atexit handlers will never get called. You need a supervising process to catch absolutely all aborts.
Shutdown/abort of a child process: If you e.g. use the subprocess module you can simply call poll() or wait() on popen objects to see if the spawned process is dead / wait for them to die. For a more advanced implementation use Pythons signal module to set a handler for SIGCHLD - this signal is sent to your process whenever one of the child processes terminates.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

MPI signal handling - python

If you use mpirun --nw, then mpirun itself should terminate as soon as it's started the subprocesses, instead of waiting for their termination; if that's acceptable then I believe your processes would be able to catch their own signals.

Related

Python, How to handle signal in Windows

python subprocess avoid signal handling by the child

proper way to stop a daemon process

Twisted program and TERM signal

How to get alert for shutdown of python process/abrupt termination?

Categories

Resources