Detect hanging python shell in OS X

Detect hanging python shell in OS X - python

I've got a program that implements a buggy library that occasionally hangs due to improperly implementing parallisation.
I don't have the time to fix the core issue, so I'm looking for a hack to figure out when the process is hanging and not doing it's job.
Are there any OS X or python specific APIs to do this? Is it possible to use another thread or even the main thread to repeatedly parse stdout so that when the last few lines haven't changed in a certain duration, the other thread is notified and can kill the misbehaving thread? (and then restart?

Basically you are looking for a monitor process. It will run a command (or set of commands) and watch their execution looking for specific things (in your case, silence on stdout). Referencing the 2 SO questions below (and a brief look at some docs), you can quickly build a super simple monitor.
https://stackoverflow.com/questions/2804543/read-subprocess-stdout-line-by-line
https://stackoverflow.com/questions/3471461/raw-input-and-timeout
# monitor.py
import subprocess
TIMEOUT = 10
while True:
# start a new process to monitor
# you could also run sys.argv[1:] for a more generic monitor
child = subprocess.Popen(['python','other.py','arg'], stdout=subprocess.PIPE)
while True:
rlist,_,_ = select([child.stdout], [], [], TIMEOUT)
if rlist:
child.stdout.read() # do you need to save the output?
else:
# timeout occurred, did the process finish?
if child.poll() is not None:
# child process completed (or was killed, but didn't hang), we are done
sys.exit()
else:
# otherwise, kill the child and start a new one
child.kill()
break

Related

Python subprocess polling not giving return code when used with Java process

I'm having a problem with subprocess poll not returning the return code when the process has finished.
I found out how to set a timeout on subprocess.Popen and used that as the basis for my code. However, I have a call that uses Java that doesn't correctly report the return code so each call "times out" even though it is actually finished. I know the process has finished because when removing the poll timeout check, the call runs without issue returning a good exit code and within the time limit.
Here is the code I am testing with.
import subprocess
import time
def execute(command):
print('start command: {}'.format(command))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print('wait')
wait = 10
while process.poll() is None and wait > 0:
time.sleep(1)
wait -= 1
print('done')
if wait == 0:
print('terminate')
process.terminate()
print('communicate')
stdout, stderr = process.communicate()
print('rc')
exit_code = process.returncode
if exit_code != 0:
print('got bad rc')
if __name__ == '__main__':
execute(['ping','-n','15','127.0.0.1']) # correctly times out
execute(['ping','-n','5','127.0.0.1']) # correctly runs within the time limit
# incorrectly times out
execute(['C:\\dev\\jdk8\\bin\\java.exe', '-jar', 'JMXQuery-0.1.8.jar', '-url', 'service:jmx:rmi:///jndi/rmi://localhost:18080/jmxrmi', '-json', '-q', 'java.lang:type=Runtime;java.lang:type=OperatingSystem'])
You can see that two examples are designed to time out and two are not to time out and they all work correctly. However, the final one (using jmxquery to get tomcat metrics) doesn't return the exit code and therefore "times out" and has to be terminated, which then causes it to return an error code of 1.
Is there something I am missing in the way subprocess poll is interacting with this Java process that is causing it to not return an exit code? Is there a way to get a timeout option to work with this?

This has the same cause as a number of existing questions, but the desire to impose a timeout requires a different answer.
The OS deliberately gives only a small amount of buffer space to each pipe. When a process writes to one that is full (because the reader has not yet consumed the previous output), it blocks. (The reason is that a producer that is faster than its consumer would otherwise be able to quickly use a great deal of memory for no gain.) Therefore, if you want to do more than one of the following with a subprocess, you have to interleave them rather than doing each in turn:
Read from standard output
Read from standard error (unless it’s merged via subprocess.STDOUT)
Wait for the process to exit, or for a timeout to elapse
Of course, the subprocess might close its streams before it exits, write useful output after you notice the timeout and before you kill it, and/or start additional processes that keep the pipe open indefinitely, so you might want to have multiple timeouts. Probably what’s most informative is the EOF on the pipe, so repeatedly use something like select to wait for (however much is left of) the timeout, issue single reads on the streams that are ready, and wait (with another timeout if you’re concerned about hangs after an early stream closure) on EOF. If the timeout occurs instead, (try to) kill the subprocess, and consider issuing non-blocking reads (or another timeout loop) to get any last available output before closing the pipes.

Using the other answer by #DavisHerring as the basis for more research, I came across a concept that worked for my original case. Here is the code that came out of that.
import subprocess
import threading
import time
def execute(command):
print('start command: {}'.format(command))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
timer = threading.Timer(10, terminate_process, [process])
timer.start()
print('communicate')
stdout, stderr = process.communicate()
print('rc')
exit_code = process.returncode
timer.cancel()
if exit_code != 0:
print('got bad rc')
def terminate_process(p):
try:
p.terminate()
except OSError:
pass # ignore error
It uses the threading.Timer to make sure that the process doesn't go over the time limit and terminates the process if it does. It otherwise waits for a response back and cancels the timer once it finishes.

gdb.execute blocks all the threads in python scripts

I am scripting GDB with Python 2.7.
I am simply stepping instructions with gdb.execute("stepi"). If the debugged program is idling and waiting for user interaction, gdb.execute("stepi") doesn't return. If there is such a situation, I want to stop the debugging session without terminating gdb.
To do so, I create a thread that will kill the debugged process if the current instruction ran for more than x seconds:
from ctypes import c_ulonglong, c_bool
from os import kill
from threading import Thread
from time import sleep
import signal
# We need mutable primitives in order to update them in the thread
it = c_ulonglong(0) # Instructions counter
program_exited = c_bool(False)
t = Thread(target=check_for_idle, args=(pid,it,program_exited))
t.start()
while not program_exited.value:
gdb.execute("si") # Step instruction
it.value += 1
# Threaded function that will kill the loaded program if it's idling
def check_for_idle(pid, it, program_exited):
delta_max = 0.1 # Max delay between 2 instructions, seconds
while not program_exited.value:
it_prev = c_ulonglong(it.value) # Previous value of instructions counter
sleep(delta_max)
# If previous instruction lasted for more than 'delta_max', kill debugged process
if (it_prev.value == it.value):
# Process pid has been retrieved before
kill(pid, signal.SIGTERM)
program_exited.value = True
print("idle_process_end")
However, gdb.execute is pausing my thread... Is there another way to kill the debugged process if it is idling?

However, gdb.execute is pausing my thread
What is happening here is that gdb.execute does not release Python's global lock when calling into gdb. So, while the gdb command executes, other Python threads are stuck.
This is just an oversight in gdb. I've filed a bug for it.
Is there another way to kill the debugged process if it is idling?
There is one other technique you can try -- I am not certain it will work. Unfortunately this part of gdb is not fully fleshed out (at the present moment); so also feel free to file bug reports.
The main idea is to run gdb commands on the main thread -- but not from Python. So, try writing your stepping loop using the gdb CLI, maybe like:
(gdb) while 1
> stepi
> end
Then your thread should be able to kill the inferior. Another approach might be for your thread to inject a gdb command into the main loop using gdb.post_event.

Python avoid orphan processes

I'm using python to benchmark something. This can take a large amount of time, and I want to set a (global) timeout. I use the following script (summarized):
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException()
# Halt problem after half an hour
signal.alarm(1800)
try:
while solution is None:
guess = guess()
try:
with open(solutionfname, 'wb') as solutionf:
solverprocess = subprocess.Popen(["solver", problemfname], stdout=solutionf)
solverprocess.wait()
finally:
# `solverprocess.poll() == None` instead of try didn't work either
try:
solverprocess.kill()
except:
# Solver process was already dead
pass
except TimeoutException:
pass
# Cancel alarm if it's still active
signal.alarm(0)
However it keeps spawning orphan processes sometimes, but I can't reliably recreate the circumstances. Does anyone know what the correct way to prevent this is?

You simply have to wait after killing the process.

The documentation for the kill() method states:
Kills the child. On Posix OSs the function sends SIGKILL to the child.
On Windows kill() is an alias for terminate().
In other words, if you aren't on Windows, you are only sending a signal to the subprocess.
This will create a zombie process because the parent process didn't read the return value of the subprocess.
The kill() and terminate() methods are just shortcuts to send_signal(SIGKILL) and send_signal(SIGTERM).
Try adding a call to wait() after the kill(). This is even shown in the example under the documentation for communicate():
proc = subprocess.Popen(...)
try:
outs, errs = proc.communicate(timeout=15)
except TimeoutExpired:
proc.kill()
outs, errs = proc.communicate()
note the call to communicate() after the kill(). (It is equivalent to calling wait() and also erading the outputs of the subprocess).
I want to clarify one thing: it seems like you don't understand exactly what a zombie process is. A zombie process is a terminated process. The kernel keeps the process in the process table until the parent process reads its exit status. I believe all memory used by the subprocess is actually reused; the kernel only has to keep track of the exit status of such a process.
So, the zombie processes you see aren't running. They are already completely dead, and that's why they are called zombie. They are "alive" in the process table, but aren't really running at all.
Calling wait() does exactly this: wait till the subprocess ends and read the exit status. This allows the kernel to remove the subprocess from the process table.

On linux, you can use python-prctl.
Define a preexec function such as:
def pre_exec():
import signal
prctl.set_pdeathsig(signal.SIGTERM)
And have your Popen call pass it.
subprocess.Popen(..., preexec_fn=pre_exec)
That's as simple as that. Now the child process will die rather than become orphan if the parent dies.
If you don't like the external dependency of python-prctl you can also use the older prctl. Instead of
prctl.set_pdeathsig(signal.SIGTERM)
you would have
prctl.prctl(prctl.PDEATHSIG, signal.SIGTERM)

Is there a way to make os.killpg not kill the script that calls it?

I have a subprocess which I open, which calls other processes.
I use os.killpg(os.getpgid(subOut.pid), signal.SIGTERM) to kill the entire group, but this kills the python script as well. Even when I call a python script with os.killpg from a second python script, this kills the second script as well. Is there a way to make os.killpg not stop the script?
Another solution would be to individually kill every child 1process. However, even using
p = psutil.Process(subOut.pid)
child_pid = p.children(recursive=True)
for pid in child_pid:
os.kill(pid.pid, signal.SIGTERM)
does not correctly give me all the pids of the children.
And you know what they say... don't kill the script that calls you...

A bit late to answer, but since google took me here while looking for a related problem: the reason your script gets killed is because its children will, by default, inherit its group id. But you can tell subprocess.Popen to create a new process group for your subprocess. Though it's a bit tricky: you have to pass in os.setpgrp for the preexec_fn parameter. This will call setpgrp (without any arguments) in the newly created (forked) process (before that does the exec) which will set the gid of the new process to the pid of the new process (thus creating a new group). The documentation mentions that it can deadlock in multi-threaded code. As an alternative, you can use start_new_session=True, but that would create not only a new process group but a new session. (And that would mean that if you close your terminal session while your script is running, the children would not be terminated. It may or may not be a problem.)
As a side note, if you are on windows, you can simply pass subprocess.CREATE_NEW_PROCESS_GROUP in the creationflag parameter.
Here is what it looks like in detail:
subOut = subprocess.Popen(['your', 'subprocess', ...], preexec_fn=os.setpgrp)
# when it's time to kill
os.killpg(os.getpgid(subOut.pid), signal.SIGTERM)

Create a process group having all the immediate children of the called process as follows:
p1 = subprocess.Popen(cmd1)
os.setpgrp(p1.pid, 0) #It will create process group with id same as p1.pid
p2 = subprocess.Popen(cmd2)
os.setpgrp(p2.pid, os.getpgid(p1.pid))
pn = subprocess.Popen(cmdn)
os.setpgrp(pn.pid, os.getpgid(p1.pid))
#Kill all the children and their process tree using following command
os.killpg(os.getpgid(p1.pid), signal.SIGKILL)
It will kill whole process tree except its own process.

atleta's answer above worked for me but the preexec_fn argument in the call to Popen should be setpgrp, rather than setgrp:
subOut = subprocess.Popen(['your', 'subprocess', ...], preexec_fn=os.setpgrp)
I'm posting this as an answer instead of a comment on atleta's answer because I don't have comment privileges yet.

Easy way is to set the parent process to ignore the signal before sending it.
# Tell this (parent) process to ignore the signal
old_handler = signal.signal(sig, signal.SIG_IGN)
# Send the signal to our process group and
# wait for them all to exit.
os.killpg(os.getpgid(0), sig)
while os.wait() != -1:
pass
# Restore the handler
signal.signal(sig, old_handler)

Python: How to determine subprocess children have all finished running

I am trying to detect when an installation program finishes executing from within a Python script. Specifically, the application is the Oracle 10gR2 Database. Currently I am using the subprocess module with Popen. Ideally, I would simply use the wait() method to wait for the installation to finish executing, however, the documented command actually spawns child processes to handle the actual installation. Here is some sample code of the failing code:
import subprocess
OUI_DATABASE_10GR2_SUBPROCESS = ['sudo',
'-u',
'oracle',
os.path.join(DATABASE_10GR2_TMP_PATH,
'database',
'runInstaller'),
'-ignoreSysPrereqs',
'-silent',
'-noconfig',
'-responseFile '+ORACLE_DATABASE_10GR2_SILENT_RESPONSE]
oracle_subprocess = subprocess.Popen(OUI_DATABASE_10GR2_SUBPROCESS)
oracle_subprocess.wait()
There is a similar question here: Killing a subprocess including its children from python, but the selected answer does not address the children issue, instead it instructs the user to call directly the application to wait for. I am looking for a specific solution that will wait for all children of the subprocess. What if there are an unknown number of subprocesses? I will select the answer that addresses the issue of waiting for all children subprocesses to finish.
More clarity on failure: The child processes continue executing after the wait() command since that command only waits for the top level process (in this case it is 'sudo'). Here is a simple diagram of the known child processes in this problem:
Python subprocess module -> Sudo -> runInstaller -> java -> (unknown)

Ok, here is a trick that will work only under Unix. It is similar to one of the answers to this question: Ensuring subprocesses are dead on exiting Python program. The idea is to create a new process group. You can then wait for all processes in the group to terminate.
pid = os.fork()
if pid == 0:
os.setpgrp()
oracle_subprocess = subprocess.Popen(OUI_DATABASE_10GR2_SUBPROCESS)
oracle_subprocess.wait()
os._exit(0)
else:
os.waitpid(-pid)
I have not tested this. It creates an extra subprocess to be the leader of the process group, but avoiding that is (I think) quite a bit more complicated.
I found this web page to be helpful as well. http://code.activestate.com/recipes/278731-creating-a-daemon-the-python-way/

You can just use os.waitpid with the the pid set to -1, this will wait for all the subprocess of the current process until they finish:
import os
import sys
import subprocess
proc = subprocess.Popen([sys.executable,
'-c',
'import subprocess;'
'subprocess.Popen("sleep 5", shell=True).wait()'])
pid, status = os.waitpid(-1, 0)
print pid, status
This is the result of pstree <pid> of different subprocess forked:
python───python───sh───sleep
Hope this can help :)

Check out the following link http://www.oracle-wiki.net/startdocsruninstaller which details a flag you can use for the runInstaller command.
This flag is definitely available for 11gR2, but I have not got a 10g database to try out this flag for the runInstaller packaged with that version.
Regards

Everywhere I look seems to say it's not possible to solve this in the general case. I've whipped up a library called 'pidmon' that combines some answers for Windows and Linux and might do what you need.
I'm planning to clean this up and put it on github, possibly called 'pidmon' or something like that. I'll post a link if/when I get it up.
EDIT: It's available at http://github.com/dbarnett/python-pidmon.
I made a special waitpid function that accepts a graft_func argument so that you can loosely define what sort of processes you want to wait for when they're not direct children:
import pidmon
pidmon.waitpid(oracle_subprocess.pid, recursive=True,
graft_func=(lambda p: p.name == '???' and p.parent.pid == ???))
or, as a shotgun approach, to just wait for any processes started since the call to waitpid to stop again, do:
import pidmon
pidmon.waitpid(oracle_subprocess.pid, graft_func=(lambda p: True))
Note that this is still barely tested on Windows and seems very slow on Windows (but did I mention it's on github where it's easy to fork?). This should at least get you started, and if it works at all for you, I have plenty of ideas on how to optimize it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.