I have a Python script, part of a test system that calls many third party tools/processes on multiple [Windows] machines, and hence has been designed to clean up comprehensively/carefully when aborted with CTRL-C; the clean-up can take many seconds, depending on what's going on. This clean-up process works fine from a [Windows] command prompt.
I run that Python script from [a scripted pipeline] Jenkinsfile, using return_value = bat("python my_script.py params", returnStatus: true), which also works fine.
However I need to be able to perform the abort/clean-up during a Jenkins [v2.263.4] run, i.e. when someone presses the little red X, and that bit I can't fathom. I understand that Jenkins sends SIGTERM when the abort button is pressed so I am trapping that in my_script.py:
SAVED_SIGTERM_HANDLER = signal(SIGTERM, sigterm_handler)
...and running the processes I would normally call from a KeyboardInterrupt in sigterm_handler() as well, but they aren't being called. I understand that the IO stream to the Jenkins console stops the moment the abort button is pressed; I can see that the clean-up functions aren't being called by looking at the behaviour of my script(s) from the "other side": it appears as though my_script.py is simply stopping dead, all connections from it drop the moment the abort button is pressed, there is no clean-up.
Can anyone suggest a way of making the abort button in Jenkins give my bat()ed Python script time to clean-up? Or am I just doing something wrong? Or is there some other approach to this within Jenkins that I'm missing?
You should be able to use a "post" action to execute any clean up needed: https://www.jenkins.io/doc/book/pipeline/syntax/#post
I know that doesn't take into account the cleanup logic you already have but it's probably the safest thing to do. Maybe separate out the cleanup logic into a separate script and make it idempotent and then you can call it no matter what at the end of a pipeline and if it has already run then it should do nothing if run again.
After much figuring out, and kudos to our tools people who found the critical "cookie" implementation detail in Jenkins, the workaround to take control of the abort process [on Windows] is as follows:
have Jenkins call a wrapper, let's call it (a), and open a socket or a named-pipe (socket would work on both Linux and Windows),
(a) then launches (b), via "start" so that (b) runs as a separate process but, CRITICALLY, the environment that (a) passes to (b) MUST have JENKINS_SERVER_COOKIE="ignore" added to it; Jenkins uses this flag to find the processes it has launched in order to kill them, so you must set this "cookie" to "ignore" to stop Jenkins killing (b),
(b) connects back to (a) via the socket or pipe,
(a) remains running for as long as (b) is connected to the socket or pipe but also lets itself be killed by CTRL-C/SIGTERM,
(b) then launches the thing you actually want to run,
when (a) is terminated by a Jenkins abort (b) notices (because the socket or pipe will close) and performs a controlled shut-down of the thing you wanted to run before (b) exits,
separately, make a thing, let's call it (c), which checks whether the socket/named-pipe is present: if it is then (b) hasn't terminated yet,
in Jenkinsfile, wrap the calling of (a) in a try()/catch()/finally() and call (c) from the finally(), hence ensuring that the Jenkins pipeline only finishes when (b) has terminated (you might want to add a guard timer for safety).
Quite a thing, and all for the lack of what would be a relatively simple API in Jenkins.
Related
I am running some Python code using a SLURM script on a remote server accessed through SSH. At some point, issues related to licenses on the SLURM platform may happen, generating errors in Python and ending the subprocess. I want to use try-except to let the Python subprocess wait until the issue is fixed, after that it can keep running from where it stopped.
What are some smart implementations for that?
My most obvious solution is just keeping Python inside a loop if the error occurs and letting it read a file every X seconds, when I finally fix the error and want it to keep running from where it stopped, I would write something on the file and break the loop. I wonder if there is a smarter way to provide input to the Python subprocess while it is running through the SLURM script.
One idea might be to add a signal handler for signal USR1 to your Python script like this.
In the signal handler function, you can set a global variable or send a message or set a threading.Event that the main process is waiting on.
Then you can signal the process with:
kill -USR1 <PID>
or with the Python os.kill() equivalent.
Though I do have to agree there is something to be said for the simplicity of your process doing:
touch /tmp/blocked.$$
and your program waiting in a loop with a 1s sleep for that file to be removed. This way you can tell which process id is blocked.
I have a shell script which I am calling in Python using os.system("./name_of_script")
I would prefer to do this call based on user input(ie a user types "start" and the call is done, and some other stuff in the python program is also done, when a user types "stop" the script is terminated) but i find that this call takes up the whole focus on the terminal (I dont really know the right word for it, but basically the whole program stalls on this call since my shell script executes until a keyboard interrupt is received). Then when I do a keyboard interrupt, that is the only moment that the shell script stops executing and the rest of the code afterwards is executed. Is this possible in python?
Simply constructing a Popen object, as in:
p = subprocess.Popen(['./name_of_script'])
...starts the named program without blocking on it to complete.
If you later want to see if it's done yet, you can check p.poll() for an update on its status.
This is also faster and safer than os.system(), in that it doesn't involve a shell (unless the script you're invoking runs one itself), so you aren't exposing yourself to shellshock, shell injection vulnerabilities, or other shell-related issues unnecessarily.
I am using requests to pull some files. I have noticed that the program seems to hang after some large number of iterations that varies from 5K to 20K. I can tell it is hanging because the folder where the results are stored has not changed in several hours. I have been trying to interrupt the process (I am using IDLE) by hitting CTRL + C to no avail. I would like to interrupt instead of killing the process because restart is easier. I have finally had to kill the process. I restart and it runs fine again until I have the same symptoms. I would like to figure out how to diagnose the problem but since I am having to kill everything I have no idea where to start.
Is there an alternate way to view what is going on or to more robustly interrupt the process?
I have been assuming that if I can interrupt without killing I can look at globals and or do some other mucking around to figure out where my code is hanging.
In case it's not too late: I've just faced the same problems and have some tips
First thing: In python most waiting apis are not interruptible (ie Thread.join(), Lock.acquire()...).
Have a look at theese pages for more informations:
http://snakesthatbite.blogspot.fr/2010/09/cpython-threading-interrupting.html
http://docs.python.org/2/library/thread.html
Then if a thread is waiting on such a call, it cannot be stopped.
There is another thing to know: if a normal thread is running (or hanged) the main program will stay indefinitely untill all threads are stopped or the process is killed.
To avoid that, you can make the thread a daemon thread: Thread.daemon=True before calling Thread.start().
Second thing, to find where your program is hanged, you can launch it with a debugger but I prefer logging because logs are always there in case its to late to debug.
Try logging before and after each waiting call to see how much time your threads have been hanged. To have high quality logs, uses python logging configured with file handler, html handler or even better with a syslog handler.
I have two scripts: "autorun.py" and "main.py". I added "autorun.py" as service to the autorun in my linux system. works perfectly!
Now my question is: When I want to launch "main.py" from my autorun script, and "main.py" will run forever, "autorun.py" never terminates as well! So when I do
sudo service autorun-test start
the command also never finishes!
How can I run "main.py" and then exit, and to finish it up, how can I then stop "main.py" when "autorun.py" is launched with the parameter "stop" ? (this is how all other services work I think)
EDIT:
Solution:
if sys.argv[1] == "start":
print "Starting..."
with daemon.DaemonContext(working_directory="/home/pi/python"):
execfile("main.py")
else:
pid = int(open("/home/pi/python/main.pid").read())
try:
os.kill(pid, 9)
print "Stopped!"
except:
print "No process with PID "+str(pid)
First, if you're trying to create a system daemon, you almost certainly want to follow PEP 3143, and you almost certainly want to use the daemon module to do that for you.
When I want to launch "main.py" from my autorun script, and "main.py" will run forever, "autorun.py" never terminates as well!
You didn't say how you're running it. If you're doing anything that launches main.py as a child and waits (or, worse, tries to import/execfile/etc. in the same process), you can't do that. Either autorun.py has to launch and detach main.py (or do so indirectly via some external tool), or main.py has to daemonize when launched.
how can I then stop "main.py" when "autorun.py" is launched with the parameter "stop" ?
You need some form of inter-process communication (IPC), and some way for autorun to find the right IPC channel to use.
If you're building a network server, the right answer might be to connect to it as a client. But otherwise, the simplest thing to do is kill the process with a signal.
If you're using the daemon module, it can easily map signals to callbacks. Or, if you don't need any cleanup, just use SIGTERM, which by default will abruptly terminate. If neither of those applies, you will have to set up a custom signal handler (and within that handler do something useful—e.g., set a flag that your main code checks periodically).
How do you know what process to send the signal to? The standard way to do this is to have main.py record its PID in a pidfile at startup. You read that pidfile, and signal whatever process is specified there. (If you get an error because there is no process with that PID, that just means the daemon already quit for some reason—possibly because of an unhandled exception, or even a segfault. You may want to log that, but treat the "stop" as successful otherwise.) Again, if you're using daemon, it does the pidfile stuff for you; if not, you have to do it yourself.
You may want to take a look at the service scripts for daemons that came with your computer. They're probably all written in bash rather than Python, but it's not that hard to figure out what they're doing. Or… just use one of them as a skeleton, in which case you don't really need any bash knowledge; it's just search-and-replace on the name.
If your distro has LSB-style init functions, you can use something like this example. That one does a whole lot more than you need to, but it's a good example of all of the details. Or do it all from scratch with something like this example. This one is doing the pidfile management and the backgrounding from the service script (turning a non-daemon program into a daemon), which you don't need if you're using daemon properly, and it's using SIGHUP instead of SIGTERM. You can google yourself for other examples of init.d service scripts.
But again, if you're just trying to do this for your own system, the best thing to do is look inside the /etc/init.d on your distro. There will be dozens of examples there, and 90% of them will be exactly the same except for the name of the daemon.
I have a python irc bot which I start up as root by doing /etc/init.d/phenny start. Sometimes it dies, though and it seems to happen overnight.
What can I do to inspect it and see the status of the process in a text file?
If you know it's still running, you can pstack it to see it's walkback. I'm not sure how useful that will be because you will see the call stack of the interpreter. You could also try strace or ltrace as someone else mentioned.
I would also make sure that in whatever environment the script runs in, you have set ulimit -c unlimited so that a core is generated in case python it is outright crashing.
Another thing I might try is to have this job executed by a parent that does not wait it's child. This should cause the proc table entry to stick around as a zombie even when the underlying job has exited.
If you're interested in really low level process activity, you can run the python interpreter under strace with standard error redirected to a file.
If you're only interested in inspecting the python code when your bot crashes, and you have the location in the source where the crash happens, you can wrap that location with try/except and break into the debugger in theexcept clause:
import pdb; pdb.set_trace()
You'll probably need to run your bot in non-daemon mode for that to work, though.
You might want to try Python psutils, it is something that I have used and works.
A cheap way to get some extra clues about the problem would be to start phenny with
/etc/init.d/phenny start 2>/tmp/phenny.out 1>&2
When it crashes, check the tail of /tmp/phenny.out for the Python traceback.
If you only need to verify that the process is running you could just run a script that checks the output of command
ps ax | grep [p]henny
every few seconds. If it's empty, then obviously the process is dead.