How to detect process freeze and restart it

How to detect process freeze and restart it - python

My code will be frozen(≠process down).
I want to detect and restart the frozen process (=stopped process / =zombie process).
Please tell me solutions on this problem.
::My Situation
My code made by python. Target os was linux (However a solution is more better if it works on macos/windows) The codes(process) are managed under supervisord(http://supervisord.org/).
The supervisord can restart the shutdown process, but it seems not to detect and restart the frozen process. (The fact I don't know. However, if supervisord application detect and restart "frozen process", that helps me a lot.)
(And the process calls external process of huge software application. If restart the process, the software application may be needed to be restarted.)
::My solution idea:
The my process writes out to log file every 1-5 sec. So current my solution idea is that I make a new program code that monitors the log data, and if the target process has not write out log, the my new code submit command to restart the process via supervisord.
However, the project was very delayed. So if you know more better solution, please tell me.
:: The cause of the freeze (Added at 5/22 14:57)
I found that the freeze was caused sometimes when a code at each terminal(end of process) in regular cycle calls quit() method of the driver for the external software process.
The quit() -> into logging module process -> logger submitting log to DB -> ... -> a code inside logging call del() of process (process shutdown? in submitting to DB? ) -> silent freeze.
My logger submits logs to DB. In other process, the logger had not fall into freeze, so I could not find the above cause. (I only have not been known to fall in others??). However, if it in the above process, the process causes silent freeze.

Related

How to identify the cause in Python of code that is not interruptible with a CTRL +C

I am using requests to pull some files. I have noticed that the program seems to hang after some large number of iterations that varies from 5K to 20K. I can tell it is hanging because the folder where the results are stored has not changed in several hours. I have been trying to interrupt the process (I am using IDLE) by hitting CTRL + C to no avail. I would like to interrupt instead of killing the process because restart is easier. I have finally had to kill the process. I restart and it runs fine again until I have the same symptoms. I would like to figure out how to diagnose the problem but since I am having to kill everything I have no idea where to start.
Is there an alternate way to view what is going on or to more robustly interrupt the process?
I have been assuming that if I can interrupt without killing I can look at globals and or do some other mucking around to figure out where my code is hanging.

In case it's not too late: I've just faced the same problems and have some tips
First thing: In python most waiting apis are not interruptible (ie Thread.join(), Lock.acquire()...).
Have a look at theese pages for more informations:
http://snakesthatbite.blogspot.fr/2010/09/cpython-threading-interrupting.html
http://docs.python.org/2/library/thread.html
Then if a thread is waiting on such a call, it cannot be stopped.
There is another thing to know: if a normal thread is running (or hanged) the main program will stay indefinitely untill all threads are stopped or the process is killed.
To avoid that, you can make the thread a daemon thread: Thread.daemon=True before calling Thread.start().
Second thing, to find where your program is hanged, you can launch it with a debugger but I prefer logging because logs are always there in case its to late to debug.
Try logging before and after each waiting call to see how much time your threads have been hanged. To have high quality logs, uses python logging configured with file handler, html handler or even better with a syslog handler.

How to correctly handle autorun start & stop on linux with python

I have two scripts: "autorun.py" and "main.py". I added "autorun.py" as service to the autorun in my linux system. works perfectly!
Now my question is: When I want to launch "main.py" from my autorun script, and "main.py" will run forever, "autorun.py" never terminates as well! So when I do
sudo service autorun-test start
the command also never finishes!
How can I run "main.py" and then exit, and to finish it up, how can I then stop "main.py" when "autorun.py" is launched with the parameter "stop" ? (this is how all other services work I think)
EDIT:
Solution:
if sys.argv[1] == "start":
print "Starting..."
with daemon.DaemonContext(working_directory="/home/pi/python"):
execfile("main.py")
else:
pid = int(open("/home/pi/python/main.pid").read())
try:
os.kill(pid, 9)
print "Stopped!"
except:
print "No process with PID "+str(pid)

First, if you're trying to create a system daemon, you almost certainly want to follow PEP 3143, and you almost certainly want to use the daemon module to do that for you.
When I want to launch "main.py" from my autorun script, and "main.py" will run forever, "autorun.py" never terminates as well!
You didn't say how you're running it. If you're doing anything that launches main.py as a child and waits (or, worse, tries to import/execfile/etc. in the same process), you can't do that. Either autorun.py has to launch and detach main.py (or do so indirectly via some external tool), or main.py has to daemonize when launched.
how can I then stop "main.py" when "autorun.py" is launched with the parameter "stop" ?
You need some form of inter-process communication (IPC), and some way for autorun to find the right IPC channel to use.
If you're building a network server, the right answer might be to connect to it as a client. But otherwise, the simplest thing to do is kill the process with a signal.
If you're using the daemon module, it can easily map signals to callbacks. Or, if you don't need any cleanup, just use SIGTERM, which by default will abruptly terminate. If neither of those applies, you will have to set up a custom signal handler (and within that handler do something useful—e.g., set a flag that your main code checks periodically).
How do you know what process to send the signal to? The standard way to do this is to have main.py record its PID in a pidfile at startup. You read that pidfile, and signal whatever process is specified there. (If you get an error because there is no process with that PID, that just means the daemon already quit for some reason—possibly because of an unhandled exception, or even a segfault. You may want to log that, but treat the "stop" as successful otherwise.) Again, if you're using daemon, it does the pidfile stuff for you; if not, you have to do it yourself.
You may want to take a look at the service scripts for daemons that came with your computer. They're probably all written in bash rather than Python, but it's not that hard to figure out what they're doing. Or… just use one of them as a skeleton, in which case you don't really need any bash knowledge; it's just search-and-replace on the name.
If your distro has LSB-style init functions, you can use something like this example. That one does a whole lot more than you need to, but it's a good example of all of the details. Or do it all from scratch with something like this example. This one is doing the pidfile management and the backgrounding from the service script (turning a non-daemon program into a daemon), which you don't need if you're using daemon properly, and it's using SIGHUP instead of SIGTERM. You can google yourself for other examples of init.d service scripts.
But again, if you're just trying to do this for your own system, the best thing to do is look inside the /etc/init.d on your distro. There will be dozens of examples there, and 90% of them will be exactly the same except for the name of the daemon.

Preventing management commands from running more than one at a time

I'm designing a long running process, triggered by a Django management command, that needs to run on a fairly frequent basis. This process is supposed to run every 5 min via a cron job, but I want to prevent it from running a second instance of the process in the rare case that the first takes longer than 5 min.
I've thought about creating a touch file that gets created when the management process starts and is removed when the process ends. A second management command process would then check to make sure the touch file didn't exist before running. But that seems like a problem if a process dies abruptly without properly removing the touch file. It seems like there's got to be a better way to do that check.
Does anyone know any good tools or patterns to help solve this type of issue?

For this reason I prefer to have a long-running process that gets its work off of a shared queue. By long-running I mean that its lifetime is longer than a single unit of work. The process is then controlled by some daemon service such as supervisord which can take over control of restarting the process when it crashes. This delegates the work appropriately to something that knows how to manage process lifecycles and frees you from having to worry about the nitty gritty of posix processes in the scope of your script.
If you have a queue, you also have the luxury of being able to spin up multiple processes that can each take jobs off of the queue and process them, but that sounds like it's out of scope of your problem.

Python: Why does my SMTP script freeze my computer?

So I wrote a little multithreaded SMTP program. The problem is every time I run it, it freezes the computer shortly after. The script appears to still work, as my network card is still lighting up and the emails are received, but in some cases it will lock up completely and stop sending the emails.
Here's a link to my two script files. The first is the one used to launch the program:
readFile.py
newEmail.py

First, you're using popen which creates subprocesses, ie. processes not threads. I'll assume this is what you meant.
My guess would be that the program gets stuck in a loop where it generates processes continuously, which the OS will probably dislike. (That kind of thing is known as a forkbomb which is a good way to freeze Linux unless a process limit has been set with ulimit.) I couldn't find the bug though, but if I were you, I'd log messages each time I spawn or kill a subprocess, and if everything is normal, watch the system closely (ps or top on Unix systems) to see if the processes are really being killed.

How to auto-restart a python script on fail?

This post describes how to keep a child process alive in a BASH script:
How do I write a bash script to restart a process if it dies?
This worked great for calling another BASH script.
However, I tried executing something similar where the child process is a Python script, daemon.py which creates a forked child process which runs in the background:
#!/bin/bash
PYTHON=/usr/bin/python2.6
function myprocess {
$PYTHON daemon.py start
}
NOW=$(date +"%b-%d-%y")
until myprocess; do
echo "$NOW Prog crashed. Restarting..." >> error.txt
sleep 1
done
Now the behaviour is completely different. It seems the python script is no longer a child of of the bash script but seems to have 'taken over' the BASH scripts PID - so there is no longer a BASH wrapper round the called script...why?

A daemon process double-forks, as the key point of daemonizing itself -- so the PID that the parent-process has is of no value (it's gone away very soon after the child process started).
Therefore, a daemon process should write its PID to a file in a "well-known location" where by convention the parent process knows where to read it from; with this (traditional) approach, the parent process, if it wants to act as a restarting watchdog, can simply read the daemon process's PID from the well-known location and periodically check if the daemon is still alive, and restart it when needed.
It takes some care in execution, of course (a "stale" PID will stay in the "well known location" file for a while and the parent must take that into account), and there are possible variants (the daemon could emit a "heartbeat" so that the parent can detect not just dead daemons, but also ones that are "stuck forever", e.g. due to a deadlock, since they stop giving their "heartbeat" [[via UDP broadcast or the like]] -- etc etc), but that's the general idea.

You should look at the Python Enhancement Proposal 3143 (PEP) here. In it Ben suggests including a daemon library in the python standard lib. He goes over LOTS of very good information about daemons and is a pretty easy read. The reference implementation is here.

It seems that the behavior is completely different because here your "daemon.py" is launched in background as a daemon.
In the other link you pointed to the process that is surveyed is not a daemon, it does not start in the background. The launcher simply wait forever that the child process stop.
There is several ways to overcome this. The classical one is the way #Alex explain, using some pid file in conventional places.
Another way could be to build the watchdog inside your running daemon and daemonize the watchdog... this would simulate a correct process that do not break at random (something that shouldn't occur)...

Make use of 'https://github.com/ut0mt8/simple-ha' .
simple-ha
Tired of keepalived, corosync, pacemaker, heartbeat or whatever ? Here a simple daemon wich ensure a Heartbeat between two hosts. One is active, and the other is backup, launching script when changing state. Simple implementation, KISS. Production ready (at least it works for me :)
Life will be too easy !

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.