I have a unittest that does a bunch of stuff in several different threads. When I stop everything in the tearDown method, somehow something is still running. And by running I mean sleeping. I ran the top command on the python process (Ubuntu 12.04), which told me that the process was sleeping.
Now I have tried using pdb to figure out what is going on, e.g. by putting set_trace() at the end of tearDown. But that tells me nothing. I suspect this is because some other thread has started sleeping earlier and is therefore not accessed anymore at this point.
Is there any tool or method I can use to track down the cause of my non-stopping process?
EDIT
Using ps -Tp <#Process> -o wchan I now know that 4 threads are still running, of which three waiting on futex_wait_queue_me and one on unix_stream_data_wait. Since I had a subprocess previously, which I killed with os.kill(pid, signal.SIGKILL), I suspect that the Pipe connection is somehow still waiting for that process. Perhaps the fast mutexes are waiting for that as well.
Is there anyway I could further reduce the search space?
If you are working under Linux then you should be able to use 'ps -eLf' to get a list of all active processes and threads. Assuming your have given your threads good names at creation it should be easy to see what is still running.
I believe under windows you can get a tool to do something similar - see http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
N.B. I have not used the windows tool this myself
Also from within Python you can use the psutil package (https://pypi.python.org/pypi/psutil/) to get similar infomration
Related
I am running some Python code using a SLURM script on a remote server accessed through SSH. At some point, issues related to licenses on the SLURM platform may happen, generating errors in Python and ending the subprocess. I want to use try-except to let the Python subprocess wait until the issue is fixed, after that it can keep running from where it stopped.
What are some smart implementations for that?
My most obvious solution is just keeping Python inside a loop if the error occurs and letting it read a file every X seconds, when I finally fix the error and want it to keep running from where it stopped, I would write something on the file and break the loop. I wonder if there is a smarter way to provide input to the Python subprocess while it is running through the SLURM script.
One idea might be to add a signal handler for signal USR1 to your Python script like this.
In the signal handler function, you can set a global variable or send a message or set a threading.Event that the main process is waiting on.
Then you can signal the process with:
kill -USR1 <PID>
or with the Python os.kill() equivalent.
Though I do have to agree there is something to be said for the simplicity of your process doing:
touch /tmp/blocked.$$
and your program waiting in a loop with a 1s sleep for that file to be removed. This way you can tell which process id is blocked.
In my quest to make a parallelised python program (written on Linux) truly platform independent, I was looking for python packages that would parallelise seamlessly in windows. So, I found joblib, which looked like a godsend, because it worked on Windows without having hundreds of annoying pickling errors.
Then I ran into a problem that the processes spawned by joblib would continue to exist even if there were no parallel jobs running. This is a problem because in my code, there are multiple uses of os.chdir(). If there are processes running after the jobs end, there is no way to do os.chdir() on them, so it is not possible to delete the temporary folder where the parallel processes were working. This issue has been noted in this previous post: Joblib Parallel doesn't terminate processes
After some digging, I figured out that the problem is coming from the backend loky, which reuses the pool of processes, so keeps them alive.
So, I used loky directly, by calling the loky.ProcessPoolExecutor() instead of using loky.get_reusable_executor() that joblib uses. The ProcessPoolExecutor from loky seems to be a reimplementation of python's concurrent.futures and uses similar semantics.
This works fine, however, there is one python process that seems to still stick around, even after shutting down the ProcessPoolExecutor.
Minimal example with interactive python on Windows (use interactive shell because otherwise all the processes will terminate on exit):
>>> import loky
>>> import time
>>> def f(): # just some dummy function
... time.sleep(10)
... return 0
...
>>> pool = loky.ProcessPoolExecutor(3)
At this point, there is only one python process running (please note the PID in task manager).
Then, submit a job to the process pool (which returns a Future object).
>>> task = pool.submit(f)
>>> task
<Future at 0x11ad6f37640 state=running>
There are 5 processes now. One is the main process (PID=16508). There are three worker processes in the pool. But there is another extra process. I am really not sure what it is doing there.
After getting results, shutting the pool down removes the three worker processes. But not the one extra process.
>>> task.result()
0
>>> pool.shutdown() # you can add kill_workers=True, but does not change the result
The new python process with PID=16904 is still running. I have tried looking through the source code of loky, but I cannot figure out where that additional process is being created (and why it is necessary). Is there any way to tell this loky process to shutdown? (I do not want to resort to os.kill or some other drastic way of terminating process e.g. with SIGTERM, I want to do it programmatically if I can)
I am using requests to pull some files. I have noticed that the program seems to hang after some large number of iterations that varies from 5K to 20K. I can tell it is hanging because the folder where the results are stored has not changed in several hours. I have been trying to interrupt the process (I am using IDLE) by hitting CTRL + C to no avail. I would like to interrupt instead of killing the process because restart is easier. I have finally had to kill the process. I restart and it runs fine again until I have the same symptoms. I would like to figure out how to diagnose the problem but since I am having to kill everything I have no idea where to start.
Is there an alternate way to view what is going on or to more robustly interrupt the process?
I have been assuming that if I can interrupt without killing I can look at globals and or do some other mucking around to figure out where my code is hanging.
In case it's not too late: I've just faced the same problems and have some tips
First thing: In python most waiting apis are not interruptible (ie Thread.join(), Lock.acquire()...).
Have a look at theese pages for more informations:
http://snakesthatbite.blogspot.fr/2010/09/cpython-threading-interrupting.html
http://docs.python.org/2/library/thread.html
Then if a thread is waiting on such a call, it cannot be stopped.
There is another thing to know: if a normal thread is running (or hanged) the main program will stay indefinitely untill all threads are stopped or the process is killed.
To avoid that, you can make the thread a daemon thread: Thread.daemon=True before calling Thread.start().
Second thing, to find where your program is hanged, you can launch it with a debugger but I prefer logging because logs are always there in case its to late to debug.
Try logging before and after each waiting call to see how much time your threads have been hanged. To have high quality logs, uses python logging configured with file handler, html handler or even better with a syslog handler.
I'm designing a long running process, triggered by a Django management command, that needs to run on a fairly frequent basis. This process is supposed to run every 5 min via a cron job, but I want to prevent it from running a second instance of the process in the rare case that the first takes longer than 5 min.
I've thought about creating a touch file that gets created when the management process starts and is removed when the process ends. A second management command process would then check to make sure the touch file didn't exist before running. But that seems like a problem if a process dies abruptly without properly removing the touch file. It seems like there's got to be a better way to do that check.
Does anyone know any good tools or patterns to help solve this type of issue?
For this reason I prefer to have a long-running process that gets its work off of a shared queue. By long-running I mean that its lifetime is longer than a single unit of work. The process is then controlled by some daemon service such as supervisord which can take over control of restarting the process when it crashes. This delegates the work appropriately to something that knows how to manage process lifecycles and frees you from having to worry about the nitty gritty of posix processes in the scope of your script.
If you have a queue, you also have the luxury of being able to spin up multiple processes that can each take jobs off of the queue and process them, but that sounds like it's out of scope of your problem.
So I wrote a little multithreaded SMTP program. The problem is every time I run it, it freezes the computer shortly after. The script appears to still work, as my network card is still lighting up and the emails are received, but in some cases it will lock up completely and stop sending the emails.
Here's a link to my two script files. The first is the one used to launch the program:
readFile.py
newEmail.py
First, you're using popen which creates subprocesses, ie. processes not threads. I'll assume this is what you meant.
My guess would be that the program gets stuck in a loop where it generates processes continuously, which the OS will probably dislike. (That kind of thing is known as a forkbomb which is a good way to freeze Linux unless a process limit has been set with ulimit.) I couldn't find the bug though, but if I were you, I'd log messages each time I spawn or kill a subprocess, and if everything is normal, watch the system closely (ps or top on Unix systems) to see if the processes are really being killed.