Debugging techniques for shut-down problems in Python daemons

Debugging techniques for shut-down problems in Python daemons - python

I am doing some gnarly stuff with Python threads including daemons.
I am getting an intermittent error on some tests:
Exception in thread myconsumerthread (most likely raised during interpreter shutdown):
Note that there are no stack trace/exception details provided.
Scrutinising my own code hasn't helped, but I am at a bit of a loss about the next step in debugging. What debugging techniques can I use to find out more about what exception might be bringing down the runtime during shutdown?
Fine print:
Windows, CPython, 2.7.2 - Not reproduceable on Ubuntu.
The problem occurs about 3% of the time - so reproducable, just not reliably.
The code in myconsumerthread has a catch-all exception handler, which tries to write the name of the exception to sys.stderr. (Could sys already be shut-down?)
I suspect the problem is related to shutting down daemon threads very quickly; before they have completely initialised. Something in this area, but I have little evidence - certainly insufficient to be pointing at a Python bug.
Ha, I have discovered a new symptom that marks a turning point in my descent into insanity!
If I import time in my test harness (not the live code), and never use it, the frequency drops to about 0.5%.
If I import turtle in my test harness (I swear on my life, there are no turtle graphics in my code; I chose this as the most irrelevant library I could quickly find) the exception starts to be caught in a different thread, and it occurs in about a third of the runs.

I have encountered the same error on a few occasions. I'm trying to locate / generate an example that displays the exact message.
Until then, if my memory serves me well, these were the areas that I focused on.
Looking for ports, files, queues, etc... removed or closed outside the daemon threads.
Scrutinize blocking calls in the daemon threads. IE a Queue.get(block=True), pyserial.read() - with timeout=None
After digging a little more I see the same types of errors popping up relating to Queue's see comments here.
I find it odd that it doesn't display the trace back. You might try to comment out the catch-all except and let Python send it to std.error. Hopefully then you'll be able to see what's dying on you.
Update
I knew I have seen this issue before... Below you'll find an example that generates that error (many of them actually). Note that there is no other trace back message either... For sake of completeness after you see the error messages, uncomment the queue.get lines and comment out the time.sleeps. The errors should go away. After re-running this again, the errors do not appear... This is inline with what you have been seeing in the sporadic failure rates... You may need to run it a few times to see the errors.
I normally use time.sleep(x) to throttle threads if blocking IO such as get() and read() do not provide a timeout method OR there is no blocking call to be used (user interface refreshes for example).
That being said, I believe there to be a problem with a thread being shutdown when waiting on a time.sleep() call. I believe that this call is what has gotten me every time, but I do not know what actually causes it inside the sleep method. For all I know there are other blocking calls that display this same behavior.
import time
import Queue
from threading import Thread
SLAVE_CNT = 50
OWNER_CNT = 10
MASTER_CNT = 2
class ThreadHungry(object):
def __init__(self):
self.rx_queue = Queue.Queue()
def start(self):
print "Adding Masters..."
for x in range(MASTER_CNT):
self.owners = []
print "Starting slave owners..."
for y in range(OWNER_CNT):
owner = Thread(target=self.__owner_action)
owner.daemon = True
owner.start()
self.owners.append(owner)
def __owner_action(self):
self.slaves = []
print "\tStarting slaves..."
for x in range(SLAVE_CNT):
slave = Thread(target=self.__slave_action)
slave.daemon = True
slave.start()
self.slaves.append(slave)
while(1):
time.sleep(1)
#self.rx_queue.get(block=True)
def __slave_action(self):
while(1):
time.sleep(1)
#self.rx_queue.get(block=True)
if __name__ == "__main__":
c = ThreadHungry()
c.start()
# Stop the threads abruptly after 5 seconds
time.sleep(5)

Related

Python Threading Issue on Windows

At first I thought I had some kind of memory leak causing an issue but I'm getting an exception I'm not sure I fully understand but at least I've narrowed it down now.
I'm using a while True loop to keep a thread running and retrieving data. If it runs into a problem it logs it and keeps running. It seems to work fine at first - at least the first time and then it constantly logs a Threading Exception.
I narrowed it down to this section:
while True:
yada yada yada...
#Works fine to this part
pool = ThreadPool(processes=1)
async_result = pool.apply_async(SpawnPhantomJS, (dcap, service_args))
Driver = async_result.get(10)
Driver.set_window_size(1024, 768) # optional
Driver.set_page_load_timeout(30)
I do this because there's an issue spawning a lot of selenium webdrivers it times out eventually (no exception - just hangs there) and using this gave it a timeout so if it couldn't spawn in 10 the exception would catch it and go again. Seemed like a great fix. But I think it's causing problems in a loop.
It works fine to start with but then throws the same exception on every loop.
I don't understand the thread pooling well enough maybe I shouldn't constantly be defining it. It's a hard exception to catch happening so testing is a bit of a pain but I'm thinking something like this might fix it?
pool = ThreadPool(processes=1)
async_result = pool.apply_async(SpawnPhantomJS, (dcap, service_args))
while True:
Driver = async_result.get(10)
That looks neater to me but I don't understand the problem well enough to say for sure it would fix it.
I'd really appreciate any suggestions.
Update:
I've tracked the problem to this section of code 100% I put a variable named bugcounter = 1 before it and = 2 afterwards and logged this on an exception.
But when trying to reproduce it with just this code in a loop it runs fine and keeps spawning web drivers. So I've no idea.
Further update:
I can run this locally for hours. Sometimes it'll run on the (Windows) server for hours. But after a while it fails somewhere here and I can't figure out why.
An exception could be thrown because the timeout hits and the browser wouldn't spawn on time. This happens rarely but that's why we loop back to it.
My assumption here is I'm creating too many threads and the OS isn't having it. I have just spotted there's a .terminate for thread pooling maybe if I terminate the pool after using it to spawn a browser?

The question I came to in the final answer solved it.
I was using a thread pool to give the browser spawn a timeout as a workaround for the bug in the library. But I wasn't terminating that thread pool so eventually after the x amount of loops the OS wouldn't let it create another pool.
Adding a .terminate once the browser had been spawned and the pool was no longer needed solved the problem.

A thread is blocked by a blocking call - how do I make a timeout on the blocking call?

I have a python program which operates an external program and starts a timeout thread. Timeout thread should countdown for 10 minutes and if the script, which operates the external program isn't finished in that time, it should kill the external program.
My thread seems to work fine on the first glance, my main script and the thread run simultaneously with no issues. But if a pop up window appears in the external program, it stops my scripts, so that even the countdown thread stops counting, therefore totally failing it's job.
I assume the issue is that the script calls a blocking function in API for the external program, which is blocked by the pop up window. I understand why it blocks my main program, but don't understand why it blocks my countdown thread. So, one possible solution might be to run a separate script for the countdown, but I would like to keep it as clean as possible and it seems really messy to start a script for this.
I have searched everywhere for a clue, but I didn't find much. There was a reference to the gevent library here:
background function in Python
, but it seems like such a basic task, that I don't want to include external library for this.
I also found a solution which uses a windows multimedia timer here, but I've never worked with this before and am afraid the code won't be flexible with this. Script is Windows-only, but it should work on all Windows from XP on.
For Unix I found signal.alarm which seems to do exactly what I want, but it's not available for Windows. Any alternatives for this?
Any ideas on how to work with this in the most simplified manner?
This is the simplified thread I'm creating (run in IDLE to reproduce the issue):
import threading
import time
class timeToKill():
def __init__(self, minutesBeforeTimeout):
self.stop = threading.Event()
self.countdownFrom = minutesBeforeTimeout * 60
def startCountdown(self):
self.countdownThread= threading.Thread(target=self.countdown, args=(self.countdownFrom,))
self.countdownThread.start()
def stopCountdown(self):
self.stop.set()
self.countdownThread.join()
def countdown(self,seconds):
for second in range(seconds):
if(self.stop.is_set()):
break
else:
print (second)
time.sleep(1)
timeout = timeToKill(1)
timeout.startCountdown()
raw_input("Blocking call, waiting for input:\n")

One possible explanation for a function call to block another Python thread is that CPython uses global interpreter lock (GIL) and the blocking API call doesn't release it (NOTE: CPython releases GIL on blocking I/O calls therefore your raw_input() example should work as is).
If you can't make the buggy API call to release GIL then you could use a process instead of a thread e.g., multiprocessing.Process instead of threading.Thread (the API is the same). Different processes are not limited by GIL.

For quick and dirty threading, I usually resort to subprocess commands. it is quite robust and os independent. It does not give as fine grained control as the thread and queue modules but for external calls to programs generally does nicely. Note the shell=True must be used with caution.
#this can be any command
p1 = subprocess.Popen(["python", "SUBSCRIPTS/TEST.py", "0"], shell=True)
#the thread p1 will run in the background - asynchronously. If you want to kill it after some time, then you need
#here do some other tasks/computations
time.sleep(10)
currentStatus = p1.poll()
if currentStatus is None: #then it is still running
try:
p1.kill() #maybe try os.kill(p1.pid,2) if p1.kill does not work
except:
#do something else if process is done running - maybe do nothing?
pass

Unlabeled exception in threading

I have a chunk of code like this
def f(x):
try:
g(x)
except Exception, e:
print "Exception %s: %d" % (x, e)
def h(x):
thread.start_new_thread(f, (x,))
Once in a while, I get this:
Unhandled exception in thread started by
Error in sys.excepthook:
Original exception was:
Unlike the code sample, that's the complete text. I assume after the "by" there's supposed to be a thread ID and after the colon there are supposed to be stack traces, but nope, nothing. I don't know how to even start to debug this.

The error you're seeing means the interpreter was exiting (because the main thread exited) while another thread was still executing Python code. Python will clean up its environment, cleaning out and throwing away all of the loaded modules (to make sure as many finalizers as possible execute) but unfortunately that means the still-running thread will start raising exceptions when it tries to use something that was already destroyed. And then that exception propagates up to the start_new_thread function that started the thread, and it will try to report the exception -- only to find that what it tries to use to report the exception is also gone, which causes the confusing empty error messages.
In your specific example, this is all caused by your thread being started and your main thread exiting right away. Whether the newly started thread gets a chance to run before, during or after the interpreter exits (and thus whether you see it run as normal, run partially and report an error or never see it run) is entirely up to the OS thread scheduler.
If you're using threads (which is not a bad thing to avoid) you probably want to not have threads running while you're exiting the interpreter. The threading.Thread class is a better interface for starting new threads, and it will make the interpreter wait for all threads by default, on exit. If you really don't want to wait for a thread to end, you can set its 'daemonic' flag in the Thread object to get the old behaviour -- including the problem you see here.

Python's time.sleep - never waking up

I think this is going to be one of those simple-when-you-see-it problems, but it has got me baffled.
[STOP PRESS: I was right. Solution was found. See the answers.]
I am using Python's unittest framework to test a multi-threaded app. Nice and straight forward - I have 5 or so worker threads monitoring a common queue, and a single producer thread making work-items for them. The producer thread is being triggered by a test-case.
In this test, only one task is being put on the queue. The processing it does is in the test is just a stub for the real processing, so the worker thread does a 5 second-sleep to simulate the elapsed time before the task will really be done, and the thread will be ready to get another task.
To the snippet of code is:
logging.info("Sleep starting")
time.sleep(5)
logging.info("Waking up")
Now the weird part. I see the "Sleep starting" log message, but not the Waking up message. The program locks up and doesn't respond to Keyboard Interrupt (CTRL+C). CPU load is very low.
I see the same problem in Windows and Ubuntu (Python 2.6.2).
I have pondered if an exception is occurring and being hidden, so I add "print 1/0" between the first and second line - I see the Division By Zero error being raised. I move it to after the sleep, and I never see the message.
I figured "Okay, maybe the other thread is trying to log something very very large at the same time, and it is still buffering. What is it doing?"
Well, by this time, the test has returned to the unittest, where it is pausing waiting for the thread to get going before testing the system's state.
logging.info("Test sleep starting")
time.sleep(0.25)
logging.info("Test waking up")
Wow, that looks familiar. It is freezing in exactly the same way! The first log message is appearing, the second isn't.
I have recently done a significant rewrite of the unit so I can't claim "I didn't touch anything", but I can't see anything untoward in my changes.
Suspicious areas:
I am including using Threading.Lock (because I don't know how to reason about GIL's safety, so I stick to what I know. I see nothing "deadlocky" about my code.
I am new to Python's unittest framework. Is there something it does with redirecting logging or similar that might simulate these symptoms?
No, I haven't substituted a non-standard time module!
What would prevent a thread from waking up? What else have I missed?

Sigh.
Worker Thread #1 is sleeping, and waking up afterwards. It is then going to log the wake message, and is blocked. Only one thread can be logging at a time.
UnitTest Thread is sleeping, and waking up afterwards. It is then going to log the wake message, and is blocked. Only one thread can be logging at a time.
Worker-Thread-Not-Previously-Mentioned-In-The-Question #2 was quietly finishing the processing the PREVIOUS item in the queue, while the first Worker Thread was sleeping. It got to a log statement. One of the parameters was an object, and str() was implicitly called. The str() function on that object had a bug; it deadlocked when it accessed some of its data members. The deadlock occured while being processed by the logging function, thus keeping the logging thread-lock, and making it appear like the other threads never woke up.
The division by zero test didn't make a difference, because the result of it was an attempt to log.

On linux, try change I/O scheduler to Completely Fair Queuing (CFQ).
echo cfq > /sys/block/sda/queue/scheduler

Self-repairing Python threads

I've created a web spider that accesses both a US and EU server. The US and EU servers are the same data structure, but have different data inside them, and I want to collate it all. In order to be nice to the server, there's a wait time between each request. As the program is exactly the same, in order to speed up processing, I've threaded the program so it can access the EU and US servers simultaneously.
This crawling will take on the order of weeks, not days. There will be exceptions, and while I've tried to handle everything inside the program, it's likely something weird might crop up. To be truly defensive about this, I'd like to catch a thread that's failed, log the error and restart it. Worst case I lose a handful of pages out of thousands, which is better than having a thread fail and lose 50% of speed. However, from what I've read, Python threads die silently. Does anyone have any ideas?
class AccessServer(threading.Thread):
def __init__(self, site):
threading.Thread.__init__(self)
self.site = site
self.qm = QueueManager.QueueManager(site)
def run(self):
# Do stuff here
def main():
us_thread = AccessServer(u"us")
us_thread.start()
eu_thread = AccessServer(u"eu")
eu_thread.start()

Just use a try: ... except: ... block in the run method. If something weird happens that causes the thread to fail, it's highly likely that an error will be thrown somewhere in your code (as opposed to in the threading subsystem itself); this way you can catch it, log it, and restart the thread. It's your call whether you want to actually shut down the thread and start a new one, or just enclose the try/except block in a while loop so the same thread keeps running.
Another solution, if you suspect that something really weird might happen which you can't detect through Python's error handling mechanism, would be to start a monitor thread that periodically checks to see that the other threads are running properly.

Can you have e.g. the main thread function as a monitoring thread? E.g. require that the worker thread regularly update some thread-specific timestamp value, and if a thread hasn't updated it's timestamp within a suitable time, have the monitoring thread kill it and restart?
Or, see this answer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.