from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
def div_zero(x):
print('In div_zero')
return x / 0
with ThreadPoolExecutor(max_workers=4) as executor:
futures = executor.submit(div_zero, 1)
done, _ = wait([futures], return_when=ALL_COMPLETED)
# print(done.pop().result())
print('Done')
The program above will run to completion without any error message.
You can only get the exception if you explicitly call future.result() or future.exception(), like what I did in the line commented-out.
I wonder why this Python module chose this kind of behavior even if it hides problems. Because of this, I spent hours debugging
a programming error (referencing a non-exist attribute in a class) that would otherwise be very obvious if the program just crashes with exception, like Java for instance.
I suspect the reason is so that the entire pool does not crash because of one thread raising an exception. This way, the pool will process all the tasks and you can get the threads that raised exceptions separately if you need to.
Each thread is (mostly) isolated from the other threads, including the primary thread. The primary thread does not communicate with the other threads until you ask it to do so.
This includes errors. The result is what you are seeing, the errors occurring other threads do not interfere with the primary thread. You only need to handle them when you ask for the results.
Related
I was running a function on a pool of N single-threaded workers (on N machines) with client.map and one of the workers failed. I was wondering if there is a way to automatically handle exceptions raised by a worker, to redistribute its failed tasks to other workers, and to ignore or exclude it from the pool?
I've tried simulating the issue with the methods shown below. To cause one worker to fail I raise an OSError on it in my_function, which is submitted to client.map like so: futures = client.map(my_function, range(100)). In my example, the worker on 'Computer123' will be the one to fail. To handle exceptions thrown by my_function, I use sys.exit in exception_handler. So when a task fails on a worker, sys.exit is called. The result is that the bad worker's distributed.nanny catches the failure and restarts the worker while the client redistributes its failed tasks. But once the bad worker is back up again, it receives tasks again because it's still in the pool. It fails again and the process repeats. As it continues to fail, eventually the other workers complete all the tasks. It would be ideal if I could automatically handle exceptions from bad workers like 'Computer123' and remove it from the pool. Maybe removing it from the pool is all I need to do?
#exception_handler
def my_function(x):
import socket
import time
time.sleep(5)
if socket.gethostname() == 'Computer123':
raise(OSError)
else:
return x**2
def exception_handler(orig_func):
def wrapper(*args,**kwargs):
try:
return orig_func(*args,**kwargs)
except:
import sys
sys.exit(1)
return wrapper
As a workaround, you could keep a dictionary of bad workers, adding the hostname to it each time you determine it is bad (perhaps after it raises a certain number of exceptions).
Then when you want to issue some task, check if it is in the offending list. Something like:
if socket.gethostname() in badHosts:
skip
else:
do_something()
If you can give more details on how you manage the pool you connect to, I may be able to offer some more advice on how to remove them directly instead of having to check each time.
Consider the following example:
from multiprocessing import Queue, Pool
def work(*args):
print('work')
return 0
if __name__ == '__main__':
queue = Queue()
pool = Pool(1)
result = pool.apply_async(work, args=(queue,))
print(result.get())
This raises the following RuntimeError:
Traceback (most recent call last):
File "/tmp/test.py", line 11, in <module>
print(result.get())
[...]
RuntimeError: Queue objects should only be shared between processes through inheritance
But interestingly the exception is only raised when I try to get the result, not when the "sharing" happens. Commenting the corresponding line silences the error while I actually did share the queue (and work is never executed!).
So here goes my question: Why is this exception only raised when the result is requested, and not when the apply_async method is invoked even though the error seems to be recognized because the target work function is never called?
It looks like the exception occurs in a different process and can only be made available to the main process when inter-process communication is performed in form of requesting the result. Then, however, I'd like to know why such checks are not performed before dispatching to the other process.
(If I used the queue in both work and the main process for communication then this would (silently) introduce a deadlock.)
Python version is 3.5.2.
I have read the following questions:
Sharing many queues among processes in Python
How do you pass a Queue reference to a function managed by pool.map_async()?
Sharing a result queue among several processes
Python multiprocessing: RuntimeError: “Queue objects should only be shared between processes through inheritance”
Python sharing a lock between processes
This behavior results from the design of the multiprocessing.Pool.
Internally, when you call apply_async, you put your job in the Pool call queue and then get back a AsyncResult object, which allow you to retrieve your computation result using get.
Another thread is then in charge of pickling your work. In this thread, the RuntimeError happens but you already returned from the call to async_apply. Thus, it sets the results in AsyncResult as the exception, which is raised when you call get.
This behavior using some kind of future results is better understood when you try using concurrent.futures, which have explicit future objects and, IMO, a better design to handle failures, has you can query the future object for failure without calling the get function.
Is it possible to gracefully kill a joblib process (threading backend), and still return the so far computed results ?
parallel = Parallel(n_jobs=4, backend="threading")
result = parallel(delayed(dummy_f)(x) for x in range(100))
For the moment I came up with two solutions
parallel._aborted = True which waits for the started jobs to finish (in my case it can be very long)
parallel._terminate_backend() which hangs if jobs are still in the pipe (parallel._jobs not empty)
Is there a way to workaround the lib to do this ?
As far as I know, Joblib does not provide methods to kill spawned threads.
As each child thread runs in its own context, it's actually difficult to perform graceful killing or termination.
That being said, there is a workaround that could be adopted.
Mimic .join() (of threading) functionality (kind of):
Create a shared memory shared_dict with keys corresponding each thread id, values if contain either thread output or Exception e.g.:
shared_dict = {i: None for i in range(num_workers)}
Whenever an error is raised in any thread, catch the exception through the handler and instead of raising it immediately, store it in the shared memory flag
Create an exception handler which waits for all(shared_dict.values())
After all values are filled with either result or error, exit the program by raising the error or logging or whatever.
How can I modify this code (that uses twisted) so that CTRL+C will cause it to exit? I expect the problem is that doWork does not yield control back to the reactor, so the reactor is not able to terminate its execution.
def loop_forever():
i = 0
while True:
yield i
i += 1
time.sleep(5)
def doWork():
for i in loop_forever():
print i
def main():
threads.deferToThread(doWork)
reactor.run()
Note that this code:
def main():
try:
threads.deferToThread(doWork)
reactor.run()
except KeyboardInterrupt:
print "user interrupted task"
does catch the exception on windows, but not on ubuntu
Twisted uses Python's threading library to implement deferToThread. All of the rules that apply to Python threads apply to the threads you get with deferToThread. One rule is that signals and threads are a bad combination (Ctrl-C sends SIGINT on Linux).
The basic idea for solving this problem is to put some logic into doWork so that it will stop. Perhaps this means setting a global flag that it checks once per iteration. You can probably find lots of information elsewhere regarding strategies for getting a long-running thread to cooperate with shutdown.
You may also want to not use deferToThread for this. If you expect your job to run for most of the lifetime of the process then you may just want to use the stdlib threading module directly. Consider that a job like this is using up one of the thread pool slots. If you have enough of these then your thread pool will be full and other work will not be able to proceed.
You may also want to take doWork out of the thread. It doesn't look like it does a lot of blocking. Instead, run doWork in the reactor thread and only run iterations of loop_forever with deferToThread. Now you no longer have a long-running operation in a thread and several of your problems will probably go away.
I have a chunk of code like this
def f(x):
try:
g(x)
except Exception, e:
print "Exception %s: %d" % (x, e)
def h(x):
thread.start_new_thread(f, (x,))
Once in a while, I get this:
Unhandled exception in thread started by
Error in sys.excepthook:
Original exception was:
Unlike the code sample, that's the complete text. I assume after the "by" there's supposed to be a thread ID and after the colon there are supposed to be stack traces, but nope, nothing. I don't know how to even start to debug this.
The error you're seeing means the interpreter was exiting (because the main thread exited) while another thread was still executing Python code. Python will clean up its environment, cleaning out and throwing away all of the loaded modules (to make sure as many finalizers as possible execute) but unfortunately that means the still-running thread will start raising exceptions when it tries to use something that was already destroyed. And then that exception propagates up to the start_new_thread function that started the thread, and it will try to report the exception -- only to find that what it tries to use to report the exception is also gone, which causes the confusing empty error messages.
In your specific example, this is all caused by your thread being started and your main thread exiting right away. Whether the newly started thread gets a chance to run before, during or after the interpreter exits (and thus whether you see it run as normal, run partially and report an error or never see it run) is entirely up to the OS thread scheduler.
If you're using threads (which is not a bad thing to avoid) you probably want to not have threads running while you're exiting the interpreter. The threading.Thread class is a better interface for starting new threads, and it will make the interpreter wait for all threads by default, on exit. If you really don't want to wait for a thread to end, you can set its 'daemonic' flag in the Thread object to get the old behaviour -- including the problem you see here.