exceptions in worker process - python

I'm play with multiprocessing in Python. I'm trying to determine what happends if a workers raise an exception so I wrote the following code:
def a(num):
if(num == 2):
raise Exception("num can't be 2")
print(num)
p = Pool()
p.map(a, [2, 1, 3, 4, 5, 6, 7, 100, 100000000000000, 234, 234, 5634, 0000])
output
3
4
5
7
6
100
100000000000000
234
234
5634
0
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "<stdin>", line 3, in a
Exception: Error, num can't be 2
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
Exception: Error, num can't be 2
If you can see the numbers that was printed "2" is not there but Why is not number 1 also there?
Note: I'm using Python 3.5.2 on Ubuntu

By default, Pool creates a number of workers equal to your number of cores. When one of those worker processes dies, it may leave work that has been assigned to it undone. It also may leave output in a buffer that never gets flushed.
The pattern with .map() is to handle exceptions in the workers and return some suitable error value, since the results of .map() are supposed to be one-to-one with the input.
from multiprocessing import Pool
def a(num):
try:
if(num == 2):
raise Exception("num can't be 2")
print(num, flush=True)
return num
except Exception as e:
print('failed', flush=True)
return e
p = Pool()
n=100
results = p.map(a, range(n))
print("missing numbers: ", tuple(i for i in range(n) if i not in results))
Here's another question with good information about how exceptions propagate in multiprocessing.map workers.

Related

Python pass variable to multiprocessing pool

I been trying to figure out this for ages, but wondering if anyone might know how I can pass the s variable to the pool without making it into an argument?
import ctypes
import multiprocessing as mp
import os
def worker1(n):
k = n*3
print(k)
print(s)
# print(ctypes_int.value)
if __name__ == '__main__':
# global s
somelist = [1,2,3]
# ctypes_int = mp.Value(ctypes.c_wchar_p , "hi")
s = "TESTING"
# worker1(1)
p = mp.Pool(1)
p.map(worker1,somelist)
This is the error I am getting:
3
6
9
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Program Files\Python\Python37\lib\multiprocessing\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "C:\Program Files\Python\Python37\lib\multiprocessing\pool.py", line 44, in mapstar
return list(map(*args))
File "C:\Users\light\AppData\Local\Temp\tempCodeRunnerFile.python", line 10, in worker1
print(s)
NameError: name 's' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\light\AppData\Local\Temp\tempCodeRunnerFile.python", line 21, in <module>
p.map(worker1,somelist)
File "C:\Program Files\Python\Python37\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Program Files\Python\Python37\lib\multiprocessing\pool.py", line 657, in get
raise self._value
NameError: name 's' is not defined
You can pass your variable along with each item in somelist:
import multiprocessing as mp
def worker1(p):
n,s = p
k = n*3
print(k)
print(s)
if __name__ == '__main__':
somelist = [1,2,3]
s = "TESTING"
p = mp.Pool(1)
p.map(worker1,[(n,s) for n in somelist])
The parameter (n,s) gets passed as p and I unpack it into n,s.

multiprocessing pool manager namespace EOF error

When I use pool.manager.namespace to share a pandas dataframe, and each target function will call .sample(5000) to this dataframe, EOF error occurs.
def get_sample(i):
print("start round {}".format(i))
sample = sharedData.data.sample(5000, random_state=i)
if __name__=='__main__':
with mp.Pool(cpu_count(logical=False)) as pool0:
results = pool0.map(load_data, paths)
sharedData.data = pd.concat(results, axis=0, copy=False)
genes = sharedData.data.columns
pool0.close()
pool0.join()
del results
"""sampling"""
with mp.Pool(cpu_count(logical=True)) as pool:
print("start sampling, total round = {}".format(1000))
r = pool.map_async(get_sample, [j for j in range(1000)], error_callback=my_error)
results2 = r.get()
pool.close()
pool.join()
which has traceback:
start round 145
round35 returns output
round18 returns output
rount161 returns output
start round 704
start round 720
start round 736
start round 752
start round 768
start round 784
start round 800
start round 816
start round 832
start round 848
start round 864
start round 880
start round 896
start round 912
start round 928
start round 944
start round 960
start round 976
start round 992
from error_callback:
multiprocessing.pool.RemoteTraceback:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "sampling2temp.py", line 38, in get_sample_ys
sample = sharedData.data.sample(5000, random_state=i)
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/managers.py", line 1060, in __getattr__
return callmethod('__getattribute__', (key,))
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "sampling2temp.py", line 105, in <module>
results2 = r.get()
File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/pool.py", line 608, in get
raise self._value
EOFError
It seems like the tasks 704 to 992 doesn't return any outputs at all then the Manager process shut down. So when one of the running task read data from manager.namespace.data, it receive EOF.
By the way, if I change sample(5000) to sample(2500) and change the size of Manager.Namespace.data from 2127096024 bytes to 1738281624 bytes, there's no EOF problem. Is that because each worker use too much memory?
A multiprocessing.Connection receiver throws EOFError if all of the associated sender Connections have been closed.
It looks like multiprocessing.Manager is using multiprocessing.Connection under the hood based on the stack trace. Since it doesn't look like your code is prematurely terminating the manager process, I think that the problem must be that the manager process is hitting an exception and terminating before you are done with it. Since reducing the sample size seems to fix the problem, it's possible the Manager process gets killed off by the OOM killer for using too much memory - you can check if that was the case by using the command suggested on that linked article:
dmesg | egrep -i "killed process"
You'd expect to see something like this:
host kernel: Out of Memory: Killed process 1234 (python).

Ways to handle exceptions in Dask distributed

I'm having a lot of success using Dask and Distributed to develop data analysis pipelines. One thing that I'm still looking forward to improving, however, is the way I handle exceptions.
Right now if, I write the following
def my_function (value):
return 1 / value
results = (dask.bag
.from_sequence(range(-10, 10))
.map(my_function))
print(results.compute())
... then on running the program I get a long, long list of tracebacks (one per worker, I'm guessing). The most relevant segment being
distributed.utils - ERROR - division by zero
Traceback (most recent call last):
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/distributed/utils.py", line 193, in f
result[0] = yield gen.maybe_future(func(*args, **kwargs))
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/distributed/client.py", line 1473, in _get
result = yield self._gather(packed)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/distributed/client.py", line 923, in _gather
st.traceback)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/mnt/lustrefs/work/aurelien.mazurie/test_dask/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/dask/bag/core.py", line 1411, in reify
File "test.py", line 9, in my_function
return 1 / value
ZeroDivisionError: division by zero
Here, of course, a visual inspection will tell me that the error was dividing a number by zero. What I'm wondering is if there is a better way to track these errors. For example, I cannot seem to be able to catch the exception itself:
import dask.bag
import distributed
try:
dask_scheduler = "127.0.0.1:8786"
dask_client = distributed.Client(dask_scheduler)
def my_function (value):
return 1 / value
results = (dask.bag
.from_sequence(range(-10, 10))
.map(my_function))
#dask_client.persist(results)
print(results.compute())
except Exception as e:
print("error: %s" % e)
EDIT: Note that in my example I'm using distributed, not just dask. There is a dask-scheduler listening on port 8786 with four dask-worker processes registered to it.
This code will produce the exact same output as above, meaning that I'm not actually catching the exception with my try/except block.
Now, since we're talking of distributed tasks across a cluster it is obviously non trivial to propagate exceptions back to me. Is there any guideline to do so? Right now my solution is to have functions return both a result and an optional error message, then process the results and error messages separately:
def my_function (value):
try:
return {"result": 1 / value, "error": None}
except ZeroDivisionError:
return {"result": None, "error": "boom!"}
results = (dask.bag
.from_sequence(range(-10, 10))
.map(my_function))
dask_client.persist(results)
errors = (results
.pluck("error")
.filter(lambda x: x is not None)
.compute())
print(errors)
results = (results
.pluck("result")
.filter(lambda x: x is not None)
.compute())
print(results)
This works, but I'm wondering if I'm sandblasting the soup cracker here. EDIT: Another option would be to use something like a Maybe monad, but once again I'd like to know if I'm overthinking it.
Dask automatically packages up exceptions that occurred remotely and reraises them locally. Here is what I get when I run your example
In [1]: from dask.distributed import Client
In [2]: client = Client('localhost:8786')
In [3]: import dask.bag
In [4]: try:
...: def my_function (value):
...: return 1 / value
...:
...: results = (dask.bag
...: .from_sequence(range(-10, 10))
...: .map(my_function))
...:
...: print(results.compute())
...:
...: except Exception as e:
...: import pdb; pdb.set_trace()
...: print("error: %s" % e)
...:
distributed.utils - ERROR - division by zero
> <ipython-input-4-17aa5fbfb732>(13)<module>()
-> print("error: %s" % e)
(Pdb) pp e
ZeroDivisionError('division by zero',)
You could wrap your function like so:
def exception_handler(orig_func):
def wrapper(*args,**kwargs):
try:
return orig_func(*args,**kwargs)
except:
import sys
sys.exit(1)
return wrapper
You could use a decorator or do:
wrapped = exception_handler(my_function)
dask_client.map(wrapper, range(100))
This seems to automatically rebalance tasks if a worker fails. But I don't know how to remove the failed worker from the pool.

imap_unordered() hangs up if iterable throws an error

Consider the file sample.py contains the following code:
from multiprocessing import Pool
def sample_worker(x):
print "sample_worker processes item", x
return x
def get_sample_sequence():
for i in xrange(2,30):
if i % 10 == 0:
raise Exception('That sequence is corrupted!')
yield i
if __name__ == "__main__":
pool = Pool(24)
try:
for x in pool.imap_unordered(sample_worker, get_sample_sequence()):
print "sample_worker returned value", x
except:
print "Outer exception caught!"
pool.close()
pool.join()
print "done"
When I execute it, I get the following output:
Exception in thread Thread-2:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Python27\lib\multiprocessing\pool.py", line 338, in _handle_tasks
for i, task in enumerate(taskseq):
File "C:\Python27\lib\multiprocessing\pool.py", line 278, in <genexpr>
self._taskqueue.put((((result._job, i, func, (x,), {})
File "C:\Users\renat-nasyrov\Desktop\sample.py", line 10, in get_sample_sequence
raise Exception('That sequence is corrupted!')
Exception: That sequence is corrupted!
After that, application hangs up. How can I handle the situation without hangups?
As septi mentioned your indentation is (still) wrong. Indent the yield statement so that i is within its scope. I am not entirely sure as to what actually happens in your Code, but yielding a variable that is out of scope does not seem like a good idea:
from multiprocessing import Pool
def sample_worker(x):
print "sample_worker processes item", x
return x
def get_sample_sequence():
for i in xrange(2,30):
if i % 10 == 0:
raise Exception('That sequence is corrupted!')
yield i # fixed
if __name__ == "__main__":
pool = Pool(24)
try:
for x in pool.imap_unordered(sample_worker, get_sample_sequence()):
print "sample_worker returned value", x
except:
print "Outer exception caught!"
pool.close()
pool.join()
print "done"
To handle exceptions in your generator, you may use a wrapper such as this one:
import logging
def robust_generator():
try:
for i in get_sample_sequence():
logging.debug("yield "+str(i))
yield i
except Exception, e:
logging.exception(e)
raise StopIteration()

Handling exceptions in while loop - Python

Here is my code (almost full version for #cdhowie :)):
def getResult(method, argument=None):
result = None
while True:
print('### loop')
try:
print ('### try hard...')
if argument:
result = method(argument)
else:
result = method()
break
except Exception as e:
print('### GithubException')
if 403 == e.status:
print('Warning: ' + str(e.data))
print('I will try again after 10 minutes...')
else:
raise e
return result
def getUsernames(locations, gh):
usernames = set()
for location in locations:
print location
result = getResult(gh.legacy_search_users, location)
for user in result:
usernames.add(user.login)
print user.login,
return usernames
# "main.py"
gh = Github()
locations = ['Washington', 'Berlin']
# "main.py", line 12 is bellow
usernames = getUsernames(locations, gh)
The problem is, that exception is raised, but I can't handle it. Here is an output:
### loop
### try hard...
Traceback (most recent call last):
File "main.py", line 12, in <module>
usernames = getUsernames(locations, gh)
File "/home/ciembor/projekty/github-rank/functions.py", line 39, in getUsernames
for user in result:
File "/usr/lib/python2.7/site-packages/PyGithub-1.8.0-py2.7.egg/github/PaginatedList.py", line 33, in __iter__
newElements = self.__grow()
File "/usr/lib/python2.7/site-packages/PyGithub-1.8.0-py2.7.egg/github/PaginatedList.py", line 45, in __grow
newElements = self._fetchNextPage()
File "/usr/lib/python2.7/site-packages/PyGithub-1.8.0-py2.7.egg/github/Legacy.py", line 37, in _fetchNextPage
return self.get_page(page)
File "/usr/lib/python2.7/site-packages/PyGithub-1.8.0-py2.7.egg/github/Legacy.py", line 48, in get_page
None
File "/usr/lib/python2.7/site-packages/PyGithub-1.8.0-py2.7.egg/github/Requester.py", line 69, in requestAndCheck
raise GithubException.GithubException(status, output)
github.GithubException.GithubException: 403 {u'message': u'API Rate Limit Exceeded for 11.11.11.11'}
Why it doesn't print ### handling exception?
Take a close look at the stack trace in the exception:
Traceback (most recent call last):
File "main.py", line 12, in <module>
usernames = getUsernames(locations, gh)
File "/home/ciembor/projekty/github-rank/functions.py", line 39, in getUsernames
for user in result:
File "/usr/lib/python2.7/site-packages/PyGithub-1.8.0-py2.7.egg/github/PaginatedList.py", line 33, in __iter__
newElements = self.__grow()
...
The exception is being thrown from code being called by the line for user in result: after getResult finishes executing. This means that the API you're using is using lazy evaluation, so the actual API request doesn't quite happen when you expect it to.
In order to catch and handle this exception, you'll need to wrap the code inside the getUsernames function with a try/except handler.

Categories

Resources