I'm having a lot of success using Dask and Distributed to develop data analysis pipelines. One thing that I'm still looking forward to improving, however, is the way I handle exceptions.
Right now if, I write the following
def my_function (value):
return 1 / value
results = (dask.bag
.from_sequence(range(-10, 10))
.map(my_function))
print(results.compute())
... then on running the program I get a long, long list of tracebacks (one per worker, I'm guessing). The most relevant segment being
distributed.utils - ERROR - division by zero
Traceback (most recent call last):
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/distributed/utils.py", line 193, in f
result[0] = yield gen.maybe_future(func(*args, **kwargs))
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/distributed/client.py", line 1473, in _get
result = yield self._gather(packed)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/distributed/client.py", line 923, in _gather
st.traceback)
File "/Users/ajmazurie/test/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/mnt/lustrefs/work/aurelien.mazurie/test_dask/.env/pyenv-3.6.0-default/lib/python3.6/site-packages/dask/bag/core.py", line 1411, in reify
File "test.py", line 9, in my_function
return 1 / value
ZeroDivisionError: division by zero
Here, of course, a visual inspection will tell me that the error was dividing a number by zero. What I'm wondering is if there is a better way to track these errors. For example, I cannot seem to be able to catch the exception itself:
import dask.bag
import distributed
try:
dask_scheduler = "127.0.0.1:8786"
dask_client = distributed.Client(dask_scheduler)
def my_function (value):
return 1 / value
results = (dask.bag
.from_sequence(range(-10, 10))
.map(my_function))
#dask_client.persist(results)
print(results.compute())
except Exception as e:
print("error: %s" % e)
EDIT: Note that in my example I'm using distributed, not just dask. There is a dask-scheduler listening on port 8786 with four dask-worker processes registered to it.
This code will produce the exact same output as above, meaning that I'm not actually catching the exception with my try/except block.
Now, since we're talking of distributed tasks across a cluster it is obviously non trivial to propagate exceptions back to me. Is there any guideline to do so? Right now my solution is to have functions return both a result and an optional error message, then process the results and error messages separately:
def my_function (value):
try:
return {"result": 1 / value, "error": None}
except ZeroDivisionError:
return {"result": None, "error": "boom!"}
results = (dask.bag
.from_sequence(range(-10, 10))
.map(my_function))
dask_client.persist(results)
errors = (results
.pluck("error")
.filter(lambda x: x is not None)
.compute())
print(errors)
results = (results
.pluck("result")
.filter(lambda x: x is not None)
.compute())
print(results)
This works, but I'm wondering if I'm sandblasting the soup cracker here. EDIT: Another option would be to use something like a Maybe monad, but once again I'd like to know if I'm overthinking it.
Dask automatically packages up exceptions that occurred remotely and reraises them locally. Here is what I get when I run your example
In [1]: from dask.distributed import Client
In [2]: client = Client('localhost:8786')
In [3]: import dask.bag
In [4]: try:
...: def my_function (value):
...: return 1 / value
...:
...: results = (dask.bag
...: .from_sequence(range(-10, 10))
...: .map(my_function))
...:
...: print(results.compute())
...:
...: except Exception as e:
...: import pdb; pdb.set_trace()
...: print("error: %s" % e)
...:
distributed.utils - ERROR - division by zero
> <ipython-input-4-17aa5fbfb732>(13)<module>()
-> print("error: %s" % e)
(Pdb) pp e
ZeroDivisionError('division by zero',)
You could wrap your function like so:
def exception_handler(orig_func):
def wrapper(*args,**kwargs):
try:
return orig_func(*args,**kwargs)
except:
import sys
sys.exit(1)
return wrapper
You could use a decorator or do:
wrapped = exception_handler(my_function)
dask_client.map(wrapper, range(100))
This seems to automatically rebalance tasks if a worker fails. But I don't know how to remove the failed worker from the pool.
Related
Hello fellow programmers!
I am trying to implement multiprocessing in a class, to reduce processing time of a program.
This is an abbreviation of the program:
import multiprocessing as mp
from functools import partial
class PlanningMachines():
def __init__(self, machines, number_of_objectives, topology=False, episodes=None):
....
def calculate_total_node_THD_func_real_data_with_topo(self):
self.consider_topology = True
func_part = partial(self.worker_function, consider_topology=self.consider_topology,
list_of_machines=self.list_of_machines, next_state=self.next_state, phase=phase, grid_topo=self.grid_topo,
total_THD_for_all_timesteps_with_topo=total_THD_for_all_timesteps_with_topo,
smallest_harmonic=smallest_harmonic, pol2cart=self.pol2cart, cart2pol=self.cart2pol,
total_THD_for_all_timesteps=total_THD_for_all_timesteps, harmonics_state_phase=harmonics_state_phase,
episode=self.episode, episodes=self.episodes, time_=self.time_, steplength=self.steplength,
longest_measurement=longest_measurement)
with mp.Pool() as mpool:
mpool.map(func_part, range(0, longest_measurement))
def worker_function(measurement=None, consider_topology=None, list_of_machines=None, next_state=None, phase=None,
grid_topo=None, total_THD_for_all_timesteps_with_topo=None, smallest_harmonic=None, pol2cart=None,
cart2pol=None, total_THD_for_all_timesteps=None, harmonics_state_phase=None, episode=None,
episodes=None, time_=None, steplength=None, longest_measurement=None):
.....
As you might know, one way of implementing parallel processing is using multiprocessing.Pool().map:
with mp.Pool() as mpool:
mpool.map(func_part, range(0, longest_measurement))
This function requires a worker_function which can be "packed" with functools.partial:
func_part = partial(self.worker_function, consider_topology=self.consider_topology,
list_of_machines=self.list_of_machines, next_state=self.next_state, phase=phase, grid_topo=self.grid_topo,
total_THD_for_all_timesteps_with_topo=total_THD_for_all_timesteps_with_topo,
smallest_harmonic=smallest_harmonic, pol2cart=self.pol2cart, cart2pol=self.cart2pol,
total_THD_for_all_timesteps=total_THD_for_all_timesteps, harmonics_state_phase=harmonics_state_phase,
episode=self.episode, episodes=self.episodes, time_=self.time_, steplength=self.steplength,
longest_measurement=longest_measurement)
The Error is thrown when I try to execute mpool.map(func_part, range(0, longest_measurement)):
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\Artur\Anaconda\lib\multiprocessing\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Artur\Anaconda\lib\multiprocessing\pool.py", line 44, in mapstar
return list(map(*args))
TypeError: worker_function() got multiple values for argument 'consider_topology'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/Artur/Desktop/RL_framework/train.py", line 87, in <module>
main()
File "C:/Users/Artur/Desktop/RL_framework/train.py", line 77, in main
duration = cf.training(episodes, env, agent, filename, topology=topology, multi_processing=multi_processing, CPUs_used=CPUs_used)
File "C:\Users\Artur\Desktop\RL_framework\help_functions\custom_functions.py", line 166, in training
save_interval = parallel_training(range(episodes), env, agent, log_data_qvalues, log_data, filename, CPUs_used)
File "C:\Users\Artur\Desktop\RL_framework\help_functions\custom_functions.py", line 54, in paral
lel_training
next_state, reward = env.step(action, state) # given the action, the environment gives back the next_state and the reward for the transaction for all objectives seperately
File "C:\Users\Artur\Desktop\RL_framework\help_functions\environment_machines.py", line 127, in step
self.calculate_total_node_THD_func_real_data_with_topo() # THD_plant calculation with considering grid topo
File "C:\Users\Artur\Desktop\RL_framework\help_functions\environment_machines.py", line 430, in calculate_total_node_THD_func_real_data_with_topo
mpool.map(func_part, range(longest_measurement))
File "C:\Users\Artur\Anaconda\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\Artur\Anaconda\lib\multiprocessing\pool.py", line 657, in get
raise self._value
TypeError: worker_function() got multiple values for argument 'consider_topology'
Process finished with exit code 1
How can consider_topology have multiple values if it is passed right before the worker_function:
self.consider_topology = True
I hope I could describe the my issue well enough for you to understand. Thank you in return.
The problem I think is that your worker_function should be a static method.
What happens now is that you provide all values except the measurement variable in the partial call. You do this since this is the one value you are changing I'm guessing.
However since it is a class method it provides an instance of itself automatically as the first argument as well. You did not define self as the first argument of worker_function and now the class instance is inputted as your measurement input. The range(0, longest_measurement) you provide the map call is then inserted as the second input variable. Now since consider_topology is the second input parameter the function sees two values supplied for it, 1 the value in the partial call, and 2 the map call.
I want to process mongodb documents in a batch of 1000 using multiprocessing. However, below code snippet is giving TypeError: zip argument #1 must support iteration
Code:
def documents_processing(skip):
conn = get_connection()
db = conn["db_name"]
print("Process::{} -- db.Transactions.find(no_cursor_timeout=True).skip({}).limit(10000)".format(os.getpid(), skip))
documents = db.Transactions.find(no_cursor_timeout=True).skip(skip).limit(10000)
# Do some processing in mongodb
max_workers = 20
def skip_list():
for i in range(0, 100000, 10000):
yield [j for j in range(i, i + 10000, 1000)]
def main_f():
try:
with concurrent.futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
executor.map(documents_processing, skip_list)
except Exception:
print("exception:", traceback.format_exc())
main_f()
Error traceback:
(rpc_venv) [user#localhost ver2_mt]$ python main_mongo_v3.py
exception: Traceback (most recent call last):
File "main_mongo_v3.py", line 113, in main_f
executor.map(documents_processing, skip_list)
File "/usr/lib64/python3.6/concurrent/futures/process.py", line 496, in map
timeout=timeout)
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 575, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 575, in <listcomp>
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/usr/lib64/python3.6/concurrent/futures/process.py", line 137, in _get_chunks
it = zip(*iterables)
TypeError: zip argument #1 must support iteration
How to fix this error? Thanks.
Invoke the skip_list function to return the generator.
Currently, you're passing a function as the second argument and not an iterable.
executor.map(documents_processing, skip_list())
Since you're retrieving 10k documents in each process starting at n, you can declare skip_list as:
def skip_list():
for i in range(0, 100000, 10000):
yield i
For some reason,
result = pool.map(_delete_load_log,list(logs_to_delete))
is now giving me an 'int' object is not iterable error.
as per the screenshot, logs_to_delete is clearly an array (added list() to see if it changed anything, and nope). This worked earlier, but I can't track back what changed to make it not work. Any ideas?
mapping function:
def _delete_load_log(load_log_id):
logging.debug('** STARTING API CALL FOR: ' + get_function_name())
input_args = "/15/" + str(load_log_id)
logging.debug('url: ' + podium_url + '\nusername: ' + podium_username)
podium = podiumapi.Podium('https://xx/podium','podium','podium')
#podium = podiumapi.Podium(podium_url,podium_username,podium_password)
data = None
response_code = 0
try:
api_url = PodiumApiUrl(podium_url,input_args)
(response_code,data) = podium._podium_curl_setopt_put(api_url)
if not data[0]['hasErrors']:
return data[0]['id']
elif data[0]['hasErrors']:
raise Exception("Errors detected on delete")
else:
raise Exception('Unmanaged exception while retrieving entity load status.')
except Exception as err:
raise Exception(str(err))
File "c:\Repos\Tools\podium-dataload\scripts\podiumlogdelete.py", line 69, in _delete_source_load_logs_gte_epoch
deleted_load_ids = _delete_logs_in_parallel(load_logs_to_delete)
File "c:\Repos\Tools\podium-dataload\scripts\podiumlogdelete.py", line 85, in _delete_logs_in_parallel
result = pool.map(_delete_load_log,logs_to_delete)
File "C:\Python27\lib\multiprocessing\pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Python27\lib\multiprocessing\pool.py", line 572, in get
raise self._value
Exception: 'int' object is not iterable
output of list:
[154840, 154841, 154842, 154843, 154844, 154845, 154846, 154847, 154848, 154849, 154850, 154851, 154852, 154853, 154854, 154855, 154856, 154857, 154858, 154859, 154860, 154861, 154862, 154863, 154864, 154865, 154866, 154867, 154868, 154869, 154870, 154871, 154872, 154873, 154874, 154875, 154876, 154877, 154878, 154879, 154880, 154881, 154882, 154883, 154884, 154885, 154886, 154887, 154888, 154889, 154890, 154891, 154892, 154893, 154894, 154895, 154896, 154897, 154898, 154899, 154900, 154901, 154902, 154903, 154904, 154905, 154906, 154907, 154908, 154909, 154910, 154911, 154912, 154913, 154914, 154915, 154916, 154917, 154918, 154919, 154920, 154921, 154922, 154923, 154924, 154925, 154926, 154927, 154928, 154929, 154930, 154931, 154932, 154933, 154934, 154935, 154936, 154937, 154938, 154939]
Your problem is clear from the traceback. It's not caused by the iterable in Pool.map(), otherwise the exception would be raised from Python source code line
iterable = list(iterable)
Here the exception is raised from
File "C:\Python27\lib\multiprocessing\pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Python27\lib\multiprocessing\pool.py", line 572, in get
raise self._value
This is because your _delete_load_log() raised some exception, and Pool.map re-raise it. See https://github.com/python/cpython/blob/2.7/Lib/multiprocessing/pool.py
In other words, Exception: 'int' object is not iterable is not from the python library part, it's from your own function _delete_load_log().
In all honesty, just take a close look at
#podium = podiumapi.Podium(podium_url,podium_username,podium_password)
data = None
response_code = 0
try:
The response code is equivalent to 0, although the traceback function isn't inputted as "invalid", it makes you have to re-patch a lot afterwards.
I'm play with multiprocessing in Python. I'm trying to determine what happends if a workers raise an exception so I wrote the following code:
def a(num):
if(num == 2):
raise Exception("num can't be 2")
print(num)
p = Pool()
p.map(a, [2, 1, 3, 4, 5, 6, 7, 100, 100000000000000, 234, 234, 5634, 0000])
output
3
4
5
7
6
100
100000000000000
234
234
5634
0
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "<stdin>", line 3, in a
Exception: Error, num can't be 2
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
Exception: Error, num can't be 2
If you can see the numbers that was printed "2" is not there but Why is not number 1 also there?
Note: I'm using Python 3.5.2 on Ubuntu
By default, Pool creates a number of workers equal to your number of cores. When one of those worker processes dies, it may leave work that has been assigned to it undone. It also may leave output in a buffer that never gets flushed.
The pattern with .map() is to handle exceptions in the workers and return some suitable error value, since the results of .map() are supposed to be one-to-one with the input.
from multiprocessing import Pool
def a(num):
try:
if(num == 2):
raise Exception("num can't be 2")
print(num, flush=True)
return num
except Exception as e:
print('failed', flush=True)
return e
p = Pool()
n=100
results = p.map(a, range(n))
print("missing numbers: ", tuple(i for i in range(n) if i not in results))
Here's another question with good information about how exceptions propagate in multiprocessing.map workers.
Consider the file sample.py contains the following code:
from multiprocessing import Pool
def sample_worker(x):
print "sample_worker processes item", x
return x
def get_sample_sequence():
for i in xrange(2,30):
if i % 10 == 0:
raise Exception('That sequence is corrupted!')
yield i
if __name__ == "__main__":
pool = Pool(24)
try:
for x in pool.imap_unordered(sample_worker, get_sample_sequence()):
print "sample_worker returned value", x
except:
print "Outer exception caught!"
pool.close()
pool.join()
print "done"
When I execute it, I get the following output:
Exception in thread Thread-2:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Python27\lib\multiprocessing\pool.py", line 338, in _handle_tasks
for i, task in enumerate(taskseq):
File "C:\Python27\lib\multiprocessing\pool.py", line 278, in <genexpr>
self._taskqueue.put((((result._job, i, func, (x,), {})
File "C:\Users\renat-nasyrov\Desktop\sample.py", line 10, in get_sample_sequence
raise Exception('That sequence is corrupted!')
Exception: That sequence is corrupted!
After that, application hangs up. How can I handle the situation without hangups?
As septi mentioned your indentation is (still) wrong. Indent the yield statement so that i is within its scope. I am not entirely sure as to what actually happens in your Code, but yielding a variable that is out of scope does not seem like a good idea:
from multiprocessing import Pool
def sample_worker(x):
print "sample_worker processes item", x
return x
def get_sample_sequence():
for i in xrange(2,30):
if i % 10 == 0:
raise Exception('That sequence is corrupted!')
yield i # fixed
if __name__ == "__main__":
pool = Pool(24)
try:
for x in pool.imap_unordered(sample_worker, get_sample_sequence()):
print "sample_worker returned value", x
except:
print "Outer exception caught!"
pool.close()
pool.join()
print "done"
To handle exceptions in your generator, you may use a wrapper such as this one:
import logging
def robust_generator():
try:
for i in get_sample_sequence():
logging.debug("yield "+str(i))
yield i
except Exception, e:
logging.exception(e)
raise StopIteration()