Multiprocessing Robust to Occasional Failures

Multiprocessing Robust to Occasional Failures - python

I have a 100-1000 timeseries paths and a fairly expensive simulation that I'd like to parallelize. However, the library I'm using hangs on rare occasions and I'd like to make it robust to those issues. This is the current setup:
with Pool() as pool:
res = pool.map_async(simulation_that_occasionally_hangs, (p for p in paths))
all_costs = res.get()
I know get() has a timeout parameter but if I understand correctly that works on the whole process of the 1000 paths. What I'd like to do is check if any single simulation is taking longer than 5 minutes (a normal path takes 4 seconds) and if so just stop that path and continue to get() the rest.
EDIT:
Testing timeout in pebble
def fibonacci(n):
if n == 0: return 0
elif n == 1: return 1
else: return fibonacci(n - 1) + fibonacci(n - 2)
def main():
with ProcessPool() as pool:
future = pool.map(fibonacci, range(40), timeout=10)
iterator = future.result()
all = []
while True:
try:
all.append(next(iterator))
except StopIteration:
break
except TimeoutError as e:
print(f'function took longer than {e.args[1]} seconds')
print(all)
Errors:
RuntimeError: I/O operations still in flight while destroying Overlapped object, the process may crash
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\anaconda3\lib\multiprocessing\spawn.py", line 99, in spawn_main
new_handle = reduction.steal_handle(parent_pid, pipe_handle)
File "C:\anaconda3\lib\multiprocessing\reduction.py", line 87, in steal_handle
_winapi.DUPLICATE_SAME_ACCESS | _winapi.DUPLICATE_CLOSE_SOURCE)
PermissionError: [WinError 5] Access is denied

The pebble library has been designed to address these kinds of issues. It handles transparently job timeouts and failures such as C library crashes.
You can check the documentation examples to see how to use it. It has a similar interface as concurrent.futures.

Probably the easiest way is to run each heavy simulation in a separate subprocess, with the parent process watching it. Specifically:
def risky_simulation(path):
...
def safe_simulation(path):
p = multiprocessing.Process(target=risky_simulation, args=(path,))
p.start()
p.join(timeout) # Your timeout here
p.kill() # or p.terminate()
# Here read and return the output of the simulation
# Can be from a file, or using some communication object
# between processes, from the `multiprocessing` module
with Pool() as pool:
res = pool.map_async(safe_simulation, paths)
all_costs = res.get()
Notes:
If the simulation may hang, you may want to run it in a separate process (i.e. the Process object should not be a thread), as depending on how it's done, it may catch the GIL.
This solution only uses the pool for the immediate sub-processes, but the computations are off-loaded to new processes. We can also make sure the computations share a pool, but that would result in uglier code, so I skipped it.

Related

Problems for running concurrent futures process pool in JupyterNotebook [duplicate]

In a nutshell
I get a BrokenProcessPool exception when parallelizing my code with concurrent.futures. No further error is displayed. I want to find the cause of the error and ask for ideas of how to do that.
Full problem
I am using concurrent.futures to parallelize some code.
with ProcessPoolExecutor() as pool:
mapObj = pool.map(myMethod, args)
I end up with (and only with) the following exception:
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
Unfortunately, the program is complex and the error appears only after the program has run for 30 minutes. Therefore, I cannot provide a nice minimal example.
In order to find the cause of the issue, I wrapped the method that I run in parallel with a try-except-block:
def myMethod(*args):
try:
...
except Exception as e:
print(e)
The problem remained the same and the except block was never entered. I conclude that the exception does not come from my code.
My next step was to write a custom ProcessPoolExecutor class that is a child of the original ProcessPoolExecutor and allows me to replace some methods with cusomized ones. I copied and pasted the original code of the method _process_worker and added some print statements.
def _process_worker(call_queue, result_queue):
"""Evaluates calls from call_queue and places the results in result_queue.
...
"""
while True:
call_item = call_queue.get(block=True)
if call_item is None:
# Wake up queue management thread
result_queue.put(os.getpid())
return
try:
r = call_item.fn(*call_item.args, **call_item.kwargs)
except BaseException as e:
print("??? Exception ???") # newly added
print(e) # newly added
exc = _ExceptionWithTraceback(e, e.__traceback__)
result_queue.put(_ResultItem(call_item.work_id, exception=exc))
else:
result_queue.put(_ResultItem(call_item.work_id,
result=r))
Again, the except block is never entered. This was to be expected, because I already ensured that my code does not raise an exception (and if everything worked well, the exception should be passed to the main process).
Now I am lacking ideas how I could find the error. The exception is raised here:
def submit(self, fn, *args, **kwargs):
with self._shutdown_lock:
if self._broken:
raise BrokenProcessPool('A child process terminated '
'abruptly, the process pool is not usable anymore')
if self._shutdown_thread:
raise RuntimeError('cannot schedule new futures after shutdown')
f = _base.Future()
w = _WorkItem(f, fn, args, kwargs)
self._pending_work_items[self._queue_count] = w
self._work_ids.put(self._queue_count)
self._queue_count += 1
# Wake up queue management thread
self._result_queue.put(None)
self._start_queue_management_thread()
return f
The process pool is set to be broken here:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else: #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
It is (or seems to be) a fact that a process terminates, but I have no clue why. Are my thoughts correct so far? What are possible causes that make a process terminate without a message? (Is this even possible?) Where could I apply further diagnostics? Which questions should I ask myself in order to come closer to a solution?
I am using python 3.5 on 64bit Linux.

I think I was able to get as far as possible:
I changed the _queue_management_worker method in my changed ProcessPoolExecutor module such that the exit code of the failed process is printed:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else:
# BLOCK INSERTED FOR DIAGNOSIS ONLY ---------
vals = list(processes.values())
for s in ready:
j = sentinels.index(s)
print("is_alive()", vals[j].is_alive())
print("exitcode", vals[j].exitcode)
# -------------------------------------------
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
Afterwards I looked up the meaning of the exit code:
from multiprocessing.process import _exitcode_to_name
print(_exitcode_to_name[my_exit_code])
whereby my_exit_code is the exit code that was printed in the block I inserted to the _queue_management_worker. In my case the code was -11, which means that I ran into a segmentation fault. Finding the reason for this issue will be a huge task but goes beyond the scope of this question.

If you are using macOS, there is a known issue with how some versions of macOS uses forking that's not considered fork-safe by Python in some scenarios. The workaround that worked for me is to use no_proxy environment variable.
Edit ~/.bash_profile and include the following (it might be better to specify list of domains or subnets here, instead of *)
no_proxy='*'
Refresh the current context
source ~/.bash_profile
My local versions the issue was seen and worked around are: Python 3.6.0 on
macOS 10.14.1 and 10.13.x
Sources:
Issue 30388
Issue 27126

Getting erratic runtime exceptions trying to access persistant data in multiprocessing.Pool worker processes

Inspired by this solution I am trying to set up a multiprocessing pool of worker processes in Python. The idea is to pass some data to the worker processes before they actually start their work and reuse it eventually. It's intended to minimize the amount of data which needs to be packed/unpacked for every call into a worker process (i.e. reducing inter-process communication overhead). My MCVE looks like this:
import multiprocessing as mp
import numpy as np
def create_worker_context():
global context # create "global" context in worker process
context = {}
def init_worker_context(worker_id, some_const_array, DIMS, DTYPE):
context.update({
'worker_id': worker_id,
'some_const_array': some_const_array,
'tmp': np.zeros((DIMS, DIMS), dtype = DTYPE),
}) # store context information in global namespace of worker
return True # return True, verifying that the worker process received its data
class data_analysis:
def __init__(self):
self.DTYPE = 'float32'
self.CPU_LEN = mp.cpu_count()
self.DIMS = 100
self.some_const_array = np.zeros((self.DIMS, self.DIMS), dtype = self.DTYPE)
# Init multiprocessing pool
self.cpu_pool = mp.Pool(processes = self.CPU_LEN, initializer = create_worker_context) # create pool and context in workers
pool_results = [
self.cpu_pool.apply_async(
init_worker_context,
args = (core_id, self.some_const_array, self.DIMS, self.DTYPE)
) for core_id in range(self.CPU_LEN)
] # pass information to workers' context
result_batches = [result.get() for result in pool_results] # check if they got the information
if not all(result_batches): # raise an error if things did not work
raise SyntaxError('Workers could not be initialized ...')
#staticmethod
def process_batch(batch_data):
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
return context['tmp'] # return result
def process_all(self):
input_data = np.arange(0, self.DIMS ** 2, dtype = self.DTYPE).reshape(self.DIMS, self.DIMS)
pool_results = [
self.cpu_pool.apply_async(
data_analysis.process_batch,
args = (input_data,)
) for _ in range(self.CPU_LEN)
] # let workers actually work
result_batches = [result.get() for result in pool_results]
for batch in result_batches[1:]:
np.add(result_batches[0], batch, out = result_batches[0]) # reduce batches
print(result_batches[0]) # show result
if __name__ == '__main__':
data_analysis().process_all()
I am running the above with CPython 3.6.6.
The strange thing is ... sometimes it works, sometimes it does not. If it does not work, the process_batch method throws an exception, because it can not find some_const_array as a key in context. The full traceback looks like this:
(env) me#box:/path> python so.py
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/path/so.py", line 37, in process_batch
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
KeyError: 'some_const_array'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/so.py", line 54, in <module>
data_analysis().process_all()
File "/path/so.py", line 48, in process_all
result_batches = [result.get() for result in pool_results]
File "/path/so.py", line 48, in <listcomp>
result_batches = [result.get() for result in pool_results]
File "/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
KeyError: 'some_const_array'
I am puzzled. What is going on here?
If my context dictionaries contain an object of "higher type", e.g. a database driver or similar, I am not getting this kind of problem. I can only reproduce this if my context dictionaries contain basic Python data types, collections or numpy arrays.
(Is there a potentially better approach for achieving the same thing in a more reliable manner? I know my approach is considered a hack ...)

You need to relocate the content of init_worker_context into your initializer function create_worker_context.
Your assumption that every single worker process will run init_worker_context is responsible for your confusion.
The tasks you submit to a pool get fed into one internal taskqueue all worker processes read from. What happens in your case is, that some worker processes complete their task and compete again for getting new tasks. So it can happen that one worker processes will execute multiple tasks while another one will not get a single one. Keep in mind the OS schedules runtime for threads (of the worker processes).

Python: How to get hung process to release control back in concurrent.futures pool?

I have a parallelized a large CPU-intensive data processing task using the
concurrent.futures ProcessPoolExecutor method like shown below.
with concurrent.futures.ProcessPoolExecutor(max_workers=workers) as executor:
futures_ocr = ([
executor.submit(
MyProcessor,
folder
) for folder in sub_folders
])
is_cancel = wait_for(futures_ocr)
if is_cancel:
print 'shutting down executor'
executor.shutdown()
def wait_for(futures):
"""Handes the future tasks after completion"""
cancelled = False
try:
for future in concurrent.futures.as_completed(futures, timeout=200):
try:
result = future.result()
print 'successfully finished processing folder: ', result.source_folder_path
except concurrent.futures.TimeoutError:
print 'TimeoutError occured'
except TypeError:
print 'TypeError occured'
except KeyboardInterrupt:
print '****** cancelling... *******'
cancelled = True
for future in futures:
future.cancel()
return cancelled
There are certain folders where the process seems to be stuck for a long time, not because of some error in the code but due to the nature of the files being processed. So, I wanted to timeout those types of processes, so that they return if a certain time limit is exceeded. The Pool can then use the process for the next available task.
Adding the timeout in the as_completed() function gives an error while completing.
Traceback (most recent call last):
File "call_ocr.py", line 96, in <module>
main()
File "call_ocr.py", line 42, in main
is_cancel = wait_for(futures_ocr)
File "call_ocr.py", line 59, in wait_for
for future in concurrent.futures.as_completed(futures, timeout=200):
File "/Users/saurav/.pyenv/versions/ocr/lib/python2.7/site-packages/concurrent/futures/_base.py", line 216, in as_completed
len(pending), len(fs)))
concurrent.futures._base.TimeoutError: 3 (of 3) futures unfinished
What am I doing wrong here, and what is the best way to cause timedout processes to stop and relinquish the process back to the Process pool?

The concurrent.futures implementation does not support such use case.
The timeout which can be passed to its functions and methods allows to set for how long to wait for results but has no effect on the actual computation itself.
The pebble library supports such use case.
from concurrent.futures import TimeoutError
from pebble import ProcessPool
def function(n):
return n
with ProcessPool() as pool:
future = pool.schedule(function, args=[1], timeout=10)
try:
results = future.result()
except TimeoutError as error:
print("function took longer than %d seconds" % error.args[1])

Python3 filling a dictionary concurrently

I want to fill a dictionary in a loop. Iterations in the loop are independent from each other. I want to perform this on a cluster with thousands of processors. Here is a simplified version of what I tried and need to do.
import multiprocessing
class Worker(multiprocessing.Process):
def setName(self,name):
self.name=name
def run(self):
print ('In %s' % self.name)
return
if __name__ == '__main__':
jobs = []
names=dict()
for i in range(10000):
p = Worker()
p.setName(str(i))
names[str(i)]=i
jobs.append(p)
p.start()
for j in jobs:
j.join()
I tried this one in python3 on my own computer and received the following error:
..
In 249
Traceback (most recent call last):
File "test.py", line 16, in <module>
p.start()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 105, in start
In 250
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files
Is there any better way to do this?

multiprocessing talks to its subprocesses via pipes. Each subprocesses requires two open file descriptors, one for read and one for write. If you launch 10000 workers, you'll end opening 20000 file descriptors which exceeds the default limit on OS X (which your paths indicate you're using).
You can fix the issue by raising the limit. See https://superuser.com/questions/433746/is-there-a-fix-for-the-too-many-open-files-in-system-error-on-os-x-10-7-1 for details - basically, it amounts to setting two sysctl knobs and upping your shell's ulimit setting.

You are spawning 10000 processes at once at the moment. That really isn't a good idea.
The error you see is most definitely raised because the multiprocessing module (seem to) use pipes for the Inter Proccess Communication and there is a limit of open pipes/FDs.
I suggest using an python interpreter without a Global interpreter lock like Jython or IronPython and just replace the multiprocessing module with the threading one.
If you still want to use the multiprocessing module, you could use an Proccess Pool like this to collect the return values:
from multiprocessing import Pool
def worker(params):
name, someArg = params
print ('In %s' % name)
# do something with someArg here
return (name, someArg)
if __name__ == '__main__':
jobs = []
names=dict()
# Spawn 100 worker processes
pool = Pool(processes=100)
# Fill with real data
task_dict = dict(('name_{}'.format(i), i) for i in range(1000))
# Process every task via our pool
results = pool.map(worker, task_dict.items())
# And convert the rsult to a dict
results = dict(results)
print (results)
This should work with minimal changes for the threading module, too.

"WindowsError: Access is denied" on calling Process.terminate

I enforce a timeout for a block of code using the multiprocessing module. It appears that with certain sized inputs, the following error is raised:
WindowsError: [Error 5] Access is denied
I can replicate this error with the following code. Note that the code completes with '467,912,040' but not with '517,912,040'.
import multiprocessing, Queue
def wrapper(queue, lst):
lst.append(1)
queue.put(lst)
queue.close()
def timeout(timeout, lst):
q = multiprocessing.Queue(1)
proc = multiprocessing.Process(target=wrapper, args=(q, lst))
proc.start()
try:
result = q.get(True, timeout)
except Queue.Empty:
return None
finally:
proc.terminate()
return result
if __name__ == "__main__":
# lst = [0]*417912040 # this works fine
# lst = [0]*467912040 # this works fine
lst = [0] * 517912040 # this does not
print "List length:",len(lst)
timeout(60*30, lst)
The output (including error):
List length: 517912040
Traceback (most recent call last):
File ".\multiprocessing_error.py", line 29, in <module>
print "List length:",len(lst)
File ".\multiprocessing_error.py", line 21, in timeout
proc.terminate()
File "C:\Python27\lib\multiprocessing\process.py", line 137, in terminate
self._popen.terminate()
File "C:\Python27\lib\multiprocessing\forking.py", line 306, in terminate
_subprocess.TerminateProcess(int(self._handle), TERMINATE)
WindowsError: [Error 5] Access is denied
Am I not permitted to terminate a Process of a certain size?
I am using Python 2.7 on Windows 7 (64bit).

While I am still uncertain regarding the precise cause of the problem, I have some additional observations as well as a workaround.
Workaround.
Adding a try-except block in the finally clause.
finally:
try:
proc.terminate()
except WindowsError:
pass
This also seems to be the solution arrived at in a related (?) issue posted here on GitHub (you may have to scroll down a bit).
Observations.
This error is dependent on the size of the object passed to the Process/Queue, but it is not related to the execution of the Process itself. In the OP, the Process completes before the timeout expires.
proc.is_alive returns True before and after the execution of proc.terminate() (which then throws the WindowsError). A second or two later, proc.is_alive() returns False and a second call to proc.terminate() succeeds.
Forcing the main thread to sleep time.sleep(1) in the finally block also prevents the throwing of the WindowsError. Thanks, #tdelaney's comment in the OP.
My best guess is that proc is in the process of freeing memory (?, or something comparable) while being killed by the OS (having completed execution) when the call to proc.terminate() attempts to kill it again.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing Robust to Occasional Failures - python

The pebble library has been designed to address these kinds of issues. It handles transparently job timeouts and failures such as C library crashes. You can check the documentation examples to see how to use it. It has a similar interface as concurrent.futures.

Related

Problems for running concurrent futures process pool in JupyterNotebook [duplicate]

Getting erratic runtime exceptions trying to access persistant data in multiprocessing.Pool worker processes

Python: How to get hung process to release control back in concurrent.futures pool?

Python3 filling a dictionary concurrently

"WindowsError: Access is denied" on calling Process.terminate

Categories

Resources