When running using multiprocessing pool, I find that the worker process keeps running past a point where an exception is thrown.
Consider the following code:
import multiprocessing
def worker(x):
print("input: " + x)
y = x + "_output"
raise Exception("foobar")
print("output: " + y)
return(y)
def main():
data = [str(x) for x in range(4)]
pool = multiprocessing.Pool(1)
chunksize = 1
results = pool.map(worker, data, chunksize)
pool.close()
pool.join()
print("Printing results:")
print(results)
if __name__ == "__main__":
main()
The output is:
$ python multiprocessing_fail.py
input: 0
input: 1
input: 2
Traceback (most recent call last):
input: 3
File "multiprocessing_fail.py", line 25, in <module>
main()
File "multiprocessing_fail.py", line 16, in main
results = pool.map(worker, data, 1)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
Exception: foobar
As you can see, the worker process never proceeds beyond raise Exception("foobar") to the second print statement. However, it resumes work at the beginning of function worker() again and again.
I looked for an explanation in the documentation, but couldn't find any. Here is a potentially related SO question:
Keyboard Interrupts with python's multiprocessing Pool
But that is different (about keyboard interrupts not being picked by the master process).
Another SO question:
How to catch exceptions in workers in Multiprocessing
This question is also different, since in it the master process doesnt catch any exception, whereas here the master did catch the exception (line 16). More importantly, in that question the worker did not run past an exception (there is only one executable line for the worker).
Am running python 2.7
Comment: Pool should start one worker since the code has pool = multiprocessing.Pool(1).
From the Documnentation:
A process pool object which controls a pool of worker processes to which jobs can be submitted
Comment: That one worker is running the worker() function multiple times
From the Documentation:
map(func, iterable[, chunksize])
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks.
Your worker() is the separate task. Renaming your worker() to task() could help to clarify what is what.
Comment: What I expect is that the worker process crashes at the Exception
It does, the separate task, your worker() dies and Pool starts the next task.
What you want is Pool.terminate()
From the Documentation:
terminate()
Stops the worker processes immediately without completing outstanding work.
Question: ... I find that the worker process keeps running past a point where an exception is thrown.
You give iteration data to Pool, therfore Pool does what it have to do:
Starting len(data) worker.
data = [str(x) for x in range(4)]
The main Question is: What do you want to expect with
raise Exception("foobar")
Related
I am trying to create a shared memory for my Python application, which should be used in the parent process and in another process that is spawned from that parent process. In most cases that works fine, however, sometimes I get the following stacktrace:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory: '/psm_47f7f5d7'
I want to emphasize that our code/application works fine in 99% of the time. We are spawning these new processes with new shared memory for each such process on a regular basis in our application (which is a server process, so it's running 24/7). Nearly all the time this works fine, only from time to time this error above is thrown, which then kills the whole application.
Update: I noticed that this problem occurs mainly when the application was running for a while already. When I start it up the creation of shared memory and spawning new processes works fine without this error.
The shared memory is created like this:
# Spawn context for multiprocessing
_mp_spawn_ctxt = multiprocessing.get_context("spawn")
_mp_spawn_ctxt_pipe = _mp_spawn_ctxt.Pipe
# Create shared memory
mem_size = width * height * bpp
shared_mem = shared_memory.SharedMemory(create=True, size=mem_size)
image = np.ndarray((height, width, bpp), dtype=np.uint8, buffer=shared_mem.buf)
parent_pipe, child_pipe = _mp_spawn_ctxt_pipe()
time.sleep(0.1)
# Spawn new process
# _CameraProcess is a custom class derived from _mp_spawn_ctxt.Process
proc = _CameraProcess(shared_mem, child_pipe)
proc.start()
Any ideas what could be the issue here?
I had the similar issue in case, that more processes had access to the shared memory/object and one process did update the shared memory/object.
I solved these issues based on these steps:
I synchronized all operations with shared memory/object via mutexes (see sample for multiprocessing usage superfastpython or protect shared resources). Critical part of code are create, update, delete but also reading content of shared object/memory, because at the same time different process can do update of shared object/memory, etc.
I avoided libraries with only single thread execution support
See sample code with synchronization:
def increase(sharedObj, lock):
for i in range(100):
time.sleep(0.01)
lock.acquire()
sharedObj = sharedObj + 1
lock.release()
def decrease(sharedObj, lock):
for i in range(100):
time.sleep(0.001)
lock.acquire()
sharedObj = sharedObj - 1
lock.release()
if __name__ == '__main__':
sharedObj = multiprocessing.Value ('i',1000)
lock=multiprocessing.Lock()
p1=multiprocessing.Process(target=increase, args=(sharedObj, lock))
p2=multiprocessing.Process(target=decrease, args=(sharedObj, lock))
p1.start()
p2.start()
p1.join()
p2.join()
Suppose I have a program that looks like this:
jobs = [list_of_values_to_consume_and_act]
with multiprocessing.Pool(8) as pool:
results = pool.map(func, jobs)
And whatever is done in func can raise an exception due to external circumstances, so I can't prevent an exception from happening.
How will the pool behave on exception?
Will it only terminate the process that raised an exception and let other processes run and consume the jobs?
If yes, will it start another process to pick up the slack?
What about the job being handled by the dead process, will it be 'resubmitted' to the pool?
In any case, how do I 'retrieve' the exception?
No processes will be terminated at all. All calls to the target
functions from within the pool's processes are wrapped in a
try...except block. Incase an exception is caught, the process
informs the appropriate handler thread in the main process which
passes the exception forward so it can be re-rasied. Whether or not other jobs will execute depends on if the pool is still open. Incase you do not catch this re-raised exception, the main process (or the process that started the pool) will exit, automatically cleaning up open resources like the pool (so no tasks can be executed now since pool closed). But if you catch the exception and let the main process continue running then the pool will not shutdown and other jobs will execute as scheduled.
N/A
The outcome of a job is irrelevant, once it's run once by any process,
that job is marked completed and not resubmitted to the pool.
Wrap your call to pool.map in a try...except block? Do note that
incase one of your jobs do raise an error, then the results of other
successful jobs will become inaccessible as well (because these are
stored after the call to pool.map completes, but the call never
successfully completed). In such cases, where you need to catch
exceptions of individual jobs, it's better to use pool.imap
or pool.apply_async
Example of catching exception for individual tasks using imap:
import multiprocessing
import time
def prt(value):
if value == 3:
raise ValueError(f"Error for value {value}")
time.sleep(1)
return value
if __name__ == "__main__":
with multiprocessing.Pool(3) as pool:
jobs = pool.imap(prt, range(1, 10))
results = []
for i in range(10):
try:
result = next(jobs)
except ValueError as e:
print(e)
results.append("N/A") # This means that this individual task was unsuccessful
except StopIteration:
break
else:
results.append(result)
print(results)
Example of catching exception for individual tasks using apply_async
import multiprocessing
import time
def prt(value):
if value == 3:
raise ValueError(f"Error for value {value}")
time.sleep(1)
return value
if __name__ == "__main__":
pool = multiprocessing.Pool(3)
job = [pool.apply_async(prt, (i,)) for i in range(1, 10)]
results = []
for j in job:
try:
results.append(j.get())
except ValueError as e:
print(e)
results.append("N/A")
print(results)
This is not very important, just a silly experiment. I would like to create my own message passing.
I would like to have a dictionary of queues, where each key is the PID of the process.
Because I'd like to have the processes (created by Process()) to exchange messages inserting them in the queue of the process they want to send it to (knowing its pid).
This is a silly code:
from multiprocessing import Process, Manager, Queue
from os import getpid
from time import sleep
def begin(dic, manager, parentQ):
parentQ.put(getpid())
dic[getpid()] = manager.Queue()
dic[getpid()].put("Something...")
if __name__== '__main__':
manager = Manager()
dic = manager.dict()
parentQ = Queue()
p = Process(target = begin, args=(dic, manager, parentQ))
p.start()
son = parentQ.get()
print son
sleep(2)
print dic[son].get()
dic[getpid()] = manager.Queue(), this works fine. But when I perform
dic[son].put()/get() I get this message:
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "mps.py", line 8, in begin
dic[getpid()].put("Something...")
File "<string>", line 2, in __getitem__
File "/usr/lib/python2.7/multiprocessing/managers.py", line 773, in _callmethod
raise convert_to_error(kind, result)
RemoteError:
---------------------------------------------------------------------------
Unserializable message: ('#RETURN', <Queue.Queue instance at 0x8a92d0c>)
---------------------------------------------------------------------------
do you know what's the right way to do it?
I believe your code is failing because Queues are not serializable, just like the traceback says. The multiprocessing.Manager() object can create a shared dict for you without a problem, just as you've done here, but values stored in the dict still need to be serializable (or picklable in Pythonese). If you're okay with the subprocesses not having access to each other's queues, then this should work for you:
from multiprocessing import Process, Manager, Queue
from os import getpid
number_of_subprocesses_i_want = 5
def begin(myQ):
myQ.put("Something sentimental from your friend, PID {0}".format(getpid()))
return
if __name__== '__main__':
queue_dic = {}
queue_manager = Manager()
process_list = []
for i in xrange(number_of_subprocesses_i_want):
child_queue = queue_manager.Queue()
p = Process(target = begin, args=(child_queue,))
p.start()
queue_dic[p.pid] = child_queue
process_list.append(p)
for p in process_list:
print(queue_dic[p.pid].get())
p.join()
This leaves you with a dictionary whose keys are the child processes, and the values are their respective queues, which can be used from the main process.
I don't think your original goal is achievable with queues because queues that you want a subprocess to use must be passed to the processes when they are created, so as you launch more processes, you have no way to give an existing process access to a new queue.
One possible way to have inter-process communication would be to have everyone share a single queue to pass messages back to your main process bundled with some kind of header, such as in a tuple:
(destination_pid, sender_pid, message)
..and have main read the destination_pid and direct (sender_pid, message) to that subprocess' queue. Of course, this implies that you need a method of notifying existing processes when a new process is available to communicate with.
While double checking that threading.Condition is correctly monkey patched, I noticed that a monkeypatched threading.Thread(…).start() behaves differently from gevent.spawn(…).
Consider:
from gevent import monkey; monkey.patch_all()
from threading import Thread, Condition
import gevent
cv = Condition()
def wait_on_cv(x):
cv.acquire()
cv.wait()
print "Here:", x
cv.release()
# XXX: This code yields "This operation would block forever" when joining the first thread
threads = [ gevent.spawn(wait_on_cv, x) for x in range(10) ]
"""
# XXX: This code, which seems semantically similar, works correctly
threads = [ Thread(target=wait_on_cv, args=(x, )) for x in range(10) ]
for t in threads:
t.start()
"""
cv.acquire()
cv.notify_all()
print "Notified!"
cv.release()
for x, thread in enumerate(threads):
print "Joining", x
thread.join()
Note, specifically, the two comments starting with XXX.
When using the first line (with gevent.spawn), the first thread.join() raises an exception:
Notified!
Joining 0
Traceback (most recent call last):
File "foo.py", line 30, in
thread.join()
File "…/gevent/greenlet.py", line 291, in join
result = self.parent.switch()
File "…/gevent/hub.py", line 381, in switch
return greenlet.switch(self)
gevent.hub.LoopExit: This operation would block forever
However, Thread(…).start() (the second block), everything works as expected.
Why would this be? What's the difference between gevent.spawn() and Thread(…).start()?
What happen in your code is that the greenlets that you have created in you threads list didn't have yet the chance to be executed because gevent will not trigger a context switch until you do so explicitly in your code using gevent.sleep() and such or implicitly by calling a function that block e.g. semaphore.wait() or by yielding and so on ..., to see that you can insert a print before cv.wait() and see that it's called only after cv.notify_all() is called:
def wait_on_cv(x):
cv.acquire()
print 'acquired ', x
cv.wait()
....
So an easy fix to your code will be to insert something that will trigger a context switch after you create your list of greenlets, example:
...
threads = [ gevent.spawn(wait_on_cv, x) for x in range(10) ]
gevent.sleep() # Trigger a context switch
...
Note: I am still new to gevent so i don't know if this is the right way to do it :)
This way all the greenlets will have the chance to be executed and each one of them will trigger a context switch when they call cv.wait() and in the mean time they will
register them self to the condition waiters so that when cv.notify_all() is called it
will notify all the greenlets.
HTH,
I'm getting the following error when using the multiprocessing module within a python daemon process (using python-daemon):
Traceback (most recent call last):
File "/usr/local/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/local/lib/python2.6/multiprocessing/util.py", line 262, in _exit_function
for p in active_children():
File "/usr/local/lib/python2.6/multiprocessing/process.py", line 43, in active_children
_cleanup()
File "/usr/local/lib/python2.6/multiprocessing/process.py", line 53, in _cleanup
if p._popen.poll() is not None:
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 106, in poll
pid, sts = os.waitpid(self.pid, flag)
OSError: [Errno 10] No child processes
The daemon process (parent) spawns a number of processes (children) and then periodically polls the processes to see if they have completed. If the parent detects that one of the processes has completed, it then attempts to restart that process. It is at this point that the above exception is raised. It seems that once one of the processes completes, any operation involving the multiprocessing module will generate this exception. If I run the identical code in a non-daemon python script, it executes with no errors whatsoever.
EDIT:
Sample script
from daemon import runner
class DaemonApp(object):
def __init__(self, pidfile_path, run):
self.pidfile_path = pidfile_path
self.run = run
self.stdin_path = '/dev/null'
self.stdout_path = '/dev/tty'
self.stderr_path = '/dev/tty'
def run():
import multiprocessing as processing
import time
import os
import sys
import signal
def func():
print 'pid: ', os.getpid()
for i in range(5):
print i
time.sleep(1)
process = processing.Process(target=func)
process.start()
while True:
print 'checking process'
if not process.is_alive():
print 'process dead'
process = processing.Process(target=func)
process.start()
time.sleep(1)
# uncomment to run as daemon
app = DaemonApp('/root/bugtest.pid', run)
daemon_runner = runner.DaemonRunner(app)
daemon_runner.do_action()
#uncomment to run as regular script
#run()
Your problem is a conflict between the daemon and multiprocessing modules, in particular in its handling of the SIGCLD (child process terminated) signal. daemon sets SIGCLD to SIG_IGN when launching, which, at least on Linux, causes terminated children to immediately be reaped (rather than becoming a zombie until the parent invokes wait()). But multiprocessing's is_alive test invokes wait() to see if the process is alive, which fails if the process has already been reaped.
Simplest solution is just to set SIGCLD back to SIG_DFL (default behaviour -- ignore the signal and let the parent wait() for the terminated child process):
def run():
# ...
signal.signal(signal.SIGCLD, signal.SIG_DFL)
process = processing.Process(target=func)
process.start()
while True:
# ...
Ignoring SIGCLD also causes problems with the subprocess module, because of a bug in that module (issue 1731717, still open as of 2011-09-21).
This behaviour is addressed in version 1.4.8 of the python-daemon library; it now omits the default fiddling with SIGCLD, so no longer has this unpleasant interaction with other standard library modules.
I think there was a fix put into trunk and 2.6 maint a little while ago which should help with this can you try running your script in python-trunk or the latest 2.6-maint svn? I'm failing to pull up the bug information
Looks like your error is coming at the very end of your process -- your clue's at the very start of your traceback, and I quote...:
File "/usr/local/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
if atexit._run_exitfuncs is running, this clearly shows that your own process is terminating. So, the error itself is a minor issue in a sense -- just from some function that the multiprocessing module registered to run "at-exit" from your process. The really interesting issue is, WHY is your main process exiting? I think this may be due to some uncaught exception: try setting the exception hook and showing rich diagnostic info before it gets lost by the OTHER exception caused by whatever it is that multiprocessing's registered for at-exit running...
I'm running into this also using the celery distributed task manager under RHEL 5.3 with Python 2.6. My traceback looks a little different but the error the same:
File "/usr/local/lib/python2.6/multiprocessing/pool.py", line 334, in terminate
self._terminate()
File "/usr/local/lib/python2.6/multiprocessing/util.py", line 174, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/local/lib/python2.6/multiprocessing/pool.py", line 373, in _terminate_pool
p.terminate()
File "/usr/local/lib/python2.6/multiprocessing/process.py", line 111, in terminate
self._popen.terminate()
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 136, in terminate
if self.wait(timeout=0.1) is None:
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 121, in wait
res = self.poll()
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 106, in poll
pid, sts = os.waitpid(self.pid, flag)
OSError: [Errno 10] No child processes
Quite frustrating.. I'm running the code through pdb now, but haven't spotted anything yet.
The original sample script has "import signal" but no use of signals. However, I had a script causing this error message and it was due to my signal handling, so I'll explain here in case its what is happening for others. Within a signal handler, I was doing stuff with processes (e.g. creating a new process). Apparently this doesn't work, so I stopped doing that within the handler and fixed the error. (Note: sleep() functions wake up after signal handling so that can be an alternative approach to acting upon signals if you need to do things with processes)