I'm getting the following error when using the multiprocessing module within a python daemon process (using python-daemon):
Traceback (most recent call last):
File "/usr/local/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/local/lib/python2.6/multiprocessing/util.py", line 262, in _exit_function
for p in active_children():
File "/usr/local/lib/python2.6/multiprocessing/process.py", line 43, in active_children
_cleanup()
File "/usr/local/lib/python2.6/multiprocessing/process.py", line 53, in _cleanup
if p._popen.poll() is not None:
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 106, in poll
pid, sts = os.waitpid(self.pid, flag)
OSError: [Errno 10] No child processes
The daemon process (parent) spawns a number of processes (children) and then periodically polls the processes to see if they have completed. If the parent detects that one of the processes has completed, it then attempts to restart that process. It is at this point that the above exception is raised. It seems that once one of the processes completes, any operation involving the multiprocessing module will generate this exception. If I run the identical code in a non-daemon python script, it executes with no errors whatsoever.
EDIT:
Sample script
from daemon import runner
class DaemonApp(object):
def __init__(self, pidfile_path, run):
self.pidfile_path = pidfile_path
self.run = run
self.stdin_path = '/dev/null'
self.stdout_path = '/dev/tty'
self.stderr_path = '/dev/tty'
def run():
import multiprocessing as processing
import time
import os
import sys
import signal
def func():
print 'pid: ', os.getpid()
for i in range(5):
print i
time.sleep(1)
process = processing.Process(target=func)
process.start()
while True:
print 'checking process'
if not process.is_alive():
print 'process dead'
process = processing.Process(target=func)
process.start()
time.sleep(1)
# uncomment to run as daemon
app = DaemonApp('/root/bugtest.pid', run)
daemon_runner = runner.DaemonRunner(app)
daemon_runner.do_action()
#uncomment to run as regular script
#run()
Your problem is a conflict between the daemon and multiprocessing modules, in particular in its handling of the SIGCLD (child process terminated) signal. daemon sets SIGCLD to SIG_IGN when launching, which, at least on Linux, causes terminated children to immediately be reaped (rather than becoming a zombie until the parent invokes wait()). But multiprocessing's is_alive test invokes wait() to see if the process is alive, which fails if the process has already been reaped.
Simplest solution is just to set SIGCLD back to SIG_DFL (default behaviour -- ignore the signal and let the parent wait() for the terminated child process):
def run():
# ...
signal.signal(signal.SIGCLD, signal.SIG_DFL)
process = processing.Process(target=func)
process.start()
while True:
# ...
Ignoring SIGCLD also causes problems with the subprocess module, because of a bug in that module (issue 1731717, still open as of 2011-09-21).
This behaviour is addressed in version 1.4.8 of the python-daemon library; it now omits the default fiddling with SIGCLD, so no longer has this unpleasant interaction with other standard library modules.
I think there was a fix put into trunk and 2.6 maint a little while ago which should help with this can you try running your script in python-trunk or the latest 2.6-maint svn? I'm failing to pull up the bug information
Looks like your error is coming at the very end of your process -- your clue's at the very start of your traceback, and I quote...:
File "/usr/local/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
if atexit._run_exitfuncs is running, this clearly shows that your own process is terminating. So, the error itself is a minor issue in a sense -- just from some function that the multiprocessing module registered to run "at-exit" from your process. The really interesting issue is, WHY is your main process exiting? I think this may be due to some uncaught exception: try setting the exception hook and showing rich diagnostic info before it gets lost by the OTHER exception caused by whatever it is that multiprocessing's registered for at-exit running...
I'm running into this also using the celery distributed task manager under RHEL 5.3 with Python 2.6. My traceback looks a little different but the error the same:
File "/usr/local/lib/python2.6/multiprocessing/pool.py", line 334, in terminate
self._terminate()
File "/usr/local/lib/python2.6/multiprocessing/util.py", line 174, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/local/lib/python2.6/multiprocessing/pool.py", line 373, in _terminate_pool
p.terminate()
File "/usr/local/lib/python2.6/multiprocessing/process.py", line 111, in terminate
self._popen.terminate()
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 136, in terminate
if self.wait(timeout=0.1) is None:
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 121, in wait
res = self.poll()
File "/usr/local/lib/python2.6/multiprocessing/forking.py", line 106, in poll
pid, sts = os.waitpid(self.pid, flag)
OSError: [Errno 10] No child processes
Quite frustrating.. I'm running the code through pdb now, but haven't spotted anything yet.
The original sample script has "import signal" but no use of signals. However, I had a script causing this error message and it was due to my signal handling, so I'll explain here in case its what is happening for others. Within a signal handler, I was doing stuff with processes (e.g. creating a new process). Apparently this doesn't work, so I stopped doing that within the handler and fixed the error. (Note: sleep() functions wake up after signal handling so that can be an alternative approach to acting upon signals if you need to do things with processes)
Related
I am trying to create a shared memory for my Python application, which should be used in the parent process and in another process that is spawned from that parent process. In most cases that works fine, however, sometimes I get the following stacktrace:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory: '/psm_47f7f5d7'
I want to emphasize that our code/application works fine in 99% of the time. We are spawning these new processes with new shared memory for each such process on a regular basis in our application (which is a server process, so it's running 24/7). Nearly all the time this works fine, only from time to time this error above is thrown, which then kills the whole application.
Update: I noticed that this problem occurs mainly when the application was running for a while already. When I start it up the creation of shared memory and spawning new processes works fine without this error.
The shared memory is created like this:
# Spawn context for multiprocessing
_mp_spawn_ctxt = multiprocessing.get_context("spawn")
_mp_spawn_ctxt_pipe = _mp_spawn_ctxt.Pipe
# Create shared memory
mem_size = width * height * bpp
shared_mem = shared_memory.SharedMemory(create=True, size=mem_size)
image = np.ndarray((height, width, bpp), dtype=np.uint8, buffer=shared_mem.buf)
parent_pipe, child_pipe = _mp_spawn_ctxt_pipe()
time.sleep(0.1)
# Spawn new process
# _CameraProcess is a custom class derived from _mp_spawn_ctxt.Process
proc = _CameraProcess(shared_mem, child_pipe)
proc.start()
Any ideas what could be the issue here?
I had the similar issue in case, that more processes had access to the shared memory/object and one process did update the shared memory/object.
I solved these issues based on these steps:
I synchronized all operations with shared memory/object via mutexes (see sample for multiprocessing usage superfastpython or protect shared resources). Critical part of code are create, update, delete but also reading content of shared object/memory, because at the same time different process can do update of shared object/memory, etc.
I avoided libraries with only single thread execution support
See sample code with synchronization:
def increase(sharedObj, lock):
for i in range(100):
time.sleep(0.01)
lock.acquire()
sharedObj = sharedObj + 1
lock.release()
def decrease(sharedObj, lock):
for i in range(100):
time.sleep(0.001)
lock.acquire()
sharedObj = sharedObj - 1
lock.release()
if __name__ == '__main__':
sharedObj = multiprocessing.Value ('i',1000)
lock=multiprocessing.Lock()
p1=multiprocessing.Process(target=increase, args=(sharedObj, lock))
p2=multiprocessing.Process(target=decrease, args=(sharedObj, lock))
p1.start()
p2.start()
p1.join()
p2.join()
When running using multiprocessing pool, I find that the worker process keeps running past a point where an exception is thrown.
Consider the following code:
import multiprocessing
def worker(x):
print("input: " + x)
y = x + "_output"
raise Exception("foobar")
print("output: " + y)
return(y)
def main():
data = [str(x) for x in range(4)]
pool = multiprocessing.Pool(1)
chunksize = 1
results = pool.map(worker, data, chunksize)
pool.close()
pool.join()
print("Printing results:")
print(results)
if __name__ == "__main__":
main()
The output is:
$ python multiprocessing_fail.py
input: 0
input: 1
input: 2
Traceback (most recent call last):
input: 3
File "multiprocessing_fail.py", line 25, in <module>
main()
File "multiprocessing_fail.py", line 16, in main
results = pool.map(worker, data, 1)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
Exception: foobar
As you can see, the worker process never proceeds beyond raise Exception("foobar") to the second print statement. However, it resumes work at the beginning of function worker() again and again.
I looked for an explanation in the documentation, but couldn't find any. Here is a potentially related SO question:
Keyboard Interrupts with python's multiprocessing Pool
But that is different (about keyboard interrupts not being picked by the master process).
Another SO question:
How to catch exceptions in workers in Multiprocessing
This question is also different, since in it the master process doesnt catch any exception, whereas here the master did catch the exception (line 16). More importantly, in that question the worker did not run past an exception (there is only one executable line for the worker).
Am running python 2.7
Comment: Pool should start one worker since the code has pool = multiprocessing.Pool(1).
From the Documnentation:
A process pool object which controls a pool of worker processes to which jobs can be submitted
Comment: That one worker is running the worker() function multiple times
From the Documentation:
map(func, iterable[, chunksize])
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks.
Your worker() is the separate task. Renaming your worker() to task() could help to clarify what is what.
Comment: What I expect is that the worker process crashes at the Exception
It does, the separate task, your worker() dies and Pool starts the next task.
What you want is Pool.terminate()
From the Documentation:
terminate()
Stops the worker processes immediately without completing outstanding work.
Question: ... I find that the worker process keeps running past a point where an exception is thrown.
You give iteration data to Pool, therfore Pool does what it have to do:
Starting len(data) worker.
data = [str(x) for x in range(4)]
The main Question is: What do you want to expect with
raise Exception("foobar")
If a module is imported from a script without a main guard (if __name__ == '__main__':), doing any kind of parallelism in some function in the module will result in an infinite loop on Windows. Each new process loads all of the sources, now with __name__ not equal to '__main__', and then continues execution in parallel. If there's no main guard, we're going to do another call to the same function in each of our new processes, spawning even more processes, until we crash. It's only a problem on Windows, but the scripts are also executed on osx and linux.
I could check this by writing to a special file on disk, and read from it to see if we've already started, but that limits us to a single python script running at once. The simple solution of modifying all the calling code to add main guards is not feasible because they are spread out in many repositories, which I do not have access to. Thus, I would like to parallelize, when main guards are used, but fallback to single threaded execution when they're not.
How do I figure out if I'm being called in an import loop due to a missing main guard, so that I can fallback to single threaded execution?
Here's some demo code:
lib with parallel code:
from multiprocessing import Pool
def _noop(x):
return x
def foo():
p = Pool(2)
print(p.map(_noop, [1, 2, 3]))
Good importer (with guard):
from lib import foo
if __name__ == "__main__":
foo()
Bad importer (without guard):
from lib import foo
foo()
where the bad importer fails with this RuntimeError, over and over again:
p = Pool(2)
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\context.py", line 118, in Pool
context=self.get_context())
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\pool.py", line 168, in __init__
self._repopulate_pool()
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\pool.py", line 233, in _repopulate_pool
w.start()
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
return Popen(process_obj)
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 34, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\spawn.py", line 144, in get_preparation_data
_check_not_importing_main()
File "C:\Users\filip.haglund\AppData\Local\Programs\Python\Python35\lib\multiprocessing\spawn.py", line 137, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Since you're using multiprocessing, you can also use it to detect if you're the main process or a child process. However, these features are not documented and are therefore just implementation details that could change without warning between python versions.
Each process has a name, _identity and _parent_pid. You can check any of them to see if you're in the main process or not. In the main process name will be 'MainProcess', _identity will be (), and _parent_pid will be None).
My solution allows you to continue using multiprocessing, but just modifies child processes so they can't keep creating child processes forever. It uses a decorator to change foo to a no-op in child processes, but returns foo unchanged in the main process. This means when the spawned child process tries to execute foo nothing will happen (as if it had been executed inside a __main__ guard.
from multiprocessing import Pool
from multiprocessing.process import current_process
def run_in_main_only(func):
if current_process().name == "MainProcess":
return func
else:
def noop(*args, **kwargs):
pass
return noop
def _noop(_ignored):
p = current_process()
return p.name, p._identity, p._parent_pid
#run_in_main_only
def foo():
with Pool(2) as p:
for result in p.map(_noop, [1, 2, 3]):
print(result) # prints something like ('SpawnPoolWorker-2', (2,), 10720)
if __name__ == "__main__":
print(_noop(1)) # prints ('MainProcess', (), None)
I'm developing a process scheduler in Python. The idea is to create several threads from the main function and start an external process in each of these threads. The external process should continue to run until either it's finished or the main thread decides to stop it (by sending a kill signal) because the process' CPU time limit is exceeded.
The problem is that sometimes the Popen call blocks and fails to return. This code reproduces the problem with ~50% probability on my system (Ubuntu 14.04.3 LTS):
import os, time, threading, sys
from subprocess import Popen
class Process:
def __init__(self, args):
self.args = args
def run(self):
print("Run subprocess: " + " ".join(self.args))
retcode = -1
try:
self.process = Popen(self.args)
print("started a process")
while self.process.poll() is None:
# in the real code, check for the end condition here and send kill signal if required
time.sleep(1.0)
retcode = self.process.returncode
except:
print("unexpected error:", sys.exc_info()[0])
print("process done, returned {}".format(retcode))
return retcode
def main():
processes = [Process(["/bin/cat"]) for _ in range(4)]
# start all processes
for p in processes:
t = threading.Thread(target=Process.run, args=(p,))
t.daemon = True
t.start()
print("all threads started")
# wait for Ctrl+C
while True:
time.sleep(1.0)
main()
The output indicates that only 3 Popen() calls have returned:
Run subprocess: /bin/cat
Run subprocess: /bin/cat
Run subprocess: /bin/cat
Run subprocess: /bin/cat
started a process
started a process
started a process
all threads started
However, running ps shows that all four processes have in fact been started!
The problem does not show up when using Python 3.4, but I want to keep Python 2.7 compatibility.
Edit: the problem also goes away if I add some delay before starting each subsequent thread.
Edit 2: I did a bit of investigation and the blocking is caused by line 1308 in subprocess.py module, which tries to do some reading from a pipe in the parent process:
data = _eintr_retry_call(os.read, errpipe_read, 1048576)
There are a handful of bugs in python 2.7's subprocess module that can result in deadlock when calling the Popen constructor from multiple threads. They are fixed in later versions of Python, 3.2+ IIRC.
You may find that using the subprocess32 backport of Python 3.2/3.3's subprocess module resolves your issue.
*I was unable to locate the link to the actual bug report, but encountered it recently when dealing with a similar issue.
I have python script called monitiq_install.py which calls other scripts (or modules) using the subprocess python module. However, if the user sends a keyboard interrupt (CTRL + C) it exits, but with an exception. I want it to exit, but nicely.
My Code:
import os
import sys
from os import listdir
from os.path import isfile, join
from subprocess import Popen, PIPE
import json
# Run a module and capture output and exit code
def runModule(module):
try:
# Run Module
process = Popen(os.path.dirname(os.path.realpath(__file__)) + "/modules/" + module, shell=True, stdout=PIPE, bufsize=1)
for line in iter(process.stdout.readline, b''):
print line,
process.communicate()
exit_code = process.wait();
return exit_code;
except KeyboardInterrupt:
print "Got keyboard interupt!";
sys.exit(0);
The error I'm getting is below:
python monitiq_install.py -a
Invalid module filename: create_db_user_v0_0_0.pyc
Not Running Module: '3parssh_install' as it is already installed
######################################
Running Module: 'create_db_user' Version: '0.0.3'
Choose username for Monitiq DB User [MONITIQ]
^CTraceback (most recent call last):
File "/opt/monitiq-universal/install/modules/create_db_user-v0_0_3.py", line 132, in <module>
inputVal = raw_input("");
Traceback (most recent call last):
File "monitiq_install.py", line 40, in <module>
KeyboardInterrupt
module_install.runModules();
File "/opt/monitiq-universal/install/module_install.py", line 86, in runModules
exit_code = runModule(module);
File "/opt/monitiq-universal/install/module_install.py", line 19, in runModule
for line in iter(process.stdout.readline, b''):
KeyboardInterrupt
A solution or some pointers would be helpful :)
--EDIT
With try catch
Running Module: 'create_db_user' Version: '0.0.0'
Choose username for Monitiq DB User [MONITIQ]
^CGot keyboard interupt!
Traceback (most recent call last):
File "monitiq_install.py", line 36, in <module>
module_install.runModules();
File "/opt/monitiq-universal/install/module_install.py", line 90, in runModules
exit_code = runModule(module);
File "/opt/monitiq-universal/install/module_install.py", line 29, in runModule
sys.exit(0);
NameError: global name 'sys' is not defined
Traceback (most recent call last):
File "/opt/monitiq-universal/install/modules/create_db_user-v0_0_0.py", line 132, in <module>
inputVal = raw_input("");
KeyboardInterrupt
If you press Ctrl + C in a terminal then SIGINT is sent to all processes within the process group. See child process receives parent's SIGINT.
That is why you see the traceback from the child process despite try/except KeyboardInterrupt in the parent.
You could suppress the stderr output from the child process: stderr=DEVNULL. Or start it in a new process group: start_new_session=True:
import sys
from subprocess import call
try:
call([sys.executable, 'child.py'], start_new_session=True)
except KeyboardInterrupt:
print('Ctrl C')
else:
print('no exception')
If you remove start_new_session=True in the above example then KeyboardInterrupt may be raised in the child too and you might get the traceback.
If subprocess.DEVNULL is not available; you could use DEVNULL = open(os.devnull, 'r+b', 0). If start_new_session parameter is not available; you could use preexec_fn=os.setsid on POSIX.
You can do this using try and except as below:
import subprocess
try:
proc = subprocess.Popen("dir /S", shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while proc.poll() is None:
print proc.stdout.readline()
except KeyboardInterrupt:
print "Got Keyboard interrupt"
You could avoid shell=True in your execution as best security practice.
This code spawns a child process and hands signals like SIGINT, ... to them just like shells (bash, zsh, ...) do it.
This means KeyboardInterrupt is no longer seen by the Python process, but the child receives this and is killed correctly.
It works by running the process in a new foreground process group set by Python.
import os
import signal
import subprocess
import sys
import termios
def run_as_fg_process(*args, **kwargs):
"""
the "correct" way of spawning a new subprocess:
signals like C-c must only go
to the child process, and not to this python.
the args are the same as subprocess.Popen
returns Popen().wait() value
Some side-info about "how ctrl-c works":
https://unix.stackexchange.com/a/149756/1321
fun fact: this function took a whole night
to be figured out.
"""
old_pgrp = os.tcgetpgrp(sys.stdin.fileno())
old_attr = termios.tcgetattr(sys.stdin.fileno())
user_preexec_fn = kwargs.pop("preexec_fn", None)
def new_pgid():
if user_preexec_fn:
user_preexec_fn()
# set a new process group id
os.setpgid(os.getpid(), os.getpid())
# generally, the child process should stop itself
# before exec so the parent can set its new pgid.
# (setting pgid has to be done before the child execs).
# however, Python 'guarantee' that `preexec_fn`
# is run before `Popen` returns.
# this is because `Popen` waits for the closure of
# the error relay pipe '`errpipe_write`',
# which happens at child's exec.
# this is also the reason the child can't stop itself
# in Python's `Popen`, since the `Popen` call would never
# terminate then.
# `os.kill(os.getpid(), signal.SIGSTOP)`
try:
# fork the child
child = subprocess.Popen(*args, preexec_fn=new_pgid,
**kwargs)
# we can't set the process group id from the parent since the child
# will already have exec'd. and we can't SIGSTOP it before exec,
# see above.
# `os.setpgid(child.pid, child.pid)`
# set the child's process group as new foreground
os.tcsetpgrp(sys.stdin.fileno(), child.pid)
# revive the child,
# because it may have been stopped due to SIGTTOU or
# SIGTTIN when it tried using stdout/stdin
# after setpgid was called, and before we made it
# forward process by tcsetpgrp.
os.kill(child.pid, signal.SIGCONT)
# wait for the child to terminate
ret = child.wait()
finally:
# we have to mask SIGTTOU because tcsetpgrp
# raises SIGTTOU to all current background
# process group members (i.e. us) when switching tty's pgrp
# it we didn't do that, we'd get SIGSTOP'd
hdlr = signal.signal(signal.SIGTTOU, signal.SIG_IGN)
# make us tty's foreground again
os.tcsetpgrp(sys.stdin.fileno(), old_pgrp)
# now restore the handler
signal.signal(signal.SIGTTOU, hdlr)
# restore terminal attributes
termios.tcsetattr(sys.stdin.fileno(), termios.TCSADRAIN, old_attr)
return ret
# example:
run_as_fg_process(['openage', 'edit', '-f', 'random_map.rms'])