What exactly is Python multiprocessing Module's .join() Method Doing?

What exactly is Python multiprocessing Module's .join() Method Doing? - python

Learning about Python Multiprocessing (from a PMOTW article) and would love some clarification on what exactly the join() method is doing.
In an old tutorial from 2008 it states that without the p.join() call in the code below, "the child process will sit idle and not terminate, becoming a zombie you must manually kill".
from multiprocessing import Process
def say_hello(name='world'):
print "Hello, %s" % name
p = Process(target=say_hello)
p.start()
p.join()
I added a printout of the PID as well as a time.sleep to test and as far as I can tell, the process terminates on its own:
from multiprocessing import Process
import sys
import time
def say_hello(name='world'):
print "Hello, %s" % name
print 'Starting:', p.name, p.pid
sys.stdout.flush()
print 'Exiting :', p.name, p.pid
sys.stdout.flush()
time.sleep(20)
p = Process(target=say_hello)
p.start()
# no p.join()
within 20 seconds:
936 ttys000 0:00.05 /Library/Frameworks/Python.framework/Versions/2.7/Reso
938 ttys000 0:00.00 /Library/Frameworks/Python.framework/Versions/2.7/Reso
947 ttys001 0:00.13 -bash
after 20 seconds:
947 ttys001 0:00.13 -bash
Behavior is the same with p.join() added back at end of the file. Python Module of the Week offers a very readable explanation of the module; "To wait until a process has completed its work and exited, use the join() method.", but it seems like at least OS X was doing that anyway.
Am also wondering about the name of the method. Is the .join() method concatenating anything here? Is it concatenating a process with it's end? Or does it just share a name with Python's native .join() method?

The join() method, when used with threading or multiprocessing, is not related to str.join() - it's not actually concatenating anything together. Rather, it just means "wait for this [thread/process] to complete". The name join is used because the multiprocessing module's API is meant to look as similar to the threading module's API, and the threading module uses join for its Thread object. Using the term join to mean "wait for a thread to complete" is common across many programming languages, so Python just adopted it as well.
Now, the reason you see the 20 second delay both with and without the call to join() is because by default, when the main process is ready to exit, it will implicitly call join() on all running multiprocessing.Process instances. This isn't as clearly stated in the multiprocessing docs as it should be, but it is mentioned in the Programming Guidelines section:
Remember also that non-daemonic processes will be automatically be
joined.
You can override this behavior by setting the daemon flag on the Process to True prior to starting the process:
p = Process(target=say_hello)
p.daemon = True
p.start()
# Both parent and child will exit here, since the main process has completed.
If you do that, the child process will be terminated as soon as the main process completes:
daemon
The process’s daemon flag, a Boolean value. This must be set before
start() is called.
The initial value is inherited from the creating process.
When a process exits, it attempts to terminate all of its daemonic
child processes.

Without the join(), the main process can complete before the child process does. I'm not sure under what circumstances that leads to zombieism.
The main purpose of join() is to ensure that a child process has completed before the main process does anything that depends on the work of the child process.
The etymology of join() is that it's the opposite of fork, which is the common term in Unix-family operating systems for creating child processes. A single process "forks" into several, then "joins" back into one.

I'm not going to explain in detail what join does, but here's the etymology and the intuition behind it, which should help you remember its meaning more easily.
The idea is that execution "forks" into multiple processes of which one is the main/primary process, the rest workers (or minor/secondary). When the workers are done, they "join" the main process so that serial execution may be resumed.
The join() causes the main process to wait for a worker to join it. The method might better have been called "wait", since that's the actual behavior it causes in the master (and that's what it's called in POSIX, although POSIX threads call it "join" as well). The joining only occurs as an effect of the threads cooperating properly, it's not something the main process does.
The names "fork" and "join" have been used with this meaning in multiprocessing since 1963.

The join() call ensures that subsequent lines of your code are not called before all the multiprocessing processes are completed.
For example, without the join(), the following code will call restart_program() even before the processes finish, which is similar to asynchronous and is not what we want (you can try):
num_processes = 5
for i in range(num_processes):
p = multiprocessing.Process(target=calculate_stuff, args=(i,))
p.start()
processes.append(p)
for p in processes:
p.join() # call to ensure subsequent line (e.g. restart_program)
# is not called until all processes finish
restart_program()

join() is used to wait for the worker processes to exit. One must call close() or terminate() before using join().
Like #Russell mentioned join is like the opposite of fork (which Spawns sub-processes).
For join to run you have to run close() which will prevent any more tasks from being submitted to the pool and exit once all tasks complete. Alternatively, running terminate() will just exit by stopping all worker processes immediately.
"the child process will sit idle and not terminate, becoming a zombie you must manually kill" this is possible when the main (parent) process exits but the child process is still running and once completed it has no parent process to return its exit status to.

To wait until a process has completed its work and exited, use the join() method.
and
Note It is important to join() the process after terminating it in order to give the background machinery time to update the status of the object to reflect the termination.
This is a good example helped me understand it: here
One thing I noticed personally was my main process paused until the child had finished its process using the join() method which defeated the point of me using multiprocessing.Process() in the first place.

Related

In Python does the parent process continue to exist as long as any non-daemonic child processes are running

I am using the multiprocessing module of Python. I am testing the following code :
from multiprocessing import *
from time import sleep
def f():
print ('in child#1 proc')
sleep(2)
print('ch#1 ends')
def f1() :
print ('in child#2 proc')
sleep(10)
print('ch#2 ends')
if __name__ == '__main__':
p = Process(target=f)
p1 = Process(target=f1, daemon=True)
p.start()
p1.start()
sleep(1)
print ('child procs started')
I have the following observations :
The first child process p runs for 2 secs
After 1 sec, the second child process p1 becomes zombie
The parent (main) process runs (is active) till child#1 (non-daemon process) is running, that is for 2secs
Now I have the following queries :
Why should the parent (main) process be active after it finishes execution? Note that the parent does not perform a join on p.
Why should the daemon child p1 become a zombie after 1 sec? Note that the parent (main) process actually stays alive till the time p is running.
I have executed the above program on ubuntu.
My observations are based on the output os the ps command on ubuntu

To sum up and persist the discussion in the comments of the other answer:
Why should the parent (main) process be active after it finishes
execution? Note that the parent does not perform a join on p.
multiprocessing tries to make sure that your programs using it behave well. That is, it attempts to clean up after itself. In order to do so, it utilizes the atexit module which lets you register exit handlers that are to be executed when the interpreter process prepares to terminate normally.
multiprocessing defines and registers the function _exit_function that first calls terminate() on all still running daemonic childs and then calls join() on all remaining non-daemonic childs. Since join() blocks, the parent waits until the non-daemonic childs have terminated. terminate() on the other hand does not block, it simply sends a SIGTERM signal (on Unix) to childs and returns.
That brings us to:
Why should the daemon child p1 become a zombie after 1 sec? Note that
the parent (main) process actually stays alive till the time p is
running.
That is because the parent has reached the end of its instructions and the interpreter prepares to terminate, i.e. it executes the registered exit handlers. The daemonic child p1 receives a SIGTERM signal. Since SIGTERM is allowed to be caught and handled inside processes, the child is not ordered to shut down immediately, but instead is given the chance to do some cleanup of its own. That's what makes p1 show up as <defunct>. The Kernel knows that the process has been instructed to terminate, but the process has not done so yet.
In the given case, p1 has not yet had the chance to honor the SIGTERM signal, presumably because it still executes sleep(). At least as of Python 3.5:
The function now sleeps at least secs even if the sleep is interrupted
by a signal, except if the signal handler raises an exception (see PEP
475 for the rationale).

The parent stays alive because it is the root of the app. It stays in memory while the children are processing. Note, join waits for the child to exit and then gives control back to the parent. If you don't join the parent will exit but remain in memory.
p1 will zombie because the parent exits after the sleep 1. It stays alive with p because you don't deamon p. if you don't deamon a process and you call start on it, the control is passed to the child and when the child is complete it will pass control back to the parent. if you do daemon it, it will keep control with the parent and run the child in the back.

not able to terminate the process in multiprocessing python (linux)

I am new to python and using multiprocessing, I am starting one process and calling one shell script through this process. After terminating this process shell script keeps running in the background, how do I kill it, please help.
python script(test.py)
#!/usr/bin/python
import time
import os
import sys
import multiprocessing
# test process
def test_py_process():
os.system("./test.sh")
return
p=multiprocessing.Process(target=test_py_process)
p.start()
print 'STARTED:', p, p.is_alive()
time.sleep(10)
p.terminate()
print 'TERMINATED:', p, p.is_alive()
shell script (test.sh)
#!/bin/bash
for i in {1..100}
do
sleep 1
echo "Welcome $i times"
done

The reason is that the child process that is spawned by the os.system call spawns a child process itself. As explained in the multiprocessing docs descendant processes of the process will not be terminated – they will simply become orphaned. So. p.terminate() kills the process you created, but the OS process (/bin/bash ./test.sh) simply gets assigned to the system's scheduler process and continues executing.
You could use subprocess.Popen instead:
import time
from subprocess import Popen
if __name__ == '__main__':
p = Popen("./test.sh")
print 'STARTED:', p, p.poll()
time.sleep(10)
p.kill()
print 'TERMINATED:', p, p.poll()
Edit: #Florian Brucker beat me to it. He deserves the credit for answering the question first. Still keeping this answer for the alternate approach using subprocess, which is recommended over os.system() in the documentation for os.system() itself.

os.system runs the given command in a separate process. Therefore, you have three processes:
The main process in which your script runs
The process in which test_py_processes runs
The process in which the bash script runs
Process 2 is a child process of process 1, and process 3 is a child of process 1.
When you call Process.terminate from within process 1 this will send the SIGTERM signal to process two. That process will then terminate. However, the SIGTERM signal is not automatically propagated to the child processes of process 2! This means that process 3 is not notified when process 2 exits and hence keeps on running as a child of the init process.
The best way to terminate process 3 depends on your actual problem setting, see this SO thread for some suggestions.

Python avoid orphan processes

I'm using python to benchmark something. This can take a large amount of time, and I want to set a (global) timeout. I use the following script (summarized):
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException()
# Halt problem after half an hour
signal.alarm(1800)
try:
while solution is None:
guess = guess()
try:
with open(solutionfname, 'wb') as solutionf:
solverprocess = subprocess.Popen(["solver", problemfname], stdout=solutionf)
solverprocess.wait()
finally:
# `solverprocess.poll() == None` instead of try didn't work either
try:
solverprocess.kill()
except:
# Solver process was already dead
pass
except TimeoutException:
pass
# Cancel alarm if it's still active
signal.alarm(0)
However it keeps spawning orphan processes sometimes, but I can't reliably recreate the circumstances. Does anyone know what the correct way to prevent this is?

You simply have to wait after killing the process.

The documentation for the kill() method states:
Kills the child. On Posix OSs the function sends SIGKILL to the child.
On Windows kill() is an alias for terminate().
In other words, if you aren't on Windows, you are only sending a signal to the subprocess.
This will create a zombie process because the parent process didn't read the return value of the subprocess.
The kill() and terminate() methods are just shortcuts to send_signal(SIGKILL) and send_signal(SIGTERM).
Try adding a call to wait() after the kill(). This is even shown in the example under the documentation for communicate():
proc = subprocess.Popen(...)
try:
outs, errs = proc.communicate(timeout=15)
except TimeoutExpired:
proc.kill()
outs, errs = proc.communicate()
note the call to communicate() after the kill(). (It is equivalent to calling wait() and also erading the outputs of the subprocess).
I want to clarify one thing: it seems like you don't understand exactly what a zombie process is. A zombie process is a terminated process. The kernel keeps the process in the process table until the parent process reads its exit status. I believe all memory used by the subprocess is actually reused; the kernel only has to keep track of the exit status of such a process.
So, the zombie processes you see aren't running. They are already completely dead, and that's why they are called zombie. They are "alive" in the process table, but aren't really running at all.
Calling wait() does exactly this: wait till the subprocess ends and read the exit status. This allows the kernel to remove the subprocess from the process table.

On linux, you can use python-prctl.
Define a preexec function such as:
def pre_exec():
import signal
prctl.set_pdeathsig(signal.SIGTERM)
And have your Popen call pass it.
subprocess.Popen(..., preexec_fn=pre_exec)
That's as simple as that. Now the child process will die rather than become orphan if the parent dies.
If you don't like the external dependency of python-prctl you can also use the older prctl. Instead of
prctl.set_pdeathsig(signal.SIGTERM)
you would have
prctl.prctl(prctl.PDEATHSIG, signal.SIGTERM)

Is there a way to make os.killpg not kill the script that calls it?

I have a subprocess which I open, which calls other processes.
I use os.killpg(os.getpgid(subOut.pid), signal.SIGTERM) to kill the entire group, but this kills the python script as well. Even when I call a python script with os.killpg from a second python script, this kills the second script as well. Is there a way to make os.killpg not stop the script?
Another solution would be to individually kill every child 1process. However, even using
p = psutil.Process(subOut.pid)
child_pid = p.children(recursive=True)
for pid in child_pid:
os.kill(pid.pid, signal.SIGTERM)
does not correctly give me all the pids of the children.
And you know what they say... don't kill the script that calls you...

A bit late to answer, but since google took me here while looking for a related problem: the reason your script gets killed is because its children will, by default, inherit its group id. But you can tell subprocess.Popen to create a new process group for your subprocess. Though it's a bit tricky: you have to pass in os.setpgrp for the preexec_fn parameter. This will call setpgrp (without any arguments) in the newly created (forked) process (before that does the exec) which will set the gid of the new process to the pid of the new process (thus creating a new group). The documentation mentions that it can deadlock in multi-threaded code. As an alternative, you can use start_new_session=True, but that would create not only a new process group but a new session. (And that would mean that if you close your terminal session while your script is running, the children would not be terminated. It may or may not be a problem.)
As a side note, if you are on windows, you can simply pass subprocess.CREATE_NEW_PROCESS_GROUP in the creationflag parameter.
Here is what it looks like in detail:
subOut = subprocess.Popen(['your', 'subprocess', ...], preexec_fn=os.setpgrp)
# when it's time to kill
os.killpg(os.getpgid(subOut.pid), signal.SIGTERM)

Create a process group having all the immediate children of the called process as follows:
p1 = subprocess.Popen(cmd1)
os.setpgrp(p1.pid, 0) #It will create process group with id same as p1.pid
p2 = subprocess.Popen(cmd2)
os.setpgrp(p2.pid, os.getpgid(p1.pid))
pn = subprocess.Popen(cmdn)
os.setpgrp(pn.pid, os.getpgid(p1.pid))
#Kill all the children and their process tree using following command
os.killpg(os.getpgid(p1.pid), signal.SIGKILL)
It will kill whole process tree except its own process.

atleta's answer above worked for me but the preexec_fn argument in the call to Popen should be setpgrp, rather than setgrp:
subOut = subprocess.Popen(['your', 'subprocess', ...], preexec_fn=os.setpgrp)
I'm posting this as an answer instead of a comment on atleta's answer because I don't have comment privileges yet.

Easy way is to set the parent process to ignore the signal before sending it.
# Tell this (parent) process to ignore the signal
old_handler = signal.signal(sig, signal.SIG_IGN)
# Send the signal to our process group and
# wait for them all to exit.
os.killpg(os.getpgid(0), sig)
while os.wait() != -1:
pass
# Restore the handler
signal.signal(sig, old_handler)

Is there no need to reap zombie process in python?

It seems to me in Python, there is no need to reap zombie processes.
For example, in the following code
import multiprocessing
import time
def func(msg):
time.sleep(2)
print "done " + str(msg)
if __name__ == "__main__":
for i in range(10):
p = multiprocessing.Process(target=func, args=('3'))
p.start()
print "child"+str(i)
print "parent"
time.sleep(100)
When all the child process exit, the parent process is still running
and at this time, I checked the process using ps -ef
and I noticed there is no defunct process.
Does this mean that in Python, there is no need to reap zombie process?

After having a look to the library - especially to multiprocessing/process.py -, I see that
in Process.start(), there is a _current_process._children.add(self) which adds the started process to a list/set/whatever,
a few lines above, there is a _cleanup() which polls and discards terminated processes, removing zombies.
But that doesn't explain why your code doesn't produce zombies, as the childs wait a while befor terminating, so that the parent's start() calls don't notice that yet.

Those processes are not actually zombies since they should terminate successfully.
You could set the child processes to be deamonic so they'll terminate if the main process terminates.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.