Using subprocess module to work in parallel (Multi-process) - python

New to multiprocessing in python, consider that you have the following function:
def do_something_parallel(self):
result_operation1 = doit.main(A,B)
do_something_else(C)
Now the point is that I want the doit.main to run in another process and to be non blocking, so the code in do_something_else will run immediately after the first has been launched in another process.
How can I do it using python subprocess module?
Is there a difference between subprocessing and creating new process aside to another one, why would we need a child processes of other process?
Note: I do not want to use multithreaded approach here..
EDIT: I wondered whether using a subprocess module and multiprocess module in the same function is prohibited?
Reason I want this is that I have two things to run: first an exe file, and second a function, each needs it own process.

If you want to run a Python code in a separate process, you could use multiprocessing module:
import multiprocessing
if __name__ == "__main__":
multiprocessing.Process(target=doit.main, args=[A, B]).start()
do_something_else() # this runs immmediately without waiting for main() to return
I wondered whether using a subprocess module and multiprocess module in the same function is prohibited?
No. You can use both subprocess and multiprocessing in the same function (moreover, multiprocessing may use subprocess to start its worker processes internally).
Reason I want this is that I have two things to run: first an exe file, and second a function, each needs it own process.
You don't need multprocessing to run an external command without blocking (obviously, in its own process); subprocess.Popen() is enough:
import subprocess
p = subprocess.Popen(['command', 'arg 1', 'arg 2'])
do_something_else() # this runs immediately without waiting for command to exit
p.wait() # this waits for the command to finish

Subprocess.Popen is definitely what you want if the "worker" process is an executable. Threading is what you need when you need things to happen asynchronously, and multiprocessing is what you need if you want to take advantage of multiple cores for the improved performance (although you will likely find yourself also using threads at the same time as they handle asynchronous output of multiple parallel processes).
The main limitation of multiprocessing is passing information. When a new process is spawned, an entire separate instance of the python interpreter is started with it's own independent memory allocation. The result of this is variables changed by one process won't be changed for other processes. For this functionality you need shared memory objects (also provided by multiprocessing module). One implementation I have done was a parent process that started several worker processes and passed them both an input queue, and an output queue. The function given to the child processes was a loop designed to do some calculations on the inputs pulled from the input queue and then spit them out to the output queue. I then designated a special input that the child would recognize to end the loop and terminate the process.
On your edit - Popen will start the other process in parallel, as will multiprocessing. If you need the child process to communicate with the executable, be sure to pass the file stream handles to the child process somehow.

Related

ProcessPoolExecutor with loky in Windows, one process is not terminated

In my quest to make a parallelised python program (written on Linux) truly platform independent, I was looking for python packages that would parallelise seamlessly in windows. So, I found joblib, which looked like a godsend, because it worked on Windows without having hundreds of annoying pickling errors.
Then I ran into a problem that the processes spawned by joblib would continue to exist even if there were no parallel jobs running. This is a problem because in my code, there are multiple uses of os.chdir(). If there are processes running after the jobs end, there is no way to do os.chdir() on them, so it is not possible to delete the temporary folder where the parallel processes were working. This issue has been noted in this previous post: Joblib Parallel doesn't terminate processes
After some digging, I figured out that the problem is coming from the backend loky, which reuses the pool of processes, so keeps them alive.
So, I used loky directly, by calling the loky.ProcessPoolExecutor() instead of using loky.get_reusable_executor() that joblib uses. The ProcessPoolExecutor from loky seems to be a reimplementation of python's concurrent.futures and uses similar semantics.
This works fine, however, there is one python process that seems to still stick around, even after shutting down the ProcessPoolExecutor.
Minimal example with interactive python on Windows (use interactive shell because otherwise all the processes will terminate on exit):
>>> import loky
>>> import time
>>> def f(): # just some dummy function
... time.sleep(10)
... return 0
...
>>> pool = loky.ProcessPoolExecutor(3)
At this point, there is only one python process running (please note the PID in task manager).
Then, submit a job to the process pool (which returns a Future object).
>>> task = pool.submit(f)
>>> task
<Future at 0x11ad6f37640 state=running>
There are 5 processes now. One is the main process (PID=16508). There are three worker processes in the pool. But there is another extra process. I am really not sure what it is doing there.
After getting results, shutting the pool down removes the three worker processes. But not the one extra process.
>>> task.result()
0
>>> pool.shutdown() # you can add kill_workers=True, but does not change the result
The new python process with PID=16904 is still running. I have tried looking through the source code of loky, but I cannot figure out where that additional process is being created (and why it is necessary). Is there any way to tell this loky process to shutdown? (I do not want to resort to os.kill or some other drastic way of terminating process e.g. with SIGTERM, I want to do it programmatically if I can)

Executing a long running external process in parallel

In python, I'm writing a script that runs an external process. This external process does the following steps:
Fetch a value from a config file, taking into account other running
processes.
Run another process, using the value from step 1.
Step 1 can be bypassed by passing in a value to use. Trying to use the same value concurrently is an error, but using it sequentially is valid. (think of it as a pool of pids, with no more than 10 available) Other processes (e.g. a user logging in) can use one of these "pids".
The external process takes a few hours to run, and multiple independent copies must be run. Running them sequentially works, but takes too long.
I'm changing the script to run these processes concurrently using the multiprocessing module. A simplified version of my code is:
from multiprocessing import Pool
import subprocess
def longRunningTask(n):
subprocess.call(["ls", "-l"]) # real code uses a process with no screen I/O
if __name__ == '__main__':
myArray = [1,2,3,4,5]
pool = Pool(processes=3)
pool.map(longRunningTask, myArray)
Using this code fails, because it uses the same "pid" for every process started.
The solutions I've come up with are:
If the call fails, have a random delay and try again. This could end
up busy waiting for hours if enough "pids" are in use.
Create a Queue of the available "pids", get() an item from it before starting the process, and put() it when it completes. This would still need to wait if the "pid" was in use, the same as number 1.
Use a Manager to hold an array of "pids" that are in use (starting empty). Before starting the process, get a "pid", check if it's in the array (start again if it is), add it to the array, remove it when done.
Are there problems with approach 3, or is there a different way to do it?

Sharing imports between Python processes

I am using multiprocessing package to spawn multiple processes that execute a function, say func (with different arguments). func imports numpy package and I was wondering if every process would import the package. In fact, the main thread, or rather main process also imports numpy and that can be easily shared between different func executing processes.
There would be a major performance hit due to multiple imports of a library.
I was wondering if every process would import the package.
Assuming the import occurs after you've forked the process, then, yes. You could avoid this by doing the import before the fork, though.
There would be a major performance hit due to multiple imports of a library.
Well, there would a performance hit if you do the import after the fork, but probably not a "major" one. The OS would most likely have all the necessary files in its cache, so it would only be reading from RAM, not disk.
Update
Just noticed this...
In fact, the main thread, or rather main process also imports numpy...
If you're already importing numpy before forking, then the imports in the subprocesses will only create a reference to the existing imported module. This should take less than a millisecond, so I wouldn't worry about it.
The answer to that question is in the documentation of the multiprocessing library: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
The summary is that it depends on which start method you choose. There are 3 methods available (defaults to fork on Unix, to spawn on Windows/Mac):
spawn: The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method.
fork: The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process.
forkserver: When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process.
You must set the start method if you want to change it. Example:
import multiprocessing as mp
mp.set_start_method('spawn')

multiprocessing.freeze_support()

Why does the multiprocessing module need to call a specific function to work when being "frozen" to produce a windows executable?
The reason is lack of fork() on Windows (which is not entirely true). Because of this, on Windows the fork is simulated by creating a new process in which code, which on Linux is being run in child process, is being run. As the code is to be run in technically unrelated process, it has to be delivered there before it can be run. The way it's being delivered is first it's being pickled and then sent through the pipe from the original process to the new one. In addition this new process is being informed it has to run the code passed by pipe, by passing --multiprocessing-fork command line argument to it. If you take a look at implementation of freeze_support() function its task is to check if the process it's being run in is supposed to run code passed by pipe or not.

How to find out if a program crashed with subprocess?

My application creates subprocesses. Usually, these processeses run and terminate without any problems. However, sometimes, they crash.
I am currently using the python subprocess module to create these subprocesses. I check if a subprocess crashed by invoking the Popen.poll() method. Unfortunately, since my debugger is activated at the time of a crash, polling doesn't return the expected output.
I'd like to be able to see the debugging window(not terminate it) and still be able to detect if a process is crashed in the python code.
Is there a way to do this?
When your debugger opens, the process isn't finished yet - and subprocess only knows if a process is running or finished. So no, there is not a way to do this via subprocess.
I found a workaround for this problem. I used the solution given in another question Can the "Application Error" dialog box be disabled?
Items of consideration:
subprocess.check_output() for your child processes return codes
psutil for process & child analysis (and much more)
threading library, to monitor these child states in your script as well once you've decided how you want to handle the crashing, if desired
import psutil
myprocess = psutil.Process(process_id) # you can find your process id in various ways of your choosing
for child in myprocess.children():
print("Status of child process is: {0}".format(child.status()))
You can also use the threading library to load your subprocess into a separate thread, and then perform the above psutil analyses concurrently with your other process.
If you find more, let me know, it's no coincidence I've found this post.

Categories

Resources