I am using multiprocessing package to spawn multiple processes that execute a function, say func (with different arguments). func imports numpy package and I was wondering if every process would import the package. In fact, the main thread, or rather main process also imports numpy and that can be easily shared between different func executing processes.
There would be a major performance hit due to multiple imports of a library.
I was wondering if every process would import the package.
Assuming the import occurs after you've forked the process, then, yes. You could avoid this by doing the import before the fork, though.
There would be a major performance hit due to multiple imports of a library.
Well, there would a performance hit if you do the import after the fork, but probably not a "major" one. The OS would most likely have all the necessary files in its cache, so it would only be reading from RAM, not disk.
Update
Just noticed this...
In fact, the main thread, or rather main process also imports numpy...
If you're already importing numpy before forking, then the imports in the subprocesses will only create a reference to the existing imported module. This should take less than a millisecond, so I wouldn't worry about it.
The answer to that question is in the documentation of the multiprocessing library: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
The summary is that it depends on which start method you choose. There are 3 methods available (defaults to fork on Unix, to spawn on Windows/Mac):
spawn: The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method.
fork: The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process.
forkserver: When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process.
You must set the start method if you want to change it. Example:
import multiprocessing as mp
mp.set_start_method('spawn')
Related
I am trying to use forkserver and I encountered NameError: name 'xxx' is not defined in worker processes.
I am using Python 3.6.4, but the documentation should be the same, from https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods it says that:
The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Also, it says:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So apparently a key object that my worker process needs to work on did not get inherited by the server process and then passing to workers, why did that happen? I wonder what exactly gets inherited by forkserver process from parent process?
Here is what my code looks like:
import multiprocessing
import (a bunch of other modules)
def worker_func(nameList):
global largeObject
for item in nameList:
# get some info from largeObject using item as index
# do some calculation
return [item, info]
if __name__ == '__main__':
result = []
largeObject # This is my large object, it's read-only and no modification will be made to it.
nameList # Here is a list variable that I will need to get info for each item in it from the largeObject
ctx_in_main = multiprocessing.get_context('forkserver')
print('Start parallel, using forking/spawning/?:', ctx_in_main.get_context())
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=4) as pool:
for x in pool.imap_unordered(worker_func, nameList):
result.append(x)
Thank you!
Best,
Theory
Below is an excerpt from Bojan Nikolic blog
Modern Python versions (on Linux) provide three ways of starting the separate processes:
Fork()-ing the parent processes and continuing with the same processes image in both parent and child. This method is fast, but potentially unreliable when parent state is complex
Spawning the child processes, i.e., fork()-ing and then execv to replace the process image with a new Python process. This method is reliable but slow, as the processes image is reloaded afresh.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
Forkserver
The third method, forkserver, is illustrated below. Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the set_forkserver_preload() method.
Practice
Thus, if you want simething to be inherited by child processes from the parent, this must be specified in the forkserver state by means of set_forkserver_preload(modules_names), which set list of module names to try to load in forkserver process. I give an example below:
# inherited.py
large_obj = {"one": 1, "two": 2, "three": 3}
# main.py
import multiprocessing
import os
from time import sleep
from inherited import large_obj
def worker_func(key: str):
print(os.getpid(), id(large_obj))
sleep(1)
return large_obj[key]
if __name__ == '__main__':
result = []
ctx_in_main = multiprocessing.get_context('forkserver')
ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=cores) as pool:
for x in pool.imap(worker_func, ["one", "two", "three"]):
result.append(x)
for res in result:
print(res)
Output:
# The PIDs are different but the address is always the same
PID=18603, obj id=139913466185024
PID=18604, obj id=139913466185024
PID=18605, obj id=139913466185024
And if we don't use preloading
...
ctx_in_main = multiprocessing.get_context('forkserver')
# ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
...
# The PIDs are different, the addresses are different too
# (but sometimes they can coincide)
PID=19046, obj id=140011789067776
PID=19047, obj id=140011789030976
PID=19048, obj id=140011789030912
So after an inspiring discussion with Alex I think I have sufficient info to address my question: what exactly gets inherited by forkserver process from parent process?
Basically when the server process starts, it will import your main module and everything before if __name__ == '__main__' will be executed. That's why my code don't work, because large_object is nowhere to be found in server process and in all those worker processes that fork from the server process.
Alex's solution works because large_object now gets imported to both main and server process so every worker forked from server will also gets large_object. If combined with set_forkserver_preload(modules_names) all workers might even get the same large_object from what I saw. The reason for using forkserver is explicitly explained in Python documentations and in Bojan's blog:
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
So it's more on the safe side of concern here.
On a side note, if you use fork as the starting method though, you don't need to import anything since all child process gets a copy of parents process memory (or a reference if the system use COW-copy-on-write, please correct me if I am wrong). In this case using global large_object will get you access to large_object in worker_func directly.
The forkserver might not be a suitable approach for me because the issue I am facing is memory overhead. All the operations that gets me large_object in the first place are memory-consuming, so I don't want any unnecessary resources in my worker processes.
If I put all those calculations directly into inherited.py as Alex suggested, it will be executed twice (once when I imported the module in main and once when the server imports it; maybe even more when worker processes were born?), this is suitable if I just want a single-threaded safe process that workers can fork from. But since I am trying to get workers to not inherit unnecessary resources and only get large_object, this won't work.
And putting those calculations in __main__ in inherited.py won't work either since now none of the processes will execute them, including main and server.
So, as a conclusion, if the goal here is to get workers to inherit minimal resources, I am better off breaking my code into 2, do calculation.py first, pickle the large_object, exit the interpreter, and start a fresh one to load the pickled large_object. Then I can just go nuts with either fork or forkserver.
New to multiprocessing in python, consider that you have the following function:
def do_something_parallel(self):
result_operation1 = doit.main(A,B)
do_something_else(C)
Now the point is that I want the doit.main to run in another process and to be non blocking, so the code in do_something_else will run immediately after the first has been launched in another process.
How can I do it using python subprocess module?
Is there a difference between subprocessing and creating new process aside to another one, why would we need a child processes of other process?
Note: I do not want to use multithreaded approach here..
EDIT: I wondered whether using a subprocess module and multiprocess module in the same function is prohibited?
Reason I want this is that I have two things to run: first an exe file, and second a function, each needs it own process.
If you want to run a Python code in a separate process, you could use multiprocessing module:
import multiprocessing
if __name__ == "__main__":
multiprocessing.Process(target=doit.main, args=[A, B]).start()
do_something_else() # this runs immmediately without waiting for main() to return
I wondered whether using a subprocess module and multiprocess module in the same function is prohibited?
No. You can use both subprocess and multiprocessing in the same function (moreover, multiprocessing may use subprocess to start its worker processes internally).
Reason I want this is that I have two things to run: first an exe file, and second a function, each needs it own process.
You don't need multprocessing to run an external command without blocking (obviously, in its own process); subprocess.Popen() is enough:
import subprocess
p = subprocess.Popen(['command', 'arg 1', 'arg 2'])
do_something_else() # this runs immediately without waiting for command to exit
p.wait() # this waits for the command to finish
Subprocess.Popen is definitely what you want if the "worker" process is an executable. Threading is what you need when you need things to happen asynchronously, and multiprocessing is what you need if you want to take advantage of multiple cores for the improved performance (although you will likely find yourself also using threads at the same time as they handle asynchronous output of multiple parallel processes).
The main limitation of multiprocessing is passing information. When a new process is spawned, an entire separate instance of the python interpreter is started with it's own independent memory allocation. The result of this is variables changed by one process won't be changed for other processes. For this functionality you need shared memory objects (also provided by multiprocessing module). One implementation I have done was a parent process that started several worker processes and passed them both an input queue, and an output queue. The function given to the child processes was a loop designed to do some calculations on the inputs pulled from the input queue and then spit them out to the output queue. I then designated a special input that the child would recognize to end the loop and terminate the process.
On your edit - Popen will start the other process in parallel, as will multiprocessing. If you need the child process to communicate with the executable, be sure to pass the file stream handles to the child process somehow.
This question already has answers here:
python multiprocessing on windows, if __name__ == "__main__"
(2 answers)
Closed 4 years ago.
While using multiprocessing in python on windows, it is expected to protect the entry point of the program.
The documentation says "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process)". Can anyone explain what exactly does this mean ?
Expanding a bit on the good answer you already got, it helps if you understand what Linux-y systems do. They spawn new processes using fork(), which has two good consequences:
All data structures existing in the main program are visible to the child processes. They actually work on copies of the data.
The child processes start executing at the instruction immediately following the fork() in the main program - so any module-level code already executed in the module will not be executed again.
fork() isn't possible in Windows, so on Windows each module is imported anew by each child process. So:
On Windows, no data structures existing in the main program are visible to the child processes; and,
All module-level code is executed in each child process.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by __name__ == '__main__'. For a subtler example, consider code that builds a gigantic list, which you intend to pass out to worker processes to crawl over. You probably want to protect that too, because there's no point in this case to make each worker process waste RAM and time building their own useless copies of the gigantic list.
Note that it's a Good Idea to use __name__ == "__main__" appropriately even on Linux-y systems, because it makes the intended division of work clearer. Parallel programs can be confusing - every little bit helps ;-)
The multiprocessing module works by creating new Python processes that will import your module. If you did not add __name__== '__main__' protection then you would enter a never ending loop of new process creation. It goes like this:
Your module is imported and executes code during the import that cause multiprocessing to spawn 4 new processes.
Those 4 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 16 new processes.
Those 16 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 64 new processes.
Well, hopefully you get the picture.
So the idea is that you make sure that the process spawning only happens once. And that is achieved most easily with the idiom of the __name__== '__main__' protection.
I am using multiprocessing module to fork child processes. Since on forking, child process gets the address space of parent process, I am getting the same logger for parent and child. I want to clear the address space of child process for any values carried over from parent. I got to know that multiprocessing does fork() at lower level but not exec(). I want to know whether it is good to use multiprocessing in my situation or should I go for os.fork() and os.exec() combination or is there any other solution?
Thanks.
Since multiprocessing is running a function from your program as if it were a thread function, it definitely needs a full copy of your process' state. That means doing fork().
Using a higher-level interface provided by multiprocessing is generally better. At least you should not care about the fork() return code yourself.
os.fork() is a lower level function providing less service out-of-the-box, though you certainly can use it for anything multiprocessing is used for... at the cost of partial reimplementation of multiprocessing code. So, I think, multiprocessing should be ok for you.
However, if you process' memory footprint is too large to duplicate it (or if you have other reasons to avoid forking -- open connections to databases, open log files etc.), you may have to make the function you want to run in a new process a separate python program. Then you can run it using subprocess, pass parameters to its stdin, capture its stdout and parse the output to get results.
UPD: os.exec... family of functions is hard to use for most of purposes since it replaces your process with a spawned one (if you run the same program as is running, it will restart from the very beginning, not keeping any in-memory data). However, if you really do not need to continue parent process execution, exec() may be of some use.
From my personal experience: os.fork() is used very often to create daemon processes on Unix; I often use subprocess (the communication is through stdin/stdout); almost never used multiprocessing; not a single time in my life I needed os.exec...().
You can just rebind the logger in the child process to its own. I don't know about other OS, but on Linux the forking doesn't duplicate the entire memory footprint (as Ellioh mentioned), but uses "copy-on-write" concept. So until you change something in the child process - it stays in the memory scope of the parent process. For instance, you can fork 100 child processes (that don't write into memory, only read) and check the overall memory usage. It'll not be parent_memory_usage * 100, but much less.
I am using a multiprocessing pool of workers as a part of a much larger application. Since I use it for crunching a large volume of simple math, I have a shared-nothing architecture, where the only variables the workers ever need are passed on as arguments. Thus, I do not need the worker subprocesses to import any globals, my __main__ module or, consequently, any of the modules it imports. Is there any way to force such a behavior and avoid the performance hit when spawning the pool?
I should note that my environment is Win32, which lacks os.fork() and the worker processes are spawned "using a subprocess call to sys.executable (i.e. start a new Python process) followed by serializing all of the globals, and sending those over the pipe." as per this SO post. This being said, I want to do as little of the above as possible so my pool opens faster.
Any ideas?
Looking at the multiprocessing.forking implementation, particularly get_preparation_data and prepare (win32-specific), globals aren't getting pickled. The reimport of the parent process's __main__ is a bit ugly, but it won't run any code except the one at the toplevel; not even if __name__ == '__main__' clauses. So just keep the main module without import-time side-effects.
You can prevent your main module from importing anything when the subprocess starts, too (only useful on win32 which, as you note, can't fork). Move the main() and its imports to a separate module, so that the startup script contains only:
if '__name__' == '__main__':
from mainmodule import main
main()
There is still an implicit import site in the child process startup. It does important initialisation and I don't think mp.forking has an easy way to disable it, but I don't expect it to be expensive anyway.