Preventing pool processes from importing __main__ and globals

Preventing pool processes from importing __main__ and globals - python

I am using a multiprocessing pool of workers as a part of a much larger application. Since I use it for crunching a large volume of simple math, I have a shared-nothing architecture, where the only variables the workers ever need are passed on as arguments. Thus, I do not need the worker subprocesses to import any globals, my __main__ module or, consequently, any of the modules it imports. Is there any way to force such a behavior and avoid the performance hit when spawning the pool?
I should note that my environment is Win32, which lacks os.fork() and the worker processes are spawned "using a subprocess call to sys.executable (i.e. start a new Python process) followed by serializing all of the globals, and sending those over the pipe." as per this SO post. This being said, I want to do as little of the above as possible so my pool opens faster.
Any ideas?

Looking at the multiprocessing.forking implementation, particularly get_preparation_data and prepare (win32-specific), globals aren't getting pickled. The reimport of the parent process's __main__ is a bit ugly, but it won't run any code except the one at the toplevel; not even if __name__ == '__main__' clauses. So just keep the main module without import-time side-effects.
You can prevent your main module from importing anything when the subprocess starts, too (only useful on win32 which, as you note, can't fork). Move the main() and its imports to a separate module, so that the startup script contains only:
if '__name__' == '__main__':
from mainmodule import main
main()
There is still an implicit import site in the child process startup. It does important initialisation and I don't think mp.forking has an easy way to disable it, but I don't expect it to be expensive anyway.

Related

multiprocessing in python - what gets inherited by forkserver process from parent process?

I am trying to use forkserver and I encountered NameError: name 'xxx' is not defined in worker processes.
I am using Python 3.6.4, but the documentation should be the same, from https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods it says that:
The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Also, it says:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So apparently a key object that my worker process needs to work on did not get inherited by the server process and then passing to workers, why did that happen? I wonder what exactly gets inherited by forkserver process from parent process?
Here is what my code looks like:
import multiprocessing
import (a bunch of other modules)
def worker_func(nameList):
global largeObject
for item in nameList:
# get some info from largeObject using item as index
# do some calculation
return [item, info]
if __name__ == '__main__':
result = []
largeObject # This is my large object, it's read-only and no modification will be made to it.
nameList # Here is a list variable that I will need to get info for each item in it from the largeObject
ctx_in_main = multiprocessing.get_context('forkserver')
print('Start parallel, using forking/spawning/?:', ctx_in_main.get_context())
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=4) as pool:
for x in pool.imap_unordered(worker_func, nameList):
result.append(x)
Thank you!
Best,

Theory
Below is an excerpt from Bojan Nikolic blog
Modern Python versions (on Linux) provide three ways of starting the separate processes:
Fork()-ing the parent processes and continuing with the same processes image in both parent and child. This method is fast, but potentially unreliable when parent state is complex
Spawning the child processes, i.e., fork()-ing and then execv to replace the process image with a new Python process. This method is reliable but slow, as the processes image is reloaded afresh.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
Forkserver
The third method, forkserver, is illustrated below. Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the set_forkserver_preload() method.
Practice
Thus, if you want simething to be inherited by child processes from the parent, this must be specified in the forkserver state by means of set_forkserver_preload(modules_names), which set list of module names to try to load in forkserver process. I give an example below:
# inherited.py
large_obj = {"one": 1, "two": 2, "three": 3}
# main.py
import multiprocessing
import os
from time import sleep
from inherited import large_obj
def worker_func(key: str):
print(os.getpid(), id(large_obj))
sleep(1)
return large_obj[key]
if __name__ == '__main__':
result = []
ctx_in_main = multiprocessing.get_context('forkserver')
ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=cores) as pool:
for x in pool.imap(worker_func, ["one", "two", "three"]):
result.append(x)
for res in result:
print(res)
Output:
# The PIDs are different but the address is always the same
PID=18603, obj id=139913466185024
PID=18604, obj id=139913466185024
PID=18605, obj id=139913466185024
And if we don't use preloading
...
ctx_in_main = multiprocessing.get_context('forkserver')
# ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
...
# The PIDs are different, the addresses are different too
# (but sometimes they can coincide)
PID=19046, obj id=140011789067776
PID=19047, obj id=140011789030976
PID=19048, obj id=140011789030912

So after an inspiring discussion with Alex I think I have sufficient info to address my question: what exactly gets inherited by forkserver process from parent process?
Basically when the server process starts, it will import your main module and everything before if __name__ == '__main__' will be executed. That's why my code don't work, because large_object is nowhere to be found in server process and in all those worker processes that fork from the server process.
Alex's solution works because large_object now gets imported to both main and server process so every worker forked from server will also gets large_object. If combined with set_forkserver_preload(modules_names) all workers might even get the same large_object from what I saw. The reason for using forkserver is explicitly explained in Python documentations and in Bojan's blog:
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
So it's more on the safe side of concern here.
On a side note, if you use fork as the starting method though, you don't need to import anything since all child process gets a copy of parents process memory (or a reference if the system use COW-copy-on-write, please correct me if I am wrong). In this case using global large_object will get you access to large_object in worker_func directly.
The forkserver might not be a suitable approach for me because the issue I am facing is memory overhead. All the operations that gets me large_object in the first place are memory-consuming, so I don't want any unnecessary resources in my worker processes.
If I put all those calculations directly into inherited.py as Alex suggested, it will be executed twice (once when I imported the module in main and once when the server imports it; maybe even more when worker processes were born?), this is suitable if I just want a single-threaded safe process that workers can fork from. But since I am trying to get workers to not inherit unnecessary resources and only get large_object, this won't work.
And putting those calculations in __main__ in inherited.py won't work either since now none of the processes will execute them, including main and server.
So, as a conclusion, if the goal here is to get workers to inherit minimal resources, I am better off breaking my code into 2, do calculation.py first, pickle the large_object, exit the interpreter, and start a fresh one to load the pickled large_object. Then I can just go nuts with either fork or forkserver.

Using third-party Python module that blocks Flask application

My application that uses websockets also makes use of several third-party Python modules that appear to be written in way that blocks the rest of the application when called. For example, I use xlrd to parse Excel files a user has uploaded.
I've monkey patched the builtins like this in the first lines of the application:
import os
import eventlet
if os.name == 'nt':
eventlet.monkey_patch(os=False)
else:
eventlet.monkey_patch()
Then I use the following to start the task that contains calls to xlrd.
socketio.start_background_task(my_background_task)
What is the appropriate way to now call these other modules so that my application runs smoothly? Is the multiprocessing module to start another process within the greened thread the right way?

First you should try thread pool [1].
If that doesn't work as well as you want, please submit an issue [2] and go with multiprocessing as workaround.
eventlet.tpool.execute(xlrd_read, file_path, other=arg)
Execute meth in a Python thread, blocking the current coroutine/ greenthread until the method completes.
The primary use case for this is to wrap an object or module that is not amenable to monkeypatching or any of the other tricks that Eventlet uses to achieve cooperative yielding. With tpool, you can force such objects to cooperate with green threads by sticking them in native threads, at the cost of some overhead.
[1] http://eventlet.net/doc/threading.html
[2] https://github.com/eventlet/eventlet/issues

Python bytecode generated when multiprocessing Pool is called

I have written some Python 2.7 code where I set sys.dont_write_bytecode = True to prevent .pyc files from being written. I have used this many times previously without issue.
I am now working on a new program using multiprocessing and I noticed that bytecode is generated when Pool is called, regardless of the bytecode variable. Please help me to understand why.
Here is working test code:
import sys
sys.dont_write_bytecode = True
from multiprocessing import Pool
# bytecode gets generated when Pool is included
pool = Pool(processes=2)
print 'done'

I'm guessing that this is because multiprocessing spawns new python interpreter processes, and this flag is only meaningful inside the process where it is set. I'm also guessing that it will actually work on POSIX systems, where multiprocessing uses fork(), which might preserve interpreter state—it seems to be the case on my system. And, my last guess is that it might be not so easy to avoid these bytecode files, as multiprocessing runs the worker code on Windows only after importing the module, which is the moment when the bytecode files are created. But all of that is just an educated guess based on my knowledge of how the interpreter and the multiprocessing module work—please wait for some real answers.

Compulsory usage of if name=="main" in windows while using multiprocessing [duplicate]

This question already has answers here:
python multiprocessing on windows, if __name__ == "__main__"
(2 answers)
Closed 4 years ago.
While using multiprocessing in python on windows, it is expected to protect the entry point of the program.
The documentation says "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process)". Can anyone explain what exactly does this mean ?

Expanding a bit on the good answer you already got, it helps if you understand what Linux-y systems do. They spawn new processes using fork(), which has two good consequences:
All data structures existing in the main program are visible to the child processes. They actually work on copies of the data.
The child processes start executing at the instruction immediately following the fork() in the main program - so any module-level code already executed in the module will not be executed again.
fork() isn't possible in Windows, so on Windows each module is imported anew by each child process. So:
On Windows, no data structures existing in the main program are visible to the child processes; and,
All module-level code is executed in each child process.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by __name__ == '__main__'. For a subtler example, consider code that builds a gigantic list, which you intend to pass out to worker processes to crawl over. You probably want to protect that too, because there's no point in this case to make each worker process waste RAM and time building their own useless copies of the gigantic list.
Note that it's a Good Idea to use __name__ == "__main__" appropriately even on Linux-y systems, because it makes the intended division of work clearer. Parallel programs can be confusing - every little bit helps ;-)

The multiprocessing module works by creating new Python processes that will import your module. If you did not add __name__== '__main__' protection then you would enter a never ending loop of new process creation. It goes like this:
Your module is imported and executes code during the import that cause multiprocessing to spawn 4 new processes.
Those 4 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 16 new processes.
Those 16 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 64 new processes.
Well, hopefully you get the picture.
So the idea is that you make sure that the process spawning only happens once. And that is achieved most easily with the idiom of the __name__== '__main__' protection.

Sharing imports between Python processes

I am using multiprocessing package to spawn multiple processes that execute a function, say func (with different arguments). func imports numpy package and I was wondering if every process would import the package. In fact, the main thread, or rather main process also imports numpy and that can be easily shared between different func executing processes.
There would be a major performance hit due to multiple imports of a library.

I was wondering if every process would import the package.
Assuming the import occurs after you've forked the process, then, yes. You could avoid this by doing the import before the fork, though.
There would be a major performance hit due to multiple imports of a library.
Well, there would a performance hit if you do the import after the fork, but probably not a "major" one. The OS would most likely have all the necessary files in its cache, so it would only be reading from RAM, not disk.
Update
Just noticed this...
In fact, the main thread, or rather main process also imports numpy...
If you're already importing numpy before forking, then the imports in the subprocesses will only create a reference to the existing imported module. This should take less than a millisecond, so I wouldn't worry about it.

The answer to that question is in the documentation of the multiprocessing library: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
The summary is that it depends on which start method you choose. There are 3 methods available (defaults to fork on Unix, to spawn on Windows/Mac):
spawn: The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method.
fork: The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process.
forkserver: When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process.
You must set the start method if you want to change it. Example:
import multiprocessing as mp
mp.set_start_method('spawn')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.