Python bytecode generated when multiprocessing Pool is called - python

I have written some Python 2.7 code where I set sys.dont_write_bytecode = True to prevent .pyc files from being written. I have used this many times previously without issue.
I am now working on a new program using multiprocessing and I noticed that bytecode is generated when Pool is called, regardless of the bytecode variable. Please help me to understand why.
Here is working test code:
import sys
sys.dont_write_bytecode = True
from multiprocessing import Pool
# bytecode gets generated when Pool is included
pool = Pool(processes=2)
print 'done'

I'm guessing that this is because multiprocessing spawns new python interpreter processes, and this flag is only meaningful inside the process where it is set. I'm also guessing that it will actually work on POSIX systems, where multiprocessing uses fork(), which might preserve interpreter state—it seems to be the case on my system. And, my last guess is that it might be not so easy to avoid these bytecode files, as multiprocessing runs the worker code on Windows only after importing the module, which is the moment when the bytecode files are created. But all of that is just an educated guess based on my knowledge of how the interpreter and the multiprocessing module work—please wait for some real answers.

Related

How to pass variables in parent to subprocess in python?

I am trying to have a parent python script sent variables to a child script to help me speed-up and automate video analysis.
I am now using the subprocess.Popen() call to start-up 6 instances of a child script but cannot find a way to pass variables and modules already called for in the parent to the child. For example, the parent file would have:
import sys
import subprocess
parent_dir = os.path.realpath(sys.argv[0])
subprocess.Popen(sys.executable, 'analysis.py')
but then import sys; import subprocess; parent_dir have to be called again in "analysis.py". Is there a way to pass them to the child?
In short, what I am trying to achieve is: I have a folder with a couple hundred video files. I want the parent python script to list the video files and start up to 6 parallel instances of an analysis script that each analyse one video file. If there are no more files to be analysed the parent file stops.
The simple answer here is: don't use subprocess.Popen, use multiprocessing.Process. Or, better yet, multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor.
With subprocess, your program's Python interpreter doesn't know anything about the subprocess at all; for all it knows, the child process is running Doom. So there's no way to directly share information with it.* But with multiprocessing, Python controls launching the subprocess and getting everything set up so that you can share data as conveniently as possible.
Unfortunately "as conveniently as possible" still isn't 100% as convenient as all being in one process. But what you can do is usually good enough. Read the section on Exchanging objects between processes and the following few sections; hopefully one of those mechanisms will be exactly what you need.
But, as I implied at the top, in most cases you can make it even simpler, by using a pool. Instead of thinking about "running 6 processes and sharing data with them", just think about it as "running a bunch of tasks on a pool of 6 processes". A task is basically just a function—it takes arguments, and returns a value. If the work you want to parallelize fits into that model—and it sounds like your work does—life is as simple as could be. For example:
import multiprocessing
import os
import sys
import analysis
parent_dir = os.path.realpath(sys.argv[0])
paths = [os.path.join(folderpath, file)
for file in os.listdir(folderpath)]
with multiprocessing.Pool(processes=6) as pool:
results = pool.map(analysis.analyze, paths)
If you're using Python 3.2 or earlier (including 2.7), you can't use a Pool in a with statement. I believe you want this:**
pool = multiprocessing.Pool(processes=6)
try:
results = pool.map(analysis.analyze, paths)
finally:
pool.close()
pool.join()
This will start up 6 processes,*** then tell the first one to do analysis.analyze(paths[0]), the second to do analysis.analyze(paths[1]), etc. As soon as any of the processes finishes, the pool will give it the next path to work on. When they're all finished, you get back a list of all the results.****
Of course this means that the top-level code that lived in analysis.py has to be moved into a function def analyze(path): so you can call it. Or, even better, you can move that function into the main script, instead of a separate file, if you really want to save that import line.
* You can still indirectly share information by, e.g., marshaling it into some interchange format like JSON and pass it via the stdin/stdout pipes, a file, a shared memory segment, a socket, etc., but multiprocessing effectively wraps that up for you to make it a whole lot easier.
** There are different ways to shut a pool down, and you can also choose whether or not to join it immediately, so you really should read up on the details at some point. But when all you're doing is calling pool.map, it really doesn't matter; the pool is guaranteed to shut down and be ready to join nearly instantly by the time the map call returns.
*** I'm not sure why you wanted 6; most machines have 4, 8, or 16 cores, not 6; why not use them all? The best thing to do is usually to just leave out the processes=6 entirely and let multiprocessing ask your OS how many cores to use, which means it'll still run at full speed on your new machine with twice as many cores that you'll buy next year.
**** This is slightly oversimplified; usually the pool will give the first process a batch of files, not one at a time, to save a bit of overhead, and you can manually control the batching if you need to optimize things or sequence them more carefully. But usually you don't care, and this oversimplification is fine.

Use several workers to execute python code

I'm executing python code on several files. Since the files are all very big and since one call one treats one file, it lasts very long till the final file is treated. Hence, here is my question: Is it possible to use several workers which treat the files in parallel?
Is this a possible invocation?:
import annotation as annot # this is a .py-file
import multiprocessing
pool = multiprocessing.Pool(processes=4)
pool.map(annot, "")
The .py-file uses for-loops (etc.) to get all files by itself.
The problem is: If I have a look at all the processes (with 'top'), I only see 1 process which is working with the .py-file. So...I suspect that I shouldn't use multiprocessing like this...does I?
Thanks for any help! :)
Yes. Use multiprocessing.Pool.
import multiprocessing
pool = multiprocessing.Pool(processes=<pool size>)
result = pool.map(<your function>, <file list>)
My answer is not purely a python answer though I think it's the best approach given your problem.
This will only work on Unix systems (OS X/Linux/etc.).
I do stuff like this all the time, and I am in love with GNU Parallel. See this also for an introduction by the GNU Parallel developer. You will likely have to install it, but it's worth it.
Here's a simple example. Say you have a python script called processFiles.py:
#!/usr/bin/python
#
# Script to print out file name
#
fileName = sys.argv[0] # command line argument
print( fileName ) # adapt for python 2.7 if you need to
To make this file executable:
chmod +x processFiles.py
And say all your large files are in largeFileDir. Then to run all the files in parallel with four processors (-P4), run this at the command line:
$ parallel -P4 processFiles.py ::: $(ls largeFileDir/*)
This will output
file1
file3
file7
file2
...
They may not be in order because each thread is operating independently in parallel. To adapt this to your process, insert your file processing script instead of just stupidly printing the file to screen.
This is preferable to threading in your case because each file processing job will get its own instance of the Python interpreter. Since each file is processed independently (or so it sounds) threading is overkill. In my experience this is the most efficient way to parallelize a process like you describe.
There is something called the Global Interpreter Lock that I don't understand very well, but has caused me headaches when trying to use python built-ins to hyperthread. Which is why I say if you don't need to thread, don't. Instead do as I've recommended and start up independent python processes.
There are many options.
multiple threads
multiple processes
"green threads", I personally like Eventlet
Then there are more "Enterprise" solutions, which are even able running workers on multiple servers, e.g. Celery, for more search Distributed task queue python.
In all cases, your scenario will become more complex and sometime you will not gain much, e.g. if your processing is limited by I/O operations (reading the data) and not by computation and processing.
Yes, this is possible. You should investigate the threading module and the multiprocessing module. Both will allow you to execute Python code concurrently. One note with the threading module, though, is that because of the way Python is implemented (Google "python GIL" if you're interested in the details), only one thread will execute at a time, even if you have multiple CPU cores. This is different from the threading implementation in our languages, where each thread will run at the same time, each using a different core. Because of this limitation, in cases where you want to do CPU-intensive operations concurrently, you'll get better performance with the multiprocessing module.

Compulsory usage of if __name__=="__main__" in windows while using multiprocessing [duplicate]

This question already has answers here:
python multiprocessing on windows, if __name__ == "__main__"
(2 answers)
Closed 4 years ago.
While using multiprocessing in python on windows, it is expected to protect the entry point of the program.
The documentation says "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process)". Can anyone explain what exactly does this mean ?
Expanding a bit on the good answer you already got, it helps if you understand what Linux-y systems do. They spawn new processes using fork(), which has two good consequences:
All data structures existing in the main program are visible to the child processes. They actually work on copies of the data.
The child processes start executing at the instruction immediately following the fork() in the main program - so any module-level code already executed in the module will not be executed again.
fork() isn't possible in Windows, so on Windows each module is imported anew by each child process. So:
On Windows, no data structures existing in the main program are visible to the child processes; and,
All module-level code is executed in each child process.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by __name__ == '__main__'. For a subtler example, consider code that builds a gigantic list, which you intend to pass out to worker processes to crawl over. You probably want to protect that too, because there's no point in this case to make each worker process waste RAM and time building their own useless copies of the gigantic list.
Note that it's a Good Idea to use __name__ == "__main__" appropriately even on Linux-y systems, because it makes the intended division of work clearer. Parallel programs can be confusing - every little bit helps ;-)
The multiprocessing module works by creating new Python processes that will import your module. If you did not add __name__== '__main__' protection then you would enter a never ending loop of new process creation. It goes like this:
Your module is imported and executes code during the import that cause multiprocessing to spawn 4 new processes.
Those 4 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 16 new processes.
Those 16 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 64 new processes.
Well, hopefully you get the picture.
So the idea is that you make sure that the process spawning only happens once. And that is achieved most easily with the idiom of the __name__== '__main__' protection.

Sharing imports between Python processes

I am using multiprocessing package to spawn multiple processes that execute a function, say func (with different arguments). func imports numpy package and I was wondering if every process would import the package. In fact, the main thread, or rather main process also imports numpy and that can be easily shared between different func executing processes.
There would be a major performance hit due to multiple imports of a library.
I was wondering if every process would import the package.
Assuming the import occurs after you've forked the process, then, yes. You could avoid this by doing the import before the fork, though.
There would be a major performance hit due to multiple imports of a library.
Well, there would a performance hit if you do the import after the fork, but probably not a "major" one. The OS would most likely have all the necessary files in its cache, so it would only be reading from RAM, not disk.
Update
Just noticed this...
In fact, the main thread, or rather main process also imports numpy...
If you're already importing numpy before forking, then the imports in the subprocesses will only create a reference to the existing imported module. This should take less than a millisecond, so I wouldn't worry about it.
The answer to that question is in the documentation of the multiprocessing library: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
The summary is that it depends on which start method you choose. There are 3 methods available (defaults to fork on Unix, to spawn on Windows/Mac):
spawn: The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method.
fork: The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process.
forkserver: When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process.
You must set the start method if you want to change it. Example:
import multiprocessing as mp
mp.set_start_method('spawn')

Preventing pool processes from importing __main__ and globals

I am using a multiprocessing pool of workers as a part of a much larger application. Since I use it for crunching a large volume of simple math, I have a shared-nothing architecture, where the only variables the workers ever need are passed on as arguments. Thus, I do not need the worker subprocesses to import any globals, my __main__ module or, consequently, any of the modules it imports. Is there any way to force such a behavior and avoid the performance hit when spawning the pool?
I should note that my environment is Win32, which lacks os.fork() and the worker processes are spawned "using a subprocess call to sys.executable (i.e. start a new Python process) followed by serializing all of the globals, and sending those over the pipe." as per this SO post. This being said, I want to do as little of the above as possible so my pool opens faster.
Any ideas?
Looking at the multiprocessing.forking implementation, particularly get_preparation_data and prepare (win32-specific), globals aren't getting pickled. The reimport of the parent process's __main__ is a bit ugly, but it won't run any code except the one at the toplevel; not even if __name__ == '__main__' clauses. So just keep the main module without import-time side-effects.
You can prevent your main module from importing anything when the subprocess starts, too (only useful on win32 which, as you note, can't fork). Move the main() and its imports to a separate module, so that the startup script contains only:
if '__name__' == '__main__':
from mainmodule import main
main()
There is still an implicit import site in the child process startup. It does important initialisation and I don't think mp.forking has an easy way to disable it, but I don't expect it to be expensive anyway.

Categories

Resources