Replace current process with invocation of subprocess? - python

In python, is there a way to invoke a new process in, hand it the same context, such as standard IO streams, close the current process, and give control to the invoked process? This would effectively 'replace' the process.
I have a program whose behavior I want to repeat. However, it uses a third-party library, and it seems that the only way that I can truly kill threads invoked by that library is to exit() my python process.
Plus, it seems like it could help manage memory.

You may be interested in os.execv() and friends:
These functions all execute a new program, replacing the current
process; they do not return. On Unix, the new executable is loaded
into the current process, and will have the same process id as the
caller. Errors will be reported as OSError exceptions.

Related

Is it safe to ignore the return value of Popen?

I need that my python (2.7) program starts a sub-process and that it continues its processing while the child process is running. So I don’t want the main program to wait for the sub-process termination.
This is apparently achieved with Popen that returns a class object.
Is it safe to ignore the return value of Popen ?
Does the garbage collector delete the popen object ? Is the object garbage collected after the sub-process terminated ?
I can’t answer these questions with the python documentation.
Note that I’m using python 2.7 because of constrains I have no control on.
On Unix-like systems, failing to wait() on children creates zombies. This is a problem because you are polluting and eventually exhausting your process table. See https://en.wikipedia.org/wiki/Wait_(system_call)
If you want to kill your child process when your program finishes, do that. That's precisely why you need to save the result from Popen().
child = Popen(['sleep', '1234567890'])
# ... code happens ...
child.kill()
child.wait()
Just to reiterate, the Python object returned by Popen() is of course going to go away when Python goes away; the problem is that you also need to clean up the actual OS process which it represents.

Multiprocessing or os.fork, os.exec?

I am using multiprocessing module to fork child processes. Since on forking, child process gets the address space of parent process, I am getting the same logger for parent and child. I want to clear the address space of child process for any values carried over from parent. I got to know that multiprocessing does fork() at lower level but not exec(). I want to know whether it is good to use multiprocessing in my situation or should I go for os.fork() and os.exec() combination or is there any other solution?
Thanks.
Since multiprocessing is running a function from your program as if it were a thread function, it definitely needs a full copy of your process' state. That means doing fork().
Using a higher-level interface provided by multiprocessing is generally better. At least you should not care about the fork() return code yourself.
os.fork() is a lower level function providing less service out-of-the-box, though you certainly can use it for anything multiprocessing is used for... at the cost of partial reimplementation of multiprocessing code. So, I think, multiprocessing should be ok for you.
However, if you process' memory footprint is too large to duplicate it (or if you have other reasons to avoid forking -- open connections to databases, open log files etc.), you may have to make the function you want to run in a new process a separate python program. Then you can run it using subprocess, pass parameters to its stdin, capture its stdout and parse the output to get results.
UPD: os.exec... family of functions is hard to use for most of purposes since it replaces your process with a spawned one (if you run the same program as is running, it will restart from the very beginning, not keeping any in-memory data). However, if you really do not need to continue parent process execution, exec() may be of some use.
From my personal experience: os.fork() is used very often to create daemon processes on Unix; I often use subprocess (the communication is through stdin/stdout); almost never used multiprocessing; not a single time in my life I needed os.exec...().
You can just rebind the logger in the child process to its own. I don't know about other OS, but on Linux the forking doesn't duplicate the entire memory footprint (as Ellioh mentioned), but uses "copy-on-write" concept. So until you change something in the child process - it stays in the memory scope of the parent process. For instance, you can fork 100 child processes (that don't write into memory, only read) and check the overall memory usage. It'll not be parent_memory_usage * 100, but much less.

python-twisted: fork for background non-returning processing

How to correctly fork a child process in twisted that does not use anything from twisted (but uses data from the parent process) (e.g. to process a “snapshot” of some data from the parent process and write it to file, without blocking)?
It seems if I do anything like clean shutdown in the child process after os.fork(), it closes some of the sockets / descriptors in the parent process; the only way to avoid that that I see is to do os.kill(os.getpid(), signal.SIGKILL), which does seem like a bad idea (though not directly problematic).
(additionally, if a dict is changed in the parent process, can it be that it will change in the child process too? Quick test shows that it doesn't change, though. OS/kernels are debian stable / sid)
IReactorProcess.spawnProcess (usually available as from twisted.internet import reactor; reactor.spawnProcess) can spawn a process running any available executable on your system. The subprocess does not need to use Twisted, or, indeed, even be in Python.
Do not call os.fork yourself. As you've discovered, it has lots of very peculiar interactions with process state, that spawnProcess will manage for you.
Among the problems with os.fork are:
Forking copies your current process state, but doesn't copy the state of threads. This means that any thread in the middle of modifying some global state will leave things half-broken, possibly holding some locks which will never be released. Don't run any threads in your application? Have you audited every library you use, every one of its dependencies, to ensure that none of them have ever or will ever use a background thread for anything?
You might think you're only touching certain areas of your application memory, but thanks to Python's reference counting, any object which you even peripherally look at (or is present on the stack) may have reference counts being incremented or decremented. Incrementing or decrementing a refcount is a write operation, which means that whole page (not just that one object) gets copied back into your process. So forked processes in Python tend to accumulate a much larger copied set than, say, forked C programs.
Many libraries, famously all of the libraries that make up the systems on macOS and iOS, cannot handle fork() correctly and will simply crash your program if you attempt to use them after fork but before exec.
There's a flag for telling file descriptors to close on exec - but no such flag to have them close on fork. So any files (including log files, and again, any background temp files opened by libraries you might not even be aware of) can get silently corrupted or truncated if you don't manage access to them carefully.

Jython: subprocess.Popen runs out of file descriptors

I'm using the Jython 2.51 implementation of Python to write a script that repeatedly invokes another process via subprocess.Popen and uses PIPE to pipe stdout and stderr to the parent process and stdin to the child process. After several hundred loop iterations, I seem to run out of file descriptors.
The Python subprocess documentation mentions very little about freeing file descriptors, other than the close_fds option, which isn't described very clearly (Why should there be any file descriptors besides 0, 1 and 2 open in the first place?). I'm assuming that in CPython, reference counting takes care of the resource freeing issue. What's the proper way to make sure all descriptors get freed when one is done with a Popen object in Jython?
Edit: Just in case it makes a difference, this is a multithreaded program, so there are several Popen processes running simultaneously.
This only answers part of your question, but my understanding is that, when you spawn a new process, it normally inherits all the handles of the parent process. That includes such things as open files and sockets that you're listening on.
On UNIX, that's a side-effect of using 'fork', which duplicates the current process and all of its handles before loading the new executable. On Windows it's more explicit, but Python does it anyway, to try to match the behavior across platforms as much as possible.
The close_fds option, when True, closes all these inherited handles after spawning the subprocess, so the new executable starts with a clean slate. But if your subprocesses are run one at a time, and terminating when they're done, then this shouldn't be the problem.

Named pipe is not flushing in Python

I have a named pipe created via the os.mkfifo() command. I have two different Python processes accessing this named pipe, process A is reading, and process B is writing. Process A uses the select function to determine when there is data available in the fifo/pipe. Despite the fact that process B flushes after each write call, process A's select function does not always return (it keeps blocking as if there is no new data). After looking into this issue extensively, I finally just programmed process B to add 5KB of garbage writes before and after my real call, and likewise process A is programmed to ignore those 5KB. Now everything works fine, and select is always returning appropriately. I came to this hack-ish solution by noticing that process A's select would return if process B were to be killed (after it was writing and flushing, it would sleep on a read pipe). Is there a problem with flush in Python for named pipes?
What APIs are you using? os.read() and os.write() don't buffer anything.
To find out if Python's internal buffering is causing your problems, when running your scripts do "python -u" instead of "python". This will force python in to "unbuffered mode" which will cause all output to be printed instantaneously.
The flush operation is irrelevant for named pipes; the data for named pipes is held strictly in memory, and won't be released until it is read or the FIFO is closed.

Categories

Resources