I'm looking to use twisted to control communication across Linux pipes (os.pipe()) and fifos (os.mkfifo()) between a master process and a set of slave processes. While I'm positive tat it's possible to use twisted for these types of file descriptors (after all, twisted is great for tcp sockets which *nix abstracts away as file descriptors), I cannot find any examples of this type of usage. Anyone have any links, sample code, or advice?
You can use reactor.spawnProcess to set up arbitrary file descriptor mappings between a parent process and a child process it spawns. For example, to run a program and give it two extra output descriptors (in addition to stdin, stdout, and stderr) with which it can send bytes back to the parent process, you would do something like this:
reactor.spawnProcess(protocol, executable, args,
childFDs={0: 'w', 1: 'r', 2: 'r', 3: 'r', 4: 'r'})
The reactor will take care of creating the pipes for you, and will call childDataReceived on the ProcessProtocol you pass in when data is read from them. See the spawnProcess API docs for details.
If you're also using Twisted on the child end, then you mostly want to be looking at twisted.internet.stdio. stdiodemo.py and stdin.py in the core examples will show you how to use that module.
It does not have anything built-in for asynchronous I/O. Someone wrote a libaio wrapper for it, but it has not been touched for a long time, and I have no idea if it still works.
In the worst case you could use select to see if there's anything available to read, but that won't help you with writing.
Related
How can I initiate IPC with a child process, without letting it inherit all handles? To make it more interesting, this shoud work on windows as well as unix.
The background: I am writing a library that interfaces with a 3rparty shared library (let's just call it IT) which in turn contains global data (that really should be objects!). I want to have multiple instances of this global data. As far as I understand, I have two options to solve this:
create a cython module that links against a static variant of IT, then copy and import the module whenever I want a new instance. Analogously, I could copy IT but that's even more work to create a ctypes interface.
spawn a subprocess that loads IT and establish an IPC connection to it.
There are a few reasons to use (2):
I am not sure, if (1) is reliable in any way and it feels like a bad idea (what happens with all the extra modules, when the application exits in an uncontrolled way?).
boxing IT into a separate process might actually be a good idea anyway for security considerations: IT deals with potentially unsafe input and IT's code quality isn't overly good. So, I'd rather not have any secure resources open when running it.
there is probably lot's of need for this kind of IPC in future applications
So what are my options? I have already looked into:
multiprocessing.Process at first looked nice, until I realized that the new process gets a copy of all my handles. Needless to say that this is quite problematic, since now resources cannot be reliably freed by closing them in the parent process + the security issues mentioned earlier.
Use os.closerange within a multiprocessing.Process to close to all handles manually - except for the Pipe I'm interested in. Does os.closerange close only files or does it take care of other types of resources as well? If so: how can I determine the range, given the Pipe object?
subprocess.Popen(.., close_fds=True, stdin=PIPE, stdout=PIPE) works fine on unix but isn't possible on win32.
Named pipes are very different on win32 and unix. Are their any libraries that their usage?
Sockets. Promising, especially since their are handy RPC libraries that can work with sockets. On the other hand, I fear that this may cause a whole bunch of security issues. Are sockets that I have determined to be of local origin (sock.getpeername()[0] == '127.0.0.1') secure against tempering?
Are there any possibilities that I have overlooked?
To round up: the main question is how to establish a secure IPC with a child process on windows+unix? But please don't hesitate to answer if you know any answers to only partial problems.
Thanks for taking the time to read it!
It seems on python>=3.4 subprocess.Popen(..., stdin=PIPE, stdout=PIPE, close_fds=False) is a possible option. This is due to a patch that makes all opened file descriptors non-inheritable by default. To be more precise, they will be automatically closed on execv (so still can't use multiprocessing.Process), see PEP 446.
This is also a valid option for other python versions:
on windows, HANDLEs are created non-inheritable by default, so you will leak only handles that were made inheritable explicitly
on POSIX/python<=3.3 you can still use os.closerange to close open file descriptors after spawning the subprocess
for a corresponding example see:
https://github.com/coldfix/python-ipc-test
The most useful combinations are:
stdio:pickle
pro: completely cross-platform in my tests
pro: fastest option (with 2)
con: stdin/stdout can not be redirected independently
inherit_unidir:pickle
pro: you can redirect STDIO streams independently
pro: fastest option together with stdio:pickle
con: very low level platform specific code
socket:sockpipe
pro: cross-platform with little effort
con: there is a short period when "attackers" may connect to the port, you could require a pass-phrase or something to prevent that from happening
con: slightly slower than alternatives on windows (factor 1.6 in my measurements)
when not using AF_UNIX there are unpredictable performance hits on linux
I have a subprocess that I use. I must be able to asynchronously read and write to/from this process to it's respective stdout and stdin.
How can I do this? I have looked at subprocess, but the communicate method waits for process termination (which is not what I want) and the subprocess.stdout.read method can block.
The subprocess is not a Python script but can be edited if absolutely necessary. In total I will have around 15 of these subprocesses.
Have a look how communicate is implemented.
There are essentially 2 ways to do it:
either use select() and be notified whether you can read/write,
or delegate the read and write, which both can block, to a thread, respectively.
Have you considered using some queue or NOSQL DB for inter process communication?
I would suggest you to use Redis, and read and write to different keys with your processes.
Have a look at sarge: http://sarge.readthedocs.org/en/latest/index.html
From the sarge docs:
If you want to interact with external programs from your Python applications, Sarge is a library which is intended to make your life easier than using the subprocess module in Python’s standard library.
If we use PEP-3143 and it's reference implementation http://pypi.python.org/pypi/python-daemon
then it looks like impossible to have Twisted working, since during daemonising ALL possible file handlers are explicitly closed, which includes pipes.
When Twisted tries to call os.pipe() and then write to it - gets bad file descriptor.
As I see it, daemonising is not suited for networking by this PEP?
And probably that's the reason why twisted exist
Edit:
I'll have to point out that the question is more of the "Why PEP effectively makes it impossible to create a network application" rather then "How to do it".
Twisted breaks this rules in order to work
It doesn't close all the open file descriptors: just the ones not in the files_preserve attribute. You could probably coerce this to work by figuring out the FD of the waker and all open sockets in the reactor and then passing that to files_preserve... but why bother? Just use twistd and have twisted daemonize itself.
Better yet, use twistd -n and let your process get monitored by some other system tool, and don't bother with daemonization at all.
Feel free to use this daemon http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/
How to mix it with Twisted see here
http://michael-xiii.blogspot.com/2011/10/twisted.html (warning! Russian text ahead, but Python code is rather demonstrating)
supervisord + upstart
The practice of closing all open filedescriptors is an effect of the possibility that the deamonizing process inherits some open files from the parent process. For example, you can open dozens of files in one process (with, say, os.open()) or and then invoke a sub-process that inherits them. You probably don't have an easy way, as a subprocess, to know what filedescriptors are useful from the parent process (unless you pass that along with command line arguments), and you certainly don't want stdin, stdout or stderr, so its perfectly reasonable to, before doing anything else, close all open files.
Then a deamonizing process will take some additional steps to become a deamon (as laid out in the PEP).
Once the process is fully detached from any kind of terminal, it can start opening files and connections as it needs. It'll open its log files, its configuration files, and its network connections.
Others have mentioned that twisted, via the twistd tool already does a pretty good job of all of this, and you don't need to use an extra module. If you don't want to use twistd (for some reason) but you do want to use twisted, you could use something external, but you should deamonize first and then import twisted and the rest of your application code and open network connections last.
I'm using the Jython 2.51 implementation of Python to write a script that repeatedly invokes another process via subprocess.Popen and uses PIPE to pipe stdout and stderr to the parent process and stdin to the child process. After several hundred loop iterations, I seem to run out of file descriptors.
The Python subprocess documentation mentions very little about freeing file descriptors, other than the close_fds option, which isn't described very clearly (Why should there be any file descriptors besides 0, 1 and 2 open in the first place?). I'm assuming that in CPython, reference counting takes care of the resource freeing issue. What's the proper way to make sure all descriptors get freed when one is done with a Popen object in Jython?
Edit: Just in case it makes a difference, this is a multithreaded program, so there are several Popen processes running simultaneously.
This only answers part of your question, but my understanding is that, when you spawn a new process, it normally inherits all the handles of the parent process. That includes such things as open files and sockets that you're listening on.
On UNIX, that's a side-effect of using 'fork', which duplicates the current process and all of its handles before loading the new executable. On Windows it's more explicit, but Python does it anyway, to try to match the behavior across platforms as much as possible.
The close_fds option, when True, closes all these inherited handles after spawning the subprocess, so the new executable starts with a clean slate. But if your subprocesses are run one at a time, and terminating when they're done, then this shouldn't be the problem.
I have some commands which I am running using the subprocess module. I then want to loop over the lines of the output. The documentation says do not do data_stream.stdout.read which I am not but I may be doing something which calls that. I am looping over the output like this:
for line in data_stream.stdout:
#do stuff here
.
.
.
Can this cause deadlocks like reading from data_stream.stdout or are the Popen modules set up for this kind of looping such that it uses the communicate code but handles all the callings of it for you?
You have to worry about deadlocks if you're communicating with your subprocess, i.e. if you're writing to stdin as well as reading from stdout. Because these pipes may be cached, doing this kind of two-way communication is very much a no-no:
data_stream = Popen(mycmd, stdin=PIPE, stdout=PIPE)
data_stream.stdin.write("do something\n")
for line in data_stream:
... # BAD!
However, if you've not set up stdin (or stderr) when constructing data_stream, you should be fine.
data_stream = Popen(mycmd, stdout=PIPE)
for line in data_stream.stdout:
... # Fine
If you need two-way communication, use communicate.
The two answer have caught the gist of the issue pretty well: don't mix writing something to the subprocess, reading something from it, writing again, etc -- the pipe's buffering means you're at risk of a deadlock. If you can, write everything you need to write to the subprocess FIRST, close that pipe, and only THEN read everything the subprocess has to say; communicate is nice for the purpose, IF the amount of data is not too large to fit in memory (if it is, you can still achieve the same effect "manually").
If you need finer-grain interaction, look instead at pexpect or, if you're on Windows, wexpect.
SilentGhost's/chrispy's answers are OK if you have a small to moderate amount of output from your subprocess. Sometimes, though, there may be a lot of output - too much to comfortably buffer in memory. In such a case, the thing to do is start() the process, and spawn a couple of threads - one to read child.stdout and one to read child.stderr where child is the subprocess. You then need to wait() for the subprocess to terminate.
This is actually how communicate() works; the advantage of using your own threads is that you can process the output from the subprocess as it is generated. For example, in my project python-gnupg I use this technique to read status output from the GnuPG executable as it is generated, rather than waiting for all of it by calling communicate(). You are welcome to inspect the source of this project - the relevant stuff is in the module gnupg.py.
data_stream.stdout is a standard output handle. you shouldn't be looping over it. communicate returns tuple of (stdoutdata, stderr). this stdoutdata you should be using to do your stuff.