Python subprocess: Print to stdin, read stdout until newline, repeat

Python subprocess: Print to stdin, read stdout until newline, repeat - python

I am looking to interface with an interactive command line application using Python 3.5. The idea is that I start the process at the beginning of the Python script and leave it open. In a loop, I print a file path, followed by a line return, to stdin, wait for a quarter second or so as it processes, and read from stdout until it reaches a newline.
This is quite similar to the communicate feature of subprocess, but I am waiting for a line return instead of waiting for the process to terminate. Anyone aware of a relatively simple way to do this?
Edit: it would be preferable to use the standard library to do this, rather than third-party libraries such as pexpect, if possible.

You can use subprocess.Popen for this.
Something like this:
proc = subprocess.Popen(['my-command'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
Now proc.stdin and proc.stdout are your ends of pipes that send data to the subprocess stdin and read from the subprocess stdout.
Since you're only interested in reading newline-terminated lines, you can probably get around any problems caused by buffering. Buffering is one of the big gotchas when using subprocess to communicate with interactive processes. Usually I/O is line-buffered, meaning that if the subprocess doesn't terminate a line with newline, you might never see any data on proc.stdout, and vice versa with you writing to proc.stdin - it might not see it if you're not ending with newline. You can turn buffering off, but that's not so simple, and not platform independent.
Another problem you might have to solve is that you can't determine whether the subprocess is waiting for input or has sent you output except by writing and reading from the pipes. So you might need to start a second thread so you can wait for output on proc.stdout and write to proc.stdin at the same time without running into a deadlock because both processes are blocking on pipe I/O (or, if you're on a Unix which supports select with file handles, use select to determine which pipes are ready to receive or ready to be read from).

This sounds like a job for an event loop. The subprocess module starts to show its strain under complex tasks.
I've done this with Twisted, by subclassing the following:
twisted.internet.endpoints.ProcessEndpoint
twisted.protocols.basic.LineOnlyReceiver
Most documentation for Twisted uses sockets as endpoints, but it's not hard to adjust the code for processes.

Related

Subprocess, repeatedly write to STDIN while reading from STDOUT (Windows)

I want to call an external process from python. The process I'm calling reads an input string and gives tokenized result, and waits for another input (binary is MeCab tokenizer if that helps).
I need to tokenize thousands of lines of string by calling this process.
Problem is Popen.communicate() works but waits for the process to die before giving out the STDOUT result. I don't want to keep closing and opening new subprocesses for thousands of times. (And I don't want to send the whole text, it may easily grow over tens of thousands of -long- lines in future.)
from subprocess import PIPE, Popen
with Popen("mecab -O wakati".split(), stdin=PIPE,
stdout=PIPE, stderr=PIPE, close_fds=False,
universal_newlines=True, bufsize=1) as proc:
output, errors = proc.communicate("foobarbaz")
print(output)
I've tried reading proc.stdout.read() instead of using communicate but it is blocked by stdin and doesn't return any results before proc.stdin.close() is called. Which, again means I need to create a new process everytime.
I've tried to implement queues and threads from a similar question as below, but it either doesn't return anything so it's stuck on While True, or when I force stdin buffer to fill by repeteadly sending strings, it outputs all the results at once.
from subprocess import PIPE, Popen
from threading import Thread
from queue import Queue, Empty
def enqueue_output(out, queue):
for line in iter(out.readline, b''):
queue.put(line)
out.close()
p = Popen('mecab -O wakati'.split(), stdout=PIPE, stdin=PIPE,
universal_newlines=True, bufsize=1, close_fds=False)
q = Queue()
t = Thread(target=enqueue_output, args=(p.stdout, q))
t.daemon = True
t.start()
p.stdin.write("foobarbaz")
while True:
try:
line = q.get_nowait()
except Empty:
pass
else:
print(line)
break
Also looked at the Pexpect route, but it's windows port doesn't support some important modules (pty based ones), so I couldn't apply that as well.
I know there are a lot of similar answers, and I've tried most of them. But nothing I've tried seems to work on Windows.
EDIT: some info on the binary I'm using, when I use it via command line. It runs and tokenizes sentences I give, until I'm done and forcibly close the program.
(...waits_for_input -> input_recieved -> output -> waits_for_input...)
Thanks.

If mecab uses C FILE streams with default buffering, then piped stdout has a 4 KiB buffer. The idea here is that a program can efficiently use small, arbitrary-sized reads and writes to the buffers, and the underlying standard I/O implementation handles automatically filling and flushing the much-larger buffers. This minimizes the number of required system calls and maximizes throughput. Obviously you don't want this behavior for interactive console or terminal I/O or writing to stderr. In these cases the C runtime uses line-buffering or no buffering.
A program can override this behavior, and some do have command-line options to set the buffer size. For example, Python has the "-u" (unbuffered) option and PYTHONUNBUFFERED environment variable. If mecab doesn't have a similar option, then there isn't a generic workaround on Windows. The C runtime situation is too complicated. A Windows process can link statically or dynamically to one or several CRTs. The situation on Linux is different since a Linux process generally loads a single system CRT (e.g. GNU libc.so.6) into the global symbol table, which allows an LD_PRELOAD library to configure the C FILE streams. Linux stdbuf uses this trick, e.g. stdbuf -o0 mecab -O wakati.
One option to experiment with is to call CreateConsoleScreenBuffer and get a file descriptor for the handle from msvcrt.open_osfhandle. Then pass this as stdout instead of using a pipe. The child process will see this as a TTY and use line buffering instead of full buffering. However managing this is non-trivial. It would involve reading (i.e. ReadConsoleOutputCharacter) a sliding buffer (call GetConsoleScreenBufferInfo to track the cursor position) that's actively written to by another process. This kind of interaction isn't something that I've ever needed or even experimented with. But I have used a console screen buffer non-interactively, i.e. reading the buffer after the child has exited. This allows reading up to 9,999 lines of output from programs that write directly to the console instead of stdout, e.g. programs that call WriteConsole or open "CON" or "CONOUT$".

Here is a workaround for Windows. This should also be adaptable to other operating systems.
Download a console emulator like ConEmu (https://conemu.github.io/)
Start it instead of mecab as your subprocess.
p = Popen(['conemu'] , stdout=PIPE, stdin=PIPE,
universal_newlines=True, bufsize=1, close_fds=False)
Then send the following as the first input:
mecab -O wakafi & exit
You are letting the emulator handle the file output issues for you; the way it normally does when you manually interact with it.
I am still looking into this; but already looks promising...
Only problem is conemu is a gui application; so if no other way to hook into its input and output, one might have to tweak and rebuild from sources (it's open source). I haven't found any other way; but this should work.
I have asked the question about running in some sort of console mode here; so you can check that thread also for something. The author Maximus is on SO...

The code
while True:
try:
line = q.get_nowait()
except Empty:
pass
else:
print(line)
break
is essentially the same as
print(q.get())
except less efficient because it burns CPU time while waiting. The explicit loop won't make data from the subprocess arrive sooner; it arrives when it arrives.
For dealing with uncooperative binaries I have a few suggestions, from best to worst:
Find a Python library and use that instead. It appears that there's an official Python binding in the MeCab source tree and I see some prebuilt packages on PyPI. You can also look for a DLL build that you can call with ctypes or another Python FFI. If that doesn't work...
Find a binary that flushes after each line of output. The most recent Win32 build I found online, v0.98, does flush after each line. Failing that...
Build your own binary that flushes after each line. It should be easy enough to find the main loop and insert a flush call in it. But MeCab seems to explicitly flush already, and git blame says that the flush statement was last changed in 2011, so I'm surprised you ever had this problem and I suspect that there may have just been a bug in your Python code. Failing that...
Process the output asynchronously. If your concern is that you want to deal with the output in parallel with the tokenization for performance reasons, you can mostly do that, after the first 4K. Just do the processing in the second thread instead of stuffing the lines in a queue. If you can't do that...
This is a terrible hack but it may work in some cases: intersperse your inputs with dummy inputs that produce at least 4K of output. For example, you could output 2047 blank lines after every real input line (2047 CRLFs plus the CRLF from the real output = 4K), or a single line of b'A' * 4092 + b'\r\n', whichever is faster.
Not on this list at all is an approach suggested by the two previous answers: directing the output to a Win32 console and scraping the console. This is a terrible idea because scraping gets you cooked output as a rectangular array of characters. The scraper has no way to know whether two lines were originally one overlong line that wrapped. If it guesses wrong, your outputs will get out of sync with your inputs. It's impossible to work around output buffering in this way if you care at all about the integrity of the output.

I guess the answer, if not the solution, can be found here
https://github.com/ikriv/ConsoleProxy/blob/master/src/Tools/Exec/readme.md
I guess, because I had a similar problem, which I worked around, and could not try this route because this tool is not available for Windows 2003, which is the OS I had to use (in a VM for a legacy application).
I'd like to know if I guessed right.

Live reading / writing to a subprocess stdin/stdout

I want to make a Python wrapper for another command-line program.
I want to read Python's stdin as quickly as possible, filter and translate it, and then write it promptly to the child program's stdin.
At the same time, I want to be reading as quickly as possible from the child program's stdout and, after a bit of massaging, writing it promptly to Python's stdout.
The Python subprocess module is full of warnings to use communicate() to avoid deadlocks. However, communicate() doesn't give me access to the child program's stdout until the child has terminated.

I think you'll be fine (carefully) ignoring the warnings using Popen.stdin, etc yourself. Just be sure to process the streams line-by-line and iterate through them on a fair schedule so not to fill up any buffers. A relatively simple (and inefficient) way of doing this in Python is using separate threads for the three streams. That's how Popen.communicate does it internally. Check out its source code to see how.

Disclaimer: This solution likely requires that you have access to the source code of the process you are trying to call, but may be worth trying anyways. It depends on the called process periodically flushing its stdout buffer which is not standard.
Say you have a process proc created by subprocess.Popen. proc has attributes stdin and stdout. These attributes are simply file-like objects. So, in order to send information through stdin you would call proc.stdin.write(). To retrieve information from proc.stdout you would call proc.stdout.readline() to read an individual line.
A couple of caveats:
When writing to proc.stdin via write() you will need to end the input with a newline character. Without a newline character, your subprocess will hang until a newline is passed.
In order to read information from proc.stdout you will need to make sure that the command called by subprocess appropriately flushes its stdout buffer after each print statement and that each line ends with a newline. If the stdout buffer does not flush at appropriate times, your call to proc.stdout.readline() will hang.

Repeatedly write to STDIN and read STDOUT of a Subprocess without closing it

I am trying to employ a Subprocess in Python for keeping an external script open in a Server-like fashion. The external script first loads a model. Once this is done, it accepts requests via STDIN and returns processed strings to STDOUT.
So far, I've tried
tokenizer = subprocess.Popen([tokenizer_path, '-l', lang_prefix], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
However, I cannot use
tokenizer.stdin.write(input_string+'\n')
out = self._tokenizer.stdout.readline()
in order to repeatedly process input_strings by means of the subprocess – out will just be empty, no matter if I use stdout.read() or stdout.readline(). However, it works when I close the stdin with tokenizer.stdin.close() before reading STDOUT, but this closes the subprocess, which is not what I want as I would have to reload the whole external script again before sending another request.
Is there any way to use a subprocess in a server-like fashion in python without closing and re-opening it?

Thanks to this Answer, I found out that a slave handle must be used in order to properly communicate with the subprocess:
master, slave = pty.openpty()
tokenizer = subprocess.Popen(script, shell=True stdin=subprocess.PIPE, stdout=slave)
stdin_handle = process.stdin
stdout_handle = os.fdopen(master)
Now, I can communicate to the subprocess without closing it via
stdin_handle.write(input)
stdout_handle.readline() #gets the processed input

Your external script probably buffers its output, so you only can read it in the father when the buffer in the child is flushed (which the child must do itself). One way to make it flush its buffers is probably closing the input because then it terminates in a proper fashion and flushes its buffers in the process.
If you have control over the external program (i. e. if you can patch it), insert a flushing after the output is produced.
Otherwise programs sometimes can be made to not buffer their output by attaching them to a pseudo-TTY (many programs, including the stdlib, assume that when their output is going to a TTY, no buffering is wished). But this is a bit tricky.

Blocking writing to stdout

I'm writing a Python script that will use subprocesses. The main idea is to have one parent script that runs specialised child scripts, which e.g. run other programs or do some stuff on their own. There are pipes between parent script and subprocesses. I use them to control whether subprocess is still responding by sending some characters on regular basis and checking the response. The problem is that when the subprocess prints anything on screen (i.e. writes to stdout or stderr), the pipes are broken and everything crashes. So my main question is whether it is possible to block writing to std* in the subprocess, so only legitimate response written to pipe would be possible? I have already tried Stop a function from writing to stdout but without any success.
Also other ideas for communcation between parent and subprocess are welcome (except file based pipes). However, the subprocesses must be used.

I strongly believe that you do not just have to accept "that when the subprocess prints anything on screen (i.e. writes to stdout or stderr), the pipes are broken and everything crashes". You can solve this problem. Then you do not need to "block" the subprocesses from writing to standard streams.
Make proper use of all the power of the subprocess module. First of all, connect a subprocess.PIPE to each of the standard streams of a subprocess:
p = subprocess.Popen(
[executable, arg1, arg2],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
Run the subprocess and interact with it through those pipes:
stdout, stderr = p.communicate(stdin="command")
If communicate() is not flexible enough (if you need to monitor several subprocesses at the same time and/or if the stdin data to a certain subprocess depends on its output in response to a previous command) you can directly interact with the p.stdout, p.stderr, p.stdin attributes. In this case, you will likely have to build your own monitoring loop and make use of p.poll() and/or p.returncode. Controlling the subprocesses can also be realized via p.send_signal().

You can pass a function to subprocess.Popen that is executed prior to executing the requested program:
def close_std():
os.close(0)
os.close(1)
os.close(2)
p = subprocess.Popen(cmd, preexec_fn=close_std)
Note the use of low-level os.close; closing sys.std* will only have effect in the forked Python process. Also, be aware that if your underlying programs are Python scripts, they may die due to an exception when they try to write to closed file descriptors.

python subprocess module: looping over stdout of child process

I have some commands which I am running using the subprocess module. I then want to loop over the lines of the output. The documentation says do not do data_stream.stdout.read which I am not but I may be doing something which calls that. I am looping over the output like this:
for line in data_stream.stdout:
#do stuff here
.
.
.
Can this cause deadlocks like reading from data_stream.stdout or are the Popen modules set up for this kind of looping such that it uses the communicate code but handles all the callings of it for you?

You have to worry about deadlocks if you're communicating with your subprocess, i.e. if you're writing to stdin as well as reading from stdout. Because these pipes may be cached, doing this kind of two-way communication is very much a no-no:
data_stream = Popen(mycmd, stdin=PIPE, stdout=PIPE)
data_stream.stdin.write("do something\n")
for line in data_stream:
... # BAD!
However, if you've not set up stdin (or stderr) when constructing data_stream, you should be fine.
data_stream = Popen(mycmd, stdout=PIPE)
for line in data_stream.stdout:
... # Fine
If you need two-way communication, use communicate.

The two answer have caught the gist of the issue pretty well: don't mix writing something to the subprocess, reading something from it, writing again, etc -- the pipe's buffering means you're at risk of a deadlock. If you can, write everything you need to write to the subprocess FIRST, close that pipe, and only THEN read everything the subprocess has to say; communicate is nice for the purpose, IF the amount of data is not too large to fit in memory (if it is, you can still achieve the same effect "manually").
If you need finer-grain interaction, look instead at pexpect or, if you're on Windows, wexpect.

SilentGhost's/chrispy's answers are OK if you have a small to moderate amount of output from your subprocess. Sometimes, though, there may be a lot of output - too much to comfortably buffer in memory. In such a case, the thing to do is start() the process, and spawn a couple of threads - one to read child.stdout and one to read child.stderr where child is the subprocess. You then need to wait() for the subprocess to terminate.
This is actually how communicate() works; the advantage of using your own threads is that you can process the output from the subprocess as it is generated. For example, in my project python-gnupg I use this technique to read status output from the GnuPG executable as it is generated, rather than waiting for all of it by calling communicate(). You are welcome to inspect the source of this project - the relevant stuff is in the module gnupg.py.

data_stream.stdout is a standard output handle. you shouldn't be looping over it. communicate returns tuple of (stdoutdata, stderr). this stdoutdata you should be using to do your stuff.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.