As part of an evaluation, I want to measure and compare the user+system runtime of different diff-tools.
As a first approach, I thought about calling the particular tools with time - f (GNU time). Since the rest of the evaluation is done by a bunch of Python scripts I want to realise it in Python.
The time output is formatted as follows:
<some error message>
user 0.4
sys 0.2
The output of the diff tool is redirected to sed to get rid of unneeded output and the output of sed is then further processed. (use of sed deprecated for my example. See Edit 2)
A call from within a shell would look like this (removes lines starting with "Binary"):
$ time -f "user %U\nsys %S\n" diff -r -u0 dirA dirB | sed -e '/^Binary.*/d'
Here is my approach so far:
import subprocess
diffcommand=["time","-f","user %U\nsys %S\n","diff","-r","-u0","testrepo_1/A/rev","testrepo_1/B/rev"]
sedcommand = ["sed","-e","/^Binary.*/d"]
# Execute command as subprocess
diff = subprocess.Popen(diffcommand, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
# Calculate runtime
runtime = 0.0
for line in diff.stderr.readlines():
current = line.split()
if current:
if current[0] == "user" or current[0] == "sys":
runtime = runtime + float(current[1])
print "Runtime: "+str(runtime)
# Pipe to "sed"
sedresult = subprocess.check_output(sedcommand, stdin=diff.stdout)
# Wait for the subprocesses to terminate
diff.wait()
However it feels like that this is not clean (especially from an OS point of view). It also leads to the script being stuck in the readlines part under certain circumstances I couldn't figure out yet.
Is there a cleaner (or better) way to achieve what I want?
Edit 1
Changed head line and gave a more detailed explanation
Edit 2
Thanks to J.F. Sebastian, I had a look at os.wait4(...) (information taken from his answer. But since I AM interested in the output, I had to implement it a bit different.
My code now looks like this:
diffprocess = subprocess.Popen(diffcommand,stdout=subprocess.PIPE)
runtimes = os.wait4(diffprocess.pid,0)[2]
runtime = runtimes.ru_utime + runtimes.ru_stime
diffresult = diffprocess.communicate()[0]
Note that I do not pipe the result to sed any more (decided to trim within python)
The runtime measurement works fine for some test cases, but the execution gets stuck sometimes. Removing the runtime measurement then helps the program to terminate and so does sending stdout to DEVNULL (as demanded here). Could I have a deadlock? (valgrind --tool=helgrind did not find anything) Is there something fundamentally wrong in my approach?
but the execution gets stuck sometimes.
If you use stdout=PIPE then something should read the output while the process is still running otherwise the child process will hang if its stdout OS pipe buffer fills up (~65K on my machine).
from subprocess import Popen, PIPE
p = Popen(diffcommand, stdout=PIPE, bufsize=-1)
with p.stdout:
output = p.stdout.read()
ru = os.wait4(p.pid, 0)[2]
Related
I'm working with a piece of scientific software called Chimera. For some of the code downstream of this question, it requires that I use Python 2.7.
I want to call a process, give that process some input, read its output, give it more input based on that, etc.
I've used Popen to open the process, process.stdin.write to pass standard input, but then I've gotten stuck trying to get output while the process is still running. process.communicate() stops the process, process.stdout.readline() seems to keep me in an infinite loop.
Here's a simplified example of what I'd like to do:
Let's say I have a bash script called exampleInput.sh.
#!/bin/bash
# exampleInput.sh
# Read a number from the input
read -p 'Enter a number: ' num
# Multiply the number by 5
ans1=$( expr $num \* 5 )
# Give the user the multiplied number
echo $ans1
# Ask the user whether they want to keep going
read -p 'Based on the previous output, would you like to continue? ' doContinue
if [ $doContinue == "yes" ]
then
echo "Okay, moving on..."
# [...] more code here [...]
else
exit 0
fi
Interacting with this through the command line, I'd run the script, type in "5" and then, if it returned "25", I'd type "yes" and, if not, I would type "no".
I want to run a python script where I pass exampleInput.sh "5" and, if it gives me "25" back, then I pass "yes"
So far, this is as close as I can get:
#!/home/user/miniconda3/bin/python2
# talk_with_example_input.py
import subprocess
process = subprocess.Popen(["./exampleInput.sh"],
stdin = subprocess.PIPE,
stdout = subprocess.PIPE)
process.stdin.write("5")
answer = process.communicate()[0]
if answer == "25":
process.stdin.write("yes")
## I'd like to print the STDOUT here, but the process is already terminated
But that fails of course, because after `process.communicate()', my process isn't running anymore.
(Just in case/FYI): Actual problem
Chimera is usually a gui-based application to examine protein structure. If you run chimera --nogui, it'll open up a prompt and take input.
I often need to know what chimera outputs before I run my next command. For example, I will often try to generate a protein surface and, if Chimera can't generate a surface, it doesn't break--it just says so through STDOUT. So, in my python script, while I'm looping through many proteins to analyze, I need to check STDOUT to know whether to continue analysis on that protein.
In other use cases, I'll run lots of commands through Chimera to clean up a protein first, and then I'll want to run lots of separate commands to get different pieces of data, and use that data to decide whether to run other commands. I could get the data, close the subprocess, and then run another process, but that would require re-running all of those cleaning up commands each time.
Anyways, those are some of the real-world reasons why I want to be able to push STDIN to a subprocess, read the STDOUT, and still be able to push more STDIN.
Thanks for your time!
you don't need to use process.communicate in your example.
Simply read and write using process.stdin.write and process.stdout.read. Also make sure to send a newline, otherwise read won't return. And when you read from stdin, you also have to handle newlines coming from echo.
Note: process.stdout.read will block until EOF.
# talk_with_example_input.py
import subprocess
process = subprocess.Popen(["./exampleInput.sh"],
stdin = subprocess.PIPE,
stdout = subprocess.PIPE)
process.stdin.write("5\n")
stdout = process.stdout.readline()
print(stdout)
if stdout == "25\n":
process.stdin.write("yes\n")
print(process.stdout.readline())
$ python2 test.py
25
Okay, moving on...
Update
When communicating with an program in that way, you have to pay special attention to what the application is actually writing. Best is to analyze the output in a hex editor:
$ chimera --nogui 2>&1 | hexdump -C
Please note that readline [1] only reads to the next newline (\n). In your case you have to call readline at least four times to get that first block of output.
If you just want to read everything up until the subprocess stops printing, you have to read byte by byte and implement a timeout. Sadly, neither read nor readline does provide such a timeout mechanism. This is probably because the underlying read syscall [2] (Linux) does not provide one either.
On Linux we can write a single-threaded read_with_timeout() using poll / select. For an example see [3].
from select import epoll, EPOLLIN
def read_with_timeout(fd, timeout__s):
"""Reads from fd until there is no new data for at least timeout__s seconds.
This only works on linux > 2.5.44.
"""
buf = []
e = epoll()
e.register(fd, EPOLLIN)
while True:
ret = e.poll(timeout__s)
if not ret or ret[0][1] is not EPOLLIN:
break
buf.append(
fd.read(1)
)
return ''.join(buf)
In case you need a reliable way to read non blocking under Windows and Linux, this answer might be helpful.
[1] from the python 2 docs:
readline(limit=-1)
Read and return one line from the stream. If limit is specified, at most limit bytes will be read.
The line terminator is always b'\n' for binary files; for text files, the newline argument to open() can be used to select the line terminator(s) recognized.
[2] from man 2 read:
#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);
[3] example
$ tree
.
├── prog.py
└── prog.sh
prog.sh
#!/usr/bin/env bash
for i in $(seq 3); do
echo "${RANDOM}"
sleep 1
done
sleep 3
echo "${RANDOM}"
prog.py
# talk_with_example_input.py
import subprocess
from select import epoll, EPOLLIN
def read_with_timeout(fd, timeout__s):
"""Reads from f until there is no new data for at least timeout__s seconds.
This only works on linux > 2.5.44.
"""
buf = []
e = epoll()
e.register(fd, EPOLLIN)
while True:
ret = e.poll(timeout__s)
if not ret or ret[0][1] is not EPOLLIN:
break
buf.append(
fd.read(1)
)
return ''.join(buf)
process = subprocess.Popen(
["./prog.sh"],
stdin = subprocess.PIPE,
stdout = subprocess.PIPE
)
print(read_with_timeout(process.stdout, 1.5))
print('-----')
print(read_with_timeout(process.stdout, 3))
$ python2 prog.py
6194
14508
11293
-----
10506
I have this code. Basically I using subprocess to execute a program several times in a while loop. It works fine but after several times (5 times to be precise) the my python script just terminates and it still has a long way before finishing.
while x < 50:
# ///////////I am doing things here/////////////////////
cmdline = 'gmx mdrun -ntomp 1 -v -deffnm my_sim'
args = shlex.split(cmdline)
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output = proc.communicate()[0].decode()
# ///////////I am doing things here/////////////////////
x += 1
For each time I am calling program, it will take approximately one hour to finish. In the mean time subprocess should wait because depending on the output I must execute parts of my code (that is why I am using .communicate() ).
Why is this happening?
Thanks for the help in advanced!
A subprocess runs asynchronously in the background (since it's a different process) and you need to use subprocess.wait() to wait for it to finish. Since you have multiple subprocesses you'll likely want to wait on all of them, like this:
exit_codes = [p.wait() for p in (p1, p2)]
To solve this problem I suggest doing the following:
while x < 50:
# ///////////I am doing things here/////////////////////
cmdline = 'gmx mdrun -ntomp 1 -v -deffnm my_sim 2>&1 | tee output.txt'
proc = subprocess.check_output(args, shell=True)
with open('output.txt', 'r') as fin:
out_file = fin.read()
# ///////////Do what ever you need with the out_file/////////////
# ///////////I am doing things here/////////////////////
x += 1
I know it is not recommended to use shell=True, so if you do not want to use it then just pass the cmdline you with commas. Be aware that you might get an error when passing with commas. I do want to go into details but in that case you can use shell=True and your problem will be gone.
Using the piece of code I just provided your python script will not terminate abruptly when using subprocess many time and with programs that have a lot of stdout and stderr messages.
It tool some time to discover this and I hope I can help someone out there.
How do I execute the following shell command using the Python subprocess module?
echo "input data" | awk -f script.awk | sort > outfile.txt
The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?
p_awk = subprocess.Popen(["awk","-f","script.awk"],
stdin=subprocess.PIPE,
stdout=file("outfile.txt", "w"))
p_awk.communicate( "input data" )
UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!
You'd be a little happier with the following.
import subprocess
awk_sort = subprocess.Popen( "awk -f script.awk | sort > outfile.txt",
stdin=subprocess.PIPE, shell=True )
awk_sort.communicate( b"input data\n" )
Delegate part of the work to the shell. Let it connect two processes with a pipeline.
You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.
Edit. Some of the reasons for suggesting that awk isn't helping.
[There are too many reasons to respond via comments.]
Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.
The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.
The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.
Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.
Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.
Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.
Sidebar Why building a pipeline (a | b) is so hard.
When the shell is confronted with a | b it has to do the following.
Fork a child process of the original shell. This will eventually become b.
Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".
Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.
The b child closes replaces its stdin with the new b's stdin. Exec the b process.
The b child waits for a to complete.
The parent is waiting for b to complete.
I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).
Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.
However, it's easier to delegate that operation to the shell.
import subprocess
some_string = b'input_data'
sort_out = open('outfile.txt', 'wb', 0)
sort_in = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sort_out).stdin
subprocess.Popen(['awk', '-f', 'script.awk'], stdout=sort_in,
stdin=subprocess.PIPE).communicate(some_string)
To emulate a shell pipeline:
from subprocess import check_call
check_call('echo "input data" | a | b > outfile.txt', shell=True)
without invoking the shell (see 17.1.4.2. Replacing shell pipeline):
#!/usr/bin/env python
from subprocess import Popen, PIPE
a = Popen(["a"], stdin=PIPE, stdout=PIPE)
with a.stdin:
with a.stdout, open("outfile.txt", "wb") as outfile:
b = Popen(["b"], stdin=a.stdout, stdout=outfile)
a.stdin.write(b"input data")
statuses = [a.wait(), b.wait()] # both a.stdin/stdout are closed already
plumbum provides some syntax sugar:
#!/usr/bin/env python
from plumbum.cmd import a, b # magic
(a << "input data" | b > "outfile.txt")()
The analog of:
#!/bin/sh
echo "input data" | awk -f script.awk | sort > outfile.txt
is:
#!/usr/bin/env python
from plumbum.cmd import awk, sort
(awk["-f", "script.awk"] << "input data" | sort > "outfile.txt")()
The accepted answer is sidestepping actual question.
here is a snippet that chains the output of multiple processes:
Note that it also prints the (somewhat) equivalent shell command so you can run it and make sure the output is correct.
#!/usr/bin/env python3
from subprocess import Popen, PIPE
# cmd1 : dd if=/dev/zero bs=1m count=100
# cmd2 : gzip
# cmd3 : wc -c
cmd1 = ['dd', 'if=/dev/zero', 'bs=1M', 'count=100']
cmd2 = ['tee']
cmd3 = ['wc', '-c']
print(f"Shell style : {' '.join(cmd1)} | {' '.join(cmd2)} | {' '.join(cmd3)}")
p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE) # stderr=PIPE optional, dd is chatty
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE)
p3 = Popen(cmd3, stdin=p2.stdout, stdout=PIPE)
print("Output from last process : " + (p3.communicate()[0]).decode())
# thoretically p1 and p2 may still be running, this ensures we are collecting their return codes
p1.wait()
p2.wait()
print("p1 return: ", p1.returncode)
print("p2 return: ", p2.returncode)
print("p3 return: ", p3.returncode)
http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?
Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().
Inspired by #Cristian's answer. I met just the same issue, but with a different command. So I'm putting my tested example, which I believe could be helpful:
grep_proc = subprocess.Popen(["grep", "rabbitmq"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
subprocess.Popen(["ps", "aux"], stdout=grep_proc.stdin)
out, err = grep_proc.communicate()
This is tested.
What has been done
Declared lazy grep execution with stdin from pipe. This command will be executed at the ps command execution when the pipe will be filled with the stdout of ps.
Called the primary command ps with stdout directed to the pipe used by the grep command.
Grep communicated to get stdout from the pipe.
I like this way because it is natural pipe conception gently wrapped with subprocess interfaces.
The previous answers missed an important point. Replacing shell pipeline is basically correct, as pointed out by geocar. It is almost sufficient to run communicate on the last element of the pipe.
The remaining problem is passing the input data to the pipeline. With multiple subprocesses, a simple communicate(input_data) on the last element doesn't work - it hangs forever. You need to create a a pipeline and a child manually like this:
import os
import subprocess
input = """\
input data
more input
""" * 10
rd, wr = os.pipe()
if os.fork() != 0: # parent
os.close(wr)
else: # child
os.close(rd)
os.write(wr, input)
os.close(wr)
exit()
p_awk = subprocess.Popen(["awk", "{ print $2; }"],
stdin=rd,
stdout=subprocess.PIPE)
p_sort = subprocess.Popen(["sort"],
stdin=p_awk.stdout,
stdout=subprocess.PIPE)
p_awk.stdout.close()
out, err = p_sort.communicate()
print (out.rstrip())
Now the child provides the input through the pipe, and the parent calls communicate(), which works as expected. With this approach, you can create arbitrary long pipelines without resorting to "delegating part of the work to the shell". Unfortunately the subprocess documentation doesn't mention this.
There are ways to achieve the same effect without pipes:
from tempfile import TemporaryFile
tf = TemporaryFile()
tf.write(input)
tf.seek(0, 0)
Now use stdin=tf for p_awk. It's a matter of taste what you prefer.
The above is still not 100% equivalent to bash pipelines because the signal handling is different. You can see this if you add another pipe element that truncates the output of sort, e.g. head -n 10. With the code above, sort will print a "Broken pipe" error message to stderr. You won't see this message when you run the same pipeline in the shell. (That's the only difference though, the result in stdout is the same). The reason seems to be that python's Popen sets SIG_IGN for SIGPIPE, whereas the shell leaves it at SIG_DFL, and sort's signal handling is different in these two cases.
EDIT: pipes is available on Windows but, crucially, doesn't appear to actually work on Windows. See comments below.
The Python standard library now includes the pipes module for handling this:
https://docs.python.org/2/library/pipes.html, https://docs.python.org/3.4/library/pipes.html
I'm not sure how long this module has been around, but this approach appears to be vastly simpler than mucking about with subprocess.
For me, the below approach is the cleanest and easiest to read
from subprocess import Popen, PIPE
def string_to_2_procs_to_file(input_s, first_cmd, second_cmd, output_filename):
with open(output_filename, 'wb') as out_f:
p2 = Popen(second_cmd, stdin=PIPE, stdout=out_f)
p1 = Popen(first_cmd, stdout=p2.stdin, stdin=PIPE)
p1.communicate(input=bytes(input_s))
p1.wait()
p2.stdin.close()
p2.wait()
which can be called like so:
string_to_2_procs_to_file('input data', ['awk', '-f', 'script.awk'], ['sort'], 'output.txt')
I have an app that reads in stuff from stdin and returns, after a newline, results to stdout
A simple (stupid) example:
$ app
Expand[(x+1)^2]<CR>
x^2 + 2*x + 1
100 - 4<CR>
96
Opening and closing the app requires a lot of initialization and clean-up (its an interface to a Computer Algebra System), so I want to keep this to a minimum.
I want to open a pipe in Python to this process, write strings to its stdin and read out the results from stdout. Popen.communicate() doesn't work for this, as it closes the file handle, requiring to reopen the pipe.
I've tried something along the lines of this related question:
Communicate multiple times with a process without breaking the pipe? but I'm not sure how to wait for the output. It is also difficult to know a priori how long it will take the app to finish to process for the input at hand, so I don't want to make any assumptions. I guess most of my confusion comes from this question: Non-blocking read on a subprocess.PIPE in python where it is stated that mixing high and low level functions is not a good idea.
EDIT:
Sorry that I didn't give any code before, got interrupted. This is what I've tried so far and it seems to work, I'm just worried that something goes wrong unnoticed:
from subprocess import Popen, PIPE
pipe = Popen(["MathPipe"], stdin=PIPE, stdout=PIPE)
expressions = ["Expand[(x+1)^2]", "Integrate[Sin[x], {x,0,2*Pi}]"] # ...
for expr in expressions:
pipe.stdin.write(expr)
while True:
line = pipe.stdout.readline()
if line != '':
print line
# output of MathPipe is always terminated by ';'
if ";" in line:
break
Potential problems with this?
Using subprocess, you can't do this reliably. You might want to look at using the pexpect library. That won't work on Windows - if you're on Windows, try winpexpect.
Also, if you're trying to do mathematical stuff in Python, check out SAGE. They do a lot of work on interfacing with other open-source maths software, so there's a chance they've already done what you're trying to.
Perhaps you could pass stdin=subprocess.PIPE as an argument to subprocess.Popen. This will make the process' stdin available as a general file-like object:
import sys, subprocess
proc = subprocess.Popen(["mathematica <args>"], stdin=subprocess.PIPE,
stdout=sys.stdout, shell=True)
proc.stdin.write("Expand[ (x-1)^2 ]") # Write whatever to the process
proc.stdin.flush() # Ensure nothing is left in the buffer
proc.terminate() # Kill the process
This directs the subprocess' output directly to your python process' stdout. If you need to read the output and do some editing first, that is possible as well. Check out http://docs.python.org/library/subprocess.html#popen-objects.
I want to capture stdout from a long-ish running process started via subprocess.Popen(...) so I'm using stdout=PIPE as an arg.
However, because it's a long running process I also want to send the output to the console (as if I hadn't piped it) to give the user of the script an idea that it's still working.
Is this at all possible?
Cheers.
The buffering your long-running sub-process is probably performing will make your console output jerky and very bad UX. I suggest you consider instead using pexpect (or, on Windows, wexpect) to defeat such buffering and get smooth, regular output from the sub-process. For example (on just about any unix-y system, after installing pexpect):
>>> import pexpect
>>> child = pexpect.spawn('/bin/bash -c "echo ba; sleep 1; echo bu"', logfile=sys.stdout); x=child.expect(pexpect.EOF); child.close()
ba
bu
>>> child.before
'ba\r\nbu\r\n'
The ba and bu will come with the proper timing (about a second between them). Note the output is not subject to normal terminal processing, so the carriage returns are left in there -- you'll need to post-process the string yourself (just a simple .replace!-) if you need \n as end-of-line markers (the lack of processing is important just in case the sub-process is writing binary data to its stdout -- this ensures all the data's left intact!-).
S. Lott's comment points to Getting realtime output using subprocess and Real-time intercepting of stdout from another process in Python
I'm curious that Alex's answer here is different from his answer 1085071.
My simple little experiments with the answers in the two other referenced questions has given good results...
I went and looked at wexpect as per Alex's answer above, but I have to say reading the comments in the code I was not left a very good feeling about using it.
I guess the meta-question here is when will pexpect/wexpect be one of the Included Batteries?
Can you simply print it as you read it from the pipe?
Inspired by pty.openpty() suggestion somewhere above, tested on python2.6, linux. Publishing since it took a while to make this working properly, w/o buffering...
def call_and_peek_output(cmd, shell=False):
import pty, subprocess
master, slave = pty.openpty()
p = subprocess.Popen(cmd, shell=shell, stdin=None, stdout=slave, close_fds=True)
os.close(slave)
line = ""
while True:
try:
ch = os.read(master, 1)
except OSError:
# We get this exception when the spawn process closes all references to the
# pty descriptor which we passed him to use for stdout
# (typically when it and its childs exit)
break
line += ch
sys.stdout.write(ch)
if ch == '\n':
yield line
line = ""
if line:
yield line
ret = p.wait()
if ret:
raise subprocess.CalledProcessError(ret, cmd)
for l in call_and_peek_output("ls /", shell=True):
pass
Alternatively, you can pipe your process into tee and capture only one of the streams.
Something along the lines of sh -c 'process interesting stuff' | tee /dev/stderr.
Of course, this only works on Unix-like systems.