Sub-processing pipe write to file malfunction

Sub-processing pipe write to file malfunction - python

Executing this in shell gets me tangible results:
wget -O c1 --no-cache "http://some.website" | sed "1,259d" c1 | sed "4,2002d"
Doing this in Python gets me nothing:
subprocess.call(shlex.split("wget -O c1 --no-cache \"http://some.website/tofile\""))
c1 = open("c1",'w')
first = subprocess.Popen(shlex.split("sed \"1,259d\" c1"), stdout=subprocess.PIPE)
subprocess.Popen(shlex.split("sed \"4,2002d\""), stdin=first.stdout, stdout=c1)
c1.close()
Doing this also gets me no results:
c1.write(subprocess.Popen(shlex.split("sed \"4,2002d\""), stdin=first.stdout, stdout=subprocess.PIPE).communicate()[0])
By 'gets me nothing' I mean blank output in the file. Does anyone see anything out of the ordinary here?

I always use plumbum for running external commands. It provides a very intuitive interface and, of course, takes care of escaping for me.
Would look something like:
from plumbum.cmd import wget, sed
cmd1 = wget['-O', 'c1']['--no-cache']["http://some.website"]
cmd2 = sed["1,259d"]['c1'] | sed["4,2002d"]
print cmd1
cmd1() # run it
print cmd2
cmd2() # run it

The statement c1 = open("c1",'w') opens file c1 for writing and truncates any existing data, so everything wget wrote to the file gets erased before you call sed.
Anyway, I think shlex.split is generally awkward. I prefer to build the args list manually:
from subprocess import Popen, PIPE
p0 = Popen(['wget', '-O', '-', 'http://www.google.com'], stdout=PIPE)
p1 = Popen(['sed', '2,8d'], stdin=p0.stdout, stdout=PIPE)
with open('c1', 'w') as c1:
p2 = Popen(['sed', '2,7d'], stdin=p1.stdout, stdout=c1)
p2.wait()
However, there's no obvious reason a Python programmer should have to call out to sed. Python has string methods and regular expressions. Also, instead of wget you can use urllib2.urlopen.

Why not just do everything all in pipes and send the output to a file?
wget -O - "http://www.google.com" | sed "1,259d" | sed "4,2002d" > c1
Or if you don't want to send it to a file, and want it on stdout instead:
wget -O - "http://www.google.com" | sed "1,259d" | sed "4,2002d"
And if you want to do it in Python:
pipe = subprocess.Popen(shlex.split("wget -O - \"http://www.google.com\" | sed \"1,259d\" | sed \"4,2002d\""), stdout=subprocess.PIPE)
result = pipe.communicate()[0]

In the interest of making life easier for people who more-or-less may be running into the same type of problem, I have decided to post the final revised code, which factored in comments about c1 and overwriting of data. Of particular interest is the usage of communicate() which helped to completely eliminate any manifestations of zombie processes, which were quite irritating. Also, I found it useful to use subprocess.call in portions where piping wasn't necessary. No wait() was necessary in the end. Ultimately, staying away from sed and wget is a good idea, especially with Python's inbuilt tools and urllib2.
p0 = subprocess.call(shlex.split("wget -Oc1 --no-cache \"http://Some.website/tofile\""))
p1 = subprocess.Popen(shlex.split("sed \"1,261d\" c1"), stdout=subprocess.PIPE)
with open("cc1", 'w') as cc1:
p2 = subprocess.Popen(shlex.split("sed \"3,2002d\""), stdin=p1.stdout, stdout=cc1)
p2.communicate()
p1.communicate()
p3 = subprocess.call(shlex.split("mv cc1 c1"))

Related

Reading only file names from s3 using python [duplicate]

How do I execute the following shell command using the Python subprocess module?
echo "input data" | awk -f script.awk | sort > outfile.txt
The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?
p_awk = subprocess.Popen(["awk","-f","script.awk"],
stdin=subprocess.PIPE,
stdout=file("outfile.txt", "w"))
p_awk.communicate( "input data" )
UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!

You'd be a little happier with the following.
import subprocess
awk_sort = subprocess.Popen( "awk -f script.awk | sort > outfile.txt",
stdin=subprocess.PIPE, shell=True )
awk_sort.communicate( b"input data\n" )
Delegate part of the work to the shell. Let it connect two processes with a pipeline.
You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.
Edit. Some of the reasons for suggesting that awk isn't helping.
[There are too many reasons to respond via comments.]
Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.
The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.
The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.
Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.
Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.
Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.
Sidebar Why building a pipeline (a | b) is so hard.
When the shell is confronted with a | b it has to do the following.
Fork a child process of the original shell. This will eventually become b.
Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".
Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.
The b child closes replaces its stdin with the new b's stdin. Exec the b process.
The b child waits for a to complete.
The parent is waiting for b to complete.
I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).
Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.
However, it's easier to delegate that operation to the shell.

import subprocess
some_string = b'input_data'
sort_out = open('outfile.txt', 'wb', 0)
sort_in = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sort_out).stdin
subprocess.Popen(['awk', '-f', 'script.awk'], stdout=sort_in,
stdin=subprocess.PIPE).communicate(some_string)

To emulate a shell pipeline:
from subprocess import check_call
check_call('echo "input data" | a | b > outfile.txt', shell=True)
without invoking the shell (see 17.1.4.2. Replacing shell pipeline):
#!/usr/bin/env python
from subprocess import Popen, PIPE
a = Popen(["a"], stdin=PIPE, stdout=PIPE)
with a.stdin:
with a.stdout, open("outfile.txt", "wb") as outfile:
b = Popen(["b"], stdin=a.stdout, stdout=outfile)
a.stdin.write(b"input data")
statuses = [a.wait(), b.wait()] # both a.stdin/stdout are closed already
plumbum provides some syntax sugar:
#!/usr/bin/env python
from plumbum.cmd import a, b # magic
(a << "input data" | b > "outfile.txt")()
The analog of:
#!/bin/sh
echo "input data" | awk -f script.awk | sort > outfile.txt
is:
#!/usr/bin/env python
from plumbum.cmd import awk, sort
(awk["-f", "script.awk"] << "input data" | sort > "outfile.txt")()

The accepted answer is sidestepping actual question.
here is a snippet that chains the output of multiple processes:
Note that it also prints the (somewhat) equivalent shell command so you can run it and make sure the output is correct.
#!/usr/bin/env python3
from subprocess import Popen, PIPE
# cmd1 : dd if=/dev/zero bs=1m count=100
# cmd2 : gzip
# cmd3 : wc -c
cmd1 = ['dd', 'if=/dev/zero', 'bs=1M', 'count=100']
cmd2 = ['tee']
cmd3 = ['wc', '-c']
print(f"Shell style : {' '.join(cmd1)} | {' '.join(cmd2)} | {' '.join(cmd3)}")
p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE) # stderr=PIPE optional, dd is chatty
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE)
p3 = Popen(cmd3, stdin=p2.stdout, stdout=PIPE)
print("Output from last process : " + (p3.communicate()[0]).decode())
# thoretically p1 and p2 may still be running, this ensures we are collecting their return codes
p1.wait()
p2.wait()
print("p1 return: ", p1.returncode)
print("p2 return: ", p2.returncode)
print("p3 return: ", p3.returncode)

http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?
Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().

Inspired by #Cristian's answer. I met just the same issue, but with a different command. So I'm putting my tested example, which I believe could be helpful:
grep_proc = subprocess.Popen(["grep", "rabbitmq"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
subprocess.Popen(["ps", "aux"], stdout=grep_proc.stdin)
out, err = grep_proc.communicate()
This is tested.
What has been done
Declared lazy grep execution with stdin from pipe. This command will be executed at the ps command execution when the pipe will be filled with the stdout of ps.
Called the primary command ps with stdout directed to the pipe used by the grep command.
Grep communicated to get stdout from the pipe.
I like this way because it is natural pipe conception gently wrapped with subprocess interfaces.

The previous answers missed an important point. Replacing shell pipeline is basically correct, as pointed out by geocar. It is almost sufficient to run communicate on the last element of the pipe.
The remaining problem is passing the input data to the pipeline. With multiple subprocesses, a simple communicate(input_data) on the last element doesn't work - it hangs forever. You need to create a a pipeline and a child manually like this:
import os
import subprocess
input = """\
input data
more input
""" * 10
rd, wr = os.pipe()
if os.fork() != 0: # parent
os.close(wr)
else: # child
os.close(rd)
os.write(wr, input)
os.close(wr)
exit()
p_awk = subprocess.Popen(["awk", "{ print $2; }"],
stdin=rd,
stdout=subprocess.PIPE)
p_sort = subprocess.Popen(["sort"],
stdin=p_awk.stdout,
stdout=subprocess.PIPE)
p_awk.stdout.close()
out, err = p_sort.communicate()
print (out.rstrip())
Now the child provides the input through the pipe, and the parent calls communicate(), which works as expected. With this approach, you can create arbitrary long pipelines without resorting to "delegating part of the work to the shell". Unfortunately the subprocess documentation doesn't mention this.
There are ways to achieve the same effect without pipes:
from tempfile import TemporaryFile
tf = TemporaryFile()
tf.write(input)
tf.seek(0, 0)
Now use stdin=tf for p_awk. It's a matter of taste what you prefer.
The above is still not 100% equivalent to bash pipelines because the signal handling is different. You can see this if you add another pipe element that truncates the output of sort, e.g. head -n 10. With the code above, sort will print a "Broken pipe" error message to stderr. You won't see this message when you run the same pipeline in the shell. (That's the only difference though, the result in stdout is the same). The reason seems to be that python's Popen sets SIG_IGN for SIGPIPE, whereas the shell leaves it at SIG_DFL, and sort's signal handling is different in these two cases.

EDIT: pipes is available on Windows but, crucially, doesn't appear to actually work on Windows. See comments below.
The Python standard library now includes the pipes module for handling this:
https://docs.python.org/2/library/pipes.html, https://docs.python.org/3.4/library/pipes.html
I'm not sure how long this module has been around, but this approach appears to be vastly simpler than mucking about with subprocess.

For me, the below approach is the cleanest and easiest to read
from subprocess import Popen, PIPE
def string_to_2_procs_to_file(input_s, first_cmd, second_cmd, output_filename):
with open(output_filename, 'wb') as out_f:
p2 = Popen(second_cmd, stdin=PIPE, stdout=out_f)
p1 = Popen(first_cmd, stdout=p2.stdin, stdin=PIPE)
p1.communicate(input=bytes(input_s))
p1.wait()
p2.stdin.close()
p2.wait()
which can be called like so:
string_to_2_procs_to_file('input data', ['awk', '-f', 'script.awk'], ['sort'], 'output.txt')

Python run bash script with flags and pipes

I have a bash script that returns the admin email for a domain, like the following.
whois -h $(whois "stackoverflow.com" | grep 'Registrar WHOIS Server:' | cut -f2- -d:) "stackoverflow.com" | grep 'Admin Email:' | cut -f2- -d:
I want to run this in a python file. I believe I need to use a subprocess but can't seem to get it working with the pipes and flags. Any help?

Yes, you can use subprocess with pipe.
i will ilustrate an exemple:
ps = subprocess.Popen(('whois', 'stackoverflow.com'), stdout=subprocess.PIPE)
output = subprocess.check_output(('grep', 'Registrar WHOIS'), stdin=ps.stdout)
ps.wait()
You can ajust as your's need

The easiest solution is to write the commands into a script file and execute that file.
If you don't want that, you can execute any command with
bash -c 'command'

This is covered in the Replacing Older Functions with the subprocess Module section of the docs.
The example there is this bash pipeline:
output=`dmesg | grep hda`
rewritten for subprocess as;
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
Note that in many cases, you don't need to handle all of the same edge cases that the shell handles in exactly the same way. But if you don't know what you need, it's better to be fully general like this.
Your $() does the same thing as the backticks in that example, your pipes are the same as the example's pipes, and your arguments aren't anything special.
So:
whois = Popen(['whois', 'stackoverflow.com'], stdout=PIPE)
grep = Popen(['grep', 'Registrar WHOIS Server:'], stdin=whois.stdout, stdout=PIPE)
whois.stdout.close()
cut = Popen(['cut', '-f2-', '-d:'], stdin=grep.stdout, stdout=PIPE)
grep.stdout.close()
inneroutput, _ = cut.communicate()
whois = Popen(['whois', '-h', inneroutput, 'stackoverflow.com'], stdout=PIPE)
grep = Popen(['grep', 'Admin Email:', stdin=whois.stdout, stdout=PIPE)
whois.stdout.close()
cut = Popen(['cut', '-f2-', '-d:'], stdin=grep.stdout)
grep.stdout.close()
cut.communicate()
If this seems like a mess, consider that:
Your original shell command is a mess.
If you actually know exactly what you're expecting the pipeline to do, you can skip a lot of it.
All of the stuff you're doing here could just be done directly in Python without the need for this whole mess.
You may be happier using a third-party library like plumbum.
How could you write the whole thing in Python without all this piping? For example, instead of using grep, you could use Python's re module. Or, since you're not even using a regular expression at all, just a simple in check. And likewise for cut:
whois = subprocess.run(['whois', 'stackoverflow.com'],
check=True, stdout=PIPE, encoding='utf-8').output
for line in whois.splitlines():
if 'Registrar WHOIS Server:' in line:
registrar = line.split(':', 1)[1]
break
whois = subprocess.run(['whois', '-h', registrar, 'stackoverflow.com'],
check=True, stdout=PIPE, encoding='utf-8').output
for line in inner.splitlines():
if 'Admin Email:' in line:
admin = line.split(':', 1)[1]
break

python subprocess: "write error: Broken pipe"

I have a problem piping a simple subprocess.Popen.
Code:
import subprocess
cmd = 'cat file | sort -g -k3 | head -20 | cut -f2,3' % (pattern,file)
p = subprocess.Popen(cmd,shell=True,stdout=subprocess.PIPE)
for line in p.stdout:
print(line.decode().strip())
Output for file ~1000 lines in length:
...
sort: write failed: standard output: Broken pipe
sort: write error
Output for file >241 lines in length:
...
sort: fflush failed: standard output: Broken pipe
sort: write error
Output for file <241 lines in length is fine.
I have been reading the docs and googling like mad but there is something fundamental about the subprocess module that I'm missing ... maybe to do with buffers. I've tried p.stdout.flush() and playing with the buffer size and p.wait(). I've tried to reproduce this with commands like 'sleep 20; cat moderatefile' but this seems to run without error.

From the recipes on subprocess docs:
# To replace shell pipeline like output=`dmesg | grep hda`
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

This is because you shouldn't use "shell pipes" in the command passed to subprocess.Popen, you should use the subprocess.PIPE like this:
from subprocess import Popen, PIPE
p1 = Popen('cat file', stdout=PIPE)
p2 = Popen('sort -g -k 3', stdin=p1.stdout, stdout=PIPE)
p3 = Popen('head -20', stdin=p2.stdout, stdout=PIPE)
p4 = Popen('cut -f2,3', stdin=p3.stdout)
final_output = p4.stdout.read()
But i have to say that what you're trying to do could be done in pure python instead of calling a bunch of shell commands.

I have been having the same error. Even put the pipe in a bash script and executed that instead of the pipe in Python. From Python it would get the broken pipe error, from bash it wouldn't.
It seems to me that perhaps the last command prior to the head is throwing an error as it's (the sort) STDOUT is closed. Python must be picking up on this whereas with the shell the error is silent. I've changed my code to consume the entire input and the error went away.
Would make sense also with smaller files working as the pipe probably buffers the entire output before head exits. This would explain the breaks on larger files.
e.g., instead of a 'head -1' (in my case, I was only wanting the first line), I did an awk 'NR == 1'
There are probably better ways of doing this depending on where the 'head -X' occurs in the pipe.

You don't need shell=True. Don't invoke the shell. This is how I would do it:
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
stdout_value = p.communicate()[0]
stdout_value # the output
See if you face the problem about the buffer after using this?

try using communicate(), rather than reading directly from stdout.
the python docs say this:
"Warning Use communicate() rather than
.stdin.write, .stdout.read or
.stderr.read to avoid deadlocks due to
any of the other OS pipe buffers
filling up and blocking the child
process."
http://docs.python.org/library/subprocess.html#subprocess.Popen.stdout
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output = p.communicate[0]
for line in output:
# do stuff

How do I redirect stdin/stdout when I have a sequence of commands in Bash?

I've currently got a Bash command being executed (via Python's subprocess.Popen) which is reading from stdin, doing something and outputing to stdout. Something along the lines of:
pid = subprocess.Popen( ["-c", "cmd1 | cmd2"],
stdin = subprocess.PIPE,
stdout = subprocess.PIPE,
shell =True )
output_data = pid.communicate( "input data\n" )
Now, what I want to do is to change that to execute another command in that same subshell that will alter the state before the next commands execute, so my shell command line will now (conceptually) be:
cmd0; cmd1 | cmd2
Is there any way to have the input sent to cmd1 instead of cmd0 in this scenario? I'm assuming the output will include cmd0's output (which will be empty) followed by cmd2's output.
cmd0 shouldn't actually read anything from stdin, does that make a difference in this situation?
I know this is probably just a dumb way of doing this, I'm trying to patch in cmd0 without altering the other code too significantly. That said, I'm open to suggestions if there's a much cleaner way to approach this.

execute cmd0 and cmd1 in a subshell and redirect /dev/null as stdin for cmd0:
(cmd0 </dev/null; cmd1) | cmd2

I don't think you should have to do anything special. If cmd0 doesn't touch stdin, it'll be intact for cmd1. Try for yourself:
ls | ( echo "foo"; sed 's/^/input: /')
(Using ls as an arbitrary command to produce a few lines of input for the pipeline)
And the additional pipe to cmd2 doesn't affect the input either, of course.

Ok, I think I may be able to duplicate the stdin file descriptor to a temporary one, close it, run cmd0, then restore it before running cmd1:
exec 0>&3; exec 0<&-; cmd0 ; exec 3>&0 ; cmd1 | cmd2
Not sure if it's possible to redirect stdin in this way though, and can't test this at the moment.
http://tldp.org/LDP/abs/html/io-redirection.html
http://tldp.org/LDP/abs/html/x17601.html

Python bash pipe

I want to pipe a python script's output to a bash script. What i did so far was i tried to use os.popen(), sys.subprocess(), and tried to give a pipe for an example
os.popen('echo "P 1 1 591336 4927369 1 321 " | v.in.ascii -zn out=abcx format=standard --overwrite')
but this didn't work, the values "591336" and "4927369" are the variables which comes as the output of the python script. but when I do this or change the values manually by repeating the echo command and the pipe, it works (in bash).
v.in.ascii -zn out=abcx format=standard --overwrite
the above part of the bash command is a part of Grass GIS
Can anyone help me!

You can just use print to output to stdout and pipe the Python process to the next process, e.g.
python myprogram.py | ...
Where myprogram.py might look like:
for x in something:
print dosomething(x)

This works for me:
>>> stdin, stdout = os.popen2("echo %s | grep 'test'" % 'some test param')
>>> print stdout.read()
some test param
>>>

As of Python 2.6, the subprocess module is recommended instead of the deprecated os.popen. Here's an example:
from subprocess import Popen, PIPE
p = Popen(["v.in.ascii", "-zn", "out=abcx", "format=standard", "--overwrite"], stdin=PIPE)
p.stdin.write("P 1 1 591336 4927369 1 321\n")
p.stdin.close()
p.wait() # unless background execution preferred

I really like John Paulett's answer.
I think your echo example would work if you used os.system instead of os.popen.
One way to use popen here is like this:
f = os.popen("v.in.ascii -zn out=abcx format=standard --overwrite", 'w')
f.write("P 1 1 591336 4927369 1 321\n")
f.close()
(You have to specify the pipe is for writing.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.