Reading only file names from s3 using python [duplicate]

Reading only file names from s3 using python [duplicate] - python

How do I execute the following shell command using the Python subprocess module?
echo "input data" | awk -f script.awk | sort > outfile.txt
The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?
p_awk = subprocess.Popen(["awk","-f","script.awk"],
stdin=subprocess.PIPE,
stdout=file("outfile.txt", "w"))
p_awk.communicate( "input data" )
UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!

You'd be a little happier with the following.
import subprocess
awk_sort = subprocess.Popen( "awk -f script.awk | sort > outfile.txt",
stdin=subprocess.PIPE, shell=True )
awk_sort.communicate( b"input data\n" )
Delegate part of the work to the shell. Let it connect two processes with a pipeline.
You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.
Edit. Some of the reasons for suggesting that awk isn't helping.
[There are too many reasons to respond via comments.]
Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.
The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.
The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.
Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.
Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.
Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.
Sidebar Why building a pipeline (a | b) is so hard.
When the shell is confronted with a | b it has to do the following.
Fork a child process of the original shell. This will eventually become b.
Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".
Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.
The b child closes replaces its stdin with the new b's stdin. Exec the b process.
The b child waits for a to complete.
The parent is waiting for b to complete.
I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).
Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.
However, it's easier to delegate that operation to the shell.

import subprocess
some_string = b'input_data'
sort_out = open('outfile.txt', 'wb', 0)
sort_in = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sort_out).stdin
subprocess.Popen(['awk', '-f', 'script.awk'], stdout=sort_in,
stdin=subprocess.PIPE).communicate(some_string)

To emulate a shell pipeline:
from subprocess import check_call
check_call('echo "input data" | a | b > outfile.txt', shell=True)
without invoking the shell (see 17.1.4.2. Replacing shell pipeline):
#!/usr/bin/env python
from subprocess import Popen, PIPE
a = Popen(["a"], stdin=PIPE, stdout=PIPE)
with a.stdin:
with a.stdout, open("outfile.txt", "wb") as outfile:
b = Popen(["b"], stdin=a.stdout, stdout=outfile)
a.stdin.write(b"input data")
statuses = [a.wait(), b.wait()] # both a.stdin/stdout are closed already
plumbum provides some syntax sugar:
#!/usr/bin/env python
from plumbum.cmd import a, b # magic
(a << "input data" | b > "outfile.txt")()
The analog of:
#!/bin/sh
echo "input data" | awk -f script.awk | sort > outfile.txt
is:
#!/usr/bin/env python
from plumbum.cmd import awk, sort
(awk["-f", "script.awk"] << "input data" | sort > "outfile.txt")()

The accepted answer is sidestepping actual question.
here is a snippet that chains the output of multiple processes:
Note that it also prints the (somewhat) equivalent shell command so you can run it and make sure the output is correct.
#!/usr/bin/env python3
from subprocess import Popen, PIPE
# cmd1 : dd if=/dev/zero bs=1m count=100
# cmd2 : gzip
# cmd3 : wc -c
cmd1 = ['dd', 'if=/dev/zero', 'bs=1M', 'count=100']
cmd2 = ['tee']
cmd3 = ['wc', '-c']
print(f"Shell style : {' '.join(cmd1)} | {' '.join(cmd2)} | {' '.join(cmd3)}")
p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE) # stderr=PIPE optional, dd is chatty
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE)
p3 = Popen(cmd3, stdin=p2.stdout, stdout=PIPE)
print("Output from last process : " + (p3.communicate()[0]).decode())
# thoretically p1 and p2 may still be running, this ensures we are collecting their return codes
p1.wait()
p2.wait()
print("p1 return: ", p1.returncode)
print("p2 return: ", p2.returncode)
print("p3 return: ", p3.returncode)

http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?
Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().

Inspired by #Cristian's answer. I met just the same issue, but with a different command. So I'm putting my tested example, which I believe could be helpful:
grep_proc = subprocess.Popen(["grep", "rabbitmq"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
subprocess.Popen(["ps", "aux"], stdout=grep_proc.stdin)
out, err = grep_proc.communicate()
This is tested.
What has been done
Declared lazy grep execution with stdin from pipe. This command will be executed at the ps command execution when the pipe will be filled with the stdout of ps.
Called the primary command ps with stdout directed to the pipe used by the grep command.
Grep communicated to get stdout from the pipe.
I like this way because it is natural pipe conception gently wrapped with subprocess interfaces.

The previous answers missed an important point. Replacing shell pipeline is basically correct, as pointed out by geocar. It is almost sufficient to run communicate on the last element of the pipe.
The remaining problem is passing the input data to the pipeline. With multiple subprocesses, a simple communicate(input_data) on the last element doesn't work - it hangs forever. You need to create a a pipeline and a child manually like this:
import os
import subprocess
input = """\
input data
more input
""" * 10
rd, wr = os.pipe()
if os.fork() != 0: # parent
os.close(wr)
else: # child
os.close(rd)
os.write(wr, input)
os.close(wr)
exit()
p_awk = subprocess.Popen(["awk", "{ print $2; }"],
stdin=rd,
stdout=subprocess.PIPE)
p_sort = subprocess.Popen(["sort"],
stdin=p_awk.stdout,
stdout=subprocess.PIPE)
p_awk.stdout.close()
out, err = p_sort.communicate()
print (out.rstrip())
Now the child provides the input through the pipe, and the parent calls communicate(), which works as expected. With this approach, you can create arbitrary long pipelines without resorting to "delegating part of the work to the shell". Unfortunately the subprocess documentation doesn't mention this.
There are ways to achieve the same effect without pipes:
from tempfile import TemporaryFile
tf = TemporaryFile()
tf.write(input)
tf.seek(0, 0)
Now use stdin=tf for p_awk. It's a matter of taste what you prefer.
The above is still not 100% equivalent to bash pipelines because the signal handling is different. You can see this if you add another pipe element that truncates the output of sort, e.g. head -n 10. With the code above, sort will print a "Broken pipe" error message to stderr. You won't see this message when you run the same pipeline in the shell. (That's the only difference though, the result in stdout is the same). The reason seems to be that python's Popen sets SIG_IGN for SIGPIPE, whereas the shell leaves it at SIG_DFL, and sort's signal handling is different in these two cases.

EDIT: pipes is available on Windows but, crucially, doesn't appear to actually work on Windows. See comments below.
The Python standard library now includes the pipes module for handling this:
https://docs.python.org/2/library/pipes.html, https://docs.python.org/3.4/library/pipes.html
I'm not sure how long this module has been around, but this approach appears to be vastly simpler than mucking about with subprocess.

For me, the below approach is the cleanest and easiest to read
from subprocess import Popen, PIPE
def string_to_2_procs_to_file(input_s, first_cmd, second_cmd, output_filename):
with open(output_filename, 'wb') as out_f:
p2 = Popen(second_cmd, stdin=PIPE, stdout=out_f)
p1 = Popen(first_cmd, stdout=p2.stdin, stdin=PIPE)
p1.communicate(input=bytes(input_s))
p1.wait()
p2.stdin.close()
p2.wait()
which can be called like so:
string_to_2_procs_to_file('input data', ['awk', '-f', 'script.awk'], ['sort'], 'output.txt')

Related

How can I execute a shell script from stdin and get the output in realtime using python?

I want to mimic the below using python subprocess:
cat /tmp/myscript.sh | sh
The /tmp/myscript.sh contains:
ls -l
sleep 5
pwd
Behaviour: stdout shows the result of "ls" and the results of "pwd" are shown after 5 seconds.
What I have done is:
import subprocess
f = open("/tmp/myscript.sh", "rb")
p = subprocess.Popen("sh", shell=True, stdin=f,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
f.close()
p.stdout.read()
This waits until ALL the processing is done and shows the results all at once. The desired effect is to fill in the stdout pipe in realtime.
Note: This expectation seems non sense but this is sample from a bigger and complex situation which I cannot describe here.
Another Note: I can't use p.communicate. This whole thing is inside a select.select statement so I need stdout to be in a pipe.

The problem is that when you don't give an argument to read(), it reads until EOF, which means it has to wait until the subprocess exits and the pipe is closed.
If you call it with a small argument it will return immediately after it has read that many characters
import subprocess
f = open("/tmp/myscript.sh", "rb")
p = subprocess.Popen("sh", shell=True, stdin=f,
stdout=subprocess.PIPE, stderr=subprocess.PIPE, encoding='utf-8')
f.close()
while True:
c = p.stdout.read(1)
if not c:
break
print(c, end='')
print()
Note that some many buffer their output when stdout is connected to a pipe, so this might not solve the problem for everything. The shell doesn't buffer its own output, but ls probably does. But since ls is producing all its output at once, it won't be a problem in this case.
To solve the more general problem you may need to use a pty instead of a pipe. The pexpect library is useful for this.

How to call a series of bash commands in python and store output

I am trying to run the following bash script in Python and store the readlist output. The readlist that I want to be stored as a python list, is a list of all files in the current directory ending in *concat_001.fastq.
I know it may be easier to do this in python (i.e.
import os
readlist = [f for f in os.listdir(os.getcwd()) if f.endswith("concat_001.fastq")]
readlist = sorted(readlist)
However, this is problematic, as I need Python to sort the list in EXACTLY the same was as bash, and I was finding that bash and Python sort certain things in different orders (eg Python and bash deal with capitalised and uncapitalised things differently - but when I tried
readlist = np.asarray(sorted(flist, key=str.lower))
I still found that two files starting with ML_ and M_ were sorted in different order with bash and Python. Hence trying to run my exact bash script through Python, then to use the sorted list generated with bash in my subsequent Python code.
input_suffix="concat_001.fastq"
ender=`echo $input_suffix | sed "s/concat_001.fastq/\*concat_001.fastq/g" `
readlist="$(echo $ender)"
I have tried
proc = subprocess.call(command1, shell=True, stdout=subprocess.PIPE)
proc = subprocess.call(command2, shell=True, stdout=subprocess.PIPE)
proc = subprocess.Popen(command3, shell=True, stdout=subprocess.PIPE)
But I just get: subprocess.Popen object at 0x7f31cfcd9190
Also - I don't understand the difference between subprocess.call and subprocess.Popen. I have tried both.
Thanks,
Ruth

So your question is a little confusing and does not exactly explain what you want. However, I'll try to give some suggestions to help you update it, or in my effort, answer it.
I will assume the following: your python script is passing to the command line 'input_suffix' and that you want your python program to receive the contents of 'readlist' when the external script finishes.
To make our lives simpler, and allow things to be more complicated, I would make the following bash script to contain your commands:
script.sh
#!/bin/bash
input_suffix=$1
ender=`echo $input_suffix | sed "s/concat_001.fastq/\*concat_001.fastq/g"`
readlist="$(echo $ender)"
echo $readlist
You would execute this as script.sh "concat_001.fastq", where $1 takes in the first argument passed on the command line.
To use python to execute external scripts, as you quite rightly found, you can use subprocess (or as noted by another response, os.system - although subprocess is recommended).
The docs tell you that subprocess.call:
"Wait for command to complete, then return the returncode attribute."
and that
"For more advanced use cases when these do not meet your needs, use the underlying Popen interface."
Given you want to pipe the output from the bash script to your python script, let's use Popen as suggested by the docs. As I posted the other stackoverflow answer, it could look like the following:
import subprocess
from subprocess import Popen, PIPE
# Execute out script and pipe the output to stdout
process = subprocess.Popen(['script.sh', 'concat_001.fastq'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
# Obtain the standard out, and standard error
stdout, stderr = process.communicate()
and then:
>>> print stdout
*concat_001.fastq

Measuring user+system runtime of external program

As part of an evaluation, I want to measure and compare the user+system runtime of different diff-tools.
As a first approach, I thought about calling the particular tools with time - f (GNU time). Since the rest of the evaluation is done by a bunch of Python scripts I want to realise it in Python.
The time output is formatted as follows:
<some error message>
user 0.4
sys 0.2
The output of the diff tool is redirected to sed to get rid of unneeded output and the output of sed is then further processed. (use of sed deprecated for my example. See Edit 2)
A call from within a shell would look like this (removes lines starting with "Binary"):
$ time -f "user %U\nsys %S\n" diff -r -u0 dirA dirB | sed -e '/^Binary.*/d'
Here is my approach so far:
import subprocess
diffcommand=["time","-f","user %U\nsys %S\n","diff","-r","-u0","testrepo_1/A/rev","testrepo_1/B/rev"]
sedcommand = ["sed","-e","/^Binary.*/d"]
# Execute command as subprocess
diff = subprocess.Popen(diffcommand, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
# Calculate runtime
runtime = 0.0
for line in diff.stderr.readlines():
current = line.split()
if current:
if current[0] == "user" or current[0] == "sys":
runtime = runtime + float(current[1])
print "Runtime: "+str(runtime)
# Pipe to "sed"
sedresult = subprocess.check_output(sedcommand, stdin=diff.stdout)
# Wait for the subprocesses to terminate
diff.wait()
However it feels like that this is not clean (especially from an OS point of view). It also leads to the script being stuck in the readlines part under certain circumstances I couldn't figure out yet.
Is there a cleaner (or better) way to achieve what I want?
Edit 1
Changed head line and gave a more detailed explanation
Edit 2
Thanks to J.F. Sebastian, I had a look at os.wait4(...) (information taken from his answer. But since I AM interested in the output, I had to implement it a bit different.
My code now looks like this:
diffprocess = subprocess.Popen(diffcommand,stdout=subprocess.PIPE)
runtimes = os.wait4(diffprocess.pid,0)[2]
runtime = runtimes.ru_utime + runtimes.ru_stime
diffresult = diffprocess.communicate()[0]
Note that I do not pipe the result to sed any more (decided to trim within python)
The runtime measurement works fine for some test cases, but the execution gets stuck sometimes. Removing the runtime measurement then helps the program to terminate and so does sending stdout to DEVNULL (as demanded here). Could I have a deadlock? (valgrind --tool=helgrind did not find anything) Is there something fundamentally wrong in my approach?

but the execution gets stuck sometimes.
If you use stdout=PIPE then something should read the output while the process is still running otherwise the child process will hang if its stdout OS pipe buffer fills up (~65K on my machine).
from subprocess import Popen, PIPE
p = Popen(diffcommand, stdout=PIPE, bufsize=-1)
with p.stdout:
output = p.stdout.read()
ru = os.wait4(p.pid, 0)[2]

python subprocess: "write error: Broken pipe"

I have a problem piping a simple subprocess.Popen.
Code:
import subprocess
cmd = 'cat file | sort -g -k3 | head -20 | cut -f2,3' % (pattern,file)
p = subprocess.Popen(cmd,shell=True,stdout=subprocess.PIPE)
for line in p.stdout:
print(line.decode().strip())
Output for file ~1000 lines in length:
...
sort: write failed: standard output: Broken pipe
sort: write error
Output for file >241 lines in length:
...
sort: fflush failed: standard output: Broken pipe
sort: write error
Output for file <241 lines in length is fine.
I have been reading the docs and googling like mad but there is something fundamental about the subprocess module that I'm missing ... maybe to do with buffers. I've tried p.stdout.flush() and playing with the buffer size and p.wait(). I've tried to reproduce this with commands like 'sleep 20; cat moderatefile' but this seems to run without error.

From the recipes on subprocess docs:
# To replace shell pipeline like output=`dmesg | grep hda`
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

This is because you shouldn't use "shell pipes" in the command passed to subprocess.Popen, you should use the subprocess.PIPE like this:
from subprocess import Popen, PIPE
p1 = Popen('cat file', stdout=PIPE)
p2 = Popen('sort -g -k 3', stdin=p1.stdout, stdout=PIPE)
p3 = Popen('head -20', stdin=p2.stdout, stdout=PIPE)
p4 = Popen('cut -f2,3', stdin=p3.stdout)
final_output = p4.stdout.read()
But i have to say that what you're trying to do could be done in pure python instead of calling a bunch of shell commands.

I have been having the same error. Even put the pipe in a bash script and executed that instead of the pipe in Python. From Python it would get the broken pipe error, from bash it wouldn't.
It seems to me that perhaps the last command prior to the head is throwing an error as it's (the sort) STDOUT is closed. Python must be picking up on this whereas with the shell the error is silent. I've changed my code to consume the entire input and the error went away.
Would make sense also with smaller files working as the pipe probably buffers the entire output before head exits. This would explain the breaks on larger files.
e.g., instead of a 'head -1' (in my case, I was only wanting the first line), I did an awk 'NR == 1'
There are probably better ways of doing this depending on where the 'head -X' occurs in the pipe.

You don't need shell=True. Don't invoke the shell. This is how I would do it:
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
stdout_value = p.communicate()[0]
stdout_value # the output
See if you face the problem about the buffer after using this?

try using communicate(), rather than reading directly from stdout.
the python docs say this:
"Warning Use communicate() rather than
.stdin.write, .stdout.read or
.stderr.read to avoid deadlocks due to
any of the other OS pipe buffers
filling up and blocking the child
process."
http://docs.python.org/library/subprocess.html#subprocess.Popen.stdout
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output = p.communicate[0]
for line in output:
# do stuff

Python - capture Popen stdout AND display on console?

I want to capture stdout from a long-ish running process started via subprocess.Popen(...) so I'm using stdout=PIPE as an arg.
However, because it's a long running process I also want to send the output to the console (as if I hadn't piped it) to give the user of the script an idea that it's still working.
Is this at all possible?
Cheers.

The buffering your long-running sub-process is probably performing will make your console output jerky and very bad UX. I suggest you consider instead using pexpect (or, on Windows, wexpect) to defeat such buffering and get smooth, regular output from the sub-process. For example (on just about any unix-y system, after installing pexpect):
>>> import pexpect
>>> child = pexpect.spawn('/bin/bash -c "echo ba; sleep 1; echo bu"', logfile=sys.stdout); x=child.expect(pexpect.EOF); child.close()
ba
bu
>>> child.before
'ba\r\nbu\r\n'
The ba and bu will come with the proper timing (about a second between them). Note the output is not subject to normal terminal processing, so the carriage returns are left in there -- you'll need to post-process the string yourself (just a simple .replace!-) if you need \n as end-of-line markers (the lack of processing is important just in case the sub-process is writing binary data to its stdout -- this ensures all the data's left intact!-).

S. Lott's comment points to Getting realtime output using subprocess and Real-time intercepting of stdout from another process in Python
I'm curious that Alex's answer here is different from his answer 1085071.
My simple little experiments with the answers in the two other referenced questions has given good results...
I went and looked at wexpect as per Alex's answer above, but I have to say reading the comments in the code I was not left a very good feeling about using it.
I guess the meta-question here is when will pexpect/wexpect be one of the Included Batteries?

Can you simply print it as you read it from the pipe?

Inspired by pty.openpty() suggestion somewhere above, tested on python2.6, linux. Publishing since it took a while to make this working properly, w/o buffering...
def call_and_peek_output(cmd, shell=False):
import pty, subprocess
master, slave = pty.openpty()
p = subprocess.Popen(cmd, shell=shell, stdin=None, stdout=slave, close_fds=True)
os.close(slave)
line = ""
while True:
try:
ch = os.read(master, 1)
except OSError:
# We get this exception when the spawn process closes all references to the
# pty descriptor which we passed him to use for stdout
# (typically when it and its childs exit)
break
line += ch
sys.stdout.write(ch)
if ch == '\n':
yield line
line = ""
if line:
yield line
ret = p.wait()
if ret:
raise subprocess.CalledProcessError(ret, cmd)
for l in call_and_peek_output("ls /", shell=True):
pass

Alternatively, you can pipe your process into tee and capture only one of the streams.
Something along the lines of sh -c 'process interesting stuff' | tee /dev/stderr.
Of course, this only works on Unix-like systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.