subprocess.call with command having embedded spaces and quotes - python

I would like to retrieve output from a shell command that contains spaces and quotes. It looks like this:
import subprocess
cmd = "docker logs nc1 2>&1 |grep mortality| awk '{print $1}'|sort|uniq"
subprocess.check_output(cmd)
This fails with "No such file or directory". What is the best/easiest way to pass commands such as these to subprocess?

The absolutely best solution here is to refactor the code to replace the entire tail of the pipeline with native Python code.
import subprocess
from collections import Counter
s = subprocess.run(
["docker", "logs", "nc1"],
text=True, capture_output=True, check=True)
count = Counter()
for line in s.stdout.splitlines():
if "mortality" in line:
count[line.split()[0]] += 1
for count, word in count.most_common():
print(count, word)
There are minor differences in how Counter objects resolve ties (if two words have the same count, the one which was seen first is returned first, rather than by sort order), but I'm guessing that's unimportant here.
I am also ignoring standard output from the subprocess; if you genuinely want to include output from error messages, too, just include s.stderr in the loop driver too.
However, my hunch is that you don't realize your code was doing that, which drives home the point nicely: Mixing shell script and Python raises the mainainability burden, because now you have to understand both shell script and Python to understand the code.
(And in terms of shell script style, I would definitely get rid of the useless grep by refactoring it into the Awk script, and probably also fold in the sort | uniq which has a trivial and more efficient replacement in Awk. But here, we are replacing all of that with Python code anyway.)
If you really wanted to stick to a pipeline, then you need to add shell=True to use shell features like redirection, pipes, and quoting. Without shell=True, Python looks for a command whose file name is the entire string you were passing in, which of course doesn't exist.

Related

Python os.system or subprocess calls for command line automation

I would like to be able to call some executables that take in parameters and then dump the output to a file. I've attempted to use both os.system and subprocess calls to no avail. Here is a sample of what I'd like python to execute for me...
c:\directory\executable_program.exe -f w:\directory\input_file.txt > Z\directory\output_file.txt
Notice the absolute paths as I will be traversing hundreds of various directories to act on files etc..
Many thanks ahead of time!
Some examples that I've tried:
subprocess.run(['c:\directory\executable_program.exe -f w:\directory\input_file.txt > Z\directory\output_file.txt']
subprocess.call(r'"c:\directory\executable_program.exe -f w:\directory\input_file.txt > Z\directory\output_file.txt"']
subprocess.call(r'"c:\directory\executable_program.exe" -f "w:\directory\input_file.txt > Z\directory\output_file.txt"']
Your attempts contain various amounts of quoting errors.
subprocess.run(r'c:\directory\executable_program.exe -f w:\directory\input_file.txt > Z\directory\output_file.txt', shell=True)
should work, where the r prefix protects the backslashes from being interpreted and removed by Python before the subprocess runs, and the absence of [...] around the value passes it verbatim to the shell (hence, shell=True).
On Windows you could get away with putting the command in square brackets even though it's not a list, and omitting shell=True in some circumstances.
If you wanted to avoid the shell, try
with open(r'Z\directory\output_file.txt', 'wb') as dest:
subprocess.run(
[r'c:\directory\executable_program.exe', '-f', r'w:\directory\input_file.txt'],
stdout=dest)
which also illustrates how to properly pass a list of strings in square brackets as the first argument to subprocess.run.

How to get data from web in python using curl?

In bash when I used
myscript.sh
file="/tmp/vipin/kk.txt"
curl -L "myabcurlx=10&id-11.com" > $file
cat $file
./myscript.sh gives me below output
1,2,33abc
2,54fdd,fddg3
3,fffff,gfr54
When I tried to fetch it using python and tried below code -
mypython.py
command = curl + ' -L ' + 'myabcurlx=10&id-11.com'
output = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).stdout.read().decode('ascii')
print(output)
python mypython.py throw error, Can you please point out what is wrong with my code.
Error :
/bin/sh: line 1: &id=11: command not found
Wrong Parameter
command = curl + ' -L ' + 'myabcurlx=10&id-11.com'
Print out what this string is, or just think about it. Assuming that curl is the string 'curl' or '/usr/bin/curl' or something, you get:
curl -L myabcurlx=10&id-11.com
That’s obviously not the same thing you typed at the shell. Most importantly, that last argument is not quoted, and it has a & in the middle of it, which means that what you’re actually asking it to do is to run curl in the background and then run some other program that doesn’t exist, as if you’d done this:
curl -L myabcurlx=10 &
id-11.com
Obviously you could manually include quotes in the string:
command = curl + ' -L ' + '"myabcurlx=10&id-11.com"'
… but that won’t work if the string is, say, a variable rather than a literal in your source—especially if that variable might have quote characters within it.
The shlex module has helpers to quoting things properly.
But the easiest thing to do is just not try to build a command line in the first place. You aren’t using any shell features here, so why add the extra headaches, performance costs, problems with the shell getting in the way of your output and retcode, and possible security issues for no benefit?
Make the arguments a list rather than a string:
command = [curl, '-L', 'myabcurlx=10&id-11.com']
… and leave off the shell=True
And it just works. No need to get spaces and quotes and escapes right.
Well, it still won’t work, because Popen doesn’t return output, it’s a constructor for a Popen object. But that’s a whole separate problem—which should be easy to solve if you read the docs.
But for this case, an even better solution is to use the Python bindings to libcurl instead of calling the command-line tool. Or, even better, since you’re not using any of the complicated features of curl in the first place, just use requests to make the same request. Either way, you get a response object as a Python object with useful attributes like text and headers and request.headers that you can’t get from a command line tool except by parsing its output as a giant string.
import subprocess
fileName="/tmp/vipin/kk.txt"
with open(fileName,"w") as f:
subprocess.read(["curl","-L","myabcurlx=10&id-11.com"],stdout=f)
print(fileName)
recommended approaches:
https://docs.python.org/3.7/library/urllib.request.html#examples
http://docs.python-requests.org/en/master/user/install/

Executing awk command from python

I am trying to execute the following awk command from a python script
awk 'BEGIN {FS="\t"}; {print $1"\t"$2}' file_a > file_b
For this, I tried to use subprocess as follows:
subprocess.check_output(["awk", 'BEGIN {FS="\t"}; {print $1"\t"$2}',
file_a, ">",
file_b])
where file_a and file_b are strings pointing to the path of the files.
From this, I am getting the error
awk: cannot open > (No such file or directory)
I'm sure I'm inputing the arguments to subprocess in a wrong way, but I can't figure out what's wrong.
While it may look like it in your shell of choice, >, <, and | are not actually passed as arguments to the program you run. Rather, they're a special part of the shell that the program never gets to see.
Since they're part of the shell, and not part of the OS or program, you have to emulate their effects yourself with the normal facilities the language gives you. In your case, since you're trying to pipe to a file, simply use Python's open() as you would normally. The subprocess API supports arguments to specify stdout, stdin, and stderr, and you can supply any file object for those.
Check it out:
with open(file_b, 'wb') as f:
subprocess.call(["awk", 'BEGIN {FS="\t"}; {print $1"\t"$2}', file_a], stdout=f)
Since subprocess.check_output redirects output already, it doesn't take the stdout argument. Using subprocess.call avoids this. If you also need the output later in the script, you can instead assign the return value of check_output to a variable, and then save that to file_b.
If you use a lot of shell commands, you might also want to check out Plumbum, which gives you a large set of fairly silly shell-like operator overloads.

pythonic equivalent this sed command

I have this awk/sed command
awk '{full=full$0}END{print full;}' initial.xml | sed 's|</Product>|</Product>\
|g' > final.xml
to break an XML doc containing large number of tags
such that the new file will have all contents of the product node in a single line
I am trying to run it using os.system and subprocess module however this is wrapping all the contents of the file into one line.
Can anyone convert it into equivalent python script?
Thanks!
Something like this?
from __future__ import print_function
import fileinput
for line in fileinput.input('initial.xml'):
print(line.rstrip('\n').replace('</Product>','</Product>\n'),end='')
I'm using the print function because the default print in Python 2.x will add a space or newline after each set of output. There are various other ways to work around that, some of which involve buffering your output before printing it.
For the record, your problem could equally well be solved in just a simple Awk script.
awk '{ gsub(/<Product>/,"&\n"); printf $0 }' initial.xml
Printing output as it arrives without a trailing newline is going to be a lot more efficient than buffering the whole file and then printing it at the end, and of course, Awk has all the necessary facilities to do the substition as well. (gsub is not available in all dialects of Awk, though.)

running BLAST (bl2seq) without creating sequence files

I have a script that performs BLAST queries (bl2seq)
The script works like this:
Get sequence a, sequence b
write sequence a to filea
write sequence b to fileb
run command 'bl2seq -i filea -j fileb -n blastn'
get output from STDOUT, parse
repeat 20 million times
The program bl2seq does not support piping.
Is there any way to do this and avoid writing/reading to the harddrive?
I'm using Python BTW.
Depending on what OS you're running on, you may be able to use something like bash's process substitution. I'm not sure how you'd set that up in Python, but you're basically using a named pipe (or named file descriptor). That won't work if bl2seq tries to seek within the files, but it should work if it just reads them sequentially.
How do you know bl2seq does not support piping.? By the way, pipes is an OS feature, not the program. If your bl2seq program outputs something, whether to STDOUT or to a file, you should be able to parse the output. Check the help file of bl2seq for options to output to file as well, eg -o option. Then you can parse the file.
Also, since you are using Python, an alternative you can use is BioPython module.
Is this the bl2seq program from BioPerl? If so, it doesn't look like you can do piping to it. You can, however, code your own hack using Bio::Tools::Run::AnalysisFactory::Pise, which is the recommended way of going about it. You'd have to do it in Perl, though.
If this is a different bl2seq, then disregard the message. In any case, you should probably provide some more detail.
Wow. I have it figured out.
The answer is to use python's subprocess module and pipes!
EDIT: forgot to mention that I'm using blast2 which does support piping.
(this is part of a class)
def _query(self):
from subprocess import Popen, PIPE, STDOUT
pipe = Popen([BLAST,
'-p', 'blastn',
'-d', self.database,
'-m', '8'],
stdin=PIPE,
stdout=PIPE)
pipe.stdin.write('%s\n' % self.sequence)
print pipe.communicate()[0]
where self.database is a string containing the database filename, ie 'nt.fa'
self.sequence is a string containing the query sequence
This prints the output to the screen but you can easily just parse it. No slow disk I/O. No slow XML parsing. I'm going to write a module for this and put it on github.
Also, I haven't gotten this far yet but I think you can do multiple queries so that the blast database does not need to be read and loaded into RAM for each query.
I call blast2 using R script:
....
system("mkfifo seq1")
system("mkfifo seq2")
system("echo sequence1 > seq1"), wait = FALSE)
system("echo sequence2 > seq2"), wait = FALSE)
system("blast2 -p blastp -i seq1 -j seq2 -m 8", intern = TRUE)
....
This is 2 times slower(!) vs. writing and reading from hard drive!

Categories

Resources