Getting output of system commands that use pipes (Python) - python

I'm trying to generate a random string using this command:
strings /dev/urandom | grep -o '[[:alnum:]]' | head -n 30 | tr -d '\n';
Works fine, but when I try to do subprocess.call(cmd,shell=True) it just gets stuck on the strings /dev/urandom command and spams my screen with grep: writing output: Broken pipe
What's causing this and how do I fix it?

No need for subprocess, observe:
>>> import base64
>>> r = open("/dev/urandom","r")
>>> base64.encodestring(r.read(22))[:30]
'3Ttlx6TT3siM8h+zKm+Q6lH1k+dTcg'
>>> r.close()
Also, stringsing and then greping alphanumeric characters from /dev/urandom is hugely inefficient and wastes a whole lot of randomness. On my desktop PC, the above python takes less than 10 ms to executed from bash, your strings ... oneliner takes 300-400...
For a pure python solution that works also on systems without /dev/urandom - and gives only alphanumeric characters (if you really don't want + or /):
import string
import random
''.join([random.choice(string.printable[:62]) for i in range(30)])

First of all, for what you're doing, it should be better to generate the string using python directly.
Anyway, when using subprocess, the correct way to pipe data from a process to another is by redirecting stdout and/or stderr to a subprocess.PIPE, and feed the new process' stdin with the previous process' stdout.

Related

subprocess.call with command having embedded spaces and quotes

I would like to retrieve output from a shell command that contains spaces and quotes. It looks like this:
import subprocess
cmd = "docker logs nc1 2>&1 |grep mortality| awk '{print $1}'|sort|uniq"
subprocess.check_output(cmd)
This fails with "No such file or directory". What is the best/easiest way to pass commands such as these to subprocess?
The absolutely best solution here is to refactor the code to replace the entire tail of the pipeline with native Python code.
import subprocess
from collections import Counter
s = subprocess.run(
["docker", "logs", "nc1"],
text=True, capture_output=True, check=True)
count = Counter()
for line in s.stdout.splitlines():
if "mortality" in line:
count[line.split()[0]] += 1
for count, word in count.most_common():
print(count, word)
There are minor differences in how Counter objects resolve ties (if two words have the same count, the one which was seen first is returned first, rather than by sort order), but I'm guessing that's unimportant here.
I am also ignoring standard output from the subprocess; if you genuinely want to include output from error messages, too, just include s.stderr in the loop driver too.
However, my hunch is that you don't realize your code was doing that, which drives home the point nicely: Mixing shell script and Python raises the mainainability burden, because now you have to understand both shell script and Python to understand the code.
(And in terms of shell script style, I would definitely get rid of the useless grep by refactoring it into the Awk script, and probably also fold in the sort | uniq which has a trivial and more efficient replacement in Awk. But here, we are replacing all of that with Python code anyway.)
If you really wanted to stick to a pipeline, then you need to add shell=True to use shell features like redirection, pipes, and quoting. Without shell=True, Python looks for a command whose file name is the entire string you were passing in, which of course doesn't exist.

python -c vs python -<< heredoc

I am trying to run some piece of Python code in a Bash script, so i wanted to understand what is the difference between:
#!/bin/bash
#your bash code
python -c "
#your py code
"
vs
python - <<DOC
#your py code
DOC
I checked the web but couldn't compile the bits around the topic. Do you think one is better over the other?
If you wanted to return a value from Python code block to your Bash script then is a heredoc the only way?
The main flaw of using a here document is that the script's standard input will be the here document. So if you have a script which wants to process its standard input, python -c is pretty much your only option.
On the other hand, using python -c '...' ties up the single-quote for the shell's needs, so you can only use double-quoted strings in your Python script; using double-quotes instead to protect the script from the shell introduces additional problems (strings in double-quotes undergo various substitutions, whereas single-quoted strings are literal in the shell).
As an aside, notice that you probably want to single-quote the here-doc delimiter, too, otherwise the Python script is subject to similar substitutions.
python - <<'____HERE'
print("""Look, we can have double quotes!""")
print('And single quotes! And `back ticks`!')
print("$(and what looks to the shell like process substitutions and $variables!)")
____HERE
As an alternative, escaping the delimiter works identically, if you prefer that (python - <<\____HERE)
If you are using bash, you can avoid heredoc problems if you apply a little bit more of boilerplate:
python <(cat <<EoF
name = input()
print(f'hello, {name}!')
EoF
)
This will let you run your embedded Python script without you giving up the standard input. The overhead is mostly the same of using cmda | cmdb. This technique is known as Process Substitution.
If want to be able to somehow validate the script, I suggest that you dump it to a temporary file:
#!/bin/bash
temp_file=$(mktemp my_generated_python_script.XXXXXX.py)
cat > $temp_file <<EoF
# embedded python script
EoF
python3 $temp_file && rm $temp_file
This will keep the script if it fails to run.
If you prefer to use python -c '...' without having to escape with the double-quotes you can first load the code in a bash variable using here-documents:
read -r -d '' CMD << '--END'
print ("'quoted'")
--END
python -c "$CMD"
The python code is loaded verbatim into the CMD variable and there's no need to escape double quotes.
How to use here-docs with input
tripleee's answer has all the details, but there's Unix tricks to work around this limitation:
So if you have a script which wants to process its standard input, python -c is pretty much your only option.
This trick applies to all programs that want to read from a redirected stdin (e.g., ./script.py < myinputs) and also take user input:
python - <<'____HERE'
import os
os.dup2(1, 0)
print(input("--> "))
____HERE
Running this works:
$ bash heredocpy.sh
--> Hello World!
Hello World!
If you want to get the original stdin, run os.dup(0) first. Here is a real-world example.
This works because as long as either stdout or stderr are a tty, one can read from them as well as write to them. (Otherwise, you could just open /dev/tty. This is what less does.)
In case you want to process inputs from a file instead, that's possible too -- you just have to use a new fd:
Example with a file
cat <<'____HERE' > file.txt
With software there are only two possibilites:
either the users control the programme
or the programme controls the users.
____HERE
python - <<'____HERE' 4< file.txt
import os
for line in os.fdopen(4):
print(line.rstrip().upper())
____HERE
Example with a command
Unfortunately, pipelines don't work here -- but process substitution does:
python - <<'____HERE' 4< <(fortune)
import os
for line in os.fdopen(4):
print(line.rstrip().upper())
____HERE

pythonic equivalent this sed command

I have this awk/sed command
awk '{full=full$0}END{print full;}' initial.xml | sed 's|</Product>|</Product>\
|g' > final.xml
to break an XML doc containing large number of tags
such that the new file will have all contents of the product node in a single line
I am trying to run it using os.system and subprocess module however this is wrapping all the contents of the file into one line.
Can anyone convert it into equivalent python script?
Thanks!
Something like this?
from __future__ import print_function
import fileinput
for line in fileinput.input('initial.xml'):
print(line.rstrip('\n').replace('</Product>','</Product>\n'),end='')
I'm using the print function because the default print in Python 2.x will add a space or newline after each set of output. There are various other ways to work around that, some of which involve buffering your output before printing it.
For the record, your problem could equally well be solved in just a simple Awk script.
awk '{ gsub(/<Product>/,"&\n"); printf $0 }' initial.xml
Printing output as it arrives without a trailing newline is going to be a lot more efficient than buffering the whole file and then printing it at the end, and of course, Awk has all the necessary facilities to do the substition as well. (gsub is not available in all dialects of Awk, though.)

Piping very long string through pipe with python subprocess

I would like to use python's subprocess library to deal with a string, process this string in a different program, then collect it and save it. Unfortunately, this string is very long (as in millions of characters long). So I have the following code segment set up:
cmd = ['some command']
p1 = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=STDOUT)
result = p1.communicate(input='some string')
Where 'some string' is actually millions of characters long.
And I always get this error:
OSError: [Errno 32] Broken pipe
I've tried it on shorter strings and the code works, so I'm guessing I'm maxing out the pipe buffer.
Is there any reasonable solution to this without having to resort to creating temporary files?
There are several constraints that make using subprocess the most attractive and simplest solution for me right now, which is why I'd like a solution within python and within subprocess.
broken pipe can also mean the child process died of other causes. Invalid input or out-of-memory could be culprits. Have you tried changing command to something like cat?
If you are sending millions of characters through input, then something is clearly wrong with the architecture of the program. Normally under those situations the program reads in chunks for those inputs.
Having said that, there is a possibility to use a file as STDIN for subprocess. This may cause the same problem for large inputs too.
Also without Python/subprocess, how do you pass such a long input to your program?
>>> import subprocess
>>> fo = open('filewithinput')
>>> proc = subprocess.Popen(['cat'],stdin=fo,stdout=subprocess.PIPE)
>>> out,err = proc.communicate()
>>> fo.close()
>>> print out

running BLAST (bl2seq) without creating sequence files

I have a script that performs BLAST queries (bl2seq)
The script works like this:
Get sequence a, sequence b
write sequence a to filea
write sequence b to fileb
run command 'bl2seq -i filea -j fileb -n blastn'
get output from STDOUT, parse
repeat 20 million times
The program bl2seq does not support piping.
Is there any way to do this and avoid writing/reading to the harddrive?
I'm using Python BTW.
Depending on what OS you're running on, you may be able to use something like bash's process substitution. I'm not sure how you'd set that up in Python, but you're basically using a named pipe (or named file descriptor). That won't work if bl2seq tries to seek within the files, but it should work if it just reads them sequentially.
How do you know bl2seq does not support piping.? By the way, pipes is an OS feature, not the program. If your bl2seq program outputs something, whether to STDOUT or to a file, you should be able to parse the output. Check the help file of bl2seq for options to output to file as well, eg -o option. Then you can parse the file.
Also, since you are using Python, an alternative you can use is BioPython module.
Is this the bl2seq program from BioPerl? If so, it doesn't look like you can do piping to it. You can, however, code your own hack using Bio::Tools::Run::AnalysisFactory::Pise, which is the recommended way of going about it. You'd have to do it in Perl, though.
If this is a different bl2seq, then disregard the message. In any case, you should probably provide some more detail.
Wow. I have it figured out.
The answer is to use python's subprocess module and pipes!
EDIT: forgot to mention that I'm using blast2 which does support piping.
(this is part of a class)
def _query(self):
from subprocess import Popen, PIPE, STDOUT
pipe = Popen([BLAST,
'-p', 'blastn',
'-d', self.database,
'-m', '8'],
stdin=PIPE,
stdout=PIPE)
pipe.stdin.write('%s\n' % self.sequence)
print pipe.communicate()[0]
where self.database is a string containing the database filename, ie 'nt.fa'
self.sequence is a string containing the query sequence
This prints the output to the screen but you can easily just parse it. No slow disk I/O. No slow XML parsing. I'm going to write a module for this and put it on github.
Also, I haven't gotten this far yet but I think you can do multiple queries so that the blast database does not need to be read and loaded into RAM for each query.
I call blast2 using R script:
....
system("mkfifo seq1")
system("mkfifo seq2")
system("echo sequence1 > seq1"), wait = FALSE)
system("echo sequence2 > seq2"), wait = FALSE)
system("blast2 -p blastp -i seq1 -j seq2 -m 8", intern = TRUE)
....
This is 2 times slower(!) vs. writing and reading from hard drive!

Categories

Resources