Piping very long string through pipe with python subprocess - python

I would like to use python's subprocess library to deal with a string, process this string in a different program, then collect it and save it. Unfortunately, this string is very long (as in millions of characters long). So I have the following code segment set up:
cmd = ['some command']
p1 = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=STDOUT)
result = p1.communicate(input='some string')
Where 'some string' is actually millions of characters long.
And I always get this error:
OSError: [Errno 32] Broken pipe
I've tried it on shorter strings and the code works, so I'm guessing I'm maxing out the pipe buffer.
Is there any reasonable solution to this without having to resort to creating temporary files?
There are several constraints that make using subprocess the most attractive and simplest solution for me right now, which is why I'd like a solution within python and within subprocess.

broken pipe can also mean the child process died of other causes. Invalid input or out-of-memory could be culprits. Have you tried changing command to something like cat?

If you are sending millions of characters through input, then something is clearly wrong with the architecture of the program. Normally under those situations the program reads in chunks for those inputs.
Having said that, there is a possibility to use a file as STDIN for subprocess. This may cause the same problem for large inputs too.
Also without Python/subprocess, how do you pass such a long input to your program?
>>> import subprocess
>>> fo = open('filewithinput')
>>> proc = subprocess.Popen(['cat'],stdin=fo,stdout=subprocess.PIPE)
>>> out,err = proc.communicate()
>>> fo.close()
>>> print out

Related

Python2: Writing to stdin of interactive process, using Popen.communicate(), without trailing newline

I am trying to write what I thought would be a simple utility script to call a different command, but Popen.communicate() seems to append a newline. I imagine this is to terminate input, and it works with a basic script that takes an input and prints it out, but it's causing problems when the other program is interactive (such as e.g. bc).
Minimal code to reproduce, using bc in lieu of the other program (since both are interactive, getting it to work with bc should solve the problem):
#!/usr/bin/env python
from subprocess import Popen, PIPE
command = "bc"
p = Popen(command, stdin=PIPE, stdout=PIPE, stderr=PIPE)
stdout_data = p.communicate(input="2+2")
print(stdout_data)
This prints ('', '(standard_in) 1: syntax error\n'), presumably caused by the appended newline character, as piping the same string to bc in a shell, echo "2+2" | bc, prints 4 just fine.
Is it possible to use Popen.communicate() without appending the newline, or would I need to use a different method?
I guess I'm an idiot, because the solution was the opposite of what I thought: adding a newline to the input: stdout_data = p.communicate(input="2+2\n") makes the script print ('4\n', '') as it should, rather than give an error.

Parsing the output of a subprocess while executing and clearing the memory (Python 2.7)

I need to parse the output produced by an external program (third party, I have no control over it) which produces large amounts of data. Since the size of the output greatly exceeds the available memory, I would like to parse the output while the process is running
and remove from the memory the data that have already been processed.
So far I do something like this:
import subprocess
p_pre = subprocess.Popen("preprocessor",stdout = subprocess.PIPE)
# preprocessor is an external bash script that produces the input for the third-party software
p_3party = subprocess.Popen("thirdparty",stdin = p_pre.stdout, stdout = subprocess.PIPE)
(data_to_parse,can_be_thrown) = p_3party.communicate()
parsed_data = myparser(data_to_parse)
When "thirdparty" output is small enough, this approach works. But as stated in the Python documentation:
The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
I think a better approach (that could actually make me save some time),
would be to start processing data_to_parse while it is being produces,
and when the parsing has been done correctly "clear" data_to_parse removing
the data that have already been parsed.
I have also tried to use a for cycle like:
parsed_data=[]
for i in p_3party.stdout:
parsed_data.append(myparser(i))
but it gets stuck and can't understand why.
So I would like to know what it is the best approach to accomplish this? What are the issues to be aware of?
You can use the subprocess.Popen() to create a steam from which you read lines.
import subprocess
stream = subprocess.Popen(stdout=subprocess.PIPE).stdout
for line in stream:
#parse lines as you recieve them.
print line
You could pass the lines to your myparser() method, or append them to a list until you are ready to use them.. whatever.
In your case, using two sub-processes, it would work something like this:
import subprocess
def method(stream, retries=3):
while retries > 0:
line = stream.readline()
if line:
yield line
else:
retries -= 1
pre_stream = subprocess.Popen(cmd, stdout=subprocess.PIPE).stdout
stream = subprocess.Popen(cmd, stdin=pre_stream, stdout=subprocess.PIPE).stdout
for parsed in method(stream):
# do what you want with the parsed data.
parsed_data.append(parsed)
Iterating over a file as in for i in p_3party.stdout: uses a read-ahead buffer. The readline() method may be more reliable with a pipe -- AFAIK it reads character by character.
while True:
line = p_3party.stdout.readline()
if not line:
break
parsed_data.append(myparser(line))

Need to get output as lists rather than strings in Popen or any other system commands

I am trying to run a command from Python script(using Popen()) get the output as list, instead of string.
For Example, When I use Popen(), it gives the output as string. For commands like "vgs, vgdisplay, pvs, pvdisplay", I need to get the output as lists and should be able to parse it row and column, so that I can do the necessary action(like deleting the already existing Vg's etc etc). I was just wondering, if it is possible to get as lists or atleast convert into lists....
I started learning python a week ago, so I might have missed some simple tricks, please pardon me.....
Just to elaborate on the existing comments
from subprocess import PIPE
import subprocess
pro = subprocess.Popen("ifconfig", stdout=PIPE, stderr=PIPE)
data = pro.communicate()[0].split()
for line in data:
print "THIS IS A LINE"
print line
print "**************"

Getting output of system commands that use pipes (Python)

I'm trying to generate a random string using this command:
strings /dev/urandom | grep -o '[[:alnum:]]' | head -n 30 | tr -d '\n';
Works fine, but when I try to do subprocess.call(cmd,shell=True) it just gets stuck on the strings /dev/urandom command and spams my screen with grep: writing output: Broken pipe
What's causing this and how do I fix it?
No need for subprocess, observe:
>>> import base64
>>> r = open("/dev/urandom","r")
>>> base64.encodestring(r.read(22))[:30]
'3Ttlx6TT3siM8h+zKm+Q6lH1k+dTcg'
>>> r.close()
Also, stringsing and then greping alphanumeric characters from /dev/urandom is hugely inefficient and wastes a whole lot of randomness. On my desktop PC, the above python takes less than 10 ms to executed from bash, your strings ... oneliner takes 300-400...
For a pure python solution that works also on systems without /dev/urandom - and gives only alphanumeric characters (if you really don't want + or /):
import string
import random
''.join([random.choice(string.printable[:62]) for i in range(30)])
First of all, for what you're doing, it should be better to generate the string using python directly.
Anyway, when using subprocess, the correct way to pipe data from a process to another is by redirecting stdout and/or stderr to a subprocess.PIPE, and feed the new process' stdin with the previous process' stdout.

running BLAST (bl2seq) without creating sequence files

I have a script that performs BLAST queries (bl2seq)
The script works like this:
Get sequence a, sequence b
write sequence a to filea
write sequence b to fileb
run command 'bl2seq -i filea -j fileb -n blastn'
get output from STDOUT, parse
repeat 20 million times
The program bl2seq does not support piping.
Is there any way to do this and avoid writing/reading to the harddrive?
I'm using Python BTW.
Depending on what OS you're running on, you may be able to use something like bash's process substitution. I'm not sure how you'd set that up in Python, but you're basically using a named pipe (or named file descriptor). That won't work if bl2seq tries to seek within the files, but it should work if it just reads them sequentially.
How do you know bl2seq does not support piping.? By the way, pipes is an OS feature, not the program. If your bl2seq program outputs something, whether to STDOUT or to a file, you should be able to parse the output. Check the help file of bl2seq for options to output to file as well, eg -o option. Then you can parse the file.
Also, since you are using Python, an alternative you can use is BioPython module.
Is this the bl2seq program from BioPerl? If so, it doesn't look like you can do piping to it. You can, however, code your own hack using Bio::Tools::Run::AnalysisFactory::Pise, which is the recommended way of going about it. You'd have to do it in Perl, though.
If this is a different bl2seq, then disregard the message. In any case, you should probably provide some more detail.
Wow. I have it figured out.
The answer is to use python's subprocess module and pipes!
EDIT: forgot to mention that I'm using blast2 which does support piping.
(this is part of a class)
def _query(self):
from subprocess import Popen, PIPE, STDOUT
pipe = Popen([BLAST,
'-p', 'blastn',
'-d', self.database,
'-m', '8'],
stdin=PIPE,
stdout=PIPE)
pipe.stdin.write('%s\n' % self.sequence)
print pipe.communicate()[0]
where self.database is a string containing the database filename, ie 'nt.fa'
self.sequence is a string containing the query sequence
This prints the output to the screen but you can easily just parse it. No slow disk I/O. No slow XML parsing. I'm going to write a module for this and put it on github.
Also, I haven't gotten this far yet but I think you can do multiple queries so that the blast database does not need to be read and loaded into RAM for each query.
I call blast2 using R script:
....
system("mkfifo seq1")
system("mkfifo seq2")
system("echo sequence1 > seq1"), wait = FALSE)
system("echo sequence2 > seq2"), wait = FALSE)
system("blast2 -p blastp -i seq1 -j seq2 -m 8", intern = TRUE)
....
This is 2 times slower(!) vs. writing and reading from hard drive!

Categories

Resources