I have a script that performs BLAST queries (bl2seq)
The script works like this:
Get sequence a, sequence b
write sequence a to filea
write sequence b to fileb
run command 'bl2seq -i filea -j fileb -n blastn'
get output from STDOUT, parse
repeat 20 million times
The program bl2seq does not support piping.
Is there any way to do this and avoid writing/reading to the harddrive?
I'm using Python BTW.
Depending on what OS you're running on, you may be able to use something like bash's process substitution. I'm not sure how you'd set that up in Python, but you're basically using a named pipe (or named file descriptor). That won't work if bl2seq tries to seek within the files, but it should work if it just reads them sequentially.
How do you know bl2seq does not support piping.? By the way, pipes is an OS feature, not the program. If your bl2seq program outputs something, whether to STDOUT or to a file, you should be able to parse the output. Check the help file of bl2seq for options to output to file as well, eg -o option. Then you can parse the file.
Also, since you are using Python, an alternative you can use is BioPython module.
Is this the bl2seq program from BioPerl? If so, it doesn't look like you can do piping to it. You can, however, code your own hack using Bio::Tools::Run::AnalysisFactory::Pise, which is the recommended way of going about it. You'd have to do it in Perl, though.
If this is a different bl2seq, then disregard the message. In any case, you should probably provide some more detail.
Wow. I have it figured out.
The answer is to use python's subprocess module and pipes!
EDIT: forgot to mention that I'm using blast2 which does support piping.
(this is part of a class)
def _query(self):
from subprocess import Popen, PIPE, STDOUT
pipe = Popen([BLAST,
'-p', 'blastn',
'-d', self.database,
'-m', '8'],
stdin=PIPE,
stdout=PIPE)
pipe.stdin.write('%s\n' % self.sequence)
print pipe.communicate()[0]
where self.database is a string containing the database filename, ie 'nt.fa'
self.sequence is a string containing the query sequence
This prints the output to the screen but you can easily just parse it. No slow disk I/O. No slow XML parsing. I'm going to write a module for this and put it on github.
Also, I haven't gotten this far yet but I think you can do multiple queries so that the blast database does not need to be read and loaded into RAM for each query.
I call blast2 using R script:
....
system("mkfifo seq1")
system("mkfifo seq2")
system("echo sequence1 > seq1"), wait = FALSE)
system("echo sequence2 > seq2"), wait = FALSE)
system("blast2 -p blastp -i seq1 -j seq2 -m 8", intern = TRUE)
....
This is 2 times slower(!) vs. writing and reading from hard drive!
Related
I would like to retrieve output from a shell command that contains spaces and quotes. It looks like this:
import subprocess
cmd = "docker logs nc1 2>&1 |grep mortality| awk '{print $1}'|sort|uniq"
subprocess.check_output(cmd)
This fails with "No such file or directory". What is the best/easiest way to pass commands such as these to subprocess?
The absolutely best solution here is to refactor the code to replace the entire tail of the pipeline with native Python code.
import subprocess
from collections import Counter
s = subprocess.run(
["docker", "logs", "nc1"],
text=True, capture_output=True, check=True)
count = Counter()
for line in s.stdout.splitlines():
if "mortality" in line:
count[line.split()[0]] += 1
for count, word in count.most_common():
print(count, word)
There are minor differences in how Counter objects resolve ties (if two words have the same count, the one which was seen first is returned first, rather than by sort order), but I'm guessing that's unimportant here.
I am also ignoring standard output from the subprocess; if you genuinely want to include output from error messages, too, just include s.stderr in the loop driver too.
However, my hunch is that you don't realize your code was doing that, which drives home the point nicely: Mixing shell script and Python raises the mainainability burden, because now you have to understand both shell script and Python to understand the code.
(And in terms of shell script style, I would definitely get rid of the useless grep by refactoring it into the Awk script, and probably also fold in the sort | uniq which has a trivial and more efficient replacement in Awk. But here, we are replacing all of that with Python code anyway.)
If you really wanted to stick to a pipeline, then you need to add shell=True to use shell features like redirection, pipes, and quoting. Without shell=True, Python looks for a command whose file name is the entire string you were passing in, which of course doesn't exist.
I'm trying to read the duration of video files using mediainfo. This shell command works
mediainfo --Inform="Video;%Duration/String3%" file
and produces an output like
00:00:33.600
But when I try to run it in python with this line
subprocess.check_output(['mediainfo', '--Inform="Video;%Duration/String3%"', file])
the whole --Inform thing is ignored and I get the full mediainfo output instead.
Is there a way to see the command constructed by subprocess to see what's wrong?
Or can anybody just tell what's wrong?
Try:
subprocess.check_output(['mediainfo', '--Inform=Video;%Duration/String3%', file])
The " in your python string are likely passed on to mediainfo, which can't parse them and will ignore the option.
These kind of problems are often caused by shell commands requiring/swallowing various special characters. Quotes such as " are often removed by bash due to shell magic. In contrast, python does not require them for magic, and will thus replicate them the way you used them. Why would you use them if you wouldn't need them? (Well, d'uh, because bash makes you believe you need them).
For example, in bash I can do
$ dd of="foobar"
and it will write to a file named foobar, swallowing the quotes.
In python, if I do
subprocess.check_output(["dd", 'of="barfoo"', 'if=foobar'])
it will write to a file named "barfoo", keeping the quotes.
I'm a win7-user.
I accidentally read about redirections (like command1 < infile > outfile) in *nix systems, and then I discovered that something similar can be done in Windows (link). And python is also can do something like this with pipes(?) or stdin/stdout(?).
I do not understand how this happens in Windows, so I have a question.
I use some kind of proprietary windows-program (.exe). This program is able to append data to a file.
For simplicity, let's assume that it is the equivalent of something like
while True:
f = open('textfile.txt','a')
f.write(repr(ctime()) + '\n')
f.close()
sleep(100)
The question:
Can I use this file (textfile.txt) as stdin?
I mean that the script (while it runs) should always (not once) handle all new data, ie
In the "never-ending cycle":
The program (.exe) writes something.
Python script captures the data and processes.
Could you please write how to do this in python, or maybe in win cmd/.bat or somehow else.
This is insanely cool thing. I want to learn how to do it! :D
If I am reading your question correctly then you want to pipe output from one command to another.
This is normally done as such:
cmd1 | cmd2
However, you say that your program only writes to files. I would double check the documentation to see if their isn't a way to get the command to write to stdout instead of a file.
If this is not possible then you can create what is known as a named pipe. It appears as a file on your filesystem, but is really just a buffer of data that can be written to and read from (the data is a stream and can only be read once). Meaning your program reading it will not finish until the program writing to the pipe stops writing and closes the "file". I don't have experience with named pipes on windows so you'll need to ask a new question for that. One down side of pipes is that they have a limited buffer size. So if there isn't a program reading data from the pipe then once the buffer is full the writing program won't be able to continue and just wait indefinitely until a program starts reading from the pipe.
An alternative is that on Unix there is a program called tail which can be set up to continuously monitor a file for changes and output any data as it is appended to the file (with a short delay.
tail --follow=textfile.txt --retry | mycmd
# wait for data to be appended to the file and output new data to mycmd
cmd1 >> textfile.txt # append output to file
One thing to note about this is that tail won't stop just because the first command has stopped writing to the file. tail will continue to listen to changes on that file forever or until mycmd stops listening to tail, or until tail is killed (or "sigint-ed").
This question has various answers on how to get a version of tail onto a windows machine.
import sys
sys.stdin = open('textfile.txt', 'r')
for line in sys.stdin:
process(line)
If the program writes to textfile.txt, you can't change that to redirect to stdin of your Python script unless you recompile the program to do so.
If you were to edit the program, you'd need to make it write to stdout, rather than a file on the filesystem. That way you can use the redirection operators to feed it into your Python script (in your case the | operator).
Assuming you can't do that, you could write a program that polls for changes on the text file, and consumes only the newly written data, by keeping track of how much it read the last time it was updated.
When you use < to direct the output of a file to a python script, that script receives that data on it's stdin stream.
Simply read from sys.stdin to get that data:
import sys
for line in sys.stdin:
# do something with line
I'm trying to generate a random string using this command:
strings /dev/urandom | grep -o '[[:alnum:]]' | head -n 30 | tr -d '\n';
Works fine, but when I try to do subprocess.call(cmd,shell=True) it just gets stuck on the strings /dev/urandom command and spams my screen with grep: writing output: Broken pipe
What's causing this and how do I fix it?
No need for subprocess, observe:
>>> import base64
>>> r = open("/dev/urandom","r")
>>> base64.encodestring(r.read(22))[:30]
'3Ttlx6TT3siM8h+zKm+Q6lH1k+dTcg'
>>> r.close()
Also, stringsing and then greping alphanumeric characters from /dev/urandom is hugely inefficient and wastes a whole lot of randomness. On my desktop PC, the above python takes less than 10 ms to executed from bash, your strings ... oneliner takes 300-400...
For a pure python solution that works also on systems without /dev/urandom - and gives only alphanumeric characters (if you really don't want + or /):
import string
import random
''.join([random.choice(string.printable[:62]) for i in range(30)])
First of all, for what you're doing, it should be better to generate the string using python directly.
Anyway, when using subprocess, the correct way to pipe data from a process to another is by redirecting stdout and/or stderr to a subprocess.PIPE, and feed the new process' stdin with the previous process' stdout.
I'm moving a website to Hostmonster and asked where the server log is located so I can automatically scan it for CGI errors. I was told, "We're sorry, but we do not have cgi errors go to any files that you have access to."
For organizational reasons I'm stuck with Hostmonster and this awful policy, so as a workaround I thought maybe I'd modify the CGI scripts to redirect STDERR to a custom log file.
I have a lot of scripts (269) so I need an easy way in both Python and Perl to redirect STDERR to a custom log file.
Something that accounts for file locking either explicitly or implicitly would be great, since a shared CGI error log file could theoretically be written to by more than one script at once if more than one script fails at the same time.
(I want to use a shared error log so I can email its contents to myself nightly and then archive or delete it.)
I know I may have to modify each file (grrr), that's why I'm looking for something elegant that will only be a few lines of code. Thanks.
For Perl, just close and re-open STDERR to point to a file of your choice.
close STDERR;
open STDERR, '>>', '/path/to/your/log.txt'
or die "Couldn't redirect STDERR: $!";
warn "this will go to log.txt";
Alternatively, you could look into a filehandle multiplexer like File::Tee.
Python: cgitb. At the top of your script, before other imports:
import cgitb
cgitb.enable(False, '/home/me/www/myapp/logs/errors')
(‘errors’ being a directory the web server user has write-access to.)
In Perl try CGI::Carp
BEGIN {
use CGI::Carp qw(carpout);
use diagnostics;
open(LOG, ">errors.txt");
carpout(LOG);
close(LOG);
}
use CGI::Carp qw(fatalsToBrowser);
The solution I finally went with was similar to the following, near the top of all my scripts:
Perl:
open(STDERR,">>","/path/to/my/cgi-error.log")
or die "Could not redirect STDERR: $OS_ERROR";
Python:
sys.stderr = open("/path/to/my/cgi-error.log", "a")
Apparently in Perl you don't need to close the STDERR handle before reopening it.
Normally I would close it anyway as a best practice, but as I said in the question, I have 269 scripts and I'm trying to minimize the changes. (Plus it seems more Perlish to just re-open the open filehandle, as awful as that sounds.)
In case anyone else has something similar in the future, here's what I'm going to do for updating all my scripts at once:
Perl:
find . -type f -name "*.pl" -exec perl -pi.bak -e 's%/usr/bin/perl%/usr/bin/perl\nopen(STDERR,">>","/path/to/my/cgi-error.log")\n or die "Could not redirect STDERR: \$OS_ERROR";%' {} \;
Python:
find . -type f -name "*.py" -exec perl -pi.bak -e 's%^(import os, sys.*)%$1\nsys.stderr = open("/path/to/my/cgi-error.log", "a")%' {} \;
The reason I'm posting these commands is that it took me quite a lot of syntactical massaging to get those commands to work (e.g., changing Couldn't to Could not, changing #!/usr/bin/perl to just /usr/bin/perl so the shell wouldn't interpret ! as a history character, using $OS_ERROR instead of $!, etc.)
Thanks to everyone who commented. Since no one answered for both Perl and Python I couldn't really "accept" any of the given answers, but I did give votes to the ones which led me in the right direction. Thanks again!
python:
import sys
sys.stderr = open('file_path_with_write_permission/filename', 'a')
Python has the sys.stderr module that you might want to look into.
>>>help(sys.__stderr__.read)
Help on built-in function read:
read(...)
read([size]) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
You can store the output of this in a string and write that string to a file.
Hope this helps
In my Perl CGI programs, I usually have
BEGIN {
open(STDERR,'>>','stderr.log');
}
right after shebang line and "use strict;use warnings;". If you want, you may append $0 to file name. But this will not solve multiple programs problem, as several copies of one programs may be run simultaneously. I usually just have several output files, for every program group.