Python equivalent of piping zcat result to filehandle in Perl

Python equivalent of piping zcat result to filehandle in Perl - python

I have a huge pipeline written in Python that uses very large .gz files (~14GB compressed), but need a better way to send certain lines to an external software (formatdb from blast-legacy/2.2.26). I have a Perl script someone wrote for me a long time ago that does this extremely fast, but I need to do that same thing in Python given that the rest of the pipeline is written in Python and I have to keep it that way. The Perl script uses two file handles, one to hold zcat of .gz file and the other to store the lines the software needs (2 of every 4) and use it as the input. It involves bioinformatics, but no experience is needed. The file is in fastq format and the software needs it in fasta format. Every 4 lines is a fastq record, take the 1st and 3rd line and add '>' to the beginning of the 1st line and that is the fasta equivalent the formatdb software will use for each record.
The perl script is as follows:
#!/usr/bin/perl
my $SRG = $ARGV[0]; # reads.fastq.gz
open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n";
# -i: input -n: db name -p: program
open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n";
#Fastq => Fasta sub
my $localcounter = 0;
while (my $line = <$fh>){
if ($. % 4==1){
print $fh2 "\>" . substr($line, 1);
$localcounter++;
}
elsif ($localcounter == 1){
print $fh2 "$line";
$localcounter = 0;
}
else{
}
}
close $fh;
close $fh2;
exit;
It works really well. How could I do this same thing in Python? I like how Perl can use those file handles, but I'm not sure how to do that in Python without creating an actual file. All I can think of is to gzip.open the file and write the two lines of every record I need to a new file and use that with "formatdb", but it is way too slow. Any ideas? I need to work it into the python pipeline, so I can't just rely on the perl script and I would also like to know how to do this in general. I assume I need to use some form of the subprocess module.
Here is my Python code, but again it is way to slow and speed is the issue here (HUGE FILES):
#!/usr/bin/env python
import gzip
from Bio import SeqIO # can recognize fasta/fastq records
import subprocess as sp
import os,sys
filename = sys.argv[1] # reads.fastq.gz
tempFile = filename + ".temp.fasta"
outFile = open(tempFile, "w")
handle = gzip.open(filename, "r")
# parses each fastq record
# r.id and r.seq are the 1st and 3rd lines of each record
for r in SeqIO.parse(handle, "fastq"):
outFile.write(">" + str(r.id) + "\n")
outFile.write(str(r.seq) + "\n")
handle.close()
outFile.close()
cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F '
sp.call(cmd, shell=True)
cmd = 'rm ' + tempFile
sp.call(cmd, shell=True)

First, there's a much better solution in both Perl and Python: just use a gzip library. In Python, there's one in the stdlib; in Perl, you can find one on CPAN. For example:
with gzip.open(path, 'r', encoding='utf-8') as f:
for line in f:
do_stuff(line)
Much simpler, and more efficient, and more portable, than shelling out to zcat.
But if you really do want to launch a subprocess and control its pipes in Python, you can do it with the subprocess module. And, unlike perl, Python can do this without having to stick a shell in the middle. There's even a nice section in the docs on Replacing Older Functions with the subprocess Module that gives you recipes.
So:
zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
Now, zcat.stdout is a file-like object, with the usual read methods and so on, wrapping the pipe to the zcat subprocess.
So, for example, to read a binary file 8K at a time in Python 3.x:
zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
for chunk in iter(functools.partial(zcat.stdout.read, 8192), b''):
do_stuff(chunk)
zcat.wait()
(If you want to do this in Python 2.x, or read a text file one line at a time instead of a binary file 8K at a time, or whatever, the changes are the same as they'd be for any other file-handling coding.)

You can parse the whole file and load it as a list of lines using this function:
def convert_gz_to_list_of_lines(filepath):
"""Parse gz file and convert it into a list of lines."""
file_as_list = list()
with gzip.open(filepath, 'rt', encoding='utf-8') as f:
try:
for line in f:
file_as_list.append(line)
except EOFError:
file_as_list = file_as_list
return file_as_list

Related

Dos2unix not working when trying to silence command

I was calling dos2unix from within Python this way:
call("dos2unix " + file1, shell=True, stdout=PIPE)
However to silence the Unix output, I did this:
f_null = open(os.devnull, 'w')
call("dos2unix " + file1, shell=True, stdout=f_null , stderr=subprocess.STDOUT)
This doesn't seem to work. The command isn't being called anymore as the diff that I perform on the file1 against file2 (did a diff -y file1 file2 | cat -t and could see the line endings hadn't changed).
file2 is the file I am comparing file1 against. It has Unix line endings as it is generated on the box. However, there is a chance that file1 doesn't.

Not sure, why but I would try to get rid of the "noise" around your command & check return code:
check_call(["dos2unix",file1], stdout=f_null , stderr=subprocess.STDOUT)
pass as list of args, not command line (support for files with spaces in it!)
remove shell=True as dos2unix isn't a built-in shell command
use check_call so it raises an exception instead of failing silently
At any rate, it is possible that dos2unix detects that the output isn't a tty anymore and decides to dump the output in it instead (dos2unix can work from standard input & to standard output). I'd go with that explanation. You could check it by redirecting to a real file instead of os.devnull and check if the result is there.
But I would do a pure python solution instead (with a backup for safety), which is portable and doesn't need dos2unix command (so it works on Windows as well):
with open(file1,"rb") as f:
contents = f.read().replace(b"\r\n",b"\n")
with open(file1+".bak","wb") as f:
f.write(contents)
os.remove(file1)
os.rename(file1+".bak",file1)
reading the file fully is fast, but could choke on really big files. A line-by-line solution is also possible (still using the binary mode):
with open(file1,"rb") as fr, open(file1+".bak","wb") as fw:
for l in fr:
fw.write(l.replace(b"\r\n",b"\n"))
os.remove(file1)
os.rename(file1+".bak",file1)

How can I run this shell script inside python?

I want to run a bash script from a python program. The script has a command like this:
find . -type d -exec bash -c 'cd "$0" && gunzip -c *.gz | cut -f 3 >> ../mydoc.txt' {} \;
Normally I would run a subprocess call like:
subprocess.call('ls | wc -l', shell=True)
But that's not possible here because of the quoting signs. Any suggestions?
Thanks!

While the question is answered already, I'll still jump in because I assume that you want to execute that bash script because you do not have the functionally equivalent Python code (which is lees than 40 lines basically, see below).
Why do this instead the bash script?
Your script now is able to run on any OS that has a Python interpreter
The functionality is a lot easier to read and understand
If you need anything special, it is always easier to adapt your own code
More Pythonic :-)
Please bear in mind that is (as your bash script) without any kind of error checking and the output file is a global variable, but that can be changed easily.
import gzip
import os
# create out output file
outfile = open('/tmp/output.txt', mode='w', encoding='utf-8')
def process_line(line):
"""
get the third column (delimiter is tab char) and write to output file
"""
columns = line.split('\t')
if len(columns) > 3:
outfile.write(columns[3] + '\n')
def process_zipfile(filename):
"""
read zip file content (we assume text) and split into lines for processing
"""
print('Reading {0} ...'.format(filename))
with gzip.open(filename, mode='rb') as f:
lines = f.read().decode('utf-8').split('\n')
for line in lines:
process_line(line.strip())
def process_directory(dirtuple):
"""
loop thru the list of files in that directory and process any .gz file
"""
print('Processing {0} ...'.format(dirtuple[0]))
for filename in dirtuple[2]:
if filename.endswith('.gz'):
process_zipfile(os.path.join(dirtuple[0], filename))
# walk the directory tree from current directory downward
for dirtuple in os.walk('.'):
process_directory(dirtuple)
outfile.close()

Escape the ' marks with a \.
i.e. For every: ', replace with: \'

Triple quotes or triple double quotes ('''some string''' or """some other string""") are handy as well. See here (yeah, its python3 documentation, but it all works 100% in python2)
mystring = """how many 'cakes' can you "deliver"?"""
print(mystring)
how many 'cakes' can you "deliver"?

python - processing very large files (>90GB)

I need to process a couple of very large files (>90GB each). Only a small portion of the files are important for me. I want to scan through the files and write the necessary lines to another file, so I don't need to process such large files every time I run an experiment. Every line is about 1000 characters.
I use the following code:
def readFile(inputFile, outputFile):
startDate = datetime.datetime.strptime('10/06/2010 00:00:00', '%m/%d/%Y %H:%M:%S')
endDate = datetime.datetime.strptime('10/13/2010 23:59:59', '%m/%d/%Y %H:%M:%S')
total_lines = 0
with open(inputFile, 'r') as a_file:
for a_line in a_file:
total_lines += 1
id, date, content = splitLine(a_line)
datetime_object = datetime.datetime.strptime(date, '%m/%d/%Y %H:%M:%S')
if (datetime_object > startDate and datetime_object < endDate):
appendToFile(outputFile, a_line)
return total_lines
def splitLine(long_string):
values = long_string.split(",")
return values[0],values[1],values[2]
def appendToFile(outputFile, outputString):
try:
file = open(outputFile, 'a+')
file.write(outputString)
file.close()
except Exception as ex:
print("Error writing to file: " + outputFile)
return
The problem is, every time I run the script, the process gets stuck around 10.000.000th line. When I use htop command, I can see that Python only uses around 8GB of RAM when it gets stuck, and the virtual memory used keeps increasing and then the OS kills the process after a while.
I used different files, and also both Python 2.7 and 3.5. I also tried using with open(inputFile, 'r', 16777216) to use buffering but the result didn't change. I'm running the code on macOS Sierra 10.12.4 and the machine has 16GBs of RAM.
Any ideas?

Open the file in pieces, until you find what you want. Like this:
f = open('yourfile')
piece = f.read(4096)
while piece:
# Implementation for each piece
piece = f.read(4096)
f.close()

A more efficient way to do that would be to call Unix awk command from python. This will work on both Mac and unix.
You call call unix commands from python like this:
import os
os.popen('ls -l > result.txt')
Running this sample code will create a file named result.txt that contains the output of the ls -l command.
Similarly, you can scan through your files with awk and pipe the result to another file.
From the man page of awk:
awk
NAME
awk - pattern-directed scanning and processing language
SYNOPSIS
awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ]
DESCRIPTION:
Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as -f progfile. With each pattern there can be an associated action that will be performed when a line of a file matches the pattern. Each line is matched
against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern. The file name - means the standard input. Any file of the form var=value is treated as an assignment, not a filename, and is executed at the time it would have been opened if it were a filename. The option -v followed by var=value is an assignment to be done before prog is executed; any number of -v options may be present. The -F fs option defines the input field separator to be the regular expression fs.
Read this answer https://unix.stackexchange.com/questions/76805/read-log-file-between-two-dates to see how to use awk to read log files between two dates.

Issues calling awk from within Python using subprocess.call

Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.
Open up command line, in admin mode or not.
Change my directory to awk.exe, namely cd R\GnuWin32\bin
Call awk -F "," "{ print > (\"split-\" $10 \".csv\") }" large.csv
My command is used to split up the large.csv file based on the 10th column into a number of files named split-[COL VAL HERE].csv. I have no issues running this command. I tried to run the same code in Python using subprocess.call() but I'm having some issues. I run the following code:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '\",\"',
'\"{ print > (\\"split-\\" $10 \\".csv\\") }\"', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')
and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/ there are no split files in the directory. Any idea on what's going wrong?

As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk argument parsing fail.
Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)
Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.
I took the time to code & test it:
def split_ByInputColumn():
# get rid of the old data from previous runs
for f in glob.glob("split-*.csv"):
os.remove(f)
open_files = dict()
with open('large.csv') as f:
cr = csv.reader(f,delimiter=',')
for r in cr:
tenth_row = r[9]
filename = "split-{}.csv".format(tenth_row)
if not filename in open_files:
handle = open(filename,"wb")
open_files[filename] = (handle,csv.writer(handle,delimiter=','))
open_files[filename][1].writerow(r)
for f,_ in open_files.values():
f.close()
split_ByInputColumn()
in detail:
read the big file as csv (advantage: quoting is handled properly)
compute the destination filename
if filename not in dictionary, open it and create csv.writer object
write the row in the corresponding dictionary
in the end, close file handles
Aside: My old solution, using awk properly:
import subprocess
def split_ByInputColumn():
subprocess.call(['awk.exe', '-F', ',',
'{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')

Someone else posted an answer (and then subsequently deleted it), but the issue was that I was over-protecting my arguments. The following code works:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', ',',
'{ print > (\"split-\" $10 \".csv\") }', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')

Python: Using Popen() versus File Objects to write to a file in linux

I noticed I have two alternatives to writing to a file in Linux within a python script. I can either create a Popen object and write to a file using shell redirection (e.g. ">" or ">>") - or I can use File Objects (e.g. open(), write(), close()).
I've played around with both for a short while and noticed that using Popen involves less code if I need to use other shell tools. For instance, below I try to get a checksum of a file and write it to a temporary file named with the PID as a unique identifier. (I know $$ will change if I call Popen again but pretend I don't need to):
Popen("md5sum " + filename + " >> /dir/test/$$.tempfile", shell=True, stdout=PIPE).communicate()[0]
Below is a (hastily written) rough equivalent using file objects. I use os.getpid instead of $$ but I still use md5sum and have to call Popen still.
PID = str(os.getpid())
manifest = open('/dir/test/' + PID + '.tempfile','w')
hash = Popen("md5sum " + filename, shell=True, stdout=PIPE).communicate()[0]
manifest.write(hash)
manifest.close()
Are there any pros/cons to either approach? I'm actually trying to port bash code over to Python and would like to use more Python, but I'm not sure which way I should go here.

Generally speaking, I would write something like:
manifest = open('/dir/test/' + PID + '.tempfile','w')
p = Popen(['md5sum',filename],stdout=manifest)
p.wait()
manifest.close()
This avoids any shell injection vulnerabilities. You also know the PID as you're not picking up the PID of the spawned subshell.

Edit: md5 module is deprecated (but still around), instead you should use the hashlib module
hashlib version
to file:
import hashlib
with open('py_md5', mode='w') as out:
with open('test.txt', mode='ro') as input:
out.write(hashlib.md5(input.read()).hexdigest())
to console:
import hashlib
with open('test.txt', mode='ro') as input:
print hashlib.md5(input.read()).hexdigest()
md5 version
Python's md5 module provides an identical tool:
import md5
# open file to write
with open('py_md5', mode='w') as out:
with open('test.txt', mode='ro') as input:
out.write(md5.new(input.read()).hexdigest())
If you just wanted to get the md5 hexadecimal digest string, you can print it insted of writing it out to a file:
import md5
# open file to write
with open('test.txt', mode='ro') as input:
print md5.new(input.read()).hexdigest()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.