python - processing very large files (>90GB)

python - processing very large files (>90GB) - python

I need to process a couple of very large files (>90GB each). Only a small portion of the files are important for me. I want to scan through the files and write the necessary lines to another file, so I don't need to process such large files every time I run an experiment. Every line is about 1000 characters.
I use the following code:
def readFile(inputFile, outputFile):
startDate = datetime.datetime.strptime('10/06/2010 00:00:00', '%m/%d/%Y %H:%M:%S')
endDate = datetime.datetime.strptime('10/13/2010 23:59:59', '%m/%d/%Y %H:%M:%S')
total_lines = 0
with open(inputFile, 'r') as a_file:
for a_line in a_file:
total_lines += 1
id, date, content = splitLine(a_line)
datetime_object = datetime.datetime.strptime(date, '%m/%d/%Y %H:%M:%S')
if (datetime_object > startDate and datetime_object < endDate):
appendToFile(outputFile, a_line)
return total_lines
def splitLine(long_string):
values = long_string.split(",")
return values[0],values[1],values[2]
def appendToFile(outputFile, outputString):
try:
file = open(outputFile, 'a+')
file.write(outputString)
file.close()
except Exception as ex:
print("Error writing to file: " + outputFile)
return
The problem is, every time I run the script, the process gets stuck around 10.000.000th line. When I use htop command, I can see that Python only uses around 8GB of RAM when it gets stuck, and the virtual memory used keeps increasing and then the OS kills the process after a while.
I used different files, and also both Python 2.7 and 3.5. I also tried using with open(inputFile, 'r', 16777216) to use buffering but the result didn't change. I'm running the code on macOS Sierra 10.12.4 and the machine has 16GBs of RAM.
Any ideas?

Open the file in pieces, until you find what you want. Like this:
f = open('yourfile')
piece = f.read(4096)
while piece:
# Implementation for each piece
piece = f.read(4096)
f.close()

A more efficient way to do that would be to call Unix awk command from python. This will work on both Mac and unix.
You call call unix commands from python like this:
import os
os.popen('ls -l > result.txt')
Running this sample code will create a file named result.txt that contains the output of the ls -l command.
Similarly, you can scan through your files with awk and pipe the result to another file.
From the man page of awk:
awk
NAME
awk - pattern-directed scanning and processing language
SYNOPSIS
awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ]
DESCRIPTION:
Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as -f progfile. With each pattern there can be an associated action that will be performed when a line of a file matches the pattern. Each line is matched
against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern. The file name - means the standard input. Any file of the form var=value is treated as an assignment, not a filename, and is executed at the time it would have been opened if it were a filename. The option -v followed by var=value is an assignment to be done before prog is executed; any number of -v options may be present. The -F fs option defines the input field separator to be the regular expression fs.
Read this answer https://unix.stackexchange.com/questions/76805/read-log-file-between-two-dates to see how to use awk to read log files between two dates.

Related

Running .py file with and argument in .bat

The problem: I want to iterate over folder in search of certain file type, then execute it with a program and the name.ext as argument, and then run python script that changes the output name of the first program.
I know there is probably a better way to do the above, but the way I thought of was this:
[BAT]
for /R "C:\..\folder" %%a IN (*.extension) do ( SET name=%%a "C:\...\first_program.exe" "%%a" "C:\...\script.py" "%name%" )
[PY]
import io
import sys
def rename(i):
name = i
with open('my_file.txt', 'r') as file:
data = file.readlines()
data[40] ='"C:\\\\Users\\\\UserName\\\\Desktop\\\\folder\\\\folder\\\\' + name + '"\n'
with open('my_file.txt', 'w') as file:
file.writelines( data )
if __name__ == "__main__":
rename(sys.argv[1])
Expected result: I wish the python file changed the name, but after putting it once into the console it seems to stay with the script. The BAT does not change it and it bothers me.
PS. If there is a better way, I'll be glad to get to know it.

This is the linux bash version, I am sure you can change the loop etc to make it work as batch file instead of your *.exe I use cat as a generic input output example
#! /bin/sh
for f in *.txt
do
suffix=".txt"
name=${f%$suffix}
cat $f > tmp.dat
awk -v myName=$f '{if(NR==5) print $0 myName; else print $0 }' tmp.dat > $name.dat
done
This produces "unique" output *.dat files named after the input *.txt files. The files are treated by cat (virtually your *.exe) and the output is put into a temorary file. Eventually, this is handled by awk changing line 5 here. with the output placed in the unique file, as mentioned above.

How can I run this shell script inside python?

I want to run a bash script from a python program. The script has a command like this:
find . -type d -exec bash -c 'cd "$0" && gunzip -c *.gz | cut -f 3 >> ../mydoc.txt' {} \;
Normally I would run a subprocess call like:
subprocess.call('ls | wc -l', shell=True)
But that's not possible here because of the quoting signs. Any suggestions?
Thanks!

While the question is answered already, I'll still jump in because I assume that you want to execute that bash script because you do not have the functionally equivalent Python code (which is lees than 40 lines basically, see below).
Why do this instead the bash script?
Your script now is able to run on any OS that has a Python interpreter
The functionality is a lot easier to read and understand
If you need anything special, it is always easier to adapt your own code
More Pythonic :-)
Please bear in mind that is (as your bash script) without any kind of error checking and the output file is a global variable, but that can be changed easily.
import gzip
import os
# create out output file
outfile = open('/tmp/output.txt', mode='w', encoding='utf-8')
def process_line(line):
"""
get the third column (delimiter is tab char) and write to output file
"""
columns = line.split('\t')
if len(columns) > 3:
outfile.write(columns[3] + '\n')
def process_zipfile(filename):
"""
read zip file content (we assume text) and split into lines for processing
"""
print('Reading {0} ...'.format(filename))
with gzip.open(filename, mode='rb') as f:
lines = f.read().decode('utf-8').split('\n')
for line in lines:
process_line(line.strip())
def process_directory(dirtuple):
"""
loop thru the list of files in that directory and process any .gz file
"""
print('Processing {0} ...'.format(dirtuple[0]))
for filename in dirtuple[2]:
if filename.endswith('.gz'):
process_zipfile(os.path.join(dirtuple[0], filename))
# walk the directory tree from current directory downward
for dirtuple in os.walk('.'):
process_directory(dirtuple)
outfile.close()

Escape the ' marks with a \.
i.e. For every: ', replace with: \'

Triple quotes or triple double quotes ('''some string''' or """some other string""") are handy as well. See here (yeah, its python3 documentation, but it all works 100% in python2)
mystring = """how many 'cakes' can you "deliver"?"""
print(mystring)
how many 'cakes' can you "deliver"?

Issues calling awk from within Python using subprocess.call

Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.
Open up command line, in admin mode or not.
Change my directory to awk.exe, namely cd R\GnuWin32\bin
Call awk -F "," "{ print > (\"split-\" $10 \".csv\") }" large.csv
My command is used to split up the large.csv file based on the 10th column into a number of files named split-[COL VAL HERE].csv. I have no issues running this command. I tried to run the same code in Python using subprocess.call() but I'm having some issues. I run the following code:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '\",\"',
'\"{ print > (\\"split-\\" $10 \\".csv\\") }\"', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')
and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/ there are no split files in the directory. Any idea on what's going wrong?

As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk argument parsing fail.
Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)
Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.
I took the time to code & test it:
def split_ByInputColumn():
# get rid of the old data from previous runs
for f in glob.glob("split-*.csv"):
os.remove(f)
open_files = dict()
with open('large.csv') as f:
cr = csv.reader(f,delimiter=',')
for r in cr:
tenth_row = r[9]
filename = "split-{}.csv".format(tenth_row)
if not filename in open_files:
handle = open(filename,"wb")
open_files[filename] = (handle,csv.writer(handle,delimiter=','))
open_files[filename][1].writerow(r)
for f,_ in open_files.values():
f.close()
split_ByInputColumn()
in detail:
read the big file as csv (advantage: quoting is handled properly)
compute the destination filename
if filename not in dictionary, open it and create csv.writer object
write the row in the corresponding dictionary
in the end, close file handles
Aside: My old solution, using awk properly:
import subprocess
def split_ByInputColumn():
subprocess.call(['awk.exe', '-F', ',',
'{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')

Someone else posted an answer (and then subsequently deleted it), but the issue was that I was over-protecting my arguments. The following code works:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', ',',
'{ print > (\"split-\" $10 \".csv\") }', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')

Python equivalent of piping zcat result to filehandle in Perl

I have a huge pipeline written in Python that uses very large .gz files (~14GB compressed), but need a better way to send certain lines to an external software (formatdb from blast-legacy/2.2.26). I have a Perl script someone wrote for me a long time ago that does this extremely fast, but I need to do that same thing in Python given that the rest of the pipeline is written in Python and I have to keep it that way. The Perl script uses two file handles, one to hold zcat of .gz file and the other to store the lines the software needs (2 of every 4) and use it as the input. It involves bioinformatics, but no experience is needed. The file is in fastq format and the software needs it in fasta format. Every 4 lines is a fastq record, take the 1st and 3rd line and add '>' to the beginning of the 1st line and that is the fasta equivalent the formatdb software will use for each record.
The perl script is as follows:
#!/usr/bin/perl
my $SRG = $ARGV[0]; # reads.fastq.gz
open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n";
# -i: input -n: db name -p: program
open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n";
#Fastq => Fasta sub
my $localcounter = 0;
while (my $line = <$fh>){
if ($. % 4==1){
print $fh2 "\>" . substr($line, 1);
$localcounter++;
}
elsif ($localcounter == 1){
print $fh2 "$line";
$localcounter = 0;
}
else{
}
}
close $fh;
close $fh2;
exit;
It works really well. How could I do this same thing in Python? I like how Perl can use those file handles, but I'm not sure how to do that in Python without creating an actual file. All I can think of is to gzip.open the file and write the two lines of every record I need to a new file and use that with "formatdb", but it is way too slow. Any ideas? I need to work it into the python pipeline, so I can't just rely on the perl script and I would also like to know how to do this in general. I assume I need to use some form of the subprocess module.
Here is my Python code, but again it is way to slow and speed is the issue here (HUGE FILES):
#!/usr/bin/env python
import gzip
from Bio import SeqIO # can recognize fasta/fastq records
import subprocess as sp
import os,sys
filename = sys.argv[1] # reads.fastq.gz
tempFile = filename + ".temp.fasta"
outFile = open(tempFile, "w")
handle = gzip.open(filename, "r")
# parses each fastq record
# r.id and r.seq are the 1st and 3rd lines of each record
for r in SeqIO.parse(handle, "fastq"):
outFile.write(">" + str(r.id) + "\n")
outFile.write(str(r.seq) + "\n")
handle.close()
outFile.close()
cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F '
sp.call(cmd, shell=True)
cmd = 'rm ' + tempFile
sp.call(cmd, shell=True)

First, there's a much better solution in both Perl and Python: just use a gzip library. In Python, there's one in the stdlib; in Perl, you can find one on CPAN. For example:
with gzip.open(path, 'r', encoding='utf-8') as f:
for line in f:
do_stuff(line)
Much simpler, and more efficient, and more portable, than shelling out to zcat.
But if you really do want to launch a subprocess and control its pipes in Python, you can do it with the subprocess module. And, unlike perl, Python can do this without having to stick a shell in the middle. There's even a nice section in the docs on Replacing Older Functions with the subprocess Module that gives you recipes.
So:
zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
Now, zcat.stdout is a file-like object, with the usual read methods and so on, wrapping the pipe to the zcat subprocess.
So, for example, to read a binary file 8K at a time in Python 3.x:
zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
for chunk in iter(functools.partial(zcat.stdout.read, 8192), b''):
do_stuff(chunk)
zcat.wait()
(If you want to do this in Python 2.x, or read a text file one line at a time instead of a binary file 8K at a time, or whatever, the changes are the same as they'd be for any other file-handling coding.)

You can parse the whole file and load it as a list of lines using this function:
def convert_gz_to_list_of_lines(filepath):
"""Parse gz file and convert it into a list of lines."""
file_as_list = list()
with gzip.open(filepath, 'rt', encoding='utf-8') as f:
try:
for line in f:
file_as_list.append(line)
except EOFError:
file_as_list = file_as_list
return file_as_list

Python - glob.glob doesn't find *.txt in specified filepath within Unix OS

I am converting some Python scripts I wrote in a Windows environment to run in Unix (Red Hat 5.4), and I'm having trouble converting the lines that deal with filepaths. In Windows, I usually read in all .txt files within a directory using something like:
pathtotxt = "C:\\Text Data\\EJC\\Philosophical Transactions 1665-1678\\*\\*.txt"
for file in glob.glob(pathtotxt):
It seems one can use the glob.glob() method in Unix as well, so I'm trying to implement this method to find all text files within a directory entitled "source" using the following code:
#!/usr/bin/env python
import commands
import sys
import glob
import os
testout = open('testoutput.txt', 'w')
numbers = [1,2,3]
for number in numbers:
testout.write(str(number + 1) + "\r\n")
testout.close
sourceout = open('sourceoutput.txt', 'w')
pathtosource = "/afs/crc.nd.edu/user/d/dduhaime/data/hill/source/*.txt"
for file in glob.glob(pathtosource):
with open(file, 'r') as openfile:
readfile = openfile.read()
souceout.write (str(readfile))
sourceout.close
When I run this code, the testout.txt file comes out as expected, but the sourceout.txt file is empty. I thought the problem might be solved if I change the line
pathtosource = "/afs/crc.nd.edu/user/d/dduhaime/data/hill/source/*.txt"
to
pathtosource = "/source/*.txt"
and then run the code from the /hill directory, but that didn't resolve my problem. Do others know how I might be able to read in the text files in the source directory? I would be grateful for any insights others can offer.
EDIT: In case it is relevant, the /afs/ tree of directories referenced above is located on a remote server that I'm ssh-ing into via Putty. I'm also using a test.job file to qsub the Python script above. (This is all to prepare myself to submit jobs on the SGE cluster system.) The test.job script looks like:
#!/bin/csh
#$ -M dduhaime#nd.edu
#$ -m abe
#$ -r y
#$ -o tmp.out
#$ -e tmp.err
module load python/2.7.3
echo "Start - `date`"
python tmp.py
echo "Finish - `date`"

Got it! I had misspelled the output command. I wrote
souceout.write (str(readfile))
instead of
sourceout.write (str(readfile))
What a dunce. I also added a newline bit to the line:
sourceout.write (str(readfile) + "\r\n")
and it works fine. I think it's time for a new IDE!

You haven't really closed the file. The function testout.close() isn't called, because you have forgotten the parentheses. The same is for sourceout.close()
testout.close
...
sourceout.close
Has to be:
testout.close()
...
sourceout.close()
If the program finishes all files are automatically closed so it is only important if you reopen the file.
Even better (the pythonic version) would be to use the with statement. Instead of this:
testout = open('testoutput.txt', 'w')
numbers = [1,2,3]
for number in numbers:
testout.write(str(number + 1) + "\r\n")
testout.close()
you would write this:
with open('testoutput.txt', 'w') as testout:
numbers = [1,2,3]
for number in numbers:
testout.write(str(number + 1) + "\r\n")
In this case the file will be automatically closed even when an error occurs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.