Processing Large Files in Python [ 1000 GB or More]

Processing Large Files in Python [ 1000 GB or More] - python

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.
Is there any faster way to do this that the one i am using bellow?
How much would it take to complete the task.
phrase = "how fast it is"
count = 0
with open('bigfile.txt') as f:
for line in f:
count += line.count(phrase)
If I am right if I do not have this file in the memory i would meed to wait till the PC loads the file each time I am doing the search and this should take at least 4000 sec for a 250 MB/sec hard drive and a file of 10000 GB.

I used file.read() to read the data in chunks, in current examples the chunks were of size 100 MB, 500MB, 1GB and 2GB respectively. The size of my text file is 2.1 GB.
Code:
from functools import partial
def read_in_chunks(size_in_bytes):
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt', 'r+b') as f:
prev = ''
count = 0
f_read = partial(f.read, size_in_bytes)
for text in iter(f_read, ''):
if not text.endswith('\n'):
# if file contains a partial line at the end, then don't
# use it when counting the substring count.
text, rest = text.rsplit('\n', 1)
# pre-pend the previous partial line if any.
text = prev + text
prev = rest
else:
# if the text ends with a '\n' then simple pre-pend the
# previous partial line.
text = prev + text
prev = ''
count += text.count(s)
count += prev.count(s)
print count
Timings:
read_in_chunks(104857600)
$ time python so.py
10000000
real 0m1.649s
user 0m0.977s
sys 0m0.669s
read_in_chunks(524288000)
$ time python so.py
10000000
real 0m1.558s
user 0m0.893s
sys 0m0.646s
read_in_chunks(1073741824)
$ time python so.py
10000000
real 0m1.242s
user 0m0.689s
sys 0m0.549s
read_in_chunks(2147483648)
$ time python so.py
10000000
real 0m0.844s
user 0m0.415s
sys 0m0.408s
On the other hand the simple loop version takes around 6 seconds on my system:
def simple_loop():
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt') as f:
print sum(line.count(s) for line in f)
$ time python so.py
10000000
real 0m5.993s
user 0m5.679s
sys 0m0.313s
Results of #SlaterTyranus's grep version on my file:
$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l
10000000
real 0m11.975s
user 0m11.779s
sys 0m0.568s
Results of #woot's solution:
$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m5.955s
user 0m14.825s
sys 0m5.766s
Got best timing when I used 100 MB as block size:
$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m4.632s
user 0m13.466s
sys 0m3.290s
Results of woot's second solution:
$ time python woot_thread.py # CHUNK_SIZE = 1073741824
10000000
real 0m1.006s
user 0m0.509s
sys 0m2.171s
$ time python woot_thread.py #CHUNK_SIZE = 2147483648
10000000
real 0m1.009s
user 0m0.495s
sys 0m2.144s
System Specs: Core i5-4670, 7200 RPM HDD

Here is a Python attempt... You might need to play with the THREADS and CHUNK_SIZE. Also it's a bunch of code in a short time so I might not have thought of everything. I do overlap my buffer though to catch the ones in between, and I extend the last chunk to include the remainder of the file.
import os
import threading
INPUTFILE ='bigfile.txt'
SEARCH_STRING='how fast it is'
THREADS = 8 # Set to 2 times number of cores, assuming hyperthreading
CHUNK_SIZE = 32768
FILESIZE = os.path.getsize(INPUTFILE)
SLICE_SIZE = FILESIZE / THREADS
class myThread (threading.Thread):
def __init__(self, filehandle, seekspot):
threading.Thread.__init__(self)
self.filehandle = filehandle
self.seekspot = seekspot
self.cnt = 0
def run(self):
self.filehandle.seek( self.seekspot )
p = self.seekspot
if FILESIZE - self.seekspot < 2 * SLICE_SIZE:
readend = FILESIZE
else:
readend = self.seekspot + SLICE_SIZE + len(SEARCH_STRING) - 1
overlap = ''
while p < readend:
if readend - p < CHUNK_SIZE:
buffer = overlap + self.filehandle.read(readend - p)
else:
buffer = overlap + self.filehandle.read(CHUNK_SIZE)
if buffer:
self.cnt += buffer.count(SEARCH_STRING)
overlap = buffer[len(buffer)-len(SEARCH_STRING)+1:]
p += CHUNK_SIZE
filehandles = []
threads = []
for fh_idx in range(0,THREADS):
filehandles.append(open(INPUTFILE,'rb'))
seekspot = fh_idx * SLICE_SIZE
threads.append(myThread(filehandles[fh_idx],seekspot ) )
threads[fh_idx].start()
totalcount = 0
for fh_idx in range(0,THREADS):
threads[fh_idx].join()
totalcount += threads[fh_idx].cnt
print totalcount

Have you looked at using parallel / grep?
cat bigfile.txt | parallel --block 10M --pipe grep -o 'how\ fast\ it\ is' | wc -l

Had you considered indexing your file? The way search engine works is by creating a mapping from words to the location they are in the file. Say if you have this file:
Foo bar baz dar. Dar bar haa.
You create an index that looks like this:
{
"foo": {0},
"bar": {4, 21},
"baz": {8},
"dar": {12, 17},
"haa": {25},
}
A hashtable index can be looked up in O(1); so it's freaking fast.
And someone searches for the query "bar baz" you first break the query into its constituent words: ["bar", "baz"] and you then found {4, 21}, {8}; then you use this to jump out right to the places where the queried text could possible exists.
There are out of the box solutions for indexed search engines as well; for example Solr or ElasticSearch.

Going to suggest doing this with grep instead of python. Will be faster, and generally if you're dealing with 1000GB of text on your local machine you've done something wrong, but all judgements aside, grep comes with a couple of options that will make your life easier.
grep -o '<your_phrase>' bigfile.txt|wc -l
Specifically this will count the number of lines in which your desired phrase appears. This should also count multiple occurrences on a single line.
If you don't need that you could instead do something like this:
grep -c '<your_phrase>' bigfile.txt

We're talking about a simple count of a specific substring within a rather large data stream. The task is nearly certainly I/O bound, but very easily parallelised. The first layer is the raw read speed; we can choose to reduce the read amount by using compression, or distribute the transfer rate by storing the data in multiple places. Then we have the search itself; substring searches are a well known problem, again I/O limited. If the data set comes from a single disk pretty much any optimisation is moot, as there's no way that disk beats a single core in speed.
Assuming we do have chunks, which might for instance be the separate blocks of a bzip2 file (if we use a threaded decompressor), stripes in a RAID, or distributed nodes, we have much to gain from processing them individually. Each chunk is searched for needle, then joints can be formed by taking len(needle)-1 from the end of one chunk and beginning of the next, and searching within those.
A quick benchmark demonstrates that the regular expression state machines operate faster than the usual in operator:
>>> timeit.timeit("x.search(s)", "s='a'*500000; import re; x=re.compile('foobar')", number=20000)
17.146117210388184
>>> timeit.timeit("'foobar' in s", "s='a'*500000", number=20000)
24.263535976409912
>>> timeit.timeit("n in s", "s='a'*500000; n='foobar'", number=20000)
21.562405109405518
Another step of optimization we can perform, given that we have the data in a file, is to mmap it instead of using the usual read operations. This permits the operating system to use the disk buffers directly. It also allows the kernel to satisfy multiple read requests in arbitrary order without making extra system calls, which lets us exploit things like an underlying RAID when operating in multiple threads.
Here's a quickly tossed together prototype. A few things could obviously be improved, such as distributing the chunk processes if we have a multinode cluster, doing the tail+head check by passing one to the neighboring worker (an order which is not known in this implementation) instead of sending both to a special worker, and implementing an interthread limited queue (pipe) class instead of matching semaphores. It would probably also make sense to move the worker threads outside of the main thread function, since the main thread keeps altering its locals.
from mmap import mmap, ALLOCATIONGRANULARITY, ACCESS_READ
from re import compile, escape
from threading import Semaphore, Thread
from collections import deque
def search(needle, filename):
# Might want chunksize=RAID block size, threads
chunksize=ALLOCATIONGRANULARITY*1024
threads=32
# Read chunk allowance
allocchunks=Semaphore(threads) # should maybe be larger
chunkqueue=deque() # Chunks mapped, read by workers
chunksready=Semaphore(0)
headtails=Semaphore(0) # edges between chunks into special worker
headtailq=deque()
sumq=deque() # worker final results
# Note: although we do push and pop at differing ends of the
# queues, we do not actually need to preserve ordering.
def headtailthread():
# Since head+tail is 2*len(needle)-2 long,
# it cannot contain more than one needle
htsum=0
matcher=compile(escape(needle))
heads={}
tails={}
while True:
headtails.acquire()
try:
pos,head,tail=headtailq.popleft()
except IndexError:
break # semaphore signaled without data, end of stream
try:
prevtail=tails.pop(pos-chunksize)
if matcher.search(prevtail+head):
htsum+=1
except KeyError:
heads[pos]=head
try:
nexthead=heads.pop(pos+chunksize)
if matcher.search(tail+nexthead):
htsum+=1
except KeyError:
tails[pos]=tail
# No need to check spill tail and head as they are shorter than needle
sumq.append(htsum)
def chunkthread():
threadsum=0
# escape special characters to achieve fixed string search
matcher=compile(escape(needle))
borderlen=len(needle)-1
while True:
chunksready.acquire()
try:
pos,chunk=chunkqueue.popleft()
except IndexError: # End of stream
break
# Let the re module do the heavy lifting
threadsum+=len(matcher.findall(chunk))
if borderlen>0:
# Extract the end pieces for checking borders
head=chunk[:borderlen]
tail=chunk[-borderlen:]
headtailq.append((pos,head,tail))
headtails.release()
chunk.close()
allocchunks.release() # let main thread allocate another chunk
sumq.append(threadsum)
with infile=open(filename,'rb'):
htt=Thread(target=headtailthread)
htt.start()
chunkthreads=[]
for i in range(threads):
t=Thread(target=chunkthread)
t.start()
chunkthreads.append(t)
pos=0
fileno=infile.fileno()
while True:
allocchunks.acquire()
chunk=mmap(fileno, chunksize, access=ACCESS_READ, offset=pos)
chunkqueue.append((pos,chunk))
chunksready.release()
pos+=chunksize
if pos>chunk.size(): # Last chunk of file?
break
# File ended, finish all chunks
for t in chunkthreads:
chunksready.release() # wake thread so it finishes
for t in chunkthreads:
t.join() # wait for thread to finish
headtails.release() # post event to finish border checker
htt.join()
# All threads finished, collect our sum
return sum(sumq)
if __name__=="__main__":
from sys import argv
print "Found string %d times"%search(*argv[1:])
Also, modifying the whole thing to use some mapreduce routine (map chunks to counts, heads and tails, reduce by summing counts and checking tail+head parts) is left as an exercise.
Edit: Since it seems this search will be repeated with varying needles, an index would be much faster, being able to skip searches of sections that are known not to match. One possibility is making a map of which blocks contain any occurence of various n-grams (accounting for the block borders by allowing the ngram to overlap into the next); those maps can then be combined to find more complex conditions, before the blocks of original data need to be loaded. There are certainly databases to do this; look for full text search engines.

Here is a third, longer method that uses a database. The database is sure to be larger than the text. I am not sure about if the indexes is optimal, and some space savings could come from playing with that a little. (like, maybe WORD, and POS, WORD are better, or perhaps WORD, POS is just fine, need to experiment a little).
This may not perform well on 200 OK's test though because it is a lot of repeating text, but might perform well on more unique data.
First create a database by scanning the words, etc:
import sqlite3
import re
INPUT_FILENAME = 'bigfile.txt'
DB_NAME = 'words.db'
FLUSH_X_WORDS=10000
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS WORDS (
POS INTEGER
,WORD TEXT
,PRIMARY KEY( POS, WORD )
) WITHOUT ROWID
""")
cursor.execute("""
DROP INDEX IF EXISTS I_WORDS_WORD_POS
""")
cursor.execute("""
DROP INDEX IF EXISTS I_WORDS_POS_WORD
""")
cursor.execute("""
DELETE FROM WORDS
""")
conn.commit()
def flush_words(words):
for word in words.keys():
for pos in words[word]:
cursor.execute('INSERT INTO WORDS (POS, WORD) VALUES( ?, ? )', (pos, word.lower()) )
conn.commit()
words = dict()
pos = 0
recomp = re.compile('\w+')
with open(INPUT_FILENAME, 'r') as f:
for line in f:
for word in [x.lower() for x in recomp.findall(line) if x]:
pos += 1
if words.has_key(word):
words[word].append(pos)
else:
words[word] = [pos]
if pos % FLUSH_X_WORDS == 0:
flush_words(words)
words = dict()
if len(words) > 0:
flush_words(words)
words = dict()
cursor.execute("""
CREATE UNIQUE INDEX I_WORDS_WORD_POS ON WORDS ( WORD, POS )
""")
cursor.execute("""
CREATE UNIQUE INDEX I_WORDS_POS_WORD ON WORDS ( POS, WORD )
""")
cursor.execute("""
VACUUM
""")
cursor.execute("""
ANALYZE WORDS
""")
Then search the database by generating SQL:
import sqlite3
import re
SEARCH_PHRASE = 'how fast it is'
DB_NAME = 'words.db'
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
recomp = re.compile('\w+')
search_list = [x.lower() for x in recomp.findall(SEARCH_PHRASE) if x]
from_clause = 'FROM\n'
where_clause = 'WHERE\n'
num = 0
fsep = ' '
wsep = ' '
for word in search_list:
num += 1
from_clause += '{fsep}words w{num}\n'.format(fsep=fsep,num=num)
where_clause += "{wsep} w{num}.word = '{word}'\n".format(wsep=wsep, num=num, word=word)
if num > 1:
where_clause += " AND w{num}.pos = w{lastnum}.pos + 1\n".format(num=str(num),lastnum=str(num-1))
fsep = ' ,'
wsep = ' AND'
sql = """{select}{fromc}{where}""".format(select='SELECT COUNT(*)\n',fromc=from_clause, where=where_clause)
res = cursor.execute( sql )
print res.fetchone()[0]

I concede that grep will be be faster. I assume this file is a large string based file.
But you could do something like this if you really really wanted.
import os
import re
import mmap
fileName = 'bigfile.txt'
phrase = re.compile("how fast it is")
with open(fileName, 'r') as fHandle:
data = mmap.mmap(fHandle.fileno(), os.path.getsize(fileName), access=mmap.ACCESS_READ)
matches = re.match(phrase, data)
print('matches = {0}'.format(matches.group()))

Related

Is there a faster way to clean out control characters in a file?

Previously, I had been cleaning out data using the code snippet below
import unicodedata, re, io
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s): # see http://www.unicode.org/reports/tr44/#General_Category_Values
return cc_re.sub('', s)
cleanfile = []
with io.open('filename.txt', 'r', encoding='utf8') as fin:
for line in fin:
line =rm_control_chars(line)
cleanfile.append(line)
There are newline characters in the file that i want to keep.
The following records the time taken for cc_re.sub('', s) to substitute the first few lines (1st column is the time taken and 2nd column is len(s)):
0.275146961212 251
0.672796010971 614
0.178567171097 163
0.200030088425 180
0.236430883408 215
0.343492984772 313
0.317672967911 290
0.160616159439 142
0.0732028484344 65
0.533437013626 468
0.260229110718 236
0.231380939484 204
0.197766065598 181
0.283867120743 258
0.229172945023 208
As #ashwinichaudhary suggested, using s.translate(dict.fromkeys(control_chars)) and the same time taken log outputs:
0.464188098907 252
0.366552114487 615
0.407374858856 164
0.322507858276 181
0.35142993927 216
0.319973945618 314
0.324357032776 291
0.371646165848 143
0.354818105698 66
0.351796150208 469
0.388131856918 237
0.374715805054 205
0.363368988037 182
0.425950050354 259
0.382766962051 209
But the code is really slow for my 1GB of text. Is there any other way to clean out controlled characters?

found a solution working character by charater, I bench marked it using a 100K file:
import unicodedata, re, io
from time import time
# This is to generate randomly a file to test the script
from string import lowercase
from random import random
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = [c for c in all_chars if unicodedata.category(c)[0] == 'C']
chars = (list(u'%s' % lowercase) * 115117) + control_chars
fnam = 'filename.txt'
out=io.open(fnam, 'w')
for line in range(1000000):
out.write(u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n')
out.close()
# version proposed by alvas
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s):
return cc_re.sub('', s)
t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
line =rm_control_chars(line)
cleanfile.append(line)
out=io.open(fnam + '_out1.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0
# using a set and checking character by character
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = set(c for c in all_chars if unicodedata.category(c)[0] == 'C')
def rm_control_chars_1(s):
return ''.join(c for c in s if not c in control_chars)
t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
line = rm_control_chars_1(line)
cleanfile.append(line)
out=io.open(fnam + '_out2.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0
the output is:
114.625444174
0.0149750709534
I tried on a file of 1Gb (only for the second one) and it lasted 186s.
I also wrote this other version of the same script, slightly faster (176s), and more memory efficient (for very large files not fitting in RAM):
t0 = time()
out=io.open(fnam + '_out5.txt', 'w')
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
out.write(rm_control_chars_1(line))
out.close()
print time() - t0

As in UTF-8, all control characters are coded in 1 byte (compatible with ASCII) and bellow 32, I suggest this fast piece of code:
#!/usr/bin/python
import sys
ctrl_chars = [x for x in range(0, 32) if x not in (ord("\r"), ord("\n"), ord("\t"))]
filename = sys.argv[1]
with open(filename, 'rb') as f1:
with open(filename + '.txt', 'wb') as f2:
b = f1.read(1)
while b != '':
if ord(b) not in ctrl_chars:
f2.write(b)
b = f1.read(1)
Is it ok enough?

Does this have to be in python? How about cleaning the file before you read it in python to start with. Use sed which will treat it line by line anyway.
See removing control characters using sed.
and if you pipe it out to another file you can open that. I don't know how fast it would be though. You can do it in a shell script and test it. according to this page - sed is 82M characters per second.
Hope it helps.

If you want it to move really fast? Break your input into multiple chunks, wrap up that data munging code as a method, and use Python's multiprocessing package to parallelize it, writing to some common text file. Going character-by-character is the easiest method to crunch stuff like this, but it always takes a while.
https://docs.python.org/3/library/multiprocessing.html

I'm surprised no one has mentioned mmap which might just be the right fit here.
Note: I'll put this in as an answer in case it's useful and apologize that I don't have the time to actually test and compare it right now.
You load the file into memory (kind of) and then you can actually run a re.sub() over the object. This helps eliminate the IO bottleneck and allows you to change the bytes in-place before writing it back at once.
After this, then, you can experiment with str.translate() vs re.sub() and also include any further optimisations like double buffering CPU and IO or using multiple CPU cores/threads.
But it'll look something like this;
import mmap
f = open('test.out', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
A nice excerpt from the mmap documentation is;
..You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a',..

A couple of things I would try.
First, do the substitution with a replace all regex.
Second, setup a regex char class with known control char ranges instead
of a class of individual control char's.
(This is incase the engine doesn't optimize it to ranges.
A range requires two conditionals on the assembly level,
as opposed to individual conditional on each char in the class)
Third, since you are removing the characters, add a greedy quantifier
after the class. This will negate the necessity to enter into substitution
subroutines after each single char match, instead grabbing all adjacent chars
as needed.
I don't know pythons syntax for regex constructs off the top of my head,
nor all the control codes in Unicode, but the result would look something
like this:
[\u0000-\u0009\u000B\u000C\u000E-\u001F\u007F]+
The largest amount of time would be in copying the results to another string.
The smallest amount of time would be in finding all the control codes, which
would be miniscule.
All things being equal, the regex (as described above) is the fastest way to go.

Copy pieces of data from a .txt into another file for a spreadsheet

I have a bunch of data in .txt file and I need it in a format that I can use in fusion tables/spreadsheet. I assume that that format would be a csv that I can write into another file that I can then import into a spreadsheet to work with.
The data is in this format with multiple entries separated by a blank line.
Start Time
8/18/14, 11:59 AM
Duration
15 min
Start Side
Left
Fed on Both Sides
No
Start Time
8/18/14, 8:59 AM
Duration
13 min
Start Side
Right
Fed on Both Sides
No
(etc.)
but I need it ultimately in this format (or whatever i can use to get it into a spreadsheet)
StartDate, StartTime, Duration, StartSide, FedOnBothSides
8/18/14, 11:59 AM, 15, Left, No
- , -, -, -, -
The problems I have come across are:
-I don't need all the info or every line but i'm not sure how to automatically separate them. I don't even know if the way I am going about sorting each line is smart
-I have been getting an error that says that "argument 1 must be string or read-only character buffer, not list" when I use .read() or .readlines() sometimes (although it did work at first). also both of my arguments are .txt files.
-the dates and times are not in set formats with regular lengths (it has 8/4/14, 5:14 AM instead of 08/04/14, 05:14 AM) which I'm not sure how to deal with
this is what I have tried so far
from sys import argv
from os.path import exists
def filework():
script, from_file, to_file = argv
print "copying from %s to %s" % (from_file, to_file)
in_file = open(from_file)
indata = in_file.readlines() #.read() .readline .readlines .read().splitline .xreadlines
print "the input file is %d bytes long" % len(indata)
print "does the output file exist? %r" % exists(to_file)
print "ready, hit RETURN to continue, CTRL-C to abort."
raw_input()
#do stuff section----------------BEGIN
for i in indata:
if i == "Start Time":
pass #do something
elif i== '{date format}':
pass #do something
else:
pass #do something
#do stuff section----------------END
out_file = open(to_file, 'w')
out_file.write(indata)
print "alright, all done."
out_file.close()
in_file.close()
filework()
So I'm relatively unversed in scripts like this that have multiple complex parts. Any help and suggestions would be greatly appreciated. Sorry if this is a jumble.
Thanks

This code should work, although its not exactly optimal, but I'm sure you'll figure out how to make it better!
What this code basically does is:
Get all the lines from the input data
Loop through all the lines, and try to recognize different keys (the start time etc)
If a keys is recognize, get the line beneath it, and apply a appropriate function to it
If a new line is found, add the current entry to a list, so that other entries can be read
Write the data to a file
Incase you haven't seen string formatting being done this way before:
"{0:} {1:}".format(arg0, arg1), the {0:} is just a way of defining a placeholder for a variable(here: arg0), and the 0 just defines which arguments to use.
Find out more here:
Python .format docs
Python OrderedDict docs
If you are using a version of python < 2.7, you might have to install a other version of ordereddicts by using pip install ordereddict. If that doesn't work, just change data = OrderedDict() to data = {}, and it should work. But then the output will look somewhat different each time it is generated, but it will still be correct.
from sys import argv
from os.path import exists
# since we want to have a somewhat standardized format
# and dicts are unordered by default
try:
from collections import OrderedDict
except ImportError:
# python 2.6 or earlier, use backport
from ordereddict import OrderedDict
def get_time_and_date(time):
date, time = time.split(",")
time, time_indic = time.split()
date = pad_time(date)
time = "{0:} {1:}".format(pad_time(time), time_indic)
return time, date
"""
Make all the time values look the same, ex turn 5:30 AM into 05:30 AM
"""
def pad_time(time):
# if its time
if ":" in time:
separator = ":"
# if its a date
else:
separator = "/"
time = time.split(separator)
for index, num in enumerate(time):
if len(num) < 2:
time[index] = "0" + time[index]
return separator.join(time)
def filework():
from_file, to_file = argv[1:]
data = OrderedDict()
print "copying from %s to %s" % (from_file, to_file)
# by using open(...) the file closes automatically
with open(from_file, "r") as inputfile:
indata = inputfile.readlines()
entries = []
print "the input file is %d bytes long" % len(indata)
print "does the output file exist? %r" % exists(to_file)
print "ready, hit RETURN to continue, CTRL-C to abort."
raw_input()
for line_num in xrange(len(indata)):
# make the entire string lowercase to be more flexible,
# and then remove whitespace
line_lowered = indata[line_num].lower().strip()
if "start time" == line_lowered:
time, date = get_time_and_date(indata[line_num+1].strip())
data["StartTime"] = time
data["StartDate"] = date
elif "duration" == line_lowered:
duration = indata[line_num+1].strip().split()
# only keep the amount of minutes
data["Duration"] = duration[0]
elif "start side" == line_lowered:
data["StartSide"] = indata[line_num+1].strip()
elif "fed on both sides" == line_lowered:
data["FedOnBothSides"] = indata[line_num+1].strip()
elif line_lowered == "":
# if a blank line is found, prepare for reading a new entry
entries.append(data)
data = OrderedDict()
entries.append(data)
# create the outfile if it does not exist
with open(to_file, "w+") as outfile:
headers = entries[0].keys()
outfile.write(", ".join(headers) + "\n")
for entry in entries:
outfile.write(", ".join(entry.values()) + "\n")
filework()

Biopython Large Sequence splitting

I'm a newbie in the field of python programming. As I was trying to do some analysis,(I've tried to find the answer on other posts, but nothing) I decided to post my first and probably very foolish question. Why does this create only one output file although in this example there were supposed to be at least 8 (sequence is more than 8000 characters).
Thank you for your answer upfront.
def batch_iterator(iterator, batch_size) :
entry = True
while entry :
batch = []
while len(batch) < batch_size :
try :
entry = iterator.next()
except StopIteration :
entry = None
if entry is None :
#End of file
break
batch.append(entry)
if batch :
yield batch
from Bio import SeqIO
record_iter = SeqIO.parse(open("some.fasta"),"fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1000)) : #I think sth is wrong here?
filename = "group_%i.fasta" % (i+1)
handle = open(filename, "w")
count = SeqIO.write(batch, handle, "fasta")
handle.close()
print "Wrote %i records to %s" % (count, filename)

Sequence chunks
After a long discussion with the OP, here is my very restructured proposal, using the generator function defined in this other SO thread
# file: main.py
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == '__main__':
handle = open('long.fasta', 'r')
records = list(SeqIO.parse(handle, "fasta"))
record = records[0]
for pos, chunk in enumerate(chunks(record.seq.tostring(), 1000)):
chunk_record = SeqRecord(Seq(
chunk, record.seq.alphabet),
id=record.id, name=record.name,
description=record.description)
outfile = "group_%d.fasta" % pos
SeqIO.write(chunk_record, open(outfile, 'w'), "fasta")
Note that your original code does something very different: it takes new records from the generator provided by the SeqIO.parse function, and tries to store them in different files. If you want to split a single record in smaller sub-sequences, you have to access the record's internal data, which is done by record.seq.tostring(). The chunks generator function, as described in the other thread linked above, returns as many chunks as is possible to build from the passed in sequence. Each of them is stored as a new fasta record in a different file (if you want to keep just the sequence, write the chunk directly to the opened outfile).
Check that it works
Consider the following code:
# file: generate.py
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio import SeqIO
long_string = "A" * 8000
outfile = open('long.fasta', 'w')
record = SeqRecord(Seq(
long_string,
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein, small")
SeqIO.write(record, outfile, "fasta")
It writes a single record to a file named "long.fasta". This single record has a Sequence inside that is 8000 characters long, as generated in long_string.
How to use it:
$ python generate.py
$ wc -c long.fasta
8177 long.fasta
The overhead over 8000 characters is the file header.
How to split that file in chunks of 1000 length each, with the code snippet above:
$ python main.py
$ ls
generate.py group_1.fasta group_3.fasta group_5.fasta group_7.fasta main.py
group_0.fasta group_2.fasta group_4.fasta group_6.fasta long.fasta
$ wc -c group_*
1060 group_0.fasta
1060 group_1.fasta
1060 group_2.fasta
1060 group_3.fasta
1060 group_4.fasta
1060 group_5.fasta
1060 group_6.fasta
1060 group_7.fasta
8480 total

Python text file processing speed issues

I'm having a problem with processing a largeish file in Python. All I'm doing is
f = gzip.open(pathToLog, 'r')
for line in f:
counter = counter + 1
if (counter % 1000000 == 0):
print counter
f.close
This takes around 10m25s just to open the file, read the lines and increment this counter.
In perl, dealing with the same file and doing quite a bit more (some regular expression stuff), the whole process takes around 1m17s.
Perl Code:
open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
$_ =~ s/some regex here/$1,$2,$3,$4/;
push #an_array, $_
}
}
close LOG;
Can anyone advise what I can do to make the Python solution run at a similar speed to the Perl solution?
EDIT
I've tried just uncompressing the file and dealing with it using open instead of gzip.open, but that only changes the total time to around 4m14.972s, which is still too slow.
I also removed the modulo and print statements and replaced them with pass, so all that is being done now is moving from file to file.

In Python (at least <= 2.6.x), gzip format parsing is implemented in Python (over zlib). More, it appears to be doing some strange things, namely, decompress to the end of file to memory and then discard everything beyond the requested read size (then do it again for next read). DISCLAIMER: I've just looked at gzip.read() for 3 minutes, so I can be wrong here. Regardless of whether my understanding of gzip.read() is correct or not, gzip module appears to be not optimized for large data volumes. Try doing the same thing as in Perl, i.e. launching an external process (e.g. see module subprocess).
EDIT
Actually, I missed the OP's remark about plain file I/O being just as slow as compressed (thanks to ire_and_curses for pointing it out). This striken me as unlikely, so I did some measurements...
from timeit import Timer
def w(n):
L = "*"*80+"\n"
with open("ttt", "w") as f:
for i in xrange(n) :
f.write(L)
def r():
with open("ttt", "r") as f:
for n,line in enumerate(f) :
if n % 1000000 == 0 :
print n
def g():
f = gzip.open("ttt.gz", "r")
for n,line in enumerate(f) :
if n % 1000000 == 0 :
print n
Now, running it...
>>> Timer("w(10000000)", "from __main__ import w").timeit(1)
14.153118133544922
>>> Timer("r()", "from __main__ import r").timeit(1)
1.6482770442962646
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
...and after having a tea break and discovering that it's still running, I've killed it, sorry. Then I tried 100'000 lines instead of 10'000'000:
>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895
Module gzip's time is O(file_size**2), so with number of lines on the order of millions, gzip read time just cannot be the same as plain read time (as we see confirmed by an experiment). Anonymouslemming, please check again.

If you Google "why is python gzip slow" you'll find plenty of discussion of this, including patches for improvements in Python 2.7 and 3.2. In the meantime, use zcat as you did in Perl which is wicked fast. Your (first) function takes me about 4.19s with a 5MB compressed file, and the second function takes 0.78s. However, I don't know what's going on with your uncompressed files. If I uncompress the log files (apache logs) and run the two function on them with a simple Python open(file), and Popen('cat'), Python is faster (0.17s) than cat (0.48s).
#!/usr/bin/python
import gzip
from subprocess import PIPE, Popen
import sys
import timeit
#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""
def test_ori():
counter = 0
f = gzip.open(pathToLog, 'r')
for line in f:
counter = counter + 1
if (counter % 100000 == 0): # 1000000
print counter, line
f.close
def test_new():
counter = 0
content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
for line in content:
counter = counter + 1
if (counter % 100000 == 0): # 1000000
print counter, line
if '__main__' == __name__:
to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
print "Original function time", to.timeit(1)
tn = timeit.Timer('test_new()', 'from __main__ import test_new')
print "New function time", tn.timeit(1)

I spent a while on this. Hopefully this code will do the trick. It uses zlib and no external calls.
The gunzipchunks method reads the compressed gzip file in chunks which can be iterated over (generator).
The gunziplines method reads these uncompressed chunks and provides you with one line at a time which can also be iterated over (another generator).
Finally, the gunziplinescounter method gives you what you're looking for.
Cheers!
import zlib
file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'
#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
f = open(file_name,'rb')
while True:
packet = f.read(chunk_size)
if not packet: break
to_do = inflator.unconsumed_tail + packet
while to_do:
decompressed = inflator.decompress(to_do, chunk_size)
if not decompressed:
to_do = None
break
yield decompressed
to_do = inflator.unconsumed_tail
leftovers = inflator.flush()
if leftovers: yield leftovers
f.close()
#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
for chunk in gunzipchunks(file_name):
chunk = "".join([leftovers,chunk])
while line_ending in chunk:
line, leftovers = chunk.split(line_ending,1)
yield line
chunk = leftovers
if leftovers: yield leftovers
def gunziplinescounter(file_name):
for counter,line in enumerate(gunziplines(file_name)):
if (counter % 1000000 != 0): continue
print "%12s: %10d" % ("checkpoint", counter)
print "%12s: %10d" % ("final result", counter)
print "DEBUG: last line: [%s]" % (line)
gunziplinescounter(file_name)
This should run a whole lot faster than using the builtin gzip module on extremely large files.

It took your computer 10 minutes? It must be your hardware. I wrote this function to write 5 million lines:
def write():
fout = open('log.txt', 'w')
for i in range(5000000):
fout.write(str(i/3.0) + "\n")
fout.close
Then I read it with a program much like yours:
def read():
fin = open('log.txt', 'r')
counter = 0
for line in fin:
counter += 1
if counter % 1000000 == 0:
print counter
fin.close
It took my computer about 3 seconds to read all 5 million lines.

Try using StringIO to buffer the output from the gzip module. The following code to read a gzipped pickle cut the execution time of my code by well over 90%.
Instead of...
import cPickle
# Use gzip to open/read the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
lData = cPickle.load(lPklFile)
lPklFile.close()
Use...
import cStringIO, cPickle
# Use gzip to open the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
# Copy the pickle into a cStringIO.
lInternalFile = cStringIO.StringIO()
lInternalFile.write(lPklFile.read())
lPklFile.close()
# Set the seek position to the start of the StringIO, and read the
# pickled data from it.
lInternalFile.seek(0, os.SEEK_SET)
lData = cPickle.load(lInternalFile)
lInternalFile.close()

Search and sort data from several files

I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:
ccc245 1 4 5 5 3 -12.3
For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3
What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:
FILE 1st_col 7th_col
in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6
I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:
concatenate all in_ files in IN.TXT
search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file
It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.
Thanks

In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.
Example (tested) for Python 2.5 and 2.6:
import heapq, glob
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield fname, items[0], float(items[6])
f.close()
result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result
Update after above answer accepted
Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).
Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.
import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield -float(items[6]), fname, items[0]
f.close()
def homegrown_nlargest(n, iterable):
"""Ensures heap never has more than n entries"""
heap = []
for item in iterable:
if len(heap) < n:
heappush(heap, item)
else:
heappushpop(heap, item)
return heap
result = homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)

I would:
take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
in case its value is lower than highest one from current top10,
(keeping the sorting for performance)
I wouldn't post the complete program here as it looks like homework.
Yes, if it wasn't ten, this would be not optimal

Try something like this in python:
min_values = []
def add_to_min(file_name, one, seven):
# checks to see if 7th column is a lower value than exiting values
if len(min_values) == 0 or seven < max(min_values)[0]:
# let's remove the biggest value
min_values.sort()
if len(min_values) != 0:
min_values.pop()
# and add the new value tuple
min_values.append((seven, file_name, one))
# loop through all the files
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
add_to_min(file_name, columns[0], float(columns[6]))
# print answers
for (seven, file_name, one) in min_values:
print file_name, one, seven
Haven't tested it, but it should get you started.
Version 2, just runs the sort a single time (after a prod by S. Lott):
values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
values.append((file_name, columns[0], float(columns[6]))
# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
print file_name, one, seven
Just re-read you question, with millions of rows, you might run out of RAM....

A small improvement of your shell solution:
$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8

This might be close to what you're looking for:
for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done
cat *.out | sort -k3n | head -n 10 > final_result.out

If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.
for file in in_*.txt
do
awk -vt=$t 'NR<=10{
c=c+1
val[c]=$7
tag[c]=$1
}
NR>10{
for(o=1;o<=c;o++){
if ( $7 <= val[o] ){
val[o]=$7
tag[o]=$1
break
}
}
}
END{
for(i=1;i<=c;i++){
print val[i], tag[i] | "sort"
}
}' $file
done

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing Large Files in Python [ 1000 GB or More] - python

Have you looked at using parallel / grep? cat bigfile.txt | parallel --block 10M --pipe grep -o 'how\ fast\ it\ is' | wc -l

Related

Is there a faster way to clean out control characters in a file?

Copy pieces of data from a .txt into another file for a spreadsheet

Biopython Large Sequence splitting

Python text file processing speed issues

Search and sort data from several files

Categories

Resources