Reading a large text file in python kills my program

Reading a large text file in python kills my program - python

I wrote a python script to read in a text file and put its information in a Dictionary. The original text file is 2.6 gb and contains 56981936 lines, of these lines I only need to link every first line to its second and fourth in a dictionary. I recently switched from windows (were this program ran fine) to Linux, where it keeps getting killed. Does somebody have any idea why?
The text format is a fastQ file, which has repeating lines in the following format:
#xxxxxxxxxxx
CTTCTCAACTC
+
AAAEE#AEE#A
This is my original code:
def createReverseDict(backwardsFile):
reverseDict = {}
with open(backwardsFile) as f3:
while True:
label = f3.readline().rstrip()
if not label:
break
sequence = f3.readline().rstrip()
next(f3)
score = f3.readline().rstrip()
reverseDict[label] = {"sequence" : sequence,
"score" : score }
return reverseDict

I created a small file that's 100MB (much smaller than yours but similar format) and used namedtuple to improve the memory performance. I also used tracemalloc to find out about how much memory was used.
import tracemalloc
tracemalloc.start()
createReverseDict("final.txt")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()
With regular dictionaries here reverseDict[label] = dict(sequence = sequence, score = score), my run had this
Current memory usage is 0.020237MB; Peak was 1210.024309MB
By using collections.namedtuple, it reduces quite a bit. Almost by half.
import collections
item = collections.namedtuple('item', ['seq', 'sco'])
and
reverseDict[label] = item(sequence, score)
inside the function. The memory usage dropped to
Current memory usage is 0.003773MB; Peak was 760.651005MB
If you do this, perhaps the OOM killer will not kick in at all. If however, it does, you can use the details in this link to raise the OOM limits. Disabling it completely using
sysctl vm.overcommit_memory=2
I cannot vouch for how safe this is but you should be able to try it and once you're done reboot the system to get it back to defaults.

Related

Randomized-offset binary raw disk writes with no caching whatsoever

For my application, I am attempting to determine whether a data backup system missed any writes. I am doing this by writing an incrementing integer counter to a 1GB virtual disk, and to make sure no writes were missed I can look at the reverted snapshot and see if there were any gaps (i.e. if I see 1, 2, 3, 0, 0, 6, 7 I know that the backup didn't get writes 4 and 5 correctly). This is all on a CentOS 7 VM, with mostly Python 2.7 scripts for writes/reads (speed isn't a huge concern)
A big part of my issues has been caching: since I'm simulating random I/O, writes are often flushed from caches and written to disk out of order. This makes every test appear as a false positive, since it looks like some data is missing at the time of the snapshot. Again, I don't really care about efficiency at all, so I don't mind really slow writes. Reads can use caching, that's not a problem, but also doesn't matter much one way or the other
Here are the things I have done to try to disable caching:
disable the disk write cache with sudo hdparm -W 0 /dev/sdb where /dev/sdb
writing to a raw disk with no filesystem, so no filesystem caching
set the buffering flag on with open in the Python script to 0 (no Python write cache)
Is it basically an impossible task to make sure that my writes get put on the disk in sequential order? All I need is write #(n) to happen before write #(n+1), and #(n+1) before #(n+2), etc.
This is the Python script I'm using to write to disk (SIZE and PRIME change based on the size of the disk an a random seed):
from struct import pack, unpack
import sys
SIZE,PRIME = [x],[x]
# random I/O traversal iterator
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
with open('/dev/sdb', 'rb+', buffering=0) as f:
index_gen = rand_index_generator(PRIME, SIZE)
# random traversal using iterator above, write counter to file
for counter in xrange(1, SIZE-16):
f.seek(index_gen.next()*4)
f.write(pack('>I', counter))
Then to validate I traverse in the same order and watch for gaps of unwritten data. This is after reverting the VM back to the snapshot. I know all the traversal and writing things work since validation will work smoothly with no missed writes before reverting, but I think some "written" data dies in RAM and doesn't make it to disk
Will take any suggestions to guarantee the write order I need for this application

Found out the answer to this question. I misunderstood the effect of writing to a raw disk, it did not eliminate OS caching since I was still calling the OS to write to my raw disk. Oops
To bypass OS caches you should use os.open and pass os.O_DIRECT and os.O_SYNC flags to make sure writes happen in the correct sequence (more info on those flags) and are not stuck in volatile memory. I used mmap and os file descriptors but you could also use the normal filehandles like this
Page size is specific to your operating system. For Linux it is 4096
The top section of the code stayed the same but here is the write loop:
PAGESIZE = 4096
filedesc = os.open('/dev/sdb', os.O_DIRECT|os.O_SYNC|os.O_RDWR)
for counter in xrange(1, SIZE-16):
write_loc = index_gen.next()*4
page_dist = (write_loc%PAGESIZE)
offset = write_loc - page_dist
bytemap = mmap.mmap(filedesc, PAGESIZE, offset=offset)
bytemap[page_dist:(page_dist+4)] = pack('>I', counter)
bytemap.flush()
bytemap.close()

Processing huge amount of text data in memory

I am trying to process ~20GB of data on a Ubuntu system having 64 GB of RAM.
This step is a part of a some preprocessing steps to generate feature vectors for training an ML algo.
The original implementation(written by someone in my team) had lists in it. It does not scale up well as we add more training data. It is something like this.
all_files = glob("./Data/*.*")
file_ls = []
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
file_ls.append(f.read())
This runs into a memory error(process gets killed).
So I though I should try out replacing the list based thing with tries
def insert(word):
cur_node = trie_root
for letter in word:
if letter in cur_node:
cur_node = cur_node[letter]
else:
cur_node[letter] = {}
cur_node = cur_node[letter]
cur_node[None] = None
trie_root = {}
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
insert(f.read().split())
This too gets killed. The above is a demo code that I have written to capture the memory footprint of the objects. The worse part is that the demo code for list runs standalone but the demo code for trie gets killed, leading me to believe that this implementation is worse than the list implementation.
My goal is to write some efficient code in Python to resolve this issue.
Kindly help me solve this problem.
EDIT:
Responding to #Paul Hankin, the data processing involves first taking up each file and adding a generic placeholder for terms with a normalized term frequency greater than 0.01 after which each file is splitted into a list and a vocabulary is calculated taking all the processed files into consideration.

One of the simple solutions to this problem might be to NOT store data in a list or any data structure. You can try writing these data to a file while doing the reading.

Efficiently reading few lines from a very large binary file

Here is a simple example to illustrate my problem:
I have a large binary file with 10 million values.
I want to get 5K values from certain points in this file.
I have a list of indexes giving me the exact place in the file I have my value in.
To solve this I tried two methods:
Going through the values and simply using seek() (from the start of the file) to get each value, something like this:
binaryFile_new = open(binary_folder_path, "r+b")
for index in index_list:
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:
binaryFile_new = open(binary_folder_path, "r+b")
sorted_index_list = sorted(index_list)
for i, index in enumerate(sorted_index_list):
if i == 0:
binaryFile_new.seek (size * (v), 0)
else:
binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
I expected the second solution to be much faster because in theory it would go through the whole file once O(N).
But for some reason both solutions run the same.
I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.

I'd go with #1:
for index in index_list:
binary_file.seek(size * index)
# ...
(I cleaned up your code a bit to comply with Python naming conventions and to avoid using a magic 0 constant, as SEEK_SET is default anyway.)
as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
No, a seek() does not "read through from the beginning", that would defeat the point of seeking. Seeking to the beginning of file and to the end of file have roughly the same cost.
Sorting the indexes so I could go through the file "once" while seeking from the current position
I can't quickly find a reference for this, but I believe there's absolutely no point in calculating the relative offset in order to use SEEK_CUR instead of SEEK_SET.
There might be a small improvement just from seeking to the positions you need in order instead of randomly, as there's an increased chance your random reads will be serviced from cache, in case many of the points you need to read happen to be close to each other (and so your read patterns trigger read-ahead in the file system).
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.
mmap doesn't scan the file. It sets up a region in your program's virtual memory to correspond to the file, so that accessing any page from this region the first time leads to a page fault, during which the OS reads that page (several KB) from the file (assuming it's not in the page cache) before letting your program proceed.
The internet is full of discussions of relative merits of read vs mmap, but I recommend you don't bother with trying to optimize by using mmap and use this time to learn about the virtual memory and the page cache.
[edit] reading in chunks larger than the size of your values might save you a bit of CPU time in case many of the values you need to read are in the same chunk (which is not a given) - but unless your program is CPU bound in production, I wouldn't bother with that either.

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards

If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.

(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Performance issue with loop on datasets with h5py

I want to apply a simple function to the datasets contained in an hdf5 file.
I am using a code similar to this
import h5py
data_sums = []
with h5py.File(input_file, "r") as f:
for (name, data) in f["group"].iteritems():
print name
# data_sums.append(data.sum(1))
data[()] # My goal is similar to the line above but this line is enough
# to replicate the problem
It goes very fast at the beginning and after a certain number, reproducible to some extent, of datasets it slow down dramatically.
If I comment the last line, it finishes almost instantly. It does not matter if the data are stored (here append to a list) or not: something like data[:100] as a similar effect.
The number of datasets that can be treated before the drop in performance is dependent to the size of portion that is accessed at each iteration.
Iterating over smaller chunks does not solve the issue.
I suppose I am filling some memory space and that the process slows down when it is full but I do not understand why.
How to circumvent this performance issue?
I run python 2.6.5 on ubuntu 10.04.
Edit:
The following code does not slow down if the second line of the loop is un-commented. It does slow down without out it
f = h5py.File(path to file, "r")
list_name = f["data"].keys()
f.close()
import numpy as np
for name in list_name:
f = h5py.File(d.storage_path, "r")
# name = list_name[0] # with this line the issue vanishes.
data = f["data"][name]
tag = get_tag(name)
data[:, 1].sum()
print "."
f.close()
Edit: I found out that accessing the first dimension of multidimensional datasets seems to run without issues. The problem occurs when higher dimensions are involved.

platform?
on windows 64 bit, python 2.6.6, i have seen some weird issues when crossing a 2GB barrier (i think) if you have allocated it in small chunks.
you can see it with a script like this:
ix = []
for i in xrange(20000000):
if i % 100000 == 0:
print i
ix.append('*' * 1000)
you can see that it will run pretty fast for a while, and then suddenly slow down.
but if you run it in larger blocks:
ix = []
for i in xrange(20000):
if i % 100000 == 0:
print i
ix.append('*' * 1000000)
it doesn't seem to have the problem (though it will run out of memory, depending on how much you have - 8GB here).
weirder yet, if you eat the memory using large blocks, and then clear the memory (ix=[] again, so back to almost no memory in use), and then re-run the small block test, it isn't slow anymore.
i think there was some dependence on the pyreadline version - 2.0-dev1 helped a lot with these sorts of issues. but don't remember too much. when i tried it now, i don't see this issue anymore really - both slow down significant around 4.8GB, which with everything else i have running is about where it hits the limits of physical memory and starts swapping.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.