Python: regex on big file. Easy way? - python

I need to run a regex match over a file, but I'm faced with an unexpected problem: the file is too big to read() or mmap() in one call, File objects don't support the buffer() interface, and the regex module takes only strings or buffers.
Is there an easy way to do this?

The Python mmap module provides a nice Python-friendly way of memory mapping a file. On a 32-bit operating system, the maximum size of the file is will be limited to no more than a GB or maybe two, but on a 64-bit OS you will be able to memory map a file of arbitrary size (until storage sizes exceed 264, of course).
I've done this with files of up to 30 GB (the Wikipedia XML dump file) in Python with excellent results.

Related

How to create a memory-mapped file in Python that is accessible from a called application?

Ok, I realize the title probably wasn't that clear. I'll clarify here and hope someone can help with a better title.
I'm opening a compressed file (tarball or similar) in python and reading some of the contents. One of the enclosed files is quite large (about 200GB, mostly zeros). Since the python tarfile module gives me file-handle like objects, I can generally use them as if I opened the file in the archive with out ever fully decompressing the enclosed file.
Unfortunately, I have to do some processing on this enclosed file using a 3rd party tool that I can't modify. This 3rd party tool only operates on files that are on disk. It won't take input from stdin.
What I do now is extract the entire 200 GB (mostly zeros) file to the disk for further processing. Obviously, this takes a while.
What I'd like to do is (using python if possible) make a "file" on disk that maps back to the "file handle" from the tarfile module in python. I could then pass this "file" to my 3rd party tool and go from there.
My target OS is linux (though a solution that also works on OSX would be nice). I don't care about working on Windows.
Edits
External tool takes a filename (or full path) as a parameter. It prints out data to stdout (which python reads)
I've gotten a setup using sparse files working. While not as fast as I had hoped, it is significantly faster than it was before.

Reading a file line by line -- impact on disk?

I'm currently writing a python script that processes very large (> 10GB) files. As loading the whole file into memory is not an option, I'm right now reading and processing it line by line:
for line in f:
....
Once the script is finished it will run fairly often, so I'm starting to think about what impact that sort of reading will have on my disks lifespan.
Will the script actually read line by line or is there some kind of OS-powered buffering happening? If not, should I implement some kind of intermediary buffer myself? Is hitting the disk that often actually harmful? I remember reading something about BitTorrent wearing out disks quickly exactly because of that kind of bitwise reading/writing rather than operating with larger chunks of data.
I'm using both a HDD and an SSD in my test environment, so answers would be interesting for both systems.
Both your OS and Python use buffers to read data in larger chunks, for performance reasons. Your disk will not be materially impacted by reading a file line by line from Python.
Specifically, Python cannot give you individual lines without scanning ahead to find the line separators, so it'll read chunks, parse out individual lines, and each iteration will take lines from the buffer until another chunk must be read to find the next set of lines. The OS uses a buffer cache to help speed up I/O in general.

Compressing strings and appending to file on the fly

I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!
gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.

Read entire physical file including file slack with python?

Is there a simple way to read all the allocated clusters of a given file with python? The usual python read() seemingly only allows me to read up to the logical size of the file (which is reasonable, of course), but I want to read all the clusters including slack space.
For example, I have a file called "test.bin" that is 1234 byte in logical size, but because my file system uses clusters of size 4096 bytes, the file has a physical size of 4096 bytes on disk. I.e., there are 2862 bytes in file slack space.
I'm not sure where to even start with this problem... I know I can read the raw disk from /dev/sda, but I'm not sure how to locate the clusters of interest... of course this is the whole point of having a file-system (to match up names of files to sectors on disk), but I don't know enough about how python interacts with the file-system to figure this out... yet... any help or pointers to references would be greatly appreciated.
Assuming an ext2/3/4 filesytem, as you guess yourself, your best bet is probably to:
use a wrapper (like this one) around debugfs to get the list of blocks associated with a given file:
debugfs: blocks ./f.txt
2562
to read-back that/those block(s) from the block device / image file
>>> f = open('/tmp/test.img','rb')
>>> f.seek(2562*4*1024)
10493952
>>> bytes = f.read(4*1024)
>>> bytes
b'Some data\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
...
Not very fancy but that will work. Please note you don't have to mount the FS to do any of these steps. This is especially important for forensic applications where you cannot trust in anyway the content of the disk and/or are not allowed per regulation to mount the disk image.
There is a C open source forensic tool that implements file access successfully.
Here is an overview of the tool Link.
You can download it here.
It basically uses the POSIX system call open() that returns a raw file discriptor (integer) that you can use with the POSIX system calls read() and write() without the restrinction of stopping at EOF which stops you from accessing the file slack.
There are lots of examples online how to do a system call with python e.g. this one

Map files into memory

I will explain what's my problem first, as It's important to understand what I want :-).
I'm working on a python-written pipeline that uses several external tools to perform several genomics data analysis. One of this tools works with very huge fastq files, which at the end are no more that plain text files.
Usually, this fastq files are gzipped, and as they're are plain text the compression ratio is very high. Most of data analysis tools can work with gzipped files, but we have a few ones that can't. So what we're doing is unzipp the files, work with them, and finaly re-compress.
As you may imagine, this process is:
Slower
High disk consuming
Bandwidth consuming (if working in a NFS filesystem)
So I'm trying to figure out a way of "tricking" these tools to work directly with gzipped files without having to touch the source code of the tools.
I thought on using FIFO files, and I tried that, but doesn't work if the tool reads the file more than once, or if the tool seeks around the file.
So basically I have to questions:
Is there any way to map a file into memory so that you can do something like:
./tool mapped_file (where mapped_file is not really a file, but a reference to a memory mapped file.
Do you have any other suggestions about how can I achieve my target?
Thank you very much to everybody!
From this answer you can load the whole uncompressed file into ram:
mkdir /mnt/ram
mount -t ramfs ram /mnt/ram
# uncompress your file to that directory
./tool /mnt/ram/yourdata
This, however, has the drawback of loading everything to ram: you'll need to have enough space to hold your uncompressed data!
Use umount /mnt/ram when you're finished.
If your script can read from standard input, then one possibility would be to decompress and stream using zcat, and then pipe it to your script.
Something like this:
zcat large_file.gz | ./tool
If you want to compress your results as well, then you can just pipe the output to gzip again:
zcat large_file.gz | ./tool | gzip - > output.gz
Otherwise, you can look at python's support for memory mapping:
http://docs.python.org/library/mmap.html
Finally, you can convert the ASCII fastq files to BAM format, which isn't compressed (per se) but uses a more compact format that will save you space. See the following:
http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam
Consider looking at winning entries in the Pistoia Alliance Sequence Squeeze contest, which rated FASTQ compression tools. You may find a tool which provides IO overhead reduction through random access and faster decompression performance.
You can write a fuse file system driver, if you are on linux: http://pypi.python.org/pypi/fuse-python
The fuse driver needs to compress and decompress the files. Maybe something like this already exists.

Categories

Resources