I will explain what's my problem first, as It's important to understand what I want :-).
I'm working on a python-written pipeline that uses several external tools to perform several genomics data analysis. One of this tools works with very huge fastq files, which at the end are no more that plain text files.
Usually, this fastq files are gzipped, and as they're are plain text the compression ratio is very high. Most of data analysis tools can work with gzipped files, but we have a few ones that can't. So what we're doing is unzipp the files, work with them, and finaly re-compress.
As you may imagine, this process is:
Slower
High disk consuming
Bandwidth consuming (if working in a NFS filesystem)
So I'm trying to figure out a way of "tricking" these tools to work directly with gzipped files without having to touch the source code of the tools.
I thought on using FIFO files, and I tried that, but doesn't work if the tool reads the file more than once, or if the tool seeks around the file.
So basically I have to questions:
Is there any way to map a file into memory so that you can do something like:
./tool mapped_file (where mapped_file is not really a file, but a reference to a memory mapped file.
Do you have any other suggestions about how can I achieve my target?
Thank you very much to everybody!
From this answer you can load the whole uncompressed file into ram:
mkdir /mnt/ram
mount -t ramfs ram /mnt/ram
# uncompress your file to that directory
./tool /mnt/ram/yourdata
This, however, has the drawback of loading everything to ram: you'll need to have enough space to hold your uncompressed data!
Use umount /mnt/ram when you're finished.
If your script can read from standard input, then one possibility would be to decompress and stream using zcat, and then pipe it to your script.
Something like this:
zcat large_file.gz | ./tool
If you want to compress your results as well, then you can just pipe the output to gzip again:
zcat large_file.gz | ./tool | gzip - > output.gz
Otherwise, you can look at python's support for memory mapping:
http://docs.python.org/library/mmap.html
Finally, you can convert the ASCII fastq files to BAM format, which isn't compressed (per se) but uses a more compact format that will save you space. See the following:
http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam
Consider looking at winning entries in the Pistoia Alliance Sequence Squeeze contest, which rated FASTQ compression tools. You may find a tool which provides IO overhead reduction through random access and faster decompression performance.
You can write a fuse file system driver, if you are on linux: http://pypi.python.org/pypi/fuse-python
The fuse driver needs to compress and decompress the files. Maybe something like this already exists.
Related
I try to read two one billion record files from a Unix server for file comparison.
I tried in python with paramiko package but it is very slow to connect and read the Unix files. That's why I chose Java.
In Java when I read the file I am facing memory issues and performance issues.
My requirement: first read all records from Unix server file1, then read second file records from Unix server and finally compare the two files.
Sounds you want to process huge files. Rule of thumb is they will exceed your RAM so never hope to get them read all at once.
Instead try to read meaningful chunks, process them, then forget them.
Meaningful chunks could be characters, words, lines, expressions, objects.
As you are working in UNIX, I would advise you to sort the files and use the diff commandline tool: UNIX' commandline commands are quite powerful. Please show us an excerpt of the files (maybe you might need cut or awk scripts too).
Ok, I realize the title probably wasn't that clear. I'll clarify here and hope someone can help with a better title.
I'm opening a compressed file (tarball or similar) in python and reading some of the contents. One of the enclosed files is quite large (about 200GB, mostly zeros). Since the python tarfile module gives me file-handle like objects, I can generally use them as if I opened the file in the archive with out ever fully decompressing the enclosed file.
Unfortunately, I have to do some processing on this enclosed file using a 3rd party tool that I can't modify. This 3rd party tool only operates on files that are on disk. It won't take input from stdin.
What I do now is extract the entire 200 GB (mostly zeros) file to the disk for further processing. Obviously, this takes a while.
What I'd like to do is (using python if possible) make a "file" on disk that maps back to the "file handle" from the tarfile module in python. I could then pass this "file" to my 3rd party tool and go from there.
My target OS is linux (though a solution that also works on OSX would be nice). I don't care about working on Windows.
Edits
External tool takes a filename (or full path) as a parameter. It prints out data to stdout (which python reads)
I've gotten a setup using sparse files working. While not as fast as I had hoped, it is significantly faster than it was before.
I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!
gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.
I need to run a regex match over a file, but I'm faced with an unexpected problem: the file is too big to read() or mmap() in one call, File objects don't support the buffer() interface, and the regex module takes only strings or buffers.
Is there an easy way to do this?
The Python mmap module provides a nice Python-friendly way of memory mapping a file. On a 32-bit operating system, the maximum size of the file is will be limited to no more than a GB or maybe two, but on a 64-bit OS you will be able to memory map a file of arbitrary size (until storage sizes exceed 264, of course).
I've done this with files of up to 30 GB (the Wikipedia XML dump file) in Python with excellent results.
I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?
Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?
More Info/Edit:
I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?
Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.
Bzip2 compresses in large blocks (900 KiB by default, I believe). One method that would speed up the scanning of the tar file dramatically, but would reduce compression performance, would be to compress each file individually and then tar the results together. This is essentially what Zip-format files are (though using zlib compression rather than bzip2). But you could then easily grab the tar index and only have to decompress the specific file(s) you are looking for.
I don't think most tar programs offer much ability to organize files in any meaningful way, though you could write a program to do this for your special case (I know Python has tar-writing libraries though I've only used them once or twice). However, you'd still have the problem of having to decompress most of the data before you found what you were looking for.