Organizing files in tar bz2 file with python

Organizing files in tar bz2 file with python - python

I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?
Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?
More Info/Edit:
I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?

Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.

Bzip2 compresses in large blocks (900 KiB by default, I believe). One method that would speed up the scanning of the tar file dramatically, but would reduce compression performance, would be to compress each file individually and then tar the results together. This is essentially what Zip-format files are (though using zlib compression rather than bzip2). But you could then easily grab the tar index and only have to decompress the specific file(s) you are looking for.
I don't think most tar programs offer much ability to organize files in any meaningful way, though you could write a program to do this for your special case (I know Python has tar-writing libraries though I've only used them once or twice). However, you'd still have the problem of having to decompress most of the data before you found what you were looking for.

Related

How to create a memory-mapped file in Python that is accessible from a called application?

Ok, I realize the title probably wasn't that clear. I'll clarify here and hope someone can help with a better title.
I'm opening a compressed file (tarball or similar) in python and reading some of the contents. One of the enclosed files is quite large (about 200GB, mostly zeros). Since the python tarfile module gives me file-handle like objects, I can generally use them as if I opened the file in the archive with out ever fully decompressing the enclosed file.
Unfortunately, I have to do some processing on this enclosed file using a 3rd party tool that I can't modify. This 3rd party tool only operates on files that are on disk. It won't take input from stdin.
What I do now is extract the entire 200 GB (mostly zeros) file to the disk for further processing. Obviously, this takes a while.
What I'd like to do is (using python if possible) make a "file" on disk that maps back to the "file handle" from the tarfile module in python. I could then pass this "file" to my 3rd party tool and go from there.
My target OS is linux (though a solution that also works on OSX would be nice). I don't care about working on Windows.
Edits
External tool takes a filename (or full path) as a parameter. It prints out data to stdout (which python reads)
I've gotten a setup using sparse files working. While not as fast as I had hoped, it is significantly faster than it was before.

Compressing strings and appending to file on the fly

I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!

gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.

Compress a folder recursively as 7z with PyLZMA and py7zlib

Through much trial and error, I've figured out how to make lzma compressed files through PyLZMA, but I'd like to replicate the seemingly simple task of compressing a folder and all of its files/directories recursively to a 7z file. I would just do it through 7z.exe, but I can't seem to be able to catch the stdout of the process until it's finished and I would like some per-7z-file progress as I will be compressing folders that range from hundreds of gigs to over a terabyte in size. Unfortunately I can't provide any code that I have tried, simply because the only thing I've seen examples of using py7zlib are extracting files from pre-existing files. Has anybody had any luck with the combination of these two or could provide some help?
For what it's worth, this would be on Windows using python 2.7. Bonus points if there some magic multi-threading that could occur here, especially given how slow lzma compression seems to be (time, however, is not an issue here). Thanks in advance!

A pure Python alternative is to create .tar.xz files with a combination of the standard library tarfile module and the liblzma wrapper module pyliblzma. This will create files comparable to in size to .7z archives:
import tarfile
import lzma
TAR_XZ_FILENAME = 'archive.tar.xz'
DIRECTORY_NAME = '/usr/share/doc/'
xz_file = lzma.LZMAFile(TAR_XZ_FILENAME, mode='w')
with tarfile.open(mode='w', fileobj=xz_file) as tar_xz_file:
tar_xz_file.add(DIRECTORY_NAME)
xz_file.close()
The tricky part is the progress report. The example above uses the recursive mode for directories of the tarfile.TarFile class, so there the add method would not return until the whole directory was added.
The following questions discuss possible strategies for monitoring the progress of the tar file creation:
Python tarfile progress output?
Python tarfile progress

Map files into memory

I will explain what's my problem first, as It's important to understand what I want :-).
I'm working on a python-written pipeline that uses several external tools to perform several genomics data analysis. One of this tools works with very huge fastq files, which at the end are no more that plain text files.
Usually, this fastq files are gzipped, and as they're are plain text the compression ratio is very high. Most of data analysis tools can work with gzipped files, but we have a few ones that can't. So what we're doing is unzipp the files, work with them, and finaly re-compress.
As you may imagine, this process is:
Slower
High disk consuming
Bandwidth consuming (if working in a NFS filesystem)
So I'm trying to figure out a way of "tricking" these tools to work directly with gzipped files without having to touch the source code of the tools.
I thought on using FIFO files, and I tried that, but doesn't work if the tool reads the file more than once, or if the tool seeks around the file.
So basically I have to questions:
Is there any way to map a file into memory so that you can do something like:
./tool mapped_file (where mapped_file is not really a file, but a reference to a memory mapped file.
Do you have any other suggestions about how can I achieve my target?
Thank you very much to everybody!

From this answer you can load the whole uncompressed file into ram:
mkdir /mnt/ram
mount -t ramfs ram /mnt/ram
# uncompress your file to that directory
./tool /mnt/ram/yourdata
This, however, has the drawback of loading everything to ram: you'll need to have enough space to hold your uncompressed data!
Use umount /mnt/ram when you're finished.

If your script can read from standard input, then one possibility would be to decompress and stream using zcat, and then pipe it to your script.
Something like this:
zcat large_file.gz | ./tool
If you want to compress your results as well, then you can just pipe the output to gzip again:
zcat large_file.gz | ./tool | gzip - > output.gz
Otherwise, you can look at python's support for memory mapping:
http://docs.python.org/library/mmap.html
Finally, you can convert the ASCII fastq files to BAM format, which isn't compressed (per se) but uses a more compact format that will save you space. See the following:
http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam

Consider looking at winning entries in the Pistoia Alliance Sequence Squeeze contest, which rated FASTQ compression tools. You may find a tool which provides IO overhead reduction through random access and faster decompression performance.

You can write a fuse file system driver, if you are on linux: http://pypi.python.org/pypi/fuse-python
The fuse driver needs to compress and decompress the files. Maybe something like this already exists.

How can I create and read compressed files and what compression should I use for python?

I have a game I am writing in pygame and have map files. These are automatically generated by my game and I would like to compress/ decompress them before reading and writing to them so that my game goes quicker. The files are about 3 KB so they aren't that big. Please help and thanks in advance.

For the how: http://docs.python.org/library/archiving.html
For the which, that depends. In fact, I'm not even sure that compression will help you out here. But consider that gzip or bz2 might be a good choice if you want to compress individual files. However, if you want to compress/decompress all of your map files together, then you'll want to consider a format that supports that, such as zipfile.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Organizing files in tar bz2 file with python - python

Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.

Related

How to create a memory-mapped file in Python that is accessible from a called application?

Compressing strings and appending to file on the fly

Compress a folder recursively as 7z with PyLZMA and py7zlib

Map files into memory

How can I create and read compressed files and what compression should I use for python?

Categories

Resources