Compressing strings and appending to file on the fly

Compressing strings and appending to file on the fly - python

I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!

gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.

Related

How to create a memory-mapped file in Python that is accessible from a called application?

Ok, I realize the title probably wasn't that clear. I'll clarify here and hope someone can help with a better title.
I'm opening a compressed file (tarball or similar) in python and reading some of the contents. One of the enclosed files is quite large (about 200GB, mostly zeros). Since the python tarfile module gives me file-handle like objects, I can generally use them as if I opened the file in the archive with out ever fully decompressing the enclosed file.
Unfortunately, I have to do some processing on this enclosed file using a 3rd party tool that I can't modify. This 3rd party tool only operates on files that are on disk. It won't take input from stdin.
What I do now is extract the entire 200 GB (mostly zeros) file to the disk for further processing. Obviously, this takes a while.
What I'd like to do is (using python if possible) make a "file" on disk that maps back to the "file handle" from the tarfile module in python. I could then pass this "file" to my 3rd party tool and go from there.
My target OS is linux (though a solution that also works on OSX would be nice). I don't care about working on Windows.
Edits
External tool takes a filename (or full path) as a parameter. It prints out data to stdout (which python reads)
I've gotten a setup using sparse files working. While not as fast as I had hoped, it is significantly faster than it was before.

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)

Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.

If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

Python Transferring files between two zipfiles

I've been trying to use the built-in python zipfiles module to manipulate some .zip files on windows, I wish to use them to store a number of files related to the current project in a program. The problem comes when I load the files from the zip and then wish to re-save them into a new, different zip file:
import zipfile
zp = zipfile.ZipFile(r"first.zip",mode='r')
myfile = zp.open(r"stored_file.txt",mode='r')
### Do something, then want to save again ###
zp2 = zipfile.ZipFile(r"second.zip",mode='w')
#Doesn't work, as myfile isn't a real file:
zp2.write(myfile)
#Doesn't work, as the path can't be resolved:
zp2.write(os.path.join(zp.filename,myfile.name))
#The following works... as long as you haven't called read()
#since .seek(0) doesn't work for ZipExtFile
zp2.writestr(myfile.name,myfile.read())
I could, of course, extract the files to somewhere and then re-add them to the new zip that way, but it would be clunky and require a lot of cleanup (and creating a lot of temporary files).
Equally I could keep track of the original zip file and use the writestr method by re-opening the file, but I was hoping to avoid it. I just wondered if there was a better way around this problem; it means I'll have to have code that determines whether the file originally came from a zip or not as well and handle it differently if it did.
Edit: If anyone else has the final problem with seek(0) not working on ZipExtFile, it is possible to use an io.StringIO class to hold the result of str(myfile.read()), which is then seekable. It means I have to keep the files loaded in memory, though, so I'm going to go with keeping track of the zipfile and transferring them only when I need them.

Map files into memory

I will explain what's my problem first, as It's important to understand what I want :-).
I'm working on a python-written pipeline that uses several external tools to perform several genomics data analysis. One of this tools works with very huge fastq files, which at the end are no more that plain text files.
Usually, this fastq files are gzipped, and as they're are plain text the compression ratio is very high. Most of data analysis tools can work with gzipped files, but we have a few ones that can't. So what we're doing is unzipp the files, work with them, and finaly re-compress.
As you may imagine, this process is:
Slower
High disk consuming
Bandwidth consuming (if working in a NFS filesystem)
So I'm trying to figure out a way of "tricking" these tools to work directly with gzipped files without having to touch the source code of the tools.
I thought on using FIFO files, and I tried that, but doesn't work if the tool reads the file more than once, or if the tool seeks around the file.
So basically I have to questions:
Is there any way to map a file into memory so that you can do something like:
./tool mapped_file (where mapped_file is not really a file, but a reference to a memory mapped file.
Do you have any other suggestions about how can I achieve my target?
Thank you very much to everybody!

From this answer you can load the whole uncompressed file into ram:
mkdir /mnt/ram
mount -t ramfs ram /mnt/ram
# uncompress your file to that directory
./tool /mnt/ram/yourdata
This, however, has the drawback of loading everything to ram: you'll need to have enough space to hold your uncompressed data!
Use umount /mnt/ram when you're finished.

If your script can read from standard input, then one possibility would be to decompress and stream using zcat, and then pipe it to your script.
Something like this:
zcat large_file.gz | ./tool
If you want to compress your results as well, then you can just pipe the output to gzip again:
zcat large_file.gz | ./tool | gzip - > output.gz
Otherwise, you can look at python's support for memory mapping:
http://docs.python.org/library/mmap.html
Finally, you can convert the ASCII fastq files to BAM format, which isn't compressed (per se) but uses a more compact format that will save you space. See the following:
http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam

Consider looking at winning entries in the Pistoia Alliance Sequence Squeeze contest, which rated FASTQ compression tools. You may find a tool which provides IO overhead reduction through random access and faster decompression performance.

You can write a fuse file system driver, if you are on linux: http://pypi.python.org/pypi/fuse-python
The fuse driver needs to compress and decompress the files. Maybe something like this already exists.

Organizing files in tar bz2 file with python

I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?
Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?
More Info/Edit:
I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?

Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.

Bzip2 compresses in large blocks (900 KiB by default, I believe). One method that would speed up the scanning of the tar file dramatically, but would reduce compression performance, would be to compress each file individually and then tar the results together. This is essentially what Zip-format files are (though using zlib compression rather than bzip2). But you could then easily grab the tar index and only have to decompress the specific file(s) you are looking for.
I don't think most tar programs offer much ability to organize files in any meaningful way, though you could write a program to do this for your special case (I know Python has tar-writing libraries though I've only used them once or twice). However, you'd still have the problem of having to decompress most of the data before you found what you were looking for.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.