I have the following requirements (from the client) for zipping a number of files.
If the zip file created is less than 2**31-1 ~2GB use compression to create it (use zipfile.ZIP_DEFLATED), otherwise do not compress it (use zipfile.ZIP_STORED).
The current solution is to compress the file without zip64 and catching the zipfile.LargeZipFile exception to then create the non-compressed version.
My question is whether or not it would be worthwhile to attempt to calculate (approximately) whether or not the zip file will exceed the zip64 size without actually processing all the files, and how best to go about it? The process for zipping such large amounts of data is slow, and minimizing the duplicate compression processing might speed it up a bit.
Edit: I would upvote both solutions, as I think I can generate a useful heuristic from a combination of max and min file sizes and compression ratios. Unfortunately at this time, StackOverflow prevents me from upvoting anything (until I have a reputation higher than noob). Thanks for the good suggestions.
The only way I know of to estimate the zip file size is to look at the compression ratios for previously compressed files of a similar nature.
I can only think of two ways, one simple but requires manual tuning, and the other may not provide enough benefit to justify the complexity.
Define a file size at which you just skip the zip attempt, and tune it to your satisfacton by hand.
Keep a record of the last N filesizes between the smallest failure to zip ever observed and the largest successful zip ever observed. Decide what the acceptable probability of an incorrect choice resulting in an file that should be zipped not being zipped (say 5%). set your "don't bother trying to zip" threshold such that it would have resulted in that percentage of files that would have been erroneously left unzipped.
If you absolutely can never miss an opportunity to zip file that should have been zipped then you've already got the solution.
A heuristic approach will always involve some false positives and some false negatives.
The eventual size of the zipped file will depend on a number of factors, some of which are not knowable without running the compression process itself.
Zip64 allows you to use many different compression formats, such as bzip2, LZMA, etc.
Even the compression format may do the compression differently depending on the data to be compressed. For example, bzip2 can use Burrows-Wheeler, run length encoding and Huffman among others. The eventual size of the file will then depend on the statistical properties of the data being compressed.
Take Huffman, for instance; the size of the symbol table depends on how randomly-distributed the content of the file is.
One can go on and try to profile different types of data, serialized binary, text, images etc. and each will have a different normal distribution of final zipped size.
If you really need to save time by doing the process only once, apart from building a very large database and using a rule-based expert system or one based on Bayes' Theorem, there is no real 100% approach to this problem.
You could also try sampling blocks of the file at random intervals and compressing this sample, then linearly interpolating based on the size of the file.
Related
I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.
I have a dataset with many different samples (numpy arrays). It is rather impractical to store everything in only one file, so I store many different 'npz' files (numpy arrays compressed in zip).
Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.
Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.
I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.
Is it technically possible? Is there something that works like this?
tldr; It depends on the size of each individual file and the data there-in. For example, characteristics / use-cases / access patterns likely vary wildly between 234567x100 byte files and 100x234567 byte files.
Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.
Possibly. Shared compression benefits will decrease as the size of the file increase.
Regardless, even using a Mono File implementation (let's say a standard zip) may save significantly effective disk space for very many very small files as it avoids overheads required by file-systems to manage individual files; if nothing else, many implementations must be aligned to full blocks [eg. 512-4k bytes]. Plus, free compression using a ubiquitously supported format.
Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.
This 'zip basis' is sometimes called a Pre-shared Dictionary.
I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.
Is it technically possible? Is there something that works like this?
Yes, it's possible. SDCH (Shared Dictionary Compression for HTTP) was one such implementation designed for common web files (eg. HTTP/CSS/JavaScript). In certain cases it could achieve significantly higher compression than standard DEFLATE.
The approach can be emulated with many compression algorithms that works on streams where the compression dictionary is encoded as part of the stream-as-written. (U = Uncompressed, C = compressed.)
To compress:
[U:shared_dict] + [U:data] -> [C:shared_dict] + [C:data]
^-- "zip basis" ^-- write only this to file
^-- artifact of priming
To decompress:
[C:shared_dict] + [C:data] -> [U:shared_dict] + [U:data]
^-- add this back before decompressing! ^-- use this
The overall space saved depends on many factors, including how useful the initial priming dictionary is and on the specific compressor details. LZ78-esque implementations are particular well-suited to the approach above due to use of a sliding-window that acts as a lookup dictionary.
Alternatively, it may be possible to use domain-specific knowledge and/or encoding to also achieve better compression with specialized compression schemes. An example of this is SQL Server's Page Compression which exploits data similarities between columns on different rows.
A 'zip-basis' is interesting but problematic.
You could preprocess the files instead. Take one file as a template and calculate the diff of each file compared to the template. Then compress the diffs.
I have a bunch of flat files that basically store millions of paths and their corresponding info (name, atime, size, owner, etc)
I would like to compile a full list of all the paths stored collectively on the files. For duplicate paths only the largest path needs to be kept.
There are roughly 500 files and approximately a million paths in the text file. The files are also gzipped. So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
Should I go for a database solution? sqlite3? Is there a data structure or better algorithm to go about this in python? Thanks for any help!
So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
If "the current list" implies that you're keeping track of all of the paths seen so far in a list, and then doing if newpath in listopaths: for each line, then each one of those searches takes linear time. If you have 500M total paths, of which 100M are unique, you're doing O(500M*100M) comparisons.
Just changing that list to a set, and changing nothing else in your code (well, you need to replace .append with .add, and you can probably remove the in check entirely… but without seeing your code it's hard to be specific) makes each one of those checks take constant time. So you're doing O(500M) comparisons—100M times faster.
Another potential problem is that you may not have enough memory. On a 64-bit machine, you've got enough virtual memory to hold almost anything you want… but if there's not enough physical memory available to back that up, eventually you'll spend more time swapping data back and forth to disk than doing actual work, and your program will slow to a crawl.
There are actually two potential sub-problems here.
First, you might be reading each entire file in at once (or, worse, all of the files at once) when you don't need to (e.g., by decompressing the whole file instead of using gzip.open, or by using f = gzip.open(…) but then doing f.readlines() or f.read(), or whatever). If so… don't do that. Just iterate over the lines in each GzipFile, for line in f:.
Second, maybe even a simple set of however many unique lines you have is too much to fit in memory on your computer. In that case, you probably want to look at a database. But you don't need anything as complicated as sqlite. A dbm acts like a dict (except that its keys and values have to be byte strings), but it's stored on-dict, caching things in memory where appropriate, instead of stored in memory, paging to disk randomly, which means it will go a lot faster in this case. (And it'll be persistent, too.) Of course you want something that acts like a set, not a dict… but that's easy. You can model a set as a dict whose keys are always ''. So instead of paths.add(newpath), it's just paths[newpath] = ''. Yeah, that wastes a few bytes of disk space over building your own custom on-disk key-only hash table, but it's unlikely to make any significant difference.
I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.
I need to create ZIP archives on demand, using either Python zipfile module or unix command line utilities.
Resources to be zipped are often > 1GB and not necessarily compression-friendly.
How do I efficiently estimate its creation time / size?
Extract a bunch of small parts from the big file. Maybe 64 chunks of 64k each. Randomly selected.
Concatenate the data, compress it, measure the time and the compression ratio. Since you've randomly selected parts of the file chances are that you have compressed a representative subset of the data.
Now all you have to do is to estimate the time for the whole file based on the time of your test-data.
I suggest you measure the average time it takes to produce a zip of a certain size. Then you calculate the estimate from that measure. However I think the estimate will be very rough in any case if you don't know how well the data compresses. If the data you want to compress had a very similar "profile" each time you could probably make better predictions.
If its possible to get progress callbacks from the python module i would suggest finding out how many bytes are processed pr second ( By simply storing where in the file you where at start of the second, and where you are at the end ). When you have the data on how fast the computer your on you can off course save it, and use it as a basis for your next zip file. ( I normally collect about 5 samples before showing a time prognosses )
Using this method can give you Microsoft minutes so as you get more samples you would need to average it out. This would esp be the case if your making a zip file that contains a lot of files, as ZIP tends to slow down when compressing many small files compared to 1 large file.
If you're using the ZipFile.write() method to write your files into the archive, you could do the following:
Get a list of the files you want to zip and their relative sizes
Write one file to the archive and time how long it took
Calculate ETA based on the number of files written, their size, and how much is left.
This won't work if you're only zipping one really big file though. I've never used the zip module myself, so I'm not sure if it would work, but for small numbers of large files, maybe you could use the ZipFile.writestr() function and read in / zip up your files in chunks?