Estimating zip size/creation time

Estimating zip size/creation time - python

I need to create ZIP archives on demand, using either Python zipfile module or unix command line utilities.
Resources to be zipped are often > 1GB and not necessarily compression-friendly.
How do I efficiently estimate its creation time / size?

Extract a bunch of small parts from the big file. Maybe 64 chunks of 64k each. Randomly selected.
Concatenate the data, compress it, measure the time and the compression ratio. Since you've randomly selected parts of the file chances are that you have compressed a representative subset of the data.
Now all you have to do is to estimate the time for the whole file based on the time of your test-data.

I suggest you measure the average time it takes to produce a zip of a certain size. Then you calculate the estimate from that measure. However I think the estimate will be very rough in any case if you don't know how well the data compresses. If the data you want to compress had a very similar "profile" each time you could probably make better predictions.

If its possible to get progress callbacks from the python module i would suggest finding out how many bytes are processed pr second ( By simply storing where in the file you where at start of the second, and where you are at the end ). When you have the data on how fast the computer your on you can off course save it, and use it as a basis for your next zip file. ( I normally collect about 5 samples before showing a time prognosses )
Using this method can give you Microsoft minutes so as you get more samples you would need to average it out. This would esp be the case if your making a zip file that contains a lot of files, as ZIP tends to slow down when compressing many small files compared to 1 large file.

If you're using the ZipFile.write() method to write your files into the archive, you could do the following:
Get a list of the files you want to zip and their relative sizes
Write one file to the archive and time how long it took
Calculate ETA based on the number of files written, their size, and how much is left.
This won't work if you're only zipping one really big file though. I've never used the zip module myself, so I'm not sure if it would work, but for small numbers of large files, maybe you could use the ZipFile.writestr() function and read in / zip up your files in chunks?

Related

Is it more beneficial to read many small files or fewer large files of the exact same data?

I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!

Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.

I must compress many similar files, can I exploit the fact they are similar?

I have a dataset with many different samples (numpy arrays). It is rather impractical to store everything in only one file, so I store many different 'npz' files (numpy arrays compressed in zip).
Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.
Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.
I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.
Is it technically possible? Is there something that works like this?

tldr; It depends on the size of each individual file and the data there-in. For example, characteristics / use-cases / access patterns likely vary wildly between 234567x100 byte files and 100x234567 byte files.
Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.
Possibly. Shared compression benefits will decrease as the size of the file increase.
Regardless, even using a Mono File implementation (let's say a standard zip) may save significantly effective disk space for very many very small files as it avoids overheads required by file-systems to manage individual files; if nothing else, many implementations must be aligned to full blocks [eg. 512-4k bytes]. Plus, free compression using a ubiquitously supported format.
Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.
This 'zip basis' is sometimes called a Pre-shared Dictionary.
I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.
Is it technically possible? Is there something that works like this?
Yes, it's possible. SDCH (Shared Dictionary Compression for HTTP) was one such implementation designed for common web files (eg. HTTP/CSS/JavaScript). In certain cases it could achieve significantly higher compression than standard DEFLATE.
The approach can be emulated with many compression algorithms that works on streams where the compression dictionary is encoded as part of the stream-as-written. (U = Uncompressed, C = compressed.)
To compress:
[U:shared_dict] + [U:data] -> [C:shared_dict] + [C:data]
^-- "zip basis" ^-- write only this to file
^-- artifact of priming
To decompress:
[C:shared_dict] + [C:data] -> [U:shared_dict] + [U:data]
^-- add this back before decompressing! ^-- use this
The overall space saved depends on many factors, including how useful the initial priming dictionary is and on the specific compressor details. LZ78-esque implementations are particular well-suited to the approach above due to use of a sliding-window that acts as a lookup dictionary.
Alternatively, it may be possible to use domain-specific knowledge and/or encoding to also achieve better compression with specialized compression schemes. An example of this is SQL Server's Page Compression which exploits data similarities between columns on different rows.

A 'zip-basis' is interesting but problematic.
You could preprocess the files instead. Take one file as a template and calculate the diff of each file compared to the template. Then compress the diffs.

Pandas: efficiently write thousands of small files

here is my problem.
I have a single big CSV file containing a bit more than 100M rows which I need to divide in much smaller files (if needed I can add more details). At the moment I'm reading in chunks the big CSV, doing some computations to determine how to subdivide the chunk and finally writing (appending) to the files with
df.to_csv(outfile, float_format='%.8f', index=False, mode='a', header=header)
(the header variable is True if it is the first time that I write to 'outfile', otherwise it is False).
While running the code I noticed that the amount of on-disk memory taken by the smaller files on the whole was likely to become larger than three times the size of the single big csv.
So here are my questions:
is this behavior normal? (probably it is, but I'm asking just in case)
is it possible to reduce the size of the files? (different file formats?) [SOLVED through compression, see update below and comments]
are there better file types for this situation with respect to CSV?
Please note that I don't have an extensive knowledge of programming, I'm just using Python for my thesis.
Thanks in advance to whoever will help.
UPDATE: thanks to #AshishAcharya and #PatrickArtner I learned how to use the compression while writing and reading the CSV. Still, I'd like to know if there are any file types that may be better than CSV for this task.
NEW QUESTION: (maybe stupid question) does appending work on compressed files?
UPDATE 2: using the compression option I noticed something that I don't understand. To determine the size of folders I was taught to use the du -hs <folder> command, but using it on the folder containing compressed files or the one containing the uncompressed files results in the same value of '3.8G' (both are created using the same first 5M rows from the big CSV). From the file explorer (Nautilus) instead, I get about 590MB for the one containing uncompressed CSV and 230MB for the other. What am I missing?

modify and write large file in python

Say I have a data file of size 5GB in the disk, and I want to append another set of data of size 100MB at the end of the file -- Just simply append, I don't want to modify nor move the original data in the file. I know I can read the hole file into memory as a long long list and append my small new data to it, but it's too slow. How I can do this more efficiently?
I mean, without reading the hole file into memory?
I have a script that generates a large stream of data, say 5GB, as a long long list, and I need to save these data into a file. I tried to generate the list first and then output them all in once, but as the list increased, the computer got slow down very very severely. So I decided to output them by several times: each time I have a list of 100MB, then output them and clear the list. (this is why I have the first problem)
I have no idea how to do this.Is there any lib or function that can do this?

Let's start from the second point: if the list you store in memory is larger than the available ram, the computer starts using the hd as ram and this severely slow down everything. The optimal way of outputting in your situation is fill the ram as much as possible (always keeping enough space for the rest of the software running on your pc) and then writing on a file all in once.
The fastest way to store a list in a file would be using pickle so that you store binary data that take much less space than formatted ones (so even the read/write process is much much faster).
When you write to a file, you should keep the file always open, using something like with open('namefile', 'w') as f. In this way, you save the time to open/close the file and the cursor will always be at the end. If you decide to do that, use f.flush() once you have written the file to avoid loosing data if something bad happen. The append method is good alternative anyway.
If you provide some code it would be easier to help you...

Calculate (approximately) if zip64 extensions are required without relying on exceptions?

I have the following requirements (from the client) for zipping a number of files.
If the zip file created is less than 2**31-1 ~2GB use compression to create it (use zipfile.ZIP_DEFLATED), otherwise do not compress it (use zipfile.ZIP_STORED).
The current solution is to compress the file without zip64 and catching the zipfile.LargeZipFile exception to then create the non-compressed version.
My question is whether or not it would be worthwhile to attempt to calculate (approximately) whether or not the zip file will exceed the zip64 size without actually processing all the files, and how best to go about it? The process for zipping such large amounts of data is slow, and minimizing the duplicate compression processing might speed it up a bit.
Edit: I would upvote both solutions, as I think I can generate a useful heuristic from a combination of max and min file sizes and compression ratios. Unfortunately at this time, StackOverflow prevents me from upvoting anything (until I have a reputation higher than noob). Thanks for the good suggestions.

The only way I know of to estimate the zip file size is to look at the compression ratios for previously compressed files of a similar nature.

I can only think of two ways, one simple but requires manual tuning, and the other may not provide enough benefit to justify the complexity.
Define a file size at which you just skip the zip attempt, and tune it to your satisfacton by hand.
Keep a record of the last N filesizes between the smallest failure to zip ever observed and the largest successful zip ever observed. Decide what the acceptable probability of an incorrect choice resulting in an file that should be zipped not being zipped (say 5%). set your "don't bother trying to zip" threshold such that it would have resulted in that percentage of files that would have been erroneously left unzipped.
If you absolutely can never miss an opportunity to zip file that should have been zipped then you've already got the solution.

A heuristic approach will always involve some false positives and some false negatives.
The eventual size of the zipped file will depend on a number of factors, some of which are not knowable without running the compression process itself.
Zip64 allows you to use many different compression formats, such as bzip2, LZMA, etc.
Even the compression format may do the compression differently depending on the data to be compressed. For example, bzip2 can use Burrows-Wheeler, run length encoding and Huffman among others. The eventual size of the file will then depend on the statistical properties of the data being compressed.
Take Huffman, for instance; the size of the symbol table depends on how randomly-distributed the content of the file is.
One can go on and try to profile different types of data, serialized binary, text, images etc. and each will have a different normal distribution of final zipped size.
If you really need to save time by doing the process only once, apart from building a very large database and using a rule-based expert system or one based on Bayes' Theorem, there is no real 100% approach to this problem.
You could also try sampling blocks of the file at random intervals and compressing this sample, then linearly interpolating based on the size of the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.