What's the best compression algorithm for data dumps - python

I'm creating data dumps from my site for others to download and analyze. Each dump will be a giant XML file.
I'm trying to figure out the best compression algorithm that:
Compresses efficiently (CPU-wise)
Makes the smallest possible file
Is fairly common
I know the basics of compression, but haven't a clue as to which algo fits the bill. I'll be using MySQL and Python to generate the dump, so I'll need something with a good python library.

GZIP with standard compression level should be fine for most cases. Higher compression levels=more CPU time. BZ2 is packing better but is also slower. Well, there is always a trade-off between CPU consumption/running time and compression efficiency...all compressions with default compression levels should be fine.

Related

Whats the performance difference between a python accessing multiple small .npy files vs one large .npy files

I am currently working on a repository that has millions of small .npy numpy or .png image files. The code reads and writes these multiple small files.
It seems to be extremely slow, I was wondering if I merged all smaller .npy files to a larger files, would the code run faster? If yes, what would be the reason? Does it have something to do with disk I/O?
Opening each file require the operating system to fetch some data on the storage device. More specifically multiple requests regarding the target file system. Reading a block also require a request. This means many request per file. Each file is typically stored at different locations (that often looks random in practice). Performing random fetches on storage devices is known to be slow. Storage devices have a limited number of IO operations per second (IOPS). HDD have a very low number of IOPS (eg. ~75 IOPS for mainstream 7200 RPM HDDs). SSDs are much faster but neither the hardware nor the software stacks are optimized to perform many requests sequentially yet so multiple threads are often needed to reach a good IOPS.
Thus, yes, merging many files into one will certainly improve performance, especially if you use a HDD, because you can then read big contiguous chunks.
For more information about the expected performance, please read this.

in which case one would use uncompress codec for parquet files in spark

I'm new in Spark and trying to understand how different compression codecs work. I'm using Cloudera Quickstart VM 5.12x, Spark 1.6.0 and Python APIs.
If I compress and save as Parquet files using below logic:
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
df.write.parquet("/user/cloudera/data/orders_parquet_snappy")
then I can read them as:
sqlContext.read.parquet("/user/cloudera/data/orders_parquet_snappy").show()
I believe above read doesn't need to uncompress and read. I wonder why and in which condition I will use uncompressed ?
sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")
Not sure if my understanding is correct.
Compression is good because it saves at-rest storage and transfer bandwidth (both in terms of local disk I/O and network), but it does so at the cost of computing power. These are the metrics that you want to keep in mind when selecting a compression algorithm for your data: depending on the kind of expectation that you have, you can decide to select the appropriate one.
But in general, Snappy has been specifically designed to be relatively easy on CPU while providing a fair storage/bandwidth saving, which makes it more than appropriate for many use cases (which is why it's the default).
The sensible suggestion is of course to measure and come to a decision based on your observations on your particular setup, but I believe it's fair to say you should not expect a massive relative improvement on resource usage (but probably enough to be of some economic relevance if you are working with a truly massive cluster).

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Calculate (approximately) if zip64 extensions are required without relying on exceptions?

I have the following requirements (from the client) for zipping a number of files.
If the zip file created is less than 2**31-1 ~2GB use compression to create it (use zipfile.ZIP_DEFLATED), otherwise do not compress it (use zipfile.ZIP_STORED).
The current solution is to compress the file without zip64 and catching the zipfile.LargeZipFile exception to then create the non-compressed version.
My question is whether or not it would be worthwhile to attempt to calculate (approximately) whether or not the zip file will exceed the zip64 size without actually processing all the files, and how best to go about it? The process for zipping such large amounts of data is slow, and minimizing the duplicate compression processing might speed it up a bit.
Edit: I would upvote both solutions, as I think I can generate a useful heuristic from a combination of max and min file sizes and compression ratios. Unfortunately at this time, StackOverflow prevents me from upvoting anything (until I have a reputation higher than noob). Thanks for the good suggestions.
The only way I know of to estimate the zip file size is to look at the compression ratios for previously compressed files of a similar nature.
I can only think of two ways, one simple but requires manual tuning, and the other may not provide enough benefit to justify the complexity.
Define a file size at which you just skip the zip attempt, and tune it to your satisfacton by hand.
Keep a record of the last N filesizes between the smallest failure to zip ever observed and the largest successful zip ever observed. Decide what the acceptable probability of an incorrect choice resulting in an file that should be zipped not being zipped (say 5%). set your "don't bother trying to zip" threshold such that it would have resulted in that percentage of files that would have been erroneously left unzipped.
If you absolutely can never miss an opportunity to zip file that should have been zipped then you've already got the solution.
A heuristic approach will always involve some false positives and some false negatives.
The eventual size of the zipped file will depend on a number of factors, some of which are not knowable without running the compression process itself.
Zip64 allows you to use many different compression formats, such as bzip2, LZMA, etc.
Even the compression format may do the compression differently depending on the data to be compressed. For example, bzip2 can use Burrows-Wheeler, run length encoding and Huffman among others. The eventual size of the file will then depend on the statistical properties of the data being compressed.
Take Huffman, for instance; the size of the symbol table depends on how randomly-distributed the content of the file is.
One can go on and try to profile different types of data, serialized binary, text, images etc. and each will have a different normal distribution of final zipped size.
If you really need to save time by doing the process only once, apart from building a very large database and using a rule-based expert system or one based on Bayes' Theorem, there is no real 100% approach to this problem.
You could also try sampling blocks of the file at random intervals and compressing this sample, then linearly interpolating based on the size of the file.

Is it possible to speed-up python IO?

Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question
Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).
The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.
You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.
plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.
If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.
as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.
2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.
PyPy provides optimised input/output faster up to 7 times.
This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)

Categories

Resources