How can I estimate the compressibility of a file without compressing it?

How can I estimate the compressibility of a file without compressing it? - python

I'm using an event loop based server in twisted python that stores files, and I'd like to be able to classify the files according to their compressibility.
If the probability that they'd benefit from compression is high, they would go to a directory with btrfs compression switched on, otherwise they'd go elsewhere.
I do not need to be sure - 80% accuracy would be plenty, and would save a lot of diskspace. But since there is the CPU and fs performance issue too, I can not just save everything compressed.
The files are in the low megabytes. I can not test-compress them without using a huge chunk of CPU and unduly delaying the event loop or refactoring a compression algorithm to fit into the event loop.
Is there any best practice to give a quick estimate for compressibility? What I came up with is taking a small chunk (few kB) of data from the beginning of the file, test-compress it (with a presumably tolerable delay) and base my decision on that.
Any suggestions? Hints? Flaws in my reasoning and/or problem?

Just 10K from the middle of the file will do the trick. You don't want the beginning or the end, since they may contain header or trailer information that is not representative of the rest of the file. 10K is enough to get some amount of compression with any typical algorithm. That will predict a relative amount of compression for the whole file, to the extent that that middle 10K is representative. The absolute ratio you get will not be the same as for the whole file, but the amount that it differs from no compression will allow you to set a threshold. Just experiment with many files to see where to set the threshold.
As noted, you can save time by doing nothing for files that are obviously already compressed, e.g. .png. .jpg., .mov, .pdf, .zip, etc.
Measuring entropy is not necessarily a good indicator, since it only gives the zeroth-order estimate of compressibility. If the entropy indicates that it is compressible enough, then it is right. If the entropy indicates that it is not compressible enough, then it may or may not be right. Your actual compressor is a much better estimator of compressibility. Running it on 10K won't take long.

I think what you are looking for is How to calculate the entropy of a file?
This questions contains all kind of methods to calculate the entropy of the file (and by that you can get the 'compressibility' of a file). Here's a quote from the abstract of this article (Relationship Between Entropy and
Test Data Compression
Kedarnath J. Balakrishnan, Member, IEEE, and Nur A. Touba, Senior Member, IEEE):
The entropy of a set of data is a measure of the amount of information contained in it. Entropy calculations for fully specified data have been used to get a theoretical bound on how much that data can be compressed. This paper extends the concept of entropy for incompletely specified test data (i.e., that has unspecified or don't care bits) and explores the use of entropy to show how bounds on the maximum amount of compression for a particular symbol partitioning can be calculated. The impact of different ways of partitioning the test data into symbols on entropy is studied. For a class of partitions that use fixed-length symbols, a greedy algorithm for specifying the don't cares to reduce entropy is described. It is shown to be equivalent to the minimum entropy set cover problem and thus is within an additive constant error with respect to the minimum entropy possible among all ways of specifying the don't cares. A polynomial time algorithm that can be used to approximate the calculation of entropy is described. Different test data compression techniques proposed in the literature are analyzed with respect to the entropy bounds. The limitations and advantages of certain types of test data encoding strategies are studied using entropy theory
And to be more constructive, checkout this site for python implementation of entropy calculations of chunks of data

Compressed files usually don't compress well. This means that just about any media file is not going to compress very well, since most media formats already include compression. Clearly there are exceptions to this, such as BMP and TIFF images, but you can probably build a whitelist of well-compressed filetypes (PNGs, MPEGs, and venturing away from visual media - gzip, bzip2, etc) to skip and then assume the rest of the files you encounter will compress well.
If you feel like getting fancy, you could build feedback into the system (observe the results of any compression you do and associate the resulting ratio with the filetype). If you come across a filetype that has consistently poor compression, you could add it to the whitelist.
These ideas depend on being able to identify a file's type, but there are standard utilities which do a pretty good job of this (generally much better than 80%) - file(1), /etc/mime.types, etc.

Related

Lightest way to store a numpy array [duplicate]

I tried various methods to do data compression when saving to disk some numpy arrays.
These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).
I tried with HDF5 (h5py) :
f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)
but this is quite slow, and the compression ratio is not the best we can expect.
I also tried with
numpy.savez_compressed()
but once again it may not be the best compression algorithm for such data (described before).
What would you choose for better compression ratio on a numpy array, with such data ?
(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)

What I do now:
import gzip
import numpy
f = gzip.GzipFile("my_array.npy.gz", "w")
numpy.save(file=f, arr=my_array)
f.close()

Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.
Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.
A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].
EDIT An illustration:
from pylab import *
import numpy
import numpy.random
import os.path
import subprocess
# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
numpy.random.randn(N) * (1<<7)).astype(int32)
numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size
subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise
data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it

The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above.
GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable.
You always need to make, at least some, tests to select the best configuration for your needs.

What constitutes the best compression (if any) highly depends on the nature of the data. Many kinds of measurement data are virtually completely incompressible, if loss-free compression is indeed required.
The pytables docs contains a lot of useful guidelines on data compression. It also details speed tradeoffs and so on; higher compression levels are usually a waste of time, as it turns out.
http://pytables.github.io/usersguide/optimization.html
Note that this is probably as good as it will get. For integer measurements, a combination of a shuffle filter with a simple zip-type compression usually works reasonably well. This filter very efficiently exploits the common situation where the highest-endian byte is usually 0, and only included to guard against overflow.

You might want to try blz. It can compress binary data very efficiently.
import blz
# this stores the array in memory
blz.barray(myarray)
# this stores the array on disk
blz.barray(myarray, rootdir='arrays')
It stores arrays either on file or compressed in memory. Compression is based on blosc.
See the scipy video for a bit of context.

First, for general data sets, the shuffle=True argument to create_dataset improves compression dramatically with roughly continuous datasets. It very cleverly rearranges the bits to be compressed so that (for continuous data) the bits change slowly, which means they can be compressed better. It slows the compression down a very little bit in my experience, but can substantially improve the compression ratios in my experience. It is not lossy, so you really do get the same data out as you put in.
If you don't care about the accuracy so much, you can also use the scaleoffset argument to limit the number of bits stored. Be careful, though, because this is not what it might sound like. In particular, it is an absolute precision, rather than a relative precision. For example, if you pass scaleoffset=8, but your data points are less then 1e-8 you'll just get zeros. Of course, if you've scaled the data to max out around 1, and don't think you can hear differences smaller than a part in a million, you can pass scaleoffset=6 and get great compression without much work.
But for audio specifically, I expect that you are right in wanting to use FLAC, because its developers have put in huge amounts of thought, balancing compression with preservation of distinguishable details. You can convert to WAV with scipy, and thence to FLAC.

How to generate input parameters for a simulation study using a collected data set?

Suppose that I have a data set S that contains the service time for different jobs, like S={t1,t2,t3,...,tn}, where ti is the service time for the ith job; and n the total number in my data set. This S is only a sample from a population. n here is 300k. I would like to study the impact of long service time as some jobs takes very long and some not. My intuition is to study this impact based on data gathered from real system. The system in study has thousands of millions of jobs, and this number is increasing by 100 new jobs each several seconds. Also, service time is measured via benchmarking the jobs on a local machine. So practically it is expensive to keep expanding your data set. Thus, i decided to randomly pick up 300k.
I am conducting simulation experiments where I have to generate a large number of jobs with their service times (say millions) and then do some other calculations.
How to use S as a population in my simulation, I came across the following:
1- use S itself. I could use bootstrapping 'sample with replacement' or ' sample without replacement'.
2- fit a theoretical distribution model to S and then draw from it.
Am I correct? which approach is best (pros and cons)? the first approach seems easy as just picking a random service time from S each time? is it reliable? Any suggestion is appreciated as I am not got at stats.

Quoting from this tutorial in the 2007 Winter Simulation Conference:
At first glance, trace-driven simulation seems appealing. That is
where historical data are used directly as inputs. It’s hard to argue
about the validity of the distributions when real data from the
real-world system is used in your model. In practice, though, this
tends to be a poor solution for several reasons. Historical data may
be expensive or impossible to extract. It certainly won’t be available
in unlimited quantities, which significantly curtails the statistical
analysis possible. Storage requirements are high. And last, but not
least, it is impossible to assess “what-if?” strategies or try to
simulate a prospective system, i.e., one which doesn’t yet exist.
One of the major uses of simulation is to study alternative configurations or policies, and trace data is not suitable for that—it can only show you how you're currently operating. Trace data cannot be used for studying systems which are under consideration but don't yet exist.
Bootstrapping resamples your existing data. This removes the data quantity limitations, but at a potential cost. Bootstrapping is premised on the assumption that your data are representative and independent. The former may not be an issue with 300k observations, but often comes up when your sample size is smaller due to cost or availability issues. The latter is a big deal if your data come from a time series where the observations are serially correlated or non-homogeneous. In that case, independent random sampling (rather than sequential playback) can lose significant information about the behaviors being studied.
If sequential playback is required you're back to being limited to 300k observations, and that may not be nearly as much data as you think for statistical measures. Variance estimation is essential to calculating margins of error for confidence intervals, and serial correlation has a huge impact on the variance of a sample mean. Getting valid confidence interval estimates can take several orders of magnitude more data than is required for independent data.
In summary, distribution fitting takes more work up front but is usually more useful in the long run.

Determining "noise" in bandwidth data

I'm have bandwidth data which identifies protocol usage by tonnage and hour. Based on the protocols, you can tell when something is just connect vs actually being used (1000 bits compared to million or billions of bits) in that hour for that specific protocol. The problem is When looking at each protocol, they are all heavily right skewed. Where 80% of the records are the just connected or what I'm calling "noise.
The task I have is to separate out this noise and focus on only when the protocol is actually being used. My classmates are all just doing this manually and removing at a low threshold. I was hoping there was a way to automate this and using statistics instead of just picking a threshold that "looks good." We have something like 30 different protocols each with a different amount of bits which would represent "noise" i.e. a download prototypical might have 1000 bits where a messaging app might have 75 bits when they are connected but not in full use. Similarly they will have different means and gaps between i.e. download mean is 215,000,000 and messaging is 5,000,000. There isn't any set pattern between them.
Also this "noise" has many connections but only accounts for 1-3% of the total bandwidth being used, this is why we are tasked with identify actual usage vs passive usage.
I don't want any actual code, as I'd like to practice with the implementation and solution building myself. But the logic, process, or name of a statistical method would be very helpful.

Do you have labeled examples, and do you have other data besides the bandwidth? One way to do this would be to train some kind of ML classifier if you have a decent amount of data where you know it's either in use or not in use. If you have enough data you also might be able to do this unsupervised. For a start a simple Naive Bayes classifier works well for binary solutions. As you may be away, NB was the original bases for spam detection (is it spam or not). So your case of is it noise or not should also work, but you will get more robust results if you have other data in addition to the bandwidth to train on. Also, I am wondering if there isn't a way to improve the title of your post so that it communicates your question more quickly.

Best practices for analyzing large time traces with high sampling frequency (~ MHz) in R

I'm intrigued by the large number of scientific packages on CRAN (specifically wavelets) and would like to learn how to analyze time traces of typically non-stationary time traces sampled with several MHz with typically 2.5e6 data points.
I usually use Python, but IMHO high-level scientific packages aren't that common (in comparison to e.g. CRAN which offers several different wavelet libraries) in Python (yet) or at least are very new and often have questionable quality. Even if I decide to only use specific R packages for certain data analysis from Python (perhaps through rpy2), I still have to figure out which data class is appropriate.
The appropriate data class
I figured I could use the ts data class for uniformly sampled data, but I'm not sure how ts will cope with such high frequencies as it seems to be designed for data sampled every few months or so. I also noticed that it's common to simply use
time_trace <- cbind(t_samples, value_samples)
I could also keep the columns in a data.frame but I suspect the performance would be sub-optimal.
Is there a recommended approach for such large and densely sampled time traces?
Handling different time scales
R being very popular with statisticians, I suspect time series are seen and perhaps treated differently than in some branches of physics, where it usually comes down to filtering and analyzing different frequency components (which is usually called digital signal processing). I've noticed that there are some R packages for that, but they didn't seem very advanced.
Would I have to change the way I think about time traces if I wanted to analyze them only in R? E.g. treat them as data to be tested against statistical models with several modes corresponding to different time scales. I'm also not sure how to deal with non-stationary signals.
Note:
This is not a question about whether R is suitable for DSP, I've created a question about that here.

According to the title the main question is about doing things in R and, while the body and comments suggest there is debating potential for a second entry titled "is R or Python better for digital signal processing" (I'd add MATLAB though), I'll only try to answer the former.
You can keep thinking about data in a tabular format (data frame - one column for your measure/value of interest and one for the time), or have the data in two vectors. As long as you have enough RAM things will be as simple as that. A reasonable machine with 8Gb and a bit of care should let you handle over 100 times more than your 2.6E6 data points.
The R packages signal and fftw are certainly containing good illustrations (in the documentation and examples) for how to go about it.

How to handle large memory footprint in Python?

I have a scientific application that reads a potentially huge data file from disk and transforms it into various Python data structures such as a map of maps, list of lists etc. NumPy is called in for numerical analysis. The problem is, the memory usage can grow rapidly. As swap space is called in, the system slows down significantly. The general strategy I have seen:
lazy initialization: this doesn't seem to help in the sense that many operations require in memory data anyway.
shelving: this Python standard library seems support writing data object into a datafile (backed by some db) . My understanding is that it dumps data to a file, but if you need it, you still have to load all of them into memory, so it doesn't exactly help. Please correct me if this is a misunderstanding.
The third option is to leverage a database, and offload as much data processing to it
As an example: a scientific experiment runs several days and have generated a huge (tera bytes of data) sequence of:
co-ordinate(x,y) observed event E at time t.
And we need to compute a histogram over t for each (x,y) and output a 3-dimensional array.
Any other suggestions? I guess my ideal case would be the in-memory data structure can be phased to disk based on a soft memory limit and this process should be as transparent as possible. Can any of these caching frameworks help?
Edit:
I appreciate all the suggested points and directions. Among those, I found user488551's comments to be most relevant. As much as I like Map/Reduce, to many scientific apps, the setup and effort for parallelization of code is even a bigger problem to tackle than my original question, IMHO. It is difficult to pick an answer as my question itself is so open ... but Bill's answer is more close to what we can do in real world, hence the choice. Thank you all.

Have you considered divide and conquer? Maybe your problem lends itself to that. One framework you could use for that is Map/Reduce.
Does your problem have multiple phases such that Phase I requires some data as input and generates an output which can be fed to phase II? In that case you can have 1 process do phase I and generate data for phase II. Maybe this will reduce the amount of data you simultaneously need in memory?
Can you divide your problem into many small problems and recombine the solutions? In this case you can spawn multiple processes that each handle a small sub-problem and have one or more processes to combine these results in the end?
If Map-Reduce works for you look at the Hadoop framework.

Well, if you need the whole dataset in RAM, there's not much to do but get more RAM. Sounds like you aren't sure if you really need to, but keeping all the data resident requires the smallest amount of thinking :)
If your data comes in a stream over a long period of time, and all you are doing is creating a histogram, you don't need to keep it all resident. Just create your histogram as you go along, write the raw data out to a file if you want to have it available later, and let Python garbage collect the data as soon as you have bumped your histogram counters. All you have to keep resident is the histogram itself, which should be relatively small.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.