Lightest way to store a numpy array [duplicate]

Lightest way to store a numpy array [duplicate] - python

I tried various methods to do data compression when saving to disk some numpy arrays.
These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).
I tried with HDF5 (h5py) :
f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)
but this is quite slow, and the compression ratio is not the best we can expect.
I also tried with
numpy.savez_compressed()
but once again it may not be the best compression algorithm for such data (described before).
What would you choose for better compression ratio on a numpy array, with such data ?
(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)

What I do now:
import gzip
import numpy
f = gzip.GzipFile("my_array.npy.gz", "w")
numpy.save(file=f, arr=my_array)
f.close()

Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.
Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.
A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].
EDIT An illustration:
from pylab import *
import numpy
import numpy.random
import os.path
import subprocess
# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
numpy.random.randn(N) * (1<<7)).astype(int32)
numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size
subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise
data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it

The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above.
GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable.
You always need to make, at least some, tests to select the best configuration for your needs.

What constitutes the best compression (if any) highly depends on the nature of the data. Many kinds of measurement data are virtually completely incompressible, if loss-free compression is indeed required.
The pytables docs contains a lot of useful guidelines on data compression. It also details speed tradeoffs and so on; higher compression levels are usually a waste of time, as it turns out.
http://pytables.github.io/usersguide/optimization.html
Note that this is probably as good as it will get. For integer measurements, a combination of a shuffle filter with a simple zip-type compression usually works reasonably well. This filter very efficiently exploits the common situation where the highest-endian byte is usually 0, and only included to guard against overflow.

You might want to try blz. It can compress binary data very efficiently.
import blz
# this stores the array in memory
blz.barray(myarray)
# this stores the array on disk
blz.barray(myarray, rootdir='arrays')
It stores arrays either on file or compressed in memory. Compression is based on blosc.
See the scipy video for a bit of context.

First, for general data sets, the shuffle=True argument to create_dataset improves compression dramatically with roughly continuous datasets. It very cleverly rearranges the bits to be compressed so that (for continuous data) the bits change slowly, which means they can be compressed better. It slows the compression down a very little bit in my experience, but can substantially improve the compression ratios in my experience. It is not lossy, so you really do get the same data out as you put in.
If you don't care about the accuracy so much, you can also use the scaleoffset argument to limit the number of bits stored. Be careful, though, because this is not what it might sound like. In particular, it is an absolute precision, rather than a relative precision. For example, if you pass scaleoffset=8, but your data points are less then 1e-8 you'll just get zeros. Of course, if you've scaled the data to max out around 1, and don't think you can hear differences smaller than a part in a million, you can pass scaleoffset=6 and get great compression without much work.
But for audio specifically, I expect that you are right in wanting to use FLAC, because its developers have put in huge amounts of thought, balancing compression with preservation of distinguishable details. You can convert to WAV with scipy, and thence to FLAC.

Related

How do I call a librosa function on the entire audio file?

I have short audio files which I'm trying to analyze using Librosa, in particular the spectral centroid function. However, this function outputs an array of different values representing the spectral centroid at different frames within the audio file. The documentation says that the frame size can be changed by specifying the parameter n_fft when calling the function. It would be more beneficial to me if this function analyzed the entire audio file at once rather than outputting the result at multiple points in time. Is there a way for me to specify that I want the function to be called with, say, a frame size of the entire audio file instead of the default time which is 2048 samples? Is there another better way?
Cheers and thank you!

The length of the FFT window (n_fft) specifies not only how many samples you need, but also the frequency resolution of the result (longer n_fft, better resolution). To ensure comparable results for many files you probably want to use the same n_fft value for all of them.
With that out of the way, say your files all have no more than 16k samples. Then you may still achieve a reasonable runtime (FFT runs in O(N log N)). Obviously, this will get worse as your file size increases. So you could call spectral_centroid(y=y, n_fft=16384, hop_length=16384, center=False) and because hop_length is set to the same value as n_fft you would compute the FFT for non-overlapping windows. And because n_fft is greater than the max number of samples in all your files (in this example), you should get only one value. Note that I set center to False to avoid an adjustment that is not necessary for your scenario.
Alternatively to choosing a long transform window, you could also compute many values for overlapping windows (or frames) using the STFT (which is what librosa does anyway) and simply average the resulting values like this:
import numpy as np
import librosa
y, sr = librosa.load(librosa.ex('trumpet'))
cent = librosa.feature.spectral_centroid(y=y, sr=sr, center=False)
avg_cent = np.mean(cent)
print(avg_cent)
2618.004809523263
The latter solution is in line with what is usually done in MIR and my recommendation. Note that this also allows you to use other statistics functions like the median, which may or may not be something you are interested in. In other words, you can determine the distribution of the centroids, which arguably carries more meaning.

Saving large numpy 2d arrays

I have an array with ~1,000,000 rows, each of which is a numpy array of 4,800 float32 numbers.
I need to save this as a csv file, however using numpy.savetxt has been running for 30 minutes and I don't know how much longer it will run for.
Is there a faster method of saving the large array as a csv?
Many thanks,
Josh

As pointed out in the comments, 1e6 rows * 4800 columns * 4 bytes per float32 is 18GiB. Writing a float to text takes ~9 bytes of text (estimating 1 for integer, 1 for decimal, 5 for mantissa and 2 for separator), which comes out to 40GiB. This takes a long time to do, since just the conversion to text itself is non-trivial, and disk I/O will be a huge bottle-neck.
One way to optimize this process may be to convert the entire array to text on your own terms, and write it in blocks using Python's binary I/O. I doubt that will give you too much benefit though.
A much better solution would be to write the binary data to a file instead of text. Aside from the obvious advantages of space and speed, binary has the advantage of being searchable and not requiring transformation after loading. You know where every individual element is in the file, if you are clever, you can access portions of the file without loading the entire thing. Finally, a binary file is more likely to be highly compressible than a relatively low-entropy text file.
Disadvantages of binary are that it is not human-readable, and not as portable as text. The latter is not a problem, since transforming into an acceptable format will be trivial. The former is likely a non-issue given the amount of data you are attempting to process anyway.
Keep in mind that human readability is a relative term. A human can not read 40iGB of numerical data with understanding. A human can process A) a graphical representation of the data, or B) scan through relatively small portions of the data. Both cases are suitable for binary representations. Case A) is straightforward: load, transform and plot the data. This will be much faster if the data is already in a binary format that you can pass directly to the analysis and plotting routines. Case B) can be handled with something like a memory mapped file. You only ever need to load a small portion of the file, since you can't really show more than say a thousand elements on screen at one time anyway. Any reasonable modern platform should be able to keep upI/O and binary-to-text conversion associated with a user scrolling around a table widget or similar. In fact, binary makes it easier since you know exactly where each element belongs in the file.

Python NUMPY HUGE Matrices multiplication

I need to multiply two big matrices and sort their columns.
import numpy
a= numpy.random.rand(1000000, 100)
b= numpy.random.rand(300000,100)
c= numpy.dot(b,a.T)
sorted = [argsort(j)[:10] for j in c.T]
This process takes a lot of time and memory. Is there a way to fasten this process? If not how can I calculate RAM needed to do this operation? I currently have an EC2 box with 4GB RAM and no swap.
I was wondering if this operation can be serialized and I dont have to store everything in the memory.

One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.
To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).
> ipython
In [1]: import numpy as np
In [2]: a = np.random.rand(1000, 1000)
In [3]: a.dtype
Out[3]: dtype('float64')
In [4]: del(a)
To get the RSS I ran:
ps -xao comm,rss | grep python
[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]
The results are;
After starting the interpreter: 24880 kiB
After importing numpy: 34364 kiB
After creating a: 42200 kiB
After deleting a: 34368 kiB
Calculating the size;
In [4]: (42200 - 34364) * 1024
Out[4]: 8024064
In [5]: 8024064/(1000*1000)
Out[5]: 8.024064
As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.
The size of your original arrays in MiB will be approximately;
In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125
In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375
That's not too bad. However, the dot product will be way too large:
In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484
That's 2235 GiB!
What you can do is split up the problem and perfrom the dot operation in pieces;
load b as an ndarray
load every row from a as an ndarray in turn.
multiply the row by every column of b and write the result to a file.
del() the row and load the next row.
This wil not make it faster, but it would make it use less memory!
Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.

What DrV and Roland Smith said are good answers; they should be listened to. My answer does nothing more than present an option to make your data sparse, a complete game-changer.
Sparsity can be extremely powerful. It would transform your O(100 * 300000 * 1000000) operation into an O(k) operation with k non-zero elements (sparsity only means that the matrix is largely zero). I know sparsity has been mentioned by DrV and disregarded as not applicable but I would guess it is.
All that needs to be done is to find a sparse representation for computing this transform (and interpreting the results is another ball game). Easy (and fast) methods include the Fourier transform or wavelet transform (both rely on similarity between matrix elements) but this problem is generalizable through several different algorithms.
Having experience with problems like this, this smells like a relatively common problem that is typically solved through some clever trick. When in a field like machine learning where these types of problems are classified as "simple," that's often the case.

YOu have a problem in any case. As Roland Smith shows you in his answer, the amount of data and number of calculations is enormous. You may not be very familiar with linear algebra, so a few words of explanation might help in understanding (and then hopefully solving) the problem.
Your arrays are both a collection of vectors with length 100. One of the arrays has 300 000 vectors, the other one 1 000 000 vectors. The dot product between these arrays means that you calculate the dot product of each possible pair of vectors. There are 300 000 000 000 such pairs, so the resulting matrix is either 1.2 TB or 2.4 TB depending on whether you use 32 or 64-bit floats.
On my computer dot multiplying a (300,100) array with a (100,1000) array takes approximately 1 ms. Extrapolating from that, you are looking at a 1000 s calculation time (depending on the number of cores).
The nice thing about taking a dot product is that you can do it piecewise. Keeping the output is then another problem.
If you were running it on your own computer, calculating the resulting matrix could be done in the following way:
create an output array as a np.memmap array onto the disk
calculate the results one row at a time (as explained by Roland Smith)
This would result in a linear file write with a largish (2.4 TB) file.
This does not require too many lines of code. However, make sure everything is transposed in a suitable way; transposing the input arrays is cheap, transposing the output is extremely expensive. Accessing the resulting huge array is cheap if you can access elements close to each other, expensive, if you access elements far away from each other.
Sorting a huge memmapped array has to be done carefully. You should use in-place sort algorithms which operate on contiguous chunks of data. The data is stored in 4 KiB chunks (512 or 1024 floats), and the fewer chunks you need to read, the better.
Now that you are not running the code in our own machine but on a cloud platform, things change a lot. Usually the cloud SSD storage is very fast with random accesses, but IO is expensive (also in terms of money). Probably the least expensive option is to calculate suitable chunks of data and send them to S3 storage for further use. The "suitable chunk" part depends on how you intend to use the data. If you need to process individual columns, then you send one or a few columns at a time to the cloud object storage.
However, a lot depends on your sorting needs. Your code looks as if you are finally only looking at a few first items of each column. If this is the case, then you should only calculate the first few items and not the full output matrix. That way you can do everything in memory.
Maybe if you tell a bit more about your sorting needs, there can be a viable way to do what you want.
Oh, one important thing: Are your matrices dense or sparse? (Sparse means they mostly contain 0's.) If your expect your output matrix to be mostly zero, that may change the game completely.

FFT result; amplitude-phase relation

I want to amplify audio input at specific frequencies, and I use numpy.fft.
So my question is: When changing the amplitudes of the signal, what happens with phase?
For example, if I multiply amplitudes in some frequency range, by some factor, let's say 2, do I need to change the phases, and if so, what should I do with them?
I've done the amplification without changing phases, and the result was not what I wanted. It's pretty much the same signal, with some unwanted noise.

You shouldn't need the change the phase for something like this. More likely the problem is that you need to be a bit more gentle about applying the boost. It sounds like you are taking some frequency window and multiplying by a constant while leaving everything else unchanged. This will cause ringing in the time domain with a very long tail. You need to smooth the transition from the gain=1 region to the gain=2 region, for instance by using a gaussian waveform with code that looks something like this:
x, t = get_waveform()
f0, df = get_parameters() # Center frequency and bandwidth of gain region
f = np.fft.rfft(x)
freqs = np.fft.fftfreq(len(x), t[1]-t[0])
freqs = freqs[0:len(f)] # rfft has only non-negative frequency components
gain_window = 1 + np.exp(-(freqs-f0)**2/(df)**2)
f = f * gain_window
x = np.fft.irfft(f)
return x
If that works, you can experiment with more aggressive functions that have sharper turn-on and a flatter top.
The FFT may not actually be what you want. FFTs are not normally used for real-time / streaming applications. This is because in the naive approach you have to collect the whole sample buffer before you start processing. For simple filtering applications it is often easier to do filtering directly in the time domain. This is what FIR and IIR filters do.
In order to filter with the fourier transform in real time what you have to do is break your data stream into overlapping blocks of a fixed length, FFT, filter, reverse FFT, and stich them back together without introducing glitches. This is possible, but it is tricky to get right. For a full-blown multi-channel EQ it might still be the best option, but you should at least consider time domain filtering.
If this is not a real-time application, then FFT is the way to go. For medium sized data sets (up to a few hundred megabytes) you can just FFT the whole data set. For much larger data sets you still have to break the data up into blocks, but they can be much larger blocks and you don't have to worry about the latency introduced.
Also, remember the FFT treats the signal as periodic, so if your signal doesn't go to zero at the beginning and end you will need to do some sort of windowing.

How can I estimate the compressibility of a file without compressing it?

I'm using an event loop based server in twisted python that stores files, and I'd like to be able to classify the files according to their compressibility.
If the probability that they'd benefit from compression is high, they would go to a directory with btrfs compression switched on, otherwise they'd go elsewhere.
I do not need to be sure - 80% accuracy would be plenty, and would save a lot of diskspace. But since there is the CPU and fs performance issue too, I can not just save everything compressed.
The files are in the low megabytes. I can not test-compress them without using a huge chunk of CPU and unduly delaying the event loop or refactoring a compression algorithm to fit into the event loop.
Is there any best practice to give a quick estimate for compressibility? What I came up with is taking a small chunk (few kB) of data from the beginning of the file, test-compress it (with a presumably tolerable delay) and base my decision on that.
Any suggestions? Hints? Flaws in my reasoning and/or problem?

Just 10K from the middle of the file will do the trick. You don't want the beginning or the end, since they may contain header or trailer information that is not representative of the rest of the file. 10K is enough to get some amount of compression with any typical algorithm. That will predict a relative amount of compression for the whole file, to the extent that that middle 10K is representative. The absolute ratio you get will not be the same as for the whole file, but the amount that it differs from no compression will allow you to set a threshold. Just experiment with many files to see where to set the threshold.
As noted, you can save time by doing nothing for files that are obviously already compressed, e.g. .png. .jpg., .mov, .pdf, .zip, etc.
Measuring entropy is not necessarily a good indicator, since it only gives the zeroth-order estimate of compressibility. If the entropy indicates that it is compressible enough, then it is right. If the entropy indicates that it is not compressible enough, then it may or may not be right. Your actual compressor is a much better estimator of compressibility. Running it on 10K won't take long.

I think what you are looking for is How to calculate the entropy of a file?
This questions contains all kind of methods to calculate the entropy of the file (and by that you can get the 'compressibility' of a file). Here's a quote from the abstract of this article (Relationship Between Entropy and
Test Data Compression
Kedarnath J. Balakrishnan, Member, IEEE, and Nur A. Touba, Senior Member, IEEE):
The entropy of a set of data is a measure of the amount of information contained in it. Entropy calculations for fully specified data have been used to get a theoretical bound on how much that data can be compressed. This paper extends the concept of entropy for incompletely specified test data (i.e., that has unspecified or don't care bits) and explores the use of entropy to show how bounds on the maximum amount of compression for a particular symbol partitioning can be calculated. The impact of different ways of partitioning the test data into symbols on entropy is studied. For a class of partitions that use fixed-length symbols, a greedy algorithm for specifying the don't cares to reduce entropy is described. It is shown to be equivalent to the minimum entropy set cover problem and thus is within an additive constant error with respect to the minimum entropy possible among all ways of specifying the don't cares. A polynomial time algorithm that can be used to approximate the calculation of entropy is described. Different test data compression techniques proposed in the literature are analyzed with respect to the entropy bounds. The limitations and advantages of certain types of test data encoding strategies are studied using entropy theory
And to be more constructive, checkout this site for python implementation of entropy calculations of chunks of data

Compressed files usually don't compress well. This means that just about any media file is not going to compress very well, since most media formats already include compression. Clearly there are exceptions to this, such as BMP and TIFF images, but you can probably build a whitelist of well-compressed filetypes (PNGs, MPEGs, and venturing away from visual media - gzip, bzip2, etc) to skip and then assume the rest of the files you encounter will compress well.
If you feel like getting fancy, you could build feedback into the system (observe the results of any compression you do and associate the resulting ratio with the filetype). If you come across a filetype that has consistently poor compression, you could add it to the whitelist.
These ideas depend on being able to identify a file's type, but there are standard utilities which do a pretty good job of this (generally much better than 80%) - file(1), /etc/mime.types, etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.