Data compression in python/numpy - python

I'm looking at using the amazon cloud for all my simulation needs. The resulting sim files are quite large, and I would like to move them over to my local drive for ease of analysis, ect. You have to pay per data you move over, so I want to compress all my sim soutions as small as possible. They are simply numpy arrays saved in the form of .mat files, using:
import scipy.io as sio
sio.savemat(filepath, do_compression = True)
So my question is, what is the best way to compress numpy arrays (they are currently stored in .mat files, but I could store them using any python method), by using python compression saving, linux compression, or both?
I am in the linux environment, and I am open to any kind of file compression.

Unless you know something special about the arrays (e.g. sparseness, or some pattern) you aren't going to do much better than the default compression, and maybe gzip on top of that. In fact you may not even need to gzip the files if you're using HTTP for downloads and your server is configured to do compression. Good lossless compression algorithms rarely vary by more than 10%.
If savemat works as advertized you should be able to get gzip compression all in python with:
import scipy.io as sio
import gzip
f_out = gzip.open(filepath_dot_gz, 'wb')
sio.savemat(f_out, do_compression = True)

Also LZMA (AKA xz) gives very good compression on fairly sparse numpy arrays, albeit it is pretty slow when compressing (and may require more memory as well).
In Ubuntu it is installed with sudo apt-get install python-lzma
It is used as any other file-object wrapper, something like that (to load pickled data):
from lzma import LZMAFile
import cPickle as pickle
if fileName.endswith('.xz'):
dataFile = LZMAFile(fileName,'r')
else:
dataFile = file(fileName, 'ro')
data = pickle.load(dataFile)

Though it won't necessarily give you the highest compression ratios, I've had good experiences saving compressed numpy arrays to disk with python-blosc. It is very fast and integrates well with numpy.

Related

Pandas to_csv() slow when writing to network drive

When writing to a Windows network drive with the Pandas to_csv() function the write operation is considerably slower than when writing to a local disk. This is obviously partly a function of network latency, but I find that if I were to write the data to a StringIO object and then write the StringIO object to the network drive it is considerably faster than calling to_csv directly with the network path, i.e.
from io import StringIO
# Slow
df.to_csv("/network/drive/test.csv")
# Fast
buf = StringIO()
df.to_csv(buf)
with open("/network/drive/test.csv", "w") as fh: fh.write(buf.getvalue())
I likewise find that when using the fwrite() function from the R data.table package there is a much smaller difference in write time between the local and network drives.
Given that I need to frequently write to a network disk I am considering making use of the "fast" method above using StringIO, but I am curious if there isn't some option I am overlooking in to_csv() that will get the same result?

Save list of numpy arrays onto disk

I have a list of 42000 numpy arrays (each array is 240x240) that I want to save to a file for use in another python script.
I've tried using pickle and numpy.savez_compressed and I run into Memory Errors (I have 16gb DDR3). I read that hdf5 which is commonly used for deep learning stuff cannot save lists so I'm kind of stuck.
Does anyone have any idea how I can save my data?
EDIT: I previously saved this data into a numpy array onto disk using np.save and it was around 2.3GB but my computer couldn't always handle it so it would sometimes crash if I tried to process it. I read lists might be better so I have moved to using lists of numpy arrays
Assume we have a list of numpy arrays, A, and wish to save these sequentially to a HDF5 file.
We can use the h5py library to create datasets, with each dataset corresponding to an array in A.
import h5py, numpy as np
A = [arr1, arr2, arr3] # each arrX is a numpy array
with h5py.File('file.h5', 'w', libver='latest') as f: # use 'latest' for performance
for idx, arr in enumerate(A):
dset = f.create_dataset(str(idx), shape=(240, 240), data=arr, chunks=(240, 240)
compression='gzip', compression_opts=9)
I use gzip compression here for compatibility reasons, since it ships with every HDF5 installation. You may also wish to consider blosc & lzf filters. I also set chunks equal to shape, under the assumption you intend to read entire arrays rather than partial arrays.
The h5py documentation is an excellent resource to improve your understanding of the HDF5 format, as the h5py API follows the C API closely.

Saving numpy array such that it is readily available without loading

I have a 20GB library of images stored as a high-dimensional numpy array. This library allows me to these use images without having to generate them anew each time. Now my problem is that np.load("mylibrary") takes as much time as it would take to generate a couple of those images. Therefore my question is: Is there a way to store a numpy array such that it is readily accessible without having to load it?
Edit: I am using PyCharm
I would suggest h5py which is a Pythonic interface to the HDF5 binary data format.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
You can also use PyTables'. It is another HDF5 interface for python and numpy
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
numpy.memap is another option. It however would be slower than hdf5. Another condition is that a array should be limited to 2.5G

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

What's the best compression algorithm for data dumps

I'm creating data dumps from my site for others to download and analyze. Each dump will be a giant XML file.
I'm trying to figure out the best compression algorithm that:
Compresses efficiently (CPU-wise)
Makes the smallest possible file
Is fairly common
I know the basics of compression, but haven't a clue as to which algo fits the bill. I'll be using MySQL and Python to generate the dump, so I'll need something with a good python library.
GZIP with standard compression level should be fine for most cases. Higher compression levels=more CPU time. BZ2 is packing better but is also slower. Well, there is always a trade-off between CPU consumption/running time and compression efficiency...all compressions with default compression levels should be fine.

Categories

Resources