PyTables and HDF5: Massive overhead for tree data

PyTables and HDF5: Massive overhead for tree data - python

I have a tree data structure that I want to save to disk. Thus, HDF5 with its internal tree structure seemed to be the perfect candidate. However, so far the data overhead is massive, by a factor of 100!
A test tree contains roughly 100 nodes, where leaves usually contain no more than 2 or 3 data items (like doubles). If I take the entire tree and just pickle it, it is about 21kB large. Yet, if I use PyTables and map the tree structure one to one to the HDF5 file, the file takes 2.4MB (!) disk space. Is the overhead that big?
The problem is that the overhead does not seem to be constant but linearly scales with the size of my tree data (as well as increasing nodes as increasing data per leaf, i.e. enlarging rows of the leaf tables).
Did I miss something regarding PyTables, like enabling compression (I thought PyTables does it by default)? What could possibly be the reason for this massive overhead?
Thanks a lot!

Ok, so I have found a way to massively reduce the file size. The point is, despite my prior believes, PyTables does NOT apply compression per default.
You can achieve this by using Filters.
Here is an example how that works:
import pytables as pt
hdf5_file = pt.openFile(filename = 'myhdf5file.h5',
mode='a',
title='How to compress data')
# for pytables >= 3 the method is called `open_file`,
# other methods are renamed analogously
myfilters = Filters(complevel=9, complib='zlib')
mydescitpion = {'mycolumn': pt.IntCol()} # Simple 1 column table
mytable = hdf5_file.createTable(where='/', name='mytable',
description=mydescription,
title='My Table',
filters=myfilters)
#Now you can happily fill the table...
The important line here is Filters(complevel=9, complib='zlib'). It specifies the
compression level complevel and the compression algorithm complib. Per default the level is set to 0, that means compression is disabled, whereas 9 is the highest compression level. For details on how compression works: HERE IS A LINK TO THE REFERENCE.
Next time, I better stick to RTFM :-) (although I did, but I missed the line "One of the beauties of PyTables is that it supports compression on tables and arrays, although it is not used by default")

Related

Where it the h5py performance bottleneck?

I have an HDF5 with 100 "events". Each event contains variable, but roughly 180 groups called "traces", and each trace has inside 6 datasets which are arrays of 32 bit floats, each ~1000 cells long (this carries slightly from event to event, but remains constant inside an event). The file was generated with default h5py settings (so no chunking or compression unless h5py does it on its own).
The readout is not fast. It is ~6 times slower than readout of the same data from CERN ROOT TTrees. I know that HDF5 is far from the fastest formats on the market, But I would be grateful, if you could tell me, where the speed is lost.
To read the arrays in traces I do:
d0keys = data["Run_0"].keys()
for key_1 in d0keys:
if("Event_" in key_1):
d1 = data["Run_0"][key_1]
d1keys = d1.keys()
for key_2 in d1keys:
if("Traces_" in key_2):
d2 = d1[key_2]
v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'][0],d2['SimSignal_Y'][0],d2['SimSignal_Z'][0],d2['SimEfield_X'][0], d2['SimEfield_Y'][0],d2['SimEfield_Z'][0]
Line profiler shows, that ~97% of the time is spent in the last line. Now, there are two issues:
It seems there is no difference between reading cell [0] and all the ~1000 cells with [:]. I understand that h5py should be able to read just a chunk of data from the disk. why no difference?
Reading 100 events from HDD (Linux, ext4) takes ~30 s with h5py, and ~5 s with ROOT. The size of 100 events is roughly 430 MB. This gives readout speed in HDF of ~14 MBps, while ROOT is ~86 MBps. Both slow, but ROOT comes much closer to the raw readout speed that I would expect from ~4 yo laptop HDD.
So where does h5py loses its speed? I guess the pure readout should be just the HDD speed. Thus, is the bottleneck:
Dereferencing HDF5 address to the dataset (ROOT does not need to do it)?
Allocating memory in python?
Something else?
I would be grateful for some clues.

There are a lot of HDF5 I/O issues to consider. I will try to cover each.
From my tests, time spent doing I/O is primarily a function of the number of reads/writes and not how much data (in MB) you read/write. Read this SO post for more details:
pytables writes much faster than h5py. Why? Note: it shows I/O performance for a fixed amount of data with different I/O write sizes for both h5py and PyTables. Based on this, it makes sense that most of the time is spent in the last line -- that's where you are reading the data from disk to memory as NumPy arrays (v1, v2, v3, v4, v5, v6).
Regarding your questions:
There's a reason there is no difference between reading d2['SimSignal_X'][0] and d2['SimSignal_X'][:]. Both read the entire dataset into memory (all ~1000 dataset values). If you only want to read a slice of the data, you need to use slice notation. For example, d2['SimSignal_X'][0:100] only reads the first 100 values (assumes d2['SimSignal_X'] only has a single axis -- shape=(1000,)). Note; reading a slice will reduce required memory, but won't improve I/O read time. (In fact, reading slices will probably increase read time.)
I am not familiar with CERN ROOT, so can't comment about performance of h5py vs ROOT. If you want to use Python, here are several things to consider:
You are reading the HDF5 data into memory (as a NumPy array). You don't have to do that. Instead of creating arrays, you can create h5py dataset objects. This reduces the I/O initially. Then you use the objects "as-if" they are np.arrays. The only change is your last line of code -- like this: v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'], d2['SimSignal_Y'], d2['SimSignal_Z'], d2['SimEfield_X'], d2['SimEfield_Y'], d2['SimEfield_Z']. Note how the slice notation is not used ([0] or [:]).
I have found PyTables to be faster than h5py. Your code can be easily converted to read with PyTables.
HDF5 can use "chunking" to improve I/O performance. You did not say if you are using this. It might help (hard to say since your datasets aren't very large). You have to define this when you create the HDF5 file, so might not be useful in your case.
Also you did not say if a compression filter was used when the data was written. This reduces the on-disk file size, but has the side effect of reducing I/O performance (increasing read times). If you don't know, check to see if compression was enabled at file creation.

Is there a Panda feature for streaming to / from a large binary source fast instead of CSV or JSON? Or is there another tool for it?

JSON isn't necessarily a high efficiency structure to store data in terms of bytes of overhead and parsing. There's a logical parsing structure, for example, based on syntax rather than being able to look up a specific segment. Let's say you have 20 years of timestep data, ~ 1TB compressed and want to be able to store it efficiently and load / store it as fast as possible for maximum speed simulation.
At first I tried relational databases, but those are actually not that fast - they're designed to load over a network, not locally, and the OSI model has overhead.
I was able to speed this up by creating a custom binary data structure with defined block sizes and header indexes, sort of like a file system, but this was time consuming and highly specified for a single type of data, for example fixed length data nodes. Editing the data wasn't a feature, it was a one time export spanning days of time. I'm sure some library could do it better.
I learned about Pandas, but they seem to load to / from CSV and JSON most commonly, and both of those are plain-text, so storing an int takes the space of multiple characters rather than having the power of deciding a 32 bit unsigned int for example.
What's the right tool? Can Pandas do this, or is there something better?
I need to be able to specify data type for each property being stored so if I only need a 16 bit int, thats the space that gets used.
I need to be able to use stream to read / write from big (1-10TB) data as fast as fundamentally possible per the hardware..

Is it possible to store multidimensional arrays of arbitrary shape in a PyTables cell?

PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.
Any pointers toward PyTables best practices are much appreciated!

The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature
and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.
If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.

The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.

Pytables vs. CSV for files that are not very large

I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.
If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.
Thanks.

Have you considered Numpy arrays?
PyTables are wonderful when your data is too large to fit in memory, but a
200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think
PyTables may be overkill.
You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.
If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 3 table columns: row, col, value:
import sqlite3
import numpy as np
db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
(row INTEGER,
col INTEGER,
value FLOAT,
PRIMARY KEY (row,col))''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
((r,c,value) for r,row in enumerate(matrix)
for c,value in enumerate(row)))
# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.

As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.
PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.
The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.

One big plus for PyTables is the storage of metadata, like variables etc.
If you run the simulations more often with different parameters you the store the results as an array entry in the h5 file.
We use it to store measurement data + experiment scripts to get the data so it is all self contained.
BTW: If you need to look quickly into a hdf5 file you can use HDFView. It's a Java app for free from the HDFGroup. It's easy to install.

i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.

This is actually quite related to another answer I've provided regarding reading / writing csv files w/ numpy:
Python: how to do basic data manipulation like in R?
You should definitely use numpy, no matter what else! The ease of indexing, etc. far outweighs the cost of the additional dependency (well, I think so). PyTables, of course, relies on numpy too.
Otherwise, it really depends on your application, your hardware and your audience. I suspect that reading in csv files of the size you're talking about won't matter in terms of speed compared to PyTables. But if that's a concern, write a benchmark! Read and write some random data 100 times. Or, if read times matter more, write once, read 100 times, etc.
I strongly suspect that PyTables will outperform SQL. SQL will rock on complex multi-table queries (especially if you do the same ones frequently), but even on single-table (so called "denormalized") table queries, pytables is hard to beat in terms of speed. I can't find a reference for this off-hand, but you may be able to dig something up if you mine the links here:
http://www.pytables.org/moin/HowToUse#HintsforSQLusers
I'm guessing execute performance for you at this stage will pale in comparison to coder performance. So, above all, pick something that makes the most sense to you!
Other points:
As with SQL, PyTables has an undo feature. CSV files won't have this, but you can keep them in version control, and you VCS doesn't need to be too smart (CSV files are text).
On a related note, CSV files will be much bigger than binary formats (you can certainly write your own tests for this too).

These are not "exclusive" choices.
You need both.
CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.

Huge Graph Structure

I'm developing an application in which I need a structure to represent a huge graph (between 1000000 and 6000000 nodes and 100 or 600 edges per node) in memory. The edges representation will contain some attributes of the relation.
I have tried a memory map representation, arrays, dictionaries and strings to represent that structure in memory, but these always crash because of the memory limit.
I would like to get an advice of how I can represent this, or something similar.
By the way, I'm using python.

If that is 100-600 edges/node, then you are talking about 3.6 billion edges.
Why does this have to be all in memory?
Can you show us the structures you are currently using?
How much memory are we allowed (what is the memory limit you are hitting?)
If the only reason you need this in memory is because you need to be able to read and write it fast, then use a database. Databases read and write extremely fast, often they can read without going to disk at all.

Depending on you hardware resources an all in memory for a graph this size is probably out of the question. Two possible options from a graph specific DB point of view are:
Neo4j - claims to easily handle billions of nodes and its been in development a long time.
FlockDB - newly released by Twitter this is a distributed graph database.
Since your using Python, have you looked at Networkx? How far did you get loading a graph of this size if you have looked at it out of interest?

I doubt you'll be able to use a memory structure unless you have a LOT of memory at your disposal:
Assume you are talking about 600 directed edges from each node, with a node being 4-bytes (integer key) and a directed edge being JUST the destination node keys (4 bytes each).
Then the raw data about each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = over 14.4GB
That's without any other overheads or any additional data in the nodes (or edges).

You appear to have very few edges considering the amount of nodes - suggesting that most of the nodes aren't strictly necessary. So, instead of actually storing all of the nodes, why not use a sparse structure and only insert them when they're in use? This should be pretty easy to do with a dictionary; just don't insert the node until you use it for an edge.
The edges can be stored using an adjacency list on the nodes.
Of course, this only applies if you really mean 100-600 nodes in total. If you mean per node, that's a completely different story.

The scipy.sparse.csgraph package may be able to handle this -- 5 million nodes * 100 edges on average is 500 million pairs, at 8 bytes per pair (two integer IDs) is just about 4GB. I think csgraph uses compression so it will use less memory than that; this could work on your laptop.
csgraph doesn't have as many features as networkx but it uses waaay less memory.

Assuming you mean 600 per node, you could try something like this:
import os.path
import cPickle
class LazyGraph:
def __init__(self,folder):
self.folder = folder
def get_node(self,id):
f = open(os.path.join(self.folder,str(id)),'rb')
node = cPickle.load(f)
f.close() # just being paranoid
return node
def set_node(self,id,node):
f = open(os.path.join(self.folder,str(id)),'wb')
cPickle.dump(node,f,-1) # use highest protocol
f.close() # just being paranoid
Use arrays (or numpy arrays) to hold the actual node ids, as they are faster.
Note, this will be very very slow.
You could use threading to pre-fetch nodes (assuming you knew which order you were processing them in), but it won't be fun.

Sounds like you need a database and an iterator over the results. Then you wouldn't have to keep it all in memory at the same time but you could always have access to it.

If you do decide to use some kind of database after all, I suggest looking at neo4j and its python bindings. It's a graph database capable of handling large graphs. Here's a presentation from this year's PyCon.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.