Where it the h5py performance bottleneck?

Where it the h5py performance bottleneck? - python

I have an HDF5 with 100 "events". Each event contains variable, but roughly 180 groups called "traces", and each trace has inside 6 datasets which are arrays of 32 bit floats, each ~1000 cells long (this carries slightly from event to event, but remains constant inside an event). The file was generated with default h5py settings (so no chunking or compression unless h5py does it on its own).
The readout is not fast. It is ~6 times slower than readout of the same data from CERN ROOT TTrees. I know that HDF5 is far from the fastest formats on the market, But I would be grateful, if you could tell me, where the speed is lost.
To read the arrays in traces I do:
d0keys = data["Run_0"].keys()
for key_1 in d0keys:
if("Event_" in key_1):
d1 = data["Run_0"][key_1]
d1keys = d1.keys()
for key_2 in d1keys:
if("Traces_" in key_2):
d2 = d1[key_2]
v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'][0],d2['SimSignal_Y'][0],d2['SimSignal_Z'][0],d2['SimEfield_X'][0], d2['SimEfield_Y'][0],d2['SimEfield_Z'][0]
Line profiler shows, that ~97% of the time is spent in the last line. Now, there are two issues:
It seems there is no difference between reading cell [0] and all the ~1000 cells with [:]. I understand that h5py should be able to read just a chunk of data from the disk. why no difference?
Reading 100 events from HDD (Linux, ext4) takes ~30 s with h5py, and ~5 s with ROOT. The size of 100 events is roughly 430 MB. This gives readout speed in HDF of ~14 MBps, while ROOT is ~86 MBps. Both slow, but ROOT comes much closer to the raw readout speed that I would expect from ~4 yo laptop HDD.
So where does h5py loses its speed? I guess the pure readout should be just the HDD speed. Thus, is the bottleneck:
Dereferencing HDF5 address to the dataset (ROOT does not need to do it)?
Allocating memory in python?
Something else?
I would be grateful for some clues.

There are a lot of HDF5 I/O issues to consider. I will try to cover each.
From my tests, time spent doing I/O is primarily a function of the number of reads/writes and not how much data (in MB) you read/write. Read this SO post for more details:
pytables writes much faster than h5py. Why? Note: it shows I/O performance for a fixed amount of data with different I/O write sizes for both h5py and PyTables. Based on this, it makes sense that most of the time is spent in the last line -- that's where you are reading the data from disk to memory as NumPy arrays (v1, v2, v3, v4, v5, v6).
Regarding your questions:
There's a reason there is no difference between reading d2['SimSignal_X'][0] and d2['SimSignal_X'][:]. Both read the entire dataset into memory (all ~1000 dataset values). If you only want to read a slice of the data, you need to use slice notation. For example, d2['SimSignal_X'][0:100] only reads the first 100 values (assumes d2['SimSignal_X'] only has a single axis -- shape=(1000,)). Note; reading a slice will reduce required memory, but won't improve I/O read time. (In fact, reading slices will probably increase read time.)
I am not familiar with CERN ROOT, so can't comment about performance of h5py vs ROOT. If you want to use Python, here are several things to consider:
You are reading the HDF5 data into memory (as a NumPy array). You don't have to do that. Instead of creating arrays, you can create h5py dataset objects. This reduces the I/O initially. Then you use the objects "as-if" they are np.arrays. The only change is your last line of code -- like this: v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'], d2['SimSignal_Y'], d2['SimSignal_Z'], d2['SimEfield_X'], d2['SimEfield_Y'], d2['SimEfield_Z']. Note how the slice notation is not used ([0] or [:]).
I have found PyTables to be faster than h5py. Your code can be easily converted to read with PyTables.
HDF5 can use "chunking" to improve I/O performance. You did not say if you are using this. It might help (hard to say since your datasets aren't very large). You have to define this when you create the HDF5 file, so might not be useful in your case.
Also you did not say if a compression filter was used when the data was written. This reduces the on-disk file size, but has the side effect of reducing I/O performance (increasing read times). If you don't know, check to see if compression was enabled at file creation.

Related

Working with multiple Large Files in Python

I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?

It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.
If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).
Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.
If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).

If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:
# cat f1.txt
1.1
1.2
1.3
# cat f2.txt
2.1
2.2
2.3
# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
... print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n')
('1.3\n', '2.3\n')
>>>

A few options:
1. Just use the memory
If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.
This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.
2. Do it row-by-row
If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.
The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.
3. Preprocess your files and use mmap
You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.
The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.
This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.
Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.
The file conversion into binary is relatively fast and easy, almost a one-liner.

How to avoid high memory usage in pytables?

I am reading in a chunk of data from a pytables.Table (version 3.1.1) using the read_where method from a big hdf5 file. The resulting numpy array has about 420 MB, however the memory consumption of my python process has gone up by 1.6GB during the read_where call and the memory is not released after the call is finished. Even deleting the array, closing the file and deleting the hdf5 file handle does not free the memory.
How can I free this memory again?

The huge memory consumption is due to the fact that python implements a lot of stuffs around the data to facilitate its manipulation.
You've got a good explanation of why the memory use is maintained here and there (found on this question). A good workaround would be to open and manipulate your table in a subprocess with the multiprocessing module

We would need more context on the details of your Table object, like how large is it and the chunk size. How HDF5 handles chunking is probably one of the largest responsibles for hugging memory in this case.
My advice is to have a thorough read of this: http://pytables.github.io/usersguide/optimization.html#understanding-chunking and experiment with different chunksizes (typically making them larger).

PyTables and HDF5: Massive overhead for tree data

I have a tree data structure that I want to save to disk. Thus, HDF5 with its internal tree structure seemed to be the perfect candidate. However, so far the data overhead is massive, by a factor of 100!
A test tree contains roughly 100 nodes, where leaves usually contain no more than 2 or 3 data items (like doubles). If I take the entire tree and just pickle it, it is about 21kB large. Yet, if I use PyTables and map the tree structure one to one to the HDF5 file, the file takes 2.4MB (!) disk space. Is the overhead that big?
The problem is that the overhead does not seem to be constant but linearly scales with the size of my tree data (as well as increasing nodes as increasing data per leaf, i.e. enlarging rows of the leaf tables).
Did I miss something regarding PyTables, like enabling compression (I thought PyTables does it by default)? What could possibly be the reason for this massive overhead?
Thanks a lot!

Ok, so I have found a way to massively reduce the file size. The point is, despite my prior believes, PyTables does NOT apply compression per default.
You can achieve this by using Filters.
Here is an example how that works:
import pytables as pt
hdf5_file = pt.openFile(filename = 'myhdf5file.h5',
mode='a',
title='How to compress data')
# for pytables >= 3 the method is called `open_file`,
# other methods are renamed analogously
myfilters = Filters(complevel=9, complib='zlib')
mydescitpion = {'mycolumn': pt.IntCol()} # Simple 1 column table
mytable = hdf5_file.createTable(where='/', name='mytable',
description=mydescription,
title='My Table',
filters=myfilters)
#Now you can happily fill the table...
The important line here is Filters(complevel=9, complib='zlib'). It specifies the
compression level complevel and the compression algorithm complib. Per default the level is set to 0, that means compression is disabled, whereas 9 is the highest compression level. For details on how compression works: HERE IS A LINK TO THE REFERENCE.
Next time, I better stick to RTFM :-) (although I did, but I missed the line "One of the beauties of PyTables is that it supports compression on tables and arrays, although it is not used by default")

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?

This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.

Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Efficiently Reading Large Files with ATpy and numpy?

I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:
sat = atpy.Table('satellite_data.tbl')
From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:
w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])
colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3
etc.
But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?
I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.
Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.

For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.
I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.
From the pyables website:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...

Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.