How is HDF5 different from a folder with files? - python

I'm working on an open source project dealing with adding metadata to folders. The provided (Python) API lets you browse and access metadata like it was just another folder. Because it is just another folder.
\folder\.meta\folder\somedata.json
Then I came across HDF5 and its derivation Alembic.
Reading up on HDF5 in the book Python and HDF5 I was looking for benefits to using it compared to using files in folders, but most of what I came across spoke about the benefits of a hierarchical file-format in terms of its simplicity in adding data via its API:
>>> import h5py
>>> f = h5py.File("weather.hdf5")
>>> f["/15/temperature"] = 21
Or its ability to read only certain parts of it upon request (e.g. random access), and parallel execution of a single HDF5 file (e.g. for multiprocessing)
You could mount HDF5 files, https://github.com/zjttoefs/hdfuse5
It even boasts a strong yet simple foundation concept of Groups and Datasets which from wiki reads:
Datasets, which are multidimensional arrays of a homogeneous type
Groups, which are container structures which can hold datasets and
other groups
Replace Dataset with File and Group with Folder and the whole feature-set sounds to me like what files in folders are already fully capable of doing.
For every benefit I came across, not one stood out as being exclusive to HDF5.
So my question is, if I were to give you one HDF5 file and one folder with files, both with identical content, in which scenario would HDF5 be better suited?
Edit:
Having gotten some responses about the portability of HDF5.
It sounds lovely and all, but I still haven't been given an example, a scenario, in which an HDF5 would out-do a folder with files. Why would someone consider using HDF5 when a folder is readable on any computer, any file-system, over a network, supports "parallel I/O", is readable by humans without an HDF5 interpreter.
I would go as far as to say, a folder with files is far more portable than any HDF5.
Edit 2:
Thucydides411 just gave an example of a scenario where portability matters.
https://stackoverflow.com/a/28512028/478949
I think what I'm taking away from the answers in this thread is that HDF5 is well suited for when you need the organisational structure of files and folders, like in the example scenario above, with lots (millions) small (~1 byte) data structures; like individual numbers or strings. That it makes up for what file-systems lack by providing a "sub file-system" favouring the small and many as opposed to few and large.
In computer graphics, we use it to store geometric models and arbitrary data about individual vertices which seems to align quite well with it's use in the scientific community.

As someone who developed a scientific project that went from using folders of files to HDF5, I think I can shed some light on the advantages of HDF5.
When I began my project, I was operating on small test datasets, and producing small amounts of output, in the range of kilobytes. I began with the easiest data format, tables encoded as ASCII. For each object I processed, I produced on ASCII table.
I began applying my code to groups of objects, which meant writing multiple ASCII tables at the end of each run, along with an additional ASCII table containing output related to the entire group. For each group, I now had a folder that looked like:
+ group
| |-- object 1
| |-- object 2
| |-- ...
| |-- object N
| |-- summary
At this point, I began running into my first difficulties. ASCII files are very slow to read and write, and they don't pack numeric information very efficiently, because each digit takes a full Byte to encode, rather than ~3.3 bits. So I switched over to writing each object as a custom binary file, which sped up I/O and decreased file size.
As I scaled up to processing large numbers (tens of thousands to millions) of groups, I suddenly found myself dealing with an extremely large number of files and folders. Having too many small files can be a problem for many filesystems (many filesystems are limited in the number of files they can store, regardless of how much disk space there is). I also began to find that when I would try to do post-processing on my entire dataset, the disk I/O to read many small files was starting to take up an appreciable amount of time. I tried to solve these problems by consolidating my files, so that I only produced two files for each group:
+ group 1
| |-- objects
| |-- summary
+ group 2
| |-- objects
| |-- summary
...
I also wanted to compress my data, so I began creating .tar.gz files for collections of groups.
At this point, my whole data scheme was getting very cumbersome, and there was a risk that if I wanted to hand my data to someone else, it would take a lot of effort to explain to them how to use it. The binary files that contained the objects, for example, had their own internal structure that existed only in a README file in a repository and on a pad of paper in my office. Whoever wanted to read one of my combined object binary files would have to know the byte offset, type and endianness of each metadata entry in the header, and the byte offset of every object in the file. If they didn't, the file would be gibberish to them.
The way I was grouping and compressing data also posed problems. Let's say I wanted to find one object. I would have to locate the .tar.gz file it was in, unzip the entire contents of the archive to a temporary folder, navigate to the group I was interested in, and retrieve the object with my own custom API to read my binary files. After I was done, I would delete the temporarily unzipped files. It was not an elegant solution.
At this point, I decided to switch to a standard format. HDF5 was attractive for a number of reasons. Firstly, I could keep the overall organization of my data into groups, object datasets and summary datasets. Secondly, I could ditch my custom binary file I/O API, and just use a multidimensional array dataset to store all the objects in a group. I could even create arrays of more complicated datatypes, like arrays of C structs, without having to meticulously document the byte offsets of every entry. Next, HDF5 has chunked compression which can be completely transparent to the end user of the data. Because the compression is chunked, if I think users are going to want to look at individual objects, I can have each object compressed in a separate chunk, so that only the part of the dataset the user is interested in needs to be decompressed. Chunked compression is an extremely powerful feature.
Finally, I can just give a single file to someone now, without having to explain much about how it's internally organized. The end user can read the file in Python, C, Fortran, or h5ls on the commandline or the GUI HDFView, and see what's inside. That wasn't possible with my custom binary format, not to mention my .tar.gz collections.
Sure, it's possible to replicate everything you can do with HDF5 with folders, ASCII and custom binary files. That's what I originally did, but it became a major headache, and in the end, HDF5 did everything I was kluging together in an efficient and portable way.

Thanks for asking this interesting question. Is a folder with files portable because I can copy a directory onto a stick on a Mac and then see the same directory and files on a PC? I agree that the file directory structure is portable, thanks to the people that write operating systems, but this is unrelated to the data in the files being portable. Now, if the files in this directory are pdf's, they are portable because there are tools that read and make sense of pdfs in multiple operating systems (thanks to Adobe). But, if those files are raw scientific data (in ASCII or binary doesn't matter) they are not at all portable. The ASCII file would look like a bunch of characters and the binary file would look like gibberish. If the were XML or json files, they would be readable, because json is ASCII, but the information they contain would likely not be portable because the meaning of the XML/json tags may not be clear to someone that did not write the file. This is an important point, the characters in an ASCII file are portable, but the information they represent is not.
HDF5 data are portable, just like the pdf, because there are tools in many operating systems that can read the data in HDF5 files (just like pdf readers, see http://www.hdfgroup.org/products/hdf5_tools/index.html). There are also libraries in many languages that can be used to read the data and present it in a way that makes sense to users – which is what Adobe reader does. There are hundreds of groups in the HDF5 community that do the same thing for their users (see http://www.hdfgroup.org/HDF5/users5.html).
There has been some discussion here of compression as well. The important thing about compressing in HDF5 files is that objects are compressed independently and only the objects that you need get decompressed on output. This is clearly more efficient than compressing the entire file and having to decompress the entire file to read it.
The other critical piece is that HDF5 files are self-describing – so, people that write the files can add information that helps users and tools know what is in the file. What are the variables, what are their types, what software wrote them, what instruments collected them, etc. It sounds like the tool you are working on can read metadata for files. Attributes in an HDF5 file can be attached to any object in the file – they are not just file level information. This is huge. And, of course, those, attributes can be read using tools written in many languages and many operating systems.

I'm currently evaluating HDF5 so had the same question.
This article – Moving Away from HDF5 – asks pretty much the same question. The article raises some good points about the fact that there is only a single implementation of the HDF5 library which is developed in relatively opaque circumstances by modern open-source standards.
As you can tell from the title, the authors decided to move away from HDF5, to a filesystem hierarchy of binary files containing arrays with metadata in JSON files. This was in spite having made a significant investment in HDF5, having had their fingers burnt by data corruption and performance issues.

I think the main advantage is portability.
HDF5 stores information about your datasets like the size, type and endianness of integers and floating point numbers, which means you can move an hdf5 file around and read its content even if it was created on a machine with a different architecture.
You can also attach arbitrary metadata to groups and datasets. Arguably you can also do that with files and folders if your filesystem support extended attributes.
An hdf5 file is a single file which can sometimes be more convenient than having to zip/tar folders and files. There is also a major drawback to this: if you delete a dataset, you can't reclaim the space without creating a new file.
Generally, HDF5 is well suited for storing large arrays of numbers, typically scientific datasets.

To me, we can compare folder with files to HDF5 only in the relevant context of scientific data where the most important data are arrays described by a set of metadata.
In the general context, Marcus is alright when he claims that folder with files is far more portable than any HDF5. I will add that in a general context, a folder with file is far most accessible than a HDF5 file. The obvious challenge is that with "normal" folder and files, there is no need of an extra API to access data. That is simply impossible with HDF5 that keeps data and metadata in the same file.
Imagine a moment, to read your pdf file, you need a new pdf reader that understands HDF5? Imagine, to play your music, you need a music player that can decode HDF5? to run your python script, the python interpreter needs to first decode the HDF5? Or the total, to launch your python interpreter, your operating system needs to decode the HDF5? etc. I will have simply not be able to write this answer, because my OS won't have been able to launch my web browser, that won't have able to read its internal files because I previously
turned everything into HDF5 (maybe a large HDF5 for everything in my hard drive).
Storing meta data in separate file has that huge advantage of working well with the huge amount of data files and softwares that already exist without any extra piece of headache.
I hope this helps.

A game where you need to load a lot of resources into the memory would be a scenario in which an HDF5 may be better than a folder with files. Loading data from files has costs as seek time, the time required to open each file, and read data from the file into memory. These operations can be even slower when reading data from a DVD or Blu-ray. Opening a single file can reduce drastically those costs.

Yes, the main advantage is that HDF5 is portable. HDF5 files can be accessed by a host of other programming/interpreting languages, such as Python (which your API is built on), MATLAB, Fortran and C. As Simon suggested, HDF5 is used extensively in the scientific community to store large datasets. In my experience, I find the ability to retrieve only certain datasets (and regions) useful. In addition, the building the HDF5 library for parallel I/O is very advantageous for post-processing of raw data at a later time.
Since the file is also self-describing, it is capable of storing not just raw data, but also description of that data, such as the array size, array name, units and a host of additional metadata.
Hope this helps.

HDF5 is ultimately, a format to store numbers, optimised for large datasets. The main strengths are the support for compression (that can make reading and writing data faster in many circumstances) and the fast in-kernel queries (retrieval of data fulfilling certain conditions, for example, all the values of pressure when the temperature was over 30 C).
The fact that you can combine several datasets in the same file is just a convenience. For example, you could have several groups corresponding to different weather stations, and each group consisting on several tables of data. For each group you would have a set of attributes describing the details of the instruments, and each table the individual settings. You can have one h5 file for each block of data, with an attribute in the corresponding place and it would give you the same functionality. But now, what you can do with HDF5 is to repack the file for optimized querying, compress the whole thing slightly, and retrieve your information at a blazing speed. If you have several files, each one would be individually compressed, and the OS would decide the layout on disk, that may not be the optimal.
One last thing HDF5 allows you is to load a file (or a piece) in memory exposing the same API as in disk. So, for example, you could use one or other backend depending on the size of the data and the available RAM. In your case, that would be equivalent as copying the relevant information to /dev/shm in Linux, and you would be responsible for commiting back to disk any modification.

One factor to consider is performance of disk access. Using hd5f, everything is stored in continuous area of disk, making reading data faster with fewer disk seek and rotation. On the other hand, using file system to organize data may involve reading from many small files, thus more disk access is required.

Related

Is it more beneficial to read many small files or fewer large files of the exact same data?

I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.

I must compress many similar files, can I exploit the fact they are similar?

I have a dataset with many different samples (numpy arrays). It is rather impractical to store everything in only one file, so I store many different 'npz' files (numpy arrays compressed in zip).
Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.
Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.
I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.
Is it technically possible? Is there something that works like this?
tldr; It depends on the size of each individual file and the data there-in. For example, characteristics / use-cases / access patterns likely vary wildly between 234567x100 byte files and 100x234567 byte files.
Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.
Possibly. Shared compression benefits will decrease as the size of the file increase.
Regardless, even using a Mono File implementation (let's say a standard zip) may save significantly effective disk space for very many very small files as it avoids overheads required by file-systems to manage individual files; if nothing else, many implementations must be aligned to full blocks [eg. 512-4k bytes]. Plus, free compression using a ubiquitously supported format.
Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.
This 'zip basis' is sometimes called a Pre-shared Dictionary.
I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.
Is it technically possible? Is there something that works like this?
Yes, it's possible. SDCH (Shared Dictionary Compression for HTTP) was one such implementation designed for common web files (eg. HTTP/CSS/JavaScript). In certain cases it could achieve significantly higher compression than standard DEFLATE.
The approach can be emulated with many compression algorithms that works on streams where the compression dictionary is encoded as part of the stream-as-written. (U = Uncompressed, C = compressed.)
To compress:
[U:shared_dict] + [U:data] -> [C:shared_dict] + [C:data]
^-- "zip basis" ^-- write only this to file
^-- artifact of priming
To decompress:
[C:shared_dict] + [C:data] -> [U:shared_dict] + [U:data]
^-- add this back before decompressing! ^-- use this
The overall space saved depends on many factors, including how useful the initial priming dictionary is and on the specific compressor details. LZ78-esque implementations are particular well-suited to the approach above due to use of a sliding-window that acts as a lookup dictionary.
Alternatively, it may be possible to use domain-specific knowledge and/or encoding to also achieve better compression with specialized compression schemes. An example of this is SQL Server's Page Compression which exploits data similarities between columns on different rows.
A 'zip-basis' is interesting but problematic.
You could preprocess the files instead. Take one file as a template and calculate the diff of each file compared to the template. Then compress the diffs.

Pandas: efficiently write thousands of small files

here is my problem.
I have a single big CSV file containing a bit more than 100M rows which I need to divide in much smaller files (if needed I can add more details). At the moment I'm reading in chunks the big CSV, doing some computations to determine how to subdivide the chunk and finally writing (appending) to the files with
df.to_csv(outfile, float_format='%.8f', index=False, mode='a', header=header)
(the header variable is True if it is the first time that I write to 'outfile', otherwise it is False).
While running the code I noticed that the amount of on-disk memory taken by the smaller files on the whole was likely to become larger than three times the size of the single big csv.
So here are my questions:
is this behavior normal? (probably it is, but I'm asking just in case)
is it possible to reduce the size of the files? (different file formats?) [SOLVED through compression, see update below and comments]
are there better file types for this situation with respect to CSV?
Please note that I don't have an extensive knowledge of programming, I'm just using Python for my thesis.
Thanks in advance to whoever will help.
UPDATE: thanks to #AshishAcharya and #PatrickArtner I learned how to use the compression while writing and reading the CSV. Still, I'd like to know if there are any file types that may be better than CSV for this task.
NEW QUESTION: (maybe stupid question) does appending work on compressed files?
UPDATE 2: using the compression option I noticed something that I don't understand. To determine the size of folders I was taught to use the du -hs <folder> command, but using it on the folder containing compressed files or the one containing the uncompressed files results in the same value of '3.8G' (both are created using the same first 5M rows from the big CSV). From the file explorer (Nautilus) instead, I get about 590MB for the one containing uncompressed CSV and 230MB for the other. What am I missing?

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Pytables vs. CSV for files that are not very large

I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.
If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.
Thanks.
Have you considered Numpy arrays?
PyTables are wonderful when your data is too large to fit in memory, but a
200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think
PyTables may be overkill.
You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.
If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 3 table columns: row, col, value:
import sqlite3
import numpy as np
db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
(row INTEGER,
col INTEGER,
value FLOAT,
PRIMARY KEY (row,col))''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
((r,c,value) for r,row in enumerate(matrix)
for c,value in enumerate(row)))
# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.
As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.
PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.
The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.
One big plus for PyTables is the storage of metadata, like variables etc.
If you run the simulations more often with different parameters you the store the results as an array entry in the h5 file.
We use it to store measurement data + experiment scripts to get the data so it is all self contained.
BTW: If you need to look quickly into a hdf5 file you can use HDFView. It's a Java app for free from the HDFGroup. It's easy to install.
i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.
This is actually quite related to another answer I've provided regarding reading / writing csv files w/ numpy:
Python: how to do basic data manipulation like in R?
You should definitely use numpy, no matter what else! The ease of indexing, etc. far outweighs the cost of the additional dependency (well, I think so). PyTables, of course, relies on numpy too.
Otherwise, it really depends on your application, your hardware and your audience. I suspect that reading in csv files of the size you're talking about won't matter in terms of speed compared to PyTables. But if that's a concern, write a benchmark! Read and write some random data 100 times. Or, if read times matter more, write once, read 100 times, etc.
I strongly suspect that PyTables will outperform SQL. SQL will rock on complex multi-table queries (especially if you do the same ones frequently), but even on single-table (so called "denormalized") table queries, pytables is hard to beat in terms of speed. I can't find a reference for this off-hand, but you may be able to dig something up if you mine the links here:
http://www.pytables.org/moin/HowToUse#HintsforSQLusers
I'm guessing execute performance for you at this stage will pale in comparison to coder performance. So, above all, pick something that makes the most sense to you!
Other points:
As with SQL, PyTables has an undo feature. CSV files won't have this, but you can keep them in version control, and you VCS doesn't need to be too smart (CSV files are text).
On a related note, CSV files will be much bigger than binary formats (you can certainly write your own tests for this too).
These are not "exclusive" choices.
You need both.
CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.

Categories

Resources