HDF5 file (h5py) with version control - hash changes on every save - python

I am using h5py to store intermediate data from numerical work in an HDF5 file. I have the project under version control, but this doesn't work well with the HDF5 files because every time a script is re-run which generates a HDF5 file, the binary file changes even if the data within does not.
Here is a small example to illustrate this:
In [1]: import h5py, numpy as np
In [2]: A = np.arange(5)
In [3]: f = h5py.File('test.h5', 'w'); f['A'] = A; f.close()
In [4]: !md5sum test.h5
7d27c258d94ed5d06736f6d2ba7c9433 test.h5
In [5]: f = h5py.File('test.h5', 'w'); f['A'] = A; f.close()
In [6]: !md5sum test.h5
c1db5806f1393f2095c88dbb7efeb7d3 test.h5
In [7]: # the file has changed but still contains the same data!
I have looked in the HDF5 file format documents and at the h5py documentation but haven't found anything which helps me with this. My questions are:
Why does the file change even though I'm saving the same data?
How can I stop it changing, so version control only sees a new version of the file when the actual numerical content has changed?

The HDF5 file uses both an abstract data model as well as an abstract storage model. What this means is that how a file is stored on disk may be (and usually is) completely different to how it is represented in your program. It's possible to store exactly the same data in more than one way, and for this not to be apparent to your program.
The HDF5 file format storage specification allows for several timestamps in the data object headers. These are not stored as attributes, and so aren't usually accessible by the high level APIs.
It's possible to turn off writing these timestamps using the low level HDF5 APIs, but it's not clear if the relevant features are in h5py. This github issue appears to be exactly what you want, but unfortunately it is still open.


Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?
I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

Close an open h5py data file

In our lab we store our data in hdf5 files trough the python package h5py.
At the beginning of an experiment we create an hdf5 file and store array after array of array of data in the file (among other things). When an experiment fails or is interrupted the file is not correctly closed.
Because our experiments run from iPython the reference to the data object remains (somewhere) in memory.
Is there a way to scan for all open h5py data objects and close them?
This is how it could be done (I could not figure out how to check for closed-ness of the file without exceptions, maybe you will find):
import gc
for obj in gc.get_objects(): # Browse through ALL objects
if isinstance(obj, h5py.File): # Just HDF5 files
pass # Was already closed
Another idea:
Dpending how you use the files, what about using the context manager and the with keyword like this?
with h5py.File("some_path.h5") as f:
f["data1"] = some_data
When the program flow exits the with-block, the file is closed regardless of what happens, including exceptions etc.
pytables (which h5py uses) keeps track of all open files and provides an easy method to force-close all open hdf5 files.
import tables
That attribute _open_files also has helpful methods to give you information and handlers for the open files.
I've found that hFile.bool() returns True if open, and False otherwise. This might be the simplest way to check.
In other words, do this:
hFile = h5py.File(path_to_file)
if hFile.__bool__():

Pickle dump huge file without memory error

I have a program where I basically adjust the probability of certain things happening based on what is already known. My file of data is already saved as a pickle Dictionary object at Dictionary.txt.
The problem is that everytime that I run the program it pulls in the Dictionary.txt, turns it into a dictionary object, makes it's edits and overwrites Dictionary.txt. This is pretty memory intensive as the Dictionary.txt is 123 MB. When I dump I am getting the MemoryError, everything seems fine when I pull it in..
Is there a better (more efficient) way of doing the edits? (Perhaps w/o having to overwrite the entire file everytime)
Is there a way that I can invoke garbage collection (through gc module)? (I already have it auto-enabled via gc.enable())
I know that besides readlines() you can read line-by-line. Is there a way to edit the dictionary incrementally line-by-line when I already have a fully completed Dictionary object File in the program.
Any other solutions?
Thank you for your time.
I was having the same issue. I use joblib and work was done. In case if someone wants to know other possibilities.
save the model to disk
from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)
some time later... load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
I am the author of a package called klepto (and also the author of dill).
klepto is built to store and retrieve objects in a very simple way, and provides a simple dictionary interface to databases, memory cache, and storage on disk. Below, I show storing large objects in a "directory archive", which is a filesystem directory with one file per entry. I choose to serialize the objects (it's slower, but uses dill, so you can store almost any object), and I choose a cache. Using a memory cache enables me to have fast access to the directory archive, without having to have the entire archive in memory. Interacting with a database or file can be slow, but interacting with memory is fast… and you can populate the memory cache as you like from the archive.
>>> import klepto
>>> d = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> d
dir_archive('stuff', {}, cached=True)
>>> import numpy
>>> # add three entries to the memory cache
>>> d['big1'] = numpy.arange(1000)
>>> d['big2'] = numpy.arange(1000)
>>> d['big3'] = numpy.arange(1000)
>>> # dump from memory cache to the on-disk archive
>>> d.dump()
>>> # clear the memory cache
>>> d.clear()
>>> d
dir_archive('stuff', {}, cached=True)
>>> # only load one entry to the cache from the archive
>>> d.load('big1')
>>> d['big1'][-3:]
array([997, 998, 999])
klepto provides fast and flexible access to large amounts of storage, and if the archive allows parallel access (e.g. some databases) then you can read results in parallel. It's also easy to share results in different parallel processes or on different machines. Here I create a second archive instance, pointed at the same directory archive. It's easy to pass keys between the two objects, and works no differently from a different process.
>>> f = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> f
dir_archive('stuff', {}, cached=True)
>>> # add some small objects to the first cache
>>> d['small1'] = lambda x:x**2
>>> d['small2'] = (1,2,3)
>>> # dump the objects to the archive
>>> d.dump()
>>> # load one of the small objects to the second cache
>>> f.load('small2')
>>> f
dir_archive('stuff', {'small2': (1, 2, 3)}, cached=True)
You can also pick from various levels of file compression, and whether
you want the files to be memory-mapped. There are a lot of different
options, both for file backends and database backends. The interface
is identical, however.
With regard to your other questions about garbage collection and editing of portions of the dictionary, both are possible with klepto, as you can individually load and remove objects from the memory cache, dump, load, and synchronize with the archive backend, or any of the other dictionary methods.
See a longer tutorial here: https://github.com/mmckerns/tlkklp
Get klepto here: https://github.com/uqfoundation
None of the above answers worked for me. I ended up using Hickle which is a drop-in replacement for pickle based on HDF5. Instead of saving it to a pickle it's saving the data to HDF5 file. The API is identical for most use cases and it has some really cool features such as compression.
pip install hickle
# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')
# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')
# Load data
array_hkl = hkl.load('test.hkl')
I had memory error and resolved it by using protocol=2:
cPickle.dump(obj, file, protocol=2)
If your key and values are string, you can use one of the embedded persistent key-value storage engines available in Python standard library. Example from the anydbm module docs:
import anydbm
# Open database, creating it if necessary.
db = anydbm.open('cache', 'c')
# Record some values
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Loop through contents. Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
print k, '\t', v
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# Close when done.
Have you tried using streaming pickle: https://code.google.com/p/streaming-pickle/
I have just solved a similar memory error by switching to streaming pickle.
How about this?
import cPickle as pickle
p = pickle.Pickler(open("temp.p","wb"))
p.fast = True
p.dump(d) # d could be your dictionary or any file
I recently had this problem. After trying cpickle with ASCII and the binary protocol 2, I found that my SVM from sci-kit learn trained on 20+ gb of data was not pickling due to a memory error. However, the dill package seemed to resolve the issue. Dill will not create many improvements for a dictionary but may help with streaming. It is meant to stream pickled bytes across a network.
import dill
with open(path,'wb') as fp:
If efficiency is an issue, try loading/saving to a database. In this instance, your storage solution may be an issue. At 123 mb Pandas should be fine. However, if the machine has limited memory SQL offers fast,optimized, bag operations over data, usually with multithreaded support.
My poly kernel svm saved.
This may seem trivial, but try to use the 64bit Python if you are not.
I have tried the following solution, but all of them can't resolve my problem.
Using hickle to replace pickle
Using joblib to replace pickle
Using sklearn.externals joblib to replace pickle
Change the pickle mode
Provide a different method for this issue:
Finally, I found the root cause is that the work directory folder was too long.
So that I change the directory to a very short structure.
Enjoy it.

Save .dta files in python

I'm wondering if anyone knows a Python package that allows you to save numpy arrays/recarrays in the .dta format of the statistical data analysis software Stata. This would really speed up a few steps in a system I have.
The scikits.statsmodels package includes a reader for Stata data files, which relies in part on PyDTA as pointed out by #Sven. In particular, genfromdta() will return an ndarray, e.g.
from Python 2.7/statsmodels 0.3.1:
>>> import scikits.statsmodels.api as sm
>>> arr = sm.iolib.genfromdta('/Applications/Stata12/auto.dta')
>>> type(arr)
<type 'numpy.ndarray'>
The savetxt() function can be used in turn to save an array as a text file, which can be imported in Stata. For example, we can export the above as
>>> sm.iolib.savetxt('auto.txt', arr, fmt='%2s', delimiter=",")
and read it in Stata without a dictionary file as follows:
. insheet using auto.txt, clear
I believe a *.dta reader should be added in the near future.
The only Python library for STATA interoperability I could find merely provides read-only access to .dta files. The R foreign library however provides a function write.dta, and RPy provides a Python interface to R. Maybe the combination of these tools can help you.
pandas DataFrame objects now have a "to_stata" method. So you can do for instance
import pandas as pd
df = pd.read_stata('my_data_in.dta')
DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). This answer may look less elegant, but it is probably more efficient.

