how to export HDF5 file to NumPy using H5PY?

how to export HDF5 file to NumPy using H5PY? - python

I have an existing hdf5 file with three arrays, i want to extract one of the arrays using h5py.

h5py already reads files in as numpy arrays, so just:
with h5py.File('the_filename', 'r') as f:
my_array = f['array_name'][()]
The [()] means to read the entire array in; if you don't do that, it doesn't read the whole data but instead gives you lazy access to sub-parts (very useful when the array is huge but you only need a small part of it).

For this question it is way overkill but if you have a lot of things like this to do I use a package SpacePy that makes some of this easier.
datamodel.fromHDF5() documentation This returns a dictionary of arrays stored in a similar way to how h5py handles data.

Related

Save list of numpy arrays onto disk

I have a list of 42000 numpy arrays (each array is 240x240) that I want to save to a file for use in another python script.
I've tried using pickle and numpy.savez_compressed and I run into Memory Errors (I have 16gb DDR3). I read that hdf5 which is commonly used for deep learning stuff cannot save lists so I'm kind of stuck.
Does anyone have any idea how I can save my data?
EDIT: I previously saved this data into a numpy array onto disk using np.save and it was around 2.3GB but my computer couldn't always handle it so it would sometimes crash if I tried to process it. I read lists might be better so I have moved to using lists of numpy arrays

Assume we have a list of numpy arrays, A, and wish to save these sequentially to a HDF5 file.
We can use the h5py library to create datasets, with each dataset corresponding to an array in A.
import h5py, numpy as np
A = [arr1, arr2, arr3] # each arrX is a numpy array
with h5py.File('file.h5', 'w', libver='latest') as f: # use 'latest' for performance
for idx, arr in enumerate(A):
dset = f.create_dataset(str(idx), shape=(240, 240), data=arr, chunks=(240, 240)
compression='gzip', compression_opts=9)
I use gzip compression here for compatibility reasons, since it ships with every HDF5 installation. You may also wish to consider blosc & lzf filters. I also set chunks equal to shape, under the assumption you intend to read entire arrays rather than partial arrays.
The h5py documentation is an excellent resource to improve your understanding of the HDF5 format, as the h5py API follows the C API closely.

Saving numpy array such that it is readily available without loading

I have a 20GB library of images stored as a high-dimensional numpy array. This library allows me to these use images without having to generate them anew each time. Now my problem is that np.load("mylibrary") takes as much time as it would take to generate a couple of those images. Therefore my question is: Is there a way to store a numpy array such that it is readily accessible without having to load it?
Edit: I am using PyCharm

I would suggest h5py which is a Pythonic interface to the HDF5 binary data format.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
You can also use PyTables'. It is another HDF5 interface for python and numpy
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
numpy.memap is another option. It however would be slower than hdf5. Another condition is that a array should be limited to 2.5G

dumping several objects into the same file

Let's say I have a dictionary of about 100k pairs of strings, and a numpy matrix of shape (100k, 500). I would like to save them to the disk in a same file.
What I'm doing right now is using cPickle to dump the dictionary, and scipy.io.savemat to dump the matrix. This way, the dump / load is very fast. But the problem is that since I use different methods I obtain 2 files, and I would like to have just one file containing my 2 objects. How can I do this?
I could cPickle them both in the same file, but cPickle is incredibly slow on big arrays.

You could use dill. dill.dump accesses and uses the dump method from numpy to store an array or matrix object, so it's stored the same way it would be if you did it directly from the method on the numpy object. You'd just dill.dump the dictionary.
dill also has the ability to store pickles in compressed format, but it's slower. As mentioned in the comments, there's also joblib, which can also do the same as dill… but basically, joblib leverages cloudpickle (which is another serializer) or can also use dill, to do the serialization.
If you have a huge dictionary, and don't need all of the contents at once… maybe a better option would be klepto, which can use advanced serialization methods (from dill) to store a dict to several files on disk (or a database), where you have a proxy dict in memory that enables you to only get the entries you need.
All of these packages give you a fast unified dump for standard python and also for numpy objects.

how to read a binary file into variables in python

I am working with information from big models, which means I have a lot of big ascii files with two float columns (lets say X and Y). However, whenever I have to read these files it takes a long time, so I thought maybe converthing them to binary files will make the reading process much faster.
I converted my asciifiles into binary files using the uu.encode(ascii_file,binary_file) command, and it worked quite well (Actually, tested the decode part and I recovered the same files).
My question is: is there anyway to read the binary files directly into python and get the data into two variables (x and y)?
Thanks!

You didn't specify how your float columns are represented in Python. The cPickle module is a fast general solution, with the drawback that it creates files readable only from Python, and that it should never be allowed to read untrusted data (received from the network). It is likely to just work with all regular datatypes, including numpy arrays.
If you can use numpy and store your data in numpy arrays, look into numpy.save and numpy.savetxt and the corresponding loading functions, which should offer performance superior to manually extracting the data.
array.array also has methods for writing array data to file, with the drawback that the array data is written in the native format and cannot be read from a different architecture.

Check out python's struct module. It's probably what you'd want to be using for reading and writing your data.

I suggest that instead of the suggested struct module, if your model is just floats/doubles (coordinates), you should see the array module, must be much faster than any ops in the struct module. The downside of it is that the collection is homogenous, you need to have first values in odd indexes, second ones in even indexes, or sequentially.

Pytables vs. CSV for files that are not very large

I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.
If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.
Thanks.

Have you considered Numpy arrays?
PyTables are wonderful when your data is too large to fit in memory, but a
200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think
PyTables may be overkill.
You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.
If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 3 table columns: row, col, value:
import sqlite3
import numpy as np
db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
(row INTEGER,
col INTEGER,
value FLOAT,
PRIMARY KEY (row,col))''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
((r,c,value) for r,row in enumerate(matrix)
for c,value in enumerate(row)))
# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.

As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.
PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.
The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.

One big plus for PyTables is the storage of metadata, like variables etc.
If you run the simulations more often with different parameters you the store the results as an array entry in the h5 file.
We use it to store measurement data + experiment scripts to get the data so it is all self contained.
BTW: If you need to look quickly into a hdf5 file you can use HDFView. It's a Java app for free from the HDFGroup. It's easy to install.

i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.

This is actually quite related to another answer I've provided regarding reading / writing csv files w/ numpy:
Python: how to do basic data manipulation like in R?
You should definitely use numpy, no matter what else! The ease of indexing, etc. far outweighs the cost of the additional dependency (well, I think so). PyTables, of course, relies on numpy too.
Otherwise, it really depends on your application, your hardware and your audience. I suspect that reading in csv files of the size you're talking about won't matter in terms of speed compared to PyTables. But if that's a concern, write a benchmark! Read and write some random data 100 times. Or, if read times matter more, write once, read 100 times, etc.
I strongly suspect that PyTables will outperform SQL. SQL will rock on complex multi-table queries (especially if you do the same ones frequently), but even on single-table (so called "denormalized") table queries, pytables is hard to beat in terms of speed. I can't find a reference for this off-hand, but you may be able to dig something up if you mine the links here:
http://www.pytables.org/moin/HowToUse#HintsforSQLusers
I'm guessing execute performance for you at this stage will pale in comparison to coder performance. So, above all, pick something that makes the most sense to you!
Other points:
As with SQL, PyTables has an undo feature. CSV files won't have this, but you can keep them in version control, and you VCS doesn't need to be too smart (CSV files are text).
On a related note, CSV files will be much bigger than binary formats (you can certainly write your own tests for this too).

These are not "exclusive" choices.
You need both.
CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.