Adding big matrices stored in HDF5 datasets

Adding big matrices stored in HDF5 datasets - python

I have two HDF5 files having an identical structure, each store a matrix of the same shape. I need to create a third HDF5 file with a matrix representing the element-wise sum of the two mentioned above matrices. Given the sizes of matrices are extremely large (in the Gb-Tb range), what would be the best way to do it, preferably in a parallel way? I am using the h5py interface to the HDF5 library. Are there any libraries capable of doing it?

Yes, this is possible. The key is to access slices of the data from file1 & file2, do your element-wise sum, then write that slice of new data to the file3. You can do this with h5py or PyTables (aka tables). No other libraries are required. I only have passing knowledge of parallel computing. I know h5py supports an mpi interface through the mpi4py Python package. Details here: h5py docs: Parallel HDF5
Here is a simple example. It creates 2 files with a dataset of random floats, shape=(10,10,10). It then creates a new file with an empty dataset of the same shape. The loop reads a slice of data from file1 and file2, sums them, then writes to the same slice in file3. To test with large data, you can modify the shapes to match your file.
21-Jan-2021 Update:
I added code to get the dataset shapes from file1 and file2, and compare them (to be sure they are equal). If the shapes aren't equal, I exit. If they match, I create the new file, then create a dataset of matching shape. (If you really want to be robust, you could do the same with the dtype.) I also use the value of shape[2] as the slice iterator over the dataset.
import h5py
import numpy as np
import random
import sys
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file1.h5','w') as h5fw :
h5fw.create_dataset('data_1',data=arr)
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file2.h5','w') as h5fw :
h5fw.create_dataset('data_2',data=arr)
h5fr1 = h5py.File('file1.h5','r')
f1shape = h5fr1['data_1'].shape
h5fr2 = h5py.File('file2.h5','r')
f2shape = h5fr2['data_2'].shape
if (f1shape!=f2shape):
print ('Datasets shapes do not match')
h5fr1.close()
h5fr2.close()
sys.exit('Exiting due to error.')
else:
with h5py.File('file3.h5','w') as h5fw :
ds3 = h5fw.create_dataset('data_3', shape=f1shape, dtype='f')
for i in range(f1shape[2]):
arr1_slice = h5fr1['data_1'][:,:,i]
arr2_slice = h5fr2['data_2'][:,:,i]
arr3_slice = arr1_slice + arr2_slice
ds3[:,:,i] = arr3_slice
# alternately, you can slice and sum in 1 line
# ds3[:,:,i] = h5fr1['data_1'][:,:,i] + \
# h5fr2['data_2'][:,:,i]
print ('Done.')
h5fr1.close()
h5fr2.close()

Related

How to compare multiple hdf5 files

I have multiple h5py files(pixel-level annotations) for one image. Image Masks are stored in hdf5 files as key-value pairs with the key being the id of some class. The masks (hdf5 files) all match the dimension of their corresponding image and represent labels for pixels in the image. I need to compare all the h5 files with one another and find out the final mask that represents the majority.
But I don't know how to compare multiple h5 files in python. Can someone kindly help?

What do you mean by "compare"?
If you just want to compare the files to see if they are the same, you can use the h5diff utility from The HDF5 Group. It comes with the HDF5 installer. You can get more info about h5diff here: h5diff utility. Links to all HDF5 utilities are at the top of the page:HDF5 Tools
It sounds like you need to do more that that. Please clarify what you mean by "find out the final mask that represents the majority". Do you want to find the average image size (either mean, median, or mode)? If so, it is "relatively straight-forward" (if you know Python) to open each file and get the dimension of the image data (the shape of each dataset -- what you call the values). For reference, the key, value terminology is how h5py refers to HDF5 dataset names and datasets.
Here is a basic outline of the process to open 1 HDF5 file and loop thru the datasets (by key name) to get the dataset shape (image size). For multiple files, you can add a for loop using the iglob iterator to get the HDF5 file names. For simplicity, I saved the shape values to 3 lists and manually calculated the mean (sum()/len()). If you want to calculate the mask differently, I suggest using NumPy arrays. It has mean and median functions built-in. For mode, you need scipy.stats module (it works on NumPy arrays).
Method 1: iterates on .keys()
s0_list = []
s1_list = []
s2_list = []
with h5py.File(filename,'r')as h5f:
for name in h5f.keys() :
shape = h5f[name].shape
s0_list.append(shape[0])
s1_list.append(shape[1])
s2_list.append(shape[2])
print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
print ('Ave len axis=2:',sum(s2_list)/len(s2_list))
Method 2: iterates on .items()
s0_list = []
s1_list = []
s2_list = []
with h5py.File(filename,'r')as h5f:
for name, ds in h5f.items() :
shape = ds.shape
s0_list.append(shape[0])
s1_list.append(shape[1])
s2_list.append(shape[2])
print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
print ('Ave len axis=2:',sum(s2_list)/len(s2_list))

Is it possible to translate this Python code to Cython?

I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython. However, I'm not sure how to implement sparse matrix in Cython. Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?
#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.
import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix
full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
u_dict[q] = i
shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings
H = load_sparse_csr('matrix.npz')
correlation_train = []
for idx, row in train_full.iterrows():
if idx%1000 == 0: print idx
id_1 = u_dict[row['w1']]
id_2 = u_dict[row['w2']]
a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
b_vec = H[id_2].toarray()
correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])

While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function? quite some time ago, I doubt if cython is the way to go. Especially if you don't already have experience with numpy and cython. cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code. Throw pandas into the mix and you have an even bigger learning curve.
And important parts of sparse code are already written with cython.
Without touching the cython issue I see a couple of problems.
H is defined twice:
H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')
That's either an oversight, or a failure to understand how Python variables are created and assigned. The 2nd assignment replaces the first; thus the first does nothing. In addition the first just makes an empty lil matrix. Such a matrix could be filled iteratively; while not fast it is the intended use of the lil format.
The 2nd expression creates a new matrix from data saved in an npz file. That involves the numpy npz file loaded as well as the basic csr matrix creation code. And since the attributes are already in csr format, there's nothing for cython touch.
You do have an iteration here - but over a Pandas dataframe:
for idx, row in train_full.iterrows():
id_1 = u_dict[row['w1']]
a_vec = H[id_1].toarray()
Looks like you are picking a particular row of H based on a dictionary/array look up. Sparse matrix indexing is slow compared to dense matrix indexing. That is, if Ha = H.toarray() fits your memory then,
a_vec = Ha[id_1,:]
will be a lot faster.
Faster selection of rows (or columns) from a sparse matrix has been asked before. If you could work directly with the sparse data of a row I could recommend something more direct. But you want a dense array that you can pass to np.corrcoef, so we'd have to implement the toarray step as well.
How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Efficiently store list of matrices

I have a large list of images stored as numpy matrices. The images have different sizes e.g.
import numpy as np
from numpy.random import rand
data = [rand(100,200), rand(1024, 768)]
I am looking for a way to store this list of matrices such that it can be read fast (writing the data can be slow). I tried pickle/numpy.savez, but reading the data was slower than loading the raw images again.
I think hdf5 may be fast, however I cannot figure out how to store this list. Not mandatory, but useful would data format which allows to append data such that the list does not have to in memory as a whole.
Edit:
Based on the answers so far I tried to time some suggestions
data = [rand(1024, 768) for i in np.arange(100)]
def timenp():
np.savez("test.npz",*data)
d=np.load('test.npz')
loaded = [d[f] for f in d]
def timebinary():
with file("tmp.bin","wb") as f:
np.save(f, len(data))
for img in data:
np.save(f,img)
with file("tmp.bin","rb") as f:
n = np.load(f)
loaded = []
for i in np.arange(n):
loaded.append(np.load(f))
import h5py
def timeh5py():
with h5py.File('foo.hdf5','w') as f:
dt = h5py.special_dtype(vlen=np.dtype('float32'))
dset = f.create_dataset('data', (len(data),), dtype=dt)
shapes = f.create_dataset('shapes', (len(data), 2), dtype='int32')
dset[...] = [img.flatten() for img in data]
shapes[...] = [img.shape for img in data]
with h5py.File('foo.hdf5','r') as f:
loaded=[]
for (img, shape) in zip(f['data'],f['shapes']):
loaded.append(np.reshape(img,shape))
python -m cProfile timenp.py
452906 function calls (451141 primitive calls) in 9.256 seconds
python -m cProfile timebinary.py
73085 function calls (71340 primitive calls) in 4.945 seconds
python -m cProfile timeh5py.py
33151 function calls (32568 primitive calls) in 4.384 seconds

Try using the numpy savez function , which comes in both compressed and uncompressed versions.

In [276]: alist=[np.arange(10), np.arange(3), np.arange(100)]
If I save this as np.savez('test',alist), it saves the list as one object. If instead I expand the list with *, then it puts each list element is a separate file in the archive.
In [277]: np.savez('test',*alist)
In [278]: d=np.load('test.npz')
In [279]: list(d.keys())
Out[279]: ['arr_2', 'arr_1', 'arr_0']
In [280]: d['arr_0']
Out[280]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
With np.save (and by extension savez), arrays are stored in their own compact format, which consists of a header block with shape information, and the data block, which is essentially a byte copy of its data buffer. So a np.save of an array should be as efficient any other method.
If you give np.save an non-array object it will use that object's pickle method. But note that the pickle method for an array is the save method I just described. So a pickle of an array should still be efficient.
Keep in mind that npz files are lazy-loaded.
With h5py, arrays are saved to named data sets. In sense it is like the savez above - the elements of the list have to have names, whether generated automatically or by your code.
I don't know how h5py speeds compare with a save(z).
h5py can handle arrays that are ragged in one dimension. I've explored that in previous SO questions. Storing multidimensional variable length array with h5py
How to save list of numpy.arrays of different shape with h5py?

Chain datasets from multiple HDF5 files/datasets

The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.
I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).
For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.
Here is some pseudocode:
import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))
# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)
For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.
Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?

First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.
Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.
As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.
class MagicArray(object):
"""Magically index an array of references
"""
def __init__(self, file, references, axis=0):
self.file = file
self.references = references
self.axis = axis
def __getitem__(self, items):
# We need to modify the indices, so make sure items is a list
items = list(items)
for item in items:
if hasattr(item, 'start'):
# items is a slice object
raise ValueError('Slices not implemented')
for ref in self.references:
size = self.file[ref].shape[self.axis]
# Check if the requested index is in this subarray
# If not, subtract the subarray size and move on
if items[self.axis] < size:
item_ref = ref
break
else:
items[self.axis] = items[self.axis] - size
return self.file[item_ref][tuple(items)]
Here's how you use it:
with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
a = f.create_dataset('a',data=np.random.random((100, 50)))
b = f.create_dataset('b',data=np.random.random((300, 50)))
c = f.create_dataset('c',data=np.random.random((253, 50)))
ref_dtype = h5py.special_dtype(ref=h5py.Reference)
ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)
for i, key in enumerate([a, b, c]):
ref_dataset[i] = key.ref
with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
foo = MagicArray(f, f['refs'], axis=0)
print(foo[104, 4])
print(f['b'][4,4])
This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.
You might be able to subclass from numpy.ndarray and get all the usual methods as well.

How to efficiently construct a numpy array from a large set of data?

If I have a huge list of lists in memory and I wish to convert it into an array, does the naive approach cause python to make a copy of all the data, taking twice the space in memory? Should I convert a list of lists, vector by vector instead by popping?
# for instance
list_of_lists = [[...], ..., [...]]
arr = np.array(list_of_lists)
Edit:
Is it better to create an empty array of a known size and then populate it incrementally thus avoiding the list_of_lists object entirely? Could this be accomplished by something as simply as some_array[i] = some_list_of_float_values?

I'm just puttign theis here as it's a bit long for a comment.
Have you read the numpy documentation for array?
numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
"""
...
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements (dtype,
order, etc.).
...
"""
When you say you don't want to copy the data of the original array when creating the numpy array, what data structure are you hoping to end up with?
A lot of the speed up you get from using numpy is because the C arrays that are created are contiguous in memory. An array in python is just an array of pointers to objects, so you have to go and find the objects every time - which isn't the case in numpy, as it's not written in python.
If you want to just have the numpy array reference the python arrays in your 2D array, then you'll lose the performance gains.
if you do np.array(my_2D_python_array, copy=False) i don't know what it will actually produce, but you could easily test it yourself. Look at the shape of the array, and see what kind of objects it houses.
If you want the numpy array to be contiguous though, as some point you're going to have to allocate all of the memory it needs (which if it's as large as you're suggesting, it sounds like it might be difficult to find a contiguous section large enough).
Sorry that was pretty rambling, just a comment. How big are the actual arrays you're looking at?
Here's a plot of the cpu usage and memory usage of a small sample program:
from __future__ import division
#Make a large python 2D array
N, M = 10000, 18750
print "%i x %i = %i doubles = %f GB" % (N, M, N * M, N*M*8/10**9)
#grab pid to moniter memory and cpu usage
import os
pid = os.getpid()
os.system("python moniter.py -p " + str(pid) + " &")
print "building python matrix"
large_2d_array = [[n + m*M for n in range(N)] for m in range(M)]
import numpy
from datetime import datetime
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=True)
print datetime.now(), "deleting array"
del(np1)
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=False)
print datetime.now(), "deleting array"
del(np1)
1, 2, and 3 are the points where each of the matrices finish being created. Note that the native python array takes up much more memory than the numpy ones - python objects each have their own overhead, and the lists are lists of objects. For the numpy array this is not the case, so it is considerably smaller.
Also note that using the copy on the python object has no effect - new data is always created. You could get around this by creating a numpy array of python objects (using dtype=object), but i wouldn't advise it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding big matrices stored in HDF5 datasets - python

Related

How to compare multiple hdf5 files

Is it possible to translate this Python code to Cython?

Efficiently store list of matrices

Chain datasets from multiple HDF5 files/datasets

How to efficiently construct a numpy array from a large set of data?

Categories

Resources