How to compare multiple hdf5 files - python

I have multiple h5py files(pixel-level annotations) for one image. Image Masks are stored in hdf5 files as key-value pairs with the key being the id of some class. The masks (hdf5 files) all match the dimension of their corresponding image and represent labels for pixels in the image. I need to compare all the h5 files with one another and find out the final mask that represents the majority.
But I don't know how to compare multiple h5 files in python. Can someone kindly help?

What do you mean by "compare"?
If you just want to compare the files to see if they are the same, you can use the h5diff utility from The HDF5 Group. It comes with the HDF5 installer. You can get more info about h5diff here: h5diff utility. Links to all HDF5 utilities are at the top of the page:HDF5 Tools
It sounds like you need to do more that that. Please clarify what you mean by "find out the final mask that represents the majority". Do you want to find the average image size (either mean, median, or mode)? If so, it is "relatively straight-forward" (if you know Python) to open each file and get the dimension of the image data (the shape of each dataset -- what you call the values). For reference, the key, value terminology is how h5py refers to HDF5 dataset names and datasets.
Here is a basic outline of the process to open 1 HDF5 file and loop thru the datasets (by key name) to get the dataset shape (image size). For multiple files, you can add a for loop using the iglob iterator to get the HDF5 file names. For simplicity, I saved the shape values to 3 lists and manually calculated the mean (sum()/len()). If you want to calculate the mask differently, I suggest using NumPy arrays. It has mean and median functions built-in. For mode, you need scipy.stats module (it works on NumPy arrays).
Method 1: iterates on .keys()
s0_list = []
s1_list = []
s2_list = []
with h5py.File(filename,'r')as h5f:
for name in h5f.keys() :
shape = h5f[name].shape
s0_list.append(shape[0])
s1_list.append(shape[1])
s2_list.append(shape[2])
print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
print ('Ave len axis=2:',sum(s2_list)/len(s2_list))
Method 2: iterates on .items()
s0_list = []
s1_list = []
s2_list = []
with h5py.File(filename,'r')as h5f:
for name, ds in h5f.items() :
shape = ds.shape
s0_list.append(shape[0])
s1_list.append(shape[1])
s2_list.append(shape[2])
print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
print ('Ave len axis=2:',sum(s2_list)/len(s2_list))

Related

Adding big matrices stored in HDF5 datasets

I have two HDF5 files having an identical structure, each store a matrix of the same shape. I need to create a third HDF5 file with a matrix representing the element-wise sum of the two mentioned above matrices. Given the sizes of matrices are extremely large (in the Gb-Tb range), what would be the best way to do it, preferably in a parallel way? I am using the h5py interface to the HDF5 library. Are there any libraries capable of doing it?
Yes, this is possible. The key is to access slices of the data from file1 & file2, do your element-wise sum, then write that slice of new data to the file3. You can do this with h5py or PyTables (aka tables). No other libraries are required. I only have passing knowledge of parallel computing. I know h5py supports an mpi interface through the mpi4py Python package. Details here: h5py docs: Parallel HDF5
Here is a simple example. It creates 2 files with a dataset of random floats, shape=(10,10,10). It then creates a new file with an empty dataset of the same shape. The loop reads a slice of data from file1 and file2, sums them, then writes to the same slice in file3. To test with large data, you can modify the shapes to match your file.
21-Jan-2021 Update:
I added code to get the dataset shapes from file1 and file2, and compare them (to be sure they are equal). If the shapes aren't equal, I exit. If they match, I create the new file, then create a dataset of matching shape. (If you really want to be robust, you could do the same with the dtype.) I also use the value of shape[2] as the slice iterator over the dataset.
import h5py
import numpy as np
import random
import sys
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file1.h5','w') as h5fw :
h5fw.create_dataset('data_1',data=arr)
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file2.h5','w') as h5fw :
h5fw.create_dataset('data_2',data=arr)
h5fr1 = h5py.File('file1.h5','r')
f1shape = h5fr1['data_1'].shape
h5fr2 = h5py.File('file2.h5','r')
f2shape = h5fr2['data_2'].shape
if (f1shape!=f2shape):
print ('Datasets shapes do not match')
h5fr1.close()
h5fr2.close()
sys.exit('Exiting due to error.')
else:
with h5py.File('file3.h5','w') as h5fw :
ds3 = h5fw.create_dataset('data_3', shape=f1shape, dtype='f')
for i in range(f1shape[2]):
arr1_slice = h5fr1['data_1'][:,:,i]
arr2_slice = h5fr2['data_2'][:,:,i]
arr3_slice = arr1_slice + arr2_slice
ds3[:,:,i] = arr3_slice
# alternately, you can slice and sum in 1 line
# ds3[:,:,i] = h5fr1['data_1'][:,:,i] + \
# h5fr2['data_2'][:,:,i]
print ('Done.')
h5fr1.close()
h5fr2.close()

How can I loop over HDF5 groups in Python removing rows according to a mask?

I have an HDF5 file containing a number of different groups all of which have the same number of rows. I also have a Boolean mask for rows to keep or remove. I would like to iterate over all groups in the HDF5 file removing rows according to the mask.
The recommended method to recursively visit all groups is visit(callable), but I can't work out how to pass my mask to the callable.
Here is some code hopefully demonstrating what I would like to do but which doesn't work:
def apply_mask(name, *args):
h5obj[name] = h5obj[name][mask]
with h5py.File(os.path.join(directory, filename), 'r+') as h5obj:
h5obj.visit(apply_mask, mask)
Which results in the error
TypeError: visit() takes 2 positional arguments but 3 were given
How can I get my mask array into this function?
I eventually achieved this with a series of hacky workarounds. If there is a better solution I'd be interested to know about it!
with h5py.File(os.path.join(directory, filename), 'r+') as h5obj:
# Use the visit callable to append to a list of key names
h5_keys = []
h5obj.visit(h5_keys.append)
# Then loop over those keys and, if they're datasets rather than
# groups, remove the invalid rows
for h5_key in h5_keys:
if isinstance(h5obj[h5_key], h5py.Dataset):
tmp = np.array(h5obj[h5_key])[mask]
# There is no way to simply change the dataset because its
# shape is fixed, causing a broadcast error, so it is
# necessary to delete and then recreate it.
del h5obj[h5_key]
h5obj.create_dataset(h5_key, data=tmp)

Import csv row as array in tensorflow

I have a csv file containing a large number N of columns: the first column contains the label, the other N-1 a numeric representation of my data (Chroma features from a music recording).
My idea is to represent the input data as an array. In practice, I want an equivalent of the standard representation of data in computer vision. Since my data is stored in a csv, inside the definition of the input train function, I need to a csv parser. I do it in this way
def parse_csv(line):
columns = tf.decode_csv(line, record_defaults=DEFAULTS) # take a line at a time
features = {'songID': columns[0], 'x': columns[1:]} # create a dictionary out of the features
labels = features.pop('songID') # define the label
return features, labels
def train_input_fn(data_file=fp, batch_size=128):
"""Generate an input function for the Estimator."""
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
dataset = dataset.map(parse_csv)
dataset = dataset.shuffle(1_000_000).repeat().batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
However, this returns an error that is not very significative: AttributeError: 'list' object has no attribute 'get_shape'. I know that the culprit is the definition of x inside the features dictionary, but I don't know how to correct it because, fundamentally, I don't really grok the data structures of tensorflow yet.
As it turns out, features need to be tensors. However, each column is a tensor in itself and taking columns[1:] results in a list of tensors. For creating a higher-dimensional tensor that stores the information from N-1 columns one should use tf.stack:
features = {'songID': columns[0], 'x': tf.stack(columns[1:])} # create a dictionary out of the features
tf.stack should solve.
There is a complete code example available in the following thread.
Tensorflow Python reading 2 files

num_buckets as a parameter in a tensorflow feature column

Currently Tensorflow documentation define a categorical vocabulary column this way:
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_list(
key="feature_name_from_input_fn",
vocabulary_list=["kitchenware", "electronics", "sports"])
However this suppose that we input manually the vocabulary list.
In case of large dataset with many columns and many unique values I would like to automate the process this way:
for k in categorical_feature_names:
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_list(
key="feature_name_from_input_fn",
vocabulary_list=list_of_unique_values_in_the_column)
To do so I need to retrieve the parameter list_of_unique_values_in_the_column.
Is there anyway to do that with Tensorflow?
I know there is tf.unique that could return unique values in a tensor but I don't get how I could feed the column to it so it returns the right vocabulary list.
If list_of_unique_values_in_the_column is known, you can save them in one file and read by tf.feature_column.categorical_column_with_vocabulary_file. If unknown, you can use tf.feature_column.categorical_column_with_hash_bucket with a large enough size.

numpy - incrementally rebuild an ndarray by filtering unneded members

Consider that I have an ndarray:
all_data.shape
(220000, 28, 28)
type(all_data)
numpy.ndarray
I want to go over each member of this array and filter out those which I don't want. As result I want to get new ndarray of exactly same shape.
Something like:
#save first image and its label in separate array
#we will store unique values
sanitized_data = all_data[0]
sanitized_labels = all_labels[0]
#lets illimnate dupes
#store of existing hashes
hashes = set()
#go over each image
for i in range(0,len(all_labels)):
#check if its hash is in list hashes
if not md5(all_data[i]).hexdigest() in hashes:
#record its hash and copy to new dataset
sanitized_data = np.stack((sanitized_data, all_data[i]))
sanitized_labels = np.stack((sanitized_labels, all_labels[i]))
hashes.add(md5(all_data[i]).hexdigest())
But I get:
ValueError: all input arrays must have the same shape
I am not sure how to properly do this. I want to incrementally add new array along first axis once I find the array I like. Not sure how to properly do this with numpy? I googled dstack action for that, but seems like it stacks stuff along wrong axis.
Copied from comments:
It is better to accumulate component arrays in a list, and apply concatenate once to the whole list. Also get in the habit checking dimensions as you go along.
#hpaulj last suggestion worked, thanks!

Categories

Resources