Extract text from a reference file in h5py

Extract text from a reference file in h5py - python

I have downloaded an .h5 file which has various objects (most of them of the format data.dat) including one named History.txt. Upon accessing it, it shows <HDF5 dataset "History.txt": shape (), type "|O">. I am not able to access the text inside this object. Here it says that type "|O" is a reference file. Converting it to np.array is showing an output in which all the lines and text are squashed together. Is there a way to extract/read the text which is there in this object?
The code is as follows:
data_0180 = h5py.File('file.h5', 'r+')
data_0180['OutermostExtraction.dir'].keys()
Output of this has many keys, I've written the first few:
<KeysViewHDF5 ['History.txt', 'Y_l2_m-1.dat', 'Y_l2_m-2.dat', 'Y_l2_m0.dat']>
These .dat keys contain data, while this History.txt contains some kind of information about the file and that data. I want to read that information. When I try to print it:
print(data_0180['OutermostExtraction.dir/History.txt'])
it shows the following output:
<HDF5 dataset "History.txt": shape (), type "|O">
Converting it to np.array shows the following output (I have mentioned only the first couple of lines, the output is large)
array('WaveformModes_4872 = scri.SpEC.read_from_h5("SimulationAnnex/Catalog/NonSpinningSurrogate/
BBH_CFMS_d18_q1_sA_0_0_0_sB_0_0_0/Lev4/rhOverM_Asymptotic_GeometricUnits.h5/Extrapolated_N2.dir",
**{})\n# # WaveformBase.ensure_validity(WaveformModes_4872, alter=True, assertions=True)\n# WaveformModes.ensure_validity(WaveformModes_4872, alter=True,
assertions=True)\n# hostname = kingcrab\n#
cwd = /mnt/raid-project/nr/woodford\n# datetime = 2017-01-16T19:24:54.304846\n# scri.__version__ = 2016.10.10.devc5096f2\n# spherical_functions.__version__ = 2016.08.30.dev77191149\n',
dtype=object)
with the shape of the array as (). How do I extract/read the text in this object?

As mentioned in my comments, you have a variable length string. NumPy doesn't have a dtype to support this data type, so h5py stores with an object dtype (along with some metadata). Adding to the "fun", the array is a scalar array, so you can't access elements with typical NumPy indexing. This is where .item() comes to the rescue. Add .decode() to finish the process. Your code should look something like this:
data_str = np.array(data_0180['OutermostExtraction.dir/History.txt']).item().decode('utf-8')
To demonstrate the behavior (start to finish), I wrote an example starting from the variable length string example in the h5py documentation. First it creates a file ('foo.hdf5') with a dataset of variable length string data -- I used the first 3 lines from your output.
After writing, the file is closed and reopened in read-only mode. The dataset is read into an array (arr), then .item() is used to access the element and .decode('utf-8') to decode it (assumes it was encoded as 'utf-8'). The last line shows how to pull all these methods together in a single Python statement that returns a decoded string (variable data_str). Once you have the string, you can parse it as needed.
Example Below:
import h5py
import numpy as np
vlstr = 'WaveformModes_4872 = scri.SpEC.read_from_h5("SimulationAnnex/Catalog/NonSpinningSurrogate/' + \
'BBH_CFMS_d18_q1_sA_0_0_0_sB_0_0_0/Lev4/rhOverM_Asymptotic_GeometricUnits.h5/Extrapolated_N2.dir",' + \
'**{})\n# # WaveformBase.ensure_validity(WaveformModes_4872, alter=True, assertions=True)\n# WaveformModes.ensure_validity(WaveformModes_4872, alter=True,'
dt = h5py.string_dtype(encoding='utf-8')
with h5py.File('foo.hdf5','w') as h5f:
ds = h5f.create_dataset('VLDS', dtype=dt, data=vlstr)
with h5py.File('foo.hdf5','r') as h5f:
ds = h5f['VLDS']
print('dtype=',ds.dtype)
print('string dtype=',h5py.check_string_dtype(ds.dtype))
arr = np.array(ds)
print('\n',arr.item().decode('utf-8'))
data_str = np.array(h5f['VLDS']).item().decode('utf-8')
print('\n',data_str)

Related

How do you make a tfrecord such that you can access features using a string key

I am trying to use someone else's code where it appears they are able to access a feature in a tfrecord example by simply subscripting with a string. Here is a brief version of their code
def foo(example):
text = example["text"]
subtokens = some_other_function(text)
features = {
"my_subtokens": subtokens}
return(features)
input_files = ['test.tfrecord']
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
d = d.map(foo)
The key line in there is text = example["text"]. How were they able to get it so they could access a feature simply by subscripting the example with a string? Every time I try to write a tf record and then use a string as a key, I get the error TypeError: Only integers, slices (':'), ellipsis ('...'), tf.newaxis ('None') and scalar tf.int32/tf.int64 tensors are valid indices
To make my tfrecord, I just copied the code exactly from this website https://www.tensorflow.org/tutorials/load_data/tfrecord

How to convert numpy array of long string to list with only one element

I read a file that has multiple lines to indicate filenames with np.loadtxt
for example, the txt file content is:
000914_0017_01_0017_P00_01.tifresize.jpg
000925_0017_01_0006_P00_01.tifresize.jpg
000919_0017_01_0012_P00_01.tifresize.jpg
This txt file has file name split_file_name.
I use for loops to decode each file name and do some processing for each image as:
for file_name in list(np.loadtxt(split_file_name, dtype=bytes)):
file_name.decode("utf-8")
# other processing...
The output for list(np.loadtxt(split_file_name, dtype=bytes)) is:
[b'000914_0017_01_0017_P00_01.tifresize.jpg',
b'000925_0017_01_0006_P00_01.tifresize.jpg',
b'000919_0017_01_0012_P00_01.tifresize.jpg']
however, when there is only one line in the file split_file_name, as:
000914_0017_01_0017_P00_01.tifresize.jpg
after using np.loadtxt() the output is:
array(b'000914_0017_01_0017_P00_01.tifresize.jpg',
dtype='|S40')
when use list(np.loadtxt(split_file_name, dtype=bytes)) in this case,
it fails with
TypeError: iteration over a 0-d array.
The reason is that after np.loadtxt it returns an numpy array object which has only one long string, but list() cannot convert it directly to a list contain only one b'000914_0017_01_0017_P00_01.tifresize.jpg'
How should I do to ensure that it works for only one line txt file?

You can use tolist:
In [1]: [np.array(b'000914_0017_01_0017_P00_01.tifresize.jpg', dtype='|S40').tolist()]
Out[1]: ['000914_0017_01_0017_P00_01.tifresize.jpg']
UPDATE:
You can also extend / improve this to the following with using flatten:
In [2]: np.array([np.array(b'000914_0017_01_0017_P00_01.tifresize.jpg', dtype='|S40').tolist()]).flatten()
Out[2]:
array(['000914_0017_01_0017_P00_01.tifresize.jpg'],
dtype='|S40')
This also works with loadtxt:
In [3]: np.array([np.loadtxt('split_file_name', dtype=bytes).tolist()]).flatten()
Out[3]:
array(['000914_0017_01_0017_P00_01.tifresize.jpg',
'000925_0017_01_0006_P00_01.tifresize.jpg',
'000919_0017_01_0012_P00_01.tifresize.jpg'],
dtype='|S40')

How can I create statically typed, n dimensional arrays in python 2.7?

I am working on a python module that allows people to write python scripts for automation of labview test systems. The idea is that the module will have methods to allow the user to get test input parameters from labview, do some math in python, send the parameters back to labview and run a test.
Labview uses statically typed data, and especially in arrays. If data of a different type is written to a labview array, it will be coerced by labview into the static type of the array (e.g. you cannot write a float to an array of int32)
When python gets data from python (using labview's "flatten to xml"), the data is transmitted along with a list of dimension sizes (one size for each dimension, and its data type. Data MUST be returned with the same data type. I would like to create data structures (dictionaries, lists, tuples) of those data types. However, with python 2.7, if an array was a list (or a list of lists in the case of a multi-dimensional array), dynamic data typing will allow someone to, say, divide two integers and write a float value into the array that should be statically typed to integers.
Are there methods to: 1) raise an exception if a user's script tries to write the wrong data type? or 2) have the program silently cast the value to the type of the data? I think I would prefer to raise an exception.
I would like to know if there is some efficient and generalized method for creating statically typed multidimensional arrays which would either raise an exception or coerce the data to the desired type.
Another issue I am struggling with is how to dynamically create n dimensional arrays. I was looking at list comprehensions but most of the examples, documentations, and stackoverflow answers only deal with 2 or three dimensions and known sizes. in this case, I will have a list of dimension sizes with an unknown number of dimensions to start with.
As an example of what comes in from Labview, here is what a 2 dimensional array looks like in xml:
<Array>
<Name>My2DArray</Name>
<Dimsize>2</Dimsize>
<Dimsize>2</Dimsize>
<I32>
<Name>Numeric</Name>
<Val>-1</Val>
</I32>
<I32>
<Name>Numeric</Name>
<Val>1</Val>
</I32>
<I32>
<Name>Numeric</Name>
<Val>0</Val>
</I32>
<I32>
<Name>Numeric</Name>
<Val>3</Val>
</I32>
</Array>
Here is what I have tried, it seems to work ok but may not be the best way of doing it. If "node" is an etree document of the above XML:
import numpy as np
def XMLType_to_dType(XMLType):
switch = {
'Boolean' : np.dtype('b'),
'String' : np.dtype('S'),
'i8' : np.dtype('i1'),
'I16': np.dtype('i2'),
'I32': np.dtype('i4'),
'I64': np.dtype('i8'),
'U8' : np.dtype('u1'),
'U16': np.dtype('u2'),
'U32': np.dtype('u4'),
'U64': np.dtype('u8'),
'SGL': np.dtype('f2'),
'DBL': np.dtype('f4'),
'EXT': np.dtype('f8'),
'CSG': np.dtype('c2'),
'CDB': np.dtype('c4'),
'CXT': np.dtype('c8'),
}
dType =switch.get(XMLType)
assert (dType != None),'Error in XML_to_dType: unrecognized XMLType: {0}'.format(XMLType)
return dType
def parseArray(node):
dimSizes = []
tempList = []
created = False # will be set to true when the array is created
for ele in node:
tag = ele.tag
if tag == 'Name':
name = ele.text
elif tag == 'Dimsize':
dimSizes.append(int(ele.text))
else: # after the <Name> and <Dimsize> elements, there should only be data remaining
if created == False: # the first data element
XMLType = tag
dType = XMLType_to_dType(XMLType)
created = True
assert (tag == XMLType),'XMLType: {0} does not equal Data Type: {1}'.format(XMLType, tag) #They should all be the same type or there is an error
for item in ele:
if item.tag == 'Val':
tempList.append(item.text)
tempArray = np.reshape(np.asarray(tempList,dtype=dType),tupledimSizes)) # build the numpy array here
return [name,tempArray]

Reading a Matlab's cell array saved as a v7.3 .mat file with H5py

I saved a cell array as a .mat file in Matlab as follows:
test = {'hello'; 'world!'};
save('data.mat', 'test', '-v7.3')
How can I import it as the list of strings in Python with H5py?
I tried
f = h5py.File('data.mat', 'r')
print f.get('test')
print f.get('test')[0]
This prints out:
<HDF5 dataset "test": shape (1, 2), type "|O8">
[<HDF5 object reference> <HDF5 object reference>]
How can I dereference it to get the list of strings ['hello', 'world!'] in Python?

Writing in Matlab:
test = {'Hello', 'world!'; 'Good', 'morning'; 'See', 'you!'};
save('data.mat', 'test', '-v7.3') % v7.3 so that it is readable by h5py
Reading in Python (works for any number or rows or columns, but assumes that each cell is a string):
import h5py
import numpy as np
data = []
with h5py.File("data.mat") as f:
for column in f['test']:
row_data = []
for row_number in range(len(column)):
row_data.append(''.join(map(unichr, f[column[row_number]][:])))
data.append(row_data)
print data
print np.transpose(data)
Output:
[[u'Hello', u'Good', u'See'], [u'world!', u'morning', u'you!']]
[[u'Hello' u'world!']
[u'Good' u'morning']
[u'See' u'you!']]

This answer should be seen as an addition to Franck Dernoncourt's answer, which totally suffices for all cell arrays that contain 'flat' data (for mat files of version 7.3 and probably above).
I encountered a case where I had nested data (e.g. 1 row of cell arrays inside a named cell array). I managed to get my hands on the data by doing the following:
# assumption:
# idx_of_interest specifies the index of the cell array we are interested in
# (at the second level)
with h5py.File(file_name) as f:
data_of_interest_reference = f['cell_array_name'][idx_of_interest, 0]
data_of_interest = f[data_of_interest_reference]
Reason this works for nested data:
If you look at the type of the dataset you want to retrieve at a deeper level, it says 'h5py.h5r.Reference'. In order to actually retrieve the data the reference points to, you need to provide that reference to the file object.

I know this is an old question. But I found a package to scratch that itch:
hdf5storage
It can be installed by pip and works nicely on python 3.6 for both pre and post 7.3 matlab files. For older files it calls scipy.io.loadmat according to the docs.

How to read HDF5 files in Python

I am trying to read data from hdf5 file in Python. I can read the hdf5 file using h5py, but I cannot figure out how to access data within the file.
My code
import h5py
import numpy as np
f1 = h5py.File(file_name,'r+')
This works and the file is read. But how can I access data inside the file object f1?

Read HDF5
import h5py
filename = "file.hdf5"
with h5py.File(filename, "r") as f:
# Print all root level object names (aka keys)
# these can be group or dataset names
print("Keys: %s" % f.keys())
# get first object name/key; may or may NOT be a group
a_group_key = list(f.keys())[0]
# get the object type for a_group_key: usually group or dataset
print(type(f[a_group_key]))
# If a_group_key is a group name,
# this gets the object names in the group and returns as a list
data = list(f[a_group_key])
# If a_group_key is a dataset name,
# this gets the dataset values and returns as a list
data = list(f[a_group_key])
# preferred methods to get dataset values:
ds_obj = f[a_group_key] # returns as a h5py dataset object
ds_arr = f[a_group_key][()] # returns as a numpy array
Write HDF5
import h5py
# Create random data
import numpy as np
data_matrix = np.random.uniform(-1, 1, size=(10, 3))
# Write data to HDF5
with h5py.File("file.hdf5", "w") as data_file:
data_file.create_dataset("dataset_name", data=data_matrix)
See h5py docs for more information.
Alternatives
JSON: Nice for writing human-readable data; VERY commonly used (read & write)
CSV: Super simple format (read & write)
pickle: A Python serialization format (read & write)
MessagePack (Python package): More compact representation (read & write)
HDF5 (Python package): Nice for matrices (read & write)
XML: exists too *sigh* (read & write)
For your application, the following might be important:
Support by other programming languages
Reading / writing performance
Compactness (file size)
See also: Comparison of data serialization formats
In case you are rather looking for a way to make configuration files, you might want to read my short article Configuration files in Python

Reading the file
import h5py
f = h5py.File(file_name, mode)
Studying the structure of the file by printing what HDF5 groups are present
for key in f.keys():
print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
print(type(f[key])) # get the object type: usually group or dataset
Extracting the data
#Get the HDF5 group; key needs to be a group name from above
group = f[key]
#Checkout what keys are inside that group.
for key in group.keys():
print(key)
# This assumes group[some_key_inside_the_group] is a dataset,
# and returns a np.array:
data = group[some_key_inside_the_group][()]
#Do whatever you want with data
#After you are done
f.close()

you can use Pandas.
import pandas as pd
pd.read_hdf(filename,key)

Here's a simple function I just wrote which reads a .hdf5 file generated by the save_weights function in keras and returns a dict with layer names and weights:
def read_hdf5(path):
weights = {}
keys = []
with h5py.File(path, 'r') as f: # open file
f.visit(keys.append) # append all keys to list
for key in keys:
if ':' in key: # contains data if ':' in key
print(f[key].name)
weights[f[key].name] = f[key].value
return weights
https://gist.github.com/Attila94/fb917e03b04035f3737cc8860d9e9f9b.
Haven't tested it thoroughly but does the job for me.

To read the content of .hdf5 file as an array, you can do something as follow
> import numpy as np
> myarray = np.fromfile('file.hdf5', dtype=float)
> print(myarray)

Use below code to data read and convert into numpy array
import h5py
f1 = h5py.File('data_1.h5', 'r')
list(f1.keys())
X1 = f1['x']
y1=f1['y']
df1= np.array(X1.value)
dfy1= np.array(y1.value)
print (df1.shape)
print (dfy1.shape)
Preferred method to read dataset values into a numpy array:
import h5py
# use Python file context manager:
with h5py.File('data_1.h5', 'r') as f1:
print(list(f1.keys())) # print list of root level objects
# following assumes 'x' and 'y' are dataset objects
ds_x1 = f1['x'] # returns h5py dataset object for 'x'
ds_y1 = f1['y'] # returns h5py dataset object for 'y'
arr_x1 = f1['x'][()] # returns np.array for 'x'
arr_y1 = f1['y'][()] # returns np.array for 'y'
arr_x1 = ds_x1[()] # uses dataset object to get np.array for 'x'
arr_y1 = ds_y1[()] # uses dataset object to get np.array for 'y'
print (arr_x1.shape)
print (arr_y1.shape)

from keras.models import load_model
h= load_model('FILE_NAME.h5')

If you have named datasets in the hdf file then you can use the following code to read and convert these datasets in numpy arrays:
import h5py
file = h5py.File('filename.h5', 'r')
xdata = file.get('xdata')
xdata= np.array(xdata)
If your file is in a different directory you can add the path in front of'filename.h5'.

What you need to do is create a dataset. If you take a look at the quickstart guide, it shows you that you need to use the file object in order to create a dataset. So, f.create_dataset and then you can read the data. This is explained in the docs.

Using bits of answers from this question and the latest doc, I was able to extract my numerical arrays using
import h5py
with h5py.File(filename, 'r') as h5f:
h5x = h5f[list(h5f.keys())[0]]['x'][()]
Where 'x' is simply the X coordinate in my case.

use this it works fine for me
weights = {}
keys = []
with h5py.File("path.h5", 'r') as f:
f.visit(keys.append)
for key in keys:
if ':' in key:
print(f[key].name)
weights[f[key].name] = f[key][()]
return weights
print(read_hdf5())
if you are using the h5py<='2.9.0'
then you can use
weights = {}
keys = []
with h5py.File("path.h5", 'r') as f:
f.visit(keys.append)
for key in keys:
if ':' in key:
print(f[key].name)
weights[f[key].name] = f[key].value
return weights
print(read_hdf5())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text from a reference file in h5py - python

Related

How do you make a tfrecord such that you can access features using a string key

How to convert numpy array of long string to list with only one element

How can I create statically typed, n dimensional arrays in python 2.7?

Reading a Matlab's cell array saved as a v7.3 .mat file with H5py

How to read HDF5 files in Python

Categories

Resources