Accessing data range with h5py - python

I have an h5 file that contains 62 different attributes. I would like to access the data range of each one of them.
to explain more here what I'm doing
import h5py
the_file = h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()
the previous code gives me a list of attributes "U","T","H",.....etc
lets say I want to know what is the minimum and maximum value of "U". how can I do that ?
this is the output of running "h5dump -H"
HDF5 "myfile.h5" {
GROUP "/" {
GROUP "data" {
ATTRIBUTE "datafield_names" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_SPACEPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 62 ) / ( 62 ) }
}
ATTRIBUTE "dimensions" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
}
ATTRIBUTE "time_variables" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
}
DATASET "Temperature" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
}

It might be a difference in terminology, but hdf5 attributes are access via the attrs attribute of a Dataset object. I call what you have variables or datasets. Anyway...
I'm guessing by your description that the attributes are just arrays, you should be able to do the following to get the data for each attribute and then calculate the min and max like any numpy array:
attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()
So if you wanted the min/max of each attribute you can just do a for loop over the attribute names or you could use
for attr_name,attr_value in data.items():
min = attr_value[:].min()
Edit to answer your first comment:
h5py's objects can be used like python dictionaries. So when you use 'keys()' you are not actually getting data, you are getting the name (or key) of that data. For example, if you run the_file.keys() you will get a list of every hdf5 dataset in the root path of that hdf5 file. If you continue along a path you will end up with the dataset that holds the actual binary data. So for example, you might start with (in an interpreter at first):
the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]
print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]
# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()
Edit 2 - Why do people format their hdf files this way? It defeats the purpose.
I think you may have to talk to the person who made this file if possible. If you made it, then you'll be able to answer my questions for yourself. First, are you sure that in your original example data.keys() returned "U","T",etc.? Unless h5py is doing something magical or if you didn't provide all of the output of the h5dump, that could not have been your output. I'll explain what the h5dump is telling me, but please try to understand what I am doing and not just copy and paste into your terminal.
# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()
# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()
As you can see from the h5dump, there are 62 datafield_names (strings), 4 dimensions (32-bit integers, I think), and 2 time_variables (64-bit floats). It also tells me that Temperature is a 3-dimensional array, 256 x 512 x 1024 (64-bit floats). Do you see where I'm getting this information? Now comes the hard part, you will need to determine how the datafield_names match up with the Temperature array. This was done by the person who made the file, so you'll have to figure out what each row/column in the Temperature array means. My first guess would be that each row in the Temperature array is one of the datafield_names, maybe 2 more for each time? But this doesn't work since there are too many rows in the array. Maybe the dimensions fit in there some how? Lastly here is how you get each of those pieces of information (continuing from before):
# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]
# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()
I'm sorry I can't be of more help, but without actually having the file and knowing what each field means this is about all I can do. Try to understand how I used h5py to read the information. Try to understand how I translated the header information (h5dump output) into information that I could actually use with h5py. If you know how the data is organized in the array you should be able to do what you want. Good luck, I'll help more if I can.

Since h5py arrays are closely related to numpy arrays, you can use the numpy.min and numpy.max functions to do this:
maxItem = numpy.max(data['U'][:]) # Find the max of item 'U'
minItem = numpy.min(data['H'][:]) # Find the min of item 'H'
Note the ':', it is needed to convert the data to a numpy array.

You can call min and max (row-wise) on the DataFrame:
In [1]: df = pd.DataFrame([[1, 6], [5, 2], [4, 3]], columns=list('UT'))
In [2]: df
Out[2]:
U T
0 1 6
1 5 2
2 4 3
In [3]: df.min(0)
Out[3]:
U 1
T 2
In [4]: df.max(0)
Out[4]:
U 5
T 6

Did you mean data.attrs rather than data itself? If so,
import h5py
with h5py.File("myfile.h5", "w") as the_file:
dset = the_file.create_dataset('MyDataset', (100, 100), 'i')
dset.attrs['U'] = (0,1,2,3)
dset.attrs['T'] = (2,3,4,5)
with h5py.File("myfile.h5", "r") as the_file:
data = the_file["MyDataset"]
print({key:(min(value), max(value)) for key, value in data.attrs.items()})
yields
{u'U': (0, 3), u'T': (2, 5)}

Related

How to slice and loop through a netCDF variable in Python?

I have a netCDF variable with 372 time-steps, I need to slice this variable to read in each individual time-step for subsequent processing.
I have used glob. to read in my 12 netCDF files and then defined the variables.
NAME_files = glob.glob('RGL*nc')
NAME_files = NAME_files[0:12]
for n in (NAME_files):
RGL = Dataset(n, mode='r')
footprint = RGL.variables['fp'][:]
lons = RGL.variables['lon'][:]
lats = RGL.variables['lat'][:]
I now need to repeat the code below in a loop for each of the 372 time-steps of the variable 'footprint'.
footprint_2 = RGL.variables['fp'][:,:,1:2]
I'm new to Python and have a poor grasp of looping. Any help would be appreciated, including better explanation/description of my issue.
You need to determine both the dimensions and shape of the fp variable in order to access it properly.
I'm making assumptions here about those values.
Your code implies 3 dimensions: time,lon,lat. Again just assuming.
footprint_2 = RGL.variables['fp'][:,:,1:2]
But the code above gets all the times, all the lons, for 1 latitude. Slice 1:2 selects 1 value.
fp_dims = RGL.variables['fp'].dimensions
print(fp_dims)
# a tuple of dimesions names
(u'time', u'lon', u'lat')
fp_shape = RGL.variables['fp'].shape
# a tuple of dimesions sizes or lengths
print(fp_shape)
(372, 30, 30)
len = fp_shape[0]
for time_idx in range(0,len)):
# you don't say if you want a single lon,lat or all the lon,lat's for a given time step.
test = RGL.variables['fp'][time_idx,:,:]
# or if you really want this:
test = RGL.variables['fp'][time_idx,:,1:2]
# or a single lon, lat
test = RGL.variables['fp'][time_idx,8,8]

appending an index to laspy file (.las)

I have two files, one an esri shapefile (.shp), the other a point cloud (.las).
Using laspy and shapefile modules I've managed to find which points of the .las file fall within specific polygons of the shapefile. What I now wish to do is to add an index number that enables identification between the two datasets. So e.g. all points that fall within polygon 231 should get number 231.
The problem is that as of yet I'm unable to append anything to the list of points when writing the .las file. The piece of code that I'm trying to do it in is here:
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
outFile1.points = truepoints
outFile1.points.append(indexfromshp)
outFile1.close()
The error I'm getting now is: AttributeError: 'numpy.ndarray' object has no attribute 'append'. I've tried multiple things already including np.append but I'm really at a loss here as to how to add anything to the las file.
Any help is much appreciated!
There are several ways to do this.
Las files have classification field, you could store the indexes in this field
las_file = laspy.file.File("las.las", mode="rw")
las_file.classification = indexfromshp
However if the Las file has version <= 1.2 the classification field can only store values in the range [0, 35], but you can use the 'user_data' field which can hold values in the range [0, 255].
Or if you need to store values higher than 255 / you need a separate field you can define a new dimension (see laspy's doc on how to add extra dimensions).
Your code should be close to something like this
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
# copy fields
for dimension in inFile.point_format:
dat = inFile.reader.get_dimension(dimension.name)
outFile1.writer.set_dimension(dimension.name, dat)
outFile1.define_new_dimension(
name="index_from_shape",
data_type=7, # uint64_t
description = "Index of corresponding polygon from shape file"
)
outFile1.index_from_shape = indexfromshp
outFile1.close()

How to use function like matlab 'fread' in python?

This is a .dat file.
In Matlab, I can use this code to read.
lonlatfile='NOM_ITG_2288_2288(0E0N)_LE.dat';
f=fopen(lonlatfile,'r');
lat_fy=fread(f,[2288*2288,1],'float32');
lon_fy=fread(f,[2288*2288,1],'float32')+86.5;
lon=reshape(lon_fy,2288,2288);
lat=reshape(lat_fy,2288,2288);
Here are some results of Matlab:
matalab
How to do in python to get the same result?
PS: My code is this:
def fromfileskip(fid,shape,counts,skip,dtype):
"""
fid : file object, Should be open binary file.
shape : tuple of ints, This is the desired shape of each data block.
For a 2d array with xdim,ydim = 3000,2000 and xdim = fastest
dimension, then shape = (2000,3000).
counts : int, Number of times to read a data block.
skip : int, Number of bytes to skip between reads.
dtype : np.dtype object, Type of each binary element.
"""
data = np.zeros((counts,) + shape)
for c in range(counts):
block = np.fromfile(fid,dtype=np.float32,count=np.product(shape))
data[c] = block.reshape(shape)
fid.seek( fid.tell() + skip)
return data
fid = open(r'NOM_ITG_2288_2288(0E0N)_LE.dat','rb')
data = fromfileskip(fid,(2288,2288),1,0,np.float32)
loncenter = 86.5 #Footpoint of FY2E
latcenter = 0
lon2e = data+loncenter
lat2e = data+latcenter
Lon = lon2e.reshape(2288,2288)
Lat = lat2e.reshape(2288,2288)
But, the result is different from that of Matlab.
You should be able to translate the code directly into Python with little change:
lonlatfile = 'NOM_ITG_2288_2288(0E0N)_LE.dat'
with open(lonlatfile, 'rb') as f:
lat_fy = np.fromfile(f, count=2288*2288, dtype='float32')
lon_fy = np.fromfile(f, count=2288*2288, dtype='float32')+86.5
lon = lon_ft.reshape([2288, 2288], order='F');
lat = lat_ft.reshape([2288, 2288], order='F');
Normally the numpy reshape would be transposed compared to the MATLAB result, due to different index orders. The order='F' part makes sure the final output has the same layout as the MATLAB version. It is optional, if you remember the different index order you can leave that off.
The with open() as f: opens the file in a safe manner, making sure it is closed again when you are done even if the program has an error or is cancelled for whatever reason. Strictly speaking it is not needed, but you really should always use it when opening a file.

Writing and reading a row array (nx1) to a binary file in Python with struct pack

I'm having a lot of trouble writing to and reading from a binary file when working with a nx1 row vector that has been written to a binary file using struct.pack. The file structure looks like this (given an argument data that is of type numpy.array) :
test.file
--------
[format_code = 3] : 4 bytes (the code 3 means a vector) - fid.write(struct.pack('i',3))
[rows] : 4 bytes (fid.write(struct.pack('i',sz[0])) where sz = data.shape
[cols] : 4 bytes (fid.write(struct.pack('i',sz[1]))
[data] : type double = 8 bytes * (rows * cols)
Unfortunately, since these files are mostly written in MATLAB, where I have a working class that reads and writes these fields, I can't only write the amount of rows (I need columns as well even if a column does only = 1).
I've tried a few ways to pack data, none of which have worked when trying to unpack it (assume I've opened my file denoted by fid in 'rb'/'wb' and have done some error checking):
# write data
sz = data.shape
datalen=8*sz[0]*sz[1]
fid.write(struct.pack('i',3)) # format code
fid.write(struct.pack('i',sz[0])) # rows
fid.write(struct.pack('i',sz[1])) # columns
### write attempt ###
for i in xrange(sz[0]):
for j in xrange(sz[1]):
fid.write(struct.pack('d',float(data[i][j]))) # write in 'c' convention, so we transpose
### read attempt ###
format_code = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
rows = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
cols = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
out_datalen = 8 * rows * cols # size of structure
output_data=numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
So far, when reading, my output has just seemingly been multiplied by random things. I don't know whats happening.
I found another similar question, and so I wrote my data as such:
fid.write(struct.pack('%sd' % len(data), *data))
However, when reading it back using:
numpy.array(struct.unpack('%sd' % out_datalen,fid.read(datalen)),dtype=float)
I get nothing in my array.
Similarly, just doing:
fid.write(struct.pack('%dd' % datalen, *data))
and reading it back with:
numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
also gives me an empty array. How can I fix this?

Python format print with a list

Which is the most pythonic way to produce my output. Let me illustrate the behavior I'm trying to achieve.
For a project of my I'm building a function that takes different parameters to print an the output in columns.
Example of the list its receives.
[('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')
** The size of items can differ (Only 3 items per tuple now, can be 4 for another list or any number**
The output should be something like this
Field Integer Hex
-------------------------------------------------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000
For working purposes I created a list which only contains the header fields:
This isn't necessary but it made it a little bit easier trying stuff out
Header field is ['Field', 'Integer', 'Hex']
The first tuple in the list declares the so called "Header fields" as shown in the list example.
For this case there are only 3 items, but this can differ from time to time. So I tried to calculate the size of items with:
length_container_header = len(container[0])
This variable can be used to correctly build up the output.
Building the header "print" I would build something like this.
print("{:21} {:7} {:7}".format(header_field[0], header_field[1], header_field[2]))
Now this is a manual version on how it should be. As you noticed the header field "Field" is shorter then
PointerToSymbolTable in the list. I wrote this function to determine the longest item for each position in the list
container_lenght_list = []
local_l = 0
for field in range(0, lenght_container_body):
for item in container[1:]:
if len(str(item[field])) > local_l:
local_l = len(str(item[field]))
else:
continue
container_lenght_list.append(str(local_l))
local_l = 0
Produces a list along the lines like [21, 7, 7] in this occasion.
creating the format string can be done pretty simple,
formatstring = ""
for line in lst:
formatstring+= "{:" + str(line) +"}"
Which produces string:
{:21}{:7}{:7}
This is the part were a run into trouble, how can I produce the last part of the format string?
I tried a nested for loop in the format() function but I ended up with all sort of Errors. I think it can be done with a
for loop, I just can't figure out how. If someone could push me in the right direction for the header print I would be very grateful. Once I figured out how to print the header I can pretty much figure out the rest. I hope I explained it well enough
With Kind Regards,
You can use * to unpack argument list:
container = [
('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')
]
lengths = [
max(len(str(row[i])) for row in container) for i in range(len(container[0]))
] # => [21, 7, 7]
# OR lengths = [max(map(len, map(str, x))) for x in zip(*container)]
fmt = ' '.join('{:<%d}' % l for l in lengths)
# => '{:<21} {:<7} {:<7}' # < for left-align
print(fmt.format(*container[0])) # header
print('-' * (sum(lengths) + len(lengths) - 1)) # separator
for row in container[1:]:
print(fmt.format(*row)) # <------- unpacking argument list
# similar to print(fmt.format(row[0], row[1], row[2])
output:
Field Integer Hex
-------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000
Formatting data in tabular form requires four important steps
Determine the field layout i.e. representing data row wise or column wise. Based on the decision you might need to transpose the data using zip
Determine the field sizes. Unless you wan;t to hard-code the field size (not-recommend), you should actually determine the maximum field size based on the data, allowing customized padding between fields. Generally this requires reading the data and determining the maximum length of the fields [len(max(map(str, field), key = len)) + pad
for field in zip(*data)]
Extract the header row. This is easy as it only requires indexing the 0th row i.e. data[0]
Formatting the data. This requires some understanding of python format string
Implementation
class FormatTable(object):
def __init__(self, data, pad = 2):
self.data = data
self.pad = pad
self.header = data[0]
self.field_size = [len(max(map(str, field), key = len)) + pad
for field in zip(*data)]
self.format = ''.join('{{:<{}}}'.format(s) for s in self.field_size)
def __iter__(self):
yield ''.join(self.format.format(*self.header))
yield '-'*(sum(self.field_size) + self.pad * len(self.header))
for row in data[1:]:
yield ''.join(self.format.format(*row))
Demo
for row in FormatTable(data):
print row
Field Integer Hex
-----------------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000
I don't know if it is "Pythonic", but you can use pandas to format your output.
import pandas as pd
data = [('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')]
s = pd.DataFrame(data[1:], columns=data[0])
print s.to_string(index=False)
Result:
Field Integer Hex
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000

Categories

Resources