Error in reading HDF5 using h5py - python

I have saved my dataset in this form as mentioned in the following image (HDF5 format). So I have different groups i.e. 4, 2, 40 etc. and for each group I have 2 datasets Annotation and Features. I have save them successfully using code but I am unable to load them back.
Strange thing is the error occurs only when I try to read Annotation. And reading works fine when I try to read Features.
I am using the following code:
dataSet = np.array([])
annotation = np.array([])
hdf5Object = readHDF5File('abc.hdf5','r')
w = 2
myGroup = hdf5Object[str(w)]
dataSet = np.array(myGroup['Features'])
annotation = np.array(myGroup['Annotation'])
Please enlighten me here as I am struggling a lot for this for a while now. Thanks.
EDIT 1
I am getting the following error when I read Annotation
Traceback (most recent call last):
File "xyz.py", line 76, in getAllData
annotation = np.array(myGroup['Annotation'])
File "/usr/lib/python2.7/dist-packages/h5py/_hl/group.py", line 153, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5o.pyx", line 173, in h5py.h5o.open (h5py/h5o.c:3403)
KeyError: "unable to open object (Symbol table: Can't open object)"
EDIT 2
So the hdf5 file was formed in 2 steps, in 1st step Features were calculated as follows:
features = <numpy array of thousand rows and 100 columns contains only floating numbers>
w = 2
f = h5py.File('abc.hdf5', 'a')
myGroup = f[str(w)]
myGroup.create_dataset('Features', data=features)
For different w file was appended and features were calculated at different times.
For annotation, same kind of procedure is used. Annotation contains only floating points as well.
EDIT 3
In the following image is content of data in Annotation and Features of one w. Left window is Annotation and right one is Features.

I just figured out that the way I was trying to access dataset was using string and somehow while saving dataset name it was saved under unicode or utf-8. So when I convert my dataset name to utf-8 it works fine.
How I figured out its datatype
myGroup = hdf5Object[str(w)]
childsIter = myGroup.iterkeys()
for child in childsIter:
print type(child)
This gave me the clue that the data type of my key of dataset is unicode and not just string. So I converted my string to unicode as follows:
key = unicode('Annotation', "utf-8")
dS = np.array(myGroup[key])
or
myGroup = hdf5Object[str(w)]
childsIter = myGroup.iterkeys()
for child in childsIter:
dS = np.array(myGroup[child])

Related

loading csv files - SyntaxError: invalid syntax (python 3.8)

I was working on a project that requires me to add csv file in two places of the code. I have seen kinda similar problem here at stackoverflow. But their problem was due to old python version 2.5. But my python version is 3.8.
import csv
from tensorflow.keras.datasets import mnist
import numpy as np
def load_az_dataset("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
# initialize the list of data and labels
data = []
labels = []
# loop over the rows of the A-Z handwritten digit dataset
for row in open("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
# parse the label and image from the row
row = row.split(",")
label = int(row[0])
image = np.array([int(x) for x in row[1:]], dtype="uint8")
# images are represented as single channel (grayscale) images
# that are 28x28=784 pixels -- we need to take this flattened
# 784-d list of numbers and repshape them into a 28x28 matrix
image = image.reshape((28, 28))
# update the list of data and labels
data.append(image)
labels.append(label)
# convert the data and labels to NumPy arrays
data = np.array(data, dtype="float32")
labels = np.array(labels, dtype="int")
# return a 2-tuple of the A-Z data and labels
return (data, labels)
It's showing this syntax error
The syntax error is caused by the fact that the file path is in the parameter list in the function definition. This is the culprit:
def load_az_dataset("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
You have no parameters listed in the function definition. You just have a literal string.
Furthermore, you should also either be using raw strings: r"..." or escaping your backslashes, as others have mentioned.
Finally, you should be using the with open(file_path) as f: pattern to open your file.
The syntax error is caused since you are passing the literal string in the method declaration of load_az_dataset.
You need to define the parameter to the function as:
def load_az_dataset(fileName):
Further, if you want to add that file as the default value for the parameter then use:
def load_az_dataset(fileName="C:\\A_Z_Handwritten_Data\\A_Z_Handwritten_Data.csv"):
Also, unrelated to the problem, you need to escape the \ with another \.
Try:
open("C:\\A_Z_Handwritten_Data\\A_Z_Handwritten_Data.csv")

Reading Dataset from files where some might be missing

I'm trying to load files to TensorFlow Dataset where some files might be missing (in which case I want to replace these with zeroes).
The structure of directories that I'm trying to read data from is as follows:
|-data
|---sensor_A
|-----1.dat
|-----2.dat
|-----3.dat
|---sensor_B
|-----1.dat
|-----2.dat
|-----3.dat
.dat files are .csv files with spacebar as a separator. The content of every file is a single, multi-row observation where the number of columns is constant (say 4) and the number of rows is unknown (timeseries data).
I've successfully managed to read every sensor data to a separate TensorFlow Dataset with the following code:
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
dataset = tf.data.Dataset.from_tensor_slices((filenames,))
def _parse_function_internal(filename):
number_of_columns = 4
single_observation = tf.read_file(filename)
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break
which successfully prints out content of all three files for every sensor.
My problem is that some timestamps in the dataset might be missing. For instance if file 1.dat in sensor_A directory will be missing I'm getting this error:
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: mock_data\sensor_A\1.dat : The system cannot find the file specified.
; No such file or directory
[[{{node ReadFile}}]] [Op:IteratorGetNextSync]
which is thrown in this line:
for el in dataset:
What I've tried to do is to surround the call to tf.read_file() function with try block but obviously it doesn't work as the error is not thrown when tf.read_file() is called, but when the value is fetched from the dataset. Later I want to pass this dataset to a Keras model so I can't just surround it with try block. Is there any workaround? Is that even supported?
Thanks!
I managed to solve the problem, sharing the solution just in case someone else will be struggling with it as well. I had to use additional list of booleans specifying whether the file actually exist and pass it into the mapper. Then using tf.cond() function we decide whether to read the file or mock the data with zeroes (or any other logic).
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
files_exist = [os.path.isfile(filename) for filename in filenames]
dataset = tf.data.Dataset.from_tensor_slices((filenames, files_exist))
def _parse_function_internal(filename, file_exist):
number_of_columns = 4
single_observation = tf.cond(file_exist, lambda: tf.read_file(filename), lambda: ' '.join(['0.0'] * number_of_columns))
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break

appending an index to laspy file (.las)

I have two files, one an esri shapefile (.shp), the other a point cloud (.las).
Using laspy and shapefile modules I've managed to find which points of the .las file fall within specific polygons of the shapefile. What I now wish to do is to add an index number that enables identification between the two datasets. So e.g. all points that fall within polygon 231 should get number 231.
The problem is that as of yet I'm unable to append anything to the list of points when writing the .las file. The piece of code that I'm trying to do it in is here:
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
outFile1.points = truepoints
outFile1.points.append(indexfromshp)
outFile1.close()
The error I'm getting now is: AttributeError: 'numpy.ndarray' object has no attribute 'append'. I've tried multiple things already including np.append but I'm really at a loss here as to how to add anything to the las file.
Any help is much appreciated!
There are several ways to do this.
Las files have classification field, you could store the indexes in this field
las_file = laspy.file.File("las.las", mode="rw")
las_file.classification = indexfromshp
However if the Las file has version <= 1.2 the classification field can only store values in the range [0, 35], but you can use the 'user_data' field which can hold values in the range [0, 255].
Or if you need to store values higher than 255 / you need a separate field you can define a new dimension (see laspy's doc on how to add extra dimensions).
Your code should be close to something like this
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
# copy fields
for dimension in inFile.point_format:
dat = inFile.reader.get_dimension(dimension.name)
outFile1.writer.set_dimension(dimension.name, dat)
outFile1.define_new_dimension(
name="index_from_shape",
data_type=7, # uint64_t
description = "Index of corresponding polygon from shape file"
)
outFile1.index_from_shape = indexfromshp
outFile1.close()

Python input from a <pre> tag

So I am writing some code in Python 2.7 to pull some information from a website, pull the relevant data from that set, then format that data in a way that is more useful. Specifically, I am wanting to take information from a html <pre> tag, put it into a file, turn that information in the file into an array (using numpy), and then do my analysis from that. I am stuck on the "put into a file" part. It seems that when I put it into a file, it is a 1x1 matrix or something and so it won't do what I hope it will. On an attempt previous to the code sample below, the error I got was: IndexError: index 5 is out of bounds for axis 0 with size 0 I had the index for array just to test if it would provide output from what I have so far.
Here is my code so far:
#Pulling data from GFS lamps
from lxml import html
import requests
import numpy as np
ICAO = raw_input("What station would you like GFS lamps data for? ")
page = requests.get('http://www.nws.noaa.gov/cgi-bin/lamp/getlav.pl?sta=' + ICAO)
tree = html.fromstring(page.content)
Lamp = tree.xpath('//pre/text()') #stores class of //pre html element in list Lamp
gfsLamps = open('ICAO', 'w') #stores text of Lamp into a new file
gfsLamps.write(Lamp[0])
array = np.genfromtxt('ICAO') #puts file into an array
array[5]
You can use KOGD as the ICAO to test this. As is, I get Value Error: Some Errors were detected and it lists Lines 2-23 (Got 26 columns instead of 8). What is the first step that I am doing wrong for what I want to do? Or am I just going about this all wrong?
The problem isn't in the putting data into the file part, its getting it out using genfromtxt. The problem is that genfromtxt is a very rigid function, mostly needs complete data unless you specify lots of options to skip columns and rows. Use this one instead:
arrays = [np.array(map(str, line.split())) for line in open('ICAO')]
The arrays variable will contain array of each line which contains each individual element in that line seperated by a space, for ex if your line has the following data:
a b cdef 124
the array for this line will be:
['a','b','cdef','124']
arrays will contain array of each line like this, which can be processed as you wish further.
So complete code is:
from lxml import html
import requests
import numpy as np
ICAO = raw_input("What station would you like GFS lamps data for? ")
page = requests.get('http://www.nws.noaa.gov/cgi-bin/lamp/getlav.pl?sta=' + ICAO)
tree = html.fromstring(page.content)
Lamp = tree.xpath('//pre/text()') #stores class of //pre html element in list Lamp
gfsLamps = open('ICAO', 'w') #stores text of Lamp into a new file
gfsLamps.write(Lamp[0])
gfsLamps.close()
array = [np.array(map(str, line.split())) for line in open('ICAO')]
print array

Can't access returned h5py object instance

I have a very weird issue here. I have 2 functions: one which reads an HDF5 file created using h5py and one which creates a new HDF5 file which concatenates the content returned by the former function.
def read_file(filename):
with h5py.File(filename+".hdf5",'r') as hf:
group1 = hf.get('group1')
group1 = hf.get('group2')
dataset1 = hf.get('dataset1')
dataset2 = hf.get('dataset2')
print group1.attrs['w'] # Works here
return dataset1, dataset2, group1, group1
And the create file function
def create_chunk(start_index, end_index):
for i in range(start_index, end_index):
if i == start_index:
mergedhf = h5py.File("output.hdf5",'w')
mergedhf.create_dataset("dataset1",dtype='float64')
mergedhf.create_dataset("dataset2",dtype='float64')
g1 = mergedhf.create_group('group1')
g2 = mergedhf.create_group('group2')
rd1,rd2,rg1,rg2 = read_file(filename)
print rg1.attrs['w'] #gives me <Closed HDF5 group> message
g1.attrs['w'] = "content"
g1.attrs['x'] = "content"
g2.attrs['y'] = "content"
g2.attrs['z'] = "content"
print g1.attrs['w'] # Works Here
return mergedhf.get('dataset1'), mergedhf.get('dataset2'), g1, g2
def calling_function():
wd1, wd2, wg1, wg2 = create_chunk(start_index, end_index)
print wg1.attrs['w'] #Works here as well
Now the problem is, the dataset and the properties from the new file created and represented by wd1, wd2, wg1 and wg2 can be accessed by me and I can access the attribute data but i cant do the same for which I have read and returned the value for.
Can anyone help me fetch the values of the dataset and group when I have returned the reference to the calling function?
The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
There are (at least) two ways to fix this. Firstly, you can open the file like you do in create_chunk:
hf = h5py.File(filename+".hdf5", 'r')
and keep the reference to hf around as long as you need it, before closing it:
hf.close()
The other way is to copy the data from the datasets in read_file and return those instead:
dataset1 = hf.get('dataset1')[:]
dataset2 = hf.get('dataset2')[:]
Note that you can't do this with the groups. The file needs to be open for as long as you need to do things with the groups.
Adding to #Yossarian's answer
The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
For those who come across this and are reading a scalar dataset make sure to index using [()]:
scalar_dataset1 = hf['scalar_dataset1'][()]
Preface
I had a similar issue as OP resulting in a return value of <closed hdf5 dataset>. However, I would get a ValueError when attempting to slice my scalar dataset with [:].
"ValueError: Illegal slicing argument for scalar dataspace"
Indexing with [()] along with #Yossarian's answer helped solve my problem.

Categories

Resources