Can I delete an element from an HDF5 dataset? - python

I would like to delete an element from an HDF5 dataset in Python. Below I have my example code
DeleteHDF5Dataset.py
# This code works, which deletes an HDF5 dataset from an HDF5 file
file_name = os.path.join('myfilepath', 'myfilename.hdf5')
f = h5py.File(file_name, 'r+')
f.__delitem__('Log list')
However, this is not what I want to do. 'mydatatset' is an HDF5 dataset that has several elements, and I would like to delete one or more of the elements individually, for example
DeleteHDF5DatasetElement.py
# This code does not work, but I would like to achieve what it's trying to do
file_name = os.path.join('myfilepath', 'myfilename.hdf5')
f = h5py.File(file_name, 'r+')
print(f['Log list'][3]) # prints the correct dataset element
f.__delitem__('Log list')[3] # I want to delete element 3 of this HDF5 dataset
The best solution I can come up with is to create a temporary dataset, loop through the original dataset, and only add the entries I want to keep to the temp dataset, and then replace the old dataset with the new one. But this seems pretty clunky. Does anybody have a clean solution to do this? It seems like there should be a simple way to just delete an element.
Thanks, and sorry if any of my terminology is incorrect.

It looks like you have an array of strings. It's not the recommended way of storing strings in HDF5, but let's assume you have no choice on how data is stored.
HDF5 prefers you to keep your array size fixed. Operations such as deleting arbitrary elements are expensive. In addition, with HDF5, space is not automatically freed when you delete data.
After all this, if you still want to remove data in your specified format, you can try simply extracting an array, deleting an element, then reassigning to your dataset:
arr = f['Log list'][:] # extract to numpy array
res = np.delete(arr, 1) # delete element with index 1, i.e. second element
f.__delitem__('Log list') # delete existing dataset
f['Log list'] = res # reassign to dataset

Related

How to automatically flatten and combine .csv files into one matrix in Python?

I have a bunch of .csv files in Python that are x by y dimensions. I know how to flatten and reshape matrices, but am having trouble automatically doing this for multiple files. Once I have flattened the matrices into one dimension, I also would like to stack them on top of each other, in one big matrix.
Is this even the proper way to do a for loop? I have not gotten to the part of stacking the linearized matrices on top of each other into one matrix yet. Would that involve the DataFrame.stack() function? When I run the code below it gives me an error.
import numpy as np
import pandas as pd
file_list = sorted(os.listdir('./E/')) #created the list of files in a specific directory
del file_list[0] #removed an item from the list that I did not want
for file in range(0,26):
pd.read_csv('./E/' + print(file_list), header=None) #should read files
A = set(Int.flatten()) #should collapse matrix to one dimension
B = np.reshape(A, -1) #should make it linear going across
Since I don't know what your files look like, I'm not sure this will work. But still, the below code includes some concepts that should be useful:
import numpy as np
import pandas as pd
file_list = sorted(os.listdir('.\\E'))
del file_list[0]
# Eventually, all_files_array will contain len(file_list) elements, each of which is a file.
all_files_array = []
for i in range(len(file_list)):
file = file_list[i]
# Depending on how you saved your file, you may need to add index=None as an argument to read_csv.
this_file_arr = pd.read_csv('.\\E\\' + file, header=None)
# Change the dtype to int if that's what you're working with.
this_file_arr = this_file_arr.to_numpy(dtype=float, copy=False)
this_file_arr = np.unique(this_file_arr.flatten())
# In all my tests, the following line does absolutely nothing, but I guess it doesn't hurt.
this_file_arr = np.reshape(this_file_arr, -1)
all_files_array.append(this_file_arr)
all_files_array = np.array(all_files_array)
# all_files_array now has shape (len(files_list), x, y) where one file has shape (x, y).
The main takeaways are that:
os.listdir() doesn't work when the path has slashes at the end. Also, Python requires that / in path names be replaced with '\', so I've done that.
Using range instead of hard-coding the number of files to read is good practice in case you add more files to file_list later down the line, unless of course you don't want to read all the files in file_list.
A print statement inside pd.read_csv is at best useless, and at worst will throw an error.
this_file_arr.flatten() is a NumPy method, so this_file_arr needs to be a NumPy array, hence the to_numpy() line.
Because np.reshape doesn't take sets, I used np.unique instead to avoid converting to a non-NumPy structure. If you want to use NumPy methods, keep your data in a NumPy array and don't convert it to a list or set or anything else.
Let me know if you have any questions!

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

New hdf5 from a group in a bigger hdf5

I have created a huge hdf5 dataset in the following form:
group1/raw
group1/preprocessed
group1/postprocessed
group2/raw
group2/preprocessed
group2/postprocessed
....
group10/raw
group10/preprocessed
group10/postprocessed
However, I realized that for portability I would like to have 10 different hdf5 files, one for each group. Is there a function in python to achieve this without looping through all the data and scanning the entire original hdf5 tree?
something like:
import h5py
file_path = 'path/to/data.hdf5'
hf = h5py.File(file_path, 'r')
print(hf.keys())
for group in hf.keys():
# create a new dataset for the group
hf_tmp = h5py.File(group + '.h5', 'w')
# get data from hf[key] and dumb them into the new file
# something like
# hf_tmp = hf[group]
# hf_tmp.dumb()
hf_tmp.close()
hf.close()
You have the right idea. There are several questions and answers on SO that show how to do this.
Start with this one. It shows how to loop over the keys and determine if it's a group or a dataset.: h5py: how to use keys() loop over HDF5 Groups and Datasets
Then look at these. Each shows a slightly different approach to the problem.
This shows one way. Extracting datasets from 1 HDF5 file to multiple files
Also, here is an earlier post I wrote: How to copy a dataset object to a different hdf5 file using pytables or h5py?
This does the opposite (copies datasets from different files to 1 file). It's useful because it demonstrates how to use the .copy() method: How can I combine multiple .h5 file?
Finally, you should review visititems() method to recursively search all Groups and Datasets. Take a look at this answer for details: is there a way to get datasets in all groups at once in h5py?
That should answer your questions.
Below is some pseudo-code that pulls all of these ideas together. It works for your schema, where all datasets are in root level groups. It will not work for the more general case with datasets at multiple group levels. Use visititems() for the more general case.
Pseudo-code below:
with h5py.File(file_path, 'r') as hf:
print(hf.keys())
# loop on group names at root level
for group in hf.keys():
hf_tmp = h5py.File(group + '.h5', 'w')
# loop on datasets names in group
for dset in hf[group].keys():
# copy dataset to the new group file
hf.copy(group+'/'+dset, hf_tmp)
hf_tmp.close()

Reading file with pandas.read_csv: why is a particular column read as a string rather than float?

A little background:
I am running a binary stellar evolution code and storing the evolution histories as gzipped .dat files. I used to have a smaller dataset resulting in some ~2000 .dat files, which I read during post-processing by appending the list data from each file to create a 3d list. Each .dat file looks somewhat like this file.
But recently I started working with a larger dataset and the number of evolutionary history files rose to ~100000. So I decided to compress the .dat files as gzips and save them in a zipped folder. The reason being, that I am doing all this on a remote server and have a limited disk quota.
Main query:
During post-processing, I try to read data using pandas from all these files as 2d numpy arrays which are stacked to form a 3d list (each file has a different length so I could not use numpy.append and have to use lists instead). To achieve this, I use this:
def read_evo_history(EvoHist, zipped, z):
ehists = []
for i in range( len(EvoHist) ):
if zipped == True:
try:
ehists.append( pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", compression='gzip', header=None).to_numpy() )
except pd.errors.EmptyDataError:
pass
return ehists
outdir = "plots"
indir = "OutputFiles_allsys"
z = zipfile.ZipFile( indir+'.zip' )
EvoHist = []
for filename in z.namelist():
if not os.path.isdir(filename):
# read the file
if filename[0:len("OutputFiles_allsys/EvoHist")] == "OutputFiles_allsys/EvoHist":
EvoHist.append( filename )
zipped = True
ehists = read_evo_history(EvoHist, zipped, z)
del z # Cleanup (if there's no further use of it after this)
The problem I am now facing is that the one particular column in the data is being read as a list of strings, rather than float. Do I need to somehow convert the datatype while reading the file? Or is this being caused because of datatype inconsistencies in the files being read? Is there a way to get the data as a 3d list of numpy arrays of floats?
P.S.: If this is being caused by inconsistencies in the input files, then I am afraid I won't be able to run my binary stellar evolution code again as it takes days to produce all these files.
I will be more than happy to clarify more on this if needed. Thanks in advance for your time and energy.
Edit:
I noticed that only the 16th column of some files is being read in as a string. And I think this is because there are some NaN values in there, but I may be wrong.
This image shows the raw data with NaN values pointed out. A demonstration showing that particular column bein read as string can be here. However, another column is read as float: image.
The workaround for overcoming a missing value was simple, pandas.read_csv has a parameter called na_values which allows users to pass specified values that they want to be read as NaNs. From the pandas docs:
na_values: scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default, the following values are interpreted as NaN: β€˜β€™, ... β€˜NA’, ...`.
Pandas itself is smart enough to automatically recognize those values without us explicitly stating it. But in my case, the file had nan values as 'nan ' (yeah, with a space!) which is why I was facing this issue. A minute change in the code fixed this,
pd.read_csv(z.open(EvoHist[i]), delimiter = "\t",
compression='gzip', header=None, na_values = 'nan ').to_numpy()

optimal data structure to store million of pixels in python?

I have several images and after some basic processing and contour detection I want to store the detected pixels locations and their adjacent neighbours values into a Python Data Structure. I settled for numpy.array
The pixel locations from each Image are retrieved using:
locationsPx = cv2.findNonZero(SomeBWImage)
which will return an array of the shape (NumberOfPixels,1L,2L) with :
print(locationsPx[0]) : array([[1649, 4]])
for example.
My question is: is it possible to store this double array on a single column in another array? Or should I use a list and drop the array all together?
note: the dataset of images might increase so the dimensions of my chose data structure will not be only huge, but also variable
EDIT: or maybe numpy.array is not good idea and Pandas Dataframe is better suited? I am open to suggestion from those who have more experience in this.
Numpy arrays are great for computation. They are not great for storing data if the size of the data keeps changing. As ali_m pointed out, all forms of array concatenation in numpy are inherently slow. Better to store the arrays in a plain-old python list:
coordlist = []
coordlist.append(locationsPx[0])
Alternatively, if your images have names, it might be attractive to use a dict with the image names as keys:
coorddict = {}
coorddict[image_name] = locationsPx[0]
Either way, you can readily iterate over the contents of the list:
for coords in coordlist:
or
for image_name, coords in coorddict.items():
And pickle is a convenient way to store your results in a file:
import pickle
with open("filename.pkl", "wb") as f:
pickle.dump(coordlist, f, pickle.HIGHEST_PROTOCOL)
(or same with coorddict instead of coordlist).
Reloading is trivially easy as well:
with open("filename.pkl", "rb") as f:
coordlist = pickle.load(f)
There are some security concerns with pickle, but if you only load files you have created yourself, those don't apply.
If you find yourself frequently adding to a previously pickled file, you might be better off with an alternative back end, such as sqlite.

Categories

Resources