New hdf5 from a group in a bigger hdf5 - python

I have created a huge hdf5 dataset in the following form:
group1/raw
group1/preprocessed
group1/postprocessed
group2/raw
group2/preprocessed
group2/postprocessed
....
group10/raw
group10/preprocessed
group10/postprocessed
However, I realized that for portability I would like to have 10 different hdf5 files, one for each group. Is there a function in python to achieve this without looping through all the data and scanning the entire original hdf5 tree?
something like:
import h5py
file_path = 'path/to/data.hdf5'
hf = h5py.File(file_path, 'r')
print(hf.keys())
for group in hf.keys():
# create a new dataset for the group
hf_tmp = h5py.File(group + '.h5', 'w')
# get data from hf[key] and dumb them into the new file
# something like
# hf_tmp = hf[group]
# hf_tmp.dumb()
hf_tmp.close()
hf.close()

You have the right idea. There are several questions and answers on SO that show how to do this.
Start with this one. It shows how to loop over the keys and determine if it's a group or a dataset.: h5py: how to use keys() loop over HDF5 Groups and Datasets
Then look at these. Each shows a slightly different approach to the problem.
This shows one way. Extracting datasets from 1 HDF5 file to multiple files
Also, here is an earlier post I wrote: How to copy a dataset object to a different hdf5 file using pytables or h5py?
This does the opposite (copies datasets from different files to 1 file). It's useful because it demonstrates how to use the .copy() method: How can I combine multiple .h5 file?
Finally, you should review visititems() method to recursively search all Groups and Datasets. Take a look at this answer for details: is there a way to get datasets in all groups at once in h5py?
That should answer your questions.
Below is some pseudo-code that pulls all of these ideas together. It works for your schema, where all datasets are in root level groups. It will not work for the more general case with datasets at multiple group levels. Use visititems() for the more general case.
Pseudo-code below:
with h5py.File(file_path, 'r') as hf:
print(hf.keys())
# loop on group names at root level
for group in hf.keys():
hf_tmp = h5py.File(group + '.h5', 'w')
# loop on datasets names in group
for dset in hf[group].keys():
# copy dataset to the new group file
hf.copy(group+'/'+dset, hf_tmp)
hf_tmp.close()

Related

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

Load group from hdf5

I have an hdf5 file that contains datasets inside groups. Example:
group1/dataset1
group1/dataset2
group1/datasetX
group2/dataset1
group2/dataset2
group2/datasetX
I'm able to read each dataset independently. This is how I read a dataset from a .hdf5 file:
def hdf5_load_dataset(hdf5_filename, dsetname):
with h5py.File(hdf5_filename, 'r') as f:
dset = f[dsetname]
return dset[...]
# pseudo-code of how I call the hdf5_load_dataset() function
group = {'group1':'dataset1', 'group1':'dataset2' ...}
for group in groups:
for dataset in groups[group]:
dset_value = hdf5_load_dataset_value(path_hdf5_file, f'{group}/{dataset}')
# do stuff
I would like to know if it's possible to load into memory all the datasets of group1, then of group2, etc as a dictionary or similar in a single file read. My script is taking quite some time (4min) to read ~200k datasets. There are 2k groups with 100 datasets. So if I load a group in memory at once it will not overload it and I will gain in speed.
This is a pseudo-code of what I'm looking for:
for group in groups:
dset_group_as_dict = hdf5_load_group(path_hdf5_file, f'{group}')
for dataset in dset_group_as_dict;
#do stuff
EDIT:
Inside each .csv file:
time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05
For each .csv file in each folder I have a dataset for the time and for the amplitude. The structure of the hdfile is like this:
XY_1/impact_X/time
XY_1/impact_Y/amplitude
where
time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...]) # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...]) # 500 items
XY_1 is a position in space.
impact_X means that X was impacted in position XY_1 so X amplitude has changed.
So, XY_1 must be in a different group of XY_2 as well as impact_X, impact_Y etc since they represent data to a particular XY position.
I need to create plots from each or only one (time, amplitude) pair (configurable). I also need to compare the amplitudes with a "golden" array to see the differences and calculate other stuff. To perform calculation I will read all the datasets, perform calculation and save the result.
I have more than 200k .csv files for each test case, in a total is more than 5M. Using 5M reads from disk will take quite some time in this case. For the 200k files, by exporting all the .csv to a unique JSON file, it takes ~40s to execute, using .csv is taking ~4min. I can't use a unique json any longer due to memory issues when loading the single JSON file. That's why I chose hdf5 as an alternative.
EDIT 2:
How I read the csv file:
def read_csv_return_list_of_rows(csv_file, _delimiter):
csv_file_list = list()
with open(csv_file, 'r') as f_read:
csv_reader = csv.reader(f_read, delimiter = _delimiter)
for row in csv_reader:
csv_file_list.append(row)
return csv_file_list
No, there is no single function that reads multiple groups or datasets at once. You have to make it yourself from lower level functions that read one group or dataset.
And can you give us some further context? What kind of data is it and how do you want to process it? (Do you want to make statistics? Make plots? Etcetera.) What are you ultimately trying to achieve? This may help us to avoid the to avoid the classical XY problem.
In your earlier question you said you converted a lot of small CSV file into one big HDF file. Can you tell us why? What is wrong with having many CSV small files?
In my experience HDF files with a huge number of groups and datasets are fairly slow, as you are experiencing now. Is it better to have relatively few, but larger, datasets. Is it possible for you to somehow merge multiple datasets into one? If not, HDF may not be the best solution for your problem.

How to export a list of dataframes from R to Python?

I'm currently working with functional MRI data in R but I need to import it to Python for some faster analysis. How can I do that in an efficient way?
I currently have in R a list of 198135 dataframes. All of them have 5 variables and 84 observations of connectivity between brain regions. I need to display the same 198135 dataframes in Python for running some specific analysis there (with the same structure than in R: one object that contains all dataframes separately).
Initially I tried exporting a .RDS file from R and then importing it to Python using "pyreadr", but I'm getting empty objects in every atempt with "pyreadr.read_r()" function.
My other method was to save every dataframe of the R list as a separate .csv file, and then importing them to Python. In that way I could get what I wanted (I tried it with 100 dataframes only for trying the code). The problem with this method is that is highly inefficient and slow.
I found several answers to similar problems, but most of them were to merge all dataframes and load it as a unique .csv into Python, which is not the solution I need.
Is there some more efficient way to do this process, without altering the data structure that I mentioned?
Thanks for your help!
# This is the code in R for an example
a <- as.data.frame(cbind(c(1:3), c(1:3), c(4:6), c(7:9)))
b <- as.data.frame(cbind(c(11:13), c(21:23), c(64:66), c(77:79)))
c <- as.data.frame(cbind(c(31:33), c(61:63), c(34:36), c(57:59)))
d <- as.data.frame(cbind(c(12:14), c(13:15), c(54:56), c(67:69)))
e <- as.data.frame(cbind(c(31:33), c(51:53), c(54:56), c(37:39)))
somelist_of_df <- list(a,b,c,d,e)
saveRDS(somelist_of_df, "somefile.rds")
## This is the function I used from pyreadr in Python
import pyreadr
results = pyreadr.read_r('/somepath/somefile.rds')
Well, thanks for the help in the other answers, but it's not exactly what I was looking for(I wanted to export just one file with the list of dataframes within it, and then loading one single file to Python, keeping the same structure). For using feather you have to decompose the list in all the dataframes within it, pretty much like saving separate .csv files, and then load each one of them into Python (or R). Anyway, it must be said that it's much faster than the method with .csv.
I leave the code that I used successfully in a separate answer, maybe it could be useful for other people since I used a simple loop for loading dataframes into Python as a list:
## Exporting a list of dataframes from R to .feather files
library(feather) #required package
a <- as.data.frame(cbind(c(1:3), c(1:3), c(4:6), c(7:9))) #Example DFs
b <- as.data.frame(cbind(c(11:13), c(21:23), c(64:66), c(77:79)))
c <- as.data.frame(cbind(c(31:33), c(61:63), c(34:36), c(57:59)))
d <- as.data.frame(cbind(c(12:14), c(13:15), c(54:56), c(67:69)))
e <- as.data.frame(cbind(c(31:33), c(51:53), c(54:56), c(37:39)))
somelist_of_df <- list(a,b,c,d,e)
## With sapply you loop over the list for creating the .feather files
sapply(seq_along(1:length(somelist_of_df)),
function(i) write_feather(somelist_of_df[[i]],
paste0("/your/directory/","DF",i,".feather")))
(Using just a MacBook Air, the code above took less than 5 seconds to run for a list of 198135 DFs)
## Importing .feather files into a list of DFs in Python
import os
import feather
os.chdir('/your/directory')
directory = '/your/directory'
py_list_of_DFs = []
for filename in os.listdir(directory):
DF = feather.read_dataframe(filename)
py_list_of_DFs.append(DF)
(This code did the work for me besides it was a bit slow, it took 12 minutes to do the task for the 198135 DFs)
I hope this could be useful for somebody.
This package may be of some interest to you
Pandas also implements a direct way to read .feather file :
pd.read_feather()
Pyreadr cannot currently read R lists, therefore you need to save the dataframes individually, also you need to save to a RDA file so that you can host multiple dataframes in one file:
# first construct a list with the names of dataframes you want to save
# instead of the dataframes themselves
somelist_of_df <- list("a", "b", "c", "d", "e")
do.call("save", c(somelist_of_df, file="somefile.rda"))
or any other variant as described here.
Then you can read the file in python:
import pyreadr
results = pyreadr.read_r('/somepath/somefile.rda')
The advantage is that there will be only one file with all dataframes.
I cannot comment in the #crlagos0 answer because reputation. I Want to add a couple of things:
seq_along(list_of_things) is enough, there is no need to do seq_along(lenght(1:list_of_things)) in R. Also, I want to point out that the official package to read and write feather files in R is called arrow and you can find its documentation here. In python is pyarrow.

Can I delete an element from an HDF5 dataset?

I would like to delete an element from an HDF5 dataset in Python. Below I have my example code
DeleteHDF5Dataset.py
# This code works, which deletes an HDF5 dataset from an HDF5 file
file_name = os.path.join('myfilepath', 'myfilename.hdf5')
f = h5py.File(file_name, 'r+')
f.__delitem__('Log list')
However, this is not what I want to do. 'mydatatset' is an HDF5 dataset that has several elements, and I would like to delete one or more of the elements individually, for example
DeleteHDF5DatasetElement.py
# This code does not work, but I would like to achieve what it's trying to do
file_name = os.path.join('myfilepath', 'myfilename.hdf5')
f = h5py.File(file_name, 'r+')
print(f['Log list'][3]) # prints the correct dataset element
f.__delitem__('Log list')[3] # I want to delete element 3 of this HDF5 dataset
The best solution I can come up with is to create a temporary dataset, loop through the original dataset, and only add the entries I want to keep to the temp dataset, and then replace the old dataset with the new one. But this seems pretty clunky. Does anybody have a clean solution to do this? It seems like there should be a simple way to just delete an element.
Thanks, and sorry if any of my terminology is incorrect.
It looks like you have an array of strings. It's not the recommended way of storing strings in HDF5, but let's assume you have no choice on how data is stored.
HDF5 prefers you to keep your array size fixed. Operations such as deleting arbitrary elements are expensive. In addition, with HDF5, space is not automatically freed when you delete data.
After all this, if you still want to remove data in your specified format, you can try simply extracting an array, deleting an element, then reassigning to your dataset:
arr = f['Log list'][:] # extract to numpy array
res = np.delete(arr, 1) # delete element with index 1, i.e. second element
f.__delitem__('Log list') # delete existing dataset
f['Log list'] = res # reassign to dataset

Join/merge multiple NetCDF files using xarray

I have a folder with NetCDF files from 2006-2100, in ten year blocks (2011-2020, 2021-2030 etc).
I want to create a new NetCDF file which contains all of these files joined together. So far I have read in the files:
ds = xarray.open_dataset('Path/to/file/20062010.nc')
ds1 = xarray.open_dataset('Path/to/file/20112020.nc')
etc.
Then merged these like this:
dsmerged = xarray.merge([ds,ds1])
This works, but is clunky and there must be a simpler way to automate this process, as I will be doing this for many different folders full of files. Is there a more efficient way to do this?
EDIT:
Trying to join these files using glob:
for filename in glob.glob('path/to/file/.*nc'):
dsmerged = xarray.merge([filename])
Gives the error:
AttributeError: 'str' object has no attribute 'items'
This is reading only the text of the filename, and not the actual file itself, so it can't merge it. How do I open, store as a variable, then merge without doing it bit by bit?
If you are looking for a clean way to get all your datasets merged together, you can use some form of list comprehension and the xarray.merge function to get it done. The following is an illustration:
ds = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])
In response to the out of memory issues you encountered, that is probably because you have more files than the python process can handle. The best fix for that is to use the xarray.open_mfdataset function, which actually uses the library dask under the hood to break the data into smaller chunks to be processed. This is usually more memory efficient and will often allow you bring your data into python. With this function, you do not need a for-loop; you can just pass it a string glob in the form "path/to/my/files/*.nc". The following is equivalent to the previously provided solution, but more memory efficient:
ds = xarray.open_mfdataset('path/to/file/*.nc')
I hope this proves useful.

Categories

Resources