How to link HDF5 files generated with Pandas?

How to link HDF5 files generated with Pandas? - python

Assume we have a folder with HDF5-files generated by pandas.to_hdf. I would like to create one master.h5 file that contains external links to all the DataFrames.
According to the documentation of h5py, the standard way to do this is
myfile = h5py.File('master.h5','w')
myfile['ext link'] = h5py.ExternalLink("some_sub_file.h5", "/path/to/resource")
But files generated by pandas.to_hdf contain not just datasets, but h5py.Groups. How exactly would you then set up the external link to work?

Links can point to any object in the HDF5 data structure (datasets or groups). The file is a special form of a group; called the root group and referenced with '/'. So, to link to a file, use: h5py.ExternalLink(filename,'/').
You didn't say if you want a link for each dataframe/dataset in each file, or links for each file. It's simpler to create links to the file root groups. If you create individual links to the datasets, be sure you assign unique names.
There are 2 answers that demonstrate each method. The questions were not specifically about h5py.ExternalLink(), but my answers to each question used external links. See these answers:
HDF5 Attributes of External Links: Creates links to root group in multiple files. (each file only has 1 dataset...but your process would be identical.)
I/O Issues in Loading Several Large H5PY Files : Creates links to multiple datasets in multiple files. (Requires unique dataset names to work "as-is". Can be modified if names are not unique.)
I modified the code from the second answer (70089964) to show how to create 3 external links from the master file to the root group in 3 files (where each file has 5 datasets).
Code to create 3 example files:
import h5py
import numpy as np
for fcnt in range(3):
fname = f'file_{fcnt+1}.h5'
with h5py.File(fname,'w') as h5fw:
for dscnt in range(1,6,1):
arr = np.random.random(10).reshape(5,2)
h5fw.create_dataset(f'data_{fcnt*10+dscnt:02}',data=arr*dscnt)
Code to create links from the master file to the 3 files:
import h5py
fnames = ['file_1.h5','file_2.h5','file_3.h5']
with h5py.File(f'master_{len(fnames)}_links}.h5','w') as h5fw:
for fname in fnames:
with h5py.File(fname,'r') as h5fr:
h5fw[fname] = h5py.ExternalLink(fname,'/')

Additional research on Pandas and HDF5 links revealed an interesting discovery: there are limitations with links (you can create them in Pandas, but Pandas can't access the linked data). In other words, the links are there, and work fine with HDFView, h5py and PyTables. Reference these GitHub issues:
Pandas hdf functions should support the hdf5 ExternalLink
functionality when reading/writing - Issue #6019
Presence of softlink in HDF5 file breaks HDFStore.keys() - Issue
#20523
Status for both appears to be Open. My tests confirm previously reported errors.
The code below shows how to create both link types. It also shows the error message you will get when you try to access the linked data. (Error message is: KeyError: 'you cannot get attributes from this 'NoAttrs' instance. This is due to a HDF5 limitation attribute restrictions on links. a A HDFStore node has some required attributes. Result is the 'NoAttrs' message when Pandas tries to read the attributes.
import pandas as pd
df1 = pd.DataFrame({ "a": [1,2,3,4], "b": [11,12,13,14] })
print(df1.to_string())
# Create file 1 with simple dataframe
f1 = "test_1.hdf"
with pd.HDFStore(f1, mode="w") as hdf1:
hdf1.put("/key1", df1)
# Create file 2 with external link
f2 = "test_extlink.hdf"
with pd.HDFStore(f2, mode="w") as hdf2:
hdf2._handle.create_external_link(hdf2._handle.root, "extlink_key1", f"{f1}:/key1")
print("Successful external link write")
with pd.HDFStore(f2, mode="r") as hdf2:
print(hdf2.keys()) # Notice that [] (no keys) is printed
# following lines will trigger the 'NoAttrs' error message
# df2test = pd.read_hdf(f2,key="extlink_key1")
# print(df2test.to_string())
print("End external link read")
# Create file 3 with simple dataframe and symbolic (soft) link
f3 = "test_symlink.hdf"
with pd.HDFStore(f3, mode="w") as hdf3:
hdf3.put("/key1", df1)
hdf3._handle.create_soft_link(hdf3._handle.root, "symlink_key1", "/key1")
print("Successful symbolic link write")
with pd.HDFStore(f3, mode="r") as hdf3:
print(hdf3.keys()) # Notice that only ['key1'] is printed
# following lines will trigger the 'NoAttrs' error message
# df3test = pd.read_hdf(f3,key="symlink_key1")
# print(df3test.to_string())
print("End symbolic link read")

Related

How to read HDF5 files in R without the memory error?

Goal
Read the data component of a hdf5 file in R.
Problem
I am using rhdf5 to read hdf5 files in R. Out of 75 files, it successfully read 61 files. But it throws an error about memory for the rest of the files. Although, some of these files are shorter than already read files.
I have tried running individual files in a fresh R session, but get the same error.
Following is an example:
# Exploring the contents of the file:
library(rhdf5)
h5ls("music_0_math_0_simple_12_2022_08_08.hdf5")
group name otype dclass dim
0 / data H5I_GROUP
1 /data ACC_State H5I_DATASET INTEGER 1 x 1
2 /data ACC_State_Frames H5I_DATASET INTEGER 1
3 /data ACC_Voltage H5I_DATASET FLOAT 24792 x 1
4 /data AUX_CACC_Adjust_Gap H5I_DATASET INTEGER 24792 x 1
... CONTINUES ----
# Reading the file:
rhdf5::h5read("music_0_math_0_simple_12_2022_08_08.hdf5", name = "data")
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, :
Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter.
In addition: Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
Error: Error in h5checktype(). H5Identifier not valid.
I can read the file via python:
import h5py
filename = "music_0_math_0_simple_12_2022_08_08.hdf5"
hf = h5py.File(filename, "r")
hf.keys()
data = hf.get('data')
data['SCC_Follow_Info']
#<HDF5 dataset "SCC_Follow_Info": shape (9, 24792), type "<f4">
How can I successfully read the file in R?

When you ask to read the data group, rhdf5 will read all the underlying datasets into R's memory. It's not clear from your example exactly how much data this is, but maybe for some of your files it really is more than the available memory on your computer. I don't know how Python works under the hood, but perhaps it doesn't do any reading of datasets until you run data['SCC_Follow_Info']?
One option to try, is that rather than reading the entire data group, you could be more selective and try reading only the specific dataset you're interested in at that moment. In the Python example that seems to be /data/SCC_Follow_Info.
You can do that with something like:
follow_info <- h5read(file = "music_0_math_0_simple_12_2022_08_08.hdf5",
name = "/data/SCC_Follow_Info")
Once you've finished working with that dataset remove it from your R session e.g. rm(follow_info) and read the next dataset or file you need.

Search Keyword from multiple Excel colomn/row in multiples pdf files

I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!

How to avail "Forecasting: Methods and Application" dataset in Python?

I am using the book Forecasting: Methods and Applications by Makridakis, Wheelwright and Hyndman. I want to do the exercises along the way, but in Python, not R (as suggested in the book).
I do not know how to use R. I know that the datasets can be availed from an R package - fma. This is the link to the package.
Is there a possible script, in R or Python, which will allow me to download the datasets as .csv files? That way, I will be able to access them using Python.

one possibility:
## install and load package:
install.packages('fma')
library('fma')
## list example data of package fma:
data(package = 'fma')
## export single data as csv:
write.csv(cement, file = 'cement.csv')
## bulk export:
## data names are in `[,3]`rd column of list member "results"
## of `data(...)` output
for (data_name in data(package = 'fma')[['results']][,3]){
write.csv(get(data_name), file = paste0(data_name, '.csv'))
}
Edit:
As Anirban noted, attaching the package {fma} exposes only a few datasets to the search path. The data can be obtained by cloning or downloading from Rob J. Hyndman's source (click green Code button and choose). Subfolder 'data' contains each dataset as an .rda file which can be load()ed and converted. (Observe the licence conditions - GPL3.0 - and acknowledge the authors' efforts anyway.)
That said, you could load and convert the data like this:
setwd('path/to/fma-master/data')
for(data_name in dir()){
cat(paste0('converting ', data_name, '... '))
load(data_name)
object_name <- (gsub('\\.rda','', data_name))
write.csv(get(object_name),
file = paste0(object_name,'.csv'),
row.names = FALSE,
append = FALSE ## overwrite file if exists
)
}

How to extract relation members from .osm xml files

All,
I've been trying to build a website (in Django) which is to be an index of all MTB routes in the world. I'm a Pythonian so wherever I can I try to use Python.
I've successfully extracted data from the OSM API (Display relation (trail) in leaflet) but found that doing this for all MTB trails (tag: route=mtb) is too much data (processing takes very long). So I tried to do everything locally by downloading a torrent of the entire OpenStreetMap dataset (from Latest Weekly Planet XML File) and filtering for tag: route=mtb using osmfilter (part of osmctools in Ubuntu 20.04), like this:
osmfilter $unzipped_osm_planet_file --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.osm
This produces a file of about 1.2 GB and on closer inspection seems to contain all the data I need. My goal was to transform the file into a pandas.DataFrame() so I could do some further filtering en transforming before pushing relevant aspects into my Django DB. I tried to load the file as a regular XML file using Python Pandas but this crashed the Jupyter notebook Kernel. I guess the data is too big.
My second approach was this solution: How to extract and visualize data from OSM file in Python. It worked for me, at least, I can get some of the information, like the tags of the relations in the file (and the other specified details). What I'm missing is the relation members (the ways) and then the way members (the nodes) and their latitude/longitudes. I need these to achieve what I did here: Plotting OpenStreetMap relations does not generate continuous lines
I'm open to many solutions, for example one could break the file up into many different files containing 1 relation and it's members per file, using an osmium based script. Perhaps then I can move on with pandas.read_xml(). This would be nice for batch processing en filling the Database. Loading the whole OSM XML file into a pd.DataFrame would be nice but I guess this really is a lot of data. Perhaps this can also be done on a per-relation basis with pyosmium?
Any help is appreciated.

Ok, I figured out how to get what I want (all information per relation of the type "route=mtb" stored in an accessible way), it's a multi-step process, I'll describe it here.
First, I downloaded the world file (went to wiki.openstreetmap.org/wiki/Planet.osm, opened the xml of the pbf file and downloaded the world file as .pbf (everything on Linux, and this file is referred to as $osm_planet_file below).
I converted this file to o5m using osmconvert (available in Ubuntu 20.04 by doing apt install osmctools, on the Linux cli:
osmconvert --verbose --drop-version $osm_planet_file -o=$osm_planet_dir/planet.o5m
The next step is to filter all relations of interest out of this file (in my case I wanted all MTB routes: route=mtb) and store them in a new file, like this:
osmfilter $osm_planet_dir/planet.o5m --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.o5m
This creates a much smaller file that contains all information on the relations that are MTB routes.
From there on I switched to a Jupyter notebook and used Python3 to further divide the file into useful, manageable chunks. I first installed osmium using conda (in the env I created first but that can be skipped):
conda install -c conda-forge osmium
Then I made a recommended osm.SimpleHandle class, this class iterates through the large o5m file and while doing this it can do actions. This is the way to deal with these files because they are far to big for memory. I made the choice to iterate through the file and store everything I needed into separate json files. This does generate more than 12.000 json files but it can be done on my laptop with 8 GB of memory. This is the class:
import osmium as osm
import json
import os
data_dump_dir = '../data'
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
data = dict()
data['version'] = elem.version,
data['members'] = [int(member.ref) for member in elem.members if member.type == 'w'], # filter nodes from waylist => could be a mistake
data['visible'] = elem.visible,
data['timestamp'] = str(elem.timestamp),
data['uid'] = elem.uid,
data['user'] = elem.user,
data['changeset'] = elem.changeset,
data['num_tags'] = len(elem.tags),
data['key'] = tag.k,
data['value'] = tag.v,
data['deleted'] = elem.deleted
with open(os.path.join(data_dump_dir, str(elem.id)+'.json'), 'w') as f:
json.dump(data, f)
def relation(self, r):
self.tag_inventory(r, "relation")
Run the class like this:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
Now we have json files with the relation number as their filename and with all metadata, and a list of the ways. But we want a list of the ways and then also all the nodes per way, so we can plot the full relations (the MTB routes). To achieve this, we parse the o5m file again (using a class build on the osm.SimpleHandler class) and this time we extract all way members (the nodes), and create a dictionary:
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = dict()
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data[int(elem.id)] = dict()
# self.osm_data[int(elem.id)]['is_closed'] = str(elem.is_closed)
self.osm_data[int(elem.id)]['nodes'] = [str(n) for n in elem.nodes]
def way(self, w):
self.tag_inventory(w, "way")
Execute the class:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
ways = osmhandler.osm_data
This gives is dict (called ways) of all ways as keys and the node IDs (!Meaning we need some more steps!) as values.
len(ways.keys())
>>> 337597
In the next (and almost last) step we add the node IDs for all ways to our relation jsons, so they become part of the files:
all_data = dict()
for relation_file in [
os.path.join(data_dump_dir,file) for file in os.listdir(data_dump_dir) if file.endswith('.json')
]:
with open(relation_file, 'r') as f:
data = json.load(f)
if 'members' in data: # Make sure these steps are never performed twice
try:
data['ways'] = dict()
for way in data['members'][0]:
data['ways'][way] = ways[way]['nodes']
del data['members']
with open(relation_file, 'w') as f:
json.dump(data, f)
except KeyError as err:
print(err, relation_file) # Not sure why some relations give errors?
So now we have relation jsons with all ways and all ways have all node IDs, the last thing to do is to replace the node IDs with their values (latitude and longitude). I also did this in 2 steps, first I build a nodeID:lat/lon dictionary, again using an osmium.SimpleHandler based class :
import osmium
class CounterHandler(osmium.SimpleHandler):
def __init__(self):
osmium.SimpleHandler.__init__(self)
self.osm_data = dict()
def node(self, n):
self.osm_data[int(n.id)] = [n.location.lat, n.location.lon]
Execute the class:
h = CounterHandler()
h.apply_file("../data/world_mtb_routes.o5m")
nodes = h.osm_data
This gives us dict with a latitude/longitude pair for every node ID. We can use this on our json files to fill the ways with coordinates (where there are now still only node IDs), I create these final json files in a new directory (data/with_coords in my case) because if there is an error, my original (input) json file is not affected and I can try again:
import os
relation_files = [file for file in os.listdir('../data/') if file.endswith('.json')]
for relation in relation_files:
relation_file = os.path.join('../data/',relation)
relation_file_with_coords = os.path.join('../data/with_coords',relation)
with open(relation_file, 'r') as f:
data = json.load(f)
try:
for way in data['ways']:
node_coords_per_way = []
for node in data['ways'][way]:
node_coords_per_way.append(nodes[int(node)])
data['ways'][way] = node_coords_per_way
with open(relation_file_with_coords, 'w') as f:
json.dump(data, f)
except KeyError:
print(relation)
Now I have what I need and I can start adding the info to my Django database, but that is beyond the scope of this question.
Btw, there are some relations that give an error, I suspect that for some relations ways were labelled as nodes but I'm not sure. I'll update here if I find out. I also have to do this process regularly (when the world file updates, or every now and then) so I'll probably write something more concise later on, but for now this works and the steps are understandable, to me, after a lot of thinking at least.
All of the complexity comes from the fact that the data is not big enough for memory, otherwise I'd have created a pandas.DataFrame in step one and be done with it. I could also have loaded the data in a database in one go perhaps, but I'm not that good with databases, yet.

How to copy a dataset object to a different hdf5 file using pytables or h5py?

I have selected specific hdf5 datasets and want to copy them to a new hdf5 file. I could find some tutorials on copying between two files, but what if you have just created a new file and you want to copy datasets to the file? I thought the way below would work, but it doesn't. Are there any simple ways to do this?
>>> dic_oldDataset['old_dataset']
<HDF5 dataset "old_dataset": shape (333217,), type "|V14">
>>> new_file = h5py.File('new_file.h5', 'a')
>>> new_file.create_group('new_group')
>>> new_file['new_group']['new_dataset'] = dic_oldDataset['old_dataset']
RuntimeError: Unable to create link (interfile hard links are not allowed)

Answer 3
Use the copy method of the group class from h5py.
TL;DR
This works on groups and datasets.
Is recursive (can do deep and shallow copies).
Has options for attributes, symbolic links and references.
with h5py.File('destFile.h5','w') as f_dest:
with h5py.File('srcFile.h5','r') as f_src:
f_src.copy(f_src["/path/to/DataSet"],f_dest["/another/path"],"DataSet")
(The file object is also the root group.)
Locations in HDF5
"An HDF5 file is organized as a rooted, directed graph" (source).
HDF5 groups (including the root group) and data sets are related to each other as "locations" (in the C API most functions take a loc_id which identifes a group or data set). These locations are the nodes on the graph, paths describe arcs through the graph to a node. copy takes a source and destination location, not specifically a group or dataset, so it can be applied to both. The source and destination do not need to be in the same file.
Attributes
Attributes are stored within the header of the group or data set they are associated with. Therefore the attributes are also associated with that "location". It follows that copying a group or dataset will include all attributes associated with that "location". However you can turn this off.
References
copy offers settings for references, also called object pointers. Object pointers are a data type in hdf5: H5T_STD_REG_OBJ, similar to an integer H5T_STD_I32BE (source) and can be stored in attributes or data sets. References can point to whole objects or regions within a data set. copy only seems to cover object references. Does it break with data set regions H5T_STD_REF_DSETREG?
Symbolic links
The "locations" taken by the C api are one level of abstraction which explains why the copy function works on individual datasets. Look at the figure again, it is the edges which are labelled, not the nodes. Under the hood, HDF5 objects are the targets of links, each link (edge) has a name, the objects (nodes) do not have names. There are two types of links: hard links and symbolic links. All HDF5 objects must have at least one hard link, hard links can only target objects within their file. When hard links are created the reference count increases by one, symbolic links do not effect the reference count. Symbolic links may point to objects within the file (soft) or objects in other files (external). copy offers options to expand soft and external symbolic links.
This explains the error code (below) and offers an alternative to copying your dataset; A soft link could allow access to a data set in another file.
RuntimeError: Unable to create link (interfile hard links are not allowed)

Answer 1 (using h5py):
This creates a simple structured array to populate the first dataset in the first file.
The data is then read from that dataset and copied to the second file using my_array.
import h5py, numpy as np
arr = np.array([(1,'a'), (2,'b')],
dtype=[('foo', int), ('bar', 'S1')])
print (arr.dtype)
h5file1 = h5py.File('test1.h5', 'w')
h5file1.create_dataset('/ex_group1/ex_ds1', data=arr)
print (h5file1)
my_array=h5file1['/ex_group1/ex_ds1']
h5file2 = h5py.File('test2.h5', 'w')
h5file2.create_dataset('/exgroup2/ex_ds2', data=my_array)
print (h5file2)
h5file1.close()
h5file2.close()

Answer 2 (using pytables):
This follows the same process as above with pytables functions. It creates the same simple structured array to populate the first dataset in the first file. The data is then read from that dataset and copied to the second file using my_array.
import tables, numpy as np
arr = np.array([(1,'a'), (2,'b')],
dtype=[('foo', int), ('bar', 'S1')])
print (arr.dtype)
h5file1 = tables.open_file('test1.h5', mode = 'w', title = 'Test file')
my_group = h5file1.create_group('/', 'ex_group1', 'Example Group')
my_table = h5file1.create_table(my_group, 'ex_ds1', None, 'Example dataset', obj=arr)
print (h5file1)
my_array=my_table.read()
h5file2 = tables.open_file('test2.h5', mode = 'w', title = 'Test file')
h5file2.create_table('/exgroup2', 'ex_ds2', createparents=True, obj=my_array)
print (h5file2)
h5file1.close()
h5file2.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.