An XML file inside HDF5, h5py - python

I am using h5py to save data (float numbers), in groups. In addition to the data itself, I need to include an additional file (an .xml file, containing necessary information) within the hdf5. How do i do this? Is my approach wrong?
f = h5py.File('filename.h5')
f.create_dataset('/data/1',numpy_array_1)
f.create_dataset('/data/2',numpy_array_2)
.
.
my h5 tree should look thus:
/
/data
/data/1 (numpy_array_1)
/data/2 (numpy_array_2)
.
.
/morphology.xml (?)

One option is to add it as a variable-length string dataset.
http://code.google.com/p/h5py/wiki/HowTo#Variable-length_strings
E.g.:
import h5py
xmldata = """<xml>
<something>
<else>Text</else>
</something>
</xml>
"""
# Write the xml file...
f = h5py.File('test.hdf5', 'w')
str_type = h5py.new_vlen(str)
ds = f.create_dataset('something.xml', shape=(1,), dtype=str_type)
ds[:] = xmldata
f.close()
# Read the xml file back...
f = h5py.File('test.hdf5', 'r')
print f['something.xml'][0]

If you just need to attach the XML file to the hdf5 file, you can add it as an attribute to the hdf5 file.
xmlfh = open('morphology.xml', 'rb')
h5f.attrs['xml'] = xmlfh.read()
You can access the xml file then like this:
h5f.attrs['xml']
Notice, also, that you can't store attributes larger than 64K, you may want to compress the file before attaching. You can have a look at compressing libraries in the standard library of Python.
However, this doesn't make the information in the XML file very accessible. If you want to associate the metadata of each dataset to some metadata in the XML file, you could map it as you need using an XML library like lxml. You can also add each field of the XML data as a separate attribute so that you can query datasets by XML field, this all depends on what you have in the XML file. Try to think about how you would like to retrieve the data later.
You may also want to create groups for each xml file with its datasets and put it all in a single hdf5 file. I don't know how large are the files you are managing, YMMV.

Related

What is the best way to store an XML file in a database using sqlalchemy-flask?

I'm working on a Flask app that I'd like to have store xml files in a database. I'd like to use flask-sqlalchemy. I've seen that in regular old sqlalchemy it is possible to use the LONGTEXT type. I believe this would work for my use case.
I would like to know (1) if LONGTEXT would be the best way store xml files and, if so, (2) how to use LONGTEXT within the flask-sqlalchemy syntax.
What should {insert-name-here} be in the code below? Will I need to install additional dependencies to use whatever is suggested?
xml_column = db.Column(db.{insert-name-here})
I use python some time.
I use [xml.etree.ElementTree] package to read or write xml data in python.
I use it like:
'''
#import
Import xml.etree.ElementTree as ET
#xml file
_xml = 'c:/.../test.xml'
#read
tree = ET.parse(_xml)
root = tree.getroot()
h_data = root.findall('h')
#write
root = ET.Element('test')
tree = ET.ElementTree(root)
tree.write(_xml), encoding='utf-8', xml_declaration=1)
'''
More you can see documents.
Xml file can save as a txt, but best the encoding is utf8.
I think xml data is not best for python.
The best is json data.
Hope I can help you.

Creating an xml file for testing from within a python module

I have an application that takes an xml file as input and converts this data into a specific data structure. I would like to write a test for this application, but instead of using an external xml file I would like to define the xml data inside the test file and then pass this data to the function, so originally my idea was to do something like this:
data = pd.DataFrame([#insert data here])
in_memory_xml = io.BytesIO()
xml_file = original.to_xml(in_memory_xml)
my_function(xml_file)
However, pandas DataFrame objects do not have a "to_xml" function, so the xml data needs to be defined differently.Is there good way to solve this problem that doesn't involve the use of an external xml file?
It can be done by just converting the string to xml using lxml: https://kite.com/python/examples/5415/lxml-load-xml-from-a-string-into-an-%60elementtree%60
You can try the below:
data=pd.read_csv('temps.csv',delimiter=r"\s+")
def myfunction(row):
myxml = ['<item>']
for field in row.index:
myxml.append(' <field name="{0}">{1}</field>'.format(field, row[field]))
myxml.append('</item>')
return '\n'.join(myxml)
final_xml='\n'.join(data.apply(myfunction, axis=1))
print(final_xml)

Parsing YAML out of a Markdown file

I am working with some legacy code that I have inherited (ie, many of these design decisions were not mine).
The code takes a directory organized into subdirectories with markdown files, and compiles them into one large markdown file (using Markdown-PP: https://github.com/jreese/markdown-pp). Then it converts this file into HTML (using pandoc: https://pandoc.org/), and finally into a PDF (using wkhtmltopdf: https://wkhtmltopdf.org/).
The problem that I am running into is that many of the original markdown files have YAML metadata headers. When stitched together by Markdown-PP, the large markdown ends up with numerous YAML metadata blocks interspersed throughout. Most of this metadata is lost when converting into HTML because of the way pandoc processes YAML (many of the headers use the same key names, and pandoc combines the separate YAML headers and only preserves the first value of the corresponding key).
I originally had no YAML appearing in the HTML, but was able to change this by correctly modifying the HTML template for pandoc. But I only get the first value for each corresponding key. It was not clear if there was a way around this in pandoc, so I instead looked into trying to process the YAML into HTML before the pandoc step. I have tried parsing the YAML in the combined markdown using PyYAML (yaml.load_all()) but only get the first YAML block to appear.
An example of a YAML block:
---
author: foo
size_minimum: 100
time_req_minutes: 120
# and so on
---
The issue being that each one of 20+ modules in the final document have this associated metadata.
To try to parse the YAML, I was using code borrowed from this post: Is it possible to use PyYAML to read a text file written with a "YAML front matter" block inside?
with a few modifications.
import yaml
import sys
def get_yaml(f):
pointer = f.tell()
if f.readline() != '---\n':
f.seek(pointer)
return ''
readline = iter(f.readline, '')
readline = iter(readline.__next__, '---\n') #underscores needed for Python3?
return ''.join(readline)
# Remove sys.argv, not sure what it was doing
with open(filepath, encoding='UTF-8') as f:
config = list(yaml.load_all(get_yaml(f), Loader=yaml.SafeLoader)) # Load all to get all the YAML documents, Loader option required for most recent PyYAML, and list because it was originally returning a generator object
text = f.read()
print("TEXT from", f)
#print(text)
print("CONFIG from", f)
print(config)
But even this only resulted in the first YAML block being read and output.
I would like to able to parse the YAML from the large markdown files, and replace it in the correct place with the corresponding HTML. I just am not sure if these (or any) packages have the capability of doing so. It may be that I just need to manually change the YAML to HTML in the original Markdown files (time intensive, but I could probably already be done with it if I had started that way).
What about this library: https://github.com/eyeseast/python-frontmatter
It parses both the front-matter and the Markdown in the file, placing the Markdown part in the content attribute of the resulting object.
Works with both front-matter containing and front-matterless (is there such a word?) files.

Accessing all the data contained in an h5 file with python

After I load an h5 files and then check the keys, is there any other data that can be stored in the h5 that I might be missing? For example:
import h5py
a = '/path/to/file.h5'
a_h5 = h5py.File(a)
a_h5.keys()
From the h5py documentation, it looks like you can also do:
a_h5.values()
a_h5.items()
I don't know much about this format, but this looks like additional information you can extract.

Python Storing Data

I have a list in my program. I have a function to append to the list, unfortunately when you close the program the thing you added goes away and the list goes back to the beginning. Is there any way that I can store the data so the user can re-open the program and the list is at its full.
You may try pickle module to store the memory data into disk,Here is an example:
store data:
import pickle
dataset = ['hello','test']
outputFile = 'test.data'
fw = open(outputFile, 'wb')
pickle.dump(dataset, fw)
fw.close()
load data:
import pickle
inputFile = 'test.data'
fd = open(inputFile, 'rb')
dataset = pickle.load(fd)
print dataset
You can make a database and save them, the only way is this. A database with SQLITE or a .txt file. For example:
with open("mylist.txt","w") as f: #in write mode
f.write("{}".format(mylist))
Your list goes into the format() function. It'll make a .txt file named mylist and will save your list data into it.
After that, when you want to access your data again, you can do:
with open("mylist.txt") as f: #in read mode, not in write mode, careful
rd=f.readlines()
print (rd)
The built-in pickle module provides some basic functionality for serialization, which is a term for turning arbitrary objects into something suitable to be written to disk. Check out the docs for Python 2 or Python 3.
Pickle isn't very robust though, and for more complex data you'll likely want to look into a database module like the built-in sqlite3 or a full-fledged object-relational mapping (ORM) like SQLAlchemy.
For storing big data, HDF5 library is suitable. It is implemented by h5py in Python.

Categories

Resources