I am looking to convert a Python object into XML data. I've tried lxml, but eventually had to write custom code for saving my object as xml which isn't perfect.
I'm looking for something more like pyxser. Unfortunately pyxser xml code looks different from what I need.
For instance I have my own class Person
Class Person:
name = ""
age = 0
ids = []
and I want to covert it into xml code looking like
<Person>
<name>Mike</name>
<age> 25 </age>
<ids>
<id>1234</id>
<id>333333</id>
<id>999494</id>
</ids>
</Person>
I didn't find any method in lxml.objectify that takes object and returns xml code.
Best is rather subjective and I'm not sure it's possible to say what's best without knowing more about your requirements. However Gnosis has previously been recommended for serializing Python objects to XML so you might want to start with that.
From the Gnosis homepage:
Gnosis Utils contains several Python modules for XML processing, plus other generally useful tools:
xml.pickle (serializes objects to/from XML)
API compatible with the standard pickle module)
xml.objectify (turns arbitrary XML documents into Python objects)
xml.validity (enforces XML validity constraints via DTD or Schema)
xml.indexer (full text indexing/searching)
many more...
Another option is lxml.objectify.
Mike,
you can either implement object rendering into XML :
class Person:
...
def toXml( self):
print '<Person>'
print '\t<name>...</name>
...
print '</Person>'
or you can transform Gnosis or pyxser output using XSLT.
Related
I'd like to create a XML namespace mapping (e.g., to use in findall calls as in the Python documentation of ElementTree). Given the definitions seem to exist as attributes of the xbrl root element, I'd have thought I could just examine the attrib attribute of the root element within my ElementTree. However, the following code
from io import StringIO
import xml.etree.ElementTree as ET
TEST = '''<?xml version="1.0" encoding="utf-8"?>
<xbrl
xml:lang="en-US"
xmlns="http://www.xbrl.org/2003/instance"
xmlns:country="http://xbrl.sec.gov/country/2021"
xmlns:dei="http://xbrl.sec.gov/dei/2021q4"
xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
xmlns:link="http://www.xbrl.org/2003/linkbase"
xmlns:nvda="http://www.nvidia.com/20220130"
xmlns:srt="http://fasb.org/srt/2021-01-31"
xmlns:stpr="http://xbrl.sec.gov/stpr/2021"
xmlns:us-gaap="http://fasb.org/us-gaap/2021-01-31"
xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</xbrl>'''
xbrl = ET.parse(StringIO(TEST))
print(xbrl.getroot().attrib)
produces the following output:
{'{http://www.w3.org/XML/1998/namespace}lang': 'en-US'}
Why aren't any of the namespace attributes showing up in root.attrib? I'd at least expect xlmns to be in the dictionary given it has no prefix.
What have I tried?
The following code seems to work to generate the namespace mapping:
print({prefix: uri for key, (prefix, uri) in ET.iterparse(StringIO(TEST), events=['start-ns'])})
output:
{'': 'http://www.xbrl.org/2003/instance',
'country': 'http://xbrl.sec.gov/country/2021',
'dei': 'http://xbrl.sec.gov/dei/2021q4',
'iso4217': 'http://www.xbrl.org/2003/iso4217',
'link': 'http://www.xbrl.org/2003/linkbase',
'nvda': 'http://www.nvidia.com/20220130',
'srt': 'http://fasb.org/srt/2021-01-31',
'stpr': 'http://xbrl.sec.gov/stpr/2021',
'us-gaap': 'http://fasb.org/us-gaap/2021-01-31',
'xbrldi': 'http://xbrl.org/2006/xbrldi',
'xlink': 'http://www.w3.org/1999/xlink',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
But yikes is it gross to have to parse the file twice.
As for the answer to your specific question, why the attrib list doesn't contain the namespace prefix decls, sorry for the unquenching answer: because they're not attributes.
http://www.w3.org/XML/1998/namespace is a special schema that doesn't act like the other schemas in your userspace. In that representation, xmlns:prefix="uri" is an attribute. In all other subordinate (by parsing sequence) schemas, xmlns:prefix="uri" is a special thing, a namespace prefix declaration, which is different than an attribute on a node or element. I don't have a reference for this but it holds true perfectly in at least a half dozen (correct) implementations of XML parsers that I've used, including those from IBM, Microsoft and Oracle.
As for the ugliness of reparsing the file, I feel your pain but it's necessary. As tdelaney so well pointed out, you may not assume that all of your namespace decls or prefixes must be on your root element.
Be prepared for the possibility of the same prefix being redefined with a different namespace on every node in your document. This may hold true and the library must correctly work with it, even if it is never the case your document (or worse, if it's never been the case so far).
Consider if perhaps you are shoehorning some text processing to parse or query XML when there may be a better solution, like XPath or XQuery. There are some good recent changes to and Python wrappers for Saxon, even though their pricing model has changed.
I'm developing an API using Python which makes server calls using XML. I am debating on whether to use a library (ex. http://wiki.python.org/moin/MiniDom) or if it would be "better" (meaning less overhead and faster) to use string concatenation in order to generate the XML used for each request. Also, the XML I will be generating is going to be quite dynamic so I'm not sure if something that allows me to manage elements dynamically will be a benefit.
Since you are just using authorize.net, why not use a library specifically designed for the Authorize.net API and forget about constructing your own XML calls?
If you want or need to go your own way with XML, don't use minidom, use something with an ElementTree interface such as cElementTree (which is in the standard library). It will be much less painful and probably much faster. You will certainly need an XML library to parse the XML you produce, so you might as well use the same API for both.
It is very unlikely that the overhead of using an XML library will be a problem, and the benefit in clean code and knowing you can't generate invalid XML is very great.
If you absolutely, positively need to be as fast as possible, use one of the extremely fast templating libraries available for Python. They will probably be much faster than any naive string concatenation you do and will also be safe (i.e do proper escaping).
Another option is to use Jinja, especially if the dynamic nature of your xml is fairly simple. This is a common idiom in flask for generating html responses.
Here is an example of a jinja template that generates the XML of an aws S3 list objects response. I usually store the template in a separate file to avoid polluting my elegant python with ugly xml.
from datetime import datetime
from jinja2 import Template
list_bucket_result = """<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>{{bucket_name}}</Name>
<Prefix/>
<KeyCount>{{object_count}}</KeyCount>
<MaxKeys>{{max_keys}}</MaxKeys>
<IsTruncated>{{is_truncated}}</IsTruncated>
{%- for object in object_list %}
<Contents>
<Key>{{object.key}}</Key>
<LastModified>{{object.last_modified_date.isoformat()}}</LastModified>
<ETag></ETag>
<Size>{{object.size}}</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>{% endfor %}
</ListBucketResult>
"""
class BucketObject:
def __init__(self, key, last_modified_date, size):
self.key = key
self.last_modified_date = last_modified_date
self.size = size
object_list = [
BucketObject('/foo/bar.txt', datetime.utcnow(), 10*1024 ),
BucketObject('/foo/baz.txt', datetime.utcnow(), 29*1024 ),
]
template = Template(list_bucket_result)
result = template.render(
bucket_name='test-bucket',
object_count=len(object_list),
max_keys=1000,
is_truncated=False,
object_list=object_list
)
print result
Output:
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>test-bucket</Name>
<Prefix/>
<KeyCount>2</KeyCount>
<MaxKeys>1000</MaxKeys>
<IsTruncated>False</IsTruncated>
<Contents>
<Key>/foo/bar.txt</Key>
<LastModified>2017-10-31T02:28:34.551000</LastModified>
<ETag></ETag>
<Size>10240</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
<Contents>
<Key>/foo/baz.txt</Key>
<LastModified>2017-10-31T02:28:34.551000</LastModified>
<ETag></ETag>
<Size>29696</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
I would definitely recommend that you make use of one of the Python libraries; such as MiniDom, ElementTree, lxml.etree or pyxser. There is no reason to not to, and the potential performance impact will be minimal.
Although, personally I prefer using simplejson (or simply json) instead.
my_list = ["Value1", "Value2"]
json = simplejson.dumps(my_list)
# send json
My real question is what are the biggest concerns for what you're trying to accomplish? If you're worried about speed/memory, then yes, minidom does take a hit. If you want something that's fairly reliable that you can deploy quickly, I'd say use it.
My suggestion for dealing with XML in any language(Java, Python, C#, Perl, etc) is to consider using something already existing. Everyone has written their own XML parser at least once, and then they never do so again because it's such a pain in the behind. And to be fair, these libraries will have already fixed 99.5% of any problems you would run into.
I recommend LXML. It's a Python library of bindings for the very fast C libraries libxml2 and libxslt.
LXML supports XPATH, and has an elementTree implementation. LXML also has an interface called objectify for writing XML as object hierarchies:
from lxml import etree, objectify
E = objectify.ElementMaker(annotate=False)
my_alpha = my_alpha = E.alpha(E.beta(E.gamma(firstattr='True')),
E.beta(E.delta('text here')))
etree.tostring(my_alpha)
# '<alpha><beta><gamma firstattr="True"/></beta><beta><delta>text here</delta></beta></alpha>'
etree.tostring(my_alpha.beta[0])
# '<beta><gamma firstattr="True"/></beta>'
my_alpha.beta[1].delta.text
# 'text here'
This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one
So I'm parsing some XML files using Python 3.2.1's cElementTree, and during the parsing I noticed that some of the tags were missing attribute information. I was wondering if there is any easy way of getting the line numbers of those Elements in the xml file.
Looking at the docs, I see no way to do this with cElementTree.
However I've had luck with lxmls version of the XML implementation.
Its supposed to be almost a drop in replacement, using libxml2. And elements have a sourceline attribute. (As well as getting a lot of other XML features).
Only caveat is that I've only used it in python 2.x - not sure how/if it works under 3.x - but might be worth a look.
Addendum:
from their front page they say :
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2
and libxslt. It is unique in that it combines the speed and XML
feature completeness of these libraries with the simplicity of a
native Python API, mostly compatible but superior to the well-known
ElementTree API. The latest release works with all CPython versions
from 2.3 to 3.2. See the introduction for more information about
background and goals of the lxml project. Some common questions are
answered in the FAQ.
So it looks like python 3.x is OK.
Took a while for me to work out how to do this using Python 3.x (using 3.3.2 here) so thought I would summarize:
# Force python XML parser not faster C accelerators
# because we can't hook the C implementation
sys.modules['_elementtree'] = None
import xml.etree.ElementTree as ET
class LineNumberingParser(ET.XMLParser):
def _start_list(self, *args, **kwargs):
# Here we assume the default XML parser which is expat
# and copy its element position attributes into output Elements
element = super(self.__class__, self)._start_list(*args, **kwargs)
element._start_line_number = self.parser.CurrentLineNumber
element._start_column_number = self.parser.CurrentColumnNumber
element._start_byte_index = self.parser.CurrentByteIndex
return element
def _end(self, *args, **kwargs):
element = super(self.__class__, self)._end(*args, **kwargs)
element._end_line_number = self.parser.CurrentLineNumber
element._end_column_number = self.parser.CurrentColumnNumber
element._end_byte_index = self.parser.CurrentByteIndex
return element
tree = ET.parse(filename, parser=LineNumberingParser())
I've done this in elementtree by subclassing ElementTree.XMLTreeBuilder. Then where I have access to the self._parser (Expat) it has properties _parser.CurrentLineNumber and _parser.CurrentColumnNumber.
http://docs.python.org/py3k/library/pyexpat.html?highlight=xml.parser#xmlparser-objects has details about these attributes
During parsing you could print out info, or put these values into the output XML element attributes.
If your XML file includes additional XML files, you have to do some stuff that I don't remember and was not well documented to keep track of the current XML file.
One (hackish) way of doing this is by inserting a dummy-attribute holding the line number into each element, before parsing. Here's how I did this with minidom:
python reporting line/column of origin of XML node
This can be trivially adjusted to cElementTree (or in fact any other python XML parser).
I am quite new to XML as well as Python, so please overlook . I am trying to unpack a dictionary straight into XML format. My Code Fragment is as follows:
from xml.dom.minidom import Document
def make_xml(dictionary):
doc = Document()
result = doc.createElement('result')
doc.appendChild(result)
for key in dictionary:
attrib = doc.createElement(key)
result.appendChild(attrib)
value = doc.createTextNode(dictionary[key])
attrib.appendChild(value)
print doc
I expected an answer of the format
<xml>
<result>
<attrib#1>value#1</attrib#1>
...
However all I am getting is
<xml.dom.minidom.Document instance at 0x01BE6130>
Please help
You have not checked the
http://docs.python.org/library/xml.dom.minidom.html
docs.
Look at the toxml() or prettyprettyxml() methods.
You can always use a library like xmler which easily takes a python dictionary and converts it to xml. Full disclosure, I created the package, but I feel like it will probably do what you need.
Also feel free to take a look at the source.