ElementTree not preserving attribute order - python

i tried to reserve xml elementree in this format
from lxml import etree
box = etree.SubElement(image, 'box',top=t,left=l,width=w,height=h)
but what i got is
<box height="511" left="1" top="1" width="510">
I noticed that the attribute are in alphabetical order
so what should I do to put it in order that I did?

In XML, the order of attributes is of no concern by specification. So it is not mandatory to keep the order of the attributes.
Your only option would be testing other XML parsers and maybe you'll find one that doesn't change the order (what depends on the inner implementation).
This SO answer elaborates on that.

Related

Preserve comments outside the root element of xml file

I am updating the xml file using python xml module and I want to preserve all the comments, using insert_comment variable is preserving the comments inside root element but removing outside/end of the xml file.
Is there any way to save those comments along with inner ones?
Below is the answers I found.
Python 3.8 added the insert_comments argument to TreeBuilder which:
class xml.etree.ElementTree.TreeBuilder(element_factory=None, *, comment_factory=None, pi_factory=None, insert_comments=False, insert_pis=False)
When insert_comments and/or insert_pis is true, comments/pis will be inserted into the tree if they appear within root element the (but not outside of it).
As your question tag contains "elementtree", this answer is specific about that.
The solution you are expecting is may not be possible due to default design of ElementTree XMLParser. The parser skips over the comments. Look into below official documented snippet.
Note that XMLParser skips over comments in the input instead of
creating comment objects for them. An ElementTree will only contain
comment nodes if they have been inserted into to the tree using one of
the Element methods.
Please find the above content in below documentation link.
https://docs.python.org/3/library/xml.etree.elementtree.html
search for xml.etree.ElementTree.Comment in above link.

Python: Using namespaces in ElementTree with iter()

I am using xml.etree.ElementTree to parse some complex xml files. Some of the xml files have several repeating tags nested in them.
<product:object>
<product:parent>
<product:parent>
<product:parent>
</product:parent>
</product:parent>
</product:parent>
</product:object>
I am using .iter() to find the repeating tag in different layers. Normally a second argument can be passed to .find() and .findall(). However, for some reason .iter() doesn't have this option.
Am I missing something, or is there another way of properly doing this?
I know how to, and have build workarounds.
e.g.
- A definition that is reiterated and passes the parent element.
- Manually mapping the namespaces
I am hoping there is a better way!?
I found that using XPath syntax .findall(.//product:parent, ns) can be used as a substitute for .iter()

Parsing XML with undeclared prefixes in Python

I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix, right at the start of <abc:thing2>)
Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.
Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.
I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:
tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found register_namespace, but that does not seem to work.
have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
some other route that I am currently not seeing?
UPDATE:
After Har07 put me on the path of lxml, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:
telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right xmlns declarations, and then handing it off to lxml.etree's fromstring method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though.
Read in the DTD before parsing: it is possible with lxml (through attribute_defaults, dtd_validation, or load_dtd), but unfortunately does not solve the namespace issue.
Telling lxml not to bother about namespaces: possible through the recover option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)
One possible way is using ElementTree compatible library, lxml. For example :
from lxml import etree as ElementTree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))
All you need to do for parsing a non well-formed XML using lxml is passing parameter recover=True to constructor of XMLParser. lxml also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.
UPDATE :
I don't know all the types of XML error that recover=True option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :
xml = """<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
print(ElementTree.tostring(tree))
The final output XML after parsed by lxml is as follow :
<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</bad></item>

Finding and converting XML processing instructions using Python

We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?
Use the XPath API of the lxml module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References
XPath and XSLT with lxml
XML Path Language 1.0: Node Tests

Parsing XML with namespaces using ElementTree in Python

I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?
This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.
From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.

Categories

Resources