Python: Using namespaces in ElementTree with iter() - python

I am using xml.etree.ElementTree to parse some complex xml files. Some of the xml files have several repeating tags nested in them.
<product:object>
<product:parent>
<product:parent>
<product:parent>
</product:parent>
</product:parent>
</product:parent>
</product:object>
I am using .iter() to find the repeating tag in different layers. Normally a second argument can be passed to .find() and .findall(). However, for some reason .iter() doesn't have this option.
Am I missing something, or is there another way of properly doing this?
I know how to, and have build workarounds.
e.g.
- A definition that is reiterated and passes the parent element.
- Manually mapping the namespaces
I am hoping there is a better way!?

I found that using XPath syntax .findall(.//product:parent, ns) can be used as a substitute for .iter()

Related

Preserve comments outside the root element of xml file

I am updating the xml file using python xml module and I want to preserve all the comments, using insert_comment variable is preserving the comments inside root element but removing outside/end of the xml file.
Is there any way to save those comments along with inner ones?
Below is the answers I found.
Python 3.8 added the insert_comments argument to TreeBuilder which:
class xml.etree.ElementTree.TreeBuilder(element_factory=None, *, comment_factory=None, pi_factory=None, insert_comments=False, insert_pis=False)
When insert_comments and/or insert_pis is true, comments/pis will be inserted into the tree if they appear within root element the (but not outside of it).
As your question tag contains "elementtree", this answer is specific about that.
The solution you are expecting is may not be possible due to default design of ElementTree XMLParser. The parser skips over the comments. Look into below official documented snippet.
Note that XMLParser skips over comments in the input instead of
creating comment objects for them. An ElementTree will only contain
comment nodes if they have been inserted into to the tree using one of
the Element methods.
Please find the above content in below documentation link.
https://docs.python.org/3/library/xml.etree.elementtree.html
search for xml.etree.ElementTree.Comment in above link.

Parsing XML with undeclared prefixes in Python

I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix, right at the start of <abc:thing2>)
Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.
Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.
I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:
tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found register_namespace, but that does not seem to work.
have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
some other route that I am currently not seeing?
UPDATE:
After Har07 put me on the path of lxml, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:
telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right xmlns declarations, and then handing it off to lxml.etree's fromstring method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though.
Read in the DTD before parsing: it is possible with lxml (through attribute_defaults, dtd_validation, or load_dtd), but unfortunately does not solve the namespace issue.
Telling lxml not to bother about namespaces: possible through the recover option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)
One possible way is using ElementTree compatible library, lxml. For example :
from lxml import etree as ElementTree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))
All you need to do for parsing a non well-formed XML using lxml is passing parameter recover=True to constructor of XMLParser. lxml also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.
UPDATE :
I don't know all the types of XML error that recover=True option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :
xml = """<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
print(ElementTree.tostring(tree))
The final output XML after parsed by lxml is as follow :
<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</bad></item>

Finding and converting XML processing instructions using Python

We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?
Use the XPath API of the lxml module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References
XPath and XSLT with lxml
XML Path Language 1.0: Node Tests

Extract text between tags with XPath including markup

I have the following piece of XML:
...<span class="st">In Tim <em>Power</em>: Politieman...</span>...
I want to extract the part between the <span> tags.
For this I use XPath:
/span[#class="st"]
This however will extract everything including the <span>.
and.
/span[#class="st"]/text()
will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em> is not included and is handled like a separator.
Is there a pure XPath solution which returns:
In Tim <em>Power</em>: Politieman...
EDIT
Thanks to #helderdarocha and #TextGeek. Seems non trivial to extract plain text with XPath only including the <em>.
The /span[#class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.
To get any child node you can use:
/span[#class="st"]/node()
This will return:
Two child text nodes
The full <em> node (element and contents).
If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:
/span[#class="st"]//text()
or
/span[#class="st"]/descendant::text()
This will return three text nodes, the text inside <em>, but not the <em> elements.
Sounds like you want the equivalent of the Javascript DOM innerHTML() function, but for XML. I don't think there's a way to do that in pure XPath.
XPath doesn't really operate on markup strings like "<em>" and "</em>" at all -- it works with a tree of Node objects (there might possibly be an XPath implementation that tries to work directly off markup, but I doubt it). Most XPath implementations wouldn't even have the 4 characters "<em>" anywhere (except maybe kept around for printing error messages or something), and of course the DOM could have been built from scratch rather than from XML or other input in the first place. Likewise, XPath doesn't really figure on handing back marked-up strings, but lists of nodes.
In XSLT or XQuery you can do this easily, but not in XPath by itself, unless I'm missing something.
-s

Search for an XML node in a parent by string with python

I'm working with python xml.dom. I'm looking for a particular method that takes in a node and string and returns the xml node that is is named string. I can't find it in the documentation
I'm thinking it would work something like this
nodeObject =parent.FUNCTION('childtoFind')
where the nodeObject is under the parent
Or barring the existence of such a method, is there a way I can make the string a node object?
You are looking for the .getElementsByTagname() function:
nodeObjects = parent.getElementsByTagname('childtoFind')
It returns a list; if you need only one node, use indexing:
nodeObject = parent.getElementsByTagname('childtoFind')[0]
You really want to use the ElementTree API instead, it's easier to use. Even the minidom documentation makes this recommendation:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
The ElementTree API has a .find() function that let's you find the first matching descendant:
element = parent.find('childtoFind')

Categories

Resources