We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?
Use the XPath API of the lxml module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References
XPath and XSLT with lxml
XML Path Language 1.0: Node Tests
Related
I am using xml.etree.ElementTree to parse some complex xml files. Some of the xml files have several repeating tags nested in them.
<product:object>
<product:parent>
<product:parent>
<product:parent>
</product:parent>
</product:parent>
</product:parent>
</product:object>
I am using .iter() to find the repeating tag in different layers. Normally a second argument can be passed to .find() and .findall(). However, for some reason .iter() doesn't have this option.
Am I missing something, or is there another way of properly doing this?
I know how to, and have build workarounds.
e.g.
- A definition that is reiterated and passes the parent element.
- Manually mapping the namespaces
I am hoping there is a better way!?
I found that using XPath syntax .findall(.//product:parent, ns) can be used as a substitute for .iter()
As I understood it, XML files are tree structures ie each branch is its own tree. Conceptually, I can't see the difference between an Element and an ElementTree. But I guess that's ok - what's worse is that there is stuff you can't do with an Element - for example root.write("bla.xml") seems to be fine but element.write("bla.xml") doesn't work.
So I suppose I need to convert the Element to an ElementTree and set it as root before I do anything else. How do I do this...?
You are right, conceptually there is no difference. So, just build you elements however you like, and then just include their root in an ElementTree so you have access its methods. You can just do
tree = ElementTree(my_root_element)
tree.write(...)
To get the root tree from an xml Element, you can use the getroottree method:
doc = lxml.html.parse(s)
tree = doc.getroottree()
for more info please check the doc to know more about the module.
I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?
This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.
From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.
I want to check if a node in an XML file exist before another node in Python 3.2. I am using the LXML library for Python. I thought of using a counter to keep track of the order, but i couldn't come up with the logic. I need to do this without changing the XML file.
My XML looks like this Example
For example i want to check if book id="bk108" is before book id="bk112"
except the book id is not in order for my XML. Eg: it doesn't go bk108,bk109...
This can be done using XPath
not(empty(//book[#id="bk108"][following-sibling::book[#id="bk112"]]))
This XPath returns true if there is a book node with the id bk108 which does have a following book node with the id bk112
I'm working with python xml.dom. I'm looking for a particular method that takes in a node and string and returns the xml node that is is named string. I can't find it in the documentation
I'm thinking it would work something like this
nodeObject =parent.FUNCTION('childtoFind')
where the nodeObject is under the parent
Or barring the existence of such a method, is there a way I can make the string a node object?
You are looking for the .getElementsByTagname() function:
nodeObjects = parent.getElementsByTagname('childtoFind')
It returns a list; if you need only one node, use indexing:
nodeObject = parent.getElementsByTagname('childtoFind')[0]
You really want to use the ElementTree API instead, it's easier to use. Even the minidom documentation makes this recommendation:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
The ElementTree API has a .find() function that let's you find the first matching descendant:
element = parent.find('childtoFind')