As I understood it, XML files are tree structures ie each branch is its own tree. Conceptually, I can't see the difference between an Element and an ElementTree. But I guess that's ok - what's worse is that there is stuff you can't do with an Element - for example root.write("bla.xml") seems to be fine but element.write("bla.xml") doesn't work.
So I suppose I need to convert the Element to an ElementTree and set it as root before I do anything else. How do I do this...?
You are right, conceptually there is no difference. So, just build you elements however you like, and then just include their root in an ElementTree so you have access its methods. You can just do
tree = ElementTree(my_root_element)
tree.write(...)
To get the root tree from an xml Element, you can use the getroottree method:
doc = lxml.html.parse(s)
tree = doc.getroottree()
for more info please check the doc to know more about the module.
Related
I'm attempting to find an XML element called "md:EntityDescriptor" using the following Python code:
def parse(filepath):
xmlfile = str(filepath)
doc1 = ET.parse(xmlfile)
root = doc1.getroot()
test = root.find('md:EntityDescriptor', namespaces)
print(test)
This is the beginning of my XML document, which is a SAML assertion. I've omitted the rest for readability and security, but the element I'm searching for is literally at the very beginning:
<?xml version="1.0" encoding="UTF-8"?>
<md:EntityDescriptor ...
I have a namespace defining "md" and several others:
namespaces = {'md': 'urn:oasis:names:tc:SAML:2.0:metadata'}
yet the output of print(test) is None.
Running ET.dump(root) outputs the full contents of the file, so I know it isn't a problem with the input I'm passing. Running print(root.nsmap) returns:
{'md': 'urn:oasis:names:tc:SAML:2.0:metadata'}
If md:EntityDescriptor is the root element, trying to find a child md:EntityDescriptor element with find isn’t going to work. You've already selected that element as root.
However, the problem is that I need to run this same operation on multiple files, and md:EntityDescriptor is not always the root element. Is there a way to find an element regardless of whether or not it's the root?
Since you're using lxml, try using xpath() and the descendant-or-self:: axis instead of find:
test = root.xpath('descendant-or-self::md:EntityDescriptor', namespaces=namespaces)
Note that xpath() returns a list.
I just started learning Python and have to write a program, that parses xml files. I have to find a certain Tag called OrganisationReference in 2 different files and return it. In fact there are multiple Tags with this name, but only one, the one I am trying to return, that has the Tag OrganisationType with the value DEALER as a parent Tag (not quite sure whether the term is right). I tried to use ElementTree for this. Here is the code:
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
tree2 = ET.parse('Master2.xml')
root2 = tree2.getroot()
for OrganisationReference in root1.findall("./Organisation/OrganisationId/[#OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.attrib)
for OrganisationReference in root2.findall("./Organisation/OrganisationId/[#OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.attrib)
But this returns nothing (also no error). Can somebody help me?
My file looks like this:
<MessageOrganisationCount>a</MessageOrganisationCount>
<MessageVehicleCount>x</MessageVehicleCount>
<MessageCreditLineCount>y</MessageCreditLineCount>
<MessagePlanCount>z</MessagePlanCount>
<OrganisationData>
<Organisation>
<OrganisationId>
<OrganisationType>DEALER</OrganisationType>
<OrganisationReference>WHATINEED</OrganisationReference>
</OrganisationId>
<OrganisationName>XYZ.</OrganisationName>
....
Due to the fact that OrganisationReference appears a few more times in this file with different text between start and endtag, I want to get exactly the one, that you see in line 9: it has OrganisationId as a parent tag, and DEALER is also a child tag of OrganisationId.
You were super close with your original attempt. You just need to make a couple of changes to your xpath and a tiny change to your python.
The first part of your xpath starts with ./Organization. Since you're doing the xpath from root, it expects Organization to be a child. It's not; it's a descendant.
Try changing ./Organization to .//Organization. (// is short for /descendant-or-self::node()/. See here for more info.)
The second issue is with OrganisationId/[#OrganisationType='DEALER']. That's invalid xpath. The / should be removed from between OrganisationId and the predicate.
Also, # is abbreviated syntax for the attribute:: axis and OrganisationType is an element, not an attribute.
Try changing OrganisationId/[#OrganisationType='DEALER'] to OrganisationId[OrganisationType='DEALER'].
The python issue is with print(OrganisationReference.attrib). The OrganisationReference doesn't have any attributes; just text.
Try changing print(OrganisationReference.attrib) to print(OrganisationReference.text).
Here's an example using just one XML file for demo purposes...
XML Input (Master1.xml; with doc element added to make it well-formed)
<doc>
<MessageOrganisationCount>a</MessageOrganisationCount>
<MessageVehicleCount>x</MessageVehicleCount>
<MessageCreditLineCount>y</MessageCreditLineCount>
<MessagePlanCount>z</MessagePlanCount>
<OrganisationData>
<Organisation>
<OrganisationId>
<OrganisationType>DEALER</OrganisationType>
<OrganisationReference>WHATINEED</OrganisationReference>
</OrganisationId>
<OrganisationName>XYZ.</OrganisationName>
</Organisation>
</OrganisationData>
</doc>
Python
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
for OrganisationReference in root1.findall(".//Organisation/OrganisationId[OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.text)
Printed Output
WHATINEED
Also note that it doesn't appear that you need to use getroot() at all. You can use findall() directly on the tree...
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
for OrganisationReference in tree1.findall(".//Organisation/OrganisationId[OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.text)
You can use a nested for-loop to do it. First you check whether the text of OrganisationType is DEALER and then get the text of the OrganisationReference that you need.
If you want to learn more about parsing XML with Python I strongly recommend the documentation of the XMLtree library.
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
tree2 = ET.parse('Master2.xml')
root2 = tree2.getroot()
#Find the parent Dealer
for element in root1.findall('./Organisation/OrganisationId'):
if element[0].text == "DEALER":
print(element[1].text)
This works if the first tag in your OrganisationId is OrganisationType :)
We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?
Use the XPath API of the lxml module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References
XPath and XSLT with lxml
XML Path Language 1.0: Node Tests
I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?
This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.
From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.
I'm working with python xml.dom. I'm looking for a particular method that takes in a node and string and returns the xml node that is is named string. I can't find it in the documentation
I'm thinking it would work something like this
nodeObject =parent.FUNCTION('childtoFind')
where the nodeObject is under the parent
Or barring the existence of such a method, is there a way I can make the string a node object?
You are looking for the .getElementsByTagname() function:
nodeObjects = parent.getElementsByTagname('childtoFind')
It returns a list; if you need only one node, use indexing:
nodeObject = parent.getElementsByTagname('childtoFind')[0]
You really want to use the ElementTree API instead, it's easier to use. Even the minidom documentation makes this recommendation:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
The ElementTree API has a .find() function that let's you find the first matching descendant:
element = parent.find('childtoFind')