Python LXML fails to find XML element - python

I'm attempting to find an XML element called "md:EntityDescriptor" using the following Python code:
def parse(filepath):
xmlfile = str(filepath)
doc1 = ET.parse(xmlfile)
root = doc1.getroot()
test = root.find('md:EntityDescriptor', namespaces)
print(test)
This is the beginning of my XML document, which is a SAML assertion. I've omitted the rest for readability and security, but the element I'm searching for is literally at the very beginning:
<?xml version="1.0" encoding="UTF-8"?>
<md:EntityDescriptor ...
I have a namespace defining "md" and several others:
namespaces = {'md': 'urn:oasis:names:tc:SAML:2.0:metadata'}
yet the output of print(test) is None.
Running ET.dump(root) outputs the full contents of the file, so I know it isn't a problem with the input I'm passing. Running print(root.nsmap) returns:
{'md': 'urn:oasis:names:tc:SAML:2.0:metadata'}

If md:EntityDescriptor is the root element, trying to find a child md:EntityDescriptor element with find isn’t going to work. You've already selected that element as root.
However, the problem is that I need to run this same operation on multiple files, and md:EntityDescriptor is not always the root element. Is there a way to find an element regardless of whether or not it's the root?
Since you're using lxml, try using xpath() and the descendant-or-self:: axis instead of find:
test = root.xpath('descendant-or-self::md:EntityDescriptor', namespaces=namespaces)
Note that xpath() returns a list.

Related

Capture all XML element paths using xml.etree.ElementTree

Using python import lxml I am able to print a list of the path for every element recursively:
from lxml import etree
root = etree.parse(xml_file)
for e in root.iter():
path = root.getelementpath(e)
print(path)
Results:
TreatmentEpisodes
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
etc.
Note: I am working with this XSD: https://www.myflfamilies.com/service-programs/samh/155-2/155-2-v14/schemas/TreatmentEpisodeDataset.xsd
I want to do the same thing using
import xml.etree.ElementTree as ET
...but ElementTree does not seem to have an equivalent function to lxml getelementpath().
I've read the docs.
I've googled for days.
I've experimented with XPath.
I've guessed using iter() and tried "getpath()", "Element.getpath()", etc. hoping to discover an undocumented feature. Fail.
Perhaps I am experiencing an extreme case of "user error" and please forgive me if this is a duplicate.
I thought I found the answer here: Get Xpath dynamically using ElementTree getpath() but the XPathEvaluator only seems to operate on a 'known' element - it doesn't have an option for "give me everything".
Here is what I tried:
import xml.etree.ElementTree as ET
tree = etree.parse(xml_file)
for entry in tree.xpath('//TreatmentEpisode'):
print(entry)
Results:
<Element TreatmentEpisode at 0xffff8f8c8a00>
What I was hoping for:
TreatmentEpisodes/TreatmentEpisode
...however, even if I received what I hoped for, I am still not sure how to obtain the full path for every element. As I understand the XPath docs, they only operate on 'known' element names. i.e. tree.xpath() seems to require the element name to be known beforehand.
Start from:
import xml.etree.ElementTree as et
An interesting way to solve your problem is to use iterparse - an
iterative parser contained in ElementTree.
It is able to report e.g. each start and end event, for each element parsed.
For details search the Web for documentation / examples of iterparse.
The idea is to:
Start with an empty list as the path.
At the start event, append the element name to path and print the full
path gathered so far.
At the end event, drop the last element from path.
You can even wrap this code in a generator function:
def pathGen(fn):
path = []
it = et.iterparse(fn, events=('start', 'end'))
for evt, el in it:
if evt == 'start':
path.append(el.tag)
yield '/'.join(path)
else:
path.pop()
Now, when you run:
for pth in pathGen('Input.xml'):
print(pth)
you will get a printout of full paths of all elements
in your source file, something like:
TreatmentEpisodes
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
...

How do I search for a Tag in xml file using ElementTree where i have a certain "Parent"tag with a specific value? (python)

I just started learning Python and have to write a program, that parses xml files. I have to find a certain Tag called OrganisationReference in 2 different files and return it. In fact there are multiple Tags with this name, but only one, the one I am trying to return, that has the Tag OrganisationType with the value DEALER as a parent Tag (not quite sure whether the term is right). I tried to use ElementTree for this. Here is the code:
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
tree2 = ET.parse('Master2.xml')
root2 = tree2.getroot()
for OrganisationReference in root1.findall("./Organisation/OrganisationId/[#OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.attrib)
for OrganisationReference in root2.findall("./Organisation/OrganisationId/[#OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.attrib)
But this returns nothing (also no error). Can somebody help me?
My file looks like this:
<MessageOrganisationCount>a</MessageOrganisationCount>
<MessageVehicleCount>x</MessageVehicleCount>
<MessageCreditLineCount>y</MessageCreditLineCount>
<MessagePlanCount>z</MessagePlanCount>
<OrganisationData>
<Organisation>
<OrganisationId>
<OrganisationType>DEALER</OrganisationType>
<OrganisationReference>WHATINEED</OrganisationReference>
</OrganisationId>
<OrganisationName>XYZ.</OrganisationName>
....
Due to the fact that OrganisationReference appears a few more times in this file with different text between start and endtag, I want to get exactly the one, that you see in line 9: it has OrganisationId as a parent tag, and DEALER is also a child tag of OrganisationId.
You were super close with your original attempt. You just need to make a couple of changes to your xpath and a tiny change to your python.
The first part of your xpath starts with ./Organization. Since you're doing the xpath from root, it expects Organization to be a child. It's not; it's a descendant.
Try changing ./Organization to .//Organization. (// is short for /descendant-or-self::node()/. See here for more info.)
The second issue is with OrganisationId/[#OrganisationType='DEALER']. That's invalid xpath. The / should be removed from between OrganisationId and the predicate.
Also, # is abbreviated syntax for the attribute:: axis and OrganisationType is an element, not an attribute.
Try changing OrganisationId/[#OrganisationType='DEALER'] to OrganisationId[OrganisationType='DEALER'].
The python issue is with print(OrganisationReference.attrib). The OrganisationReference doesn't have any attributes; just text.
Try changing print(OrganisationReference.attrib) to print(OrganisationReference.text).
Here's an example using just one XML file for demo purposes...
XML Input (Master1.xml; with doc element added to make it well-formed)
<doc>
<MessageOrganisationCount>a</MessageOrganisationCount>
<MessageVehicleCount>x</MessageVehicleCount>
<MessageCreditLineCount>y</MessageCreditLineCount>
<MessagePlanCount>z</MessagePlanCount>
<OrganisationData>
<Organisation>
<OrganisationId>
<OrganisationType>DEALER</OrganisationType>
<OrganisationReference>WHATINEED</OrganisationReference>
</OrganisationId>
<OrganisationName>XYZ.</OrganisationName>
</Organisation>
</OrganisationData>
</doc>
Python
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
for OrganisationReference in root1.findall(".//Organisation/OrganisationId[OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.text)
Printed Output
WHATINEED
Also note that it doesn't appear that you need to use getroot() at all. You can use findall() directly on the tree...
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
for OrganisationReference in tree1.findall(".//Organisation/OrganisationId[OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.text)
You can use a nested for-loop to do it. First you check whether the text of OrganisationType is DEALER and then get the text of the OrganisationReference that you need.
If you want to learn more about parsing XML with Python I strongly recommend the documentation of the XMLtree library.
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
tree2 = ET.parse('Master2.xml')
root2 = tree2.getroot()
#Find the parent Dealer
for element in root1.findall('./Organisation/OrganisationId'):
if element[0].text == "DEALER":
print(element[1].text)
This works if the first tag in your OrganisationId is OrganisationType :)

Check if tag exists by index in xml file

I wrote a python script that returns xml file tag values. It goes through the file by index. How can I check if an index exists?
This is the basic blueprint.
tree = ET.parse(file)
root = tree.getroot()
root[0][1][0].text
I am not sure what are you trying to achive. As I understand you are parsing some XML and you are trying to get text from one of xml element basing on indexes. I think you could use XPath searching instead. It works even better if you use lxml module for parsing xmls instead of xml. Here is description of XPath usage in lxml.
Anyway, if you really prefer using indexes, you do not have to check if element exists under specific index. Use try, except block instead to catch errors if index does not exists.
This answers provides some details why you should use this approach.
And your code could look more or less like this:
tree = ET.parse(file)
root = tree.getroot()
try:
text = root[0][1][0].text
except IndexError as e:
#do something to handle error
pass

How can I turn an xml Element into an ElementTree (python)?

As I understood it, XML files are tree structures ie each branch is its own tree. Conceptually, I can't see the difference between an Element and an ElementTree. But I guess that's ok - what's worse is that there is stuff you can't do with an Element - for example root.write("bla.xml") seems to be fine but element.write("bla.xml") doesn't work.
So I suppose I need to convert the Element to an ElementTree and set it as root before I do anything else. How do I do this...?
You are right, conceptually there is no difference. So, just build you elements however you like, and then just include their root in an ElementTree so you have access its methods. You can just do
tree = ElementTree(my_root_element)
tree.write(...)
To get the root tree from an xml Element, you can use the getroottree method:
doc = lxml.html.parse(s)
tree = doc.getroottree()
for more info please check the doc to know more about the module.

Parsing XML with namespaces using ElementTree in Python

I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?
This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.
From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.

Categories

Resources