Capture all XML element paths using xml.etree.ElementTree

Capture all XML element paths using xml.etree.ElementTree - python

Using python import lxml I am able to print a list of the path for every element recursively:
from lxml import etree
root = etree.parse(xml_file)
for e in root.iter():
path = root.getelementpath(e)
print(path)
Results:
TreatmentEpisodes
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
etc.
Note: I am working with this XSD: https://www.myflfamilies.com/service-programs/samh/155-2/155-2-v14/schemas/TreatmentEpisodeDataset.xsd
I want to do the same thing using
import xml.etree.ElementTree as ET
...but ElementTree does not seem to have an equivalent function to lxml getelementpath().
I've read the docs.
I've googled for days.
I've experimented with XPath.
I've guessed using iter() and tried "getpath()", "Element.getpath()", etc. hoping to discover an undocumented feature. Fail.
Perhaps I am experiencing an extreme case of "user error" and please forgive me if this is a duplicate.
I thought I found the answer here: Get Xpath dynamically using ElementTree getpath() but the XPathEvaluator only seems to operate on a 'known' element - it doesn't have an option for "give me everything".
Here is what I tried:
import xml.etree.ElementTree as ET
tree = etree.parse(xml_file)
for entry in tree.xpath('//TreatmentEpisode'):
print(entry)
Results:
<Element TreatmentEpisode at 0xffff8f8c8a00>
What I was hoping for:
TreatmentEpisodes/TreatmentEpisode
...however, even if I received what I hoped for, I am still not sure how to obtain the full path for every element. As I understand the XPath docs, they only operate on 'known' element names. i.e. tree.xpath() seems to require the element name to be known beforehand.

Start from:
import xml.etree.ElementTree as et
An interesting way to solve your problem is to use iterparse - an
iterative parser contained in ElementTree.
It is able to report e.g. each start and end event, for each element parsed.
For details search the Web for documentation / examples of iterparse.
The idea is to:
Start with an empty list as the path.
At the start event, append the element name to path and print the full
path gathered so far.
At the end event, drop the last element from path.
You can even wrap this code in a generator function:
def pathGen(fn):
path = []
it = et.iterparse(fn, events=('start', 'end'))
for evt, el in it:
if evt == 'start':
path.append(el.tag)
yield '/'.join(path)
else:
path.pop()
Now, when you run:
for pth in pathGen('Input.xml'):
print(pth)
you will get a printout of full paths of all elements
in your source file, something like:
TreatmentEpisodes
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
...

Related

Python LXML fails to find XML element

I'm attempting to find an XML element called "md:EntityDescriptor" using the following Python code:
def parse(filepath):
xmlfile = str(filepath)
doc1 = ET.parse(xmlfile)
root = doc1.getroot()
test = root.find('md:EntityDescriptor', namespaces)
print(test)
This is the beginning of my XML document, which is a SAML assertion. I've omitted the rest for readability and security, but the element I'm searching for is literally at the very beginning:
<?xml version="1.0" encoding="UTF-8"?>
<md:EntityDescriptor ...
I have a namespace defining "md" and several others:
namespaces = {'md': 'urn:oasis:names:tc:SAML:2.0:metadata'}
yet the output of print(test) is None.
Running ET.dump(root) outputs the full contents of the file, so I know it isn't a problem with the input I'm passing. Running print(root.nsmap) returns:
{'md': 'urn:oasis:names:tc:SAML:2.0:metadata'}

If md:EntityDescriptor is the root element, trying to find a child md:EntityDescriptor element with find isn’t going to work. You've already selected that element as root.
However, the problem is that I need to run this same operation on multiple files, and md:EntityDescriptor is not always the root element. Is there a way to find an element regardless of whether or not it's the root?
Since you're using lxml, try using xpath() and the descendant-or-self:: axis instead of find:
test = root.xpath('descendant-or-self::md:EntityDescriptor', namespaces=namespaces)
Note that xpath() returns a list.

How do I search for a Tag in xml file using ElementTree where i have a certain "Parent"tag with a specific value? (python)

I just started learning Python and have to write a program, that parses xml files. I have to find a certain Tag called OrganisationReference in 2 different files and return it. In fact there are multiple Tags with this name, but only one, the one I am trying to return, that has the Tag OrganisationType with the value DEALER as a parent Tag (not quite sure whether the term is right). I tried to use ElementTree for this. Here is the code:
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
tree2 = ET.parse('Master2.xml')
root2 = tree2.getroot()
for OrganisationReference in root1.findall("./Organisation/OrganisationId/[#OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.attrib)
for OrganisationReference in root2.findall("./Organisation/OrganisationId/[#OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.attrib)
But this returns nothing (also no error). Can somebody help me?
My file looks like this:
<MessageOrganisationCount>a</MessageOrganisationCount>
<MessageVehicleCount>x</MessageVehicleCount>
<MessageCreditLineCount>y</MessageCreditLineCount>
<MessagePlanCount>z</MessagePlanCount>
<OrganisationData>
<Organisation>
<OrganisationId>
<OrganisationType>DEALER</OrganisationType>
<OrganisationReference>WHATINEED</OrganisationReference>
</OrganisationId>
<OrganisationName>XYZ.</OrganisationName>
....
Due to the fact that OrganisationReference appears a few more times in this file with different text between start and endtag, I want to get exactly the one, that you see in line 9: it has OrganisationId as a parent tag, and DEALER is also a child tag of OrganisationId.

You were super close with your original attempt. You just need to make a couple of changes to your xpath and a tiny change to your python.
The first part of your xpath starts with ./Organization. Since you're doing the xpath from root, it expects Organization to be a child. It's not; it's a descendant.
Try changing ./Organization to .//Organization. (// is short for /descendant-or-self::node()/. See here for more info.)
The second issue is with OrganisationId/[#OrganisationType='DEALER']. That's invalid xpath. The / should be removed from between OrganisationId and the predicate.
Also, # is abbreviated syntax for the attribute:: axis and OrganisationType is an element, not an attribute.
Try changing OrganisationId/[#OrganisationType='DEALER'] to OrganisationId[OrganisationType='DEALER'].
The python issue is with print(OrganisationReference.attrib). The OrganisationReference doesn't have any attributes; just text.
Try changing print(OrganisationReference.attrib) to print(OrganisationReference.text).
Here's an example using just one XML file for demo purposes...
XML Input (Master1.xml; with doc element added to make it well-formed)
<doc>
<MessageOrganisationCount>a</MessageOrganisationCount>
<MessageVehicleCount>x</MessageVehicleCount>
<MessageCreditLineCount>y</MessageCreditLineCount>
<MessagePlanCount>z</MessagePlanCount>
<OrganisationData>
<Organisation>
<OrganisationId>
<OrganisationType>DEALER</OrganisationType>
<OrganisationReference>WHATINEED</OrganisationReference>
</OrganisationId>
<OrganisationName>XYZ.</OrganisationName>
</Organisation>
</OrganisationData>
</doc>
Python
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
for OrganisationReference in root1.findall(".//Organisation/OrganisationId[OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.text)
Printed Output
WHATINEED
Also note that it doesn't appear that you need to use getroot() at all. You can use findall() directly on the tree...
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
for OrganisationReference in tree1.findall(".//Organisation/OrganisationId[OrganisationType='DEALER']/OrganisationReference"):
print(OrganisationReference.text)

You can use a nested for-loop to do it. First you check whether the text of OrganisationType is DEALER and then get the text of the OrganisationReference that you need.
If you want to learn more about parsing XML with Python I strongly recommend the documentation of the XMLtree library.
import xml.etree.ElementTree as ET
tree1 = ET.parse('Master1.xml')
root1 = tree1.getroot()
tree2 = ET.parse('Master2.xml')
root2 = tree2.getroot()
#Find the parent Dealer
for element in root1.findall('./Organisation/OrganisationId'):
if element[0].text == "DEALER":
print(element[1].text)
This works if the first tag in your OrganisationId is OrganisationType :)

Check if tag exists by index in xml file

I wrote a python script that returns xml file tag values. It goes through the file by index. How can I check if an index exists?
This is the basic blueprint.
tree = ET.parse(file)
root = tree.getroot()
root[0][1][0].text

I am not sure what are you trying to achive. As I understand you are parsing some XML and you are trying to get text from one of xml element basing on indexes. I think you could use XPath searching instead. It works even better if you use lxml module for parsing xmls instead of xml. Here is description of XPath usage in lxml.
Anyway, if you really prefer using indexes, you do not have to check if element exists under specific index. Use try, except block instead to catch errors if index does not exists.
This answers provides some details why you should use this approach.
And your code could look more or less like this:
tree = ET.parse(file)
root = tree.getroot()
try:
text = root[0][1][0].text
except IndexError as e:
#do something to handle error
pass

How can I turn an xml Element into an ElementTree (python)?

As I understood it, XML files are tree structures ie each branch is its own tree. Conceptually, I can't see the difference between an Element and an ElementTree. But I guess that's ok - what's worse is that there is stuff you can't do with an Element - for example root.write("bla.xml") seems to be fine but element.write("bla.xml") doesn't work.
So I suppose I need to convert the Element to an ElementTree and set it as root before I do anything else. How do I do this...?

You are right, conceptually there is no difference. So, just build you elements however you like, and then just include their root in an ElementTree so you have access its methods. You can just do
tree = ElementTree(my_root_element)
tree.write(...)

To get the root tree from an xml Element, you can use the getroottree method:
doc = lxml.html.parse(s)
tree = doc.getroottree()
for more info please check the doc to know more about the module.

search entire tree with etree

I am using xml.etree.ElementTree as ET, this seems like the go-to library but if there is something else/better for the job I'm intrigued.
Let's say I have a tree like:
doc = """
<top>
<second>
<third>
<subthird></subthird>
<subthird2>
<subsubthird>findme</subsubthird>
</subthird2>
</third>
</second>
</top>"""
and for the sake of this problem, let's say this is already in an elementree named myTree
I want to update findme to found, is there a simple way to do it other than iterating like:
myTree.getroot().getchildren()[0].getchildren()[0].getchildren() \
[1].getchildren()[0].text = 'found'
The issue is I have a large xml tree and I want to update these values and I can't find a clear and pythonic way to do this.

You can use XPath expressions to get a specific tagname like this:
for el in myTree.getroot().findall(".//subsubthird"):
el.text = 'found'
If you need to find all tags with a specific text value, take a look at this answer: Find element by text with XPath in ElementTree.

I use lxml with XPath expressions. ElementTree has an abbreviated XPath syntax but since I don't use it, I don't know how extensive it is. The thing about XPath is that you can write as complex an element selector as you need. In this example, its based on nesting:
import lxml.etree
doc = """
<top>
<second>
<third>
<subthird></subthird>
<subthird2>
<subsubthird>findme</subsubthird>
</subthird2>
</third>
</second>
</top>"""
root = lxml.etree.XML(doc)
for elem in root.xpath('second/third/subthird2/subsubthird'):
elem.text = 'found'
print(lxml.etree.tostring(root, pretty_print=True, encoding='unicode'))
But suppose there was something else identifying, such as a unique attribute,
<subthird2 class="foo"><subsubthird>findme</subsubthird></subthird2>
then you xpath would be //subthird2[#class="foo"]/subsubthird.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Capture all XML element paths using xml.etree.ElementTree - python

Related

Python LXML fails to find XML element

How do I search for a Tag in xml file using ElementTree where i have a certain "Parent"tag with a specific value? (python)

Check if tag exists by index in xml file

How can I turn an xml Element into an ElementTree (python)?

search entire tree with etree

Categories

Resources