How can I print XPaths of lxml tree elements? - python

I'm trying to print XPaths of all elements in XML tree, but I get strange output when using lxml. Instead of xpath which contains name of each node within path, I get strange "*"-kind of output.
Do you know what might be the issue here? Here the code, as well as XML I am trying to analyze.
from lxml import etree
xml = """
<filter xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
<bundles xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-bundlemgr-oper">
<bundles>
<bundle>
<data>
<bundle-status/>
<lacp-status/>
<minimum-active-links/>
<ipv4bfd-status/>
<active-member-count/>
<active-member-configured/>
</data>
<members>
<member>
<member-interface/>
<interface-name/>
<member-mux-data>
<member-state/>
</member-mux-data>
</member>
</members>
<bundle-interface>{{bundle_name}}</bundle-interface>
</bundle>
</bundles>
</bundles>
<bfd xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-ip-bfd-oper">
<session-briefs>
<session-brief>
<state/>
<interface-name>{{bundle_name}}</interface-name>
</session-brief>
</session-briefs>
</bfd>
</filter>
"""
root = etree.XML(xml)
tree = etree.ElementTree(root)
for element in root.iter():
print(tree.getpath(element))
The output looks like this (there should be node names instead of "*"):
/*
/*/*[1]
/*/*[1]/*
/*/*[1]/*/*
/*/*[1]/*/*/*[1]
/*/*[1]/*/*/*[1]/*[1]
/*/*[1]/*/*/*[1]/*[2]
/*/*[1]/*/*/*[1]/*[3]
/*/*[1]/*/*/*[1]/*[4]
/*/*[1]/*/*/*[1]/*[5]
/*/*[1]/*/*/*[1]/*[6]
/*/*[1]/*/*/*[2]
/*/*[1]/*/*/*[2]/*
/*/*[1]/*/*/*[2]/*/*[1]
/*/*[1]/*/*/*[2]/*/*[2]
/*/*[1]/*/*/*[2]/*/*[3]
/*/*[1]/*/*/*[2]/*/*[3]/*
/*/*[1]/*/*/*[3]
/*/*[2]
/*/*[2]/*
/*/*[2]/*/*
/*/*[2]/*/*/*[1]
/*/*[2]/*/*/*[2]
Thanks a lot!
Dragan

I found that besides getpath, etree contains also a "sibling"
method called getelementpath, giving proper result also for
namespaced elements.
So change your code to:
for element in root.iter():
print(tree.getelementpath(element))
For your source sample, with namespaces shortened for readability,
the initial part of the result is:
.
{http://cisco.com/ns}bundles
{http://cisco.com/ns}bundles/{http://cisco.com/ns}bundles

Related

xml parsing in python with XPath

I am trying to parse an XML file in Python with the built in xml module and Elemnt tree, but what ever I try to do according to the documentation, it does not give me what I need.
I am trying to extract all the value tags into a list
<?xml version="1.0" encoding="UTF-8"?>
<CustomField xmlns="http://soap.sforce.com/2006/04/metadata">
<fullName>testPicklist__c</fullName>
<externalId>false</externalId>
<label>testPicklist</label>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<type>Picklist</type>
<valueSet>
<restricted>true</restricted>
<valueSetDefinition>
<sorted>false</sorted>
<value>
<fullName>a 32</fullName>
<default>false</default>
<label>a 32</label>
</value>
<value>
<fullName>23 432;:</fullName>
<default>false</default>
<label>23 432;:</label>
</value>
and here is the example code that I cant get to work. It's very basic and all I have issues is the xpath.
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
print(root.findall(".//value")
print(root.findall(".//*/value")
print(root.findall("./*/value")
Since the root element has attribute xmlns="http://soap.sforce.com/2006/04/metadata", every element in the document will belong to this namespace. So you're actually looking for {http://soap.sforce.com/2006/04/metadata}value elements.
To search all <value> elements in this document you have to specify the namespace argument in the findall() function
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
# get the namespace of root
ns = root.tag.split('}')[0][1:]
# create a dictionary with the namespace
ns_d = {'my_ns': ns}
# get all the values
values = root.findall('.//my_ns:value', namespaces=ns_d)
# print the values
for value in values:
print(value)
Outputs:
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043ba0>
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043e20>
Alternatively you can just search for the {http://soap.sforce.com/2006/04/metadata}value
# get all the values
values = root.findall('.//{http://soap.sforce.com/2006/04/metadata}value')

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source
This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

Why won't this check for an element work using python elementtree

I finally decided to learn how to parse xml in python. I'm using elementtree just to get a basic understanding. I'm on CentOS 6.5 using python 2.7.9. I've looked through the following pages:
http://www.diveintopython3.net/xml.html
https://pymotw.com/2/xml/etree/ElementTree/parse.html#traversing-the-parsed-tree
and performed several searches on this forum, but I'm having some trouble and I'm not sure if it's my code or the xml I'm trying to parse.
I need to be able to verify if certain elements are in the xml or not. For example, in the xml below, I need to check to see if the element Analyzer is present and if so, get the attribute. Then, if Analyzer is present, I need to check for the location element and get the text then the name element and get that text. I thought that the following code would check to see if the element existed:
if element.find('...') is not None
but that yields inconsistent results and it never seems to find the location or name element. For example:
if tree.find('Alert') is not None:
appears to work, but
if tree.find('location') is not None:
or
if tree.find('Analyzer') is not None:
definitely don't work. I'm guessing that the tree.find() function only works for the top level?
So how do I do this check?
Here is my xml:
<?xml version='1.0' encoding='UTF-8'?>
<Report>
<Alert>
<Analyzer analyzerid="CS">
<Node>
<location>USA</location>
<name>John Smith</name>
</Node>
</Analyzer>
<AnalyzerTime>2016-06-11T00:30:02+0000</AnalyzerTime>
<AdditionalData type="integer" meaning="number of alerts in this report">19</AdditionalData>
<AdditionalData type="string" meaning="report schedule">5 minutes</AdditionalData>
<AdditionalData type="string" meaning="report type">alerts</AdditionalData>
<AdditionalData type="date-time" meaning="report start time">2016-06-11T00:25:16+0000</AdditionalData>
</Alert>
<Alert>
<CreateTime>2016-06-11T00:25:16+0000</CreateTime>
<Source>
<Node>
<Address category="ipv4-addr">
<address>1.5.1.4</address>
</Address>
</Node>
</Source>
<Target>
<Service>
<port>22</port>
<protocol>TCP</protocol>
</Service>
</Target>
<Classification text="SSH scans, direction:ingress, confidence:80, severity:high">
<Reference meaning="Scanning" origin="user-specific">
<name>SSH Attack</name>
<url> </url>
</Reference>
</Classification>
<Assessment>
<Action category="block-installed"/>
</Assessment>
<AdditionalData type="string" meaning="top level domain owner">PH, Philippines</AdditionalData>
<AdditionalData type="integer" meaning="alert threshold">0</AdditionalData>
</Alert>
</Report>
And here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for child in root: print child
all_links = tree.findall('.//Analyzer')
try:
print all_links[0].attrib.get('analyzerid')
ID = all_links[0].attrib.get('analyzerid')
all_links2 = tree.findall('.//location')
print all_links2
try:
print all_links[0].text
except: print "can't print text location"
if tree.find('location') is None: print 'lost'
for kid in tree.iter('location'):
try:
location = kid.text
print kid.text
except: print 'bad'
except IndexError: print'There was no Analyzer element'
I think you're missing one important line from the Dive Into Python tutorial (just up from here):
There is a way to search for descendant elements, i.e. children, grandchildren, and any element at any nesting level.
That way is to precede the element names with //.
tree.find("someElementName") will only find a direct child element of tree with the name someElementName. If you want to search for an element named someElementName anywhere within tree, use tree.find("//someElementName").
The // notation originates from XPath. The ElementTree module provides support for a limited subset of XPath. The ElementTree documentation details the parts of XPath syntax it supports.

Copy attribute information when different element have the same name in XML with python

So, here's my XML tree:
<?xml version="1.0"?>
<api>
<query>
<normalized>
<n from="Brain_cancer" to="Brain cancer" />
</normalized>
<redirects>
<r from="Brain cancer" to="Brain tumor"
/>
</redirects>
<pages>
<page pageid="37284" ns="0" title="Brain tumor">
<revisions>
<rev revid="412658600" parentid="412501243" user="Andycjp" userid="55014" timestamp="2011-02-08T03:35:27Z" size="59870" sha1="fe1ff25c27ebc86572aa4be8201cb813e1bf3d32" comment="/* Psychological and behavioral consequences */" contentformat="text/x-wiki" contentmodel="wikitext" xml:space="preserve">
</rev>
</revisions>
</page>
</pages>
</query>
<warnings>
<revisions xml:space="preserve">
</revisions>
<result xml:space="preserve">
</result>
</warnings>
<query-continue>
<revisions rvcontinue="456175380"
/>
</query-continue>
</api>
So, has you can see, the "revisions" element appears in two differents places, in differents levels. My objective is to reach the attribute "rvcontinue" (who's path is api/query-continue/revisions) to copy it's value in a new variable. It's probably because i'm just not getting it right, but elementTree and xpath didn't work so far.
This is what i've did so far, but it's getting no where
import xml.etree.ElementTree as ET
tree = ET.parse('Brain_tumor_5.xml')
for elem in tree.getiterator():
if elem.tag=='{http://www.namespace.co.uk}query-continue':
output = {}
for elem1 in list(elem):
if elem1.tag=='{http://www.namespace.co.uk}revisions':
output['rvcontinue']=elem1.text
print output
p = tree.find("./api/query-continue/revisions[#rvcontinue=]")
q = p.attrib
print q
I also have mostly used lxml, so I don't know what's up with etree, but it appears
that find from the tree doesn't work, but find from the root does work:
>>> tree.getroot().find( 'query-continue/revisions[#rvcontinue]' ).attrib['rvcontinue']
'456175380'
Also: I don't know if it's just a typo above, but:
p = tree.find("./api/query-continue/revisions[#rvcontinue=]")
will give a SyntaxError: invalid predicate
Added Note: It appears that tree.find( 'api' ) returns None,
but tree.find( '.' ) returns <Element 'api' at 0x1004e5f10>
so tree.find( './query-continue/revisions[#rvcontinue]' )
will also work.
This does not directly answer your question. However, I would use lxml.etree (which supposedly provides the same ElementTree interface) and the following code:
>>> import lxml.etree
>>> doc = lxml.etree.parse('doc.xml')
>>> node = doc.xpath('/api/query-continue/revisions[#rvcontinue]')
>>> node[0].attrib['rvcontinue']
'456175380'
Tried with xml.etree.ElementTree but doesn't appear to work.

Categories

Resources