xml parsing in python with XPath

xml parsing in python with XPath - python

I am trying to parse an XML file in Python with the built in xml module and Elemnt tree, but what ever I try to do according to the documentation, it does not give me what I need.
I am trying to extract all the value tags into a list
<?xml version="1.0" encoding="UTF-8"?>
<CustomField xmlns="http://soap.sforce.com/2006/04/metadata">
<fullName>testPicklist__c</fullName>
<externalId>false</externalId>
<label>testPicklist</label>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<type>Picklist</type>
<valueSet>
<restricted>true</restricted>
<valueSetDefinition>
<sorted>false</sorted>
<value>
<fullName>a 32</fullName>
<default>false</default>
<label>a 32</label>
</value>
<value>
<fullName>23 432;:</fullName>
<default>false</default>
<label>23 432;:</label>
</value>
and here is the example code that I cant get to work. It's very basic and all I have issues is the xpath.
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
print(root.findall(".//value")
print(root.findall(".//*/value")
print(root.findall("./*/value")

Since the root element has attribute xmlns="http://soap.sforce.com/2006/04/metadata", every element in the document will belong to this namespace. So you're actually looking for {http://soap.sforce.com/2006/04/metadata}value elements.
To search all <value> elements in this document you have to specify the namespace argument in the findall() function
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
# get the namespace of root
ns = root.tag.split('}')[0][1:]
# create a dictionary with the namespace
ns_d = {'my_ns': ns}
# get all the values
values = root.findall('.//my_ns:value', namespaces=ns_d)
# print the values
for value in values:
print(value)
Outputs:
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043ba0>
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043e20>
Alternatively you can just search for the {http://soap.sforce.com/2006/04/metadata}value
# get all the values
values = root.findall('.//{http://soap.sforce.com/2006/04/metadata}value')

Related

python, xml: how to access the 3rd child by element' name

Would you help me, pleace, to get an access to elemnt with name 'id' by the following construction in Python (i have lxml and xml.etree.ElementTree libraries).
Desirable result: '0000000'
Desirable method:
Search in xml-document a child, where it's name is fcsProtocolEF3.
Search in fcsProtocolEF3 an element with name 'id'.
It is crucial to search by element name. Not by ordinal position.
I tried to use something like this: tree.findall('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')[0].findall('{http://zakupki.gov.ru/oos/types/1}id')[0].text
it works, but it requires to input namespaces. XML-document have different namespaces and I don't know how to define them beforehand.
Thank you.
That would be great to use something like XQuery in SQL:
value('(/*:export/*:fcsProtocolEF3/*:id)[1]', 'nvarchar(21)')) AS [id],
XML-document:
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>

lxml solution:
xml = '''<?xml version="1.0"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>'''
from lxml import etree as et
root = et.fromstring(xml)
text = root.xpath('//*[local-name()="export"]/*[local-name()="fcsProtocolEF3"]/*[local-name()="id"]/text()')[0]
print(text)

Below is ET based solution. NS are in use.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>
'''
def get_id_text():
root = ET.fromstring(xml)
fcs = root.find('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')
# assuming there is one fcs element and one id under fcs
return fcs.find('{http://zakupki.gov.ru/oos/types/1}id').text
print(get_id_text())
output
0000000

Is there a way to return the value for a tag from a XML based on a specific path in python?

I have this XML
<Body>
<Batch_Number>2000</Batch_Number>
<Total_No_Of_Batches>12312</Total_No_Of_Batches>
<requestNo>1923</requestNo>
<Parent1>
<Parent2>
<Parent3>
<lastModifiedDateTime>2022-11-11T11:07:30.000</lastModifiedDateTime>
<purpose>NeverMore</purpose>
<endDate>9999-12-31T00:00:00.000</endDate>
<createdDateTime>2019-06-06T06:32:16.000</createdDateTime>
<createdOn>2019-06-06T08:32:16.000</createdOn>
<address2>Forever street 21</address2>
<externalCode>code123</externalCode>
<lastModifiedBy>user2.thisUser</lastModifiedBy>
<lastModifiedOn>2039-06-11T13:07:30.000</lastModifiedOn>
<lastModifiedBy>MG</lastModifiedBy>
<PS>1234431</PS>
</Parent3>
</Parent2>
</Parent1>
</Body>
Is there a way to return the value for lastModifiedBy for example where the path has this specific structure :
Body.Parent1.Parent2.Parent3.lastModifiedBy
Idealy, I would like to populate a dictionary with the child tag name and its value, for example :
dict[lastModifiedBy.tag] = lastModifiedBy.text

You can use xml from standart library for working with xml files.
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
And then you can access elements as indexes or you can use root like as a list:
for i in root:
print(i)
A XML element may have more than one child with same tag name (even you have two lastModifiedBy in the Parent3). This is why we use them like lists, they works like a list. So you shouldn't try to use them like dictionary.
I think you need to use XPath. Like so:
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
s = root.findall(".Parent1/Parent2/Parent3/lastModifiedBy")
for i in s:
print(i.text)
This gives you all lastModifiedBy elements in the Parent3 element. You can access to any index if you want too, like this:
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
s = root.find(".Parent1/Parent2/Parent3/lastModifiedBy[1]") # first element with "lastModifiedBy" tag
print(s.text)

How can I print XPaths of lxml tree elements?

I'm trying to print XPaths of all elements in XML tree, but I get strange output when using lxml. Instead of xpath which contains name of each node within path, I get strange "*"-kind of output.
Do you know what might be the issue here? Here the code, as well as XML I am trying to analyze.
from lxml import etree
xml = """
<filter xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
<bundles xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-bundlemgr-oper">
<bundles>
<bundle>
<data>
<bundle-status/>
<lacp-status/>
<minimum-active-links/>
<ipv4bfd-status/>
<active-member-count/>
<active-member-configured/>
</data>
<members>
<member>
<member-interface/>
<interface-name/>
<member-mux-data>
<member-state/>
</member-mux-data>
</member>
</members>
<bundle-interface>{{bundle_name}}</bundle-interface>
</bundle>
</bundles>
</bundles>
<bfd xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-ip-bfd-oper">
<session-briefs>
<session-brief>
<state/>
<interface-name>{{bundle_name}}</interface-name>
</session-brief>
</session-briefs>
</bfd>
</filter>
"""
root = etree.XML(xml)
tree = etree.ElementTree(root)
for element in root.iter():
print(tree.getpath(element))
The output looks like this (there should be node names instead of "*"):
/*
/*/*[1]
/*/*[1]/*
/*/*[1]/*/*
/*/*[1]/*/*/*[1]
/*/*[1]/*/*/*[1]/*[1]
/*/*[1]/*/*/*[1]/*[2]
/*/*[1]/*/*/*[1]/*[3]
/*/*[1]/*/*/*[1]/*[4]
/*/*[1]/*/*/*[1]/*[5]
/*/*[1]/*/*/*[1]/*[6]
/*/*[1]/*/*/*[2]
/*/*[1]/*/*/*[2]/*
/*/*[1]/*/*/*[2]/*/*[1]
/*/*[1]/*/*/*[2]/*/*[2]
/*/*[1]/*/*/*[2]/*/*[3]
/*/*[1]/*/*/*[2]/*/*[3]/*
/*/*[1]/*/*/*[3]
/*/*[2]
/*/*[2]/*
/*/*[2]/*/*
/*/*[2]/*/*/*[1]
/*/*[2]/*/*/*[2]
Thanks a lot!
Dragan

I found that besides getpath, etree contains also a "sibling"
method called getelementpath, giving proper result also for
namespaced elements.
So change your code to:
for element in root.iter():
print(tree.getelementpath(element))
For your source sample, with namespaces shortened for readability,
the initial part of the result is:
.
{http://cisco.com/ns}bundles
{http://cisco.com/ns}bundles/{http://cisco.com/ns}bundles

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source

This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

Python ElementTree find() not matching within kml file

I'm trying to find an element from a kml file using element trees as follows:
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("history-03-02-2012.kml")
p = tree.find(".//name")
A sufficient subset of the file to demonstrate the problem follows:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>Location history from 03/03/2012 to 03/10/2012</name>
</Document>
</kml>
A "name" element exists; why does the search come back empty?

The name element you're trying to match is actually within the KML namespace, but you aren't searching with that namespace in mind.
Try:
p = tree.find(".//{http://www.opengis.net/kml/2.2}name")
If you were using lxml's XPath instead of the standard-library ElementTree, you'd instead pass the namespace in as a dictionary:
>>> tree = lxml.etree.fromstring('''<kml xmlns="http://www.opengis.net/kml/2.2">
... <Document>
... <name>Location history from 03/03/2012 to 03/10/2012</name>
... </Document>
... </kml>''')
>>> tree.xpath('//kml:name', namespaces={'kml': "http://www.opengis.net/kml/2.2"})
[<Element {http://www.opengis.net/kml/2.2}name at 0x23afe60>]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

xml parsing in python with XPath - python

Related

python, xml: how to access the 3rd child by element' name

Is there a way to return the value for a tag from a XML based on a specific path in python?

How can I print XPaths of lxml tree elements?

Python -lxml xpath returns empty list

Python ElementTree find() not matching within kml file

Categories

Resources