How to parse tiered XML String - python

I have an xml string that I need to parse in python that looks like this:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<PostLoadsResponse xmlns="http://webservices.truckstop.com/v11">
<PostLoadsResult xmlns:a="http://schemas.datacontract.org/2004/07/WebServices.Objects" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Errors xmlns="http://schemas.datacontract.org/2004/07/WebServices">
<Error>
<ErrorMessage>Invalid Location</ErrorMessage>
</Error>
</Errors>
</PostLoadsResult>
</PostLoadsResponse>
</s:Body>
</s:Envelope>'
I'm having trouble using xmltree to get to the error message of this tree without something like:
import xml.etree.ElementTree as ET
ET.fromstring(text).findall('{http://schemas.xmlsoap.org/soap/envelope/}Body')[0].getchildren()[0].getchildren()[0].getchildren()

You need to handle namespaces and you can do it with xml.etree.ElementTree:
tree = ET.fromstring(data)
namespaces = {
's': 'http://schemas.xmlsoap.org/soap/envelope/',
'd': "http://schemas.datacontract.org/2004/07/WebServices"
}
print(tree.find(".//d:ErrorMessage", namespaces=namespaces).text)
Prints Invalid Location.

Using the partial XPath support:
ET.fromstring(text).find('.//{http://schemas.datacontract.org/2004/07/WebServices}ErrorMessage')
That will instruct it to find the first element named ErrorMessage with namespace http://schemas.datacontract.org/2004/07/WebServices at any depth.
However, it may be faster to use something like
ET.fromstring(text).find('{http://schemas.xmlsoap.org/soap/envelope/}Body').find('{http://webservices.truckstop.com/v11}PostLoadsResponse').find('{http://webservices.truckstop.com/v11}PostLoadsResult').find('{http://schemas.datacontract.org/2004/07/WebServices}Errors').find('{http://schemas.datacontract.org/2004/07/WebServices}Error').find('{http://schemas.datacontract.org/2004/07/WebServices}ErrorMessage'
If you know your message will always contain those elements.

You can use the getiterator method on the tree to iterate through the items in it. You can check the tag on each item to see if it's the right one.
>>> err = [node.text for node in tree.getiterator() if node.tag.endswith('ErrorMessage')]
>>> err
['Invalid Location']

Related

How can I print XPaths of lxml tree elements?

I'm trying to print XPaths of all elements in XML tree, but I get strange output when using lxml. Instead of xpath which contains name of each node within path, I get strange "*"-kind of output.
Do you know what might be the issue here? Here the code, as well as XML I am trying to analyze.
from lxml import etree
xml = """
<filter xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
<bundles xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-bundlemgr-oper">
<bundles>
<bundle>
<data>
<bundle-status/>
<lacp-status/>
<minimum-active-links/>
<ipv4bfd-status/>
<active-member-count/>
<active-member-configured/>
</data>
<members>
<member>
<member-interface/>
<interface-name/>
<member-mux-data>
<member-state/>
</member-mux-data>
</member>
</members>
<bundle-interface>{{bundle_name}}</bundle-interface>
</bundle>
</bundles>
</bundles>
<bfd xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-ip-bfd-oper">
<session-briefs>
<session-brief>
<state/>
<interface-name>{{bundle_name}}</interface-name>
</session-brief>
</session-briefs>
</bfd>
</filter>
"""
root = etree.XML(xml)
tree = etree.ElementTree(root)
for element in root.iter():
print(tree.getpath(element))
The output looks like this (there should be node names instead of "*"):
/*
/*/*[1]
/*/*[1]/*
/*/*[1]/*/*
/*/*[1]/*/*/*[1]
/*/*[1]/*/*/*[1]/*[1]
/*/*[1]/*/*/*[1]/*[2]
/*/*[1]/*/*/*[1]/*[3]
/*/*[1]/*/*/*[1]/*[4]
/*/*[1]/*/*/*[1]/*[5]
/*/*[1]/*/*/*[1]/*[6]
/*/*[1]/*/*/*[2]
/*/*[1]/*/*/*[2]/*
/*/*[1]/*/*/*[2]/*/*[1]
/*/*[1]/*/*/*[2]/*/*[2]
/*/*[1]/*/*/*[2]/*/*[3]
/*/*[1]/*/*/*[2]/*/*[3]/*
/*/*[1]/*/*/*[3]
/*/*[2]
/*/*[2]/*
/*/*[2]/*/*
/*/*[2]/*/*/*[1]
/*/*[2]/*/*/*[2]
Thanks a lot!
Dragan
I found that besides getpath, etree contains also a "sibling"
method called getelementpath, giving proper result also for
namespaced elements.
So change your code to:
for element in root.iter():
print(tree.getelementpath(element))
For your source sample, with namespaces shortened for readability,
the initial part of the result is:
.
{http://cisco.com/ns}bundles
{http://cisco.com/ns}bundles/{http://cisco.com/ns}bundles

"None" result at Parsing XML with ElementTree

I've tried everything to get a XML content but all I've got is a 'None' as return. Could anybody help me?
The code I'm trying is:
import xml.etree.cElementTree as ET
parsedXML = ET.parse("C:\\Users\\denis\\Documents\\Projetos\\NFe\\Arquivos\\33180601279711000100550020001554261733208443-nfeo.xml")
for node in parsedXML.getroot():
email = node.find('cNF')
phone = node.find('natOp')
street = node.find('nNF')
print(email)
Part of the XML (content is bigger than this) is right bellow:
<?xml version="1.0" encoding="ISO-8859-1"?>
<nfeProc xmlns="http://www.portalfiscal.inf.br/nfe" versao="3.10">
<NFe xmlns="http://www.portalfiscal.inf.br/nfe">
<infNFe versao="3.10" Id="NFe33180601279711000100550020001554261733208443">
<ide>
<cUF>33</cUF>
<cNF>73320844</cNF>
<natOp>VENDA DE PRODUCAO DO ESTABELECIMENTO</natOp>
<indPag>1</indPag>
<mod>55</mod>
<serie>2</serie>
<nNF>155426</nNF>
<dhEmi>2018-06-25T16:06:33-03:00</dhEmi>
<dhSaiEnt>2018-06-25T16:06:08-03:00</dhSaiEnt>
<tpNF>1</tpNF>
<idDest>2</idDest>
<cMunFG>3304557</cMunFG>
<tpImp>2</tpImp>
<tpEmis>1</tpEmis>
<cDV>3</cDV>
<tpAmb>1</tpAmb>
<finNFe>1</finNFe>
<indFinal>1</indFinal>
<indPres>9</indPres>
<procEmi>0</procEmi>
<verProc>NeoGrid NFe 1.63.4</verProc>
</ide>
<emit>
I appreciate your help!
You are using an XML document with namespaces, so you need to provide it during you call, as shown in this answer.
Here, we get
namespaces = {'n': 'http://www.portalfiscal.inf.br/nfe'}
root = parsedXML.getroot()
root.find('n:NFe', namespaces)
to return the element, while root.find('NFe') returns None.
Also note that find and findall only search the direct children, not nested children (cf. documentation), which mean that you will have to iter over children (see e.g. here for an example).

How to deal with xmlns values while parsing an XML file?

I have the following toy example of an XML file. I have thousands of these. I have difficulty parsing this file.
Look at the text in second line. All my original files contain this text. When I delete i:type="Record" xmlns="http://schemas.datacontract.org/Storage" from second line (retaining the remaining text), I am able to get accelx and accely values using the code given below.
How can I parse this file with the original text?
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas.datacontract.org/Storage">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
Code to parse the data:
import lxml.etree as etree
tree = etree.parse(r"C:\testdel.xml")
root = tree.getroot()
val_of_interest = root.findall('./Values/SensorValue')
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text
I asked related question here: How to extract data from xml file that is deep down the tag
Thanks
The confusion was caused by the following default namespace (namespace declared without prefix) :
xmlns="http://schemas.datacontract.org/Storage"
Note that descendants elements without prefix inherit default namespace from ancestor, implicitly. Now, to reference element in namespace, you need to map a prefix to the namespace URI, and use that prefix in your XPath :
ns = {'d': 'http://schemas.datacontract.org/Storage' }
val_of_interest = root.findall('./d:Values/d:SensorValue', ns)
for sensor_val in val_of_interest:
print sensor_val.find('d:accelx', ns).text
print sensor_val.find('d:accely', ns).text

lxml: How do I search for fields without adding a xmlns (localhost) path to each search term?

I'm trying to locate fields in a SOAP xml file using lxml (3.6.0)
...
<soap:Body>
<Request xmlns="http://localhost/">
<Test>
<field1>hello</field1>
<field2>world</field2>
</Test>
</Request>
</soap:Body>
...
In this example I'm trying to find field1 and field2.
I need to add a path to the search term, to find the field:
print (myroot.find(".//{http://localhost/}field1").tag) # prints 'field1'
without it, I don't find anything
print (myroot.find("field1").tag) # finds 'None'
Is there any other way to search for the field tag (here field1) without giving path info?
Full example below:
from lxml import etree
example = """<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body><Request xmlns="http://localhost/">
<Test><field1>hello</field1><field2>world</field2></Test>
</Request></soap:Body></soap:Envelope>
"""
myroot = etree.fromstring(example)
# this works
print (myroot.find(".//{http://localhost/}field1").text)
print (myroot.find(".//{http://localhost/}field2").text)
# this fails
print (myroot.find(".//field1").text)
print (myroot.find("field1").text)
Comment: The input of the SOAP request is given, I can't change any of it in real live to make things easier.
There is a way to ignore namespace when selecting element using XPath, but that isn't a good practice. Namespace is there for a reason. Anyway, there is a cleaner way to reference element in namespace i.e by using namespace prefix that was mapped to the namespace uri, instead of using the actual namespace uri every time :
.....
>>> ns = {'d': 'http://localhost/'}
>>> print (myroot.find(".//d:field1", ns).text)
hello
>>> print (myroot.find(".//d:field2", ns).text)
world

XML parsing specific values - Python

I've been attempting to parse a list of xml files. I'd like to print specific values such as the userName value.
<?xml version="1.0" encoding="utf-8"?>
<Drives clsid="{8FDDCC1A-0C3C-43cd-A6B4-71A6DF20DA8C}"
disabled="1">
<Drive clsid="{935D1B74-9CB8-4e3c-9914-7DD559B7A417}"
name="S:"
status="S:"
image="2"
changed="2007-07-06 20:57:37"
uid="{4DA4A7E3-F1D8-4FB1-874F-D2F7D16F7065}">
<Properties action="U"
thisDrive="NOCHANGE"
allDrives="NOCHANGE"
userName=""
cpassword=""
path="\\scratch"
label="SCRATCH"
persistent="1"
useLetter="1"
letter="S"/>
</Drive>
</Drives>
My script is working fine collecting a list of xml files etc. However the below function is to print the relevant values. I'm trying to achieve this as suggested in this post. However I'm clearly doing something incorrectly as I'm getting errors suggesting that elm object has no attribute text. Any help would be appreciated.
Current Code
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
elm = doc.find('userName')
print elm.text
doc.find looks for a tag with the given name. You are looking for an attribute with the given name.
elm.text is giving you an error because doc.find doesn't find any tags, so it returns None, which has no text property.
Read the lxml.etree docs some more, and then try something like this:
doc = ET.parse(fi)
root = doc.getroot()
prop = root.find(".//Properties") # finds the first <Properties> tag anywhere
elm = prop.attrib['userName']
userName is an attribute, not an element. Attributes don't have text nodes attached to them at all.
for el in doc.xpath('//*[#userName]'):
print el.attrib['userName']
You can try to take the element using the tag name and then try to take its attribute (userName is an attribute for Properties):
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
props = doc.getElementsByTagName('Properties')
elm = props[0].attributes['userName']
print elm.value

Categories

Resources