XML data extraction in python

XML data extraction in python - python

I have an XML file like the following:
<AreaModel>
...
<RecipePhase>
<UniqueName>PHASE1</UniqueName>
...
<NumberOfParameterTags>7</NumberOfParameterTags>
...
<DefaultRecipeParameter>
<Name>PARAM1</Name>
----
</DefaultRecipeParameter>
<DefaultRecipeParameter>
<Name>PARAM2</Name>
----
</DefaultRecipeParameter>
<DefaultRecipeParameter>
<Name>PARAM3</Name>
----
</DefaultRecipeParameter>
</RecipePhase>
<RecipePhase>
....
</RecipePhase>
</AreaModel>
I would like to read this file in sequential order and generate different list. One for the texts of UniqueName TAGs and a list of lists containing for each list the set of texts for tag Name under each RecipePhase element.
For example, I might have 10 RecipePhase elements, each one with TAG UniqueName and each one containing a different set of children with tag DefaultRecipeParameter.
How can I take into account when I enter into RecipePhase and when I go out of the element during parsing?
I am trying ElementTree but I am not able to find a solution.
cheers,
m

You can use xml python module:
See my example:
from xml.dom import minidom as dom
import urllib2
def fetchPage(url):
a = urllib2.urlopen(url)
return ''.join(a.readlines())
def extract(page):
a = dom.parseString(page)
item = a.getElementsByTagName('Rate')
for i in item:
if i.hasChildNodes() == True:
print i.getAttribute('currency')+"-"+ i.firstChild.nodeValue
if __name__=='__main__':
page = fetchPage("http://www.bnro.ro/nbrfxrates.xml")
extract(page)

I solved partially my problem with the following code:
import xml.etree.ElementTree as ET
tree = ET.parse('control_strategies.axml')
root = tree.getroot()
phases=[]
for recipephase in root.findall('./RecipePhase/UniqueName'):
phases.append(recipephase.text)
n_elem = len(phases)
param=[[] for _ in range(n_elem)]
i = 0
for recipephase in root.findall('./RecipePhase'):
for defparam in recipephase.findall('./DefaultRecipeParameter'):
for paramname in defparam.findall('./Name'):
param[i].append(paramname.text)
i = i + 1

Related

How python lxml iteration handles tag text? [duplicate]

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Just use the node.itertext() method, as in:
''.join(node.itertext())

Does text_content() do what you need?

Try:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
Example:
from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)
Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

A version of albertov 's stringify-content that solves the bugs reported by hoju:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
return ''.join(
chunk for chunk in chain(
(node.text,),
chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
(node.tail,)) if chunk)

The following snippet which uses python generators works perfectly and is very efficient.
''.join(node.itertext()).strip()

Defining stringify_children this way may be less complicated:
from lxml import etree
def stringify_children(node):
s = node.text
if s is None:
s = ''
for child in node:
s += etree.tostring(child, encoding='unicode')
return s
or in one line
return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))
Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.
Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:
def stringify_children(node):
s = etree.tostring(node, encoding='unicode', with_tail=False)
return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]
which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is
etree.tostring(html, method="text")
where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

import urllib2
from lxml import etree
url = 'some_url'
getting url
test = urllib2.urlopen(url)
page = test.read()
getting all html code within including table tag
tree = etree.HTML(page)
xpath selector
table = tree.xpath("xpath_here")
res = etree.tostring(table)
res is the html code of table
this was doing job for me.
so you can extract the tags content with xpath_text() and tags including their content using tostring()
div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content")
or text = tree.xpath("//content/text()")
div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')
this last line with strip method using is not nice, but it just works

Just a quick enhancement as the answer has been given. If you want to clean the inside text:
clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

In response to #Richard's comment above, if you patch stringify_children to read:
parts = ([node.text] +
-- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++ list(chain(*([tostring(c)] for c in node.getchildren()))) +
[node.tail])
it seems to avoid the duplication he refers to.

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:
def stringify_children(node):
"""Given a LXML tag, return contents as a string
>>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
>>> node = lxml.html.fragment_fromstring(html)
>>> extract_html_content(node)
"<strong>Sample sentence</strong> with tags."
"""
if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
return ""
node.attrib.clear()
opening_tag = len(node.tag) + 2
closing_tag = -(len(node.tag) + 3)
return lxml.html.tostring(node)[opening_tag:closing_tag]
Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.
import re
from lxml import etree
def _tostr_with_tags(parent_element, html_entities=False):
RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$'
content_with_parent = etree.tostring(parent_element)
def _replace_html_entities(s):
RE_ENTITY = r'&#(\d+);'
def repl(m):
return unichr(int(m.group(1)))
replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)
return replaced
if not html_entities:
content_with_parent = _replace_html_entities(content_with_parent)
content_with_parent = content_with_parent.strip() # remove 'white' characters on margins
start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]
if start_tag != end_tag:
raise Exception('Start tag does not match to end tag while getting content with tags.')
return content_without_parent
parent_element must have Element type.
Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

lxml have a method for that:
node.text_content()

If this is an a tag, you can try:
node.values()

import re
from lxml import etree
node = etree.fromstring("""
<content>Text before inner tag
<div>Text
<em>inside</em>
tag
</div>
Text after inner tag
</content>""")
print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

Python:XML List index out of range

I'm having troubles to get some values in a xml file. The error is IndexError: list index out of range
XML
<?xml version="1.0" encoding="UTF-8"?>
<nfeProc xmlns="http://www.portalfiscal.inf.br/nfe" versao="3.10">
<NFe xmlns="http://www.portalfiscal.inf.br/nfe">
<infNFe Id="NFe35151150306471000109550010004791831003689145" versao="3.10">
<ide>
<nNF>479183</nNF>
</ide>
<emit>
<CNPJ>3213213212323</CNPJ>
</emit>
<det nItem="1">
<prod>
<cProd>7030-314</cProd>
</prod>
<imposto>
<ICMS>
<ICMS10>
<orig>1</orig>
<CST>10</CST>
<vICMS>10.35</vICMS>
<vICMSST>88.79</vICMSST>
</ICMS10>
</ICMS>
</imposto>
</det>
<det nItem="2">
<prod>
<cProd>7050-6</cProd>
</prod>
<imposto>
<ICMS>
<ICMS00>
<orig>1</orig>
<CST>00</CST>
<vICMS>7.49</vICMS>
</ICMS00>
</ICMS>
</imposto>
</det>
</infNFe>
</NFe>
</nfeProc>
I'm getting the values from XML, it's ok in some xml's, those having vICMS and vICMSST tags:
vicms = doc.getElementsByTagName('vICMS')[i].firstChild.nodeValue
vicmsst = doc.getElementsByTagName('vICMSST')[1].firstChild.nodeValue
This returns:
First returns:
print vicms
>> 10.35
print vicmsst
>> 88.79
Second imposto CRASHES because don't find vICMSST tag...
**IndexError: list index out of range**
What the best form to test it? I'm using xml.etree.ElementTree:
My code:
import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
import glob
import xml.etree.ElementTree as ET
origem = 0
# only loops over XML documents in folder
for file in glob.glob("*.xml"):
f = open("%s" % file,'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('emit'):
#Get Fiscal Number
nnf= doc.getElementsByTagName('nNF')[i].firstChild.nodeValue
print 'Fiscal Number %s' % nnf
print '\n'
for prod in doc.getElementsByTagName('det'):
vicms = 0
vicmsst = 0
#Get value of ICMS
vicms = doc.getElementsByTagName('vICMS')[i].firstChild.nodeValue
#Get value of VICMSST
vicmsst = doc.getElementsByTagName('vICMSST')[i].firstChild.nodeValue
#PRINT INFO
print 'ICMS %s' % vicms
print 'Valor do ICMSST: %s' % vicmsst
print '\n\n'
i +=1
print '\n\n'

There is only one vICMSST tag in your XML document. So, when i=1, the following line returns an IndexError.
vicmsst = doc.getElementsByTagName('vICMSST')[1].firstChild.nodeValue
You can restructure this to:
try:
vicmsst = doc.getElementsByTagName('vICMSST')[i].firstChild.nodeValue
except IndexError:
# set a default value or deal with this how you like
It's hard to say what you should do upon an exception without knowing more about what you're trying to do.

You are making several general mistakes in your code.
Don't use counters to index into lists you don't know the length of. Normally, iteration with for .. in is a lot better than using indexes anyway.
You have many imports you don't seem to use, get rid of them.
You can use minidom, but ElementTree is better for your task because it supports searching for nodes with XPath and it supports XML namespaces.
Don't read an XML file as a string and then use parseString. Let the XML parser handle the file directly. This way all file encoding related issues will be handled without errors.
The following is a lot better than your original approach.
import glob
import xml.etree.ElementTree as ET
def get_text(context_elem, xpath, xmlns=None):
""" helper function that gets the text value of a node """
node = context_elem.find(xpath, xmlns)
if (node != None):
return node.text
else:
return ""
# set up XML namespace URIs
xmlns = {
"nfe": "http://www.portalfiscal.inf.br/nfe"
}
for path in glob.glob("*.xml"):
doc = ET.parse(path)
for infNFe in doc.iterfind('.//nfe:infNFe', xmlns):
print 'Fiscal Number\t%s' % get_text(infNFe, ".//nfe:nNF", xmlns)
for det in infNFe.iterfind(".//nfe:det", xmlns):
print ' ICMS\t%s' % get_text(det, ".//nfe:vICMS", xmlns)
print ' Valor do ICMSST:\t%s' % get_text(det, ".//nfe:vICMSST", xmlns)
print '\n\n'

Python xml parsing etree find element X by postion

I'm trying to parse the following xml to pull out certain data then eventually edit the data as needed.
Here is the xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CHECKLIST>
<VULN>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Num</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>V-38438</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Title</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>More text.</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Discuss</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Some text here</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>IA_Controls</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA></ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Ver</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Gen000000</ATTRIBUTE_DATA>
</STIG_DATA>
<STATUS>NotAFinding</STATUS>
<FINDING_DETAILS></FINDING_DETAILS>
<COMMENTS></COMMENTS>
<SEVERITY_OVERRIDE></SEVERITY_OVERRIDE>
<SEVERITY_JUSTIFICATION></SEVERITY_JUSTIFICATION>
</VULN>
The data that I'm looking to pull from this is the STATUS, COMMENTS and the ATTRIBUTE_DATA directly following VULN_ATTRIBUTE that matches == Rule_Ver. So in this example.
I should get the following:
Gen000000 NotAFinding None
What I have so far is that I can get the Status and Comments easy, but can't figure out the ATTRIBUTE_DATA portion. I can find the first one (Vuln_Num), then I tried to add a index but that gives a "list index out of range" error.
This is where I'm at now.
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = curTag.find("./STIG_DATA/ATTRIBUTE_DATA")
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
This gives the following output:
GEN:[V-38438] Status:[NotAFinding] Comments: None
I want:
GEN:[Gen000000] Status:[NotAFinding] Comments: None
So the end goal is to be able to parse hundreds of these and edit the comments field as needed. I don't think the editing part will be that hard once I get the right element.
Logically I see two ways of doing this. Either go to the ATTRIBUTE_DATA[5] and grab the text or find VULN_ATTRIBUTE == Rule_Ver then grab the next ATTRIBUTE_DATA.
I have tried doing this:
DataTag = curTag.find(".//STIG_DATA//ATTRIBUTE_DATA")[5]
andDataTag[5].text`
and both give meIndexError: list index out of range
I saw lxml had get_element_by_id and xpath, but I can't add modules to this system so it is etree for me.
Thanks in advance.

One can find an element by position, but you've used the incorrect XPath syntax. Either of the following lines should work:
DataTag = curTag.find("./STIG_DATA[5]/ATTRIBUTE_DATA") # Note: 5, not 4
DataTag = curTag.findall("./STIG_DATA/ATTRIBUTE_DATA")[4] # Note: 4, not 5
However, I strongly recommend against using that. There is no guarantee that the Rule_Ver instance of STIG_DATA is always the fifth item.
If you could change to lxml, then this works:
DataTag = curTag.xpath(
'./STIG_DATA/VULN_ATTRIBUTE[text()="Rule_Ver"]/../ATTRIBUTE_DATA')[0]
Since you can't use lxml, you must iterate the STIG_DATA elements by hand, like so:
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
if stig.find('VULN_ATTRIBUTE').text == 'Rule_Ver':
return stig.find('ATTRIBUTE_DATA')
Here is a complete program with error checking added to GetData():
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
vuln = stig.find('VULN_ATTRIBUTE')
if vuln is not None and vuln.text == 'Rule_Ver':
data = stig.find('ATTRIBUTE_DATA')
return data
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = GetData(curTag)
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
References:
https://stackoverflow.com/a/10836343/8747
http://lxml.de/xpathxslt.html#xpath

Reg adding data to an existing XML in Python

I have to parse an xml file & modify the data in a particular tag using Python. I'm using Element Tree to do this. I'm able to parse & reach the required tag. But I'm not able to modify the value. I'm not sure if Element Tree is okay or if I should use TreeBuilder for this.
As you can see below I just want to replace the Not Executed under Verdict with a string value.
-<Procedure>
<PreCondition>PRECONDITION: - ECU in extended diagnostic session (zz = 0x03) </PreCondition>
<PostCondition/>
<ProcedureID>428495</ProcedureID>
<SequenceNumber>2</SequenceNumber>
<CID>-1</CID>
<**Verdict** Writable="true">NotExecuted</Verdict>
</Procedure>
import xml.etree.ElementTree as etree
X_tree = etree.parse('DIAGNOSTIC SERVER.xml')
X_root = X_tree.getroot()
ATC_Name = X_root.iterfind('TestOrder//TestOrder//TestSuite//')
try:
while(1):
temp = ATC_Name.next()
if temp.tag == 'ProcedureID' and temp.text == str(TestCase_Id[j].text).split('-')[1]:
ATC_Name.next()
ATC_Name.next()
ATC_Name.next().text = 'Pass' <--This is what I want to do
ATC_Name.close()
break
except:
print sys.exc_info()
I believe my approach is wrong. Kindly guide me with right pointers.
Thanks.

You'd better switch to lxml so that you can use the "unlimited" power of xpath.
The idea is to use the following xpath expression:
//Procedure[ProcedureID/text()="%d"]/Verdict
where %d placeholder is substituted with the appropriate procedure id via string formatting operation.
The xpath expression finds the appropriate Verdict tag which you can set text on:
from lxml import etree
data = """<Procedure>
<PreCondition>PRECONDITION: - ECU in extended diagnostic session (zz = 0x03) </PreCondition>
<PostCondition/>
<ProcedureID>428495</ProcedureID>
<SequenceNumber>2</SequenceNumber>
<CID>-1</CID>
<Verdict Writable="true">NotExecuted</Verdict>
</Procedure>"""
ID = 428495
tree = etree.fromstring(data)
verdict = tree.xpath('//Procedure[ProcedureID/text()="%d"]/Verdict' % ID)[0]
verdict.text = 'test'
print etree.tostring(tree)
prints:
<Procedure>
<PreCondition>PRECONDITION: - ECU in extended diagnostic session (zz = 0x03) </PreCondition>
<PostCondition/>
<ProcedureID>428495</ProcedureID>
<SequenceNumber>2</SequenceNumber>
<CID>-1</CID>
<Verdict Writable="true">test</Verdict>
</Procedure>

Here is a solution using ElementTree. See Modifying an XML File
import xml.etree.ElementTree as et
tree = et.parse('prison.xml')
root = tree.getroot()
print root.find('Verdict').text #before update
root.find('Verdict').text = 'Executed'
tree.write('prison.xml')

try this
import xml.etree.ElementTree as et
root=et.parse(xmldata).getroot()
s=root.find('Verdict')
s.text='Your string'

LXML Xpath does not seem to return full path

OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?

ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.

getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

XML data extraction in python - python

Related

How python lxml iteration handles tag text? [duplicate]

Python:XML List index out of range

Python xml parsing etree find element X by postion

Reg adding data to an existing XML in Python

LXML Xpath does not seem to return full path

Categories

Resources