How to parse XML using python - python

I am trying to parse an xml using python for create a result summary file. Below is my code and a snippet of xml, Like the below i have couple of sections with <test> and </test>
<test name="tst_case1">
<prolog time="2013-01-18T14:41:09+05:30"/>
<verification name="VP5" file="D:/Squish/HMI_testing/tst_case1/test.py" type="properties" line="6">
<result time="2013-01-18T14:41:10+05:30" type="PASS">
<description>VP5: Object propertycomparisonof ':_QMenu_3.enabled'passed</description> <description type="DETAILED">'false' and 'false' are equal</description>
<description type="object">:_QMenu_3</description>
<description type="property">enabled</description>
<description type="failedValue">false</description>
</result>
</verification>
<epilog time="2013-01-18T14:41:11+05:30"/>
</test>
What I want to get is,
in one <test> section how many PASS / FAIL is there.
With the below code its printing the total pass/Fail in the xml file.But i am interested in each section how many PASS/FAIL. can any boy tell me the procedure to fetchout this ?
import sys
import xml.dom.minidom as XY
file = open("result.txt", "w")
tree = XY.parse('D:\\Squish\\squish results\\Results-On-2013-01-18_0241 PM.xml')
Test_name = tree.getElementsByTagName('test')
Test_status = tree.getElementsByTagName('result')
count_testname =0
passcount = 0
failcount = 0
Test_name_array = []
for my_Test_name in Test_name:
count_testname = count_testname+1
passcount = 0
failcount = 0
my_Test_name_final = my_Test_name.getAttribute('name')
Test_name_array = my_Test_name_final
if(count_testname > 1):
print(my_Test_name_final)
for my_Test_status in Test_status:
my_Test_status_final = my_Test_status.getAttribute('type')
if(my_Test_status_final == 'PASS'):
passcount = passcount+1
if(my_Test_status_final == 'FAIL'):
failcount = failcount+1
print(str(my_Test_status_final))

I'd not use minidom for this task; the DOM API is very cumbersome, verbose, and not suited for searching and matching.
The Python library also includes the xml.etree.ElementTree API, I'd use that instead:
from xml.etree import ElementTree as ET
tree = ET.parse(r'D:\Squish\squish results\Results-On-2013-01-18_0241 PM.xml')
tests = dict()
# Find all <test> elements with a <verification> child:
for test in tree.findall('.//test[verification]'):
passed = len(test.findall(".//result[#type='PASS']"))
failed = len(test.findall(".//result[#type='FAIL']"))
tests[test.attrib['name']] = {'pass': passed, 'fail': failed}
The above piece of code counts the number of passed and failed tests per <test> element and stores them in a dictionary, keyed to the name attribute of the <test> element.
I've tested the above code with Python 3.2 and the full XML document from another question you posted, which results in:
{'tst_Setup_menu_2': {'fail': 0, 'pass': 8}}

Thanks for the posting. i got it working using minidon.
still wish to see how can be solved using xml.etree.ElementTree
import sys
import xml.dom.minidom as XY
file = open("Result_Summary.txt", "w")
#tree = XY.parse('D:\\Squish\\squish results\\Results-On-2013-01-18_0241 PM.xml')
#print (str(sys.argv[1]))
tree = XY.parse(sys.argv[1])
Test_name = tree.getElementsByTagName('test')
count_testname =0
file.write('Test Name \t\t\t No:PASS\t\t\t No:FAIL\t \n\n')
for my_Test_name in Test_name:
count_testname = count_testname+1
my_Test_name_final = my_Test_name.getAttribute('name')
if(count_testname > 1):
#print(my_Test_name_final)
file.write(my_Test_name_final)
file.write('\t\t\t\t')
my_Test_status = my_Test_name.getElementsByTagName('result')
passcount = 0
failcount = 0
for my_Test_status_1 in my_Test_status:
my_Test_status_final = my_Test_status_1.getAttribute('type')
if(my_Test_status_final == 'PASS'):
passcount = passcount+1
if(my_Test_status_final == 'FAIL'):
failcount = failcount+1
#print(str(my_Test_status_final))
file.write(str(passcount))
#print(passcount)
file.write('\t\t\t\t')
file.write(str(failcount))
# print(failcount)
file.write('\n')
#print ('loop count: %d' %count_testname)
#print('PASS count: %s' %passcount)
#print('FAIL count: %s' %failcount)
file.close()

Although not a standard module but well worth the effort of installing is lxml especially if you want to do fast Xml parsing etc IMHO.
Without a full example of your results I guessed at what they would look like.
from lxml import etree
tree = etree.parse("results.xml")
count_result_type = etree.XPath("count(.//result[#type = $name])")
for test in tree.xpath("//test"):
print test.attrib['name']
print "\t# FAILS ", count_result_type(test, name="FAIL")
print "\t# PASSES", count_result_type(test, name="PASS")
I generated the following running against my guess of your xml, which should give you an idea of what is happening.
tst_case1
# FAILS 1.0
# PASSES 1.0
tst_case0
# FAILS 0.0
# PASSES 1.0
tst_case2
# FAILS 0.0
# PASSES 1.0
tst_case3
# FAILS 0.0
# PASSES 1.0
What I like about lxml is how expressive it can be, YMMV.

I see you are using Squish. You should check your squish folder under \examples\regressiontesting. There you can find a file called xml2result2html.py. Here you can find an example of converting squish test results into html.

Related

How to parse an XML file to a list?

I am trying to parse an XML file to a list with Python. I have looked at some solutions on this site and others and could not make them work for me. I have managed to do it but in a laborious way that seems stupid to me. It seems that there should be an easier way.
I have tried to adapt other peoples code to suit my needs but that is not working as I am not always sure of what I am reading.
This is the XML file:
<?xml version="1.0"?>
<configuration>
<location name ="location">
<latitude>54.637348</latitude>
<latHemi>N</latHemi>
<longitude>5.829723</longitude>
<longHemi>W</longHemi>
</location>
<microphone name="microphone">
<sensitivity>-26.00</sensitivity>
</microphone>
<weighting name="weighting">
<cWeight>68</cWeight>
<aWeight>2011</aWeight>
</weighting>
<optionalLevels name="optionalLevels">
<L95>95</L95>
<L90>90</L90>
<L50>50</L50>
<L10>10</L10>
<L05>05</L05>
<fmax>fmax</fmax>
</optionalLevels>
<averagingPeriod name="averagingPeriod">
<onemin>1</onemin>
<fivemin>5</fivemin>
<tenmin>10</tenmin>
<fifteenmin>15</fifteenmin>
<thirtymin>30</thirtymin>
</averagingPeriod>
<timeWeighting name="timeWeighting">
<fast>fast</fast>
<slow>slow</slow>
</timeWeighting>
<rebootTime name="rebootTime">
<midnight>midnight</midnight>
<sevenAm>7am</sevenAm>
<sevenPm>7pm</sevenPm>
<elevenPm>23pm</elevenPm>
</rebootTime>
<remoteUpload name="remoteUpload">
<nointernet>nointernet</nointernet>
<vodafone>vodafone</vodafone>
</remoteUpload>
</configuration>
And this is the Python program.
#!/usr/bin/python
import xml.etree.ElementTree as ET
import os
try:
import cElementTree as ET
except ImportError:
try:
import xml.etree.cElementTree as ET
except ImportError:
exit_err("Failed to import cElementTree from any known place")
file_name = ('/home/mark/Desktop/Practice/config_settings.xml')
full_file = os.path.abspath(os.path.join('data', file_name))
dom = ET.parse(full_file)
tree = ET.parse(full_file)
root = tree.getroot()
location_settings = dom.findall('location')
mic_settings = dom.findall('microphone')
weighting = dom.findall('weighting')
olevels = dom.findall('optionalLevels')
avg_period = dom.findall('averagingPeriod')
time_weight = dom.findall('timeWeighting')
reboot = dom.findall('rebootTime')
remote_upload = dom.findall('remoteUpload')
for i in location_settings:
latitude = i.find('latitude').text
latHemi = i.find('latHemi').text
longitude = i.find('longitude').text
longHemi = i.find('longHemi').text
for i in mic_settings:
sensitivity = i.find('sensitivity').text
for i in weighting:
cWeight = i.find('cWeight').text
aWeight = i.find('aWeight').text
for i in olevels:
L95 = i.find('L95').text
L90 = i.find('L90').text
L50 = i.find('L50').text
L10 = i.find('L10').text
L05 = i.find('L05').text
for i in avg_period:
onemin = i.find('onemin').text
fivemin = i.find('fivemin').text
tenmin = i.find('tenmin').text
fifteenmin = i.find('fifteenmin').text
thirtymin = i.find('thirtymin').text
for i in time_weight:
fast = i.find('fast').text
slow = i.find('slow').text
for i in reboot:
midnight = i.find('midnight').text
sevenAm = i.find('sevenAm').text
sevenPm = i.find('sevenPm').text
elevenPm= i.find('elevenPm').text
for i in remote_upload:
nointernet = i.find('nointernet').text
vodafone = i.find('vodafone').text
config_list = [latitude,latHemi,longitude,longHemi,sensitivity,aWeight,cWeight,
L95,L90,L50,L10,L05,onemin,fivemin,tenmin,fifteenmin,thirtymin,
fast,slow,midnight,sevenAm,sevenAm,elevenPm,nointernet,vodafone]
print(config_list)
The problem you're posing isn't very well defined. The XML structure doesn't conform very well to a list structure to begin with. If you're new to python, I think the best way to go about what you're trying to do is to use something like xmltodict which will parse the implicit schema in your xml to python data structures.
e.g.
import xmltodict
xml = """<?xml version="1.0"?>
<configuration>
<location name ="location">
<latitude>54.637348</latitude>
<latHemi>N</latHemi>
<longitude>5.829723</longitude>
<longHemi>W</longHemi>
</location>
<microphone name="microphone">
<sensitivity>-26.00</sensitivity>
</microphone>
<weighting name="weighting">
<cWeight>68</cWeight>
<aWeight>2011</aWeight>
</weighting>
<optionalLevels name="optionalLevels">
<L95>95</L95>
<L90>90</L90>
<L50>50</L50>
<L10>10</L10>
<L05>05</L05>
<fmax>fmax</fmax>
</optionalLevels>
<averagingPeriod name="averagingPeriod">
<onemin>1</onemin>
<fivemin>5</fivemin>
<tenmin>10</tenmin>
<fifteenmin>15</fifteenmin>
<thirtymin>30</thirtymin>
</averagingPeriod>
<timeWeighting name="timeWeighting">
<fast>fast</fast>
<slow>slow</slow>
</timeWeighting>
<rebootTime name="rebootTime">
<midnight>midnight</midnight>
<sevenAm>7am</sevenAm>
<sevenPm>7pm</sevenPm>
<elevenPm>23pm</elevenPm>
</rebootTime>
<remoteUpload name="remoteUpload">
<nointernet>nointernet</nointernet>
<vodafone>vodafone</vodafone>
</remoteUpload>
</configuration>"""
d = xmltodict.parse(xml)
Thanks for the comments. Sorry if the question was not well posed. I have found an answer myself. I was looking to parse the XML child elements into a list for later use in another program. I figured it out. Thank you for your patience.

Python:XML List index out of range

I'm having troubles to get some values in a xml file. The error is IndexError: list index out of range
XML
<?xml version="1.0" encoding="UTF-8"?>
<nfeProc xmlns="http://www.portalfiscal.inf.br/nfe" versao="3.10">
<NFe xmlns="http://www.portalfiscal.inf.br/nfe">
<infNFe Id="NFe35151150306471000109550010004791831003689145" versao="3.10">
<ide>
<nNF>479183</nNF>
</ide>
<emit>
<CNPJ>3213213212323</CNPJ>
</emit>
<det nItem="1">
<prod>
<cProd>7030-314</cProd>
</prod>
<imposto>
<ICMS>
<ICMS10>
<orig>1</orig>
<CST>10</CST>
<vICMS>10.35</vICMS>
<vICMSST>88.79</vICMSST>
</ICMS10>
</ICMS>
</imposto>
</det>
<det nItem="2">
<prod>
<cProd>7050-6</cProd>
</prod>
<imposto>
<ICMS>
<ICMS00>
<orig>1</orig>
<CST>00</CST>
<vICMS>7.49</vICMS>
</ICMS00>
</ICMS>
</imposto>
</det>
</infNFe>
</NFe>
</nfeProc>
I'm getting the values from XML, it's ok in some xml's, those having vICMS and vICMSST tags:
vicms = doc.getElementsByTagName('vICMS')[i].firstChild.nodeValue
vicmsst = doc.getElementsByTagName('vICMSST')[1].firstChild.nodeValue
This returns:
First returns:
print vicms
>> 10.35
print vicmsst
>> 88.79
Second imposto CRASHES because don't find vICMSST tag...
**IndexError: list index out of range**
What the best form to test it? I'm using xml.etree.ElementTree:
My code:
import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
import glob
import xml.etree.ElementTree as ET
origem = 0
# only loops over XML documents in folder
for file in glob.glob("*.xml"):
f = open("%s" % file,'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('emit'):
#Get Fiscal Number
nnf= doc.getElementsByTagName('nNF')[i].firstChild.nodeValue
print 'Fiscal Number %s' % nnf
print '\n'
for prod in doc.getElementsByTagName('det'):
vicms = 0
vicmsst = 0
#Get value of ICMS
vicms = doc.getElementsByTagName('vICMS')[i].firstChild.nodeValue
#Get value of VICMSST
vicmsst = doc.getElementsByTagName('vICMSST')[i].firstChild.nodeValue
#PRINT INFO
print 'ICMS %s' % vicms
print 'Valor do ICMSST: %s' % vicmsst
print '\n\n'
i +=1
print '\n\n'
There is only one vICMSST tag in your XML document. So, when i=1, the following line returns an IndexError.
vicmsst = doc.getElementsByTagName('vICMSST')[1].firstChild.nodeValue
You can restructure this to:
try:
vicmsst = doc.getElementsByTagName('vICMSST')[i].firstChild.nodeValue
except IndexError:
# set a default value or deal with this how you like
It's hard to say what you should do upon an exception without knowing more about what you're trying to do.
You are making several general mistakes in your code.
Don't use counters to index into lists you don't know the length of. Normally, iteration with for .. in is a lot better than using indexes anyway.
You have many imports you don't seem to use, get rid of them.
You can use minidom, but ElementTree is better for your task because it supports searching for nodes with XPath and it supports XML namespaces.
Don't read an XML file as a string and then use parseString. Let the XML parser handle the file directly. This way all file encoding related issues will be handled without errors.
The following is a lot better than your original approach.
import glob
import xml.etree.ElementTree as ET
def get_text(context_elem, xpath, xmlns=None):
""" helper function that gets the text value of a node """
node = context_elem.find(xpath, xmlns)
if (node != None):
return node.text
else:
return ""
# set up XML namespace URIs
xmlns = {
"nfe": "http://www.portalfiscal.inf.br/nfe"
}
for path in glob.glob("*.xml"):
doc = ET.parse(path)
for infNFe in doc.iterfind('.//nfe:infNFe', xmlns):
print 'Fiscal Number\t%s' % get_text(infNFe, ".//nfe:nNF", xmlns)
for det in infNFe.iterfind(".//nfe:det", xmlns):
print ' ICMS\t%s' % get_text(det, ".//nfe:vICMS", xmlns)
print ' Valor do ICMSST:\t%s' % get_text(det, ".//nfe:vICMSST", xmlns)
print '\n\n'

Python xml parsing etree find element X by postion

I'm trying to parse the following xml to pull out certain data then eventually edit the data as needed.
Here is the xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CHECKLIST>
<VULN>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Num</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>V-38438</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Title</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>More text.</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Discuss</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Some text here</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>IA_Controls</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA></ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Ver</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Gen000000</ATTRIBUTE_DATA>
</STIG_DATA>
<STATUS>NotAFinding</STATUS>
<FINDING_DETAILS></FINDING_DETAILS>
<COMMENTS></COMMENTS>
<SEVERITY_OVERRIDE></SEVERITY_OVERRIDE>
<SEVERITY_JUSTIFICATION></SEVERITY_JUSTIFICATION>
</VULN>
The data that I'm looking to pull from this is the STATUS, COMMENTS and the ATTRIBUTE_DATA directly following VULN_ATTRIBUTE that matches == Rule_Ver. So in this example.
I should get the following:
Gen000000 NotAFinding None
What I have so far is that I can get the Status and Comments easy, but can't figure out the ATTRIBUTE_DATA portion. I can find the first one (Vuln_Num), then I tried to add a index but that gives a "list index out of range" error.
This is where I'm at now.
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = curTag.find("./STIG_DATA/ATTRIBUTE_DATA")
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
This gives the following output:
GEN:[V-38438] Status:[NotAFinding] Comments: None
I want:
GEN:[Gen000000] Status:[NotAFinding] Comments: None
So the end goal is to be able to parse hundreds of these and edit the comments field as needed. I don't think the editing part will be that hard once I get the right element.
Logically I see two ways of doing this. Either go to the ATTRIBUTE_DATA[5] and grab the text or find VULN_ATTRIBUTE == Rule_Ver then grab the next ATTRIBUTE_DATA.
I have tried doing this:
DataTag = curTag.find(".//STIG_DATA//ATTRIBUTE_DATA")[5]
andDataTag[5].text`
and both give meIndexError: list index out of range
I saw lxml had get_element_by_id and xpath, but I can't add modules to this system so it is etree for me.
Thanks in advance.
One can find an element by position, but you've used the incorrect XPath syntax. Either of the following lines should work:
DataTag = curTag.find("./STIG_DATA[5]/ATTRIBUTE_DATA") # Note: 5, not 4
DataTag = curTag.findall("./STIG_DATA/ATTRIBUTE_DATA")[4] # Note: 4, not 5
However, I strongly recommend against using that. There is no guarantee that the Rule_Ver instance of STIG_DATA is always the fifth item.
If you could change to lxml, then this works:
DataTag = curTag.xpath(
'./STIG_DATA/VULN_ATTRIBUTE[text()="Rule_Ver"]/../ATTRIBUTE_DATA')[0]
Since you can't use lxml, you must iterate the STIG_DATA elements by hand, like so:
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
if stig.find('VULN_ATTRIBUTE').text == 'Rule_Ver':
return stig.find('ATTRIBUTE_DATA')
Here is a complete program with error checking added to GetData():
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
vuln = stig.find('VULN_ATTRIBUTE')
if vuln is not None and vuln.text == 'Rule_Ver':
data = stig.find('ATTRIBUTE_DATA')
return data
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = GetData(curTag)
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
References:
https://stackoverflow.com/a/10836343/8747
http://lxml.de/xpathxslt.html#xpath

LXML Xpath does not seem to return full path

OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?
ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.
getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath

How to replace node values in XML with Python

I am new to Python. Now I have to replace a number of values in an XML file with Python. The example snippet of XML is:
<gmd:extent>
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString />
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>112.907</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>158.96</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>-54.7539</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>-10.1357</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>
What I want to do is to replace those decimal values, i.e. 112.907, with a specified value.
<gmd:extent>
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString />
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>
I tried with a few methods but none of them worked with my assumption that the difficulty is with the namespace prefix gmd and gco.
Please help me out. Thanks in advance!
Cheers, Alex
I couldn't get lxml to process your xml without adding fake namespace declarations at the top so here is how your input looked
<gmd:extent xmlns:gmd="urn:x:y:z:1" xmlns:gco="urn:x:y:z:1">
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString />
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>112.907</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>158.96</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>-54.7539</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>-10.1357</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>
I assumed you have two lists one for the current values and one for the new ones like this
old = [112.907, 158.96, -54.7539, -10.1357]
new = [1,2,3,4]
d = dict(zip(old,new))
Here is the full code
#!/usr/bin/env python
import sys
from lxml import etree
def process(fname):
f = open(fname)
tree = etree.parse(f)
root = tree.getroot()
old = [112.907, 158.96, -54.7539, -10.1357]
new = [1,2,3,4]
d = dict(zip(old,new))
nodes = root.findall('.//gco:Decimal', root.nsmap)
for node in nodes:
node.text = str(d[float(node.text)])
f.close()
return etree.tostring(root, pretty_print=True)
def main():
fname = sys.argv[1]
text = process(fname)
outfile = open('out.xml', 'w+')
outfile.write(text)
outfile.close()
if __name__ == '__main__':
main()
and here is how the output looked like
<gmd:extent xmlns:gmd="urn:x:y:z:1" xmlns:gco="urn:x:y:z:1">
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString/>
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>1</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>2</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>3</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>4</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>

Categories

Resources