I am using lxml etree to create xml or REST call. I have problem with namespaces since if not formulated correctly I get a syntax error from server.
As you can see in the following 2 examples I should be getting eg ns1, ns2, ns4, ns5 but the xml goes over with ns15, ns16 but at the end it has the e.g "" or " " - I know this explains it but for the nature of my REST call I need it as the example is.
How can I prevent that
I have to get the following xml
<ns5:prenosPodatkovRazporedaZahtevaSporocilo xmlns="http://xxx.yyy/sheme/pdr/skupno/v1" xmlns:ns2="http://xxx.yyy/sheme/pdr/v1" xmlns:ns3="http://xxx.yyy/sheme/kis/skupno/v2" xmlns:ns4="http://xxx.yyy/sheme/kis/v2" xmlns:ns5="http://xxx.yyy/sheme/pdr/sporocila/v1">
<ns5:podatkiRazporeda>
<ns2:podatkiRazporeda>
<ns2:delitvenaEnota>
<sifra>80</sifra>
</ns2:delitvenaEnota>
<ns2:vrstaRazporeda>
<sifra>4</sifra>
</ns2:vrstaRazporeda>
<ns2:tipRazporeda>
<sifra>D</sifra>
</ns2:tipRazporeda>
<ns2:obdobje>
<ns2:mesec>12</ns2:mesec>
<ns2:leto>2017</ns2:leto>
</ns2:obdobje>
<ns2:skupina>0</ns2:skupina>
<ns2:izvor>P_738</ns2:izvor>
<ns2:oznakeDelaZaDneve>
<ns2:oznakaDelaZaDan>
<ns2:dan>1</ns2:dan>
<ns2:oznakaDela>D4</ns2:oznakaDela>
</ns2:oznakaDelaZaDan>
....
</ns2:oznakeDelaZaDneve>
<ns2:organizacijskaEnota>
<sifra>738</sifra>
</ns2:organizacijskaEnota>
<ns2:zaposlenec>
<ns4:osebnaStevilka>10357</ns4:osebnaStevilka>
</ns2:zaposlenec>
</ns2:podatkiRazporeda>
</ns5:podatkiRazporeda>
Where I am getting this xml.
Mind the namespace marks.
<ns0:prenosPodatkovRazporedaOdgovorSporocilo xmlns:ns="http://rccirc.si/sheme/pdr/skupno/v1" xmlns:ns2="http://rccirc.si/sheme/pdr/v1" xmlns:ns3="http://rccirc.si/sheme/kis/skupno/v2" xmlns:ns4="http://rccirc.si/sheme/kis/v2" xmlns:ns5="http://rccirc.si/sheme/pdr/sporocila/v1" xmlns:ns0="ns5">
<ns0:podatkiRazporeda>
<ns1:podatkiRazporeda xmlns:ns1="ns2">
<ns1:vrstaRazporeda>
<sifra>647</sifra>
</ns1:vrstaRazporeda>
<ns1:tipRazporeda>
<sifra>D</sifra>
</ns1:tipRazporeda>
<ns1:obdobje>
<ns1:mesec>1</ns1:mesec>
<ns1:leto>2018</ns1:leto>
</ns1:obdobje>
<ns1:skupina>0</ns1:skupina>
<ns1:izvor>0</ns1:izvor>
<ns1:organizacijskaEnota>
<sifra>250</sifra>
</ns1:organizacijskaEnota>
<ns6:delitvenaenota xmlns:ns6="ns3">
<sifra>80</sifra>
</ns6:delitvenaenota>
<ns1:oznakeDelaZaDneve>
<oznakeDelaZaDneve>
<ns1:dan>29</ns1:dan>
<ns1:oznakaDela>1930-0730</ns1:oznakaDela>
</oznakeDelaZaDneve>
</ns1:oznakeDelaZaDneve>
<ns1:zaposlenec>
<ns7:osebnaStevilka xmlns:ns7="ns4">Z1</ns7:osebnaStevilka>
</ns1:zaposlenec>
</ns1:podatkiRazporeda>
.......
<ns11:podatkiRazporeda xmlns:ns11="ns2">
<ns11:vrstaRazporeda>
<sifra>647</sifra>
</ns11:vrstaRazporeda>
<ns11:tipRazporeda>
<sifra>D</sifra>
</ns11:tipRazporeda>
<ns11:obdobje>
<ns11:mesec>1</ns11:mesec>
<ns11:leto>2018</ns11:leto>
</ns11:obdobje>
<ns11:skupina>0</ns11:skupina>
<ns11:izvor>0</ns11:izvor>
<ns11:organizacijskaEnota>
<sifra>250</sifra>
</ns11:organizacijskaEnota>
<ns12:delitvenaenota xmlns:ns12="ns3">
<sifra>80</sifra>
</ns12:delitvenaenota>
<ns11:oznakeDelaZaDneve>
<oznakeDelaZaDneve>
<ns11:dan>3</ns11:dan>
<ns11:oznakaDela>0730-1530</ns11:oznakaDela>
</oznakeDelaZaDneve>
.....
</ns11:oznakeDelaZaDneve>
<ns11:zaposlenec>
<ns13:osebnaStevilka xmlns:ns13="ns4">Z1</ns13:osebnaStevilka>
</ns11:zaposlenec>
</ns11:podatkiRazporeda>
</ns0:podatkiRazporeda>
</ns0:prenosPodatkovRazporedaOdgovorSporocilo>
Here is my code.
root = etree.Element('{ns5}prenosPodatkovRazporedaOdgovorSporocilo', nsmap = {'ns': "http://xxx.yyy/sheme/pdr/skupno/v1",'ns2':"http://xxx.yyy/sheme/pdr/v1" ns3':"http://xxx.yyy/sheme/kis/skupno/v2",ns4': "http://xxx.yyy/sheme/kis/v2",ns5': "http://xxx.yyy/sheme/pdr/sporocila/v1"})
podatkiRazporedaMain = etree.SubElement(root, '{ns5}podatkiRazporeda')
#follwed by creating sub elements etc.
for rec in grouped_workers:
podatkiRazporeda = etree.SubElement(podatkiRazporedaMain, '{ns2}podatkiRazporeda')
vrstaRazporeda= etree.SubElement(podatkiRazporeda, '{ns2}vrstaRazporeda')
vrstaRazporedaSifra = etree.SubElement(vrstaRazporeda, 'sifra')
vrstaRazporedaSifra.text = "647"
tipRazporeda= etree.SubElement(podatkiRazporeda, '{ns2}tipRazporeda')
tipRazporedaSifra = etree.SubElement(tipRazporeda, 'sifra')
tipRazporedaSifra.text = 'D'
for rr in rec["data"]:
oznakaDelaZaDan = etree.SubElement(oznakeDelaZaDneve, 'oznakeDelaZaDneve')
dan= etree.SubElement(oznakaDelaZaDan, '{ns2}dan')
dan.text = str(rr["rw_date"].day)
oznakaDela = etree.SubElement(oznakaDelaZaDan, '{ns2}oznakaDela')
oznakaDela.text = str(rr["rw_shift"])
#print etree.tostring(root, pretty_print=True, xml_declaration=False, encoding='UTF-8')
fle = os.path.join(request.folder, 'private', str(647) + '.xml')
with open(fle, 'wb') as f:
f.write(etree.tostring(root, pretty_print=True, xml_declaration=False, encoding='UTF-8'))#,inclusive_ns_prefixes=None))
#etree..write(fle, pretty_print=True, xml_declaration=False, encoding='UTF-8')
print "Done"
So why are ns incremented?
Hope I was clear
Than you
So as it turns out when you are creating tags you should not write
vrstaRazporeda= etree.SubElement(podatkiRazporeda, '{ns2}vrstaRazporeda')
vrstaRazporedaSifra = etree.SubElement(vrstaRazporeda, 'sifra').text = "647"
But
vrstaRazporeda= etree.SubElement(podatkiRazporeda, '{http://xxx.yyy/sheme/pdr/v1}vrstaRazporeda')
vrstaRazporedaSifra = etree.SubElement(vrstaRazporeda, 'sifra').text = "647"
so the whole url - this seemed to solve the issue.
Related
I am trying to open an xml file, and get values from certain tags. I have done this a lot but this particular xml is giving me some issues. Here is a section of the xml file:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer" version="film4.7">
<provider>filmgroup</provider>
<language>en-GB</language>
<actor name="John Smith" display="Doe John"</actor>
</package>
And here is a sample of my python code:
metadata = '/Users/mylaptop/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
for element in root.iter(tag='provider'):
providerValue = tree.find('//provider')
providerValue = providerValue.text
print providerValue
tree.write('/Users/mylaptop/Desktop/Python/metadataDone.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
When I run this it can't find the provider tag or its value. If I remove xmlns="http://apple.com/itunes/importer" then all work as expected.
My question is how can I remove this namespace, as i'm not at all interested in this, so I can get the tag values I need using lxml?
The provider tag is in the http://apple.com/itunes/importer namespace, so you either need to use the fully qualified name
{http://apple.com/itunes/importer}provider
or use one of the lxml methods that has the namespaces parameter, such as root.xpath. Then you can specify it with a namespace prefix (e.g. ns:provider):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
namespaces = {'ns':'http://apple.com/itunes/importer'}
items = iter(root.xpath('//ns:provider/text()|//ns:actor/#name',
namespaces=namespaces))
for provider, actor in zip(*[items]*2):
print(provider, actor)
yields
('filmgroup', 'John Smith')
Note that the XPath used above assumes that <provider> and <actor> elements always appear in alternation. If that is not true, then there are of course ways to handle it, but the code becomes a bit more verbose:
for package in root.xpath('//ns:package', namespaces=namespaces):
for provider in package.xpath('ns:provider', namespaces=namespaces):
providerValue = provider.text
print providerValue
for actor in package.xpath('ns:actor', namespaces=namespaces):
print actor.attrib['name']
My suggestion is to not ignore the namespace but, instead, to take it into account. I wrote some related functions (copied with slight modification) for my work on the django-quickbooks library. With these functions, you should be able to do this:
providers = getels(root, 'provider', ns='http://apple.com/itunes/importer')
Here are those functions:
def get_tag_with_ns(tag_name, ns):
return '{%s}%s' % (ns, tag_name)
def getel(elt, tag_name, ns=None):
""" Gets the first tag that matches the specified tag_name taking into
account the QB namespace.
:param ns: The namespace to use if not using the default one for
django-quickbooks.
:type ns: string
"""
res = elt.find(get_tag_with_ns(tag_name, ns=ns))
if res is None:
raise TagNotFound('Could not find tag by name "%s"' % tag_name)
return res
def getels(elt, *path, **kwargs):
""" Gets the first set of elements found at the specified path.
Example:
>>> xml = (
"<root>" +
"<item>" +
"<id>1</id>" +
"</item>" +
"<item>" +
"<id>2</id>"* +
"</item>" +
"</root>")
>>> el = etree.fromstring(xml)
>>> getels(el, 'root', 'item', ns='correct/namespace')
[<Element item>, <Element item>]
"""
ns = kwargs['ns']
i=-1
for i in range(len(path)-1):
elt = getel(elt, path[i], ns=ns)
tag_name = path[i+1]
return elt.findall(get_tag_with_ns(tag_name, ns=ns))
OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?
ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.
getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath
I am trying to parse an xml using python for create a result summary file. Below is my code and a snippet of xml, Like the below i have couple of sections with <test> and </test>
<test name="tst_case1">
<prolog time="2013-01-18T14:41:09+05:30"/>
<verification name="VP5" file="D:/Squish/HMI_testing/tst_case1/test.py" type="properties" line="6">
<result time="2013-01-18T14:41:10+05:30" type="PASS">
<description>VP5: Object propertycomparisonof ':_QMenu_3.enabled'passed</description> <description type="DETAILED">'false' and 'false' are equal</description>
<description type="object">:_QMenu_3</description>
<description type="property">enabled</description>
<description type="failedValue">false</description>
</result>
</verification>
<epilog time="2013-01-18T14:41:11+05:30"/>
</test>
What I want to get is,
in one <test> section how many PASS / FAIL is there.
With the below code its printing the total pass/Fail in the xml file.But i am interested in each section how many PASS/FAIL. can any boy tell me the procedure to fetchout this ?
import sys
import xml.dom.minidom as XY
file = open("result.txt", "w")
tree = XY.parse('D:\\Squish\\squish results\\Results-On-2013-01-18_0241 PM.xml')
Test_name = tree.getElementsByTagName('test')
Test_status = tree.getElementsByTagName('result')
count_testname =0
passcount = 0
failcount = 0
Test_name_array = []
for my_Test_name in Test_name:
count_testname = count_testname+1
passcount = 0
failcount = 0
my_Test_name_final = my_Test_name.getAttribute('name')
Test_name_array = my_Test_name_final
if(count_testname > 1):
print(my_Test_name_final)
for my_Test_status in Test_status:
my_Test_status_final = my_Test_status.getAttribute('type')
if(my_Test_status_final == 'PASS'):
passcount = passcount+1
if(my_Test_status_final == 'FAIL'):
failcount = failcount+1
print(str(my_Test_status_final))
I'd not use minidom for this task; the DOM API is very cumbersome, verbose, and not suited for searching and matching.
The Python library also includes the xml.etree.ElementTree API, I'd use that instead:
from xml.etree import ElementTree as ET
tree = ET.parse(r'D:\Squish\squish results\Results-On-2013-01-18_0241 PM.xml')
tests = dict()
# Find all <test> elements with a <verification> child:
for test in tree.findall('.//test[verification]'):
passed = len(test.findall(".//result[#type='PASS']"))
failed = len(test.findall(".//result[#type='FAIL']"))
tests[test.attrib['name']] = {'pass': passed, 'fail': failed}
The above piece of code counts the number of passed and failed tests per <test> element and stores them in a dictionary, keyed to the name attribute of the <test> element.
I've tested the above code with Python 3.2 and the full XML document from another question you posted, which results in:
{'tst_Setup_menu_2': {'fail': 0, 'pass': 8}}
Thanks for the posting. i got it working using minidon.
still wish to see how can be solved using xml.etree.ElementTree
import sys
import xml.dom.minidom as XY
file = open("Result_Summary.txt", "w")
#tree = XY.parse('D:\\Squish\\squish results\\Results-On-2013-01-18_0241 PM.xml')
#print (str(sys.argv[1]))
tree = XY.parse(sys.argv[1])
Test_name = tree.getElementsByTagName('test')
count_testname =0
file.write('Test Name \t\t\t No:PASS\t\t\t No:FAIL\t \n\n')
for my_Test_name in Test_name:
count_testname = count_testname+1
my_Test_name_final = my_Test_name.getAttribute('name')
if(count_testname > 1):
#print(my_Test_name_final)
file.write(my_Test_name_final)
file.write('\t\t\t\t')
my_Test_status = my_Test_name.getElementsByTagName('result')
passcount = 0
failcount = 0
for my_Test_status_1 in my_Test_status:
my_Test_status_final = my_Test_status_1.getAttribute('type')
if(my_Test_status_final == 'PASS'):
passcount = passcount+1
if(my_Test_status_final == 'FAIL'):
failcount = failcount+1
#print(str(my_Test_status_final))
file.write(str(passcount))
#print(passcount)
file.write('\t\t\t\t')
file.write(str(failcount))
# print(failcount)
file.write('\n')
#print ('loop count: %d' %count_testname)
#print('PASS count: %s' %passcount)
#print('FAIL count: %s' %failcount)
file.close()
Although not a standard module but well worth the effort of installing is lxml especially if you want to do fast Xml parsing etc IMHO.
Without a full example of your results I guessed at what they would look like.
from lxml import etree
tree = etree.parse("results.xml")
count_result_type = etree.XPath("count(.//result[#type = $name])")
for test in tree.xpath("//test"):
print test.attrib['name']
print "\t# FAILS ", count_result_type(test, name="FAIL")
print "\t# PASSES", count_result_type(test, name="PASS")
I generated the following running against my guess of your xml, which should give you an idea of what is happening.
tst_case1
# FAILS 1.0
# PASSES 1.0
tst_case0
# FAILS 0.0
# PASSES 1.0
tst_case2
# FAILS 0.0
# PASSES 1.0
tst_case3
# FAILS 0.0
# PASSES 1.0
What I like about lxml is how expressive it can be, YMMV.
I see you are using Squish. You should check your squish folder under \examples\regressiontesting. There you can find a file called xml2result2html.py. Here you can find an example of converting squish test results into html.
I am new to Python. Now I have to replace a number of values in an XML file with Python. The example snippet of XML is:
<gmd:extent>
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString />
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>112.907</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>158.96</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>-54.7539</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>-10.1357</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>
What I want to do is to replace those decimal values, i.e. 112.907, with a specified value.
<gmd:extent>
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString />
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>new value</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>
I tried with a few methods but none of them worked with my assumption that the difficulty is with the namespace prefix gmd and gco.
Please help me out. Thanks in advance!
Cheers, Alex
I couldn't get lxml to process your xml without adding fake namespace declarations at the top so here is how your input looked
<gmd:extent xmlns:gmd="urn:x:y:z:1" xmlns:gco="urn:x:y:z:1">
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString />
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>112.907</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>158.96</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>-54.7539</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>-10.1357</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>
I assumed you have two lists one for the current values and one for the new ones like this
old = [112.907, 158.96, -54.7539, -10.1357]
new = [1,2,3,4]
d = dict(zip(old,new))
Here is the full code
#!/usr/bin/env python
import sys
from lxml import etree
def process(fname):
f = open(fname)
tree = etree.parse(f)
root = tree.getroot()
old = [112.907, 158.96, -54.7539, -10.1357]
new = [1,2,3,4]
d = dict(zip(old,new))
nodes = root.findall('.//gco:Decimal', root.nsmap)
for node in nodes:
node.text = str(d[float(node.text)])
f.close()
return etree.tostring(root, pretty_print=True)
def main():
fname = sys.argv[1]
text = process(fname)
outfile = open('out.xml', 'w+')
outfile.write(text)
outfile.close()
if __name__ == '__main__':
main()
and here is how the output looked like
<gmd:extent xmlns:gmd="urn:x:y:z:1" xmlns:gco="urn:x:y:z:1">
<gmd:EX_Extent>
<gmd:description gco:nilReason="missing">
<gco:CharacterString/>
</gmd:description>
<gmd:geographicElement>
<gmd:EX_GeographicBoundingBox>
<gmd:westBoundLongitude>
<gco:Decimal>1</gco:Decimal>
</gmd:westBoundLongitude>
<gmd:eastBoundLongitude>
<gco:Decimal>2</gco:Decimal>
</gmd:eastBoundLongitude>
<gmd:southBoundLatitude>
<gco:Decimal>3</gco:Decimal>
</gmd:southBoundLatitude>
<gmd:northBoundLatitude>
<gco:Decimal>4</gco:Decimal>
</gmd:northBoundLatitude>
</gmd:EX_GeographicBoundingBox>
</gmd:geographicElement>
</gmd:EX_Extent>
</gmd:extent>
I'm currently using xml.dom.minidom to parse some XML in python. After parsing, I'm doing some reporting on the content, and would like to report the line (and column) where the tag started in the source XML document, but I don't see how that's possible.
I'd like to stick with xml.dom / xml.dom.minidom if possible, but if I need to use a SAX parser to get the origin info, I can do that -- ideal in that case would be using SAX to track node location, but still end up with a DOM for my post-processing.
Any suggestions on how to do this? Hopefully I'm just overlooking something in the docs and this extremely easy.
By monkeypatching the minidom content handler I was able to record line and column number for each node (as the 'parse_position' attribute). It's a little dirty, but I couldn't see any "officially sanctioned" way of doing it :) Here's my test script:
from xml.dom import minidom
import xml.sax
doc = """\
<File>
<name>Name</name>
<pos>./</pos>
</File>
"""
def set_content_handler(dom_handler):
def startElementNS(name, tagName, attrs):
orig_start_cb(name, tagName, attrs)
cur_elem = dom_handler.elementStack[-1]
cur_elem.parse_position = (
parser._parser.CurrentLineNumber,
parser._parser.CurrentColumnNumber
)
orig_start_cb = dom_handler.startElementNS
dom_handler.startElementNS = startElementNS
orig_set_content_handler(dom_handler)
parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler
dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
if child.localName is None:
continue
pos = child.parse_position
print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])
It outputs the following:
Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2
A different way to hack around the problem is by patching line number information into the document before parsing it. Here's the idea:
LINE_DUMMY_ATTR = '_DUMMY_LINE' # Make sure this string is unique!
def parseXml(filename):
f = file.open(filename, 'r')
l = 0
content = list ()
for line in f:
l += 1
content.append(re.sub(r'<(\w+)', r'<\1 ' + LINE_DUMMY_ATTR + '="' + str(l) + '"', line))
f.close ()
return minidom.parseString ("".join(content))
Then you can retrieve the line number of an element with
int (element.getAttribute (LINE_DUMMY_ATTR))
Quite clearly, this approach has its own set of drawbacks, and if you really need column numbers, too, patching that in will be somewhat more involved. Also, if you want to extract text nodes or comments or use Node.toXml(), you'll have to make sure to strip out LINE_DUMMY_ATTR from any accidental matches, there.
The one advantage of this solution over aknuds1's answer is that it does not require messing with minidom internals.