Minidom getElementById not working - python

Minidom's getElementById function is returning None for any entry I pass to it.
For example, this code:
l = minidom.parseString('<node id="node">Node</node>')
print(l.getElementById("node"))
Prints "None" on my computer.
I must be doing something here wrong but I can't figure it out!
I'm running Python 3.3.2 if that helps.

I used another approach to get Elemnts by ID (meaning the XML-attribute "id"), since I wanted to only use xml.dom.minidom.
Here is an example from my work:
#import minidom
from xml.dom.minidom import parse as p
#parse your XML-document
cmmn_doc = p("document.xml")
#Get all child nodes of your root-element or any element surrounding your "target" (in my example "cmmn:casePlanModel")
notelist = cmmn_doc.getElementsByTagName("cmmn:casePlanModel")[0].childNodes
#Now find the element via the id-tag
def find_element(id):
i=0
for i in range(len(notelist)):
if notelist[i].getAttribute("id") == id:
return notelist[i].nodeName #(or whatever you want to do)
#Call find_element with the id you are looking for
find_element(id)
XML from the example:
<cmmn:casePlanModel id="CasePlanModel_1" name="A CasePlanModel">
<cmmn:planItem id="PlanItem_1" definitionRef="Task_1" />
<cmmn:planItem id="PlanItem_08uai3q" definitionRef="HumanTask_0pgsk2i" />
<cmmn:planItem id="PlanItem_0crahv8" definitionRef="HumanTask_0jvecsr">
<cmmn:itemControl id="PlanItemControl_0tdwp8g">
<cmmn:repetitionRule id="RepetitionRule_03ky93m" />
<cmmn:requiredRule id="RequiredRule_1klzaio" />
<cmmn:manualActivationRule id="ManualActivationRule_1rek2bf" />
</cmmn:itemControl>
</cmmn:planItem>
<cmmn:planItem id="PlanItem_08kswcr" definitionRef="HumanTask_14zxi11" />
<cmmn:planItem id="PlanItem_12b1nkx" definitionRef="ProcessTask_10xuu3g">
<cmmn:exitCriterion id="EntryCriterion_09gio4l" sentryRef="Sentry_0hst9b5" />
</cmmn:planItem>
<cmmn:planItem id="PlanItem_1v34h5m" definitionRef="CaseTask_0hwjce3">
<cmmn:entryCriterion id="EntryCriterion_1j8r6j1" sentryRef="Sentry_1ii8w5d" />
</cmmn:planItem>
<cmmn:planItem id="PlanItem_0wroqsx" definitionRef="EventListener_17yxe7z" />
<cmmn:sentry id="Sentry_0hst9b5" />
<cmmn:sentry id="Sentry_1ii8w5d">
<cmmn:planItemOnPart id="PlanItemOnPart_1gt5jrc" sourceRef="PlanItem_12b1nkx"> <cmmn:standardEvent>complete</cmmn:standardEvent>
</cmmn:planItemOnPart>
<cmmn:planItemOnPart id="PlanItemOnPart_01b6uw3" sourceRef="PlanItem_0wroqsx"> <cmmn:standardEvent>occur</cmmn:standardEvent>
</cmmn:planItemOnPart>
</cmmn:sentry>
<cmmn:task id="Task_1" name="Simple Task" />
<cmmn:humanTask id="HumanTask_0pgsk2i" name="Human Task" />
<cmmn:humanTask id="HumanTask_0jvecsr" name="Human_Blocking" isBlocking="false" />
<cmmn:humanTask id="HumanTask_14zxi11" name="Human_mit_Anhang">
<cmmn:planningTable id="PlanningTable_1yxv7gm">
<cmmn:discretionaryItem id="DiscretionaryItem_0ne79yh" definitionRef="DecisionTask_1ecc5v8" />
</cmmn:planningTable>
</cmmn:humanTask>
<cmmn:decisionTask id="DecisionTask_1ecc5v8" name="Descritionary to Human Task" />
<cmmn:processTask id="ProcessTask_10xuu3g" name="Prozess Task" />
<cmmn:caseTask id="CaseTask_0hwjce3" name="Case Task" />
<cmmn:eventListener id="EventListener_17yxe7z" name="EventListener" />
</cmmn:casePlanModel>
I found this way more convenient.

If you want to get elements with name="node"
l.getElementsByTagName("node")
If you want to get elements with attribute having an attribute "id" with value "node", use xpath:
import xpath
xpath.find("//*['id=node']",l) #search for all elements with an attribute id="node"

From the instruction you typed, I understand you are trying to get the element which id value is node.
The solution is to loop over all your XML elements (well, you have only one in this situation, but it does not matter), and then check if that element has an attribute called id and the value of that attribute is node.
Let us translate this logic into a program:
>>> from xml.dom import minidom
>>> xml_string = '<node id="node">Node</node>'
>>> xml_doc = minidom.parseString(xml_string)
>>> elements = xml_doc.getElementsByTagName('node')
>>> for element in elements:
... if element.hasAttribute('id') and element.getAttribute('id') == 'node':
... print(element.toxml())
...
<node id="node">Node</node>

Related

I cannot parse this xml file in python

I am trying to create an API connection and response is looking like below. I need to parse this data and turn it into a pd dataframe and/or create loop to find specific information belong to tags.
Below is the code i try to run but it returns with empty list, and it looks not iterable.
Also it is not convertible to a data frame for now. What steps should I take to handle this data?
import requests
import pandas as pd
import xml.etree.ElementTree as ET
response = """<?xml version = "1.0" encoding = "utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<SOAP-ENV:Body>
<Desperados_Clientes_V2.DESPERADOSResponse xmlns="TrainsWebb_V16">
<Sdtdesperadosclient xmlns="TrainsWebb_V16">
<SDTDesperadosClientItem xmlns="TrainsWebb_V16">
<AESA>10555555555 </AESA>
<DOCUMENTO>1666666666</DOCUMENTO>
<REMITENTE>888888888 </REMITENTE>
<NM_REMITENTE>ABDULREZZAK S.A.S. </NM_REMITENTE>
<FECHA_ELABORACION>14/8/2020</FECHA_ELABORACION>
<HORA_ELABORACION>11:27</HORA_ELABORACION>
<CODIGO_DEST>0000000000</CODIGO_DEST>
<NIT_DESTINATARIO>0000000000</NIT_DESTINATARIO>
<NOMBRE_DESTINATARIO>HOST ADMIRALE GORA</NOMBRE_DESTINATARIO>
<DIRECCION_DESTINATARIO>BBA 56 # 21 - 001</DIRECCION_DESTINATARIO>
<DANE_DESTINO>0200000</DANE_DESTINO>
<CIUDAD_DESTINO>GORA </CIUDAD_DESTINO>
<DEPARTAMENTO_DESTINO>ANTIOCHIA </DEPARTAMENTO_DESTINO>
<FECHA_ENTREGA>11/02/2020</FECHA_ENTREGA>
<HORA_ENTREGA>11:44</HORA_ENTREGA>
<FECHA_CITA />
<HORA_CITA />
<CODIGO_ESTADO>Z </CODIGO_ESTADO>
<NOMBRE_ESTADO>CUMPLEANNO </NOMBRE_ESTADO>
<FECHA_ESTADO>11/01/2020</FECHA_ESTADO>
<HORA_ESTADO>11:44</HORA_ESTADO>
<CODIGO_NOVEDAD />
<NOMBRE_NOVEDAD />
<FECHA_NOVEDAD />
<HORA_NOVEDAD />
<COMENTARIO_NOVEDAD />
<OBSERVACIONES />
<ENLACE_IMAGEN>https://ssssss.ssssssss.com/SSSSSS/xxxxxxxxxxxxxx.aspx?1111111,222222222,SIXA_XEOX,SIXAXEOX2016</ENLACE_IMAGEN>
<DOCUMENTO_2>169999999999</DOCUMENTO_2>
<DOCUMENTO_3 />
<DOCUMENTO_4 />
<FECHA_TRANSMISION>18/02/2020</FECHA_TRANSMISION>
<HORA_TRANSMISION>08:12:30</HORA_TRANSMISION>
<MENSAJE_TRANSMISION>KK</MENSAJE_TRANSMISION>
<PROMESA_SERVICIO>15/10/21</PROMESA_SERVICIO>
<CODIGO_DIVISION>011111</CODIGO_DIVISION>
<NOMBRE_DIVISION>ABDURREZZAK </NOMBRE_DIVISION>
</SDTDesperadosClientItem>
<SDTDesperadosClientItem xmlns="TrainsWebb_V16">
<AESA>10555555555 </AESA>
<DOCUMENTO>177777777</DOCUMENTO>
<REMITENTE>9999999999 </REMITENTE>
<NM_REMITENTE>ABDULREZZAK S.A.S. </NM_REMITENTE>
<FECHA_ELABORACION>12/8/2020</FECHA_ELABORACION>
<HORA_ELABORACION>16:27</HORA_ELABORACION>
<CODIGO_DEST>0000000000</CODIGO_DEST>
<NIT_DESTINATARIO>0000000000</NIT_DESTINATARIO>
<NOMBRE_DESTINATARIO>GORA FORA</NOMBRE_DESTINATARIO>
<DIRECCION_DESTINATARIO>BBG 16 # 91 - 021</DIRECCION_DESTINATARIO>
<DANE_DESTINO>0500000</DANE_DESTINO>
<CIUDAD_DESTINO>AROG </CIUDAD_DESTINO>
<DEPARTAMENTO_DESTINO>ANTIOCHIA </DEPARTAMENTO_DESTINO>
<FECHA_ENTREGA>10/02/2020</FECHA_ENTREGA>
<HORA_ENTREGA>10:44</HORA_ENTREGA>
<FECHA_CITA />
<HORA_CITA />
<CODIGO_ESTADO>D </CODIGO_ESTADO>
<NOMBRE_ESTADO>CUMPLEANNI </NOMBRE_ESTADO>
<FECHA_ESTADO>11/01/2020</FECHA_ESTADO>
<HORA_ESTADO>11:44</HORA_ESTADO>
<CODIGO_NOVEDAD />
<NOMBRE_NOVEDAD />
<FECHA_NOVEDAD />
<HORA_NOVEDAD />
<COMENTARIO_NOVEDAD />
<OBSERVACIONES />
<ENLACE_IMAGEN>https://ssssss.ssssssss.com/SSSSSS/xxxxxxxxxxxxxx.aspx?1111111,222222222,SIXA_XEOX,SIXAXEOX2016</ENLACE_IMAGEN>
<DOCUMENTO_2>1677777777</DOCUMENTO_2>
<DOCUMENTO_3 />
<DOCUMENTO_4 />
<FECHA_TRANSMISION>18/02/2020</FECHA_TRANSMISION>
<HORA_TRANSMISION>08:12:30</HORA_TRANSMISION>
<MENSAJE_TRANSMISION>HK</MENSAJE_TRANSMISION>
<PROMESA_SERVICIO>15/10/21</PROMESA_SERVICIO>
<CODIGO_DIVISION>011111</CODIGO_DIVISION>
<NOMBRE_DIVISION>ABDURREZZAK </NOMBRE_DIVISION>
</SDTDesperadosClientItem>
</Sdtdesperadosclient>
</Desperados_Clientes_V2.DESPERADOSResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>"""
myroot = ET.fromstring(response)
for child in myroot.iter('*'):
print(child.tag)
sid = myroot.findall(".//{'TrainsWebb_V16'}AESA")
print(sid)
For parsing into a pandas DataFrame, you can use the pandas.read_xml function:
data_frame = pd.read_xml(response, xpath="//*[name()='SDTDesperadosClientItem']")
https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html#pandas-read-xml

how to find and edit tags in XML files with namespaces using ElementTree

I would like to find specific tags in my XML document and edit their text or attributes. My XML file contains namespaces (and as I understand it correctly, nested namespaces). The tool I'd like to use for this purpose is ElementTree. I managed to read XML file by iterparse, however I don't know how I can save edited XML, because iterparse doesn't have write element. I need a solution to read XML file by parse and strip its namespaces and nested namespaces or a way to save iterparsed file.
For this case, let's edit the "Rating" tag text.
it = ET.iterparse(adiPath)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in list(el.attrib): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
root = it.root
# Search Rating tag and edit it's value
for rating in root.iter('Rating'):
print(rating.text) # Prints 18
rating.text = "999"
print(rating.text) # Prints 999
However in this case XML file remains unchanged.
Here is XML file:
<?xml version="1.0" encoding="utf-8"?>
<ADI3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:content="urn:cablelabs:md:xsd:content:3.0" xmlns:core="urn:cablelabs:md:xsd:core:3.0" xmlns:offer="urn:cablelabs:md:xsd:offer:3.0" xmlns:terms="urn:cablelabs:md:xsd:terms:3.0" xmlns:title="urn:cablelabs:md:xsd:title:3.0" xmlns:adb="urn:adb:md:xsd:adb:01" xmlns:schemaLocation="urn:adb:md:xsd:adb:01 ADB-EXT-C01.xsd urn:cablelabs:md:xsd:core:3.0 MD-SP-CORE-C01.xsd urn:cablelabs:md:xsd:content:3.0 MD-SP-CONTENT-C01.xsd urn:cablelabs:md:xsd:offer:3.0 MD-SP-OFFER-C01.xsd urn:cablelabs:md:xsd:terms:3.0 MD-SP-TERMS-C01.xsd urn:cablelabs:md:xsd:title:3.0 MD-SP-TITLE-C01.xsd" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns="urn:cablelabs:md:xsd:core:3.0">
<Asset xsi:type="title:TitleType" uriId="ab://cc.com" providerVersionNum="1" internalVersionNum="0" creationDateTime="2020-01-28T08:55:19Z" startDateTime="2019-05-20T00:00:00Z" endDateTime="2028-08-20T23:59:00Z">
<AlternateId identifierSystem="VOD1.1">ab://cc.com</AlternateId>
<Ext>
<adb:ExtensionType>
<adb:TitleExt>
<adb:SeriesInfo episodeNumber="6">
<adb:series seriesId="GOT" seasonCount="8"></adb:series>
<adb:season seasonId="GOTS08" number="8" episodeCount="6"></adb:season>
</adb:SeriesInfo>
</adb:TitleExt>
</adb:ExtensionType>
</Ext>
<title:LocalizableTitle xml:lang="pol">
<title:TitleLong>Game of Thrones VIII</title:TitleLong>
<title:SummaryLong>Long summary, long summary, long summary...</title:SummaryLong>
<title:Actor fullName="Peter Dinklage" firstName="Peter" lastName="Dinklage" />
<title:Actor fullName="Nikolaj Coster-Waldau" firstName="Nikolaj" lastName="Coster-Waldau" />
<title:Actor fullName="Emilia Clarke" firstName="Emilia" lastName="Clarke" />
<title:Actor fullName="Lena Headey" firstName="Lena" lastName="Headey" />
<title:Director fullName="David Nutter" firstName="David" lastname="Nutter" />
</title:LocalizableTitle>
<title:Rating ratingSystem="PL">18</title:Rating>
<title:Audience>General</title:Audience>
<title:DisplayRunTime>01:15</title:DisplayRunTime>
<title:Year>2019</title:Year>
<title:CountryOfOrigin>US</title:CountryOfOrigin>
<title:Genre>Film fantasy</title:Genre>
<title:ShowType>Movie</title:ShowType>
</Asset>
<Asset xsi:type="offer:CategoryType" uriId="cc.com/XX">
<AlternateId identifierSystem="VOD1.1">cc.com/XX</AlternateId>
<offer:CategoryPath>VOD/GOT/Season 8</offer:CategoryPath>
</Asset>
<Asset xsi:type="content:MovieType" uriId="GraoTronVIII_0_1080mp4">
<AlternateId identifierSystem="VOD1.1">GraoTronVIII_0_1080mp4</AlternateId>
<content:SourceUrl>GOTS08E06.mp4</content:SourceUrl>
<content:Resolution>1080p</content:Resolution>
<content:Duration>PT1H15M20S</content:Duration>
<content:Language>pol</content:Language>
<content:Language>eng</content:Language>
</Asset>
<Asset xsi:type="content:PreviewType" uriId="GraoTronVIII_1_1080mp4">
<AlternateId identifierSystem="VOD1.1">GraoTronVIII_1_1080mp4</AlternateId>
<content:SourceUrl>GOTS08E06_trailer.mp4</content:SourceUrl>
<content:Resolution>1080p</content:Resolution>
<content:Duration>PT0H01M48S</content:Duration>
<content:Language>pol</content:Language>
<content:Language>eng</content:Language>
</Asset>
<Asset xsi:type="content:PosterType" uriId="GraoTronVIIIPoster">
<AlternateId identifierSystem="VOD1.1">GraoTronVIIIPoster</AlternateId>
<content:SourceUrl>GOTS08E06.jpg</content:SourceUrl>
<content:X_Resolution>600</content:X_Resolution>
<content:Y_Resolution>900</content:Y_Resolution>
<content:Language>pol</content:Language>
</Asset>
<Asset xsi:type="offer:ContentGroupType" uriId="abc">
<AlternateId identifierSystem="VOD1.1">abc</AlternateId>
<offer:TitleRef uriId="abc" />
<offer:MovieRef uriId="GraoTronVIII_0_1080mp4" />
</Asset>
<Asset xsi:type="offer:ContentGroupType" uriId="abc">
<AlternateId identifierSystem="VOD1.1">abc</AlternateId>
<offer:TitleRef uriId="abc" />
<offer:MovieRef uriId="GraoTronVIII_1_1080mp4" />
</Asset>
<Asset xsi:type="offer:ContentGroupType" uriId="abc">
<AlternateId identifierSystem="VOD1.1">abc</AlternateId>
<offer:TitleRef uriId="abc" />
<offer:MovieRef uriId="GraoTronVIIIPoster" />
</Asset>
</ADI3>
Instead of stripping out the namespaces, I suggest using namespace wildcards. Support for this was added in Python 3.8.
from xml.etree import ElementTree as ET
tree = ET.parse(adiPath)
rating = tree.find(".//{*}Rating") # Find the Rating element in any namespace
rating.text = "999"
Note that you have to use find() (or findall()). Wildcards do not work with iter().
The following workaround can be used to preserve the original namespace prefixes when serializing the XML document (see also https://stackoverflow.com/a/42372404/407651 and https://stackoverflow.com/a/54491129/407651).
namespaces = dict([elem for _, elem in ET.iterparse("test1.xml", events=['start-ns'])])
for ns in namespaces:
ET.register_namespace(ns, namespaces[ns])

Cannot find a namespaced attribute from current element

Hi I have the following XML and I am using the python ElementTree library to parse it.
<?xml version="1.0" encoding="UTF-8"?>
<bpmn:definitions xmlns:bpmn="http://www.omg.org/spec/BPMN/20100524/MODEL" xmlns:bpmndi="http://www.omg.org/spec/BPMN/20100524/DI" xmlns:di="http://www.omg.org/spec/DD/20100524/DI" xmlns:dc="http://www.omg.org/spec/DD/20100524/DC" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camunda="http://camunda.org/schema/1.0/bpmn" id="Definitions_0qjsjhs" targetNamespace="http://bpmn.io/schema/bpmn" exporter="Camunda Modeler" exporterVersion="2.2.4">
<bpmn:process id="Process_1" isExecutable="true">
<bpmn:startEvent id="StartEvent_1">
<bpmn:outgoing>SequenceFlow_0bwuv8k</bpmn:outgoing>
</bpmn:startEvent>
<bpmn:sequenceFlow id="SequenceFlow_0bwuv8k" sourceRef="StartEvent_1" targetRef="Task_07owcwp" />
<bpmn:endEvent id="EndEvent_13n8600">
<bpmn:incoming>SequenceFlow_111n6oc</bpmn:incoming>
</bpmn:endEvent>
<bpmn:sequenceFlow id="SequenceFlow_111n6oc" sourceRef="Task_07owcwp" targetRef="EndEvent_13n8600">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">myexpression.</bpmn:conditionExpression>
</bpmn:sequenceFlow>
<bpmn:scriptTask id="Task_07owcwp" scriptFormat="testformat" camunda:resultVariable="output">
<bpmn:incoming>SequenceFlow_0bwuv8k</bpmn:incoming>
<bpmn:outgoing>SequenceFlow_111n6oc</bpmn:outgoing>
<bpmn:script>myscript</bpmn:script>
</bpmn:scriptTask>
</bpmn:process>
<bpmndi:BPMNDiagram id="BPMNDiagram_1">
<bpmndi:BPMNPlane id="BPMNPlane_1" bpmnElement="Process_1">
<bpmndi:BPMNShape id="_BPMNShape_StartEvent_2" bpmnElement="StartEvent_1">
<dc:Bounds x="173" y="102" width="36" height="36" />
</bpmndi:BPMNShape>
<bpmndi:BPMNEdge id="SequenceFlow_0bwuv8k_di" bpmnElement="SequenceFlow_0bwuv8k">
<di:waypoint x="209" y="120" />
<di:waypoint x="266" y="120" />
</bpmndi:BPMNEdge>
<bpmndi:BPMNShape id="EndEvent_13n8600_di" bpmnElement="EndEvent_13n8600">
<dc:Bounds x="423" y="102" width="36" height="36" />
</bpmndi:BPMNShape>
<bpmndi:BPMNEdge id="SequenceFlow_111n6oc_di" bpmnElement="SequenceFlow_111n6oc">
<di:waypoint x="366" y="120" />
<di:waypoint x="423" y="120" />
</bpmndi:BPMNEdge>
<bpmndi:BPMNShape id="ScriptTask_1p2sdkp_di" bpmnElement="Task_07owcwp">
<dc:Bounds x="266" y="80" width="100" height="80" />
</bpmndi:BPMNShape>
</bpmndi:BPMNPlane>
</bpmndi:BPMNDiagram>
</bpmn:definitions>`
I am currently at the 'bpmn:scripttask' element and I am having issues working out how to extract the value of 'camunda:resultVariable'
<bpmn:scriptTask id="Task_07owcwp" scriptFormat="testformat" camunda:resultVariable="output">
<bpmn:incoming>SequenceFlow_0bwuv8k</bpmn:incoming>
<bpmn:outgoing>SequenceFlow_111n6oc</bpmn:outgoing>
<bpmn:script>myscript</bpmn:script>
</bpmn:scriptTask>
I have tried
node.find('camunda:resultVariable' {'camunda':'http://camunda.org/schema/1.0/bpmn'}
where node is a an ElementTree object scriptTask. Does find or findall all you to look at the current elements attributes as it doesn't seem to be finding it. How would I go about finding this value?
Thank you.
ok I worked it out. I can use .get but I need to expand out the namespace like so
node.get('{http://camunda.org/schema/1.0/bpmn}resultVariable')

How to insert a sub-node for a particular node in XML using cElementTree in python

Below is the requirement to convert a BNF-form grammar into XML.
input:
define program
[repeat statement]
end define
define statement
[includeStatement]
| [keysStatement]
| [compoundsStatement]
| [commentsStatement]
| [tokensStatement]
| [defineStatement]
| [redefineStatement]
| [ruleStatement]
| [functionStatement]
| [externalStatement]
| [comment] [NL]
end define
expected output:
<Feature>
<program>
<statement>
<includeStatement />
<keysStatement />
<compoundsStatement />
<commentsStatement />
<tokensStatement />
<defineStatement />
<redefineStatement />
<ruleStatement />
<functionStatement />
<externalStatement />
<comment />
<NL />
</statement>
</program>
</Feature>
actual output:
<Feature>
<program>
<statement />
</program>
</Feature>
Below is a function in my code, ET.SubElement(parent, ) is working for one section but not working in another part the reason could because of ET.Element(nonTmnl) returns a value rather than returning a reference. I have commented the code on my finding. Appreciate any suggestion on how I can get access to a node in XML so I can insert a child node to it.
import xml.etree.cElementTree as ET
def getNonTerminal (strline):
wordList=''
global parent
if re.match('define \w',strline):
nonTmnl = strline.replace('define ','')
nonTmnl = nonTmnl.replace('\n','')
nonTmnl = nonTmnl.replace(' ','')
if nonTmnl not in nonterminals:
child = ET.SubElement(parent, nonTmnl) #This line is working Problem line 2 not working and has a dependency on problem line 1 parent = child
nonterminals.append(nonTmnl)
else:
parent = ET.Element(nonTmnl) #Problem line1: Here I am searching for a node under which I want to insert a new sub-node
return;
if re.match('.*\[.*\].*',strline):
strline = strline.replace('\'[','')
while (re.match('.*\[.*\].*',strline)):
wordList = extractWords(strline)
strList = wordList.split(' ')
for item in strList:
if item not in TXLtoken and item not in TXLunparseElement and item not in TXLmodifier and item not in TXLother and item not in nonterminals :
if not item.startswith('\''):
item = item.replace(' ','')
while(item[-1] in TXLmodifier):
item = item[:-1]
nonterminals.append(item)
child = ET.SubElement(parent, item) #Problem line2: Here I am adding the subnode. While debugging I see it adds to the parent node(variable), but it never reflects in final XML.
strline = strline.replace('['+wordList+']','',1)
return;

Python parse XML has multiple root

I failed to parse an XML file(it is GC history). Sample of the XML is shown below.
<?xml version="1.0" ?>
<verbosegc xmlns="http://www.ibm.com/j9/verbosegc" version="R28_jvm.28_20150612_0201_B252774_CMPRSS">
<initialized id="1" timestamp="2015-12-04T20:17:07.219">
<attribute name="gcPolicy" value="-Xgcpolicy:gencon" />
<attribute name="maxHeapSize" value="0x20000000" />
<attribute name="initialHeapSize" value="0x400000" />
</initialized>
<cycle-start id="4" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.677" intervalms="3457.977" />
<gc-start id="5" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.677">
<mem-info id="6" free="3037768" total="4194304" percent="72">
</mem-info>
</gc-start>
<gc-end id="8" type="scavenge" contextid="4" durationms="0.807" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.678" activeThreads="2">
<mem-info id="9" free="3163968" total="4194304" percent="75">
</mem-info>
</gc-end>
<cycle-end id="10" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.678" />
<cycle-start id="16" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.742" intervalms="64.838" />
<gc-start id="17" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.742">
<mem-info id="18" free="3037664" total="4194304" percent="72">
</mem-info>
</gc-start>
<gc-end id="20" type="scavenge" contextid="16" durationms="0.649" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.743" activeThreads="2">
<mem-info id="21" free="3110592" total="4194304" percent="74">
</mem-info>
</gc-end>
<cycle-end id="22" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.743" />
<allocation-satisfied id="23" threadId="0000000002E10500" bytesRequested="416" />
</verbosegc>
I want to mem-info::free in gc-start and gc-end, both of which are enclosed by cycle-start and cycle-end tags and have the same contexid. For example, the first two mem-info values are 3037768 and 3163968, the corresponding contextid is 4 which equals to the cycle-start id. With these data, I can draw the figure to show memory footprint.
The main problem for me is that I could not parse the XML sucessfully with the method in XML parse python. The getroot works but all other find/findall returns empty. Is there any other solutions for this? thanks
Here are my tries:
>>> tree = ET.parse('gc.trace')
>>> tree
<xml.etree.ElementTree.ElementTree object at 0x7fdfaddc19d0>
>>> root=tree.getroot()
>>> root
<Element '{http://www.ibm.com/j9/verbosegc}verbosegc' at 0x7fdfaddc1a90>
>>> cycle_start = root.findall('cycle-start')
>>> cycle_start
[] ; Empty???
>>> cycle_start = root.findall('mem-info')
>>> print cycle_start
[] ;Empty???
>>>
>>> cycle_start = root.find('mem-info')
>>> cycle_start
>>> print cycle_start
None
from lxml import etree
tree = etree.parse("gc.log")
root = tree.getroot()
>>root.findall('mem-info', root.nsmap)
>>> root.nsmap
{None: 'http://www.ibm.com/j9/verbosegc'}
That's because your XML has default namespace here :
xmlns="http://www.ibm.com/j9/verbosegc"
Notice that descendant element inherits ancestor's default namespace implicitly. You can use prefix-to-namespace mapping to get element in namespace, for example :
ns = {'d': 'http://www.ibm.com/j9/verbosegc'}
cycle_starts = root.findall('d:cycle-start', namespaces=ns)
print(cycle_starts)
mem_infos = root.findall('d:gc-start/d:mem-info', namespaces=ns)
print(mem_infos)
output :
[<Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae6a0>, <Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae8d0>]
[<Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae780>, <Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae9b0>]
update :
Responding to your comment, this is one possible way to avoid hard-coding the namespace :
#map default namespace uri to prefix d without hard-coding:
ns = {'d': root.nsmap[None]}
result = root.findall('.//d:mem-info', namespaces=ns)
as an aside, I'd suggest using xpath() method instead of findall() since the former provides better support for standard XPath 1.0 expression which will be useful in more complex situation.

Categories

Resources