XML file parsing - Get data from each parent and their own children

XML file parsing - Get data from each parent and their own children - python

I would like to get data from each parent and their own children fro an XML file.
I'm trying to parse this XML file
<DB>
<Entry>
<Name>Assembly.iam</Name>
<DisplayName>Assembly.iam</DisplayName>
<Scalar>
<Name>d0</Name>
<DisplayName>d0 (value = 0 mm)</DisplayName>
<Value>0</Value>
</Scalar>
<Scalar>
<Name>d1</Name>
<DisplayName>d1 (value = 0 mm)</DisplayName>
<Value>0</Value>
</Scalar>
</Entry>
<Entry>
<Name>Ground.ipt</Name>
<DisplayName>Ground.ipt</DisplayName>
<Scalar>
<Name>Ground_length</Name>
<DisplayName>Ground_length (value = 160 mm)</DisplayName>
<Value>160</Value>
</Scalar>
<Scalar>
<Name>d2</Name>
<DisplayName>d2 (value = 80 mm)</DisplayName>
<Value>80</Value>
</Scalar>
</Entry>
</DB>
In fact, I would like to get the data which are into <DisplayName></DisplayName>.
Then, I would like to put that data into an array of tuples like this
[(Assembly.iam,[d0 (value = 0 mm), d1 (value = 0 mm)]),
(Ground.ipt,[Ground_length (value = 160 mm), d2 (value = 80 mm)])
I have tried to use the xml.etree.cElementTree library with this code
from xml.etree import cElementTree
import numpy as np
workingDir = "C:/Users/Vince/Test"
newStrWorkingDir = str.replace(workingDir, '/', '\\')
tree = cElementTree.parse(newStrWorkingDir + "\\test.xml")
root = tree.getroot()
tab = np.empty(shape=(0, 0))
tabEntry = np.empty(shape=(0, 0))
tabScalar = np.empty(shape=(0, 0))
for entry in root.findall('Entry'):
entryNames = entry.findall("./DisplayName")
entryNamesText = entry.find("./DisplayName").text
tabEntry = np.append(tabEntry,entryNamesText)
for scalar in entry.findall('Scalar'):
scalarNames = scalar.findall("./DisplayName")
scalarNamesText = scalar.find("./DisplayName").text
tabScalar = np.append(tabScalar,scalarNamesText)
tab = np.append(tab,(entryNamesText,scalarNamesText))
print(tab)
But it outputs me this
['Assembly.iam' 'd0 (value = 0 mm)'
'Assembly.iam' 'd1 (value = 0 mm)'
'Ground.ipt' 'Ground_length (value = 160 mm)'
'Ground.ipt' 'd2 (value = 80 mm)']

To get your wanted structure, you have to build lists of lists:
import os
from xml.etree import cElementTree
workingDir = "C:\\Users\\Vince\\Test"
tree = cElementTree.parse(os.path.join(newStrWorkingDir, "test.xml"))
root = tree.getroot()
tab = []
for entry in root.findall('Entry'):
entry_name = entry.findtext("./DisplayName")
scalar_names = [e.text for e in entry.findall('Scalar/DisplayName')]
tab.append((entry_name, scalar_names))
print(tab)

Related

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.
The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.
this is how it looks like in the first 30 rows:
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.9</product_version>
<schema_version>2.3</schema_version>
<timestamp>2022-03-17T03:51:01.909Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.
From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.
The dataframe should look like this:
| cpe23_name | title_text | ref1 | ref2 | ref3 | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN | NaN
1 | 'cpe23name 2'| 'this is a java pkg' | 'url1'| 'url2'| NaN | NaN
...
my code is here，finished in ~100sec：
import xml.etree.ElementTree as et
xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()
import time
start_time = time.time()
df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']
title = '{http://cpe.mitre.org/dictionary/2.0}title'
ref = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []
i = 0
while i < len(xroot):
for elm in xroot[i]:
if elm.tag == title:
p_text = elm.text
#assign p_text
elif elm.tag == ref:
for nn in elm:
s = nn.text.lower()
#check the lower text in refs
if 'version' in s:
p_vers = nn.attrib.get('href')
#assign p_vers
elif 'advisor' in s:
p_advi = nn.attrib.get('href')
#assign p_advi
elif 'product' in s:
p_prod = nn.attrib.get('href')
#assign p_prod
elif 'vendor' in s:
p_vend = nn.attrib.get('href')
#assign p_vend
elif 'change' in s:
p_chan = nn.attrib.get('href')
#assign p_vend
else:
p_othe = nn.attrib.get('href')
elif elm.tag == cpe_item:
p_cpe = elm.attrib.get("name")
#assign p_cpe
else:
print(elm.tag)
row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
rows.append(row)
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
print(len(rows)) #this shows how far I got during the running time
i+=1
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent
print("---853k rows take %s seconds ---" % (time.time() - start_time))
updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!

Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:
url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)
If you do not have the default parser, lxml, installed, use the etree parser:
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

different return types for getpath() in lxml

I have folders full of XML files which I want to parse to a dataframe. The following functions iterate through an XML tree recursively and return a dataframe with three columns: path, attributes and text.
def XML2DF(filename,df1,MAX_DEPTH=20):
with open(filename) as f:
xml_str = f.read()
tree = etree.fromstring(xml_str)
df1 = recursive_parseXML2DF(tree, df1, MAX_DEPTH=MAX_DEPTH)
return
def recursive_parseXML2DF(element, df1, depth=0, MAX_DEPTH=20):
if depth > MAX_DEPTH:
return df1
df2 = pd.DataFrame([[element.getroottree().getpath(element), element.attrib, element.text]],
columns=["path", "attrib", "text"])
#print(df2)
df1 = pd.concat([df1, df2])
for child in element.getchildren():
df1 = recursive_parseXML2DF(child, df1, depth=depth + 1)
return df1
The code for the function was adapted from this post.
Most of the times the function works fine and returns the entire path but for some documents the returned path looks like this:
/*/*[1]/*[3]
/*/*[1]/*[3]/*[1]
The text tag entry remains valid and correct.
The only difference in the XML between working path and widlcard path documents I can make out is that the XML tags are written in all caps.
Working example:
<?xml version="1.0" encoding="utf-8"?>
<root>
<Header>
<ReceivingApplication>ReceivingApplication</ReceivingApplication>
<SendingApplication>SendingApplication</SendingApplication>
<MessageControlID>12345</MessageControlID>
<ReceivingApplication>ReceivingApplication</ReceivingApplication>
<FileCreationDate>2000-01-01T00:00:00</FileCreationDate>
</Header>
<Einsendung>
<Patient>
<PatientName>Name</PatientName>
<PatientVorname>FirstName</PatientVorname>
<PatientGebDat>2000-01-01T00:00:00</PatientGebDat>
<PatientSex>4</PatientSex>
<PatientPWID>123456</PatientPWID>
</Patient>
<Visit>
<VisitNumber>A2000.0001</VisitNumber>
<PatientPLZ>1234</PatientPLZ>
<PatientOrt>PatientOrt</PatientOrt>
<PatientAdr2>
</PatientAdr2>
<PatientStrasse>PatientStrasse 01</PatientStrasse>
<VisitEinsID>1234</VisitEinsID>
<VisitBefund>VisitBefund</VisitBefund>
<Befunddatum>2000-01-01T00:00:00</Befunddatum>
</Visit>
</Einsendung>
</root>
nonsensical Example:
<?xml version="1.0"?>
<KRSCHWEIZ xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="krSCHWEIZ">
<KEY_VS>abcdefg</KEY_VS>
<KEY_KLR>abcdefg</KEY_KLR>
<ABSENDER>
<ABSENDER_MELDER_ID>123456</ABSENDER_MELDER_ID>
<MELDER>
<MELDER_ID>123456</MELDER_ID>
<QUELLSYSTEM>ABCDEF</QUELLSYSTEM>
<PATIENT>
<REFERENZNR>987654</REFERENZNR>
<NACHNAME>my name</NACHNAME>
<VORNAMEN>my first name</VORNAMEN>
<GEBURTSNAME />
<GEBURTSDATUM>my dob</GEBURTSDATUM>
<GESCHLECHT>XX</GESCHLECHT>
<PLZ>9999</PLZ>
<WOHNORT>Mycity</WOHNORT>
<STRASSE>mystreet</STRASSE>
<HAUSNR>99</HAUSNR>
<VERSICHERTENNR>999999999</VERSICHERTENNR>
<DATEIEN>
<DATEI>
<DATEINAME>my_attached_document.html</DATEINAME>
<DATEIBASE64>mybase_64_encoded_document</DATEIBASE64>
</DATEI>
</DATEIEN>
</PATIENT>
</MELDER>
</ABSENDER>
</KRSCHWEIZ>
How do I get correct explicit path information also for this case?

The prescence of namespaces changes the output of .getpath() - you can use .getelementpath() instead which will include the namespace prefix instead of using wildcards.
If the prefix should be discarded completely - you can strip them out before using .getpath()
import lxml.etree
import pandas as pd
rows = []
tree = lxml.etree.parse("broken.xml")
for node in tree.iter():
try:
node.tag = lxml.etree.QName(node).localname
except ValueError:
# skip tags with no name
continue
rows.append([tree.getpath(node), node.attrib, node.text])
df = pd.DataFrame(rows, columns=["path", "attrib", "text"])
Resulting dataframe:
>>> df
path attrib text
0 /KRSCHWEIZ [] \n
1 /KRSCHWEIZ/KEY_VS [] abcdefg
2 /KRSCHWEIZ/KEY_KLR [] abcdefg
3 /KRSCHWEIZ/ABSENDER [] \n
4 /KRSCHWEIZ/ABSENDER/ABSENDER_MELDER_ID [] 123456
5 /KRSCHWEIZ/ABSENDER/MELDER [] \n
6 /KRSCHWEIZ/ABSENDER/MELDER/MELDER_ID [] 123456
7 /KRSCHWEIZ/ABSENDER/MELDER/QUELLSYSTEM [] ABCDEF
8 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT [] \n
9 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/REFERENZNR [] 987654
10 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/NACHNAME [] my name
11 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VORNAMEN [] my first name
12 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSNAME [] None
13 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSDATUM [] my dob
14 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GESCHLECHT [] XX
15 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/PLZ [] 9999
16 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/WOHNORT [] Mycity
17 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/STRASSE [] mystreet
18 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/HAUSNR [] 99
19 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VERSICHERTENNR [] 999999999
20 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN [] \n
21 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DATEI [] \n
22 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] my_attached_document.html
23 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] mybase_64_encoded_document

Parse xml file with python

I have this simple xml file:
<BSB>
<APPLSUMMARY>
<MAIN W="S1" X="{ND}"/>
<COUNTS Z="0" AB="0" BB="0" CB="0" DB="0" EB="0" FB="0" GB="{ND}"/>
<SCOTDEBT OQB="{ND}"/>
<NOTICES HB="0" IB="3"/>
<SUB_BLOCKS C="3" D="3" E="1" F="0"/>
<ALIAS_NO UPB="0" VPB="{ND}" WPB="0"/>
<ASSOC_NO DD="0" ED="0" AC="0"/>
<ALERTSUMM PB="0" QB="0" RB="{ND}" SB="{ND}" TB="{ND}" UB="{ND}"/>
<HHOSUMM BC="{ND}" RGB="{ND}"/>
<TPD INB="{ND}" JNB="{ND}" KNB="{ND}" LNB="{ND}"/>
<OCCUPANCY AD="1"/>
<DECEASED LQB="1" FCC="{ND}" GCC="{ND}" HCC="{ND}" ICC="{ND}"/>
<IMPAIRED MQB="0"/>
<ACTIVITY JCC="{ND}" KCC="{ND}" LCC="{ND}"/>
<ADVERSE MCC="{ND}" HHC="{ND}"/>
</APPLSUMMARY>
</BSB>
I want to create in python a csv file that contains only the DECEASED contents in columns like this:
So, I am trying to get the values of the DECEASED bit and align them in columns.
I have tried this:
import xml.etree.ElementTree as ET
import io
parsed = objectify.parse(open(path)) // path is where the xml file is saved
root = parsed.getroot()
data = []
for elt in root.BSB.DECEASED:
el_data = {}
for child in elt.getchildren():
el_data[child.tag] = child.text
data.append(el_data)
perf =pd.DataFrame(data).drop_duplicates(subset=None, keep='first', inplace=False)
print(perf)
perf.to_csv('DECESEAD.csv')
I get an empty dataset:
Empty DataFrame
Columns: []
Index: []
Can anyone help me get the values inside the DECEASED tag, please?

The code below collects the data you are looking for
import xml.etree.ElementTree as ET
from typing import Dict
xml = '''<BSB>
<APPLSUMMARY>
<MAIN W="S1" X="{ND}"/>
<COUNTS Z="0" AB="0" BB="0" CB="0" DB="0" EB="0" FB="0" GB="{ND}"/>
<SCOTDEBT OQB="{ND}"/>
<NOTICES HB="0" IB="3"/>
<SUB_BLOCKS C="3" D="3" E="1" F="0"/>
<ALIAS_NO UPB="0" VPB="{ND}" WPB="0"/>
<ASSOC_NO DD="0" ED="0" AC="0"/>
<ALERTSUMM PB="0" QB="0" RB="{ND}" SB="{ND}" TB="{ND}" UB="{ND}"/>
<HHOSUMM BC="{ND}" RGB="{ND}"/>
<TPD INB="{ND}" JNB="{ND}" KNB="{ND}" LNB="{ND}"/>
<OCCUPANCY AD="1"/>
<DECEASED LQB="1" FCC="{ND}" GCC="{ND}" HCC="{ND}" ICC="{ND}"/>
<IMPAIRED MQB="0"/>
<ACTIVITY JCC="{ND}" KCC="{ND}" LCC="{ND}"/>
<ADVERSE MCC="{ND}" HHC="{ND}"/>
</APPLSUMMARY>
</BSB>'''
def _clean_dict(attributes: Dict) -> Dict:
result = {}
for k, v in attributes.items():
if v[0] == '{':
val = v[1:-1]
else:
val = v
result[k] = val
return result
data = []
root = ET.fromstring(xml)
for d in root.findall('.//DECEASED'):
data.append(_clean_dict(d.attrib))
print(data)
output (list of dicts)
[{'LQB': '1', 'FCC': 'ND', 'GCC': 'ND', 'HCC': 'ND', 'ICC': 'ND'}]

Python and products file XML

I'm using python and I need to find sku, min-order-qty and step-quantity for each occurence of sku.
Input file is:
<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
....
</product>
....
I try to use lxml but fail to get min-order-qty and step-quantity
from lxml import etree
tree = etree.parse('./ST2CleanCourt.xml')
elem = tree.getroot()
for child in elem:
print (child.attrib["sku"])
I tried to use the 2 solutions below. It works but I need to read the file so I write
from lxml import etree
import codecs
f=codecs.open('./ST2CleanCourt.xml','r','utf-8')
fichier = f.read()
tree = etree.fromstring(fichier)
for child in tree:
print ('sku:', child.attrib['sku'])
print ('min:', child.find('.//min-order-quantity').text)
and I always get this error
print ('min:', child.find('.//min-order-quantity').text)
AttributeError: 'NoneType' object has no attribute 'text'
what is wrong ?

You can use the xpath method to get the required values.
Example:
from lxml import etree
a = """<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
"""
tree = etree.fromstring(a)
tags = tree.xpath('/product')
for b in tags:
print b.attrib["sku"]
min_order = b.xpath("//quantity/min-order-quantity")
print min_order[0].text
step_quality = b.xpath("//quantity/step-quantity")
print step_quality[0].text
Output:
1235997403
1
1

Using more then 1 product and root node of products you can find this:
x = """
<products>
<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
<product sku="997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>5</min-order-quantity>
<step-quantity>7</step-quantity>
</quantity>
</product>
</products>
"""
from lxml import etree
tree = etree.fromstring(x)
for child in tree:
print ("sku:", child.attrib["sku"])
print ("min:", child.find(".//min-order-quantity").text) # looks for node below
print ("step:" ,child.find(".//step-quantity").text) # child with the given name
Essentially you look for any node below child that has the correct name and print its text.
Output:
sku:1235997403
min:1
step:1
sku:997403
min:5
step:7
Doku: http://lxml.de/tutorial.html#elementpath

XML parsing with XMLtree or MINIDOM

I have a xml file, and in the middle of it I have a block like this:
...
<node id = "1" >
<ngh id = "2" > 100 </ngh>
<ngh id = "3"> 300 </ngh>
</node>
<node id = "2">
<ngh id = "1" > 400 </ngh>
<ngh id = "3"> 500 </ngh>
</node>
...
and trying to get
1, 2, 100
1, 3, 300
2, 1, 400
2, 3, 500
...
I found a similar question and did the following
from xml.dom import minidom
xmldoc = minidom.parse('file.xml')
nodelist = xmldoc.getElementsByTagName('node')
for s in nodelist:
print s.attributes['id'].value)
is there a way to get i get the values between tags (i.e. 100, 300, 400) ?

You need an inner loop over ngh elements:
from xml.dom import minidom
xmldoc = minidom.parse('file.xml')
nodes = xmldoc.getElementsByTagName('node')
for node in nodes:
node_id = node.attributes['id'].value
for ngh in node.getElementsByTagName('ngh'):
ngh_id = ngh.attributes['id'].value
ngh_text = ngh.firstChild.nodeValue
print node_id, ngh_id, ngh_text
Prints:
1 2 100
1 3 300
2 1 400
2 3 500

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

XML file parsing - Get data from each parent and their own children - python

Related

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

different return types for getpath() in lxml

Parse xml file with python

Python and products file XML

XML parsing with XMLtree or MINIDOM

Categories

Resources