XML parsing with XMLtree or MINIDOM

XML parsing with XMLtree or MINIDOM - python

I have a xml file, and in the middle of it I have a block like this:
...
<node id = "1" >
<ngh id = "2" > 100 </ngh>
<ngh id = "3"> 300 </ngh>
</node>
<node id = "2">
<ngh id = "1" > 400 </ngh>
<ngh id = "3"> 500 </ngh>
</node>
...
and trying to get
1, 2, 100
1, 3, 300
2, 1, 400
2, 3, 500
...
I found a similar question and did the following
from xml.dom import minidom
xmldoc = minidom.parse('file.xml')
nodelist = xmldoc.getElementsByTagName('node')
for s in nodelist:
print s.attributes['id'].value)
is there a way to get i get the values between tags (i.e. 100, 300, 400) ?

You need an inner loop over ngh elements:
from xml.dom import minidom
xmldoc = minidom.parse('file.xml')
nodes = xmldoc.getElementsByTagName('node')
for node in nodes:
node_id = node.attributes['id'].value
for ngh in node.getElementsByTagName('ngh'):
ngh_id = ngh.attributes['id'].value
ngh_text = ngh.firstChild.nodeValue
print node_id, ngh_id, ngh_text
Prints:
1 2 100
1 3 300
2 1 400
2 3 500

Related

different return types for getpath() in lxml

I have folders full of XML files which I want to parse to a dataframe. The following functions iterate through an XML tree recursively and return a dataframe with three columns: path, attributes and text.
def XML2DF(filename,df1,MAX_DEPTH=20):
with open(filename) as f:
xml_str = f.read()
tree = etree.fromstring(xml_str)
df1 = recursive_parseXML2DF(tree, df1, MAX_DEPTH=MAX_DEPTH)
return
def recursive_parseXML2DF(element, df1, depth=0, MAX_DEPTH=20):
if depth > MAX_DEPTH:
return df1
df2 = pd.DataFrame([[element.getroottree().getpath(element), element.attrib, element.text]],
columns=["path", "attrib", "text"])
#print(df2)
df1 = pd.concat([df1, df2])
for child in element.getchildren():
df1 = recursive_parseXML2DF(child, df1, depth=depth + 1)
return df1
The code for the function was adapted from this post.
Most of the times the function works fine and returns the entire path but for some documents the returned path looks like this:
/*/*[1]/*[3]
/*/*[1]/*[3]/*[1]
The text tag entry remains valid and correct.
The only difference in the XML between working path and widlcard path documents I can make out is that the XML tags are written in all caps.
Working example:
<?xml version="1.0" encoding="utf-8"?>
<root>
<Header>
<ReceivingApplication>ReceivingApplication</ReceivingApplication>
<SendingApplication>SendingApplication</SendingApplication>
<MessageControlID>12345</MessageControlID>
<ReceivingApplication>ReceivingApplication</ReceivingApplication>
<FileCreationDate>2000-01-01T00:00:00</FileCreationDate>
</Header>
<Einsendung>
<Patient>
<PatientName>Name</PatientName>
<PatientVorname>FirstName</PatientVorname>
<PatientGebDat>2000-01-01T00:00:00</PatientGebDat>
<PatientSex>4</PatientSex>
<PatientPWID>123456</PatientPWID>
</Patient>
<Visit>
<VisitNumber>A2000.0001</VisitNumber>
<PatientPLZ>1234</PatientPLZ>
<PatientOrt>PatientOrt</PatientOrt>
<PatientAdr2>
</PatientAdr2>
<PatientStrasse>PatientStrasse 01</PatientStrasse>
<VisitEinsID>1234</VisitEinsID>
<VisitBefund>VisitBefund</VisitBefund>
<Befunddatum>2000-01-01T00:00:00</Befunddatum>
</Visit>
</Einsendung>
</root>
nonsensical Example:
<?xml version="1.0"?>
<KRSCHWEIZ xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="krSCHWEIZ">
<KEY_VS>abcdefg</KEY_VS>
<KEY_KLR>abcdefg</KEY_KLR>
<ABSENDER>
<ABSENDER_MELDER_ID>123456</ABSENDER_MELDER_ID>
<MELDER>
<MELDER_ID>123456</MELDER_ID>
<QUELLSYSTEM>ABCDEF</QUELLSYSTEM>
<PATIENT>
<REFERENZNR>987654</REFERENZNR>
<NACHNAME>my name</NACHNAME>
<VORNAMEN>my first name</VORNAMEN>
<GEBURTSNAME />
<GEBURTSDATUM>my dob</GEBURTSDATUM>
<GESCHLECHT>XX</GESCHLECHT>
<PLZ>9999</PLZ>
<WOHNORT>Mycity</WOHNORT>
<STRASSE>mystreet</STRASSE>
<HAUSNR>99</HAUSNR>
<VERSICHERTENNR>999999999</VERSICHERTENNR>
<DATEIEN>
<DATEI>
<DATEINAME>my_attached_document.html</DATEINAME>
<DATEIBASE64>mybase_64_encoded_document</DATEIBASE64>
</DATEI>
</DATEIEN>
</PATIENT>
</MELDER>
</ABSENDER>
</KRSCHWEIZ>
How do I get correct explicit path information also for this case?

The prescence of namespaces changes the output of .getpath() - you can use .getelementpath() instead which will include the namespace prefix instead of using wildcards.
If the prefix should be discarded completely - you can strip them out before using .getpath()
import lxml.etree
import pandas as pd
rows = []
tree = lxml.etree.parse("broken.xml")
for node in tree.iter():
try:
node.tag = lxml.etree.QName(node).localname
except ValueError:
# skip tags with no name
continue
rows.append([tree.getpath(node), node.attrib, node.text])
df = pd.DataFrame(rows, columns=["path", "attrib", "text"])
Resulting dataframe:
>>> df
path attrib text
0 /KRSCHWEIZ [] \n
1 /KRSCHWEIZ/KEY_VS [] abcdefg
2 /KRSCHWEIZ/KEY_KLR [] abcdefg
3 /KRSCHWEIZ/ABSENDER [] \n
4 /KRSCHWEIZ/ABSENDER/ABSENDER_MELDER_ID [] 123456
5 /KRSCHWEIZ/ABSENDER/MELDER [] \n
6 /KRSCHWEIZ/ABSENDER/MELDER/MELDER_ID [] 123456
7 /KRSCHWEIZ/ABSENDER/MELDER/QUELLSYSTEM [] ABCDEF
8 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT [] \n
9 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/REFERENZNR [] 987654
10 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/NACHNAME [] my name
11 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VORNAMEN [] my first name
12 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSNAME [] None
13 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSDATUM [] my dob
14 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GESCHLECHT [] XX
15 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/PLZ [] 9999
16 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/WOHNORT [] Mycity
17 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/STRASSE [] mystreet
18 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/HAUSNR [] 99
19 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VERSICHERTENNR [] 999999999
20 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN [] \n
21 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DATEI [] \n
22 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] my_attached_document.html
23 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] mybase_64_encoded_document

Txt to XML Python parser

I tried to parse a .txt file that looks like this:
-------------------------------------------------------------------------------
Compare Results
Compare Directory 1 : /data/Run_288/bitmaps
Compare Directory 2 : /data/Run_301/bitmaps
-------------------------------------------------------------------------------
idx, Filename , Exact, F3x3, F5x5, F7x7, Threshold, P/F
-------------------------------------------------------------------------------
1, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif, 0, 0, 0, 0, 0, PASS
2, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif, 0, 0, 0, 0, 0, PASS
-------------------------------------------------------------------------------
Bitmap Compare FAILURE !!! Threshold Exceeded : Threshold Values : Exact = 0 : Fuzzy 3x3 = 200 : Fuzzy 5x5 = 100 : Fuzzy 7x7 = 50 : Threshold 7x7 = 0
3, MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif, 2083, 1180, 650, 262, 52, FAIL
-------------------------------------------------------------------------------
I need to obtain an xml with this format:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Suite date="2019-05-27T10:47:03" compareDir1="/data/Run_288/bitmaps" compareDir2="/data/Run_301/bitmaps">
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif" result="pass">
</Test>
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif" result="pass">
</Test>
<Test name="MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif" result="crash">
</Test>
</Suite>
This is the code that should do the work.The problem is that is not working and with my little python knowledge I don't know why.Can somebody help me please with this?!
Thank you!
import xml.etree.ElementTree as ET
root = ET.Element('Suite')
with open('file3.txt') as f:
lines = f.read().splitlines()
print(lines)
#add first subelement
celldata = ET.SubElement(root, 'Test')
import itertools as it
#for every line in input file
#group consecutive dedup to one
for line in it.groupby(lines):
line=line[0]
#if its a break of subelements - that is an empty space
if not line:
#add the next subelement and get it as celldata
celldata = ET.SubElement(root, 'test')
else:
#otherwise, split with : to get the tag name
tag = line.split(",")
#format tag name
el=ET.SubElement(celldata,tag[1])
print(tag[1])
print(tag[7])
tag=' '.join(tag[1]).strip()
if 'PASS' in line:
tag = line.split(",")[-1].strip()
elif 'FAILURE' in line:
splist = filter(None,line.split(" "))
tag = splist[splist.index(',')+1]
el.text = tag
#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
ET.tostring(
root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML
#write the formatedXML to file.
with open("results.xml","w+") as f:
f.write(formatedXML)

For this I would use regular expressions. My take:
data = '''-------------------------------------------------------------------------------
Compare Results
Compare Directory 1 : /data/Run_288/bitmaps
Compare Directory 2 : /data/Run_301/bitmaps
-------------------------------------------------------------------------------
idx, Filename , Exact, F3x3, F5x5, F7x7, Threshold, P/F
-------------------------------------------------------------------------------
1, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif, 0, 0, 0, 0, 0, PASS
2, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif, 0, 0, 0, 0, 0, PASS
-------------------------------------------------------------------------------
Bitmap Compare FAILURE !!! Threshold Exceeded : Threshold Values : Exact = 0 : Fuzzy 3x3 = 200 : Fuzzy 5x5 = 100 : Fuzzy 7x7 = 50 : Threshold 7x7 = 0
3, MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif, 2083, 1180, 650, 262, 52, FAIL
-------------------------------------------------------------------------------'''
import re
dirs = []
for d in re.findall('Compare Directory\s+(\d+)\s*:\s*(.*?)$', data, flags=re.DOTALL|re.MULTILINE):
dirs += [d]
passes = []
fails = []
for line in data.split('\n'):
for p in re.findall('(\d+,\s+(.*?),.*?PASS)$', line):
passes += [p]
for f in re.findall('(\d+,\s+(.*?),.*?FAIL)$', line):
fails += [f]
s = f'''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Suite date="2019-05-27T10:47:03" compareDir1="{dirs[0][1]}" compareDir2="{dirs[1][1]}">
'''
for p in passes:
s += f''' <Test name="{p[1]}" result="pass">
</Test>
'''
for fail in fails:
s += f''' <Test name="{fail[1]}" result="crash">
</Test>
'''
s += '''</Suite>'''
print(s)
Prints:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Suite date="2019-05-27T10:47:03" compareDir1="/data/Run_288/bitmaps" compareDir2="/data/Run_301/bitmaps">
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif" result="pass">
</Test>
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif" result="pass">
</Test>
<Test name="MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif" result="crash">
</Test>
</Suite>

How to access nested children with same name in XML using lxml in python

I'm trying to parse an XML file using "lxml" module in Python.
My xml is:
<?xml version="1.0"?>
<root xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<GEOMdata>
<numEL>2</numEL>
<EL>
<isEMPTY>true</isEMPTY>
<SECdata>
<SEC>
<Z>10.00</Z>
<A>20.00</A>
<P>30.00</P>
</SEC>
<SEC>
<Z>40.00</Z>
<A>50.00</A>
<P>60.00</P>
</SEC>
</SECdata>
</EL>
<EL>
<isEMPTY>false</isEMPTY>
<SECdata>
<SEC>
<Z>15.00</Z>
<A>25.00</A>
<P>35.00</P>
</SEC>
<SEC>
<Z>45.00</Z>
<A>55.00</A>
<P>65.00</P>
</SEC>
</SECdata>
</EL>
</GEOMdata>
</root>
I want to write a text file for each "EL" reporting isEMPTY value and a list of Z,A,P values. Despite the I/O I don't understand how to access this file.
For the moment I wrote that code:
from lxml import etree
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("TEST.xml", parser=parser)
for ELtest in tree.xpath('/root/GEOMdata/EL'):
print (ELtest.findtext('isEMPTY'))
and the output is correct:
true
false
Now I don't know how to access the children element Z,A,P "inside" ELtest.
Thanks for your kind help.
EDIT:
The desired output is a formatted file like this:
1
true
# Z A P #
10 20 30
40 50 60
2
false
# Z A P #
15 25 35
45 55 65

You can use something like:
from lxml import etree
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("_etree.xml", parser=parser)
with open("output.txt", "w", encoding="utf8") as f:
e = tree.findall('GEOMdata/EL')
for i in e:
isEMPTY = i.find('isEMPTY')
SECdata = i.findall('SECdata')
f.write(isEMPTY.text+"\n")
for y in SECdata:
z = y.find("SEC/Z").text
a = y.find("SEC/A").text
p = y.find("SEC/P").text
f.write("{} {} {}\n\n".format(z,a, p))
output.txt:
true
10.00 20.00 30.00
false
15.00 25.00 35.00

Final solution of my question, (thanks to Pedro Lobito!) is:
from lxml import etree
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("_etree.xml", parser=parser)
with open("output.dat", "w", encoding="utf8") as f:
e = tree.findall('GEOMdata/EL')
for i in e:
isEMPTY = i.find('isEMPTY')
SECdata = i.findall('SECdata')
f.write(isEMPTY.text+"\n")
for y in SECdata:
for k in list(y.iterchildren()):
z = k.find("Z").text
a = k.find("A").text
p = k.find("P").text
f.write("{} {} {}\n".format(z,a,p))
f.write("\n")
Output file is:
true
10.00 20.00 30.00
40.00 50.00 60.00
false
15.00 25.00 35.00
45.00 55.00 65.00

XML file parsing - Get data from each parent and their own children

I would like to get data from each parent and their own children fro an XML file.
I'm trying to parse this XML file
<DB>
<Entry>
<Name>Assembly.iam</Name>
<DisplayName>Assembly.iam</DisplayName>
<Scalar>
<Name>d0</Name>
<DisplayName>d0 (value = 0 mm)</DisplayName>
<Value>0</Value>
</Scalar>
<Scalar>
<Name>d1</Name>
<DisplayName>d1 (value = 0 mm)</DisplayName>
<Value>0</Value>
</Scalar>
</Entry>
<Entry>
<Name>Ground.ipt</Name>
<DisplayName>Ground.ipt</DisplayName>
<Scalar>
<Name>Ground_length</Name>
<DisplayName>Ground_length (value = 160 mm)</DisplayName>
<Value>160</Value>
</Scalar>
<Scalar>
<Name>d2</Name>
<DisplayName>d2 (value = 80 mm)</DisplayName>
<Value>80</Value>
</Scalar>
</Entry>
</DB>
In fact, I would like to get the data which are into <DisplayName></DisplayName>.
Then, I would like to put that data into an array of tuples like this
[(Assembly.iam,[d0 (value = 0 mm), d1 (value = 0 mm)]),
(Ground.ipt,[Ground_length (value = 160 mm), d2 (value = 80 mm)])
I have tried to use the xml.etree.cElementTree library with this code
from xml.etree import cElementTree
import numpy as np
workingDir = "C:/Users/Vince/Test"
newStrWorkingDir = str.replace(workingDir, '/', '\\')
tree = cElementTree.parse(newStrWorkingDir + "\\test.xml")
root = tree.getroot()
tab = np.empty(shape=(0, 0))
tabEntry = np.empty(shape=(0, 0))
tabScalar = np.empty(shape=(0, 0))
for entry in root.findall('Entry'):
entryNames = entry.findall("./DisplayName")
entryNamesText = entry.find("./DisplayName").text
tabEntry = np.append(tabEntry,entryNamesText)
for scalar in entry.findall('Scalar'):
scalarNames = scalar.findall("./DisplayName")
scalarNamesText = scalar.find("./DisplayName").text
tabScalar = np.append(tabScalar,scalarNamesText)
tab = np.append(tab,(entryNamesText,scalarNamesText))
print(tab)
But it outputs me this
['Assembly.iam' 'd0 (value = 0 mm)'
'Assembly.iam' 'd1 (value = 0 mm)'
'Ground.ipt' 'Ground_length (value = 160 mm)'
'Ground.ipt' 'd2 (value = 80 mm)']

To get your wanted structure, you have to build lists of lists:
import os
from xml.etree import cElementTree
workingDir = "C:\\Users\\Vince\\Test"
tree = cElementTree.parse(os.path.join(newStrWorkingDir, "test.xml"))
root = tree.getroot()
tab = []
for entry in root.findall('Entry'):
entry_name = entry.findtext("./DisplayName")
scalar_names = [e.text for e in entry.findall('Scalar/DisplayName')]
tab.append((entry_name, scalar_names))
print(tab)

Get Text for XML-Node including childnodes (or something like this)

I have to get the pure text out of a xml-node and its child nodes, or what else these strange inner-tags are:
Example-Nodes:
<BookTitle>
<Emphasis Type="Italic">Z</Emphasis>
= 63 - 100
</BookTitle>
or:
<BookTitle>
Mtn
<Emphasis Type="Italic">Z</Emphasis>
= 74 - 210
</BookTitle>
I have to get:
Z = 63 - 100
Mtn Z = 74 - 210
Remember, this is just an example! There could be any type of "Child-Nodes" inside the BookTitle-Node, and all I need is the pure Text inside BookTitle.
I tried:
tagtext = root.find('.//BookTitle').text
print tagtext
but .text can't deal with this strange xml-nodes and gives me a "NoneType" back
Regards & Thanks!

That's not the text of the BookTitle node, it's the tail of the Emphasis node. So you should do something like:
def parse(el):
text = el.text.strip() + ' ' if el.text.strip() else ''
for child in el.getchildren():
text += '{0} {1}\n'.format(child.text.strip(), child.tail.strip())
return text
Which gives you:
>>> root = et.fromstring('''
<BookTitle>
<Emphasis Type="Italic">Z</Emphasis>
= 63 - 100
</BookTitle>''')
>>> print parse(root)
Z = 63 - 100
And for:
>>> root = et.fromstring('''
<BookTitle>
Mtn
<Emphasis Type="Italic">Z</Emphasis>
= 74 - 210
</BookTitle>''')
>>> print parse(root)
Mtn Z = 74 - 210
Which should give you a basic idea what to do.
Update: Fixed the whitespace...

You can use the minidom parser. Here is an example:
from xml.dom import minidom
def strip_tags(node):
text = ""
for child in node.childNodes:
if child.nodeType == doc.TEXT_NODE:
text += child.toxml()
else:
text += strip_tags(child)
return text
doc = minidom.parse("<your-xml-file>")
text = strip_tags(doc)
The strip_tags recursive function will browse the xml tree and extract the text in order.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

XML parsing with XMLtree or MINIDOM - python

Related

different return types for getpath() in lxml

Txt to XML Python parser

How to access nested children with same name in XML using lxml in python

XML file parsing - Get data from each parent and their own children

Get Text for XML-Node including childnodes (or something like this)

Categories

Resources