Txt to XML Python parser

Txt to XML Python parser - python

I tried to parse a .txt file that looks like this:
-------------------------------------------------------------------------------
Compare Results
Compare Directory 1 : /data/Run_288/bitmaps
Compare Directory 2 : /data/Run_301/bitmaps
-------------------------------------------------------------------------------
idx, Filename , Exact, F3x3, F5x5, F7x7, Threshold, P/F
-------------------------------------------------------------------------------
1, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif, 0, 0, 0, 0, 0, PASS
2, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif, 0, 0, 0, 0, 0, PASS
-------------------------------------------------------------------------------
Bitmap Compare FAILURE !!! Threshold Exceeded : Threshold Values : Exact = 0 : Fuzzy 3x3 = 200 : Fuzzy 5x5 = 100 : Fuzzy 7x7 = 50 : Threshold 7x7 = 0
3, MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif, 2083, 1180, 650, 262, 52, FAIL
-------------------------------------------------------------------------------
I need to obtain an xml with this format:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Suite date="2019-05-27T10:47:03" compareDir1="/data/Run_288/bitmaps" compareDir2="/data/Run_301/bitmaps">
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif" result="pass">
</Test>
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif" result="pass">
</Test>
<Test name="MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif" result="crash">
</Test>
</Suite>
This is the code that should do the work.The problem is that is not working and with my little python knowledge I don't know why.Can somebody help me please with this?!
Thank you!
import xml.etree.ElementTree as ET
root = ET.Element('Suite')
with open('file3.txt') as f:
lines = f.read().splitlines()
print(lines)
#add first subelement
celldata = ET.SubElement(root, 'Test')
import itertools as it
#for every line in input file
#group consecutive dedup to one
for line in it.groupby(lines):
line=line[0]
#if its a break of subelements - that is an empty space
if not line:
#add the next subelement and get it as celldata
celldata = ET.SubElement(root, 'test')
else:
#otherwise, split with : to get the tag name
tag = line.split(",")
#format tag name
el=ET.SubElement(celldata,tag[1])
print(tag[1])
print(tag[7])
tag=' '.join(tag[1]).strip()
if 'PASS' in line:
tag = line.split(",")[-1].strip()
elif 'FAILURE' in line:
splist = filter(None,line.split(" "))
tag = splist[splist.index(',')+1]
el.text = tag
#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
ET.tostring(
root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML
#write the formatedXML to file.
with open("results.xml","w+") as f:
f.write(formatedXML)

For this I would use regular expressions. My take:
data = '''-------------------------------------------------------------------------------
Compare Results
Compare Directory 1 : /data/Run_288/bitmaps
Compare Directory 2 : /data/Run_301/bitmaps
-------------------------------------------------------------------------------
idx, Filename , Exact, F3x3, F5x5, F7x7, Threshold, P/F
-------------------------------------------------------------------------------
1, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif, 0, 0, 0, 0, 0, PASS
2, ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif, 0, 0, 0, 0, 0, PASS
-------------------------------------------------------------------------------
Bitmap Compare FAILURE !!! Threshold Exceeded : Threshold Values : Exact = 0 : Fuzzy 3x3 = 200 : Fuzzy 5x5 = 100 : Fuzzy 7x7 = 50 : Threshold 7x7 = 0
3, MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif, 2083, 1180, 650, 262, 52, FAIL
-------------------------------------------------------------------------------'''
import re
dirs = []
for d in re.findall('Compare Directory\s+(\d+)\s*:\s*(.*?)$', data, flags=re.DOTALL|re.MULTILINE):
dirs += [d]
passes = []
fails = []
for line in data.split('\n'):
for p in re.findall('(\d+,\s+(.*?),.*?PASS)$', line):
passes += [p]
for f in re.findall('(\d+,\s+(.*?),.*?FAIL)$', line):
fails += [f]
s = f'''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Suite date="2019-05-27T10:47:03" compareDir1="{dirs[0][1]}" compareDir2="{dirs[1][1]}">
'''
for p in passes:
s += f''' <Test name="{p[1]}" result="pass">
</Test>
'''
for fail in fails:
s += f''' <Test name="{fail[1]}" result="crash">
</Test>
'''
s += '''</Suite>'''
print(s)
Prints:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Suite date="2019-05-27T10:47:03" compareDir1="/data/Run_288/bitmaps" compareDir2="/data/Run_301/bitmaps">
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00001.tif" result="pass">
</Test>
<Test name="ASCII_APPE_600X450_150_colorManBasic2.blackGrayReproductionImage_0_2p_color_test_four_object.pdf_20190522005734_00002.tif" result="pass">
</Test>
<Test name="MIME_Test3_Job_setup__600X600_50_default_default_PPST56_003.mjm_20190521213826_00001.tif" result="crash">
</Test>
</Suite>

Related

different return types for getpath() in lxml

I have folders full of XML files which I want to parse to a dataframe. The following functions iterate through an XML tree recursively and return a dataframe with three columns: path, attributes and text.
def XML2DF(filename,df1,MAX_DEPTH=20):
with open(filename) as f:
xml_str = f.read()
tree = etree.fromstring(xml_str)
df1 = recursive_parseXML2DF(tree, df1, MAX_DEPTH=MAX_DEPTH)
return
def recursive_parseXML2DF(element, df1, depth=0, MAX_DEPTH=20):
if depth > MAX_DEPTH:
return df1
df2 = pd.DataFrame([[element.getroottree().getpath(element), element.attrib, element.text]],
columns=["path", "attrib", "text"])
#print(df2)
df1 = pd.concat([df1, df2])
for child in element.getchildren():
df1 = recursive_parseXML2DF(child, df1, depth=depth + 1)
return df1
The code for the function was adapted from this post.
Most of the times the function works fine and returns the entire path but for some documents the returned path looks like this:
/*/*[1]/*[3]
/*/*[1]/*[3]/*[1]
The text tag entry remains valid and correct.
The only difference in the XML between working path and widlcard path documents I can make out is that the XML tags are written in all caps.
Working example:
<?xml version="1.0" encoding="utf-8"?>
<root>
<Header>
<ReceivingApplication>ReceivingApplication</ReceivingApplication>
<SendingApplication>SendingApplication</SendingApplication>
<MessageControlID>12345</MessageControlID>
<ReceivingApplication>ReceivingApplication</ReceivingApplication>
<FileCreationDate>2000-01-01T00:00:00</FileCreationDate>
</Header>
<Einsendung>
<Patient>
<PatientName>Name</PatientName>
<PatientVorname>FirstName</PatientVorname>
<PatientGebDat>2000-01-01T00:00:00</PatientGebDat>
<PatientSex>4</PatientSex>
<PatientPWID>123456</PatientPWID>
</Patient>
<Visit>
<VisitNumber>A2000.0001</VisitNumber>
<PatientPLZ>1234</PatientPLZ>
<PatientOrt>PatientOrt</PatientOrt>
<PatientAdr2>
</PatientAdr2>
<PatientStrasse>PatientStrasse 01</PatientStrasse>
<VisitEinsID>1234</VisitEinsID>
<VisitBefund>VisitBefund</VisitBefund>
<Befunddatum>2000-01-01T00:00:00</Befunddatum>
</Visit>
</Einsendung>
</root>
nonsensical Example:
<?xml version="1.0"?>
<KRSCHWEIZ xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="krSCHWEIZ">
<KEY_VS>abcdefg</KEY_VS>
<KEY_KLR>abcdefg</KEY_KLR>
<ABSENDER>
<ABSENDER_MELDER_ID>123456</ABSENDER_MELDER_ID>
<MELDER>
<MELDER_ID>123456</MELDER_ID>
<QUELLSYSTEM>ABCDEF</QUELLSYSTEM>
<PATIENT>
<REFERENZNR>987654</REFERENZNR>
<NACHNAME>my name</NACHNAME>
<VORNAMEN>my first name</VORNAMEN>
<GEBURTSNAME />
<GEBURTSDATUM>my dob</GEBURTSDATUM>
<GESCHLECHT>XX</GESCHLECHT>
<PLZ>9999</PLZ>
<WOHNORT>Mycity</WOHNORT>
<STRASSE>mystreet</STRASSE>
<HAUSNR>99</HAUSNR>
<VERSICHERTENNR>999999999</VERSICHERTENNR>
<DATEIEN>
<DATEI>
<DATEINAME>my_attached_document.html</DATEINAME>
<DATEIBASE64>mybase_64_encoded_document</DATEIBASE64>
</DATEI>
</DATEIEN>
</PATIENT>
</MELDER>
</ABSENDER>
</KRSCHWEIZ>
How do I get correct explicit path information also for this case?

The prescence of namespaces changes the output of .getpath() - you can use .getelementpath() instead which will include the namespace prefix instead of using wildcards.
If the prefix should be discarded completely - you can strip them out before using .getpath()
import lxml.etree
import pandas as pd
rows = []
tree = lxml.etree.parse("broken.xml")
for node in tree.iter():
try:
node.tag = lxml.etree.QName(node).localname
except ValueError:
# skip tags with no name
continue
rows.append([tree.getpath(node), node.attrib, node.text])
df = pd.DataFrame(rows, columns=["path", "attrib", "text"])
Resulting dataframe:
>>> df
path attrib text
0 /KRSCHWEIZ [] \n
1 /KRSCHWEIZ/KEY_VS [] abcdefg
2 /KRSCHWEIZ/KEY_KLR [] abcdefg
3 /KRSCHWEIZ/ABSENDER [] \n
4 /KRSCHWEIZ/ABSENDER/ABSENDER_MELDER_ID [] 123456
5 /KRSCHWEIZ/ABSENDER/MELDER [] \n
6 /KRSCHWEIZ/ABSENDER/MELDER/MELDER_ID [] 123456
7 /KRSCHWEIZ/ABSENDER/MELDER/QUELLSYSTEM [] ABCDEF
8 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT [] \n
9 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/REFERENZNR [] 987654
10 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/NACHNAME [] my name
11 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VORNAMEN [] my first name
12 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSNAME [] None
13 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSDATUM [] my dob
14 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GESCHLECHT [] XX
15 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/PLZ [] 9999
16 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/WOHNORT [] Mycity
17 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/STRASSE [] mystreet
18 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/HAUSNR [] 99
19 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VERSICHERTENNR [] 999999999
20 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN [] \n
21 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DATEI [] \n
22 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] my_attached_document.html
23 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] mybase_64_encoded_document

Parse xml with python getting x,y values

i need to get x,y localisation from xml file
-<TwoDimensionSpatialCoordinate>
<coordinateIndex value="0"/>
<x value="302.6215607602997"/>
<y value="166.6285651861381"/>
</TwoDimensionSpatialCoordinate>
from xml.dom import minidom
doc = minidom.parse("1.631791322.58809740.14.834982.40440.3641459051.955.6373933.1920.xml")
"""doc.getElementsByTagName returns NodeList
coordinate = doc.getElementsByTagName("coordinateIndex")[0]
print(coordinate.firstChild.data)
"""
coordinate = doc.getElementsByTagName("coordinateIndex")
for coordinateIndex in coordinate:
value = coordinateIndex.getAttribute("value")
coordinatex = doc.getElementsByTagName("x")
for x in coordinatex:
valuex = x.getAttribute("value")
coordinatey = doc.getElementsByTagName("y")
for y in coordinatey:
valuey = y.getAttribute("value")
print("value:%s, x:%s,y:%s" % (value, x , y))
so when i execute i get this result
value:22, x:,y:
can Anyone help me please ?:(

As your example xml file
<?xml version="1.0" ?>
<TwoDimensionSpatialCoordinate>
<coordinateIndex value="0"/>
<x value="302.6215607602997"/>
<y value="166.6285651861381"/>
<coordinateIndex value="1"/>
<x value="3.6215607602997"/>
<y value="1.6285651861381"/>
</TwoDimensionSpatialCoordinate>
import xml.dom.minidom
def main(file):
doc = xml.dom.minidom.parse(file)
values = doc.getElementsByTagName("coordinateIndex")
coordX = doc.getElementsByTagName("x")
coordY = doc.getElementsByTagName("y")
d = {}
for atr_value, atr_x, atr_y in zip(values, coordX, coordY):
value = atr_value.getAttribute('value')
x = atr_x.getAttribute('value')
y = atr_y.getAttribute('value')
d[value] = [x, y]
return d
result = main('/path/file.xml')
print(result)
# {'0': ['302.621', '166.628'], '1': ['3.621', '1.628']}

Using the ElementTree API (use ET.parse(filename).getroot() instead of ET.XML() to load from a file):
from xml.etree import ElementTree as ET
xml = ET.XML("""
<?xml version="1.0" ?>
<Things>
<TwoDimensionSpatialCoordinate>
<coordinateIndex value="0"/>
<x value="302.6215607602997"/>
<y value="166.6285651861381"/>
</TwoDimensionSpatialCoordinate>
<TwoDimensionSpatialCoordinate>
<coordinateIndex value="1"/>
<x value="3.6215607602997"/>
<y value="1.6285651861381"/>
</TwoDimensionSpatialCoordinate>
</Things>
""".strip())
coords_by_index = {}
for coord in xml.findall(".//TwoDimensionSpatialCoordinate"):
coords_by_index[coord.find("coordinateIndex").get("value")] = (
coord.find("x").get("value"),
coord.find("y").get("value"),
)
print(coords_by_index)
outputs
{
'0': ('302.6215607602997', '166.6285651861381'),
'1': ('3.6215607602997', '1.6285651861381'),
}

Merge output loop lxml

I want to take out some element from xml which look up from variable.
here is my.xml file:
<?xml version='1.0' encoding='UTF-8'?>
<ArrayOfSalesOrderHeader xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<SalesOrderHeader>
<TenantCode>15152343</TenantCode>
<SalesOrderDetails>
<SalesOrderDetail>
<ItemCode>20072129</ItemCode>
</SalesOrderDetail>
<SalesOrderDetail>
<ItemCode>67332054</ItemCode>
</SalesOrderDetail>
<SalesOrderDetail>
<ItemCode>20206133</ItemCode>
</SalesOrderDetail>
<SalesOrderDetail>
<ItemCode>62071796</ItemCode>
</SalesOrderDetail>
</SalesOrderDetails>
</SalesOrderHeader>
</ArrayOfSalesOrderHeader>
this is my script:
doc = ET.parse("my.xml")
arrDat = '20206133'
fol = doc.xpath('.//SalesOrderDetail[descendant::ItemCode[not(contains(text(),"' + arrDat + '"))]]')
for SOD in fol :
SOD.getparent().remove(SOD)
doc.write('output.xml', xml_declaration=True, encoding='utf-8', method="xml")
The problem when i defined arrDat as array:
doc = ET.parse("my.xml")
arrDat = ['20072129','67332054']
cnt = 0
while cnt < len(arrDat) :
fol = doc.xpath('.//SalesOrderDetail[descendant::ItemCode[not(contains(text(),"' + arrDat[cnt] + '"))]]')
for SOD in fol :
SOD.getparent().remove(SOD)
doc.write('output.xml', xml_declaration=True, encoding='utf-8', method="xml")
cnt += 1
i need output.xml to be like:
<?xml version='1.0' encoding='UTF-8'?>
<ArrayOfSalesOrderHeader xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<SalesOrderHeader>
<TenantCode>15152343</TenantCode>
<SalesOrderDetails>
<SalesOrderDetail>
<ItemCode>20072129</ItemCode>
</SalesOrderDetail>
<SalesOrderDetail>
<ItemCode>67332054</ItemCode>
</SalesOrderDetail>
</SalesOrderDetails>
</SalesOrderHeader>
</ArrayOfSalesOrderHeader>

I think you can simply check the item node value and remove the one not present on your list. Here is the implementation:
from lxml import etree as ET
doc = ET.parse("data1.xml")
arrDat = ['20072129', '67332054']
for order in doc.xpath("//SalesOrderDetail"):
item = order.xpath('ItemCode')
item_code = item[0].text
if item_code not in arrDat:
order.getparent().remove(order)
doc.write('output.xml', xml_declaration=True, encoding='utf-8', method="xml")

XML file parsing - Get data from each parent and their own children

I would like to get data from each parent and their own children fro an XML file.
I'm trying to parse this XML file
<DB>
<Entry>
<Name>Assembly.iam</Name>
<DisplayName>Assembly.iam</DisplayName>
<Scalar>
<Name>d0</Name>
<DisplayName>d0 (value = 0 mm)</DisplayName>
<Value>0</Value>
</Scalar>
<Scalar>
<Name>d1</Name>
<DisplayName>d1 (value = 0 mm)</DisplayName>
<Value>0</Value>
</Scalar>
</Entry>
<Entry>
<Name>Ground.ipt</Name>
<DisplayName>Ground.ipt</DisplayName>
<Scalar>
<Name>Ground_length</Name>
<DisplayName>Ground_length (value = 160 mm)</DisplayName>
<Value>160</Value>
</Scalar>
<Scalar>
<Name>d2</Name>
<DisplayName>d2 (value = 80 mm)</DisplayName>
<Value>80</Value>
</Scalar>
</Entry>
</DB>
In fact, I would like to get the data which are into <DisplayName></DisplayName>.
Then, I would like to put that data into an array of tuples like this
[(Assembly.iam,[d0 (value = 0 mm), d1 (value = 0 mm)]),
(Ground.ipt,[Ground_length (value = 160 mm), d2 (value = 80 mm)])
I have tried to use the xml.etree.cElementTree library with this code
from xml.etree import cElementTree
import numpy as np
workingDir = "C:/Users/Vince/Test"
newStrWorkingDir = str.replace(workingDir, '/', '\\')
tree = cElementTree.parse(newStrWorkingDir + "\\test.xml")
root = tree.getroot()
tab = np.empty(shape=(0, 0))
tabEntry = np.empty(shape=(0, 0))
tabScalar = np.empty(shape=(0, 0))
for entry in root.findall('Entry'):
entryNames = entry.findall("./DisplayName")
entryNamesText = entry.find("./DisplayName").text
tabEntry = np.append(tabEntry,entryNamesText)
for scalar in entry.findall('Scalar'):
scalarNames = scalar.findall("./DisplayName")
scalarNamesText = scalar.find("./DisplayName").text
tabScalar = np.append(tabScalar,scalarNamesText)
tab = np.append(tab,(entryNamesText,scalarNamesText))
print(tab)
But it outputs me this
['Assembly.iam' 'd0 (value = 0 mm)'
'Assembly.iam' 'd1 (value = 0 mm)'
'Ground.ipt' 'Ground_length (value = 160 mm)'
'Ground.ipt' 'd2 (value = 80 mm)']

To get your wanted structure, you have to build lists of lists:
import os
from xml.etree import cElementTree
workingDir = "C:\\Users\\Vince\\Test"
tree = cElementTree.parse(os.path.join(newStrWorkingDir, "test.xml"))
root = tree.getroot()
tab = []
for entry in root.findall('Entry'):
entry_name = entry.findtext("./DisplayName")
scalar_names = [e.text for e in entry.findall('Scalar/DisplayName')]
tab.append((entry_name, scalar_names))
print(tab)

XML parsing with XMLtree or MINIDOM

I have a xml file, and in the middle of it I have a block like this:
...
<node id = "1" >
<ngh id = "2" > 100 </ngh>
<ngh id = "3"> 300 </ngh>
</node>
<node id = "2">
<ngh id = "1" > 400 </ngh>
<ngh id = "3"> 500 </ngh>
</node>
...
and trying to get
1, 2, 100
1, 3, 300
2, 1, 400
2, 3, 500
...
I found a similar question and did the following
from xml.dom import minidom
xmldoc = minidom.parse('file.xml')
nodelist = xmldoc.getElementsByTagName('node')
for s in nodelist:
print s.attributes['id'].value)
is there a way to get i get the values between tags (i.e. 100, 300, 400) ?

You need an inner loop over ngh elements:
from xml.dom import minidom
xmldoc = minidom.parse('file.xml')
nodes = xmldoc.getElementsByTagName('node')
for node in nodes:
node_id = node.attributes['id'].value
for ngh in node.getElementsByTagName('ngh'):
ngh_id = ngh.attributes['id'].value
ngh_text = ngh.firstChild.nodeValue
print node_id, ngh_id, ngh_text
Prints:
1 2 100
1 3 300
2 1 400
2 3 500

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Txt to XML Python parser - python

Related

different return types for getpath() in lxml

Parse xml with python getting x,y values

Merge output loop lxml

XML file parsing - Get data from each parent and their own children

XML parsing with XMLtree or MINIDOM

Categories

Resources