Python Parsing Multiple XML Nodes with Dynamic Data - python

I have an log file from an application in XML-like format that I'm trying to parse. As you can see from the file, one "group" starts with a [trace] line, and contains 4 nodes - RequestMeta, Request, ReplyMeta, and Reply.
Once the file is parsed, I want to create an object for each "group" and use the objects for further processing. There could be from 1:n groups depending on the complexity of the log file.
I have been able to parse the XML, but I have some questions on how best to proceed based on it's structure.
The first problem is how to structure/re-structure the file for parsing. Since I'm adding a single root node to more than one "group", there will be no easy way for me to know which children of the root node belong together in that group. In the original file, the group is denoted as everything between the [trace] line and the next [trace] line.
I think I could potentially solve this by taking each string "group" and create a tree for each group instead of a tree for the entire file.
The second problem is how to store the data once it's parsed. Each and every request/reply will contain different data elements under the srvdata node. I'm not sure how to dynamically store a variable number of values that have a variable number of names.
After parsing all of the data, I want to output it in a simple webpage that looks something like https://imgur.com/a/2l6ZSJK
py script
import xml.etree.ElementTree as ET
with open('C:/code/mra/requestreply.txt') as f:
txt = f.read()
pos = 0
# replace all [trace] lines
while pos >= 0:
pos = txt.find('[trace-')
pos2 = txt.find('\n', pos + 1) + 1
if pos >= 0:
txt = txt.replace(txt[pos:pos2], '')
# replace all xml instances because they are out of order
txt = txt.replace('<?xml version="1.0" encoding="utf-8"?>\n', '')
# add a master root node
xml = '<root>\n' + txt + '</root>'
tree = ET.fromstring(xml)
xml file - this is considered a single group (there could be hundreds)
[trace-592] TransactionID=6010 TransactionName=CPM.ExecuteDiscernScript User=MEPPS
<RequestMeta>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</RequestMeta>
<Request>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Request>
<ReplyMeta>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</ReplyMeta>
<Reply>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Reply>

I suggest modify your xml structure like this, I named the file trace.xml:
<?xml version="1.0" encoding="utf-8"?>
<root>
<!--[trace-592] TransactionID=6010 TransactionName=CPM.ExecuteDiscernScript User=MEPPS-->
<RequestMeta>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</RequestMeta>
<Request>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Request>
<ReplyMeta>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</ReplyMeta>
<Reply>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Reply>
</root>
Then you can parse each segment separate like:
import xml.etree.ElementTree as ET
def parseRequestMeta(RequestMeta):
"""Parse your interest here """
for root in RequestMeta:
print(root.tag)
for child in root.iter():
print(child.tag, child.text)
def parseRequest(Request):
psss
def parseReplyMeta(ReplyMeta):
psss
def parseReply(Reply):
psss
RequestMeta = []
Request = []
ReplyMeta = []
Reply = []
events = ["start", "end"]
for event, node in ET.iterparse('trace.xml', events=events):
if event == "end" and node.tag == "RequestMeta":
RequestMeta.append(node)
print(node.tag)
if event == "end" and node.tag == "Request":
Request.append(node)
print(node.tag)
if event == "end" and node.tag == "ReplyMeta":
ReplyMeta.append(node)
print(node.tag)
if event == "end" and node.tag == "Reply":
Reply.append(node)
print(node.tag)
parseRequestMeta(RequestMeta)
parseRequestMeta(Request)
parseRequestMeta(ReplyMeta)
parseRequestMeta(Reply)

Related

Export information from child nodes in xml using Python

I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles

Remove namespaces and nodes from XML string in python

I get an xml string from a post request and I need to use this xml in a subsequent request. I need to edit the XML from the first request to reflect the correct format for the subsequent request.
I can successfully remove the name spaces but am struggling with extracting the desired node and keeping the xml formatting.
current format
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<GetExResponse xmlns="http://www.someurl.com/">
<GetExResult>
<DataMap xmlns="" sourceType="0">
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
</DataMap>
</GetExResult>
</GetExResponse>
</soap:Body>
</soap:Envelope>
Desired Format
<?xml version="1.0" encoding="UTF-8"?>
<DataMap xmlns="" sourceType="0">
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
</DataMap>
--removes namespaces
dmXML = xmlstring
from lxml import etree
root = etree.fromstring(dmXML)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
etree.cleanup_namespaces(root)
test = etree.tostring(root).decode()
print(test)
--extracts desired node but into dataframe changing the formatting
xdf = pandas.read_xml(dmXML, xpath='.//DataMap/*', namespaces={"doc": "http://www.w3.org/2001/XMLSchema"})
xml = pandas.DataFrame.to_xml(xdf)
You can simply extract the relevant portion into a new document:
import xml.etree.ElementTree as ET
root = ET.fromstring(dmXML)
new_root = root.find('.//DataMap')
print(ET.tostring(new_root, xml_declaration=True, encoding='UTF-8').decode())
Output:
<?xml version='1.0' encoding='UTF-8'?>
<DataMap sourceType="0">
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1" />
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1" />
</DataMap>

Python - replace root element of one xml file with another root element without its children

I have one xml file that looks like this, XML1:
<?xml version='1.0' encoding='utf-8'?>
<report>
</report>
And the other one that is like this,
XML2:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
....
</child2>
</child1>
</report>
I need to replace and put root element of XML2 without its children, so XML1 looks like this:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>
Currently my code looks like this but it won't remove children but put whole tree inside:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
report = source_root.findall('report')
for child in list(report):
report.remove(child)
source_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)
Anyone has ide how can I achieve this?
Thanks!
Try the below (just copy attrib)
import xml.etree.ElementTree as ET
xml1 = '''<?xml version='1.0' encoding='utf-8'?>
<report>
</report>'''
xml2 = '''<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
</child2>
</child1>
</report>'''
root1 = ET.fromstring(xml1)
root2 = ET.fromstring(xml2)
root1.attrib = root2.attrib
ET.dump(root1)
output
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>
So here is working code:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
dest_tree = ET.parse('XML1.xml')
dest_root = dest_tree.getroot()
dest_root.attrib = source_root.attrib
dest_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)

How to add space before and after CDATA in XML file

I want to create a function to modify XML content without changing the format. I managed to change the text but I can't do it without changing the format in XML.
So now, what I wanted to do is to add space before and after CDATA in a XML file.
Default XML file:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row> <![CDATA[001 001 001]]> </Row>
</Data>
</Device>
</Map>
</Maps>
And I am getting this result:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row><![CDATA[001 001 099]]></Row>
</Data>
</Device>
</Map>
</Maps>
However, I want the new xml to be like this:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row> <![CDATA[001 001 099]]> </Row>
</Data>
</Device>
</Map>
</Maps>
Here is my code:
from lxml import etree as ET
def xml_new(f,fpath,newtext,xmlrow):
xmlrow = 19
parser = ET.XMLParser(strip_cdata=False)
tree = ET.parse(f, parser)
root = tree.getroot()
for child in root:
value = child[0][2][xmlrow].text
text = ET.CDATA("001 001 099")
child[0][2][xmlrow] = ET.Element('Row')
child[0][2][xmlrow].text = text
child[0][2][xmlrow].tail = "\n"
ET.register_namespace('A', "http://www.semi.org")
tree.write(fpath,encoding='utf-8',xml_declaration=True)
return value
Anyone can help me on this? thanks in advance!
I don't quite understand what you want to do. Here's an example for you. I don't know if it can meet your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
html ='''<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row> <![CDATA[001 001 001]]> </Row>
</Data>
</Device>
</Map>
</Maps>'''
doc = SimplifiedDoc(html)
row = doc.Data.Row # Get the node you want to modify.
row.setContent(" "+row.html+" ") # Modify the node content.
print (doc.html)
Result:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice />
<Bin>
<Bin Bin="001" />
</Bin>
<Data>
<Row> <![CDATA[001 001 001]]> </Row>
</Data>
</Device>
</Map>
</Maps>
thanks for all your help. I have found another way to achieve the result I want
This is the code:
# what you want to change
replaceby = '020]]> </Row>\n'
# row you want to change
row = 1
# col you want to change based on list
col = 3
file = open(file,'r')
line = file.readlines()
i = 0
editedXML=[]
for l in line:
if 'cdata' in l.lower():
i=i+1
if i == row:
oldVal = l.split(' ')
newVal = []
for index, old in enumerate(oldVal):
if index == col:
newVal.append(replaceby)
else:
newVal.append(old)
editedXML.append(' '.join(newVal))
else:
editedXML.append(l)
else:
editedXML.append(l)
file2 = open(newfile,'w')
file2.write(''.join(editedXML))
file2.close()

Append new node into XML using python

I have written the below code to create moderately large XML file, wherein I will be creating nodes in loop.
import xml.etree.cElementTree as ET
number = 0
def xml_write(number,doc):
ET.SubElement(doc, "extra-TextID", used="true").text = ""+str(number) ##in each loop number will be changed from 0 to 9
while number != 10:
doc = ET.Element("message")
xml_write(number,doc)
tree = ET.ElementTree(doc)
tree.write('XML_file.xml')
number = number + 1
But running the above code I am only getting the last node, i.e., with "9" in the last line. Data is getting replaced in the file. How to append it so that I will get all the nodes containing 0 to 9 in each node.
<?xml version="1.0"?>
-<message>
<source>Rain</source>
<translations language="Dev">Cyclone</translations>
<extra-TextID used="true">9</extra-TextID>
<message>
I need to get xml file as:
<?xml version="1.0"?>
-<message>
<source>Rain</source>
<translations language="Dev">Cyclone</translations>
<extra-TextID used="true">0</extra-TextID>
<message>
<?xml version="1.0"?>
-<message>
<source>Rain</source>
<translations language="Dev">Cyclone</translations>
<extra-TextID used="true">1</extra-TextID>
<message>
<?xml version="1.0"?>
-<message>
<source>Rain</source>
<translations language="Dev">Cyclone</translations>
<extra-TextID used="true">3</extra-TextID>
<message>
.
.
.
<?xml version="1.0"?>
-<message>
<source>Rain</source>
<translations language="Dev">Cyclone</translations>
<extra-TextID used="true">9</extra-TextID>
<message>
The ElementTree library would not dump an XML with multiple root elements. If you want to have this kind of output in the XML file, append the generated elements manually:
with open('XML_file.xml', 'wb') as f:
while number != 10:
doc = ET.Element("message")
xml_write(number, doc)
f.write(ET.tostring(doc, method="xml"))
number += 1

Categories

Resources