How to extract xml from log file to parse in python - python

I have a log file containing xml envelopes (2 types of xml structures: request and response). What i need to do is to parse this file, extract xml-s and put them into 2 arrays as strings (1st array for requests and 2nd array for responses), so i can parse them later.
Any ideas how can i achieve this in python ?
Snippet of log file to be parsed (log contains ):
2014-10-31 12:27:33,600 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Sending BILL request
2014-10-31 12:27:33,601 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>
<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<transactionheader>
<username>XXX</username>
<password>XXX</password>
<time>31/10/2014 12:27:33</time>
<clientreferencenumber>123</clientreferencenumber>
<numberrequests>3</numberrequests>
<information>Description</information>
<postbackurl>http://localhost/status</postbackurl>
</transactionheader>
<transactiondetails>
<items>
<item id="1" client="XXX1" keyword="test"/>
<item id="2" client="XXX2" keyword="test"/>
<item id="3" client="XXX3" keyword="test"/>
</items>
</transactiondetails>
</request>
2014-10-31 12:27:34,487 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Response code 200 for bill request
2014-10-31 12:27:34,489 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>
<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<serverreferencenumber>XXX123XXX</serverreferencenumber>
<clientreferencenumber>123</clientreferencenumber>
<information>Queued for Processing</information>
<status>OK</status>
</response>
Many thanks for reply!
Regards,
Robert

As both #Paco and #Lord_Gestalter suggested, you can use xml.etree and replace the non-XML elements from your file, something like this:
# I use re to substitute non-XML elements
import re
# then use xml module as a parser
import xml.etree.ElementTree as ET
# read your file and store in string 's'
with open('yourfilehere','r') as f:
s = f.read()
# then remove non-XML element with re
# I also remove <?xml ...?> part as your file consists of multiple xml logs
s = re.sub(r'<\?xml.*?>', '', ''.join(re.findall(r'<.*>', s)))
# wrap your s with a root element
s = '<root>'+s+'</root>'
# parse s with ElementTree
tree = ET.fromstring(s)
tree
<Element 'root' at 0x7f2ab877e190>
if you don't care about xml parser and just want 'request' & 'response' string, use re.search
with open('yourfilehere','r') as f:
s = f.read()
# put the string of both request and response into 'req' and 'res'
# or you need to construct a better re.search if you have multiple requests, responses
req = [re.search(r'<request.*\/request>', s).group()]
res = [re.search(r'<response.*\/response>', s).group()]
req
['<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><transactionheader><username>XXX</username><password>XXX</password><time>31/10/2014 12:27:33</time><clientreferencenumber>123</clientreferencenumber><numberrequests>3</numberrequests><information>Description</information><postbackurl>http://localhost/status</postbackurl></transactionheader><transactiondetails><items><item id="1" client="XXX1" keyword="test"/><item id="2" client="XXX2" keyword="test"/><item id="3" client="XXX3" keyword="test"/></items></transactiondetails></request>']
res
['<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><serverreferencenumber>XXX123XXX</serverreferencenumber><clientreferencenumber>123</clientreferencenumber><information>Queued for Processing</information><status>OK</status></response>']

Related

How to insert a processing instruction in XML file?

I want to add a xml-stylesheet processing instruction before the root element in my XML file using ElementTree (Python 3.8).
You find as below my code that I used to create XML file
import xml.etree.cElementTree as ET
def Export_star_xml( self ):
star_element = ET.Element("STAR",**{ 'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance' })
element_node = ET.SubElement(star_element ,"STAR_1")
element_node.text = "Mario adam"
tree.write( "star.xml" ,encoding="utf-8", xml_declaration=True )
Output:
<?xml version="1.0" encoding="windows-1252"?>
<STAR xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<STAR_1> Mario adam </STAR_1>
</STAR>
Output Expected:
<?xml version="1.0" encoding="windows-1252"?>
<?xml-stylesheet type="text/xsl" href="ResourceFiles/form_star.xsl"?>
<STAR xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<STAR_1> Mario adam </STAR_1>
</STAR>
I cannot figure out how to do this with ElementTree. Here is a solution that uses lxml, which provides an addprevious() method on elements.
from lxml import etree as ET
# Note the use of nsmap. The syntax used in the question is not accepted by lxml
star_element = ET.Element("STAR", nsmap={'xsi': 'http://www.w3.org/2001/XMLSchema-instance'})
element_node = ET.SubElement(star_element ,"STAR_1")
element_node.text = "Mario adam"
# Create PI and and insert it before the root element
pi = ET.ProcessingInstruction("xml-stylesheet", text='type="text/xsl" href="ResourceFiles/form_star.xsl"')
star_element.addprevious(pi)
ET.ElementTree(star_element).write("star.xml", encoding="utf-8",
xml_declaration=True, pretty_print=True)
Result:
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="ResourceFiles/form_star.xsl"?>
<STAR xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<STAR_1>Mario adam</STAR_1>
</STAR>

How to write XML file with xml tags that include html tags in its text using Python?

I recently came to the realization that XML containing HTML tags in body text for some of the tags seem to make parsers like WP All Import choke.
So to mitigate this, I attempted to write a Python script to properly put out XML.
It starts with this XML file (this is just an excerpt):
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
</Row>
...
</Root>
The desired output is:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like , <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
Unfortunately, I'm getting the following with weird escape characters like:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
So I'd like to fix the following:
1) Output new XML file that preserves the text including the HTML in the newly introduced "Introduction_Body" tag as well as any other tags like "Waterfall_Name"
2) Is it possible to cleanly pretty print this (for human-readability)? How?
My Python code currently looks like this:
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import os
data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()
for element in root:
if element.find('File_directory') is not None:
directory = element.find('File_directory').text
if element.find('Introduction') is not None:
introduction = element.find('Introduction').text
intro_tree = directory+introduction
with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
intro_text = f.read()
intro_body = ET.SubElement(element,'Introduction_Body')
intro_body.text = '<![CDATA[' + intro_text + ']]>'
#tree.write('new_' + data_file) #same result but leaves out the xml header
f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()
Thanks,
Johnny
I would recommend you switch to lxml. It is well-documented and (almost) completely compatible with python's own xml. You might only have to minimally change your code. lxml supports CDATA very handily:
> from lxml import etree
> elmnt = etree.Element('root')
> elmnt.text = etree.CDATA('abcd')
> etree.dump(elmnt)
<root><![CDATA[abcd]]></root>
That aside, you should definitely use whatever library you use not only for parsing xml, but also for writing it! lxml will do the declaration for you:
> print(etree.tostring(elmnt, encoding="utf-8"))
<?xml version='1.0' encoding='utf-8'?>
<root><![CDATA[abcd]]></root>

Converting xml file into json dump python

Hi i want to parse xml file into json dump. Already tried this:
for i in cNodes[0].getElementsByTagName("person"):
self.userName = i.getElementsByTagName("login")[0].childNodes[0].toxml()
self.userPassword = i.getElementsByTagName("password")[0].childNodes[0].toxml()
self.userNick = i.getElementsByTagName("nick")[0].childNodes[0].toxml()
But i want to get titles and values in format title:value, using for loop.
<user>
<person>
<nick>Gamer</nick>
<login>1</login>
<password>tajne</password>
</person>
<properties>
<fullHp>100</fullHp>
<currentHp>25</currentHp>
<fullMana>200</fullMana>
<currentMana>124</currentMana>
<premiumAcc>1</premiumAcc>
</properties>
This is my xml format.
Don't reinvent the wheel (with "minidom" it would not be fun anyway), use xmltodict:
import xmltodict
data = """
<user>
<person>
<nick>Gamer</nick>
<login>1</login>
<password>tajne</password>
</person>
<properties>
<fullHp>100</fullHp>
<currentHp>25</currentHp>
<fullMana>200</fullMana>
<currentMana>124</currentMana>
<premiumAcc>1</premiumAcc>
</properties>
</user>"""
print xmltodict.parse(data)

How to keep the xml-stylesheet?

I want to keep the xml-stylesheet. But it doesn't work.
I use Python to modify the XML for deploy hadoop automatically.
XML:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c11:9000</value>
  </property>
</configuration>
Code:
from xml.etree.ElementTree import ElementTree as ET
def modify_core_site(namenode_hostname):
tree = ET()
tree.parse("pkg/core-site.xml")
root = tree.getroot()
for p in root.iter("property"):
name = p.find("name").text
if name == "fs.default.name":
text = "hdfs://%s:9000" % namenode_hostname
p.find("value").text = text
tree.write("pkg/tmp.xml", encoding="utf-8", xml_declaration=True)
modify_core_site("c80")
Result:
<?xml version='1.0' encoding='utf-8'?>
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c80:9000</value>
  </property>
</configuration>
The xml-stylesheet disappear...
How can I keep this?
One solution is you can use lxml Once you parse xml go till you find the xsl node. Quick sample below:
>>> import lxml.etree
>>> doc = lxml.etree.parse('C:/downloads/xmltest.xml')
>>> root = doc.getroot()
>>> xslnode=root.getprevious().getprevious()
>>> xslnode
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Make sure you put in some exception handling and check if the node indeed exists. You can check if the node is xslt processing instruction by
>>> isinstance(xslnode, lxml.etree._XSLTProcessingInstruction)
True

xml.dom.minidom getting elements by tagname

How can I retrieve the value of code with this (below) xml string and when using xml.dom.minidom?
<data>
<element1>
<name>myname</name>
</element1>
<element2>
<code>3</code>
<name>another name</name>
</element2>
</data>
Because multiple 'name' tags can appear I would like to do something like this:
from xml.dom.minidom import parseString
dom = parseString("<data>...</data>")
dom.getElementsByTagName("element1").getElementsByTagName("name")
But that doesn't work unfortunately.
The below code worked fine for me. I think you had multiple tags and you want to get the name from the second tag.
myxml = """\
<data>
<element>
<name>myname</name>
</element>
<element>
<code>3</code>
<name>another name</name>
</element>
</data>
"""
dom = xml.dom.minidom.parseString(myxml)
nodelist = dom.getElementsByTagName("element")[1].getElementsByTagName("name")
for node in nodelist:
print node.toxml()

Categories

Resources