Converting xml file into json dump python - python

Hi i want to parse xml file into json dump. Already tried this:
for i in cNodes[0].getElementsByTagName("person"):
self.userName = i.getElementsByTagName("login")[0].childNodes[0].toxml()
self.userPassword = i.getElementsByTagName("password")[0].childNodes[0].toxml()
self.userNick = i.getElementsByTagName("nick")[0].childNodes[0].toxml()
But i want to get titles and values in format title:value, using for loop.
<user>
<person>
<nick>Gamer</nick>
<login>1</login>
<password>tajne</password>
</person>
<properties>
<fullHp>100</fullHp>
<currentHp>25</currentHp>
<fullMana>200</fullMana>
<currentMana>124</currentMana>
<premiumAcc>1</premiumAcc>
</properties>
This is my xml format.

Don't reinvent the wheel (with "minidom" it would not be fun anyway), use xmltodict:
import xmltodict
data = """
<user>
<person>
<nick>Gamer</nick>
<login>1</login>
<password>tajne</password>
</person>
<properties>
<fullHp>100</fullHp>
<currentHp>25</currentHp>
<fullMana>200</fullMana>
<currentMana>124</currentMana>
<premiumAcc>1</premiumAcc>
</properties>
</user>"""
print xmltodict.parse(data)

Related

How to extract xml from log file to parse in python

I have a log file containing xml envelopes (2 types of xml structures: request and response). What i need to do is to parse this file, extract xml-s and put them into 2 arrays as strings (1st array for requests and 2nd array for responses), so i can parse them later.
Any ideas how can i achieve this in python ?
Snippet of log file to be parsed (log contains ):
2014-10-31 12:27:33,600 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Sending BILL request
2014-10-31 12:27:33,601 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>
<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<transactionheader>
<username>XXX</username>
<password>XXX</password>
<time>31/10/2014 12:27:33</time>
<clientreferencenumber>123</clientreferencenumber>
<numberrequests>3</numberrequests>
<information>Description</information>
<postbackurl>http://localhost/status</postbackurl>
</transactionheader>
<transactiondetails>
<items>
<item id="1" client="XXX1" keyword="test"/>
<item id="2" client="XXX2" keyword="test"/>
<item id="3" client="XXX3" keyword="test"/>
</items>
</transactiondetails>
</request>
2014-10-31 12:27:34,487 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Response code 200 for bill request
2014-10-31 12:27:34,489 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>
<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<serverreferencenumber>XXX123XXX</serverreferencenumber>
<clientreferencenumber>123</clientreferencenumber>
<information>Queued for Processing</information>
<status>OK</status>
</response>
Many thanks for reply!
Regards,
Robert
As both #Paco and #Lord_Gestalter suggested, you can use xml.etree and replace the non-XML elements from your file, something like this:
# I use re to substitute non-XML elements
import re
# then use xml module as a parser
import xml.etree.ElementTree as ET
# read your file and store in string 's'
with open('yourfilehere','r') as f:
s = f.read()
# then remove non-XML element with re
# I also remove <?xml ...?> part as your file consists of multiple xml logs
s = re.sub(r'<\?xml.*?>', '', ''.join(re.findall(r'<.*>', s)))
# wrap your s with a root element
s = '<root>'+s+'</root>'
# parse s with ElementTree
tree = ET.fromstring(s)
tree
<Element 'root' at 0x7f2ab877e190>
if you don't care about xml parser and just want 'request' & 'response' string, use re.search
with open('yourfilehere','r') as f:
s = f.read()
# put the string of both request and response into 'req' and 'res'
# or you need to construct a better re.search if you have multiple requests, responses
req = [re.search(r'<request.*\/request>', s).group()]
res = [re.search(r'<response.*\/response>', s).group()]
req
['<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><transactionheader><username>XXX</username><password>XXX</password><time>31/10/2014 12:27:33</time><clientreferencenumber>123</clientreferencenumber><numberrequests>3</numberrequests><information>Description</information><postbackurl>http://localhost/status</postbackurl></transactionheader><transactiondetails><items><item id="1" client="XXX1" keyword="test"/><item id="2" client="XXX2" keyword="test"/><item id="3" client="XXX3" keyword="test"/></items></transactiondetails></request>']
res
['<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><serverreferencenumber>XXX123XXX</serverreferencenumber><clientreferencenumber>123</clientreferencenumber><information>Queued for Processing</information><status>OK</status></response>']

How to keep the xml-stylesheet?

I want to keep the xml-stylesheet. But it doesn't work.
I use Python to modify the XML for deploy hadoop automatically.
XML:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c11:9000</value>
  </property>
</configuration>
Code:
from xml.etree.ElementTree import ElementTree as ET
def modify_core_site(namenode_hostname):
tree = ET()
tree.parse("pkg/core-site.xml")
root = tree.getroot()
for p in root.iter("property"):
name = p.find("name").text
if name == "fs.default.name":
text = "hdfs://%s:9000" % namenode_hostname
p.find("value").text = text
tree.write("pkg/tmp.xml", encoding="utf-8", xml_declaration=True)
modify_core_site("c80")
Result:
<?xml version='1.0' encoding='utf-8'?>
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c80:9000</value>
  </property>
</configuration>
The xml-stylesheet disappear...
How can I keep this?
One solution is you can use lxml Once you parse xml go till you find the xsl node. Quick sample below:
>>> import lxml.etree
>>> doc = lxml.etree.parse('C:/downloads/xmltest.xml')
>>> root = doc.getroot()
>>> xslnode=root.getprevious().getprevious()
>>> xslnode
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Make sure you put in some exception handling and check if the node indeed exists. You can check if the node is xslt processing instruction by
>>> isinstance(xslnode, lxml.etree._XSLTProcessingInstruction)
True

How to extract the string values "Hello" and "World" from the XML using Python 2.6

I need to extract the strings "Hello" and "World" using Python 2.6. Please advice.
<Translate_Array_Request>
<App_Id />
<From>language-code</From>
<Options>
<Category xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" >string-value</Category>
<Content Type xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
<Reserved Flags xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" />
<State xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" >int-value</State>
<Uri xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" >string-value</Uri>
<User xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" >string-value</User>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">**Hello**</string>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">**World**</string>
</Texts>
<To>language-code</To>
</Translate_Array_Request>
There are multiple libraries in python that let you parse and extract data from XML. One way would be to use the ElementTree XML python API. Assuming the input is saved as a string xml_data, this is what you do:
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(xml_data)
>>> texts = root.find('Texts')
>>> for data in texts:
... print data.text
...
**Hello**
**World**
with xml package, do something like:
import xml.etree.ElementTree as ET
def getTags( xml )
root = ET.fromstring( xml )
res = []
for tag in root.iter("string"):
res.append(tag.text)
return res
Alternative solution using minidom,
import xml.dom.minidom as minidom
def getTags(xml)
root = minidom.parseString(xml)
return [i.firstChild.nodeValue for i in root.getElementsByTagName('string')]

How to parse xml in python?

I have to extract friendlyName from the XML document.
Here's my current solution:
root = ElementTree.fromstring(urllib2.urlopen(XMLLocation).read())
for child in root.iter('{urn:schemas-upnp-org:device-1-0}friendlyName'):
return child.text
I there any better way to do this (maybe any other way which does not involve iteration)? Could I use XPath?
XML content:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0">
<specVersion>
<major>1</major>
<minor>0</minor>
</specVersion>
<device>
<dlna:X_DLNADOC xmlns:dlna="urn:schemas-dlna-org:device-1-0">DMR-1.50</dlna:X_DLNADOC>
<deviceType>urn:schemas-upnp-org:device:MediaRenderer:1</deviceType>
<friendlyName>My Product 912496</friendlyName>
<manufacturer>embedded</manufacturer>
<manufacturerURL>http://www.embedded.com</manufacturerURL>
<modelDescription>Product</modelDescription>
<modelName>Product</modelName>
<modelNumber />
<modelURL>http://www.embedded.com</modelURL>
<UDN>uuid:93b2abac-cb6a-4857-b891-002261912496</UDN>
<serviceList>
<service>
<serviceType>urn:schemas-upnp-org:service:ConnectionManager:1</serviceType>
<serviceId>urn:upnp-org:serviceId:ConnectionManager</serviceId>
<SCPDURL>/xml/ConnectionManager.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelSinkConnectionManager</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelSinkConnectionManager</controlURL>
</service>
<service>
<serviceType>urn:schemas-upnp-org:service:AVTransport:1</serviceType>
<serviceId>urn:upnp-org:serviceId:AVTransport</serviceId>
<SCPDURL>/xml/AVTransport2.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelAVTransport</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelAVTransport</controlURL>
</service>
<service>
<serviceType>urn:schemas-upnp-org:service:RenderingControl:3</serviceType>
<serviceId>urn:upnp-org:serviceId:RenderingControl</serviceId>
<SCPDURL>/xml/RenderingControl2.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelRenderingControl</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelRenderingControl</controlURL>
</service>
<service>
<serviceType>urn:schemas-embedded-com:service:RTSPGateway:1</serviceType>
<serviceId>urn:embedded-com:serviceId:RTSPGateway</serviceId>
<SCPDURL>/xml/RTSPGateway.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelRTSPGateway</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelRTSPGateway</controlURL>
</service>
<service>
<serviceType>urn:schemas-embedded-com:service:SpeakerManagement:1</serviceType>
<serviceId>urn:embedded-com:serviceId:SpeakerManagement</serviceId>
<SCPDURL>/xml/SpeakerManagement.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelSpeakerManagement</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelSpeakerManagement</controlURL>
</service>
<service>
<serviceType>urn:schemas-embedded-com:service:NetworkManagement:1</serviceType>
<serviceId>urn:embedded-com:serviceId:NetworkManagement</serviceId>
<SCPDURL>/xml/NetworkManagement.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelNetworkManagement</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelNetworkManagement</controlURL>
</service>
</serviceList>
<iconList>
<icon>
<mimetype>image/png</mimetype>
<width>120</width>
<height>120</height>
<depth>32</depth>
<url>/org.mpris.MediaPlayer2.mansion-120x120x32.png</url>
</icon>
<icon>
<mimetype>image/png</mimetype>
<width>48</width>
<height>48</height>
<depth>32</depth>
<url>/org.mpris.MediaPlayer2.mansion-48x48x32.png</url>
</icon>
<icon>
<mimetype>image/jpeg</mimetype>
<width>120</width>
<height>120</height>
<depth>24</depth>
<url>/org.mpris.MediaPlayer2.mansion-120x120x24.jpg</url>
</icon>
<icon>
<mimetype>image/jpeg</mimetype>
<width>48</width>
<height>48</height>
<depth>24</depth>
<url>/org.mpris.MediaPlayer2.mansion-48x48x24.jpg</url>
</icon>
</iconList>
<X_embeddedDevice xmlns:edd="schemas-embedded-com:extended-device-description">
<firmwareVersion>v1.0 (4.155.1.15.002)</firmwareVersion>
<features>
<feature>
<name>com.sony.Product</name>
<version>1.0.0</version>
</feature>
<feature>
<name>com.sony.Product.btmrc</name>
<version>1.0.0</version>
</feature>
<feature>
<name>com.sony.Product.btmrs</name>
<version>1.0.0</version>
</feature>
</features>
</X_embeddedDevice>
</device>
</root>
Using ElementTree, you can either read directly from the file or load it into a string.
First , include the following import.
from xml.etree.ElementTree import ElementTree
from xml.parsers.expat import ExpatError
If you are using a string:
from xml.etree.ElementTree import fromstring
try:
tree = fromstring(xml_data)
except ExpatData:
print "Unable to parse XML data from string"
Otherwise, to load it directly:
try:
tree = ElementTree(file = "filename")
except ExpatData:
print "Unable to parse XML from file"
Once you have the tree initialised, you can begin parsing the information.
root = tree.getroot()
print root.find('device/friendlyName').text
Pedro, in the comments is right.
.find(match, namespaces=None)
Finds the first subelement matching match. match may be a tag name or a path. Returns an element instance or None. namespaces is an optional mapping from namespace prefix to full name.
The ElemntTree docs are really helpful in these cases.
https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.find
Edit:
The link I gave in the comments leads to the following code:
import xml.etree.ElementTree as ET
input = '''<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>
'''
stuff = ET.fromstring(input)
lst = stuff.findall("users/user")
print len(lst)
for item in lst:
print item.attrib["x"]
item = lst[0]
ET.dump(item)
item.get("x") # get works on attributes
item.find("id").text
item.find("id").tag
for user in stuff.getiterator('user') :
print "User" , user.attrib["x"]
ET.dump(user)
The code above uses:
item.find("id").text
If you modify that, along with removing the other code which you don't need... The find should look something like this:
item.find('device/friendlyName').text
You can get the xml file, instead of using the input string with the following (from the ElementTree docs):
import xml.etree.ElementTree as ET
tree = ET.parse('your_file_name.xml')
import xml.etree.ElementTree as ElementTree
namespace = '{urn:schemas-upnp-org:device-1-0}'
root = ElementTree.fromstring(urllib2.urlopen(XMLLocation).read())
# The `//` specifies all subelements within the whole tree.
return root.find('.//{}friendlyName'.format(namespace)).text
The find() function stops when it finds the first match. To get all of the elements that match the XPath, use the findall() function.

xml.dom.minidom getting elements by tagname

How can I retrieve the value of code with this (below) xml string and when using xml.dom.minidom?
<data>
<element1>
<name>myname</name>
</element1>
<element2>
<code>3</code>
<name>another name</name>
</element2>
</data>
Because multiple 'name' tags can appear I would like to do something like this:
from xml.dom.minidom import parseString
dom = parseString("<data>...</data>")
dom.getElementsByTagName("element1").getElementsByTagName("name")
But that doesn't work unfortunately.
The below code worked fine for me. I think you had multiple tags and you want to get the name from the second tag.
myxml = """\
<data>
<element>
<name>myname</name>
</element>
<element>
<code>3</code>
<name>another name</name>
</element>
</data>
"""
dom = xml.dom.minidom.parseString(myxml)
nodelist = dom.getElementsByTagName("element")[1].getElementsByTagName("name")
for node in nodelist:
print node.toxml()

Categories

Resources