How to parse xml in python? - python

I have to extract friendlyName from the XML document.
Here's my current solution:
root = ElementTree.fromstring(urllib2.urlopen(XMLLocation).read())
for child in root.iter('{urn:schemas-upnp-org:device-1-0}friendlyName'):
return child.text
I there any better way to do this (maybe any other way which does not involve iteration)? Could I use XPath?
XML content:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0">
<specVersion>
<major>1</major>
<minor>0</minor>
</specVersion>
<device>
<dlna:X_DLNADOC xmlns:dlna="urn:schemas-dlna-org:device-1-0">DMR-1.50</dlna:X_DLNADOC>
<deviceType>urn:schemas-upnp-org:device:MediaRenderer:1</deviceType>
<friendlyName>My Product 912496</friendlyName>
<manufacturer>embedded</manufacturer>
<manufacturerURL>http://www.embedded.com</manufacturerURL>
<modelDescription>Product</modelDescription>
<modelName>Product</modelName>
<modelNumber />
<modelURL>http://www.embedded.com</modelURL>
<UDN>uuid:93b2abac-cb6a-4857-b891-002261912496</UDN>
<serviceList>
<service>
<serviceType>urn:schemas-upnp-org:service:ConnectionManager:1</serviceType>
<serviceId>urn:upnp-org:serviceId:ConnectionManager</serviceId>
<SCPDURL>/xml/ConnectionManager.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelSinkConnectionManager</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelSinkConnectionManager</controlURL>
</service>
<service>
<serviceType>urn:schemas-upnp-org:service:AVTransport:1</serviceType>
<serviceId>urn:upnp-org:serviceId:AVTransport</serviceId>
<SCPDURL>/xml/AVTransport2.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelAVTransport</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelAVTransport</controlURL>
</service>
<service>
<serviceType>urn:schemas-upnp-org:service:RenderingControl:3</serviceType>
<serviceId>urn:upnp-org:serviceId:RenderingControl</serviceId>
<SCPDURL>/xml/RenderingControl2.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelRenderingControl</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelRenderingControl</controlURL>
</service>
<service>
<serviceType>urn:schemas-embedded-com:service:RTSPGateway:1</serviceType>
<serviceId>urn:embedded-com:serviceId:RTSPGateway</serviceId>
<SCPDURL>/xml/RTSPGateway.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelRTSPGateway</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelRTSPGateway</controlURL>
</service>
<service>
<serviceType>urn:schemas-embedded-com:service:SpeakerManagement:1</serviceType>
<serviceId>urn:embedded-com:serviceId:SpeakerManagement</serviceId>
<SCPDURL>/xml/SpeakerManagement.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelSpeakerManagement</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelSpeakerManagement</controlURL>
</service>
<service>
<serviceType>urn:schemas-embedded-com:service:NetworkManagement:1</serviceType>
<serviceId>urn:embedded-com:serviceId:NetworkManagement</serviceId>
<SCPDURL>/xml/NetworkManagement.xml</SCPDURL>
<eventSubURL>/Event/org.mpris.MediaPlayer2.mansion/RygelNetworkManagement</eventSubURL>
<controlURL>/Control/org.mpris.MediaPlayer2.mansion/RygelNetworkManagement</controlURL>
</service>
</serviceList>
<iconList>
<icon>
<mimetype>image/png</mimetype>
<width>120</width>
<height>120</height>
<depth>32</depth>
<url>/org.mpris.MediaPlayer2.mansion-120x120x32.png</url>
</icon>
<icon>
<mimetype>image/png</mimetype>
<width>48</width>
<height>48</height>
<depth>32</depth>
<url>/org.mpris.MediaPlayer2.mansion-48x48x32.png</url>
</icon>
<icon>
<mimetype>image/jpeg</mimetype>
<width>120</width>
<height>120</height>
<depth>24</depth>
<url>/org.mpris.MediaPlayer2.mansion-120x120x24.jpg</url>
</icon>
<icon>
<mimetype>image/jpeg</mimetype>
<width>48</width>
<height>48</height>
<depth>24</depth>
<url>/org.mpris.MediaPlayer2.mansion-48x48x24.jpg</url>
</icon>
</iconList>
<X_embeddedDevice xmlns:edd="schemas-embedded-com:extended-device-description">
<firmwareVersion>v1.0 (4.155.1.15.002)</firmwareVersion>
<features>
<feature>
<name>com.sony.Product</name>
<version>1.0.0</version>
</feature>
<feature>
<name>com.sony.Product.btmrc</name>
<version>1.0.0</version>
</feature>
<feature>
<name>com.sony.Product.btmrs</name>
<version>1.0.0</version>
</feature>
</features>
</X_embeddedDevice>
</device>
</root>

Using ElementTree, you can either read directly from the file or load it into a string.
First , include the following import.
from xml.etree.ElementTree import ElementTree
from xml.parsers.expat import ExpatError
If you are using a string:
from xml.etree.ElementTree import fromstring
try:
tree = fromstring(xml_data)
except ExpatData:
print "Unable to parse XML data from string"
Otherwise, to load it directly:
try:
tree = ElementTree(file = "filename")
except ExpatData:
print "Unable to parse XML from file"
Once you have the tree initialised, you can begin parsing the information.
root = tree.getroot()
print root.find('device/friendlyName').text

Pedro, in the comments is right.
.find(match, namespaces=None)
Finds the first subelement matching match. match may be a tag name or a path. Returns an element instance or None. namespaces is an optional mapping from namespace prefix to full name.
The ElemntTree docs are really helpful in these cases.
https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.find
Edit:
The link I gave in the comments leads to the following code:
import xml.etree.ElementTree as ET
input = '''<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>
'''
stuff = ET.fromstring(input)
lst = stuff.findall("users/user")
print len(lst)
for item in lst:
print item.attrib["x"]
item = lst[0]
ET.dump(item)
item.get("x") # get works on attributes
item.find("id").text
item.find("id").tag
for user in stuff.getiterator('user') :
print "User" , user.attrib["x"]
ET.dump(user)
The code above uses:
item.find("id").text
If you modify that, along with removing the other code which you don't need... The find should look something like this:
item.find('device/friendlyName').text
You can get the xml file, instead of using the input string with the following (from the ElementTree docs):
import xml.etree.ElementTree as ET
tree = ET.parse('your_file_name.xml')

import xml.etree.ElementTree as ElementTree
namespace = '{urn:schemas-upnp-org:device-1-0}'
root = ElementTree.fromstring(urllib2.urlopen(XMLLocation).read())
# The `//` specifies all subelements within the whole tree.
return root.find('.//{}friendlyName'.format(namespace)).text
The find() function stops when it finds the first match. To get all of the elements that match the XPath, use the findall() function.

Related

Same prefix, multiple namespace in XML - How to add element attrib without affecting other in Python

I have the below input XML:
<host xmlns="urn:jboss:domain:4.1" >
<profile>
<subsystem xmlns="urn:jboss:domain:jmx:1.3">
<expose-resolved-model/>
<expose-expression-model/>
<remoting-connector/>
</subsystem>
</profile>
</host>
As you can see, xmlns is used for two namespaces , "urn:jboss:domain:4.1" and "urn:jboss:domain:jmx:1.3"
I would like to add an attribute to host element
Below is my code in Python:
from xml.etree import ElementTree as ET
def parse_xml():
ET.register_namespace('','urn:jboss:domain:4.1')
tree = ET.parse('sample.xml')
root = tree.getroot()
for elements in tree.iter():
if "host" in elements.tag :
elements.attrib['name'] = "slaveOne"
print elements.attrib
tree.write('sample.xml')
The above code changes the XML as below:
<host xmlns="urn:jboss:domain:4.1" xmlns:ns1="urn:jboss:domain:jmx:1.3" name="slaveOne">
<profile>
<ns1:subsystem>
<ns1:expose-resolved-model />
<ns1:expose-expression-model />
<ns1:remoting-connector />
</ns1:subsystem>
</profile>
</host>
tree.write('sample.xml')
Changes all elements belonging to same prefix in this case
ET.register_namespace('','urn:jboss:domain:4.1')
How do i isolate the changes only to host element
You can try using minidom:
from xml.dom import minidom
doc = minidom.parse("sample.xml")
#getElementsByTagName returns NodeList
#grab first
host = doc.getElementsByTagName("host")[0]
#set attr -> value
#look at setAttributeNS in minidom docs for namespaces
host.setAttribute('name', '123')
#write to file
with open('sample2.xml', 'w') as xmlfile:
doc.writexml(xmlfile)
Take a look at xml.sax package and this XML Processing Modules page too.

Why won't this check for an element work using python elementtree

I finally decided to learn how to parse xml in python. I'm using elementtree just to get a basic understanding. I'm on CentOS 6.5 using python 2.7.9. I've looked through the following pages:
http://www.diveintopython3.net/xml.html
https://pymotw.com/2/xml/etree/ElementTree/parse.html#traversing-the-parsed-tree
and performed several searches on this forum, but I'm having some trouble and I'm not sure if it's my code or the xml I'm trying to parse.
I need to be able to verify if certain elements are in the xml or not. For example, in the xml below, I need to check to see if the element Analyzer is present and if so, get the attribute. Then, if Analyzer is present, I need to check for the location element and get the text then the name element and get that text. I thought that the following code would check to see if the element existed:
if element.find('...') is not None
but that yields inconsistent results and it never seems to find the location or name element. For example:
if tree.find('Alert') is not None:
appears to work, but
if tree.find('location') is not None:
or
if tree.find('Analyzer') is not None:
definitely don't work. I'm guessing that the tree.find() function only works for the top level?
So how do I do this check?
Here is my xml:
<?xml version='1.0' encoding='UTF-8'?>
<Report>
<Alert>
<Analyzer analyzerid="CS">
<Node>
<location>USA</location>
<name>John Smith</name>
</Node>
</Analyzer>
<AnalyzerTime>2016-06-11T00:30:02+0000</AnalyzerTime>
<AdditionalData type="integer" meaning="number of alerts in this report">19</AdditionalData>
<AdditionalData type="string" meaning="report schedule">5 minutes</AdditionalData>
<AdditionalData type="string" meaning="report type">alerts</AdditionalData>
<AdditionalData type="date-time" meaning="report start time">2016-06-11T00:25:16+0000</AdditionalData>
</Alert>
<Alert>
<CreateTime>2016-06-11T00:25:16+0000</CreateTime>
<Source>
<Node>
<Address category="ipv4-addr">
<address>1.5.1.4</address>
</Address>
</Node>
</Source>
<Target>
<Service>
<port>22</port>
<protocol>TCP</protocol>
</Service>
</Target>
<Classification text="SSH scans, direction:ingress, confidence:80, severity:high">
<Reference meaning="Scanning" origin="user-specific">
<name>SSH Attack</name>
<url> </url>
</Reference>
</Classification>
<Assessment>
<Action category="block-installed"/>
</Assessment>
<AdditionalData type="string" meaning="top level domain owner">PH, Philippines</AdditionalData>
<AdditionalData type="integer" meaning="alert threshold">0</AdditionalData>
</Alert>
</Report>
And here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for child in root: print child
all_links = tree.findall('.//Analyzer')
try:
print all_links[0].attrib.get('analyzerid')
ID = all_links[0].attrib.get('analyzerid')
all_links2 = tree.findall('.//location')
print all_links2
try:
print all_links[0].text
except: print "can't print text location"
if tree.find('location') is None: print 'lost'
for kid in tree.iter('location'):
try:
location = kid.text
print kid.text
except: print 'bad'
except IndexError: print'There was no Analyzer element'
I think you're missing one important line from the Dive Into Python tutorial (just up from here):
There is a way to search for descendant elements, i.e. children, grandchildren, and any element at any nesting level.
That way is to precede the element names with //.
tree.find("someElementName") will only find a direct child element of tree with the name someElementName. If you want to search for an element named someElementName anywhere within tree, use tree.find("//someElementName").
The // notation originates from XPath. The ElementTree module provides support for a limited subset of XPath. The ElementTree documentation details the parts of XPath syntax it supports.

Copy attribute information when different element have the same name in XML with python

So, here's my XML tree:
<?xml version="1.0"?>
<api>
<query>
<normalized>
<n from="Brain_cancer" to="Brain cancer" />
</normalized>
<redirects>
<r from="Brain cancer" to="Brain tumor"
/>
</redirects>
<pages>
<page pageid="37284" ns="0" title="Brain tumor">
<revisions>
<rev revid="412658600" parentid="412501243" user="Andycjp" userid="55014" timestamp="2011-02-08T03:35:27Z" size="59870" sha1="fe1ff25c27ebc86572aa4be8201cb813e1bf3d32" comment="/* Psychological and behavioral consequences */" contentformat="text/x-wiki" contentmodel="wikitext" xml:space="preserve">
</rev>
</revisions>
</page>
</pages>
</query>
<warnings>
<revisions xml:space="preserve">
</revisions>
<result xml:space="preserve">
</result>
</warnings>
<query-continue>
<revisions rvcontinue="456175380"
/>
</query-continue>
</api>
So, has you can see, the "revisions" element appears in two differents places, in differents levels. My objective is to reach the attribute "rvcontinue" (who's path is api/query-continue/revisions) to copy it's value in a new variable. It's probably because i'm just not getting it right, but elementTree and xpath didn't work so far.
This is what i've did so far, but it's getting no where
import xml.etree.ElementTree as ET
tree = ET.parse('Brain_tumor_5.xml')
for elem in tree.getiterator():
if elem.tag=='{http://www.namespace.co.uk}query-continue':
output = {}
for elem1 in list(elem):
if elem1.tag=='{http://www.namespace.co.uk}revisions':
output['rvcontinue']=elem1.text
print output
p = tree.find("./api/query-continue/revisions[#rvcontinue=]")
q = p.attrib
print q
I also have mostly used lxml, so I don't know what's up with etree, but it appears
that find from the tree doesn't work, but find from the root does work:
>>> tree.getroot().find( 'query-continue/revisions[#rvcontinue]' ).attrib['rvcontinue']
'456175380'
Also: I don't know if it's just a typo above, but:
p = tree.find("./api/query-continue/revisions[#rvcontinue=]")
will give a SyntaxError: invalid predicate
Added Note: It appears that tree.find( 'api' ) returns None,
but tree.find( '.' ) returns <Element 'api' at 0x1004e5f10>
so tree.find( './query-continue/revisions[#rvcontinue]' )
will also work.
This does not directly answer your question. However, I would use lxml.etree (which supposedly provides the same ElementTree interface) and the following code:
>>> import lxml.etree
>>> doc = lxml.etree.parse('doc.xml')
>>> node = doc.xpath('/api/query-continue/revisions[#rvcontinue]')
>>> node[0].attrib['rvcontinue']
'456175380'
Tried with xml.etree.ElementTree but doesn't appear to work.

Accessing Elements with and without namespaces using lxml

Is there a way to search for the same element, at the same time, within a document that occur with and without namespaces using lxml? As an example, I would want to get all occurences of the element identifier irrespective of whether or not it is associated with a specific namespace. I am currently only able to access them separately as below.
Code:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
root = xmlfile.getroot()
for l in root.iter('identifier'):
print l.text
for l in root.iter('{http://www.openarchives.org/OAI/2.0/provenance}identifier'):
print l.text
File: xmlfile.xml
<?xml version="1.0"?>
<record>
<header>
<identifier>identifier1</identifier>
<datestamp>datastamp1</datestamp>
<setSpec>setspec1</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>title1</dc:title>
<dc:title>title2</dc:title>
<dc:creator>creator1</dc:creator>
<dc:subject>subject1</dc:subject>
<dc:subject>subject2</dc:subject>
</oai_dc:dc>
</metadata>
<about>
<provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd">
<originDescription altered="false" harvestDate="2011-08-11T03:47:51Z">
<baseURL>baseURL1</baseURL>
<identifier>identifier3</identifier>
<datestamp>datestamp2</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
<originDescription altered="false" harvestDate="2010-10-10T06:15:53Z">
<baseURL>xxxxx</baseURL>
<identifier>identifier4</identifier>
<datestamp>2010-04-27T01:10:31Z</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
You could use XPath to solve that kind of issue:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
identifier_nodes = xmlfile.xpath("//*[local-name() = 'identifier']")

xml.dom.minidom getting elements by tagname

How can I retrieve the value of code with this (below) xml string and when using xml.dom.minidom?
<data>
<element1>
<name>myname</name>
</element1>
<element2>
<code>3</code>
<name>another name</name>
</element2>
</data>
Because multiple 'name' tags can appear I would like to do something like this:
from xml.dom.minidom import parseString
dom = parseString("<data>...</data>")
dom.getElementsByTagName("element1").getElementsByTagName("name")
But that doesn't work unfortunately.
The below code worked fine for me. I think you had multiple tags and you want to get the name from the second tag.
myxml = """\
<data>
<element>
<name>myname</name>
</element>
<element>
<code>3</code>
<name>another name</name>
</element>
</data>
"""
dom = xml.dom.minidom.parseString(myxml)
nodelist = dom.getElementsByTagName("element")[1].getElementsByTagName("name")
for node in nodelist:
print node.toxml()

Categories

Resources