Getting XML attributes from XML with namespaces and Python (lxml)

Getting XML attributes from XML with namespaces and Python (lxml) - python

I'm trying to grab the "id" and "href" attributes from the below XML. Thus far I can't seem to get my head around the namespacing aspects. I can get things easily enough with XML that doesn't have namespace references. But this has befuddled me. Any ideas would be appreciated!
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<ns3:searchResult total="1" xmlns:ns5="ers.ise.cisco.com" xmlns:ers-v2="ers- v2" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:ns3="v2.ers.ise.cisco.com">
<ns3:resources>
<ns5:resource id="d28b5080-587a-11e8-b043-d8b1906198a4"name="00:1B:4F:32:27:50">
<link rel="self" href="https://ho-lab-ise1:9060/ers/config/endpoint/d28b5080-587a-11e8-b043-d8b1906198a4"type="application/xml"/>
</ns5:resource>
</ns3:resources>

You can use xpath function to search all resources and iterate on them. The function has a namespaces keyword argument. The can use it to declare the mapping between namespace prefixes and namespace URL.
Here is the idea:
from lxml import etree
NS = {
"ns5": "ers.ise.cisco.com",
"ns3": "v2.ers.ise.cisco.com"
}
tree = etree.parse('your.xml')
resources = tree.xpath('//ns5:resource', namespaces=NS)
for resource in resources:
print(resource.attrib['id'])
links = resource.xpath('link')
for link in links:
print(link.attrib['href'])
sorry, this is not tested
Here is the documentation about xpath.

#laurent-laporte's answer is great for showing how to handle multiple namespaces (+1).
However if you truly only need to select a couple of attributes no matter what namespace they're in, you can test local-name() in a predicate...
from lxml import etree
tree = etree.parse('your.xml')
attrs = tree.xpath("//#*[local-name()='id' or local-name()='href']")
for attr in attrs:
print(attr)
This will print (the same as Laurent's)...
d28b5080-587a-11e8-b043-d8b1906198a4
https://ho-lab-ise1:9060/ers/config/endpoint/d28b5080-587a-11e8-b043-d8b1906198a4

Related

Python lxml cant navigate when using namespace [duplicate]

I am testing against the following test document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>hi there</title>
</head>
<body>
<img class="foo" src="bar.png"/>
</body>
</html>
If I parse the document using lxml.html, I can get the IMG with an xpath just fine:
>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]
However, if I parse the document as XML and try to get the IMG tag, I get an empty result:
>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]
I can navigate to the element directly:
>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>
But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:
>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]
But that xpath is, again, obviously not useful for parsing arbitrary documents.
Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.
So, what am I missing?

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.
Try this:
>>> tree.getroot().xpath(
... "//xhtml:img",
... namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
... )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]

XPath considers all unprefixed names to be in "no namespace".
In particular the spec says:
"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "
See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.
Hope this helped.
Cheers,
Dimitre Novatchev

If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.
In your case it would be like
from lxml import objectify
root = objectify.parse(url) #also available: fromstring
You can access the nodes as
root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag
While it might not be of great help in html, it can be highly useful in well structured xml.
For more info, check out http://lxml.de/objectify.html

python alexa result parsing with lxml.etree

I am using alexa api from aws but I find difficult in parse the result to get what I want
alexa api return an object tree <type 'lxml.etree._ElementTree'>
I use this code to print the tree
from lxml import etree
root = tree.getroot()
print etree.tostring(root)
I get xml below
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"><aws:OperationRequest><aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId></aws:OperationRequest><aws:UrlInfoResult><aws:Alexa>
<aws:ContentData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:SiteData>
<aws:Title>Google</aws:Title>
<aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
<aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
</aws:SiteData>
<aws:LinksInCount>3453627</aws:LinksInCount>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:Rank>1</aws:Rank>
</aws:TrafficData>
</aws:Alexa></aws:UrlInfoResult><aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:StatusCode>Success</aws:StatusCode></aws:ResponseStatus></aws:Response></aws:UrlInfoResponse>
I use root.find('LinksInCount').text to get value of element but it does not work.
I want to know how to get the text 3453627 of aws:LinksInCount

You run into two challenges:
XML using namespaces
two namespaces sharing the same namespace prefix
XML document with reused prefix for 2 different namespaces
You see "aws:" prefix, but it is used for two different namespaces:
xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"
xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"
Using the same namespace prefix in XML is completely legal. The rule is, the later one is valid.
xmlstr = """
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
<aws:OperationRequest>
<aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId>
</aws:OperationRequest>
<aws:UrlInfoResult>
<aws:Alexa>
<aws:ContentData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:SiteData>
<aws:Title>Google</aws:Title>
<aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
<aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
</aws:SiteData>
<aws:LinksInCount>3453627</aws:LinksInCount>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:Rank>1</aws:Rank>
</aws:TrafficData>
</aws:Alexa>
</aws:UrlInfoResult>
<aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:StatusCode>Success</aws:StatusCode>
</aws:ResponseStatus>
</aws:Response>
</aws:UrlInfoResponse>
"""
Next challenge is, how to search for namespaced elements.
I prefer using xpath, and for it, you can use whatever namespace you like in the xpath expression, but you have to tell the xpath call what you meant by those prefixes. This is done by namespaces dictionary:
from lxml import etree
doc = etree.fromstring(xmlstr.strip())
namespaces = {"aws": "http://awis.amazonaws.com/doc/2005-07-11"}
texts = doc.xpath("//aws:LinksInCount/text()", namespaces=namespaces)
print texts[0]

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>

Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.

Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

XPath failing using lxml

I have used xpaths to great effect with both HTML and XML before, but can't seem to get any results this time.
The data is from http://www.ahrefs.com/api/, under "Example of an answer", saved to an .xml file
My code:
from lxml import etree
doc = etree.XML(open('example.xml').read())
print doc.xpath('//result')
which doesn't give any results.
Where am I going wrong?

You need to take the namespace of the document into account:
from lxml import etree
doc = etree.parse('example.xml')
print doc.xpath('//n:result',
namespaces={'n': "http://ahrefs.com/schemas/api/links/1"})
=>
[<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d670>,
<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d698>]

My experience is from using XPath in C#, but I believe the XML namespace is causing your query to fail. You'll need to either use some variation of the local() operator, or check your documentation for some way of defining the namespace beforehand.

Python get ID from XML data

I am a total python newb and am trying to parse an XML document that is being returned from google as a result of a post request.
The document returned looks like the one outlined in this doc
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#Archives
where it says 'The response contains information about the archive.'
The only part I am interested in is the Id attribute right near the beginning. There will only every be 1 entry, and 1 id attribute. How can I extract it to be use later? I've been fighting with this for a while and I feel like I've tried everything from minidom to elementtree. No matter what I do my search comes back blank, loops don't iterate, or methods are missing. Any assistance is much appreciated. Thank you.

I would highly recommend the Python package BeautifulSoup. It is awesome. Here is a simple example using their example data (assuming you've installed BeautifulSoup already):
from BeautifulSoup import BeautifulSoup
data = """<?xml version='1.0' encoding='utf-8'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:docs='http://schemas.google.com/docs/2007'
xmlns:gd='http://schemas.google.com/g/2005'>
<id>
https://docs.google.com/feeds/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA</id>
<published>2010-11-18T18:34:06.981Z</published>
<updated>2010-11-18T18:34:07.763Z</updated>
<app:edited xmlns:app='http://www.w3.org/2007/app'>
2010-11-18T18:34:07.763Z</app:edited>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/docs/2007#archive'
label='archive' />
<title>Document Archive - someuser#somedomain.com</title>
<link rel='self' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<link rel='edit' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<author>
<name>someuser</name>
<email>someuser#somedomain.com</email>
</author>
<docs:archiveNotify>someuser#somedomain.com</docs:archiveNotify>
<docs:archiveStatus>flattening</docs:archiveStatus>
<docs:archiveResourceId>
0Adj-hQNOVsTFSNDEkdk2221OTJfMWpxOGI5OWZu</docs:archiveResourceId>
<docs:archiveResourceId>
0Adj-hQNOVsTFZGZodGs2O72NFMllMQDN3a2Rq</docs:archiveResourceId>
<docs:archiveConversion source='application/vnd.google-apps.document'
target='text/plain' />
</entry>"""
soup = BeautifulSoup(data, fromEncoding='utf8')
print soup('id')[0].text
There is also expat, which is built into Python, but it is worth learning BeautifulSoup, because it will respond way better to real-world XML (and HTML).

Assuming the variable response contains a string representation of the returned HTML document, let me tell you the WRONG way to solve your problem
id = response.split("</id>")[0].split("<id>")[1]
The right way to do it is with xml.sax or xml.dom or expat, but personally, I wouldn't be bothered unless I wanted to have robust error handling of exception cases when response contains something unexpected.
EDIT: I forgot about BeautifulSoup, it is indeed as awesome as Travis describes.

If you'd like to use minidom, you can do the following (replace gd.xml with your xml input):
from xml.dom import minidom
dom = minidom.parse("gd.xml")
id = dom.getElementsByTagName("id")[0].childNodes[0].nodeValue
print id
Also, I assume you meant id element, and not id attribute.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting XML attributes from XML with namespaces and Python (lxml) - python

Related

Python lxml cant navigate when using namespace [duplicate]

python alexa result parsing with lxml.etree

Parse xml from file using etree works when reading string, but not a file

XPath failing using lxml

Python get ID from XML data

Categories

Resources