Python lxml cant navigate when using namespace [duplicate] - python

I am testing against the following test document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>hi there</title>
</head>
<body>
<img class="foo" src="bar.png"/>
</body>
</html>
If I parse the document using lxml.html, I can get the IMG with an xpath just fine:
>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]
However, if I parse the document as XML and try to get the IMG tag, I get an empty result:
>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]
I can navigate to the element directly:
>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>
But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:
>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]
But that xpath is, again, obviously not useful for parsing arbitrary documents.
Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.
So, what am I missing?

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.
Try this:
>>> tree.getroot().xpath(
... "//xhtml:img",
... namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
... )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]

XPath considers all unprefixed names to be in "no namespace".
In particular the spec says:
"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "
See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.
Hope this helped.
Cheers,
Dimitre Novatchev

If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.
In your case it would be like
from lxml import objectify
root = objectify.parse(url) #also available: fromstring
You can access the nodes as
root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag
While it might not be of great help in html, it can be highly useful in well structured xml.
For more info, check out http://lxml.de/objectify.html

Related

Getting XML attributes from XML with namespaces and Python (lxml)

I'm trying to grab the "id" and "href" attributes from the below XML. Thus far I can't seem to get my head around the namespacing aspects. I can get things easily enough with XML that doesn't have namespace references. But this has befuddled me. Any ideas would be appreciated!
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<ns3:searchResult total="1" xmlns:ns5="ers.ise.cisco.com" xmlns:ers-v2="ers- v2" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:ns3="v2.ers.ise.cisco.com">
<ns3:resources>
<ns5:resource id="d28b5080-587a-11e8-b043-d8b1906198a4"name="00:1B:4F:32:27:50">
<link rel="self" href="https://ho-lab-ise1:9060/ers/config/endpoint/d28b5080-587a-11e8-b043-d8b1906198a4"type="application/xml"/>
</ns5:resource>
</ns3:resources>
You can use xpath function to search all resources and iterate on them. The function has a namespaces keyword argument. The can use it to declare the mapping between namespace prefixes and namespace URL.
Here is the idea:
from lxml import etree
NS = {
"ns5": "ers.ise.cisco.com",
"ns3": "v2.ers.ise.cisco.com"
}
tree = etree.parse('your.xml')
resources = tree.xpath('//ns5:resource', namespaces=NS)
for resource in resources:
print(resource.attrib['id'])
links = resource.xpath('link')
for link in links:
print(link.attrib['href'])
sorry, this is not tested
Here is the documentation about xpath.
#laurent-laporte's answer is great for showing how to handle multiple namespaces (+1).
However if you truly only need to select a couple of attributes no matter what namespace they're in, you can test local-name() in a predicate...
from lxml import etree
tree = etree.parse('your.xml')
attrs = tree.xpath("//#*[local-name()='id' or local-name()='href']")
for attr in attrs:
print(attr)
This will print (the same as Laurent's)...
d28b5080-587a-11e8-b043-d8b1906198a4
https://ho-lab-ise1:9060/ers/config/endpoint/d28b5080-587a-11e8-b043-d8b1906198a4

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

html5lib with lxml treebuilder doesn't parse namespaces correctly

I'm trying to parse some HTML content with html5lib using the lxml treebuilder. Note: I'm using the requests library to grab the content and the content is HTML5 (tried with XHTML - same result).
When I simply output the HTML source, it looks alright:
response = requests.get(url)
return response.text
returns
<html xmlns:foo="http://www.example.com/ns/foo">
But when I'm actually parsing it with the html5lib, something odd happens:
tree = html5lib.parse(response.text, treebuilder = 'lxml', namespaceHTMLElements = True)
html = tree.getroot()
return lxml.etree.tostring(html, pretty_print = False)
returns
<html:html xmlns:html="http://www.w3.org/1999/xhtml" xmlnsU0003Afoo="http://www.example.com/ns/foo">
Note the xmlnsU0003Afoo thing.
Also, the html.nsmap dict does not contain the foo namespace, only html.
Does anyone have an idea about what's going on and how I could fix this?
Later edit:
It seems that this is expected behavior:
If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names [...] to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code [...]
- Coercing an HTML DOM into an infoset
A few observations:
HTML5 doesn't seem to support xmlns attributes. Quoting section 1.6 of the latest HTML5 specification: "...namespaces cannot be represented using the HTML syntax, but they are supported in the DOM and in the XHTML syntax." I see you tried with XHTML as well, but you're currently using HTML5, so there could be an issue there. U+003A is the Unicode for colon, so somehow the xmlns is being noted but flubbed.
There is an open issue with custom namespace elements for at least the PHP version.
I don't understand the role of html5lib here. Why not just use lxml directly:
from lxml import etree
tree = etree.fromstring(resp_text)
print etree.tostring(tree, pretty_print=True)
That seems to do what you want, without html5lib and without the goofy xmlnsU0003Afoo error. With the test HTML I used, I got the right output (follows), and tree.nsmap contained an entry for 'foo'.
<html xmlns:foo="http://www.example.com/ns/foo">
<head>
<title>yo</title>
</head>
<body>
<p>test</p>
</body>
</html>
Alternatively, if you wish to use pure html5lib, you can just use the included simpletree:
tree = html5lib.parse(resp_text, namespaceHTMLElements=True)
print tree.toxml()
While this doesn't muck up the xmlns attribute, simpletree unfortunately lacks the more powerful ElementTree functions like xpath().

XPath failing using lxml

I have used xpaths to great effect with both HTML and XML before, but can't seem to get any results this time.
The data is from http://www.ahrefs.com/api/, under "Example of an answer", saved to an .xml file
My code:
from lxml import etree
doc = etree.XML(open('example.xml').read())
print doc.xpath('//result')
which doesn't give any results.
Where am I going wrong?
You need to take the namespace of the document into account:
from lxml import etree
doc = etree.parse('example.xml')
print doc.xpath('//n:result',
namespaces={'n': "http://ahrefs.com/schemas/api/links/1"})
=>
[<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d670>,
<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d698>]
My experience is from using XPath in C#, but I believe the XML namespace is causing your query to fail. You'll need to either use some variation of the local() operator, or check your documentation for some way of defining the namespace beforehand.

How can I add consistent whitespace to existing HTML using Python?

I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)
The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.
First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)
Algorithm
Parse html into some representation
Serialize the representation back to html
Example html5lib parser with BeautifulSoup tree builder
#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders
parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""
soup = parser.parse(c)
print soup.prettify()
Output:
<html>
<head>
<title>
Title
</title>
</head>
<body>
......
</body>
</html>
I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of toprettyxml(). Here's what I came up with:
from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO
def tidy_html(text):
"""Returns a well-formatted version of input HTML."""
p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(text)
# using cStringIO for fast string concatenation
pretty_HTML = StringIO()
node = dom_tree.firstChild
while node:
node_contents = node.toprettyxml(indent=' ')
pretty_HTML.write(node_contents)
node = node.nextSibling
output = pretty_HTML.getvalue()
pretty_HTML.close()
return output
And an example:
>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
<i>
bold, italic
</i>
</b>
<div>
a div
</div>
Why am I iterating over the children of the tree, rather than just calling toprettyxml() on dom_tree directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the <head> and <body> tags. To handle this I used the parseFragment() method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have a writexml() method (which toprettyxml() calls), so I iterate over the child nodes, which do have the method.
If the html is indeed well formed xml, you can use DOM parser.
from xml.dom.minidom import parse, parseString
#if you have html string in a variable
html = parseString(theHtmlString)
#or parse the html file
html = parse(htmlFileName)
print html.toprettyxml()
The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.

Categories

Resources