html5lib with lxml treebuilder doesn't parse namespaces correctly - python

I'm trying to parse some HTML content with html5lib using the lxml treebuilder. Note: I'm using the requests library to grab the content and the content is HTML5 (tried with XHTML - same result).
When I simply output the HTML source, it looks alright:
response = requests.get(url)
return response.text
returns
<html xmlns:foo="http://www.example.com/ns/foo">
But when I'm actually parsing it with the html5lib, something odd happens:
tree = html5lib.parse(response.text, treebuilder = 'lxml', namespaceHTMLElements = True)
html = tree.getroot()
return lxml.etree.tostring(html, pretty_print = False)
returns
<html:html xmlns:html="http://www.w3.org/1999/xhtml" xmlnsU0003Afoo="http://www.example.com/ns/foo">
Note the xmlnsU0003Afoo thing.
Also, the html.nsmap dict does not contain the foo namespace, only html.
Does anyone have an idea about what's going on and how I could fix this?
Later edit:
It seems that this is expected behavior:
If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names [...] to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code [...]
- Coercing an HTML DOM into an infoset

A few observations:
HTML5 doesn't seem to support xmlns attributes. Quoting section 1.6 of the latest HTML5 specification: "...namespaces cannot be represented using the HTML syntax, but they are supported in the DOM and in the XHTML syntax." I see you tried with XHTML as well, but you're currently using HTML5, so there could be an issue there. U+003A is the Unicode for colon, so somehow the xmlns is being noted but flubbed.
There is an open issue with custom namespace elements for at least the PHP version.
I don't understand the role of html5lib here. Why not just use lxml directly:
from lxml import etree
tree = etree.fromstring(resp_text)
print etree.tostring(tree, pretty_print=True)
That seems to do what you want, without html5lib and without the goofy xmlnsU0003Afoo error. With the test HTML I used, I got the right output (follows), and tree.nsmap contained an entry for 'foo'.
<html xmlns:foo="http://www.example.com/ns/foo">
<head>
<title>yo</title>
</head>
<body>
<p>test</p>
</body>
</html>
Alternatively, if you wish to use pure html5lib, you can just use the included simpletree:
tree = html5lib.parse(resp_text, namespaceHTMLElements=True)
print tree.toxml()
While this doesn't muck up the xmlns attribute, simpletree unfortunately lacks the more powerful ElementTree functions like xpath().

Related

How to remove links from HTML completely with Bleach?

Bleach strips non-whitelisted tags from HTML, but leaves child nodes, e.g.
>>> import bleach
>>> bleach.clean("stays", strip=True, tags=[])
'stays'
>>>
How can the entire element along with its children be removed?
You should use lxml. Bleach is simply for cleaning data & ensuring security/safety in the markup you store.
You can use lxml to parse structured data like HTML or XML.
Consider a simple html file;
<html>
<body>
<p>Hello, World!</p>
</body>
</html>
from lxml import html
root = html.parse("hello_world.html").getroot()
print(html.tostring(root))
# <html><body><p>Hello, World!</p></body></html>
p = root.find("body/p")
p.drop_tree()
print(html.tostring(root))
# <html><body></body></html>
On a related note, if you want to look into some more advanced parsing with lxml, one of my oldest questions on here was around getting python to parse xml & write python code out of it. Writing a Python tool to convert XML to Python?

Parsing HTML with Python with no regard for correct tag hierarchy

I would like to parse a document that is syntactically a html document (using tags with attributes etc), but structurally doesn't follow the rules (e.g. there could be a <html> tag inside a <div> tag inside a <body> tag). I also do not want the additional strictness of XML. Unfortunately, lxml only offers document_fromstring(), which requires a html root element, as well as fragment_fromstring(), which in turn does not allow there to be any html or body tags in unusual places.
How do I parse a document with no "fixing" of incorrect structure?
BeautifulSoup should do this fine.
it would be a case of:
from bs4 import BeautifulSoup
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
then you'd search "soup" for whatever you're looking for.

Python lxml cant navigate when using namespace [duplicate]

I am testing against the following test document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>hi there</title>
</head>
<body>
<img class="foo" src="bar.png"/>
</body>
</html>
If I parse the document using lxml.html, I can get the IMG with an xpath just fine:
>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]
However, if I parse the document as XML and try to get the IMG tag, I get an empty result:
>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]
I can navigate to the element directly:
>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>
But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:
>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]
But that xpath is, again, obviously not useful for parsing arbitrary documents.
Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.
So, what am I missing?
The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.
Try this:
>>> tree.getroot().xpath(
... "//xhtml:img",
... namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
... )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]
XPath considers all unprefixed names to be in "no namespace".
In particular the spec says:
"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "
See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.
Hope this helped.
Cheers,
Dimitre Novatchev
If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.
In your case it would be like
from lxml import objectify
root = objectify.parse(url) #also available: fromstring
You can access the nodes as
root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag
While it might not be of great help in html, it can be highly useful in well structured xml.
For more info, check out http://lxml.de/objectify.html

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

This question already has answers here:
Don't put html, head and body tags automatically, beautifulsoup
(9 answers)
Closed 9 years ago.
In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:
from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
'<div><b>soup</b></div>'
However with BeautifulSoup4 the same operation creates additional tags:
from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
'<html><body><div><b>soup 4</b></div></body></html>'
^^^^^^^^^^^^ ^^^^^^^^^^^^^^
I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.
If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.
If you know you have a fragment, something like this will give you exactly that fragment:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
return soup4.body.next
elif soup4.html:
return soup4.html.next
else:
return soup4
Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div
If you want to know why this happens:
BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.
As Differences between parsers says:
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
So, while this exact difference isn't documented, it's just a special case of something that is.
As was noted in the old BeautifulStoneSoup documentation:
The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.
Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting...
And in the BeautifulSoup4 docs:
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.
Perhaps that will yield what you want.

How can I add consistent whitespace to existing HTML using Python?

I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)
The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.
First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)
Algorithm
Parse html into some representation
Serialize the representation back to html
Example html5lib parser with BeautifulSoup tree builder
#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders
parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""
soup = parser.parse(c)
print soup.prettify()
Output:
<html>
<head>
<title>
Title
</title>
</head>
<body>
......
</body>
</html>
I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of toprettyxml(). Here's what I came up with:
from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO
def tidy_html(text):
"""Returns a well-formatted version of input HTML."""
p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(text)
# using cStringIO for fast string concatenation
pretty_HTML = StringIO()
node = dom_tree.firstChild
while node:
node_contents = node.toprettyxml(indent=' ')
pretty_HTML.write(node_contents)
node = node.nextSibling
output = pretty_HTML.getvalue()
pretty_HTML.close()
return output
And an example:
>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
<i>
bold, italic
</i>
</b>
<div>
a div
</div>
Why am I iterating over the children of the tree, rather than just calling toprettyxml() on dom_tree directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the <head> and <body> tags. To handle this I used the parseFragment() method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have a writexml() method (which toprettyxml() calls), so I iterate over the child nodes, which do have the method.
If the html is indeed well formed xml, you can use DOM parser.
from xml.dom.minidom import parse, parseString
#if you have html string in a variable
html = parseString(theHtmlString)
#or parse the html file
html = parse(htmlFileName)
print html.toprettyxml()
The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.

Categories

Resources