I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)
The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.
First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)
Algorithm
Parse html into some representation
Serialize the representation back to html
Example html5lib parser with BeautifulSoup tree builder
#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders
parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""
soup = parser.parse(c)
print soup.prettify()
Output:
<html>
<head>
<title>
Title
</title>
</head>
<body>
......
</body>
</html>
I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of toprettyxml(). Here's what I came up with:
from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO
def tidy_html(text):
"""Returns a well-formatted version of input HTML."""
p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(text)
# using cStringIO for fast string concatenation
pretty_HTML = StringIO()
node = dom_tree.firstChild
while node:
node_contents = node.toprettyxml(indent=' ')
pretty_HTML.write(node_contents)
node = node.nextSibling
output = pretty_HTML.getvalue()
pretty_HTML.close()
return output
And an example:
>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
<i>
bold, italic
</i>
</b>
<div>
a div
</div>
Why am I iterating over the children of the tree, rather than just calling toprettyxml() on dom_tree directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the <head> and <body> tags. To handle this I used the parseFragment() method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have a writexml() method (which toprettyxml() calls), so I iterate over the child nodes, which do have the method.
If the html is indeed well formed xml, you can use DOM parser.
from xml.dom.minidom import parse, parseString
#if you have html string in a variable
html = parseString(theHtmlString)
#or parse the html file
html = parse(htmlFileName)
print html.toprettyxml()
The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.
Related
I have the following Code and Output (without closing br-tag):
from lxml import html
root = html.fromstring('<br/>')
html.tostring(root)
Output : '<br>'
However, I have expected some similar to this (adding also a closing tag for this self-closing element):
from lxml import html
root = html.fromstring('<a/>')
html.tostring(root)
Output : '<a></a>'
Is there a way to produce the wanted output or just '<br/>' ?
Maybe somebody could also point to the relevant part in the source code.
HTML is not XML compliant and this is one of the places where that shows up. A break is <br>. Since a break element does not usually have internal content (attributes or text) it does not need to be written as an enclosing <br/> or <br></br> (although most browsers will accept both). You can change the output method to get XML compliance.
html.tostring(root, method='xml')
This could negatively impact other parts of the document, though.
Bleach strips non-whitelisted tags from HTML, but leaves child nodes, e.g.
>>> import bleach
>>> bleach.clean("stays", strip=True, tags=[])
'stays'
>>>
How can the entire element along with its children be removed?
You should use lxml. Bleach is simply for cleaning data & ensuring security/safety in the markup you store.
You can use lxml to parse structured data like HTML or XML.
Consider a simple html file;
<html>
<body>
<p>Hello, World!</p>
</body>
</html>
from lxml import html
root = html.parse("hello_world.html").getroot()
print(html.tostring(root))
# <html><body><p>Hello, World!</p></body></html>
p = root.find("body/p")
p.drop_tree()
print(html.tostring(root))
# <html><body></body></html>
On a related note, if you want to look into some more advanced parsing with lxml, one of my oldest questions on here was around getting python to parse xml & write python code out of it. Writing a Python tool to convert XML to Python?
I would like to parse a document that is syntactically a html document (using tags with attributes etc), but structurally doesn't follow the rules (e.g. there could be a <html> tag inside a <div> tag inside a <body> tag). I also do not want the additional strictness of XML. Unfortunately, lxml only offers document_fromstring(), which requires a html root element, as well as fragment_fromstring(), which in turn does not allow there to be any html or body tags in unusual places.
How do I parse a document with no "fixing" of incorrect structure?
BeautifulSoup should do this fine.
it would be a case of:
from bs4 import BeautifulSoup
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
then you'd search "soup" for whatever you're looking for.
I'm trying to parse some HTML content with html5lib using the lxml treebuilder. Note: I'm using the requests library to grab the content and the content is HTML5 (tried with XHTML - same result).
When I simply output the HTML source, it looks alright:
response = requests.get(url)
return response.text
returns
<html xmlns:foo="http://www.example.com/ns/foo">
But when I'm actually parsing it with the html5lib, something odd happens:
tree = html5lib.parse(response.text, treebuilder = 'lxml', namespaceHTMLElements = True)
html = tree.getroot()
return lxml.etree.tostring(html, pretty_print = False)
returns
<html:html xmlns:html="http://www.w3.org/1999/xhtml" xmlnsU0003Afoo="http://www.example.com/ns/foo">
Note the xmlnsU0003Afoo thing.
Also, the html.nsmap dict does not contain the foo namespace, only html.
Does anyone have an idea about what's going on and how I could fix this?
Later edit:
It seems that this is expected behavior:
If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names [...] to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code [...]
- Coercing an HTML DOM into an infoset
A few observations:
HTML5 doesn't seem to support xmlns attributes. Quoting section 1.6 of the latest HTML5 specification: "...namespaces cannot be represented using the HTML syntax, but they are supported in the DOM and in the XHTML syntax." I see you tried with XHTML as well, but you're currently using HTML5, so there could be an issue there. U+003A is the Unicode for colon, so somehow the xmlns is being noted but flubbed.
There is an open issue with custom namespace elements for at least the PHP version.
I don't understand the role of html5lib here. Why not just use lxml directly:
from lxml import etree
tree = etree.fromstring(resp_text)
print etree.tostring(tree, pretty_print=True)
That seems to do what you want, without html5lib and without the goofy xmlnsU0003Afoo error. With the test HTML I used, I got the right output (follows), and tree.nsmap contained an entry for 'foo'.
<html xmlns:foo="http://www.example.com/ns/foo">
<head>
<title>yo</title>
</head>
<body>
<p>test</p>
</body>
</html>
Alternatively, if you wish to use pure html5lib, you can just use the included simpletree:
tree = html5lib.parse(resp_text, namespaceHTMLElements=True)
print tree.toxml()
While this doesn't muck up the xmlns attribute, simpletree unfortunately lacks the more powerful ElementTree functions like xpath().
I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:
<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?
At the moment, my code looks like what is below:
import re
import urllib2,sys
import lxml
from lxml import etree
from lxml.html.soupparser import fromstring
from lxml.etree import tostring
from lxml.cssselect import CSSSelector
from BeautifulSoup import BeautifulSoup, NavigableString
address='http://www.example.com/'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
html=soup.prettify()
html=html.replace(' ', ' ')
html=html.replace('í','í')
root=fromstring(html)
I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.
EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.
It's not clear to me from your question why you need to worry about the div tags -- what about doing just:
soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string
On the HTML you give, running this emits exactly:
####I want whatever is located here ###
which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).
Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:
thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
print thetd.string
...i.e., not much harder at all!-)
or you could be using pyquery, since BeautifulSoup is not actively maintained anymore, see http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
first, install pyquery with
easy_install pyquery
then your script could be as simple as
from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]
pyquery uses the css selector syntax familiar from jQuery which I find more intuitive than BeautifulSoup's. It uses lxml underneath, and is much faster than BeautifulSoup. But BeautifulSoup is pure python, and thus works on Google's app engine as well
The lxml library is now the standard for parsing html in python. The interface can seem awkward at first, but it is very serviceable for what it does.
You should let the libary handle the xml specialism, such as those escaped &entities;
import lxml.html
html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
<td class="author">####I want whatever is located here, eh? í ###</td>
</tr></tbody></table></div></div></body></html>"""
root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")
print tds # gives [<Element td at 84ee2cc>]
print tds[0].text # what you want, including the 'í'
BeautifulSoup is certainly the canonical HTML parser/processor. But if you have just this kind of snippet you need to match, instead of building a whole hierarchical object representing the HTML, pyparsing makes it easy to define leading and trailing HTML tags as part of creating a larger search expression:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
author_td, end_td = makeHTMLTags("td")
# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))
search = author_td + SkipTo(end_td)("body") + end_td
for match in search.searchString(html):
print match.body
Pyparsing's makeHTMLTags function does a lot more than just emit "<tag>" and "</tag>" expressions. It also handles:
caseless matching of tags
"<tag/>" syntax
zero or more attribute in the opening tag
attributes defined in arbitrary order
attribute names with namespaces
attribute values in single, double, or no quotes
intervening whitespace between tag and symbols, or attribute name, '=', and value
attributes are accessible after parsing as named results
These are the common pitfalls when considering using a regex for HTML scraping.