Beautiful Soup and Table Scraping - lxml vs html parser

Beautiful Soup and Table Scraping - lxml vs html parser - python

I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup.
<table class="facts_label" id="facts_table">...</table>
I would like to know why the code bellow works with the "html.parser" and prints back none if I change "html.parser" for "lxml".
#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.thewebpage.com')
soup=BeautifulSoup(webpage, "html.parser")
table = soup.find('table', {'class' : 'facts_label'})
print table

Short answer.
If you already installed lxml, just use it.
html.parser - BeautifulSoup(markup, "html.parser")
Advantages: Batteries included, Decent speed, Lenient (as of Python
2.7.3 and 3.2.)
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
lxml - BeautifulSoup(markup, "lxml")
Advantages: Very fast, Lenient
Disadvantages: External C dependency
html5lib - BeautifulSoup(markup, "html5lib")
Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
Disadvantages: Very slow, External Python dependency

There is a special paragraph in BeautifulSoup documentation called Differences between parsers, it states that:
Beautiful Soup presents the same interface to a number of different
parsers, but each parser is different. Different parsers will create
different parse trees from the same document. The biggest differences
are between the HTML parsers and the XML parsers.
The differences become clear on non well-formed HTML documents.
The moral is just that you should use the parser that works in your particular case.
Also note that you should always explicitly specify which parser are you using. This would help you to avoid surprises when running the code on different machines or virtual environments.

Related

BeautifulSoup not parsing past the title tag

I am trying to parse a page
http://gwyneddathletics.com/custompages/sport/mlacrosse/stats/2014/ml0402gm.htm
and when I try to findAll('b') I get no results, same with 'tr'. I cannot find anything beyond the initial title tag.
Also, when I do soup = BeautifulSoup(markup) and print the soup, I get the entire page with an extra at the end of the output
I am using python 2.6 with BeautifulSoup 3.2.0. Why is my soup not parsing the page correctly?

It's likely the parser that BeautifulSoup is using really doesn't like the markup on the page, I have had similar issues happen in the past. I did a quick test on your input and found that if you upgrade to the newest BeautifulSoup (the package is called bs4) it just works. bs4 also supports python2.6 and the backwards incompatible changes between it and BeautifulSoup (the 3.x series) are tiny. See here if you need to check out how to port.

difference between lxml and html5lib in the context of beautifulsoup

Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct --
ret = requests.get('http://www.olivegarden.com')
soup = BeautifulSoup(ret.text, 'html5lib')
for item in soup.find_all('a'):
print item['href']
I started out with using lxml as the parser but noticed that for some websites the for loop just is never entered even though there are valid links in the page. The same page works with html5ib parser. Are there any specific type of pages that might not work with lxml?
I am on Ubuntu using python-lxml 2.3.2-1 with libxml2 2.7.8.dfsg-5.1ubunt
and html5lib-1.0b3
EDIT: I updated to lxml 3.1.2 and still see the same issue. On a mac though running 3.0.x the same page is being parsed properly. The website in question is www.olivegarden.com

html5lib uses the HTML parsing algorithm as defined in the HTML spec, and as implemented in all major browsers. lxml uses libxml2's HTML parser — this is based on their XML parser, ultimately, and does not follow any error handling for invalid HTML used anywhere else.
Most web developers only test with web browsers — standards be damned — so if you want to get what the page's author intended, you'll likely need to use something like html5lib that matches current browsers,

Timetable Web Scraping with multiple tables (Python)

I'm just looking for some info regarding python web scraping. I'm trying to get all the data from this timetable and I want to have the class linked to the time its on at. Looking at the html there's multiple tables (tables within tables). I'm planning to use Google App Engine with Python (perhaps BeautifulSoup also). Any suggestions on the best way of going about this is?
Thanks
UPDATE:
I've managed to extract the required data from the table using the following code:
import urllib
from lxml import etree
import StringIO
url = "http://ttcache.dcu.ie/Reporting/Individual;Locations;id;lg25?
template=location+Individual&weeks=20&days=1-5&periods=1-30&Width=0&Height=0"
result = urllib.urlopen(url)
html = result.read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
xpath = "//table[2]/tr/td//text()"
filtered_html = tree.xpath(xpath)
print filtered_html
But I'm getting a lot of these u'\xa0', u'\xa0', '\r\n', '\r\n' characters scattered throughout the parsed text. Any suggestions on how I could combat these?
Thanks

The best library available for parsing HTML is lxml, which is based on libxml2. Although it's intended for XML parsing it also has a HTML parser that deals with tag soup far better than BeautifulSoup does. Due to the parser being in C it's also much much faster.
You'll also get access to XPath to query the HTML dom, with the libxml2 support for regular expression matches in XPaths which is very useful for web scraping.
libxml2 and lxml are very well supported and you'll find there are packages for them on all major distros. Google App engine appears to support it as well if you're using 2.7 https://developers.google.com/appengine/docs/python/tools/libraries27
EDIT:
The characters you're getting are due to there being a lot of empty table cells on the page, so your xpath is often matching the whitespace characters (which are non-breaking spaces). You can skip those text nodes with no non-space characters with a regular expression something like this:
xpath = "//table[2]/tr/td//text()[re:match(., '\\S')]"
filtered_html = tree.xpath(
xpath,
namespaces={"re": "http://exslt.org/regular-expressions"})
The namespaces bit just tells lxml that you want to use it's regular expression extension.

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is faster.
So I'm wondering what are the advantages of one over the other? When would I want to use lxml and when would I be better off using BeautifulSoup? Are there any other libraries worth considering?

Pyquery provides the jQuery selector interface to Python (using lxml under the hood).
http://pypi.python.org/pypi/pyquery
It's really awesome, I don't use anything else anymore.

For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.
Quoting from the linked page:
Version 3.1.0 of Beautiful Soup does
significantly worse on real-world HTML
than version 3.0.8 does. The most
common problems are handling
tags incorrectly, "malformed start
tag" errors, and "bad end tag" errors.
This page explains what happened, how
the problem will be addressed, and
what you can do right now.
This page was originally written in
March 2009. Since then, the 3.2 series
has been released, replacing the 3.1
series, and development of the 4.x
series has gotten underway. This page
will remain up for historical
purposes.
tl;dr
Use 3.2.0 instead.

In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup's functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
BeautifulSoup uses a different parsing approach. It is not a real HTML
parser but uses regular expressions to dive through tag soup. It is
therefore more forgiving in some cases and less good in others. It is
not uncommon that lxml/libxml2 parses and fixes broken HTML better,
but BeautifulSoup has superiour support for encoding detection. It
very much depends on the input which parser works better.
In the end they are saying,
The downside of using this parser is that it is much slower than
the HTML parser of lxml. So if performance matters, you might want
to consider using soupparser only as a fallback for certain cases.
If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.
They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:
>>> from BeautifulSoup import UnicodeDammit
>>> def decode_html(html_string):
... converted = UnicodeDammit(html_string, isHTML=True)
... if not converted.unicode:
... raise UnicodeDecodeError(
... "Failed to detect encoding, tried [%s]",
... ', '.join(converted.triedEncodings))
... # print converted.originalEncoding
... return converted.unicode
>>> root = lxml.html.fromstring(decode_html(tag_soup))
(Same source: http://lxml.de/elementsoup.html).
In words of BeautifulSoup's creator,
That's it! Have fun! I wrote Beautiful Soup to save everybody time.
Once you get used to it, you should be able to wrangle data out of
poorly-designed websites in just a few minutes. Send me email if you
have any comments, run into problems, or want me to know about your
project that uses Beautiful Soup.
--Leonard
Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,
lxml has been downloaded from the Python Package Index more than two
million times and is also available directly in many package
distributions, e.g. for Linux or MacOS-X.
And, from Why lxml?,
The C libraries libxml2 and libxslt have huge benefits:...
Standards-compliant... Full-featured... fast. fast! FAST! ... lxml
is a new Python binding for libxml2 and libxslt...

Don't use BeautifulSoup, use
lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.

I've used lxml with great success for parsing HTML. It seems to do a good job of handling "soupy" HTML, too. I'd highly recommend it.
Here's a quick test I had lying around to try handling of some ugly HTML:
import unittest
from StringIO import StringIO
from lxml import etree
class TestLxmlStuff(unittest.TestCase):
bad_html = """
<html>
<head><title>Test!</title></head>
<body>
<h1>Here's a heading
<p>Here's some text
<p>And some more text
<b>Bold!</b></i>
<table>
<tr>row
<tr><td>test1
<td>test2
</tr>
<tr>
<td colspan=2>spanning two
</table>
</body>
</html>"""
def test_soup(self):
"""Test lxml's parsing of really bad HTML"""
parser = etree.HTMLParser()
tree = etree.parse(StringIO(self.bad_html), parser)
self.assertEqual(len(tree.xpath('//tr')), 3)
self.assertEqual(len(tree.xpath('//td')), 3)
self.assertEqual(len(tree.xpath('//i')), 0)
#print(etree.tostring(tree.getroot(), pretty_print=False, method="html"))
if __name__ == '__main__':
unittest.main()

For sure i would use EHP. It is faster than lxml, much more elegant and simpler to use.
Check out. https://github.com/iogf/ehp
<body ><em > foo <font color="red" ></font></em></body>
from ehp import *
data = '''<html> <body> <em> Hello world. </em> </body> </html>'''
html = Html()
dom = html.feed(data)
for ind in dom.find('em'):
print ind.text()
Output:
Hello world.

A somewhat outdated speed comparison can be found here, which clearly recommends lxml, as the speed differences seem drastic.

Why is Beautiful Soup truncating this page?

I am trying to pull at list of resource/database names and IDs from a listing of resources that my school library has subscriptions to. There are pages listing the different resources, and I can use urllib2 to get the pages, but when I pass the page to BeautifulSoup, it truncates its tree just before the end of the entry for the first resource in the list. The problem seems to be in image link used to add the resource to a search set. This is where things get cut off, here's the HTML:
<a href="http://www2.lib.myschool.edu:7017/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45899?func=find-db-add-res&resource=XYZ00618&z122_key=000000000&function-in=www_v_find_db_0" onclick='javascript:addToz122("XYZ00618","000000000","myImageXYZ00618","http://discover.lib.myschool.edu:8331/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45900");return false;'>
<img name="myImageXYZ00618" id="myImageXYZ00618" src="http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png" title="Add to My Sets" alt="Add to My Sets" border="0">
</a>
And here is my python code:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://discover.lib.myschool.edu:8331/V?func=find-db-1-title&mode=titles&scan_start=latp&scan_utf=D&azlist=Y&restricted=all")
print BeautifulSoup(page).prettify
In BeautifulSoup's version, the opening <a href...> shows up, but the <img> doesn't, and the <a> is immediately closed, as are the rest of the open tags, all the way to </html>.
The only distinguishing trait I see for these "add to sets" images is that they are the only ones to have name and id attributes. I can't see why that would cause BeautifulSoup to stop parsing immediately, though.
Note: I am almost entirely new to Python, but seem to be understanding it all right.
Thank you for your help!

You can try beautiful soup with html5lib rather than the built-in parser.
BeautifulSoup(markup, "html5lib")
html5lib is more lenient and often parses pages that the built-in parser truncates. See the docs at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree

I was using Firefox's "view selection source", which apparently cleans up the HTML for me. When I viewed the original source, this is what I saw
<img name="myImageXYZ00618" id="myImageXYZ00618" src='http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png' alt='Add to My Sets' title='Add to My Sets' border="0"title="Add to clipboard PAIS International (CSA)" alt="Add to clipboard PAIS International (CSA)">
By putting a space after the border="0" attribute, I can get BS to parse the page.

I strongly recommend using html5lib + lxml instead of beautiful soup. It uses a real HTML parser (very similar to the one in Firefox) and lxml provides a very flexible way to query the resulting tree (css-selectors or xpath).
There are tons of bugs or strange behavior in BeautifulSoup which makes it not the best solution for a lot of HTML markup you can't trust.

If I remember correctly, BeautifulSoup uses "name" in it's tree as the name of the tag. In this case "a" would be the "name" of the anchor tag.
That doesn't seem like it should break it though. What version of Python and BS are you using?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup and Table Scraping - lxml vs html parser - python

Related

BeautifulSoup not parsing past the title tag

difference between lxml and html5lib in the context of beautifulsoup

Timetable Web Scraping with multiple tables (Python)

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

Why is Beautiful Soup truncating this page?

Categories

Resources