I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing.
The HTML comes from this page: http://www.wvdnr.gov/
It contains multiple errors, like multiple <html></html>, <title> outside the <head>, etc...
However, html5lib usually works well even in these cases. In fact, when I do:
soup = BeautifulSoup(document, "html5lib")
and I pretti-print soup, I see the following output: http://pastebin.com/8BKapx88
which contains a lot of <a> tags.
However, when I do soup.find_all("a") I get an empty list. With lxml I get the same.
So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib found but isn't returning with find_all?
Even if the correct answer is "use another parser" (thanks #alecxe), I have another workaround. For some reason, this works too:
soup = BeautifulSoup(document, "html5lib")
soup = BeautifulSoup(soup.prettify(), "html5lib")
print soup.find_all('a')
which returns the same link list of:
soup = BeautifulSoup(document, "html.parser")
When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
html.parser worked for me:
from bs4 import BeautifulSoup
import requests
document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147
See also:
Differences between parsers.
Related
I have some HTML that contains a pre tag:
<p>Hi!</p><pre><p>Hi!</p></pre>
I'd like to change it to:
<p>Hi!</p><pre><p>Bye!</p></pre>
The naïve thing to do seems to be:
from bs4 import BeautifulSoup
markup = """<p>Hi!</p><pre><p>Hi!</p></pre>"""
soup = BeautifulSoup(markup, "html.parser")
pre_tag = soup.pre
pre_tag.string = "<p>bye!</p>"
print(str(soup))
but that gives <p>Hi!</p><pre><p>bye!</p></pre>
In the BS4 docs there's a section on output formatters that gives an example of using cdata:
from bs4.element import CData
soup = BeautifulSoup("<a></a>", 'html.parser')
soup.a.string = CData("one < three")
print(soup.a.prettify(formatter="html"))
# <a>
# <![CDATA[one < three]]>
# </a>
Which looks like what's needed, except that it also wraps the unformatted characters in a cdata tag; not good inside a pre.
This question: Beautiful Soup replaces < with < looks like it's going in this vague direction, but isn't about the insides of a pre tag.
This question: customize BeautifulSoup's prettify by tag seems like overkill, and is also from the BS3 era.
p.s. the example above is indicative of wanting to do all kinds of things to the contents of a pre, not just change hi to bye. (before anyone asks)
Either you can use the API to construct the new contents:
from bs4 import BeautifulSoup
markup = """<p>Hi!</p><pre><p>Hi!</p></pre>"""
soup = BeautifulSoup(markup, "html.parser")
pre_tag = soup.pre
new_tag = soup.new_tag("p")
new_tag.append("bye!")
pre_tag.clear()
pre_tag.append(new_tag)
print(str(soup))
Or you can provide the HTML to another BeautifulSoup instance and use that:
from bs4 import BeautifulSoup
markup = """<p>Hi!</p><pre><p>Hi!</p></pre>"""
soup = BeautifulSoup(markup, "html.parser")
pre_tag = soup.pre
soup2 = BeautifulSoup("<p>bye!</p>", "html.parser")
pre_tag.clear()
pre_tag.append(soup2)
print(str(soup))
I have a script that I've used for several years. One particular page on the site loads and returns soup, but all my finds return no result. This is old code that has worked on this site in the past. Instead of searching for a specific <div> I simplified it to look for any table, tr or td, with find or findAll. I've tried various methods of opening the page, including lxml - all with no results.
My interests are in the player_basic and player_records div's
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
import urllib2
url = "http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456"
#html = urllib2.urlopen(url).read()
html = urllib2.urlopen(url,"lxml")
soup = BeautifulSoup(html)
#div = soup.find('div', {"class":"player_basic"})
#div = soup.find('div', {"class":"player_records"})
item = soup.findAll('td')
print item
you're not reading the response. try this:
import urllib2
url = 'http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456'
response = urllib2.urlopen(url, 'lxml')
html = response.read()
then you can use it with BeautifulSoup. if it still does not work, there are strong reasons to believe that there is malformed HTML in that page (missing closing tags, etc.) since the parsers that BeautifulSoup uses (specially html.parser) are not very tolerant with that.
UPDATE: try using lxml parser:
soup = BeautifulSoup(html, 'lxml')
tds = soup.find_all('td')
print len(tds)
$ 142
I'm trying to scrape this site, and I want to check all of the anchor tags.
I have imported beautifulsoup 4.3.2 and here is my code:
url = """http://www.civicinfo.bc.ca/bids?pn=1"""
Html = urlopen(url).read()
Soup = BeautifulSoup(Html, 'html.parser')
Content = Soup.find_all('a')
My problem is that Content is always empty (i.e. Content = []). Does anyone have any ideas?
From the documentation html.parser is not very lenient before certain versions of Python. So you're likely looking at some malformed HTML.
What you want to do works if you use lxml instead of html.parser
From the documentation:
That said, there are things you can do to speed up Beautiful Soup. If
you’re not using lxml as the underlying parser, my advice is to start.
Beautiful Soup parses documents significantly faster using lxml than
using html.parser or html5lib.
So the relevant code would be:
Soup = BeautifulSoup(Html, 'lxml')
I am using python's BeautifulSoup to parse some HTML. The problem is I want to extract only the text of the document execept for <ul> and <li> tags. Sort of the opposite of unwrap(). Thus I want a function parse_everything_but_lists that will have the following behaviour
>>> parse_everything_but_lists("Hello <a>this</a> is <ul><li>me</li><li><b>Dr</b> Pablov</li></ul>")
"Hello this is <ul><li>me</li><li>Dr Pablov</li></ul>"
You can still use unwrap, you just need to get a bit recursive.
from bs4 import Tag
def unwrapper(tags, keep = ('ul','li')):
for el in tags:
if isinstance(el,Tag):
unwrapper(el) # recurse first, unwrap later
if el.name not in keep:
el.unwrap()
demo:
s = '''"Hello <a>this</a> is <ul><li>me</li><li><b>Dr</b> Pablov</li></ul>"'''
soup = BeautifulSoup(s, 'html.parser') # force html.parser to avoid lxml's auto-inclusion of <html><body>
unwrapper(soup)
soup
Out[63]: "Hello this is <ul><li>me</li><li>Dr Pablov</li></ul>"
This approach should work on any arbitrary nestings of tags, i.e.
s = '''"<a><b><ul><c><li><d>Hello</d></li></c></ul></b></a>"'''
soup = BeautifulSoup(s, 'html.parser')
unwrapper(soup)
soup
Out[19]: "<ul><li>Hello</li></ul>"
I have a fairly complex template script that BeautifulSoup4 isn't understanding for some reason. As you can see below, BS4 is only parsing partially into the tree before giving up. Why is this and is there a way to fix it?
>>> from bs4 import BeautifulSoup
>>> html = """<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script> Other stuff I want to stay"""
>>> soup = BeautifulSoup(html)
>>> soup.findAll('script')
[<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</script>]
Edit: on further testing, for some reason it appears that BS3 is able to parse this correctly:
>>> from BeautifulSoup import BeautifulSoup as bs3
>>> soup = bs3(html)
>>> soup.script
<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script>
Beautiful Soup sometimes fail with its default parser. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers.
In some cases I have to change the parser to other like : lxml, html5lib or any other.
This is a example of the explanation above :
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
I recommend you read this http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser