I have a fairly complex template script that BeautifulSoup4 isn't understanding for some reason. As you can see below, BS4 is only parsing partially into the tree before giving up. Why is this and is there a way to fix it?
>>> from bs4 import BeautifulSoup
>>> html = """<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script> Other stuff I want to stay"""
>>> soup = BeautifulSoup(html)
>>> soup.findAll('script')
[<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</script>]
Edit: on further testing, for some reason it appears that BS3 is able to parse this correctly:
>>> from BeautifulSoup import BeautifulSoup as bs3
>>> soup = bs3(html)
>>> soup.script
<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script>
Beautiful Soup sometimes fail with its default parser. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers.
In some cases I have to change the parser to other like : lxml, html5lib or any other.
This is a example of the explanation above :
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
I recommend you read this http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Related
I would like to use BeautifulSoup or lxml to parse some web pages. Since the raw data is not a clean xml so it cannot be parsed directly by lxml.etree.fromstring. However, Beautifulsoup(page_source,'lxml') works and I can get the soup data of the page. As I need some functions in lxml such as query by xpath. Are there any functions or variables that I can call to convert the soup object of the whole raw web page to a etree object? (I guess Beautifulsoup should have converted the raw page to an etree object before generating a soup object via lxml parser, but I cannot find out where it stores the object.)
p.s. I have tried the answer from Is it possible to use bs4 soup object with lxml? to parse the web pages. But I still find some pages cannot be parsed, here is the example:
>>> from urllib.request import urlopen
>>> html = urlopen('https://www.nature.com/articles/s41558-019-0619-1').read()
>>> soup = BeautifulSoup(html,'lxml') ## return a soup object
>>> from lxml.etree import fromstring
>>> fromstring(soup.prettify()) ## return errors
>>> from lxml.html.soupparser import fromstring
>>> fromstring(soup.prettify()) ## return errors
I'm trying to scrape this site, and I want to check all of the anchor tags.
I have imported beautifulsoup 4.3.2 and here is my code:
url = """http://www.civicinfo.bc.ca/bids?pn=1"""
Html = urlopen(url).read()
Soup = BeautifulSoup(Html, 'html.parser')
Content = Soup.find_all('a')
My problem is that Content is always empty (i.e. Content = []). Does anyone have any ideas?
From the documentation html.parser is not very lenient before certain versions of Python. So you're likely looking at some malformed HTML.
What you want to do works if you use lxml instead of html.parser
From the documentation:
That said, there are things you can do to speed up Beautiful Soup. If
you’re not using lxml as the underlying parser, my advice is to start.
Beautiful Soup parses documents significantly faster using lxml than
using html.parser or html5lib.
So the relevant code would be:
Soup = BeautifulSoup(Html, 'lxml')
I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing.
The HTML comes from this page: http://www.wvdnr.gov/
It contains multiple errors, like multiple <html></html>, <title> outside the <head>, etc...
However, html5lib usually works well even in these cases. In fact, when I do:
soup = BeautifulSoup(document, "html5lib")
and I pretti-print soup, I see the following output: http://pastebin.com/8BKapx88
which contains a lot of <a> tags.
However, when I do soup.find_all("a") I get an empty list. With lxml I get the same.
So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib found but isn't returning with find_all?
Even if the correct answer is "use another parser" (thanks #alecxe), I have another workaround. For some reason, this works too:
soup = BeautifulSoup(document, "html5lib")
soup = BeautifulSoup(soup.prettify(), "html5lib")
print soup.find_all('a')
which returns the same link list of:
soup = BeautifulSoup(document, "html.parser")
When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
html.parser worked for me:
from bs4 import BeautifulSoup
import requests
document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147
See also:
Differences between parsers.
I'm trying to parse an XML page with BeautifulSoup and for some reason it's not able to find the XML parser. I don't think it's a path issue as I've used lxml to parse pages in the past, just not XML. Here's the code:
from bs4 import *
import urllib2
import lxml
from lxml import *
BASE_URL = "http://auctionresults.fcc.gov/Auction_66/Results/xml/round/66_115_database_round.xml"
proxy = urllib2.ProxyHandler({'http':'http://myProxy.com})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
page = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(page,"xml")
print soup
I'm probably missing something simple, but all the XML parsing with BS questions I found on here were around bs3 and I'm using bs4 which uses a different method for parsing XML. Thanks.
If you have lxml installed, just call that as BeautifulSoup's parser instead, like below.
Code:
from bs4 import BeautifulSoup as bsoup
import requests as rq
url = "http://auctionresults.fcc.gov/Auction_66/Results/xml/round/66_115_database_round.xml"
r = rq.get(url)
soup = bsoup(r.content, "lxml")
print soup
Result:
<html><body><dataroot xmlns:od="urn:schemas-microsoft-com:officedata" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:nonamespaceschemalocation="66_database.xsd"><all_bids>
<auction_id>66</auction_id>
<auction_description>Advanced Wireless Services</auction_description>
... really long list follows...
[Finished in 34.9s]
Let us know if this helps.
Is there any Python library that allows me to parse an HTML document similar to what jQuery does?
i.e. I'd like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc.
The only Python HTML parsing lib I've used before was BeautifulSoup, and even though it's fine I keep thinking it would be faster to do my parsing if I had jQuery syntax available. :D
If you are fluent with BeautifulSoup, you could just add soupselect to your libs.
Soupselect is a CSS selector extension for BeautifulSoup.
Usage:
from bs4 import BeautifulSoup as Soup
from soupselect import select
import urllib
soup = Soup(urllib.urlopen('http://slashdot.org/'))
select(soup, 'div.title h3')
[<h3><span><a href='//science.slashdot.org/'>Science</a>:</span></h3>,
<h3><a href='//slashdot.org/articles/07/02/28/0120220.shtml'>Star Trek</h3>,
..]
Consider PyQuery:
http://packages.python.org/pyquery/
>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> import urllib
>>> d = pq("<html></html>")
>>> d = pq(etree.fromstring("<html></html>"))
>>> d = pq(url='http://google.com/')
>>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read())
>>> d = pq(filename=path_to_html_file)
>>> d("#hello")
[<p#hello.hello>]
>>> p = d("#hello")
>>> p.html()
'Hello world !'
>>> p.html("you know <a href='http://python.org/'>Python</a> rocks")
[<p#hello.hello>]
>>> p.html()
u'you know Python rocks'
>>> p.text()
'you know Python rocks'
The lxml library supports CSS selectors.
BeautifulSoup, now has support for css selectors
import requests
from bs4 import BeautifulSoup as Soup
html = requests.get('https://stackoverflow.com/questions/3051295').content
soup = Soup(html)
Title of this question
soup.select('h1.grid--cell :first-child')[0].text
Number of question upvotes
# first item
soup.select_one('[itemprop="upvoteCount"]').text
using Python Requests to get the html page