On Python 3.2.3 running on Kubuntu Linux 12.10 with Requests 0.12.1 and BeautifulSoup 4.1.0, I am having some web pages break on parsing:
try:
response = requests.get('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
except Exception as error:
return False
pprint(str(type(response)));
pprint(response);
pprint(str(type(response.content)));
soup = bs4.BeautifulSoup(response.content)
Note that hundreds of other web pages parse fine. What is in this particular page that is crashing Python, and how can I work around it? Here is the crash:
- bruno:scraper$ ./test-broken-site.py
"<class 'requests.models.Response'>"
<Response [200]>
"<class 'bytes'>"
Traceback (most recent call last):
File "./test-broken-site.py", line 146, in <module>
main(sys.argv)
File "./test-broken-site.py", line 138, in main
has_adsense('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
File "./test-broken-site.py", line 67, in test_page_parse
soup = bs4.BeautifulSoup(response.content)
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 172, in __init__
self._feed()
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 175, in feed
self.parser.close()
File "parser.pxi", line 1171, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:79886)
File "parsertarget.pxi", line 126, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:88932)
File "lxml.etree.pyx", line 282, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7469)
File "saxparser.pxi", line 288, in lxml.etree._handleSaxDoctype (src/lxml/lxml.etree.c:85572)
File "parsertarget.pxi", line 84, in lxml.etree._PythonSaxParserTarget._handleSaxDoctype (src/lxml/lxml.etree.c:88469)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 150, in doctype
doctype = Doctype.for_name_and_ids(name, pubid, system)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 653, in __new__
return str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
TypeError: coercing to str: need bytes, bytearray or buffer-like object, NoneType found
Instead of bs4.BeautifulSoup(response.content) I had tried bs4.BeautifulSoup(response.text). This had the same result (same crash on this page). What can I do to work around pages that break like this, so that I could parse them?
The website provided in your output has doctype:
<!DOCTYPE>
Whereas a proper site has to have something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
When the beautifulsoup parser tries to get the doctype here:
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
The value of Doctype is empty, and then when that value is attempted to be used, the parser fails.
One solution is to manually fix the problem with regex, before parsing the page to beautifulsoup
Related
I am trying to scrape data from yelp and put it into a .csv file using urllib, beautifulsoup, and pandas. I am getting several errors and I am not sure what to do.
I have reviewed all the similar questions I could find online but none of them seem to be the same issue.
This is what error I get at runtime:
Traceback (most recent call last):
File "/Users/sydney/PycharmProjects/yelpdata/venv/yelp.py", line 7, in <module>
page_soup = bs.BeautifulSoup(source, 'html.parser')
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/__init__.py", line 310, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/builder/_htmlparser.py", line 248, in prepare_markup
exclude_encodings=exclude_encodings)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 395, in __init__
for encoding in self.detector.encodings:
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 278, in encodings
self.markup, self.is_html)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 343, in find_declared_encoding
declared_encoding_match = xml_re.search(markup, endpos=xml_endpos)
TypeError: expected string or buffer
This is my python file (linked to GitHub)
I am trying to parse an HTML table into python (2.7) with the solutions in this post.
When I try either one of the first two with a string (as in the example) it works perfect.
But when I try to to use the etree.xml on HTML page I read with urlib I get an error. I did a check for each one of solutions, and the variable I pass is a str as well.
For the following code:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
I get this error:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line
9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML
(src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument
(src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc
(src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc
(src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and
ending tag mismatch: link line 8 and head, line 8, column 48
and for this code:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
I get this error:
Traceback (most recent call last): File
"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in
table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in
_raiseerror
raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111
While they may seem the same markup types, HTML is not as stringent as XML to be well-formed and follow markup rules (opening/closing nodes, escaping entities, etc.). Hence, what passes for HTML may not be allowed for XML.
Therefore, consider using etree's HTML() function to parse the page. Additionally, you can use XPath to target the particular area you intend to extract or use. Below is an example attempting to pull the main page's table. Do note the webpage uses a quite a bit of nested tables.
from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))
# PARSE PAGE
htmlpage = etree.HTML(s)
# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")
for row in htmltable:
print(row)
I am working in a job and I need to parse a site with Beautiful Soup. The site is http://www.manta.com but when I try to see the encoding of the site in the meta of the HTML code don't appears nothing. I'm try to parse the HTML locally , with the web page downloaded, but I'm having trouble with some decoding errors:
# manta web page downloaded before
html = open('1.html', 'r')
soup = BeautifulSoup(html, 'lxml')
This produces the following stack trace:
Traceback (most recent call last):
File "E:/Projects/Python/webkit/sample.py", line 10, in <module>
soup = BeautifulSoup(html, 'lxml')
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
self._feed()
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
self.parser.close()
File "parser.pxi", line 1209, in
lxml.etree._FeedParser.close(src\lxm\lxml.etree.c:90717)
File "parsertarget.pxi", line 142, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:100104)
File "parsertarget.pxi", line 130, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99927)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored
(src\lxml\lxml.etree.c:9387)
File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml
\lxml.etree.c:96065)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 105-106: invalid data
I'm try to introduce the encoding in the constructor of Beautiful Soup :
soup = BeautifulSoup(html, 'lxml', from_encoding= "some encoding")
And I continue get the same error.
The interesting thing is that if I load the page in my browser and then I change the encode to utf-8 for example in Firefox and the save it. This work good.Any help is greatly appreciated. Thank you.
Encode the string in UTF-8
soup = BeautifulSoup(html.encode('UTF-8'),'lxml')
>>> soup = BeautifulSoup( data )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 5518, column 822
>>> for each in l[5515:5520]:
... print each
...
<script>
registerImage("original_image", "http://ecx.images-amazon.com/images/I/41h7uHc1jmL._SL500_AA240_.jpg","<a href="+'"'+"http://www.amazon.com/gp/product/images/1592406017/ref=dp_image_0?ie=UTF8&n=283155&s=books"+'"'+" target="+'"'+"AmazonHelp"+'"'+" onclick="+'"'+"return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+" ><img onload="+'"'+"if (typeof uet == 'function') { uet('af'); }"+'"'+" src="+'"'+"http://ecx.images-amazon.com/images/I/41h7uHc1jmL._SL500_AA240_.jpg"+'"'+" id="+'"'+"prodImage"+'"'+" width="+'"'+"240"+'"'+" height="+'"'+"240"+'"'+" border="+'"'+"0"+'"'+" alt="+'"'+"Life, on the Line: A Chef's Story of Chasing Greatness, Facing Death, and Redefining the Way We Eat"+'"'+" onmouseover="+'"'+""+'"'+" /></a>", "<br /><a href="+'"'+"http://www.amazon.com/gp/product/images/1592406017/ref=dp_image_text_0?ie=UTF8&n=283155&s=books"+'"'+" target="+'"'+"AmazonHelp"+'"'+" onclick="+'"'+"return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+" >See larger image</a>", "");
var ivStrings = new Object();
</script>
>>>
>>> l[5518-1][822]
'h'
>>>
Note : using Python 2.6.5 on ubuntu 10.04
Isn't BeutifulSoup supposed to ignore script tags ?
Cant figure out a way out of this :(
any suggestions ??
Pyparsing has some HTML tag support that makes for more robust scripts than just straight RE's. And since it doesn't try to parse/process the entire HTML body but instead just looks for matching string expressions, it can handle badly formed HTML:
html = """<script>
registerImage("original_image",
"this is a closing </script> tag in quotes"
etc....
</script>
"""
# code to strip <script> tags from an HTML page
from pyparsing import makeHTMLTags,SkipTo,quotedString
script,scriptEnd = makeHTMLTags("script")
scriptBody = script + SkipTo(scriptEnd, ignore=quotedString) + scriptEnd
descriptedHtml = scriptBody.suppress().transformString(html)
Depending on what kind of HTML scraping you are trying to do, you might be able to do the whole thing using pyparsing.
When I hit script tags in BeautifulSoup often times I will convert the soup object back to a string, remove the offending data and then re-Soup the data. Works when you dont care about the data.
I am using OSX 10.6 and python 2.7.1 with BeautifulSoup 3.0 and feedparser 5.01.
I am trying to parse the New York Times RSS Feed, which validates, and which Beautiful Soup on its own will parse happily.
The minimum code to produce the error is:
import feedparser
from BeautifulSoup import BeautifulSoup
feed = feedparser.parse("http://www.nytimes.com/services/xml/rss/nyt/GlobalHome.xml")
It fails if I use either the url or
if I use urllib2.urlopen to get the
contents.
I have also tried the character set detector.
The error block is:
/Users/user/Source/python/feed/BeautifulSoup.py:1553: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == '\xef\xbb\xbf':
/Users/user/Source/python/feed/BeautifulSoup.py:1556: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == '\x00\x00\xfe\xff':
/Users/user/Source/python/feed/BeautifulSoup.py:1559: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == '\xff\xfe\x00\x00':
Traceback (most recent call last):
File "parse.py", line 5, in <module>
feed = feedparser.parse("http://www.nytimes.com/services/xml/rss/nyt/GlobalHome.xml")
File "/Users/user/Source/python/feed/feedparser.py", line 3822, in parse
feedparser.feed(data.decode('utf-8', 'replace'))
File "/Users/user/Source/python/feed/feedparser.py", line 1851, in feed
sgmllib.SGMLParser.feed(self, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 360, in finish_endtag
self.unknown_endtag(tag)
File "/Users/user/Source/python/feed/feedparser.py", line 657, in unknown_endtag
method()
File "/Users/user/Source/python/feed/feedparser.py", line 1545, in _end_description
value = self.popContent('description')
File "/Users/user/Source/python/feed/feedparser.py", line 961, in popContent
value = self.pop(tag)
File "/Users/user/Source/python/feed/feedparser.py", line 868, in pop
mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
File "/Users/user/Source/python/feed/feedparser.py", line 2420, in _parseMicroformats
p = _MicroformatsParser(htmlSource, baseURI, encoding)
File "/Users/user/Source/python/feed/feedparser.py", line 2024, in __init__
self.document = BeautifulSoup.BeautifulSoup(data)
File "/Users/user/Source/python/feed/BeautifulSoup.py", line 1228, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Users/user/Source/python/feed/BeautifulSoup.py", line 892, in __init__
self._feed()
File "/Users/user/Source/python/feed/BeautifulSoup.py", line 917, in _feed
SGMLParser.feed(self, markup)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 103, in feed
self.rawdata = self.rawdata + data
TypeError: cannot concatenate 'str' and 'NoneType' objects
I would appreciate any advice very much.
I tested using Python 2.7.1, feedparser 5.0.1, and BeautifulSoup 3.2.0, but the feed didn't cause a traceback. Try upgrading to BeautifulSoup 3.2.0.