I am working in a job and I need to parse a site with Beautiful Soup. The site is http://www.manta.com but when I try to see the encoding of the site in the meta of the HTML code don't appears nothing. I'm try to parse the HTML locally , with the web page downloaded, but I'm having trouble with some decoding errors:
# manta web page downloaded before
html = open('1.html', 'r')
soup = BeautifulSoup(html, 'lxml')
This produces the following stack trace:
Traceback (most recent call last):
File "E:/Projects/Python/webkit/sample.py", line 10, in <module>
soup = BeautifulSoup(html, 'lxml')
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
self._feed()
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
self.parser.close()
File "parser.pxi", line 1209, in
lxml.etree._FeedParser.close(src\lxm\lxml.etree.c:90717)
File "parsertarget.pxi", line 142, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:100104)
File "parsertarget.pxi", line 130, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99927)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored
(src\lxml\lxml.etree.c:9387)
File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml
\lxml.etree.c:96065)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 105-106: invalid data
I'm try to introduce the encoding in the constructor of Beautiful Soup :
soup = BeautifulSoup(html, 'lxml', from_encoding= "some encoding")
And I continue get the same error.
The interesting thing is that if I load the page in my browser and then I change the encode to utf-8 for example in Firefox and the save it. This work good.Any help is greatly appreciated. Thank you.
Encode the string in UTF-8
soup = BeautifulSoup(html.encode('UTF-8'),'lxml')
Related
I am trying to scrape data from yelp and put it into a .csv file using urllib, beautifulsoup, and pandas. I am getting several errors and I am not sure what to do.
I have reviewed all the similar questions I could find online but none of them seem to be the same issue.
This is what error I get at runtime:
Traceback (most recent call last):
File "/Users/sydney/PycharmProjects/yelpdata/venv/yelp.py", line 7, in <module>
page_soup = bs.BeautifulSoup(source, 'html.parser')
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/__init__.py", line 310, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/builder/_htmlparser.py", line 248, in prepare_markup
exclude_encodings=exclude_encodings)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 395, in __init__
for encoding in self.detector.encodings:
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 278, in encodings
self.markup, self.is_html)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 343, in find_declared_encoding
declared_encoding_match = xml_re.search(markup, endpos=xml_endpos)
TypeError: expected string or buffer
This is my python file (linked to GitHub)
Hi I am trying to extract comments on a web page using lxml and xpath. Here is my code:
pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
for cm in cm_pg:
print cm
I got this error
Traceback (most recent call last):
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte
I know that there is an invalid character in the comments. How do I solve this?
Can you ask Requests to attempt to decode it for you? Use response.text (a string) rather than response.content (bytes).
The encoding of the source is probably something other than UTF-8, which your XPath library might be assuming. response.encoding is Requests best guess at what it is. Sometimes web servers/pages aren't configured to explicitly say what encoding they're using then all you can do is guess.
Doesn't help that encoding can be specified in an HTTP header and/or in a <meta> tag. Or websites can lie. Or they might mixing encodings. Note that your target website can't even validate because the encoding is wrong, and even with that it's rife with errors.
The page have badly encoded characters.
Ex:
Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)
You can avoid them by manually decoding:
>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'
On Python 3.2.3 running on Kubuntu Linux 12.10 with Requests 0.12.1 and BeautifulSoup 4.1.0, I am having some web pages break on parsing:
try:
response = requests.get('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
except Exception as error:
return False
pprint(str(type(response)));
pprint(response);
pprint(str(type(response.content)));
soup = bs4.BeautifulSoup(response.content)
Note that hundreds of other web pages parse fine. What is in this particular page that is crashing Python, and how can I work around it? Here is the crash:
- bruno:scraper$ ./test-broken-site.py
"<class 'requests.models.Response'>"
<Response [200]>
"<class 'bytes'>"
Traceback (most recent call last):
File "./test-broken-site.py", line 146, in <module>
main(sys.argv)
File "./test-broken-site.py", line 138, in main
has_adsense('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
File "./test-broken-site.py", line 67, in test_page_parse
soup = bs4.BeautifulSoup(response.content)
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 172, in __init__
self._feed()
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 175, in feed
self.parser.close()
File "parser.pxi", line 1171, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:79886)
File "parsertarget.pxi", line 126, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:88932)
File "lxml.etree.pyx", line 282, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7469)
File "saxparser.pxi", line 288, in lxml.etree._handleSaxDoctype (src/lxml/lxml.etree.c:85572)
File "parsertarget.pxi", line 84, in lxml.etree._PythonSaxParserTarget._handleSaxDoctype (src/lxml/lxml.etree.c:88469)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 150, in doctype
doctype = Doctype.for_name_and_ids(name, pubid, system)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 653, in __new__
return str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
TypeError: coercing to str: need bytes, bytearray or buffer-like object, NoneType found
Instead of bs4.BeautifulSoup(response.content) I had tried bs4.BeautifulSoup(response.text). This had the same result (same crash on this page). What can I do to work around pages that break like this, so that I could parse them?
The website provided in your output has doctype:
<!DOCTYPE>
Whereas a proper site has to have something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
When the beautifulsoup parser tries to get the doctype here:
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
The value of Doctype is empty, and then when that value is attempted to be used, the parser fails.
One solution is to manually fix the problem with regex, before parsing the page to beautifulsoup
I'm attempting to pull some data off a popular browser based game, but am having trouble with some decoding errors:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.neopets.com/")
p = BeautifulSoup(r.text)
This produces the following stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 172, in __init__
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 185, in _feed
File "build/bdist.linux-x86_64/egg/bs4/builder/_lxml.py", line 195, in feed
File "parser.pxi", line 1187, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:87912)
File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97055)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8862)
File "saxparser.pxi", line 274, in lxml.etree._handleSaxCData (src/lxml/lxml.etree.c:93385)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 476: invalid start byte
Doing the following:
print repr(r.text[476 - 10: 476 + 10])
Produces:
u'ttp-equiv="X-UA-Comp'
I'm really not sure what the issue here is. Any help is greatly appreciated. Thank you.
.text on a response returns a decoded unicode value, but perhaps you should let BeautifulSoup do the decoding for you:
p = BeautifulSoup(r.content, from_encoding=r.encoding)
r.content returns the un-decoded raw bytestring, and r.encoding is the encoding detected from the headers.
start_url=requests.get('http://www.delicious.com/golisoda')
soup=BeautifulSoup(start_url)
this code is displaying the following error:
Traceback (most recent call last):
File "test2_requests.py", line 10, in <module>
soup=BeautifulSoup(start_url)
File "/usr/local/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 203, in __init__
self._detectEncoding(markup, is_html)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 373, in _detectEncoding
xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer
Use the .content of the response:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.content)
Alternatively, you can use the decoded unicode text:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.text)
See the Response content section of the documentation.
you probebly need to Use
using
soup=BeautifulSoup(start_url.read())
or
soup=BeautifulSoup(start_url.text)
from BeautifulSoup import BeautifulSoup
import urllib2
data=urllib2.urlopen('http://www.delicious.com/golisoda').read()
soup=BeautifulSoup(data)