Python decoding errors with BeautifulSoup, requests, and lxml - python

I'm attempting to pull some data off a popular browser based game, but am having trouble with some decoding errors:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.neopets.com/")
p = BeautifulSoup(r.text)
This produces the following stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 172, in __init__
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 185, in _feed
File "build/bdist.linux-x86_64/egg/bs4/builder/_lxml.py", line 195, in feed
File "parser.pxi", line 1187, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:87912)
File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97055)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8862)
File "saxparser.pxi", line 274, in lxml.etree._handleSaxCData (src/lxml/lxml.etree.c:93385)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 476: invalid start byte
Doing the following:
print repr(r.text[476 - 10: 476 + 10])
Produces:
u'ttp-equiv="X-UA-Comp'
I'm really not sure what the issue here is. Any help is greatly appreciated. Thank you.

.text on a response returns a decoded unicode value, but perhaps you should let BeautifulSoup do the decoding for you:
p = BeautifulSoup(r.content, from_encoding=r.encoding)
r.content returns the un-decoded raw bytestring, and r.encoding is the encoding detected from the headers.

Related

"TypeError: expected string or buffer" stemming from BeautifulSoup

I am trying to scrape data from yelp and put it into a .csv file using urllib, beautifulsoup, and pandas. I am getting several errors and I am not sure what to do.
I have reviewed all the similar questions I could find online but none of them seem to be the same issue.
This is what error I get at runtime:
Traceback (most recent call last):
File "/Users/sydney/PycharmProjects/yelpdata/venv/yelp.py", line 7, in <module>
page_soup = bs.BeautifulSoup(source, 'html.parser')
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/__init__.py", line 310, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/builder/_htmlparser.py", line 248, in prepare_markup
exclude_encodings=exclude_encodings)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 395, in __init__
for encoding in self.detector.encodings:
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 278, in encodings
self.markup, self.is_html)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 343, in find_declared_encoding
declared_encoding_match = xml_re.search(markup, endpos=xml_endpos)
TypeError: expected string or buffer
This is my python file (linked to GitHub)

Encoding problem with /usr/lib64/python3.4/http/client.py

I do not understand the error below. If I run :
python3.4 ./bug.py "salé.txt"
It is fine.
If I run : python3.4 ./bug.py "Capture d’écran du 2019-03-21 15-17-10.png"
I got this error :
Traceback (most recent call last):
File "./bug.py", line 45, in <module>
status=testB_CreateSimpleDocumentWithFile(session)
File "./bug.py", line 32, in testB_CreateSimpleDocumentWithFile
status, result = session.create_document_with_properties(path,mydoc,simple_document,properties=props,files=kk)
File "/home/karim/testatrium/nuxeolib/session.py", line 345, in create_document_with_properties
_document_properties, _ = self.encode_properties(properties, files)
File "/home/karim/testatrium/nuxeolib/session.py", line 251, in encode_properties
_names, _sizes = self.upload_files(files, batch_id=_batch_id)
File "/home/karim/testatrium/nuxeolib/session.py", line 136, in upload_files
_status, _result = self.execute_api(param=_param, headers=_headers, file_name=_name)
File "/home/karim/testatrium/nuxeolib/session.py", line 1325, in execute_api
_connection.request(method, url, headers=h2, body=data)
File "/usr/lib64/python3.4/http/client.py", line 1139, in request
self._send_request(method, url, body, headers)
File "/usr/lib64/python3.4/http/client.py", line 1179, in _send_request
self.putheader(hdr, value)
File "/usr/lib64/python3.4/http/client.py", line 1110, in putheader
values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 9: ordinal not in range(256)
The problem comes from the Right Single Quotation Mark. I do not manage to fix it.
Thanks for any advice.
Karim
Since File name is not in your control, I would sanitize the file name..
Any of the methods in this question, would solve the problem.

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Hi I am trying to extract comments on a web page using lxml and xpath. Here is my code:
pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
for cm in cm_pg:
print cm
I got this error
Traceback (most recent call last):
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte
I know that there is an invalid character in the comments. How do I solve this?
Can you ask Requests to attempt to decode it for you? Use response.text (a string) rather than response.content (bytes).
The encoding of the source is probably something other than UTF-8, which your XPath library might be assuming. response.encoding is Requests best guess at what it is. Sometimes web servers/pages aren't configured to explicitly say what encoding they're using then all you can do is guess.
Doesn't help that encoding can be specified in an HTTP header and/or in a <meta> tag. Or websites can lie. Or they might mixing encodings. Note that your target website can't even validate because the encoding is wrong, and even with that it's rife with errors.
The page have badly encoded characters.
Ex:
Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)
You can avoid them by manually decoding:
>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'

Beautiful Soup decode error

I am working in a job and I need to parse a site with Beautiful Soup. The site is http://www.manta.com but when I try to see the encoding of the site in the meta of the HTML code don't appears nothing. I'm try to parse the HTML locally , with the web page downloaded, but I'm having trouble with some decoding errors:
# manta web page downloaded before
html = open('1.html', 'r')
soup = BeautifulSoup(html, 'lxml')
This produces the following stack trace:
Traceback (most recent call last):
File "E:/Projects/Python/webkit/sample.py", line 10, in <module>
soup = BeautifulSoup(html, 'lxml')
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
self._feed()
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
self.parser.close()
File "parser.pxi", line 1209, in
lxml.etree._FeedParser.close(src\lxm\lxml.etree.c:90717)
File "parsertarget.pxi", line 142, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:100104)
File "parsertarget.pxi", line 130, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99927)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored
(src\lxml\lxml.etree.c:9387)
File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml
\lxml.etree.c:96065)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 105-106: invalid data
I'm try to introduce the encoding in the constructor of Beautiful Soup :
soup = BeautifulSoup(html, 'lxml', from_encoding= "some encoding")
And I continue get the same error.
The interesting thing is that if I load the page in my browser and then I change the encode to utf-8 for example in Firefox and the save it. This work good.Any help is greatly appreciated. Thank you.
Encode the string in UTF-8
soup = BeautifulSoup(html.encode('UTF-8'),'lxml')

Python rdflib Not Parsing Creative Commons License Information Correctly

I was using rdflib version 3.2.3 and everything was working fine. After upgrading to 4.0.1 I started getting the error:
RDFa parsing Error! 'ascii' codec can't decode byte 0xc3 in position 5454: ordinal not in range(128)
I tried various way to make this work but so far have not succeeded. Below are my attempts.
In each case I:
from rdflib import Graph
First attempt:
>>> lg =Graph()
>>> len(lg.parse('http://creativecommons.org/licenses/by/3.0/'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/graph.py", line 1002, in parse
parser.parse(source, self, **args)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/structureddata.py", line 268, in parse
vocab_cache=vocab_cache)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/structureddata.py", line 148, in _process
_check_error(processor_graph)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/structureddata.py", line 57, in _check_error
raise Exception("RDFa parsing Error! %s" % msg)
Exception: RDFa parsing Error! 'ascii' codec can't decode byte 0xc3 in position 4801: ordinal not in range(128)
Second attempt:
>>> lg =Graph()
>>> len(lg.parse('http://creativecommons.org/licenses/by/3.0/rdf'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/graph.py", line 1002, in parse
parser.parse(source, self, **args)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 570, in parse
self._parser.parse(source)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 349, in end_element_ns
self._cont_handler.endElementNS(pair, None)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 160, in endElementNS
self.current.end(name, qname)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 461, in property_element_end
current.data, literalLang, current.datatype)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/term.py", line 541, in __new__
raise Exception("'%s' is not a valid language tag!"%lang)
Exception: 'i18n' is not a valid language tag!
Third attempt: gives no errors but also does not give any results
>>> lg =Graph()
>>> len(lg.parse('http://creativecommons.org/licenses/by/3.0/rdf', format='rdfa'))
0
So someone please tell me what I am dong wrong! :)
As Graham replied on the rdflib mailinglist, there is a html5lib problem - we will pin it correctly for python 2 for the next release, but for now just do:
pip install html5lib==0.95
The second problem is a problem in the data from creative commons, "i18n" really isn't a valid language tag according to rfc5646. I added the check, but in retrospect it seems to strict to raise an exception. I guess I'll change it to a warning.

Categories

Resources