I am trying to parse arbitrary webpages with the requests and BeautifulSoup libraries with this code:
try:
response = requests.get(url)
except Exception as error:
return False
if response.encoding == None:
soup = bs4.BeautifulSoup(response.text) # This is line 809
else:
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
On most webpages this works fine. However, on some arbitrary pages (<1%) I get this crash:
Traceback (most recent call last):
File "/home/dotancohen/code/parser.py", line 155, in has_css
soup = bs4.BeautifulSoup(response.text)
File "/usr/lib/python3/dist-packages/requests/models.py", line 809, in text
content = str(self.content, encoding, errors='replace')
TypeError: str() argument 2 must be str, not None
For reference, this is the relevent method of the requests library:
#property
def text(self):
"""Content of the response, in unicode.
if Response.encoding is None and chardet module is available, encoding
will be guessed.
"""
# Try charset from content-type
content = None
encoding = self.encoding
# Fallback to auto-detected encoding.
if self.encoding is None:
if chardet is not None:
encoding = chardet.detect(self.content)['encoding']
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors='replace') # This is line 809
except LookupError:
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# So we try blindly encoding.
content = str(self.content, errors='replace')
return content
As can be seen, I am not passing in an encoding when this error is thrown. How am I using the library incorrectly, and what can I do to prevent this error? This is on Python 3.2.3, but I can also get the same results with Python 2.
This means that the server did not send an encoding for the content in the headers, and the chardet library was also not able to determine an encoding for the contents. You in fact deliberately test for the lack of encoding; why try to get decoded text if no encoding is available?
You can try to leave the decoding up to the BeautifulSoup parser:
if response.encoding is None:
soup = bs4.BeautifulSoup(response.content)
and there is no need to pass in the encoding to BeautifulSoup, since if .text does not fail, you are using Unicode and BeautifulSoup will ignore the encoding parameter anyway:
else:
soup = bs4.BeautifulSoup(response.text)
Related
When I use urllib in Python3 to get the HTML code of a web page, I use this code:
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read().decode('utf-8')
print(html)
return html
However, this fails every time with the error:
Traceback (most recent call last):
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('https://www.hltv.org/team/7900/spirit-academy')
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The page is in UTF-8 and I am decoding it properly according to the urllib docs. The page is not gzipped or in another charset from what I can tell.
url.info().get_charset() returns None for the page, however the meta tags specify UTF-8. I have no problems viewing the HTML in any program.
I do not want to use any external libraries.
Is there a solution? What is going on? This works fine with the following Python2 code:
def getHTML(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()
return html
You don't need to decode('utf-8')
The following should return the fetched html.
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read()
return html
There, found your error, the parsing was done just fine, everything was evaluated alright. But when you read the Traceback carefully:
Traceback (most recent call last): File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('hltv.org/team/7900/spirit-academy') File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The error was caused by the print statement, as you can see, this is in the traceback print(html).
This is somewhat common exception, it's just telling you that with your current system encoding, some of the text cannot be printed to the console. One simple solution will be to add print(html.encode('ascii', 'ignore')) to ignore all the unprintable characters. You still can do all the other stuff with html, it's just that you can't print it.
See this if you want a better "fix": https://wiki.python.org/moin/PrintFails
btw: re module can search byte strings. Copy this exactly as-is, will work:
import re
print(re.findall(b'hello', b'hello world'))
I am working on a python web scraper to extract data from this webpage. It contains latin characters like ą, č, ę, ė, į, š, ų, ū, ž. I use BeautifulSoup to recognise the encoding:
def decode_html(html_string):
converted = UnicodeDammit(html_string)
print(converted.original_encoding)
if not converted.unicode_markup:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.tried_encodings))
return converted.unicode_markup
The encoding that it always seems to use is "windows-1252". However, this turns characters like ė into ë and ų into ø when printing to file or console. I use the lxml library to scrape the data. So I would think that it uses the wrong encoding, but what's odd is that if I use lxml.html.open_in_browser(decoded_html), all the characters are back to normal. How do I print the characters to a file without all the mojibake?
This is what I am using for output:
def write(filename, obj):
with open(filename, "w", encoding="utf-8") as output:
json.dump(obj, output, cls=CustomEncoder, ensure_ascii=False)
return
From the HTTP headers set on the specific webpage you tried to load:
Content-Type:text/html; charset=windows-1257
so Windows-1252 will result in invalid results. BeautifulSoup made a guess (based on statistical models), and guessed wrong. As you noticed, using 1252 instead leads to incorrect codepoints:
>>> 'ė'.encode('cp1257').decode('cp1252')
'ë'
>>> 'ų'.encode('cp1257').decode('cp1252')
'ø'
CP1252 is the fallback for the base characterset detection implementation in BeautifulSoup. You can improve the success-rate of BeautifulSoup's character-detection code by installing an external library; both chardet and cchardet are supported. These two libraries guess at MacCyrillic and ISO-8859-13, respectively (both wrong, but cchardet got pretty close, perhaps close enough).
In this specific case, you can make use of the HTTP headers instead. In requests, I generally use:
import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)
The above only uses the encoding from the response if explicitly set by the server, and there was no HTML <meta> header. For text/* mime-types, HTTP specifies that the response should be considered as using Latin-1, which requests adheres too, but that default would be incorrect for most HTML data.
I send a GET request to the CareerBuilder API :
import requests
url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text
And get back an XML that looks like this. However, I have trouble parsing it.
Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print etree.fromstring(xml)
File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print ET.fromstring(xml)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?
You are using the decoded unicode value. Use r.raw raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True option to .get().
Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.
You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.
Correction!
See below how I got it all wrong. Basically, when we use the method .text then the result is a unicode encoded string. Using it raises the following exception in lxml
ValueError: Unicode strings with encoding declaration are not
supported. Please use bytes input or XML fragments without
declaration.
Which basically means that #martijn-pieters was right, we must use the raw response as returned by .content
Incorrect answer (but might be interesting to someone)
For whoever is interested. I believe the reason this error occurs is probably an invalid guess taken by requests as explained in Response.text documentation:
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on
HTTP headers, following RFC 2616 to the letter. If you can take
advantage of non-HTTP knowledge to make a better guess at the
encoding, you should set r.encoding appropriately before accessing
this property.
So, following this, one could also make sure requests' r.text encodes the response content correctly by explicitly setting the encoding with r.encoding = 'UTF-8'
This approach adds another validation that the received response is indeed in the correct encoding prior to parsing it with lxml.
Understand the question has already got its answer, I faced this similar issue on Python3 and it worked fine on Python2. My resolution was: str_xml.encode() and then xml = etree.fromstring(str_xml) and then the parsing and extractions of tags and attributes.
I have the following code for urllib and BeautifulSoup:
getSite = urllib.urlopen(pageName) # open current site
getSitesoup = BeautifulSoup(getSite.read()) # reading the site content
print getSitesoup.originalEncoding
for value in getSitesoup.find_all('link'): # extract all <a> tags
defLinks.append(value.get('href'))
The result of it:
/usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
"Some characters could not be decoded, and were "
And when i try to read the site i get:
�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z
The page is in UTF-8, but the server is sending it to you in a compressed format:
>>> print getSite.headers['content-encoding']
gzip
You'll need to decompress the data before running it through Beautiful Soup. I got an error using zlib.decompress() on the data, but writing the data to a file and using gzip.open() to read from it worked fine--I'm not sure why.
BeautifulSoup works with Unicode internally; it'll try and decode non-unicode responses from UTF-8 by default.
It looks like the site you are trying to load is using a different encode; for example, it could be UTF-16 instead:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('utf-16-le')
뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽
It could be mac_cyrillic too:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('mac_cyrillic')
пњљ7пњљeпњљпњљпњљпњљ0*"IяЈпњљGпњљHпњљпњљпњљпњљFпњљпњљпњљпњљпњљпњљ9-пњљпњљпњљпњљпњљпњљ;пњљпњљEпњљY√ЮBsпњљпњљпњљпњљпњљпњљпњљпњљпњљгФґ?пњљ4iпњљпњљпњљ)пњљпњљпњљпњљпњљ^Wпњљпњљпњљпњљпњљ`wпњљKeпњљпњљ%пњљпњљ*9пњљ.'OQBпњљпњљпњљVпњљпњљ#пњљпњљпњљпњљпњљ]пњљпњљпњљ(Pпњљпњљ^пњљпњљqпњљ$пњљS5пњљпњљпњљtT*пњљZ
But I have way too little information about what kind of site you are trying to load nor can I read the output of either encoding. :-)
You'll need to decode the result of getSite() before passing it to BeautifulSoup:
getSite = urllib.urlopen(pageName).decode('utf-16')
Generally, the website will return what encoding was used in the headers, in the form of a Content-Type header (probably text/html; charset=utf-16 or similar).
I ran into the same problem, and as Leonard mentioned, it was due to a compressed format.
This link solved it for me which says to add ('Accept-Encoding', 'gzip,deflate') in the request header. For example:
opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data
Where the decode() function is defined by:
def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()
return page
I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:
import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.urlopen(prefix + nothing).read()
print text
match = findnothing(text)
if match:
nothing = match.group(1)
print " going to", nothing
else:
break
So to convert this to 3, I would change to this:
import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.request.urlopen(prefix + nothing).read()
print(text)
match = findnothing(text)
if match:
nothing = match.group(1)
print(" going to", nothing)
else:
break
So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:
and the next nothing is 72198
going to 72198
and the next nothing is 80992
going to 80992
and the next nothing is 8880
going to 8880 etc
If i run the 3.x version, i get the following output:
b'and the next nothing is 44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 26, in <module>
match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object
So if i change the r to a b in this line
findnothing = re.compile(b"nothing is (\d+)").search
I get:
b'and the next nothing is 44827'
going to b'44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 24, in <module>
text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly
Any ideas?
I'm pretty new to programming, so please don't bite my head off.
_bk201
You can't mix bytes and str objects implicitly.
The simplest thing would be to decode bytes returned by urlopen().read() and use str objects everywhere:
text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8
The page doesn't specify the preferable character encoding via Content-Type header or <meta> element. I don't know what the default encoding should be for text/html but the rfc 2068 says:
When no explicit charset parameter is provided by the sender, media
subtypes of the "text" type are defined to have a default charset
value of "ISO-8859-1" when received via HTTP.
Regular expressions make sense only on text, not on binary data.
So, keep findnothing = re.compile(r"nothing is (\d+)").search, and convert text to string instead.
Instead of urllib we're using requests and it has two options ( which maybe you can search in urllib for similar options )
Response object
import requests
>>> response = requests.get('https://api.github.com')
Using response.content - has the bytes type
>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_us...."}'
While using response.text - you have the encoded response
>>> response.text
'{"current_user_url":"https://api.github.com/user","current_us...."}'
The default encoding is utf-8, but you can set it right after the request like so
import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'
And then response.text will hold the content in the encoding you requested ...