I am using urlfetch to fetch a URL. When I try to send it to html2text function (strips off all HTML tags), I get the following message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>
I've been trying to process encode('UTF-8','ignore') on the string but I keep getting this error.
Any ideas?
Thanks,
Joel
Some Code:
result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))
And the error message:
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
You need to decode the data you fetched first! With which codec? Depends on the website you fetch.
When you have unicode and try to encode it with some_unicode.encode('utf-8', 'ignore') i can't image how it could throw an error.
Ok what you need to do:
result = fetch('http://google.com')
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8
This is not really robust but it should show you the way.
Related
I am trying to get and parse a webpage that contains non-ASCII characters (the URL is http://www.one.co.il). This is what I have:
url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
# but printing the first characters shows garbage -
# '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
# '<!DOCTYPE'
html_decoded = html.decode(encoding)
The last line gives me an exception:
File "C:/Users/....\WebGetter.py", line 16, in get_page
html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>
I tried looking at other related questions such as urllib2 read to Unicode and How to handle response encoding from urllib.request.urlopen() , but didn't find anything helpful about this.
Can someone please shed some light and guide me in this subject? Thanks!
0x1f 0x8b 0x08 is the magic number for a gzipped file. You will need to decompress it before you can use the contents.
This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode characters
(11 answers)
Closed 5 months ago.
I am trying to make a crawler in python by following an udacity course. I have this method get_page() which returns the content of the page.
def get_page(url):
'''
Open the given url and return the content of the page.
'''
data = urlopen(url)
html = data.read()
return html.decode('utf8')
the original method was just returning data.read(), but that way I could not do operations like str.find(). After a quick search I found out I need to decode the data. But now I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
I have found similar questions in SO but none of them were specifically for this. Please help.
You are trying to decode an invalid string.
The start byte of any valid UTF-8 string must be in the range of 0x00 to 0x7F.
So 0x8B is definitely invalid.
From RFC3629 Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.
You should post the string you are trying to decode.
Maybe the page is encoded with other character encoding but 'utf-8'. So the start byte is invalid.
You could do this.
def get_page(self, url):
if url is None:
return None
response=urllib.request.urlopen(url)
if response.getcode()!=200:
print("Http code:",response.getcode())
return None
else:
try:
return response.read().decode('utf-8')
except:
return response.read()
Web servers often serve HTML pages with a Content-Type header that includes the encoding used to encoding the page. The header might look this:
Content-Type: text/html; charset=UTF-8
We can inspect the content of this header to find the encoding to use to decode the page:
from urllib.request import urlopen
def get_page(url):
""" Open the given url and return the content of the page."""
data = urlopen(url)
content_type = data.headers.get('content-type', '')
print(f'{content_type=}')
encoding = 'latin-1'
if 'charset' in content_type:
_, _, encoding = content_type.rpartition('=')
print(f'{encoding=}')
html = data.read()
return html.decode(encoding)
Using requests is similar:
response = requests.get(url)
content_type = reponse.headers.get('content-type', '')
Latin-1 (or ISO-8859-1) is a safe default: it will always decode any bytes (though the result may not be useful).
If the server doesn't serve a content-type header you can try looking for a <meta> tag that specifies the encoding in the HTML. Or pass the response bytes to Beautiful Soup and let it try to guess the encoding.
I'm using python 2.7 and I have some problems converting chars like "ä" to "ae".
I'm retrieving the content of a webpage using:
req = urllib2.Request(url + str(questionID))
response = urllib2.urlopen(req)
data = response.read()
After that I'm doing some extraction stuff and there is my problem.
extractedStr = pageContent[start:end] // this string contains the "ä" !
extractedStr = extractedStr.decode("utf8") // here I get the error, tried it with encode aswell
extractedStr = extractedStr.replace(u"ä", "ae")
--> 'utf8' codec can't decode byte 0xe4 in position 13: invalid continuation byte
But: my simple trial is working fine...:
someStr = "geräusch"
someStr = someStr.decode("utf8")
someStr = someStr.replace(u"ä", "ae")
I've got the feeling, it has something to do with WHEN I try to use the .decode() function... I tried it at several positions, no success :(
Use .decode("latin-1") instead. That is what you are trying to decode.
I am web-scraping with Python using BeautifulSoap
I am getting this error
'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>
when scraping a webpage
This is my Python
hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')
We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xae'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.
I have been working on a program to retrieve questions from Stack Overflow. Till yesterday the program was working fine, but since today I'm getting the error
"Message File Name Line Position
Traceback
<module> C:\Users\DPT\Desktop\questions.py 13
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)"
Currently, the questions are being displayed, but I seem to be unable to copy the output to a new text file.
import sys
sys.path.append('.')
import stackexchange
so = stackexchange.Site(stackexchange.StackOverflow)
term= raw_input("Enter the keyword for Stack Exchange")
print 'Searching for %s...' % term,
sys.stdout.flush()
qs = so.search(intitle=term)
print '\r--- questions with "%s" in title ---' % (term)
for q in qs:
print '%8d %s' % (q.id, q.title)
with open('E:\questi.txt', 'a+') as question:
question.write(q.title)
time.sleep(10)
with open('E:\questi.txt') as intxt:
data = intxt.read()
regular = re.findall('[aA-zZ]+', data)
print(regular)
tokens = set(regular)
with open('D:\Dictionary.txt', 'r') as keywords:
keyset = set(keywords.read().split())
with open('D:\Questionmatches.txt', 'w') as matches:
for word in keyset:
if word in tokens:
matches.write(word + '\n')
q.title is a Unicode string. When writing that to a file, you need to encode it first, preferably a fully Unicode-capable encoding such as UTF-8 (if you don't, Python will default to using the ASCII codec which doesn't support any character codepoint above 127).
question.write(q.title.encode("utf-8"))
should fix the problem.
By the way, the program tripped up on character “ (U+201C).
I ran into this as well using Transifex API
response['source_string']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
Fixed with response['source_string'].encode("utf-8")
import requests
username = "api"
password = "PASSWORD"
AUTH = (username, password)
url = 'https://www.transifex.com/api/2/project/project-site/resource/name-of-resource/translation/en/strings/?details'
response = requests.get(url, auth=AUTH).json()
print response['key'], response['context']
print response['source_string'].encode("utf-8")