'charmap' codec can't encode character '\xae' While Scraping a Webpage - python

I am web-scraping with Python using BeautifulSoap
I am getting this error
'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>
when scraping a webpage
This is my Python
hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')

We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xae'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Related

PDFQuery - 'ascii' codec can't encode character u'\u2013'

I am using PDFQuery to extract data from PDF. It was working fine for most of the PDFs.
Recently for few PDFs I am getting the following errors on a couple of pages:
'ascii' codec can't encode character u'\u2019' in position 91: ordinal not in range(128)
'ascii' codec can't encode character u'\u2013' in position 29: ordinal not in range(128)
My code looks like this:
pdf = pdfquery.PDFQuery(pdf_file)
pages_in_pdf = pdf.doc.catalog['Pages'].resolve()['Count']
for i in range(0, pages_in_pdf):
try:
pdf.load(i)
# logic
except ValueError as e:
print('Error on page number {0}. Error message is {1}'.format(i, e))

How to get a webpage with unicode chars in python

I am trying to get and parse a webpage that contains non-ASCII characters (the URL is http://www.one.co.il). This is what I have:
url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
# but printing the first characters shows garbage -
# '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
# '<!DOCTYPE'
html_decoded = html.decode(encoding)
The last line gives me an exception:
File "C:/Users/....\WebGetter.py", line 16, in get_page
html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>
I tried looking at other related questions such as urllib2 read to Unicode and How to handle response encoding from urllib.request.urlopen() , but didn't find anything helpful about this.
Can someone please shed some light and guide me in this subject? Thanks!
0x1f 0x8b 0x08 is the magic number for a gzipped file. You will need to decompress it before you can use the contents.

python string encoding unicode

I'm using python 2.7 and I have some problems converting chars like "ä" to "ae".
I'm retrieving the content of a webpage using:
req = urllib2.Request(url + str(questionID))
response = urllib2.urlopen(req)
data = response.read()
After that I'm doing some extraction stuff and there is my problem.
extractedStr = pageContent[start:end] // this string contains the "ä" !
extractedStr = extractedStr.decode("utf8") // here I get the error, tried it with encode aswell
extractedStr = extractedStr.replace(u"ä", "ae")
--> 'utf8' codec can't decode byte 0xe4 in position 13: invalid continuation byte
But: my simple trial is working fine...:
someStr = "geräusch"
someStr = someStr.decode("utf8")
someStr = someStr.replace(u"ä", "ae")
I've got the feeling, it has something to do with WHEN I try to use the .decode() function... I tried it at several positions, no success :(
Use .decode("latin-1") instead. That is what you are trying to decode.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)

I have been working on a program to retrieve questions from Stack Overflow. Till yesterday the program was working fine, but since today I'm getting the error
"Message File Name Line Position
Traceback
<module> C:\Users\DPT\Desktop\questions.py 13
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)"
Currently, the questions are being displayed, but I seem to be unable to copy the output to a new text file.
import sys
sys.path.append('.')
import stackexchange
so = stackexchange.Site(stackexchange.StackOverflow)
term= raw_input("Enter the keyword for Stack Exchange")
print 'Searching for %s...' % term,
sys.stdout.flush()
qs = so.search(intitle=term)
print '\r--- questions with "%s" in title ---' % (term)
for q in qs:
print '%8d %s' % (q.id, q.title)
with open('E:\questi.txt', 'a+') as question:
question.write(q.title)
time.sleep(10)
with open('E:\questi.txt') as intxt:
data = intxt.read()
regular = re.findall('[aA-zZ]+', data)
print(regular)
tokens = set(regular)
with open('D:\Dictionary.txt', 'r') as keywords:
keyset = set(keywords.read().split())
with open('D:\Questionmatches.txt', 'w') as matches:
for word in keyset:
if word in tokens:
matches.write(word + '\n')
q.title is a Unicode string. When writing that to a file, you need to encode it first, preferably a fully Unicode-capable encoding such as UTF-8 (if you don't, Python will default to using the ASCII codec which doesn't support any character codepoint above 127).
question.write(q.title.encode("utf-8"))
should fix the problem.
By the way, the program tripped up on character “ (U+201C).
I ran into this as well using Transifex API
response['source_string']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
Fixed with response['source_string'].encode("utf-8")
import requests
username = "api"
password = "PASSWORD"
AUTH = (username, password)
url = 'https://www.transifex.com/api/2/project/project-site/resource/name-of-resource/translation/en/strings/?details'
response = requests.get(url, auth=AUTH).json()
print response['key'], response['context']
print response['source_string'].encode("utf-8")

UnicodeEncodeError when fetching URLs

I am using urlfetch to fetch a URL. When I try to send it to html2text function (strips off all HTML tags), I get the following message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>
I've been trying to process encode('UTF-8','ignore') on the string but I keep getting this error.
Any ideas?
Thanks,
Joel
Some Code:
result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))
And the error message:
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
You need to decode the data you fetched first! With which codec? Depends on the website you fetch.
When you have unicode and try to encode it with some_unicode.encode('utf-8', 'ignore') i can't image how it could throw an error.
Ok what you need to do:
result = fetch('http://google.com')
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8
This is not really robust but it should show you the way.

Categories

Resources