PDFQuery - 'ascii' codec can't encode character u'\u2013' - python

I am using PDFQuery to extract data from PDF. It was working fine for most of the PDFs.
Recently for few PDFs I am getting the following errors on a couple of pages:
'ascii' codec can't encode character u'\u2019' in position 91: ordinal not in range(128)
'ascii' codec can't encode character u'\u2013' in position 29: ordinal not in range(128)
My code looks like this:
pdf = pdfquery.PDFQuery(pdf_file)
pages_in_pdf = pdf.doc.catalog['Pages'].resolve()['Count']
for i in range(0, pages_in_pdf):
try:
pdf.load(i)
# logic
except ValueError as e:
print('Error on page number {0}. Error message is {1}'.format(i, e))

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u263a' in position 124: character maps to <undefined>

I understand the error is something related to charector encoding, but not sure how to fix it.
Error details:
UnicodeEncodeError: 'charmap' codec can't encode character '\u263a' in position 124: character maps to <undefined>
here is the error happening:
csv_writer.writerow(data_tmp_dict)
Try to decode it to UTF-8
# -*- coding: utf-8 -*-
data_tmp_dict = {'key': 'value'.encode("utf-8")}
# or
data_tmp_dict = {'key': 'value'.encode("ascii")}

'charmap' codec can't encode character '\xae' While Scraping a Webpage

I am web-scraping with Python using BeautifulSoap
I am getting this error
'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>
when scraping a webpage
This is my Python
hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')
We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xae'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)

I have been working on a program to retrieve questions from Stack Overflow. Till yesterday the program was working fine, but since today I'm getting the error
"Message File Name Line Position
Traceback
<module> C:\Users\DPT\Desktop\questions.py 13
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)"
Currently, the questions are being displayed, but I seem to be unable to copy the output to a new text file.
import sys
sys.path.append('.')
import stackexchange
so = stackexchange.Site(stackexchange.StackOverflow)
term= raw_input("Enter the keyword for Stack Exchange")
print 'Searching for %s...' % term,
sys.stdout.flush()
qs = so.search(intitle=term)
print '\r--- questions with "%s" in title ---' % (term)
for q in qs:
print '%8d %s' % (q.id, q.title)
with open('E:\questi.txt', 'a+') as question:
question.write(q.title)
time.sleep(10)
with open('E:\questi.txt') as intxt:
data = intxt.read()
regular = re.findall('[aA-zZ]+', data)
print(regular)
tokens = set(regular)
with open('D:\Dictionary.txt', 'r') as keywords:
keyset = set(keywords.read().split())
with open('D:\Questionmatches.txt', 'w') as matches:
for word in keyset:
if word in tokens:
matches.write(word + '\n')
q.title is a Unicode string. When writing that to a file, you need to encode it first, preferably a fully Unicode-capable encoding such as UTF-8 (if you don't, Python will default to using the ASCII codec which doesn't support any character codepoint above 127).
question.write(q.title.encode("utf-8"))
should fix the problem.
By the way, the program tripped up on character “ (U+201C).
I ran into this as well using Transifex API
response['source_string']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
Fixed with response['source_string'].encode("utf-8")
import requests
username = "api"
password = "PASSWORD"
AUTH = (username, password)
url = 'https://www.transifex.com/api/2/project/project-site/resource/name-of-resource/translation/en/strings/?details'
response = requests.get(url, auth=AUTH).json()
print response['key'], response['context']
print response['source_string'].encode("utf-8")

Python 2.7 , issue with decode('utf-8')

I have:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib2 import urlopen
page2 = urlopen('http://pogoda.yandex.ru/moscow/').read().decode('utf-8')
page = urlopen('http://yasko.by/').read().decode('utf-8')
And in line "page ..." I have error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 32: invalid continuation byte", but in line "page2 ..." th error is not, why?
From a position of 32 in yasko.by starts Cyrillic symbols, how I get it correctly?
Thanks!
The content of http://yasko.by/ is encoded with windows-1251, while the content of http://pogoda.yandex.ru/moscow/ is encoded with utf-8.
page = .. line should become:
page = urlopen('http://yasko.by/').read().decode('windows-1251')

UnicodeEncodeError when fetching URLs

I am using urlfetch to fetch a URL. When I try to send it to html2text function (strips off all HTML tags), I get the following message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>
I've been trying to process encode('UTF-8','ignore') on the string but I keep getting this error.
Any ideas?
Thanks,
Joel
Some Code:
result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))
And the error message:
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
You need to decode the data you fetched first! With which codec? Depends on the website you fetch.
When you have unicode and try to encode it with some_unicode.encode('utf-8', 'ignore') i can't image how it could throw an error.
Ok what you need to do:
result = fetch('http://google.com')
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8
This is not really robust but it should show you the way.

Categories

Resources