UnicodeEncodeError: 'ascii' codec can't encode character in print function - python

My company is using a database and I am writing a script that interacts with that database. There is already an script for putting the query on database and based on the query that script will return results from database.
I am working on unix environment and I am using that script in my script for getting some data from database and I am redirecting the result from the query to a file. Now when I try to read this file then I am getting an error saying-
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9741: ordinal not in range(128)
I know that python is not able to read file because of the encoding of the file. The encoding of the file is not ascii that's why the error is coming. I tried checking the encoding of the file and tried reading the file with its own encoding.
The code that I am using is-
os.system("Query.pl \"select title from bug where (ste='KGF-A' AND ( status = 'Not_Approved')) \">patchlet.txt")
encoding_dict3={}
encoding_dict3=chardet.detect(open("patchlet.txt", "rb").read())
print(encoding_dict3)
# Open the patchlet.txt file for storing the last part of titles for latest ACF in a list
with codecs.open("patchlet.txt",encoding='{}'.format(encoding_dict3['encoding'])) as csvFile
readCSV = csv.reader(csvFile,delimiter=":")
for row in readCSV:
if len(row)!=0:
if len(row) > 1:
j=len(row)-1
patchlets_in_latest.append(row[j])
elif len(row) ==1:
patchlets_in_latest.append(row[0])
patchlets_in_latest_list=[]
# calling the strip_list_noempty function for removing newline and whitespace characters
patchlets_in_latest_list=strip_list_noempty(patchlets_in_latest)
# coverting list of titles in set to remove any duplicate entry if present
patchlets_in_latest_set= set(patchlets_in_latest_list)
# Finding duplicate entries in list
duplicates_in_latest=[k for k,v in Counter(patchlets_in_latest_list).items() if v>1]
# Printing imp info for logs
print("list of titles of patchlets in latest list are : ")
for i in patchlets_in_latest_list:
**print(str(i))**
print("No of patchlets in latest list are : {}".format(str(len(patchlets_in_latest_list))))
Where Query.pl is the perl script that is written to bring in the result of query from database.The encoding that I am getting for "patchlet.txt" (the file used for storing result from HSD) is:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
Even when I have provided the same encoding for reading the file, then also I am getting the error.
Please help me in resolving this error.
EDIT:
I am using python3.6
EDIT2:
While outputting the result I am getting the error and there is one line in the file which is having some unknown character. The line looks like:
Some failure because of which vtrace cannot be used along with some trace.
I am using gvim and in gvim the "vtrace" looks like "~Vvtrace" . Then I checked on database manually for this character and the character is "–" which is according to my keyboard is neither hyphen nor underscore.These kinds of characters are creating the problem.
Also I am working on linux environment.
EDIT 3:
I have added more code that can help in tracing the error. Also I have highlighted a "print" statement (print(str(i))) where I am getting the error.

Problem
Based on the information in the question, the program is processing non-ASCII input data, but is unable to output non-ASCII data.
Specifically, this code:
for i in patchlets_in_latest_list:
print(str(i))
Results in this exception:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013'
This behaviour was common in Python2, where calling str on a unicode object would cause Python to try to encode the object as ASCII, resulting in a UnicodeEncodeError if the object contained non-ASCII characters.
In Python3, calling str on a str instance doesn't trigger any encoding. However calling the print function on a str will encode the str to sys.stdout.encoding. sys.stdout.encoding defaults to that returned by locale.getpreferredencoding. This will generally be your linux user's LANG environment variable.
Solution
If we assume that your program is not overriding normal encoding behaviour, the problem should be fixed by ensuring that the code is being executed by a Python3 interpreter in a UTF-8 locale.
be 100% certain that the code is being executed by a Python3 interpreter - print sys.version_info from within the program.
try setting the PYTHONIOENCODING environment variable when running your script: PYTHONIOENCODING=UTF-8 python3 myscript.py
check your locale using the locale command in the terminal (or echo $LANG). If it doesn't end in UTF-8, consider changing it. Consult your system administrators if you are on a corporate machine.
if your code runs in a cron job, bear in mind that cron jobs often run with the 'C' or 'POSIX' locale - which could be using ASCII encoding - unless a locale is explicitly set. Likewise if the script is run under a different user, check their locale settings.
Workaround
If changing the environment is not feasible, you can workaround the problem in Python by encoding to ASCII with an error handler, then decoding back to str.
There are four useful error handlers in your particular situation, their effects are demonstrated with this code:
>>> s = 'Hello \u2013 World'
>>> s
'Hello – World'
>>> handlers = ['ignore', 'replace', 'xmlcharrefreplace', 'namereplace']
>>> print(str(s))
Hello – World
>>> for h in handlers:
... print(f'Handler: {h}:', s.encode('ascii', errors=h).decode('ascii'))
...
Handler: ignore: Hello World
Handler: replace: Hello ? World
Handler: xmlcharrefreplace: Hello – World
Handler: namereplace: Hello \N{EN DASH} World
The ignore and replace handlers lose information - you can't tell what character has been replaced with an space or question mark.
The xmlcharrefreplace and namereplace handlers do not lose information, but the replacement sequences may make the text less readable to humans.
It's up to you to decide which tradeoff is acceptable for the consumers of your program's output.
If you decided to use the replace handler, you would change your code like this:
for i in patchlets_in_latest_list:
replaced = i.encode('ascii', errors='replace').decode('ascii')
print(replaced)
wherever you are printing data that might contain non-ASCII characters.

Related

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 48: ordinal not in range(128) [duplicate]

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
Read the Python Unicode HOWTO. This error is the very first example.
Do not use str() to convert from unicode to encoded text / bytes.
Instead, use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let's see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
In shell:
Find supported UTF-8 locale by the following command:
locale -a | grep "UTF-8"
Export it, before running the script, e.g.:
export LC_ALL=$(locale -a | grep UTF-8)
or manually like:
export LC_ALL=C.UTF-8
Test it by printing special character, e.g. ™:
python -c 'print(u"\u2122");'
Above tested in Ubuntu.
I've actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
UPDATE: My original answer was written for Python 2. For Python 3:
def safeStr(obj):
try: return str(obj).encode('ascii', 'ignore').decode('ascii')
except: return ""
Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.
Suggestion: you might want to name this function toAscii instead? That's a matter of preference...
Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:
from six import PY2, iteritems
CHAR_SWAP = { u'\u201c': u'"'
, u'\u201D': u'"'
, u'\u2018': u"'"
, u'\u2019': u"'"
}
def toAscii( text ) :
try:
for k,v in iteritems( CHAR_SWAP ):
text = text.replace(k,v)
except: pass
try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
except UnicodeEncodeError:
return text.encode('ascii', 'replace').decode('ascii')
except: return ""
if __name__ == '__main__':
print( toAscii( u'testin\u2019' ) )
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.
I always put the code below in the first two lines of the python files:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
It works for me:
export LC_CTYPE="en_US.UTF-8"
Alas this works in Python 3 at least...
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.
Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
Just add to a variable encode('utf-8')
agent_contact.encode('utf-8')
Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"
In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me :
export PYTHONIOENCODING=UTF-8
I just used the following:
import unicodedata
message = unicodedata.normalize("NFKD", message)
Check what documentation says about it:
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.
Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:
import sys
import io
sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")
Below solution worked for me, Just added
u "String"
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works
f = open("results.txt", "w")
f.write(data_that_causes_this_error.encode('utf-8'))
f.close()
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
I had this idea after reading Ned's presentation.
I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.
We struck this error when running manage.py migrate in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.
The issue was simply that the Django container (we use docker) was missing the LANG env var.
Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.
Update for python 3.0 and later. Try the following in the python editor:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
This sets the system`s default locale encoding to the UTF-8 format.
More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.
The recommended solution did not work for me, and I could live with dumping all non ascii characters, so
s = s.encode('ascii',errors='ignore')
which left me with something stripped that doesn't throw errors.
Many answers here (#agf and #Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.
Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.
Simple tip to avoid :
try:
data=str(data)
except:
data = data #Don't convert to String
The above example will solve Encode error also.
If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:
unic = u''
packet_data = unic
You can set the character encoding to UTF-8 before running your script:
export LC_CTYPE="en_US.UTF-8"
This should generally resolve the issue.

UnicodePython 3: EncodeError: 'ascii' codec can't encode character '\xe4' in position 277: ordinal not in range(128)

When I tried to make request with python requests library like below. I am getting the below exception
def get_request(url):
return requests.get(url).json()
Exception
palo:dataextractor minisha$ python escoskill.py
Traceback (most recent call last):
File "escoskill.py", line 62, in <module>
print(response.json())
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 277: ordinal not in range(128)
However, the same piece of code works for some request and not for all. For the below url, it doesn't work.
https://ec.europa.eu/esco/api/resource/concept?uri=http://data.europa.eu/esco/isco/C2&language=en
Url that works
https://ec.europa.eu/esco/api/resource/taxonomy?uri=http://data.europa.eu/esco/concept-scheme/isco&language=en
The exception you're getting, UnicodeEncodeError, means we have some character that we cannot encode into bytes. In this case, we're trying to encode \xe4, or ä, which ASCII¹ does not have, hence, error.
In this line of code:
print(response.json())
The only thing that's going to be doing encoding is the print(). print(), to emit text into something, needs to encode it to bytes. Now, what it does by default depends on what sys.stdout is. Typically, stdout is your screen unless you've redirected output to a file. On Unix-like OSs (Linux, OS X), the encoding Python will use will be whatever LANG is set to; typically, this should be something like en_US.utf8 (the first part, en_US, might differ if you're in a different country; the utf8 bit is what is important here). If LANG isn't set (this is unusual, but can happen in some contexts, such as Docker containers) then it defaults to C, for which Python will use ASCII as an encoding.
(Edit) From the additional information in the comments (you're on OS X, you're using IntelliJ, and LANG is unset (print(repr(os.environ['LANG']))) printed None)), this is a tough one to give advice on. LANG being unset means Python will assume it can only output ASCII, and error out, as you've seen, on anything else. In order of preference, I would:
Try to figure out why LANG is unset. This might be some configuration of the mini-terminal in the IDE, if that is what you have and are using. This may prove hard to find if you're unfamiliar with character encodings, and I might be off-base here, as I'm unfamiliar with IntelliJ.
Since you seem to be running your program from a command line, you can see if setting LANG helps. Where currently you are doing,
python escoskill.py
You can set LANG for a single run with:
LANG=en_US.utf8 python escoskill.py
If that works, you can make it last for however long that session is by doing,
export LANG=en_US.utf8
# LANG will have that value for all future commands run from this terminal.
python escoskill.py
You can override what Python autodetects the encoding to be, or you can override its behavior when it hits a character it can't encode. For example,
PYTHONIOENCODING=ascii:replace 'print("\xe4")'
tells Python to use the output encoding of ASCII (which is what it was doing before anyways) but the :replace bit will make characters that it can't encode in ASCII, such as ä, be emitted as ?s instead of erroring out. This might make some things harder to read, of course.
¹ASCII is a character encoding. A character encoding tells one how to translate bytes into characters. There's not just one, because… humans.
²or perhaps your OS, but LANG being unset on OS X just sounds very implausible

Python 3.6.1 - Printing string as human readable text, special characters

I'm building a little django 1.1 app (though I believe this issue to be specific to Python) where I've come to use commands to control the flow of getting and categorizing data. I also wish to print a sort of summary using a third command. I am using macOS 10.12.3
My problem comes from getting text data in and printing it to the console or a document using
> or >>
in the console.
I'm running these scripts using an alias of Python 3.6.1
I'm using the Tweepy api, but that should hopefully not be relevant.
These snippets should illustrate the problem I'm hoping to solve:
print(type(data))
print(type(data.text))
try:
print(data.text)
except UnicodeEncodeError:
print("no printing today :(")
print(type(data.text.encode('UTF-8')))
print(data.text.encode('UTF-8'))
this outputs:
<class 'tweepy.models.Status'>
<class 'str'>
no printing today :(
<class 'bytes'>
b'kontroll p\xc3\xa5 ... v\xc3\xa5pen.'
The ugly things there should both be the character 'å'.
This is the error that would be thrown:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 223: ordinal not in range(128)
It says 'ascii' codec, but doing (in my Python 3.6.1 script):
print(sys.getdefaultencoding())
outputs:
utf-8
Running
print(sys.getdefaultencoding())
again in Python 2.7.10 outputs:
ascii
So the thrown error matches what 2.7.10 outputs. I am not discounting the possibility that I could be wrong about what a default encoder does
I have also tried
export LOCALE="no_NB.UTF-8"
in an attempt to see if this could be caused by my system (unless I'm misunderstanding what this does). I did not write this to any file, thinking it would persist through the current session.
Is the wrong encoder being used somehow? Could it be my terminal encoding? How can I write my special characters to the terminal and file? Are strings really this hard to get right?
Any help is greatly appreciated!!
Setting
export LC_ALL=no_NO.UTF-8
export LANG=no_NO.UTF-8
in my .bash_profile now allows me to see the characters I want in my terminal and it is also successfully echoed to a file.

Error using word_tokenize UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128) [duplicate]

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
Read the Python Unicode HOWTO. This error is the very first example.
Do not use str() to convert from unicode to encoded text / bytes.
Instead, use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let's see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
In shell:
Find supported UTF-8 locale by the following command:
locale -a | grep "UTF-8"
Export it, before running the script, e.g.:
export LC_ALL=$(locale -a | grep UTF-8)
or manually like:
export LC_ALL=C.UTF-8
Test it by printing special character, e.g. ™:
python -c 'print(u"\u2122");'
Above tested in Ubuntu.
I've actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
UPDATE: My original answer was written for Python 2. For Python 3:
def safeStr(obj):
try: return str(obj).encode('ascii', 'ignore').decode('ascii')
except: return ""
Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.
Suggestion: you might want to name this function toAscii instead? That's a matter of preference...
Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:
from six import PY2, iteritems
CHAR_SWAP = { u'\u201c': u'"'
, u'\u201D': u'"'
, u'\u2018': u"'"
, u'\u2019': u"'"
}
def toAscii( text ) :
try:
for k,v in iteritems( CHAR_SWAP ):
text = text.replace(k,v)
except: pass
try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
except UnicodeEncodeError:
return text.encode('ascii', 'replace').decode('ascii')
except: return ""
if __name__ == '__main__':
print( toAscii( u'testin\u2019' ) )
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.
I always put the code below in the first two lines of the python files:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
It works for me:
export LC_CTYPE="en_US.UTF-8"
Alas this works in Python 3 at least...
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.
Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
Just add to a variable encode('utf-8')
agent_contact.encode('utf-8')
Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"
In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me :
export PYTHONIOENCODING=UTF-8
I just used the following:
import unicodedata
message = unicodedata.normalize("NFKD", message)
Check what documentation says about it:
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.
Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:
import sys
import io
sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")
Below solution worked for me, Just added
u "String"
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works
f = open("results.txt", "w")
f.write(data_that_causes_this_error.encode('utf-8'))
f.close()
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
I had this idea after reading Ned's presentation.
I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.
We struck this error when running manage.py migrate in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.
The issue was simply that the Django container (we use docker) was missing the LANG env var.
Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.
Update for python 3.0 and later. Try the following in the python editor:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
This sets the system`s default locale encoding to the UTF-8 format.
More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.
The recommended solution did not work for me, and I could live with dumping all non ascii characters, so
s = s.encode('ascii',errors='ignore')
which left me with something stripped that doesn't throw errors.
Many answers here (#agf and #Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.
Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.
Simple tip to avoid :
try:
data=str(data)
except:
data = data #Don't convert to String
The above example will solve Encode error also.
If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:
unic = u''
packet_data = unic
You can set the character encoding to UTF-8 before running your script:
export LC_CTYPE="en_US.UTF-8"
This should generally resolve the issue.

UnicodeDecodeError On Unicode File Read

I have a problem where, when I execute a script which involved reading in data from a file that contains unicode code points, everything works fine. But when it is executed via another application, it is raising the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
0: ordinal not in range(128)
I am executing the exact same code using the exact same data file. A sample datafile that replicates the problem is like this:
¥ Α © §
I called this sample.txt
A very simple python script to simply read in and print the file contents:
with open("sample.txt") as f:
for line in f:
print(line)
print("Done")
This executes fine from the command line; executing via Apache/CGI fails with the above error.
A hint to the problem came from the documentation of the open function:
In text mode, if encoding is not specified the encoding used is
platform dependent: locale.getpreferredencoding(False) is called to
get the current locale encoding.
[Link]
Platform dependent suggested environment variables. So, I inspected what environment variables were set for my shell, and found LANG set to en_US.UTF-8. Dumping the environment variables set by Apache found that LANG was missing.
So, apparently when locale cannot be determined, Python uses ASCII as the default file encoding. As a result, the error was encountered when the ordinal was out of range for ASCII.
To fix this, I set this environment variable in my CGI script. If the environment variable is somehow missing from a user shell, it can be set via normal methods, or just by:
export LANG=en_US.UTF-8
Or whatever preferred encoding is desired.
Note, the issue is probably far more noticeable if the locale is missing from a user shell, as text editors like vi will not display characters without it. It was significantly more subtle when only an issue when called from Apache (or some other application).

Categories

Resources