Proper use of unicode characters in python3 - Force utf-8 encoding - python

I'm going crazy here. The internet and this SO question tell me that in python 3.x, the default encoding is UTF-8. In addition to that, my system's default encoding is UTF-8. In addition to that, I have # -*- coding: utf-8 -*- at the top of my python 3.5 file.
Still, python is using ascii:
# -*- coding: utf-8 -*-
mystring = "Ⓐ"
print(mystring)
Greets me with:
SyntaxError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)
I've also tried this: print(mystring.encode("utf-8")) and .decode("utf-8") - Same thing.
What am I missing here? How do I force python to stop using ascii encoding?
Edit: I know that it seems weird to complain about position 7 with a one character string, but this is my actual MCVE and the exact output I'm getting. The above is using python shell, the below is in a script. Both use python 3.5.2.
Edit: Since I figured it might be relevant: The string I'm getting comes from an external application and is not hardcoded, so I need a way to get that utf-8 string and save it into a file. The above is just a minimalized and generalized example. Here is my real-life code:
# the variables being a string that might contain unicode characters
mystring = "username: " + fromuser + " | printname: " + fromname
with open("myfile.txt", "a") as myfile:
myfile.write(mystring + "\n")

In Python3 all strings are unicode, so the problem you're having is likely due to your locale settings not being correct. The Python3 interpreter looks to use the locale environment variables and if it cannot find them it emulates basic ASCII
From locale.py:
except ImportError:
# Locale emulation
CHAR_MAX = 127
LC_ALL = 6
LC_COLLATE = 3
LC_CTYPE = 0
LC_MESSAGES = 5
LC_MONETARY = 4
LC_NUMERIC = 1
LC_TIME = 2
Error = ValueError
Double check the locale on your shell from which you are executing. Here are a few work arounds you can try to see if they get you working before you go through the task of getting your env setup correctly.
1) Validate UTF-8 locale or language files are installed (see link above)
2) Try adding this to the top of your script
#!/usr/bin/env LC_ALL=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')
or
#!/usr/bin/env LANG=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')
Or export shell variables before executing the Python interpreter
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
python3
>>> print('カタカナ')
Sorry I cannot be more specific, as these settings are platform and OS specific. You can forcefully attempt to set the locale in Python directly using the locale module, but I don't recommend that, and it won't help if they are not installed.
Hope that helps.

What's new in Python 3.0 says:
All text is Unicode; however encoded Unicode is represented as binary
data
If you want to try outputting utf-8, here's an example:
b'\x41'.decode("utf-8", "strict")
If you'd like to use unicode in a string literal, use the unicode escape and its coded representation. For your example:
print("\u24B6")

Related

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 48: ordinal not in range(128) [duplicate]

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
Read the Python Unicode HOWTO. This error is the very first example.
Do not use str() to convert from unicode to encoded text / bytes.
Instead, use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let's see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
In shell:
Find supported UTF-8 locale by the following command:
locale -a | grep "UTF-8"
Export it, before running the script, e.g.:
export LC_ALL=$(locale -a | grep UTF-8)
or manually like:
export LC_ALL=C.UTF-8
Test it by printing special character, e.g. ™:
python -c 'print(u"\u2122");'
Above tested in Ubuntu.
I've actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
UPDATE: My original answer was written for Python 2. For Python 3:
def safeStr(obj):
try: return str(obj).encode('ascii', 'ignore').decode('ascii')
except: return ""
Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.
Suggestion: you might want to name this function toAscii instead? That's a matter of preference...
Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:
from six import PY2, iteritems
CHAR_SWAP = { u'\u201c': u'"'
, u'\u201D': u'"'
, u'\u2018': u"'"
, u'\u2019': u"'"
}
def toAscii( text ) :
try:
for k,v in iteritems( CHAR_SWAP ):
text = text.replace(k,v)
except: pass
try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
except UnicodeEncodeError:
return text.encode('ascii', 'replace').decode('ascii')
except: return ""
if __name__ == '__main__':
print( toAscii( u'testin\u2019' ) )
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.
I always put the code below in the first two lines of the python files:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
It works for me:
export LC_CTYPE="en_US.UTF-8"
Alas this works in Python 3 at least...
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.
Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
Just add to a variable encode('utf-8')
agent_contact.encode('utf-8')
Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"
In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me :
export PYTHONIOENCODING=UTF-8
I just used the following:
import unicodedata
message = unicodedata.normalize("NFKD", message)
Check what documentation says about it:
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.
Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:
import sys
import io
sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")
Below solution worked for me, Just added
u "String"
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works
f = open("results.txt", "w")
f.write(data_that_causes_this_error.encode('utf-8'))
f.close()
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
I had this idea after reading Ned's presentation.
I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.
We struck this error when running manage.py migrate in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.
The issue was simply that the Django container (we use docker) was missing the LANG env var.
Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.
Update for python 3.0 and later. Try the following in the python editor:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
This sets the system`s default locale encoding to the UTF-8 format.
More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.
The recommended solution did not work for me, and I could live with dumping all non ascii characters, so
s = s.encode('ascii',errors='ignore')
which left me with something stripped that doesn't throw errors.
Many answers here (#agf and #Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.
Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.
Simple tip to avoid :
try:
data=str(data)
except:
data = data #Don't convert to String
The above example will solve Encode error also.
If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:
unic = u''
packet_data = unic
You can set the character encoding to UTF-8 before running your script:
export LC_CTYPE="en_US.UTF-8"
This should generally resolve the issue.

Error using word_tokenize UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128) [duplicate]

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
Read the Python Unicode HOWTO. This error is the very first example.
Do not use str() to convert from unicode to encoded text / bytes.
Instead, use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let's see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
In shell:
Find supported UTF-8 locale by the following command:
locale -a | grep "UTF-8"
Export it, before running the script, e.g.:
export LC_ALL=$(locale -a | grep UTF-8)
or manually like:
export LC_ALL=C.UTF-8
Test it by printing special character, e.g. ™:
python -c 'print(u"\u2122");'
Above tested in Ubuntu.
I've actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
UPDATE: My original answer was written for Python 2. For Python 3:
def safeStr(obj):
try: return str(obj).encode('ascii', 'ignore').decode('ascii')
except: return ""
Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.
Suggestion: you might want to name this function toAscii instead? That's a matter of preference...
Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:
from six import PY2, iteritems
CHAR_SWAP = { u'\u201c': u'"'
, u'\u201D': u'"'
, u'\u2018': u"'"
, u'\u2019': u"'"
}
def toAscii( text ) :
try:
for k,v in iteritems( CHAR_SWAP ):
text = text.replace(k,v)
except: pass
try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
except UnicodeEncodeError:
return text.encode('ascii', 'replace').decode('ascii')
except: return ""
if __name__ == '__main__':
print( toAscii( u'testin\u2019' ) )
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.
I always put the code below in the first two lines of the python files:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
It works for me:
export LC_CTYPE="en_US.UTF-8"
Alas this works in Python 3 at least...
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.
Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
Just add to a variable encode('utf-8')
agent_contact.encode('utf-8')
Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"
In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me :
export PYTHONIOENCODING=UTF-8
I just used the following:
import unicodedata
message = unicodedata.normalize("NFKD", message)
Check what documentation says about it:
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.
Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:
import sys
import io
sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")
Below solution worked for me, Just added
u "String"
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works
f = open("results.txt", "w")
f.write(data_that_causes_this_error.encode('utf-8'))
f.close()
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
I had this idea after reading Ned's presentation.
I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.
We struck this error when running manage.py migrate in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.
The issue was simply that the Django container (we use docker) was missing the LANG env var.
Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.
Update for python 3.0 and later. Try the following in the python editor:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
This sets the system`s default locale encoding to the UTF-8 format.
More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.
The recommended solution did not work for me, and I could live with dumping all non ascii characters, so
s = s.encode('ascii',errors='ignore')
which left me with something stripped that doesn't throw errors.
Many answers here (#agf and #Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.
Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.
Simple tip to avoid :
try:
data=str(data)
except:
data = data #Don't convert to String
The above example will solve Encode error also.
If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:
unic = u''
packet_data = unic
You can set the character encoding to UTF-8 before running your script:
export LC_CTYPE="en_US.UTF-8"
This should generally resolve the issue.

Issues while writing HTML file [duplicate]

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
Read the Python Unicode HOWTO. This error is the very first example.
Do not use str() to convert from unicode to encoded text / bytes.
Instead, use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let's see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
In shell:
Find supported UTF-8 locale by the following command:
locale -a | grep "UTF-8"
Export it, before running the script, e.g.:
export LC_ALL=$(locale -a | grep UTF-8)
or manually like:
export LC_ALL=C.UTF-8
Test it by printing special character, e.g. ™:
python -c 'print(u"\u2122");'
Above tested in Ubuntu.
I've actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
UPDATE: My original answer was written for Python 2. For Python 3:
def safeStr(obj):
try: return str(obj).encode('ascii', 'ignore').decode('ascii')
except: return ""
Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.
Suggestion: you might want to name this function toAscii instead? That's a matter of preference...
Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:
from six import PY2, iteritems
CHAR_SWAP = { u'\u201c': u'"'
, u'\u201D': u'"'
, u'\u2018': u"'"
, u'\u2019': u"'"
}
def toAscii( text ) :
try:
for k,v in iteritems( CHAR_SWAP ):
text = text.replace(k,v)
except: pass
try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
except UnicodeEncodeError:
return text.encode('ascii', 'replace').decode('ascii')
except: return ""
if __name__ == '__main__':
print( toAscii( u'testin\u2019' ) )
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.
I always put the code below in the first two lines of the python files:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
It works for me:
export LC_CTYPE="en_US.UTF-8"
Alas this works in Python 3 at least...
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.
Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
Just add to a variable encode('utf-8')
agent_contact.encode('utf-8')
Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"
In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me :
export PYTHONIOENCODING=UTF-8
I just used the following:
import unicodedata
message = unicodedata.normalize("NFKD", message)
Check what documentation says about it:
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.
Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:
import sys
import io
sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")
Below solution worked for me, Just added
u "String"
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works
f = open("results.txt", "w")
f.write(data_that_causes_this_error.encode('utf-8'))
f.close()
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
I had this idea after reading Ned's presentation.
I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.
We struck this error when running manage.py migrate in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.
The issue was simply that the Django container (we use docker) was missing the LANG env var.
Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.
Update for python 3.0 and later. Try the following in the python editor:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
This sets the system`s default locale encoding to the UTF-8 format.
More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.
The recommended solution did not work for me, and I could live with dumping all non ascii characters, so
s = s.encode('ascii',errors='ignore')
which left me with something stripped that doesn't throw errors.
Many answers here (#agf and #Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.
Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.
Simple tip to avoid :
try:
data=str(data)
except:
data = data #Don't convert to String
The above example will solve Encode error also.
If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:
unic = u''
packet_data = unic
You can set the character encoding to UTF-8 before running your script:
export LC_CTYPE="en_US.UTF-8"
This should generally resolve the issue.

python... encoding issue when using linux > [duplicate]

This question already has answers here:
Understanding Python Unicode and Linux terminal
(2 answers)
Closed 9 years ago.
simple test program of an encoding issue:
#!/bin/env python
# -*- coding: utf-8 -*-
print u"Råbjerg" # >>> unicodedata.name(u"å") = 'LATIN SMALL LETTER A WITH RING ABOVE'
here is what i get when i use it from a debian command box, i do not understand why using redirect here broke the thing, as i can see it correctly when using without.
can someone help to understand what i have missed? and what should the right way to print this characters so that they are ok everywhere?
$ python testu.py
Råbjerg
$ python testu.py > A
Traceback (most recent call last):
File "testu.py", line 3, in <module>
print u"Råbjerg"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 1: ordinal not in range(128)
using debian Debian GNU/Linux 6.0.7 (squeeze) configured with:
$ locale
LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=
EDIT: from other similar questions seen later from the pointing done below
#!/bin/env python1
# -*- coding: utf-8 -*-
import sys, locale
s = u"Råbjerg" # >>> unicodedata.name(u"å") = 'LATIN SMALL LETTER A WITH RING ABOVE'
if sys.stdout.encoding is None: # if it is a pipe, seems python2 return None
s = s.encode(locale.getpreferredencoding())
print s
When redirecting the output, sys.stdout is not connected to a terminal and Python cannot determine the output encoding. When not directing the output, Python can detect that sys.stdout is a TTY and will use the codec configured for that TTY when printing unicode.
Set the PYTHONIOENCODING environment variable to tell Python what encoding to use in such cases, or encode explicitly.
Use: print u"Råbjerg".encode('utf-8')
Similar question was asked today : Understanding Python Unicode and Linux terminal
I'll suggest you to output it already encoded:
print u"Råbjerg".encode('utf-8')
This will write the correct bytes of the string in utf-8 and you'll be able to see in almost every editor/console which support utf-8

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)
# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.
If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.
Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.
Convert it to a unicode string through:
print unicode(name)

Categories

Resources