python unichr problem - python

I've got some problem with unichr() on my server. Please see below:
On my server (Ubuntu 9.04):
>>> print unichr(255)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 0: ordinal not in range(128)
On my desktop (Ubuntu 9.10):
>>> print unichr(255)
ÿ
I'm fairly new to python so I don't know how to solve this. Anyone care to help? Thanks.

When using the "print" keyword, you'll be writing to the sys.stdout output stream. sys.stdout can usually only display Unicode strings if the characters can be converted to ascii using str(message).
You'll need to encode to your OS's terminal encoding when printing to be able to do this.
The locale module can sometimes detect the encoding of the output console:
import locale
print unichr(0xff).encode(locale.getdefaultlocale()[1], 'replace')
but it's usually better to just specify the encoding yourself, as python often gets it wrong:
print unichr(0xff).encode('latin-1', 'replace')
UTF-8 or latin-1 I think is often used in many modern linux distros.
If you know the encoding of your console, the lines below will encode Unicode strings automatically when you use "print":
import sys
import codecs
sys.stdout = codecs.getwriter(ENCODING)(sys.stdout)
If the encoding is ascii or something similar, you may need to change the console encoding of your OS to be able to display that character.
See also: http://wiki.python.org/moin/PrintFails

The terminal settings on your server are different, probably set to 7-bit US ASCII.

It's not really unichr() related. Problem is with locale setting in your server environment, as it's probably set to something like en_US and it's not unicode aware.

Consider using an explicit encoding when printing unicode strings where OS settings are not uniform.
unicode.encode([encoding[, errors]])
Return an encoded version of the string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.
For example,
>>> print unichr(0xff).encode('iso8859-1')
����??
>>>

Related

json encoding isn't considering encoding argument in python

Trying to encode a json file using the utf catalog (utf-8-sig), with this code
data =json.load(open("data.json", encoding = "utf-8-sig"))
But it appears that the encoding argument is being ignored throwing this error
Traceback (most recent call last):
File "app1.py", line 11, in <module>
print(k,v)
UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 141: ordinal not in
range(128)
Edit: the datatype of the file data.json is <class '_io.TextIOWrapper'>, and here's the full stack:
import json
data =json.load(open("data.json", encoding = "utf-8-sig"))
for k,v in data.items():
print(k,v)
Edit2:Binary sample of the file using print(open("data.json"), "rb").read(180)
b'{"abandoned industrial site": ["Site that cannot be used for any
purpose, being contaminated b y pollutants."], "abandoned vehicle":
["A vehicle that has been discarded in the envir'
As #tdelaney pointed out in comments, you are looking the wrong problem:
The error in not on open, nor in json.read, not in the data.items() iteractor, so it is not a decoding problem, between utf-8-sig and unicode string.
But the problem is in print. The problem is encoding, so it means that the error is between unicode string in python and a resulting binary encoding. In this case, the unicode string is converted into ascii, but ascii cannot represent all original characters.
There are two solutions:
you can allow your terminal to accept UTF-8 characters. This could by done by setting e.g. LANG=C.UTF-8 (supported only on few systems) or LANG=en_US.UTF-8 or other locales (check on your system what locales support UTF-8).
you force print to print only ascii, as proposed by #tdelaney print(k.encode('ascii', 'replace'), v.encode('ascii', 'replace')). You may want to change replace with backslashreplace or ignore. (see https://docs.python.org/3/library/codecs.html#codec-base-classes).
There are also hybrid solution (but not really safe and portable on all systems) (like forcing python to output in a particular encoding), and many hacks, but it will make your code complex, so I do not recommend it (and you should wrap them in a new function).

Python how to handle unicode text

I am using Python 2.6.6
item = {u'snippet': {u'title': u'How to Pronounce Canap\xe9'}}
title = item['snippet']['title']
print title
Result:
How to Pronounce Canapé
Desired result:
How to Pronounce Canapé
This looks like a Unicode issue, I tried encode and decode to utf8, but result still the same, any ideas?
Your terminal expects UTF-8:
$ locale charmap
UTF-8
Python prints using UTF-8:
>>> sys.stdout.encoding
UTF-8
Change SecureCRT setting to accept UTF-8.
This is quite possibly due to mismatch of the default encoding that Python is using versus the console's encoding. It looks like Python is assuming that the encoding is UTF-8 but then the console is interpreting that as latin-1.
Instead of \xe9, use \u00e9 if possible. Then pick an appropriate encoding when outputting the unicode string:
print title.encode('latin1')
What encoding is sensible depends on where you are outputting to. Generally, you have to infer it from the environment variables, or maybe let your users make a choice in a configuration file.
PS: If you deal with Unicode strings a lot, I'd recommend switching to Python 3 (e.g. 3.3), if at all possible. Unicode handling is a lot more clear/explicit/sane, there.
I am getting your expected output in my terminal (using python 2.7.7)
The format you are expecting depends on encoding set in the terminal. For me, it is set to 'cp437'
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> sys.stdout.encoding
'cp437'
You can verify that, you are getting correct output by giving:
print title.encode('cp437')
set your default encoding to iso-8859-1 in your sitecustomize.py file in ${pythondir}/lib/site-packages/ as
import sys
sys.setdefaultencoding('iso-8859-1')
for me it worked with \xe9.

python: unicode in Windows terminal, encoding used?

I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.
I type:
>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'
Question 1: Why is the encoding used in the string s different from the one used in the unicode string u?
I continue, and type:
>>> us=unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'
Question2: I tried using the latin-1 encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8). How can I find out which encoding the terminal has used to encode my string?
Question 3: how can I make the terminal print ë as ë instead of '\x89' or u'xeb'? Hmm, stupid me. print(s) does the job.
I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows
Unicode is not an encoding. You encode into byte strings and decode into Unicode:
>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'
The windows terminal uses legacy code pages for DOS. For US Windows it is:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows applications use windows code pages. Python's IDLE will show the windows encoding:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary.
Avoid Windows Terminal
I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.
Output to a file instead
Set the PYTHONIOENCODING environment variable and then redirect the output to a file.
set PYTHONIOENCODING=utf-8
./myscript.py > output.txt
Then using Notepad++ you can then see the UTF-8 version of your output.
Install win-unicode-console
win-unicode-console can fix your problems. You should try it out
pip install win-unicode-console
If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.
py -m run script.py
Runs it per script or you can follow their directions to add win_unicode_console.enable() to every invocation by adding it to usercustomize or sitecustomize.
In case others get this page when searching
Easiest way is to set the codepage in the terminal first
CHCP 65001
then run your program.
working well for me.
For power shell start it with
powershell.exe -NoExit /c "chcp.com 65001"
Its from python: unicode in Windows terminal, encoding used?
Read through this python HOWTO about unicode after you read this section from the tutorial
Creating Unicode strings in Python is just as simple as creating normal strings:
>>> u'Hello World !'
u'Hello World !'
To answer your first question, they are different because only when using u''are you creating a unicode string.
2nd question:
sys.getdefaultencoding()
returns the default encoding
But to quote from link:
Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.
You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb.
Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab and in Win-1252 it's actually the same as the unicode code-point - ie hex eb. So, it's a bit of a mystery.
It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.
>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'
Actually, unicode object has no
'encoding'. You should read up on
Unicode in python to avoid constant
confusion. This presentation looks
adequate -
http://farmdev.com/talks/unicode/ .
You are on russian version of
windows, right? You terminal uses
cp1251.
As you've figured out:
>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё
Do you open any file when get such errors?
If so, try to open it with
import codecs
f = codecs.open('filename.txt','r','utf-8')

Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

I have seen few py scripts which use this at the top of the script. In what cases one should use it?
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.
This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py, After this module has been evaluated, the setdefaultencoding() function is removed from the sys module.
The only way to actually use it is with a reload hack that brings the attribute back.
Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to "utf-8" and changing it raises an error.
I suggest some pointers for reading:
http://blog.ianbicking.org/illusive-setdefaultencoding.html
http://nedbatchelder.com/blog/200401/printing_unicode_from_python.html
http://www.diveintopython3.net/strings.html#one-ring-to-rule-them-all
http://boodebr.org/main/python/all-about-python-and-unicode
http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
tl;dr
The answer is NEVER! (unless you really know what you're doing)
9/10 times the solution can be resolved with a proper understanding of encoding/decoding.
1/10 people have an incorrectly defined locale or environment and need to set:
PYTHONIOENCODING="UTF-8"
in their environment to fix console printing problems.
What does it do?
sys.setdefaultencoding("utf-8") (struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:
str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC")
In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
(My console is configured as UTF-8, so "€" = '\xe2\x82\xac', hence exception on \xe2)
or
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
sys.setdefaultencoding("utf-8") will allow these to work for me, but won't necessarily work for people who don't use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into code
Console
sys.setdefaultencoding("utf-8") also has a side effect of appearing to fix sys.stdout.encoding, used when printing characters to the console. Python uses the user's locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user's locale is broken and just requires PYTHONIOENCODING to fix the console encoding.
Example:
$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()
$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€
What's so bad with sys.setdefaultencoding("utf-8")?
People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeError exception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.
From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
def welcome_message(byte_string):
try:
return u"%s runs your business" % byte_string
except UnicodeError:
return u"%s runs your business" % unicode(byte_string,
encoding=detect_encoding(byte_string))
print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))
Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.
Changing what should be a constant will have dramatic effects on modules you depend upon. It's better to just fix the data coming in and out of your code.
Example problem
While the setting of defaultencoding to UTF-8 isn't the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u
chmod +x test.py
./test.py
moçambique
moçambique
./test.py > output.txt
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)
on shell works , sending to sdtout not ,
so that is one workaround, to write to stdout .
I made other approach, which is not run if sys.stdout.encoding is not define, or in others words , need export PYTHONIOENCODING=UTF-8 first to write to stdout.
import sys
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
so, using same example:
export PYTHONIOENCODING=UTF-8
./test.py > output.txt
will work
The first danger lies in reload(sys).
When you reload a module, you actually get two copies of the module in your runtime. The old module is a Python object like everything else, and stays alive as long as there are references to it. So, half of the objects will be pointing to the old module, and half to the new one. When you make some change, you will never see it coming when some random object doesn't see the change:
(This is IPython shell)
In [1]: import sys
In [2]: sys.stdout
Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8>
In [3]: reload(sys)
<module 'sys' (built-in)>
In [4]: sys.stdout
Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0>
In [11]: import IPython.terminal
In [14]: IPython.terminal.interactiveshell.sys.stdout
Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>
Now, sys.setdefaultencoding() proper
All that it affects is implicit conversion str<->unicode. Now, utf-8 is the sanest encoding on the planet (backward-compatible with ASCII and all), the conversion now "just works", what could possibly go wrong?
Well, anything. And that is the danger.
There may be some code that relies on the UnicodeError being thrown for non-ASCII input, or does the transcoding with an error handler, which now produces an unexpected result. And since all code is tested with the default setting, you're strictly on "unsupported" territory here, and no-one gives you guarantees about how their code will behave.
The transcoding may produce unexpected or unusable results if not everything on the system uses UTF-8 because Python 2 actually has multiple independent "default string encodings". (Remember, a program must work for the customer, on the customer's equipment.)
Again, the worst thing is you will never know that because the conversion is implicit -- you don't really know when and where it happens. (Python Zen, koan 2 ahoy!) You will never know why (and if) your code works on one system and breaks on another. (Or better yet, works in IDE and breaks in console.)

Does python's print function handle unicode differently now than when Dive Into Python was written?

I'm trying to work my way through some frustrating encoding issues by going back to basics. In Dive Into Python example 9.14 (here) we have this:
>>> s = u'La Pe\xf1a'
>>> print s
Traceback (innermost last): File "<interactive input>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>> print s.encode('latin-1')
La Peña
But on my machine, this happens:
>>> sys.getdefaultencoding()
'ascii'
>>> s = u'La Pe\xf1a'
>>> print s
La Peña
I don't understand why these are different. Anybody?
The default encoding for print doesn't depend on sys.getdefaultencoding(), but on sys.stdout.encoding. If you launch python with e.g. LANG=C or redirect a python script to a file, the encoding for stdout will be ANSI_X3.4-1968. On the other hand, if sys.stdout is a terminal, it will use the terminal's encoding.
To explain what sys.getdefaultencoding() does -- it's used when implicitly converting strings from/to unicode. In this example, str(u'La Pe\xf1a') with the default ASCII encoding would fail, but with modified default encoding it would encode the string to Latin-1. However setting the default encoding is a horrible idea, you should always use explicit encoding when you want to go from unicode to str.

Categories

Resources