handling unicode strings in Windows

handling unicode strings in Windows - python

For the first time, I was trying out one of my Python scripts, which deals with unicode characters, on Windows (Vista) and found that it's not working. The script works perfectly okay on Linux and OS X but no joy on Windows. Here is the little script that I tried:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys, codecs
reload(sys)
sys.setdefaultencoding('utf-8')
print "\nDefault encoding\t: %s" % sys.getdefaultencoding()
print "sys.stdout.encoding\t: %s\n" % sys.stdout.encoding
## Unicode strings
ln1 = u"?0>9<8~7|65\"4:3}2{1+_)(*&^%$£#!/`\\][=-"
ln2 = u"mnbvc xzasdfghjkl;'poiuyàtrewq€é#¢."
refStr = u"%s%s" % (ln2,ln1)
print "refSTR: ", refStr
for x in refStr:
print "%s => %s" % (x, ord(u"%s" % x))
When I run the script from Windows CLI, I get this error:
C:\Users\san\Scripts>python uniCode.py
Default encoding : utf-8
sys.stdout.encoding : cp850
refSTR; Traceback (most recent call last):
File "uniCode.py", line 18, in <module>
print "refSTR; ", refStr
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u20ac' in position
30: character maps to <undefined>
I came across this Python-wiki and tried a few things from there but that didn't work. Does anyone know what I'm still missing? Any help greatly appreciated. Cheers!!

The Windows console has a Unicode API, but not utf-8. Python is trying to encode Unicode characters to your console's 8-bit code page cp850, which obviously won't work. There's supposedly a code page (chcp 65001) in the Windows console that supports utf-8, but it's severely broken. Read issue 1602 and look at sys_write_stdout.patch and unicode2.py, which use Unicode wide character functions such as WriteConsoleOutputW and WriteConsoleW. Unfortunately it's a low priority issue.
FYI, you can also use IDLE, or another GUI console (based on pythonw.exe), to run a script that outputs Unicode characters. For example:
C:\pythonXX\Lib\idlelib\idle.pyw -r script.py
But it's not a general solution if you need to write CLI console tools.

The setdefaultencoding and getdefaultencoding denote the encoding followed by the python interpreter and while you use sys.stdout.encoding, it denotes the encoding used by your terminal. You can verify this, if you were to write it in the file vs print in the terminal.
The way to 'fix' this program would be to set the terminal encoding to something that you want (utf-8) or write to the file and open the output in the editor which supports those particular characters.

Related

Unicode error Python3 calendar module

I'm trying to print a simple calendar from python calendar module:
import calendar
c = calendar.LocaleTextCalendar(0, 'Russian')
s = c.formatmonth(2016, 5)
print(s)
On linux it works well, but on Windows I got an error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4-10: character maps to <undefined>
All I can do to avoid the error is print(s.encode('ascii', 'replace').decode('ascii')) (with locale text values missed), so I'm intrested in common nice solution.
Thanks in advance.

That happens because the windows console encoding is not Unicode. Unfortunately it is not trivial and there are several way to work around this.
What is the encoding of your console? You can find out in Python this way
import sys
sys.stdin.encoding
you can try to set Unicode this way only for the current console:
chcp 65001
python myScript.py
In your script make sure that your string is encoded into UTF-8.

I solved the issue as following:
import platform
if platform.system() == 'Windows':
import locale
locale.setlocale(locale.LC_ALL, "Russian")
...
print(s) # Works!
Another option is encode/decode inside print:
print(s.encode('cp1252').decode('cp1251'))
Both cases works for file output too.

Django shell encoding error (Debian only, Ubuntu fine)

Good day
Can somebody explain what is going on behind the Django manage.py shell console?
The problem is following. I'm developing a Django app, which is using an urllib to parse some html pages to get some info from them. And that info is in russian language, so it should be unicode (this is address string in this case). Next, my script feeds this to some other third-party module which falls, because it is not valid unicode string (I'm trying to geodecode point from address).
I tried to print the string (parsed address in this case) to console with print address command but it fails:
File "<console>", line 1, in <module>
... some useless stacktrace ...
print address
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Now comes the interesting part.
I have 2 computers: workstation with Ubuntu and Python 2.7.2 and Debian Lenny VPS with Python 2.7.2. I start parser the same way on both machines: by executing python manage.py shell and calling my function from it.
First I got the same error on both installations, but then I noticed that my python encoding is set to 'ascii' (import sys; sys.getdefaultencoding()). And when I put
import sys; reload(sys).setdefaultencoding('utf-8')
into settings.py the problem solves for Ubuntu. Now I get proper print on it, e.g. г. Челябинск, ул. Кирова, д. 27, КТК Набережный, but this is not working for Debian.
If i delete this print address string than, I get non-readable geolocation errors, but again - only on Debian. Ubuntu is working just fine:
Failed to geodecode address [Ð³. Ð§ÐµÐ»ÑÐ±Ð¸Ð½ÑÐº, ÑÐ». 1-Ð¾Ð¹ ÐÑÑÐ¸Ð»ÐµÑÐºÐ¸, 17/1, ÑÑÐ½Ð¾Ðº ÐÑÑÐ°Ðº, 1-Ð·]
No amount of unicode(address).encode('utf-8') magic can help this.
So I just can't get it. What's the differences between machines that cause me so much trouble?

If you run the following python script, you'll see what's happening:
# -*- coding: utf-8 -*-
a = r"Челябинск"
print "Encode from UTF-8 to UTF-8:",unicode(a,'utf-8').encode('utf-8')
print "Encode from ISO8859-1 to UTF-8:",unicode(a,'iso8859-1').encode('utf-8')
The output is:
Encode from ISO8859-1 to UTF-8: Челябинск
Encode from ISO8859-1 to UTF-8: Ð§ÐµÐ»ÑÐ±Ð¸Ð½ÑÐº
In essence you're taking a string encoded (already) as UTF-8 and re-encoding it (a second time, as if it were ISO8859-1) into UTF-8.
It's worth checking what the default encoding of the machine is in each case.
If anyone can add to this answer then please do.

Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

I have seen few py scripts which use this at the top of the script. In what cases one should use it?
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.
This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py, After this module has been evaluated, the setdefaultencoding() function is removed from the sys module.
The only way to actually use it is with a reload hack that brings the attribute back.
Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to "utf-8" and changing it raises an error.
I suggest some pointers for reading:
http://blog.ianbicking.org/illusive-setdefaultencoding.html
http://nedbatchelder.com/blog/200401/printing_unicode_from_python.html
http://www.diveintopython3.net/strings.html#one-ring-to-rule-them-all
http://boodebr.org/main/python/all-about-python-and-unicode
http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python

tl;dr
The answer is NEVER! (unless you really know what you're doing)
9/10 times the solution can be resolved with a proper understanding of encoding/decoding.
1/10 people have an incorrectly defined locale or environment and need to set:
PYTHONIOENCODING="UTF-8"
in their environment to fix console printing problems.
What does it do?
sys.setdefaultencoding("utf-8") (struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:
str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC")
In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
(My console is configured as UTF-8, so "€" = '\xe2\x82\xac', hence exception on \xe2)
or
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
sys.setdefaultencoding("utf-8") will allow these to work for me, but won't necessarily work for people who don't use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into code
Console
sys.setdefaultencoding("utf-8") also has a side effect of appearing to fix sys.stdout.encoding, used when printing characters to the console. Python uses the user's locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user's locale is broken and just requires PYTHONIOENCODING to fix the console encoding.
Example:
$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()
$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€
What's so bad with sys.setdefaultencoding("utf-8")?
People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeError exception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.
From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
def welcome_message(byte_string):
try:
return u"%s runs your business" % byte_string
except UnicodeError:
return u"%s runs your business" % unicode(byte_string,
encoding=detect_encoding(byte_string))
print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))
Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.
Changing what should be a constant will have dramatic effects on modules you depend upon. It's better to just fix the data coming in and out of your code.
Example problem
While the setting of defaultencoding to UTF-8 isn't the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u
chmod +x test.py
./test.py
moçambique
moçambique
./test.py > output.txt
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)
on shell works , sending to sdtout not ,
so that is one workaround, to write to stdout .
I made other approach, which is not run if sys.stdout.encoding is not define, or in others words , need export PYTHONIOENCODING=UTF-8 first to write to stdout.
import sys
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
so, using same example:
export PYTHONIOENCODING=UTF-8
./test.py > output.txt
will work

The first danger lies in reload(sys).
When you reload a module, you actually get two copies of the module in your runtime. The old module is a Python object like everything else, and stays alive as long as there are references to it. So, half of the objects will be pointing to the old module, and half to the new one. When you make some change, you will never see it coming when some random object doesn't see the change:
(This is IPython shell)
In [1]: import sys
In [2]: sys.stdout
Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8>
In [3]: reload(sys)
<module 'sys' (built-in)>
In [4]: sys.stdout
Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0>
In [11]: import IPython.terminal
In [14]: IPython.terminal.interactiveshell.sys.stdout
Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>
Now, sys.setdefaultencoding() proper
All that it affects is implicit conversion str<->unicode. Now, utf-8 is the sanest encoding on the planet (backward-compatible with ASCII and all), the conversion now "just works", what could possibly go wrong?
Well, anything. And that is the danger.
There may be some code that relies on the UnicodeError being thrown for non-ASCII input, or does the transcoding with an error handler, which now produces an unexpected result. And since all code is tested with the default setting, you're strictly on "unsupported" territory here, and no-one gives you guarantees about how their code will behave.
The transcoding may produce unexpected or unusable results if not everything on the system uses UTF-8 because Python 2 actually has multiple independent "default string encodings". (Remember, a program must work for the customer, on the customer's equipment.)
Again, the worst thing is you will never know that because the conversion is implicit -- you don't really know when and where it happens. (Python Zen, koan 2 ahoy!) You will never know why (and if) your code works on one system and breaks on another. (Or better yet, works in IDE and breaks in console.)

Converting from ascii to utf-8 with Python

I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:
output = os.popen(cmd).read()
if not isinstance(output, unicode):
output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))
But when Russian symbols appear in output they aren't converted well.
sys.getdefaultencoding()
says that default command prompt encoding is 'ascii', but when I try to do
output.decode('ascii')
in python console I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1:
ordinal not in range(128)
OS: Win XP, Python 2.5.4
PS: Sorry for my English :(

sys.getdefaultencoding() returns python's default encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.
You need to work out what encoding the actual text is, either manually, or using the locale module.
Typically something like:
import locale
encoding = locale.getpreferredencoding(do_setlocale=True)¶

Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866
See http://en.wikipedia.org/wiki/Code_page
edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.

You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""
sys.getdefaultencoding says NOTHING about the "command prompt" encoding.
On Windows, sys.stdout.encoding should do the job. On my machine, it contains cp850 when Python is run in a Command Prompt window, and cp1252 in IDLE. Yours should contain cp866 and cp1251 respectively.
Update You say that you still need cp866 in IDLE. Note this:
IDLE 2.6.4
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>
So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read(). The text before the : is probably locale-dependent. codepage = result.split()[-1] may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding should be OK.

In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)

# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.

If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.

Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.

Convert it to a unicode string through:
print unicode(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

handling unicode strings in Windows - python

Related

Unicode error Python3 calendar module

Django shell encoding error (Debian only, Ubuntu fine)

Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

Converting from ascii to utf-8 with Python

Unicode problems in PyObjC

Categories

Resources