Enable Unicode "globally" in Python - python

Is it possible to avoid having to put this in every page?
# -*- coding: utf-8 -*-
I'd really like Python to default to this.

In Python 3, the default encoding is UTF-8, so you won't need to set it explicitly anymore. There isn't a way to 'globally' set the default source encoding, though, and history has shown that such global options are generally a bad idea. (For instance, the -U and -Q options to Python, and sys.setdefaultencoding() back when we had it.) You don't (directly) control all the source that gets imported in your program, because it includes the standard library and any third-party modules you use directly or indirectly.
Also note that this isn't enabling Unicode, as your question title suggests. What it does is make the source encoding UTF-8, meaning that any non-ASCII characters in unicode literals (e.g. u'spæm') will be interpreted using that encoding. It won't make non-unicode literals ('spam' and "spam") suddenly unicode, nor will it do anything for non-literals anywhere in your code.

This is a feature of Python 3.0
It was one of the things that was done in Python 3 because it would break backward compatibility, so you won't find such a global option in 2.x

It's a very bad idea for Python 2 because you will be expecting behavior which is only preset on your dev machine. Which means that when your library goes out to someone else, or to a host server, or elsewhere, any use of it will flood the logs with UnicodeDecodeErrors.

Related

Difference between print and click.echo in Python 3?

I am creating CLI app for Unix terminal using click module. So I see two ways how I can display data:
print(data) and click.echo(data)
What is difference between them and what should I use?
Please, read at least quickstart of library before using it. The answer is in the third part of quickstart.
If you use click click.echo() is preferred because:
Click attempts to support both Python 2 and Python 3 the same way and to be very robust even when the environment is misconfigured. Click wants to be functional at least on a basic level even if everything is completely broken.
What this means is that the echo() function applies some error correction in case the terminal is misconfigured instead of dying with an UnicodeError.
As an added benefit, starting with Click 2.0, the echo function also has good support for ANSI colors. It will automatically strip ANSI codes if the output stream is a file and if colorama is supported, ANSI colors will also work on Windows. See ANSI Colors for more information.
If you don’t need this, you can also use the print() construct / function.

Python 'ascii' encode problems in print statement

System: python 3.4.2 on linux.
I'm woring on a django application (irrelevant), and I encountered a problem that it throws
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
when print is called (!). After quite a bit of digging, I discovered I should check
>>> sys.getdefaultencoding()
'utf-8'
but it was as expected, utf8. I noticed also that os.path.exists throws the same exception when used with a unicode string. So I checked
>>> sys.getfilesystemencoding()
'ascii'
When I used LANG=en_US.UTF-8 the issue disappeared. I understand now why os.path.exists had problems with that. But I have absolutely no clue why print statement is affected by the filesystem setting. Is there a third setting I'm missing? Or does it just assume LANG environment is to be trusted for everything?
Also... I don't get the reasoning here. LANG does not tell what encoding is supported by the filenames. It has nothing to do with that. It's set separately for the current environment, not for the filesystem. Why is python using this setting for filesystem filenames? It makes applications very fragile, as all the file operations just break when run in an environment where LANG is not set or set to C (not uncommon, especially when a web-app is run as root or a new user created specifically for the daemon).
Test code (no actual unicode input needed to avoid terminal encoding pitfalls):
x=b'\xc4\x8c\xc5\xbd'
y=x.decode('utf-8')
print(y)
Question:
is there a good and accepted way of making the application robust to the LANG setting?
is there any real-world reason to guess the filesystem capabilities from environment instead of the filesystem driver?
why is print affected?
LANG is used to determine your locale; if you don't set specific LC_ variables the LANG variable is used as the default.
The filesystem encoding is determined by the LC_CTYPE variable, but if you haven't set that variable specifically, the LANG environment variable is used instead.
Printing uses sys.stdout, a textfile configured with the codec your terminal uses. Your terminal settings is also locale specific; your LANG variable should really reflect what locale your terminal is set to. If that is UTF-8, you need to make sure your LANG variable reflects that. sys.stdout uses locale.getpreferredencoding(False) (like all text streams opened without an explicit encoding set) and on POSIX systems that'll use LC_CTYPE too.

Unicode SendKeys Alternative (Any programming language)

Before I get to the actual question I will say that altough I'm currently working in Python I will accept a solution in ANY language. I'm mostly a Java programmer but since Java is pretty limited to its JVM I didn't think it would be possible to create this in Java.
Goal:
I'm trying to make a program that will intercept keyboard events (I've already done this part using pyHook, this is one of the main reasons I am programming this in Python). Based on these events and the context I need to write unicode characters (ancient-greek) into any focused window (Currently only on Windows OS but an uniform solution that will work on all OS's seems ideal). Basically this is a program that allows me (Classical Language Student) to type Ancient Greek.
Problems:
Everything is working great up until the point where I need to send unicode characters, like an alpha, delta or omega, using sendKeys. The hook works perfectly and SendKeys works perfectly with normal ASCII characters. I've tried the following libraries all to no avail: (Code example at the bottom)
SendKeysCtypes (contrary to what the blog says it does NOT support unicode)
win32com.client using the shell and SendKeys.
SendKeys (Another library doing basically the same thing)
Now that I've outlined my current situation I've got the following questions:
Questions
1. Is it at all possible to use unicode characters with SendKeys? (google searches thus far seem to indicate that it is impossible).
Since this is likely not the case I wonder:
2. Is there any other library capable of sending unicode characters to the focused window?
Another thing that has crossed my mind is that I might be using the wrong method altogether (the whole simulating keypress events thing). Any other solution that will help me reach, or at least get closer to, my goal are VERY welcome.
#coding: utf-8
import time
import win32com
import win32com.client
shell = win32com.client.Dispatch("WScript.Shell")
shell.Run('notepad')
time.sleep(0.1)
shell.AppActivate('kladblok')
shell.SendKeys("When Unicode characters are pasted here, errors ensue", 0)
shell.SendKeys(u"When Unicode characters are pasted here, harmony shall hopefully ensue".encode("utf-16le"), 0)
You have not followed up on questions in comments, so this is necessarily speculative.

Python, UTF-8 filesystem, iso-8859-1 files

I have an application written in Python 2.7 that reads user's file from the hard-drive using os.walk.
The application requires a UTF-8 system locale (we check the env variables before it starts) because we handle files with Unicode characters (audio files with the artist name in it for example), and want to make sure we can save these files with the correct file name to the filesystem.
Some of our users have UTF-8 locales (therefore a UTF-8 fs), but still somehow manage to have ISO-8859-1 files stored on their drive. This causes problems when our code tries to os.walk() these directories as Python throws an exception when trying to decode this sequence of ISO-8859-1 bytes using UTF-8.
So my question is, how do I get python to ignore this file and move on to the next one instead of aborting the entire os.walk(). Should I just roll my own os.walk() function?
Edit: Until now we've been telling our users to use the convmv linux command to correct their filenames, however many users have various different types of encodings (8859-1, 8859-2, etc.), and using convmv requires the user to make an educated guess on what files have what encoding before they run convmv on each one individually.
Please read Unicode filenames, part of the Python Unicode how-to. Most importantly, filesystem encodings are not necessarily the same as the current LANG setting in the terminal.
Specifically, os.walk is built upon os.listdir, and will thus switch between unicode and 8-bit bytes depending on wether or not you give it a unicode path.
Pass it an 8-bit path instead, and your code will work properly, then decode from UTF-8 or ISO 8859-1 as needed.
Use character encoding detection, chardet modules for python work well for determining actual encoding with some confidence. "as appropriate" -- You either know the encoding or you have to guess at it. If with chardet you guess wrong, at least you tried.

Why don't scripting languages output Unicode to the Windows console?

The Windows console has been Unicode aware for at least a decade and perhaps as far back as Windows NT. However for some reason the major cross-platform scripting languages including Perl and Python only ever output various 8-bit encodings, requiring much trouble to work around. Perl gives a "wide character in print" warning, Python gives a charmap error and quits. Why on earth after all these years do they not just simply call the Win32 -W APIs that output UTF-16 Unicode instead of forcing everything through the ANSI/codepage bottleneck?
Is it just that cross-platform performance is low priority? Is it that the languages use UTF-8 internally and find it too much bother to output UTF-16? Or are the -W APIs inherently broken to such a degree that they can't be used as-is?
UPDATE
It seems that the blame may need to be shared by all parties. I imagined that the scripting languages could just call wprintf on Windows and let the OS/runtime worry about things such as redirection. But it turns out that even wprintf on Windows converts wide characters to ANSI and back before printing to the console!
Please let me know if this has been fixed since the bug report link seems broken but my Visual C test code still fails for wprintf and succeeds for WriteConsoleW.
UPDATE 2
Actually you can print UTF-16 to the console from C using wprintf but only if you first do _setmode(_fileno(stdout), _O_U16TEXT).
From C you can print UTF-8 to a console whose codepage is set to codepage 65001, however Perl, Python, PHP and Ruby all have bugs which prevent this. Perl and PHP corrupt the output by adding additional blank lines following lines which contain at least one wide character. Ruby has slightly different corrupt output. Python crashes.
UPDATE 3
Node.js is the first scripting language that shipped without this problem straight out of the box.
The Python dev team slowly came to realize this was a real problem since it was first reported back at the end of 2007 and has seen a huge flurry of activity to fully understand and fully fix the bug in 2016.
The main problem seems to be that it is not possible to use Unicode on Windows using only the standard C library and no platform-dependent or third-party extensions. The languages you mentioned originate from Unix platforms, whose method of implementing Unicode blends well with C (they use normal char* strings, the C locale functions, and UTF-8). If you want to do Unicode in C, you more or less have to write everything twice: once using nonstandard Microsoft extensions, and once using the standard C API functions for all other operating systems. While this can be done, it usually doesn't have high priority because it's cumbersome and most scripting language developers either hate or ignore Windows anyway.
At a more technical level, I think the basic assumption that most standard library designers make is that all I/O streams are inherently byte-based on the OS level, which is true for files on all operating systems, and for all streams on Unix-like systems, with the Windows console being the only exception. Thus the architecture many class libraries and programming language standard have to be modified to a great extent if one wants to incorporate Windows console I/O.
Another more subjective point is that Microsoft just did not enough to promote the use of Unicode. The first Windows OS with decent (for its time) Unicode support was Windows NT 3.1, released in 1993, long before Linux and OS X grew Unicode support. Still, the transition to Unicode in those OSes has been much more seamless and unproblematic. Microsoft once again listened to the sales people instead of the engineers, and kept the technically obsolete Windows 9x around until 2001; instead of forcing developers to use a clean Unicode interface, they still ship the broken and now-unnecessary 8-bit API interface, and invite programmers to use it (look at a few of the recent Windows API questions on Stack Overflow, most newbies still use the horrible legacy API!).
When Unicode came out, many people realized it was useful. Unicode started as a pure 16-bit encoding, so it was natural to use 16-bit code units. Microsoft then apparently said "OK, we have this 16-bit encoding, so we have to create a 16-bit API", not realizing that nobody would use it. The Unix luminaries, however, thought "how can we integrate this into the current system in an efficient and backward-compatible way so that people will actually use it?" and subsequently invented UTF-8, which is a brilliant piece of engineering. Just as when Unix was created, the Unix people thought a bit more, needed a bit longer, has less financially success, but did it eventually right.
I cannot comment on Perl (but I think that there are more Windows haters in the Perl community than in the Python community), but regarding Python I know that the BDFL (who doesn't like Windows as well) has stated that adequate Unicode support on all platforms is a major goal.
Small contribution to the discussion - I am running Czech localized Windows XP, which almost everywhere uses CP1250 code page. Funny thing with console is though that it still uses legacy DOS 852 code page.
I was able to make very simple perl script that prints utf8 encoded data to console using:
binmode STDOUT, ":utf8:encoding(cp852)";
Tried various options (including utf16le), but only above settings printed accented Czech characters correctly.
Edit: I played a little more with the problem and found Win32::Unicode. The module exports function printW that works properly both in output and redirected:
use utf8;
use Win32::Unicode;
binmode STDOUT, ":utf8";
printW "Příliš žluťoučký kůň úpěl ďábelské ódy";
I have to unask many of your questions.
Did you know that
Windows uses UTF-16 for its APIs, but still defaults to the various "fun" legacy encodings (e.g. Windows-1252, Windows-1251) in userspace, including file names, differently for the many localisations of Windows?
you need to encode output, and picking the appropriate encoding for the system is achieved by the locale pragma, and that there is the a POSIX standard called locale on which this is built, and Windows is incompatible with it?
Perl already supported the so-called "wide" APIs once?
Microsoft managed to adapt UTF-8 into their codepage system of character encoding, and you can switch your terminal by issuing the appropriate chcp 65001 command?
Michael Kaplan has series of blog posts about the cmd console and Unicode that may be informative (while not really answering your question):
Conventional wisdom is retarded, aka What the ##%&* is _O_U16TEXT?
Anyone who says the console can't do Unicode isn't as smart as they think they are
A confluence of circumstances leaves a stone unturned...
PS: Thanks #Jeff for finding the archive.org links.
Are you sure your script would output Unicode on some other platform correctly? "wide character in print" warning makes me very suspicious.
I recommend to look over this overview
Why on earth after all these years do
they not just simply call the Win32 -W
APIs that output UTF-16 Unicode
instead of forcing everything through
the ANSI/codepage bottleneck?
Because Perl and Python aren't Windows programs. They're Unix programs that happen to have been mostly ported to Windows. As such, they don't like to call Win32 functions unless necessary. For byte-based I/O, it's not necessary; this can be done with the Standard C Libary. UTF-16-based I/O is a special case.
Or are the -W APIs inherently broken
to such a degree that they can't be
used as-is?
I wouldn't say that the -W APIs are inherently broken as much as I'd say that Microsoft's approach to Unicode in C(++) is inherently broken.
No matter how much certain Windows developers insist that programs should use wchar_t instead of char, there are just too many barriers to switching:
Platform dependence:
The use of UTF-16 wchar_t on Windows and UTF-32 wchar_t elsewhere. (The new char16_t and char32_t types may help.)
The non-standardness of UTF-16 filename functions like _wfopen, _wstat, etc. limits the ability to use wchar_t in cross-platform code.
Education. Everbody learns C with printf("Hello, world!\n");, not wprintf(L"Hello, world!\n");. The C textbook I used in college never even mentioned wide characters until Appendix A.13.
The existing zillions of lines of code that use char* strings.
For Perl to fully support Windows in this way, every call to print printf say warn and die has to be modified.
Is this Windows?
Which version of Windows?
Perl still mostly works on Windows 95
Is this going to the console, or somewhere else.
Once you have that determined, you then have to use a completely different set of API functions.
If you really want to see everything involved in doing this properly, have a look at the source of Win32::Unicode::Console.
On Linux, OpenBSD, FreeBSD and similar OS's you can usually just call binmode on the STDOUT and STDERR file handles.
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
This assumes that the terminal is using the UTF-8 encoding.
For Python, the relevant issue in tracker is http://bugs.python.org/issue1602 (as said in comments). Note that it is open for 7 years. I tried to publish a working solution (based on information in the issue) as a Python package: https://github.com/Drekin/win-unicode-console, https://pypi.python.org/pypi/win_unicode_console.
Unicode issues in Perl
covers how the Win32 console works with Perl and the transcoding that happens behind the scene from ANSI to Unicode;albeit not just a Perl issue but affects other languages

Categories

Resources