Why don't scripting languages output Unicode to the Windows console?

Why don't scripting languages output Unicode to the Windows console? - python

The Windows console has been Unicode aware for at least a decade and perhaps as far back as Windows NT. However for some reason the major cross-platform scripting languages including Perl and Python only ever output various 8-bit encodings, requiring much trouble to work around. Perl gives a "wide character in print" warning, Python gives a charmap error and quits. Why on earth after all these years do they not just simply call the Win32 -W APIs that output UTF-16 Unicode instead of forcing everything through the ANSI/codepage bottleneck?
Is it just that cross-platform performance is low priority? Is it that the languages use UTF-8 internally and find it too much bother to output UTF-16? Or are the -W APIs inherently broken to such a degree that they can't be used as-is?
UPDATE
It seems that the blame may need to be shared by all parties. I imagined that the scripting languages could just call wprintf on Windows and let the OS/runtime worry about things such as redirection. But it turns out that even wprintf on Windows converts wide characters to ANSI and back before printing to the console!
Please let me know if this has been fixed since the bug report link seems broken but my Visual C test code still fails for wprintf and succeeds for WriteConsoleW.
UPDATE 2
Actually you can print UTF-16 to the console from C using wprintf but only if you first do _setmode(_fileno(stdout), _O_U16TEXT).
From C you can print UTF-8 to a console whose codepage is set to codepage 65001, however Perl, Python, PHP and Ruby all have bugs which prevent this. Perl and PHP corrupt the output by adding additional blank lines following lines which contain at least one wide character. Ruby has slightly different corrupt output. Python crashes.
UPDATE 3
Node.js is the first scripting language that shipped without this problem straight out of the box.
The Python dev team slowly came to realize this was a real problem since it was first reported back at the end of 2007 and has seen a huge flurry of activity to fully understand and fully fix the bug in 2016.

The main problem seems to be that it is not possible to use Unicode on Windows using only the standard C library and no platform-dependent or third-party extensions. The languages you mentioned originate from Unix platforms, whose method of implementing Unicode blends well with C (they use normal char* strings, the C locale functions, and UTF-8). If you want to do Unicode in C, you more or less have to write everything twice: once using nonstandard Microsoft extensions, and once using the standard C API functions for all other operating systems. While this can be done, it usually doesn't have high priority because it's cumbersome and most scripting language developers either hate or ignore Windows anyway.
At a more technical level, I think the basic assumption that most standard library designers make is that all I/O streams are inherently byte-based on the OS level, which is true for files on all operating systems, and for all streams on Unix-like systems, with the Windows console being the only exception. Thus the architecture many class libraries and programming language standard have to be modified to a great extent if one wants to incorporate Windows console I/O.
Another more subjective point is that Microsoft just did not enough to promote the use of Unicode. The first Windows OS with decent (for its time) Unicode support was Windows NT 3.1, released in 1993, long before Linux and OS X grew Unicode support. Still, the transition to Unicode in those OSes has been much more seamless and unproblematic. Microsoft once again listened to the sales people instead of the engineers, and kept the technically obsolete Windows 9x around until 2001; instead of forcing developers to use a clean Unicode interface, they still ship the broken and now-unnecessary 8-bit API interface, and invite programmers to use it (look at a few of the recent Windows API questions on Stack Overflow, most newbies still use the horrible legacy API!).
When Unicode came out, many people realized it was useful. Unicode started as a pure 16-bit encoding, so it was natural to use 16-bit code units. Microsoft then apparently said "OK, we have this 16-bit encoding, so we have to create a 16-bit API", not realizing that nobody would use it. The Unix luminaries, however, thought "how can we integrate this into the current system in an efficient and backward-compatible way so that people will actually use it?" and subsequently invented UTF-8, which is a brilliant piece of engineering. Just as when Unix was created, the Unix people thought a bit more, needed a bit longer, has less financially success, but did it eventually right.
I cannot comment on Perl (but I think that there are more Windows haters in the Perl community than in the Python community), but regarding Python I know that the BDFL (who doesn't like Windows as well) has stated that adequate Unicode support on all platforms is a major goal.

Small contribution to the discussion - I am running Czech localized Windows XP, which almost everywhere uses CP1250 code page. Funny thing with console is though that it still uses legacy DOS 852 code page.
I was able to make very simple perl script that prints utf8 encoded data to console using:
binmode STDOUT, ":utf8:encoding(cp852)";
Tried various options (including utf16le), but only above settings printed accented Czech characters correctly.
Edit: I played a little more with the problem and found Win32::Unicode. The module exports function printW that works properly both in output and redirected:
use utf8;
use Win32::Unicode;
binmode STDOUT, ":utf8";
printW "Příliš žluťoučký kůň úpěl ďábelské ódy";

I have to unask many of your questions.
Did you know that
Windows uses UTF-16 for its APIs, but still defaults to the various "fun" legacy encodings (e.g. Windows-1252, Windows-1251) in userspace, including file names, differently for the many localisations of Windows?
you need to encode output, and picking the appropriate encoding for the system is achieved by the locale pragma, and that there is the a POSIX standard called locale on which this is built, and Windows is incompatible with it?
Perl already supported the so-called "wide" APIs once?
Microsoft managed to adapt UTF-8 into their codepage system of character encoding, and you can switch your terminal by issuing the appropriate chcp 65001 command?

Michael Kaplan has series of blog posts about the cmd console and Unicode that may be informative (while not really answering your question):
Conventional wisdom is retarded, aka What the ##%&* is _O_U16TEXT?
Anyone who says the console can't do Unicode isn't as smart as they think they are
A confluence of circumstances leaves a stone unturned...
PS: Thanks #Jeff for finding the archive.org links.

Are you sure your script would output Unicode on some other platform correctly? "wide character in print" warning makes me very suspicious.
I recommend to look over this overview

Why on earth after all these years do
they not just simply call the Win32 -W
APIs that output UTF-16 Unicode
instead of forcing everything through
the ANSI/codepage bottleneck?
Because Perl and Python aren't Windows programs. They're Unix programs that happen to have been mostly ported to Windows. As such, they don't like to call Win32 functions unless necessary. For byte-based I/O, it's not necessary; this can be done with the Standard C Libary. UTF-16-based I/O is a special case.
Or are the -W APIs inherently broken
to such a degree that they can't be
used as-is?
I wouldn't say that the -W APIs are inherently broken as much as I'd say that Microsoft's approach to Unicode in C(++) is inherently broken.
No matter how much certain Windows developers insist that programs should use wchar_t instead of char, there are just too many barriers to switching:
Platform dependence:
The use of UTF-16 wchar_t on Windows and UTF-32 wchar_t elsewhere. (The new char16_t and char32_t types may help.)
The non-standardness of UTF-16 filename functions like _wfopen, _wstat, etc. limits the ability to use wchar_t in cross-platform code.
Education. Everbody learns C with printf("Hello, world!\n");, not wprintf(L"Hello, world!\n");. The C textbook I used in college never even mentioned wide characters until Appendix A.13.
The existing zillions of lines of code that use char* strings.

For Perl to fully support Windows in this way, every call to print printf say warn and die has to be modified.
Is this Windows?
Which version of Windows?
Perl still mostly works on Windows 95
Is this going to the console, or somewhere else.
Once you have that determined, you then have to use a completely different set of API functions.
If you really want to see everything involved in doing this properly, have a look at the source of Win32::Unicode::Console.
On Linux, OpenBSD, FreeBSD and similar OS's you can usually just call binmode on the STDOUT and STDERR file handles.
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
This assumes that the terminal is using the UTF-8 encoding.

For Python, the relevant issue in tracker is http://bugs.python.org/issue1602 (as said in comments). Note that it is open for 7 years. I tried to publish a working solution (based on information in the issue) as a Python package: https://github.com/Drekin/win-unicode-console, https://pypi.python.org/pypi/win_unicode_console.

Unicode issues in Perl
covers how the Win32 console works with Perl and the transcoding that happens behind the scene from ANSI to Unicode;albeit not just a Perl issue but affects other languages

Related

understanding encoding of exe/dll files

I recently tried to understand how sockets module works in python, I opened the source code and by tracing socket class I found out it uses something like _socket.socket. When i scroll up and find import _socket i traced it down and found out that the module is located in another folder named DLLs(i have no idea how but i do know that you can import files from python install location no matter where your file is located, but how?cool if you could answer this doubt too)so opening the file with notepad(it had no default extension association)tells me that it has an awkward encoding. Here's the first few lines in _socket.pyd :
MZ ÿÿ ¸ # º ´ Í!¸LÍ!This program cannot be run in DOS mode.
$ jâò.ƒã¡.ƒã¡.ƒã¡'ûp¡(ƒã¡B÷â ,ƒã¡B÷æ "ƒã¡B÷ç &ƒã¡B÷à -ƒã¡÷÷â ,ƒã¡uëâ )ƒã¡.ƒâ¡”ƒã¡÷÷î /ƒã¡÷÷ã /ƒã¡÷÷¡/ƒã¡÷÷á /ƒã¡Rich.ƒã¡ PE d† ;3` ð "  z ¤ ¨(  €  ` ð= `      põ P Àõ ´ # 0 €
° P ¸ ô¡ T P¢ 8  .text ny  z `.rdata ¬y z ~ # #.data (   ø # À.pdata €
0  # #.rsrc #
 # #.reloc ¸ P # B H‰\$H‰t$UWATAVAWH¬$ÿÿÿHìð H‹àÿ H3ÄH‰…è ¹ HT$ ÿf …À…>, H
ÿ9„ 3ÿƒ=è ÿDg…„ 3ÒÇD$ A¸ H‰|$,HL$4è5( 3Àf‰}6A°‰E8W3Éÿ0€ A°A‹ÔH‹Èÿ!€ A°W H‹Èÿ€ W#ÇD$$ L‹ÀD‰d$(HL$ fD‰e4ÿß ‹Ï…À•Á‰
Z H‹£ƒ H
anyone have any idea how do i decode this to simple python code(i only know that .pyd files are DLL files but in python format)? I also found out from hours of googling that DLL and EXE files have same encoding, so it would be cool if anyone could give me the link for a decoding tool or at least give me a table of this encoding's characters so i can decode it on my own.

DLL's and EXE's files are all "binary" formats. On Windows, this is the PE format. It is compiled machine code and cannot be reverted back to (nor was it started from) python code. Python supports calling Python extensions that are written in C, but called via Python. The socket library in Python is all written in C, and Python knows how to call into it.
Too look at the socket's code, you'll need to go find the corresponding C source file in the CPython repository. Alternatively, you can use a dissassembler like IDA Pro or Ghidra to give you an assembly representation, though if you don't yet understand binary formats, this may not be of much use. Ghidra (and HexRays for IDA Pro) will also attempt to Decompile the assembly giving you an approximation of the original source, but without variable names and inferred types and such.
But if you are looking for the python code that sits behind _socket, none exists.
Program Differences
Compiled Languages
These languages take source code (C, C++, etc) and turn it into Machine Code, which is most directly represented by an Assembly language. The program runs natively on the host machine, meaning it doesn't need any sort of interpreter. It's in a format the OS understands. The original source code is lost in that there is no direct mapping back to the original code. Inferences can be made with advanced decompilers but they are often imperfect and give some general guesses as to what the original source code looked liked. But there is no encoding such that you can parse out the source code from the Binary format.
Interpretive Languages
These languages run an interpreter (which is a native program in a format the OS understands, i.e. PE) which will interpret source code and dynamically turn it into Machine code the processor understand. This is how Python works and is why the source code is inside the program you run. But you can only run the Python code through a Python interpreter.
Managed Languages
These are a bit of a hybrid. They have a compilation step that takes source code and converts it into byte code. This byte code is then run through an interpreter that converts it down into Machine code. So you still need an interpreter (or VM is the more common term) that can run the byte-code, but the source code itself does not have to be present. Many of these can also be decompiled and may give better output than compiled languages, but it's simply inferences made from the underlying code, and not the actual source code that was used to build the binary.
Python can also behave like a managed language in that it's interpretation compiles the source into a byte code representation. Then it acts like a VM in that it executes that byte code. This is what the .pyc files are. The byte code representations of their corresponding .py files.

Python itself is a program, and it runs on your computer. Python code is read and interpreted, and ultimately executed by executing code in your computer's native instruction set. Well, Python is set up so that it can import and run code that is already in the form of "native" instructions (in this case, written in C and compiled to machine code).
To get a feel for how this works, take a look at the official python.org documentation: Extending Python with C or C++. Enjoy!

Are there any risks/downsides to putting emojis in code?

I sometimes use emojis in programs to highlight certain parts of the code (in open source libraries). I rarely use more than say 5-6 per script and I find they really stand out due to their colors in a text editor.
Typically, they are transient markers and will be removed when whatever issue they are associated with is closed.
My question is: are emojis liable to cause any issues in the general Python toolchain? This includes, but is not limited to: git, github, pypi, editors, linters, interpreter, CI/CD pipelines, command line usage...
I haven't seen any, but then again I rarely see emojis in code. This is a Python 3 only question, so Python 2 unicode aspects are out.
(This question is not about whether this looks professional or not. That's a valid, but entirely separate consideration.)
Some examples:
# ⚙️ this is where you configure foo
foo.max_cntr = 10
foo.tolerate_duplicates = False
# 🧟‍♂️🧟‍♂️🧟‍♂️ to indicate code to be removed
some dead code
# 👇 very important, don't forget to do this!
bar.deactivate_before_call()

In terms of risks, there aren't really any real ones. If you use them in comments they'll be removed/ignored at runtime anyway so performance-wise there's no issues.
The main issue that you could run into is that some Linux distributions (distros) DONT support emojis, so they'd fallback to some standard unicode character (generically a white rectangle with a cross through the middle), so this could make comments hard to understand.
But in personal use: no not really, there's no issues.
TLDR: Probably not, but maybe.

Enable python autocompletion in Vim without any install

I am coding Python3 in Vim and would like to enable autocompletion.
I must use different computers without internet access. Every computer is running Linux with Vim preinstalled.
I dont want to have something to install, I just want the simplest way to enable python3 completion (even if it is not the best completion), just something easy to enable from scratch on a new Linux computer.
Many thanks

Unfortunately Vim, by default doesn't do what you want, you are pretty much limited to the Ctrl-P style, which is better than it seems once you get used to it.
However i also often find myself working on machines that are not allowed to access the internet or have other files placed on them and When i find myself in this situation and i am using an unfamiliar language i sometimes use Vim's dictionary completion: http://vim.wikia.com/wiki/VimTip91
To populate this dictionary i cat / trim / filter the man pages for the language to get a variety of keywords. I ram these into a filetype specific dictionary: au FileType * execute 'setlocal dict+=~/.vim/words/'.&filetype.'.txt'
Obviously this isn't the greatest solution and it is a bit heavy handed but it does provide a certain degree of "What is that function called" type stuff.

If you have compiled vim with +python3, you can try omnifunc.
Add following to your ~/.vimrc:
au FileType python setl ofu=python3complete#Complete
Then in insert mode, just type CtrlX + CtrlO.
See :help omnifunc.

by default, Ctrl-P in insert mode do an autocompletion with all the words already present in the file you're editing

Unicode SendKeys Alternative (Any programming language)

Before I get to the actual question I will say that altough I'm currently working in Python I will accept a solution in ANY language. I'm mostly a Java programmer but since Java is pretty limited to its JVM I didn't think it would be possible to create this in Java.
Goal:
I'm trying to make a program that will intercept keyboard events (I've already done this part using pyHook, this is one of the main reasons I am programming this in Python). Based on these events and the context I need to write unicode characters (ancient-greek) into any focused window (Currently only on Windows OS but an uniform solution that will work on all OS's seems ideal). Basically this is a program that allows me (Classical Language Student) to type Ancient Greek.
Problems:
Everything is working great up until the point where I need to send unicode characters, like an alpha, delta or omega, using sendKeys. The hook works perfectly and SendKeys works perfectly with normal ASCII characters. I've tried the following libraries all to no avail: (Code example at the bottom)
SendKeysCtypes (contrary to what the blog says it does NOT support unicode)
win32com.client using the shell and SendKeys.
SendKeys (Another library doing basically the same thing)
Now that I've outlined my current situation I've got the following questions:
Questions
1. Is it at all possible to use unicode characters with SendKeys? (google searches thus far seem to indicate that it is impossible).
Since this is likely not the case I wonder:
2. Is there any other library capable of sending unicode characters to the focused window?
Another thing that has crossed my mind is that I might be using the wrong method altogether (the whole simulating keypress events thing). Any other solution that will help me reach, or at least get closer to, my goal are VERY welcome.
#coding: utf-8
import time
import win32com
import win32com.client
shell = win32com.client.Dispatch("WScript.Shell")
shell.Run('notepad')
time.sleep(0.1)
shell.AppActivate('kladblok')
shell.SendKeys("When Unicode characters are pasted here, errors ensue", 0)

shell.SendKeys(u"When Unicode characters are pasted here, harmony shall hopefully ensue".encode("utf-16le"), 0)
You have not followed up on questions in comments, so this is necessarily speculative.

Insert keypresses into the Linux console from Python

I have recently been faced with a rather odd task, one result being the necessity for the ability to use DTMF (aka "Touch Tone") tones to control a non-X Linux computer's terminal. The computer has a modem which can be accessed through ALSA, and therefore the sox "rec" program, which is what I am reading the input through. The computer in question is otherwise completely isolated, having no Ethernet or other network interfaces whatsoever. The Goertzel algorithm implementation I am using works very well, as does the eSpeak speech synthesis engine which is the only source of output; this is supposed to work with any Touch Tone phone. It reads back both input (input being octal digits, one ASCII byte at a time)and whatever the dash shell feeds back -- the prompt, the output from commands, etc., using ASCII mnemonics for control characters.
The current method that I am using for interacting with dash and the programs launched through it is the pexpect module. However, I need it to be able to, on demand, read back the entire contents of the line on which the cursor is positioned, and I do not recall pexpect being able to do this (If it is, I cannot tell.). The only other solution that I can think of is to somehow use Python to either control, or act as, the keyboard and console drivers.
Is this, indeed, the only way to go about it (and if so, is it even possible with Python?), or is there another way of having direct access to the contents of the console?
Edit: Through dumb luck, I just recently found that the SVN version of PExpect has pexpect.screen. However, it does not have any way of actually running a program under it. I'll have to keep an eye on its development.

The simple solution is to use the Linux kernel uinput interface. It allows you to insert keypresses and mouse events into the kernel, exactly as if they came from a physical human interface device. This would basically turn your application into a keyboard/mouse.
Since you are working with Python, I recommend you take a look at the python-uinput module.
If you are comfortable with binary I/O in Python, then you can certainly do the same without any libraries; just check out the /usr/include/linux/uinput.h header file for the structures involved (the interface is completely stable), and perhaps some uinput tutorials in C, too.
Note that accessing the /dev/uinput or /dev/input/uinput device (depending on your distribution) normally requires root privileges. I would personally run the Python service as a user and group dedicated to the service, and modify/add a udev rule (check all files under rules.d) to allow read-write access to the uinput device to that group, something like
SUBSYSTEM=="input", ENV{ID_INPUT}=="", IMPORT{builtin}="input_id"
KERNEL=="uinput", MODE="0660", GROUP="the-dedicated-group"
However, if your Python application simply executes programs, you should make it a terminal emulator -- for example, using this. You can do it too without any extra libraries, using the Python pty; the main work is to however simulate a terminal with ANSI escape sequences, so that applications don't get confused, and the existing terminal emulators have such code.

If you want to manipulate the contents of the console, you probably want to use curses. It's well documented here. Look at window.getch() and window.getyx().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.