Understanding Python Unicode and Linux terminal

Understanding Python Unicode and Linux terminal - python

I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:
mystring="this is unicode string:"+japanesevalues[1]
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring
I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.
If I execute this:
#python myscript.py
this is unicode string: カラダーズ ソフィー
But if I do that:
#python myscript.py > output
I got the typical error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)
Why is that?

The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.
But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.
Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).
In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).

If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.

Related

Why web-server complains about Cyrillic letters and command line not?

I have a web-server on which I try to submit a form containing Cyrillic letters. As a result I get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
This message comes from the following line of the code:
ups = 'rrr {0}'.format(body.replace("'","''"))
(body contains Cyrillic letters). Strangely I cannot reproduce this error message in the python command line. The following works fine:
>>> body = 'ппп'
>>> ups = 'rrr {0}'.format(body.replace("'","''"))

It's working in the interactive prompt because your terminal is using your locale to determine what encoding to use. Directly from the Python docs:
Whereas the other file-like objects in python always convert to ASCII
unless you set them up differently, using print() to output to the
terminal will use the user’s locale to convert before sending the
output to the terminal.
On the other hand, while your server is running the scripts, there is no such assumption. Everything read as a byte str from a file-like object is encoded as ASCII in memory unless otherwise specified. Your Cyrillic characters, presumably encoded as UTF-8, can't be converted; they're far beyond the U+007F code point that maps directly between UTF-8 and ASCII. (Unicode uses hex to map its code points; U+007F, then, is U+00127 in decimal. In fact, ASCII only has 127 zero-indexed code points because it uses only 1 byte, and of that one byte, only the least-significant 7 bits. The most significant bit is always 0.)
Back to your problem. If you want to operate on the body of the file, you'll have to specify that it should be opened with a UTF-8 encoding. (Again, I'm assuming it's UTF-8 because it's information submitted from the web. If it's not -- well, it really should be.) The solution has already been given in other StackOverflow answers, so I'll just link to one of them rather than reiterate what's already been answered. The best answer may vary a little bit depending on your version of Python -- if you let me know in a comment I could give you a clearer recommendation.

Unicode conversion issue using Python in Emacs

I'm trying to understand the difference in a bit of Python script behavior when run on the command line vs run as part of an Emacs elisp function.
The script looks like this (I'm using Python 2.7.1 BTW):
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape")
that is, [in general] take a JSON segment containing unicode characters, dumpstring it to it's unicode escaped version, then decode it back to it's unicode representation. When run on the command line, the dumps part of this returns:
'{"Foo": "\\u30b6"}'
which when printed looks like:
'{"Foo": "\u30b6"}'
the decode part of this looks like:
u'{"Foo": "\u30b6"}'
which when printed looks like:
{"Foo": "ザ"}
i.e., the original string representation of the structure, at least in a terminal/console that supports unicode (in my testbed, an xterm). In a Windows console, the output is not correct with respect to the unicode character, but the script does not error out.
In Emacs, the dumps conversion is the same as on the command line (at least as far as confirming with a print), but the decode part blows out with the dreaded:
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\u30b6' in position 9: ordinal not in range(128)`
I've a feeling I'm missing something basic here with respect to either the script or Emacs (in my testbed 23.1.1). Is there some auto-magic part of print invoking the correct codec/locale that happens at the command line but not in Emacs? I've tried explicitly setting the locale for an Emacs invocation (here's a stub test without the json logic):
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s'"
produces the same exception, while
"LC_ALL=\"en_US.UTF-8\" python -c 'import sys; enc=sys.stdout.encoding; print enc' "
indicates that the encoding is 'None'.
If I attempt to coerce the conversion using:
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s.encode(\"utf8\",\"replace\")'"
the error goes away, but the result is the "garbled" version of the string seen in the non-unicode console:
Fooa?¶
Any ideas?
UPDATE: thanks to unutbu -- b/c the locale identification falls down, the command needs to be explicitly decorated with the utf8-encode (see the answer for working directly with a unicode string). In my case, I am getting what is needed from the dumps/decode sequence, so I add the additional required decoration to achieve the desired result:
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape").encode("utf8","replace")
Note that this is the "raw" Python without the necessary escaping required by Emacs.
As you may have guessed from looking at the original part of this question, I'm using this as part of some JSON formatting logic in Emacs -- see my answer to this question.

The Python wiki page, "PrintFails" says
When Python does not detect the desired character set of the output,
it sets sys.stdout.encoding to None, and print will invoke the "ascii"
codec.
It appears that when python is being run from an elisp function, it can not detect the desired character set, so it is defaulting to "ascii". So trying to print unicode is tacitly causing python to encode the unicode as ascii, which is reason for the error.
Replacing u\"Fooザ\" with u\"Foo\\u30b6\" seems to work:
(defun mytest ()
(interactive)
(shell-command-on-region (point)
(point) "LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Foo\\u30b6\"; print s.encode(\"utf8\",\"replace\")'" nil t))
C-x C-e M-x mytest
yields
Fooザ

How to print Unicode in Python 2 when LANG=C

Quite a silly question, I know. Of course normally LANG=C indicates an ASCII terminal
which cannot display Unicode characters. But I nevertheless want to print out the UTF-8 bytes. I use Python 2 (2.6.5 actually)
print '\xc3\xa4', u'\xe4'
This prints 'ä ä' on a Unicode terminal, but the second string causes an error when executed with LANG=C. I don't want Python to be smart but simply convert u'\xe4' to UTF-8 so it's just '\xc3\xa4' in memory.
I tried all combinations of decode(), encode() and unicode() that I can imagine but it seems I missed the right combination.
What I actually want is reading Unicode charaters through vi's system() function, like
:echo system('python foo.py')

To encode a unicode to utf-8, call .encode('utf-8') on it.:
>>> u'\xe4'.encode('utf-8')
'\xc3\xa4'

UnicodeDecodeError when redirecting to file

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what's going on behind the curtain in both cases?

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:
Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.
On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).
In summary, computers need to internally represent characters with bytes, and they do so through two operations:
Encoding: characters → bytes
Decoding: bytes → characters
Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).
Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).
Some more information on Unicode, characters and code points, if you are interested:
Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster".
Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python 2).
While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).
This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.
In Python 2, Unicode strings are called… "Unicode strings" (unicode type, literal form u"…"), while byte arrays are "strings" (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (str type, literal form "…"), while byte arrays are "bytes" (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).
With these few key points, you should be able to understand most encoding related questions!
Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:
% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8
If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).
If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.
However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):
% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8
The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:
% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8
If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.
At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:
import codecs
import locale
import sys
# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
For Python 3, you can check one of the questions asked previously on StackOverflow.

Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.
In your case you've printed 4 uncommon characters that your terminal didn't support in its font. Here's some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).
Example 1
Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.
#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
cp437
αßΓπΣσµτΦΘΩδ∞φ
Python correctly determined the encoding of the terminal.
Output (redirected to file)
None
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Python could not determine encoding (None) so used 'ascii' default. ASCII only supports converting the first 128 characters of Unicode.
Output (redirected to file, PYTHONIOENCODING=cp437)
cp437
and my output file was correct:
C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ
Example 2
Now I'll throw in a character in the source that isn't supported by my terminal:
#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
cp437
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>
My terminal didn't understand that last Chinese character.
Output (run directly, PYTHONIOENCODING=437:replace)
cp437
αßΓπΣσµτΦΘΩδ∞φ?
Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.

Encode it while printing
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")
This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.

Return unicode string from python via ajax

I have a small webapp that runs Python on the server side and javascript (jQuery) on the client side.
Now upon a certain request my Python script returns a unicode string and the client is supposed to put that string inside a div in the browser. However i get a unicode encode error from Python.
If i run the script from the shell (bash on debian linux) teh script runs fine and prints the unicode string.
Any ideas ?
Thanks!
EDIT
This is the print statement that causes the error:
print u'öäü°'
This is the error message i get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-36: ordinal not in range(128)
However i only get that message when calling the script via ajax ( $('#somediv').load('myscript.py'); )
Thank you !

If the python interpreter can't determine the encoding of sys.stdout ascii is used as a fallback however the characters in the string are not part of ascii, therefore a UnicodeEncodeError exception is raised.
A solution would be to encode the string yourself using something like .encode(sys.stdout.encoding or "utf-8"). This way utf-8 is used as a fallback instead of ascii.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.