UnicodeDecodeError when redirecting to file - python

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what's going on behind the curtain in both cases?

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:
Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.
On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).
In summary, computers need to internally represent characters with bytes, and they do so through two operations:
Encoding: characters → bytes
Decoding: bytes → characters
Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).
Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).
Some more information on Unicode, characters and code points, if you are interested:
Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster".
Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python 2).
While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).
This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.
In Python 2, Unicode strings are called… "Unicode strings" (unicode type, literal form u"…"), while byte arrays are "strings" (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (str type, literal form "…"), while byte arrays are "bytes" (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).
With these few key points, you should be able to understand most encoding related questions!
Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:
% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8
If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).
If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.
However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):
% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8
The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:
% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8
If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.
At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:
import codecs
import locale
import sys
# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
For Python 3, you can check one of the questions asked previously on StackOverflow.

Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.
In your case you've printed 4 uncommon characters that your terminal didn't support in its font. Here's some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).
Example 1
Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.
#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
cp437
αßΓπΣσµτΦΘΩδ∞φ
Python correctly determined the encoding of the terminal.
Output (redirected to file)
None
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Python could not determine encoding (None) so used 'ascii' default. ASCII only supports converting the first 128 characters of Unicode.
Output (redirected to file, PYTHONIOENCODING=cp437)
cp437
and my output file was correct:
C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ
Example 2
Now I'll throw in a character in the source that isn't supported by my terminal:
#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
cp437
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>
My terminal didn't understand that last Chinese character.
Output (run directly, PYTHONIOENCODING=437:replace)
cp437
αßΓπΣσµτΦΘΩδ∞φ?
Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.

Encode it while printing
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")
This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.

Related

Processing unicode characters in command line arguments

I have a project on Raspberry Pi (Raspbian Debian-based Linux OS) where I have to pass command-line parameters to a Python 3 program. I need to be able to pass a unicode string.
I am not sure how exactly this should be set up. It is clear that there are several conversions the command line string is going through before the data is passed to Python.
Let's start with the fact that I can see the Unicode characters correctly, when I press the required keystrokes in the terminal session. Here is some test code:
$ echo "ā" > test.txt
$ cat test.txt
ā
$ hexdump test.txt
0000000 81c4 000a
0000003
That 0x81c4 word, or two-byte sequence 0xc4+0x81 is "ā" encoded to UTF-8.
Now, if I try to pass the same character to Python, I get a two-character string with weird character codes:
import sys
param = sys.argv[1]
print([hex(ord(char)) for char in param])
$ python test.py ā
['0xdcc4', '0xdc81']
One can notice that the character codes are related to the 0xc4+0x81 byte sequence, but here each byte is added 0xdc00.
If I go in the interactive console, the unicode character manipulation is just the same as with ordinary characters:
>>> txt = 'ā'
>>> len(txt)
1
>>> hex(ord(txt[0]))
'0x101'
0x101 is a correct code point for the character "ā".
So, my question is, how can I reliably convert from the two-character ['0xdcc4', '0xdc81'] string to the one-character string "ā", that would work on all platforms?
I am not sure exactly at which point this happens, but the command line parameters apparently are expected to contain only ASCII characters, and to decode the byte array to a string, bytes.decode(encoding, errors) is used:
param = b'\xc4\x81'.decode('ASCII', 'surrogateescape')
print(param == '\udcc4\udc81') # True
When the decoder stumbles upon a non-ASCII character, it processes the decoding according to the selected error handler. In this case, surrogateescape error handler replaces byte with individual surrogate code ranging from U+DC80 to U+DCFF.
So, the way to fix this is to encode the incorrectly decoded string back to byte array, using the same surrogateescape error handler, and then decode it as utf-8:
import sys
param = sys.argv[1]
param_unicode = param.encode('ASCII', 'surrogateescape').decode('utf-8')
print(param_unicode)
$ python test.py ā
ā
It should be verified, though, if the command line parameters really are always decoded using the ASCII encoding. Perhaps it is different on different platforms and is configurable.

Why does printing these values give different values in different OS and versions?

Why does printing these \x values give different values in different OS and versions?
Example:
print("A"*20+"\xef\xbe\xad\xde")
This gives different output in Python3 and 2 and in different platforms
In Microsoft's Windows:
Python2: AAAAAAAAAAAAAAAAAAAAï¾Þ
Python3: AAAAAAAAAAAAAAAAAAAAï¾Þ
In Kali:
Python2: AAAAAAAAAAAAAAAAAAAAᆳ
Python3: AAAAAAAAAAAAAAAAAAAAï¾­Þ
UPDATE: What I want is the exact Python2 output but with Python3? I tried many things(encoding, decoding, byte conversion) but realised \xde can't be decoded. Any other way to achieve what I want?
It is a question of encoding.
In Latin1 or Windows 1252 encoding, you have:
0xef -> ï (LATIN SMALL LETTER I WITH DIAERESIS)
0xbe -> ¾ (VULGAR FRACTION THREE QUARTERS)
0xad -> undefined and non printed in your examples
0xde -> Þ (LATIN CAPITAL LETTER THORN)
In utf-8 encoding, you have:
'\xef\xbe\xad' -> u'\uffad' or 'ᆳ' (HALFWIDTH HANGUL LETTER RIEUL-SIOS)
'\xde' -> should raise an UnicodeDecodeError...
In Windows, Python2 or Python3 both use Windows 1252 code page (in your example). On Kali, Python2 sees the string as byte string and the terminal displays it in utf8, while Python3 assumes it already contains unicode character values and displays them directly.
As in Latin1 (and in Windows 1252 for all characters outside 0x80-0x9f) the byte code is the unicode value, that is enough to explain your outputs.
What to learn: be explicit whether strings contains unicode or bytes and beware of encodings!
To get consistent behavior on both Python 2 and Python 3, you'll need to be explicit about your intended output. If you want, AAAAAAAAAAAAAAAAAAAAᆳ, then the \xde is garbage; if you want AAAAAAAAAAAAAAAAAAAAï¾Þ, the \xad is garbage. Either way, the "solution" to printing what you've got is to explicitly use bytes literals and decode them with the desired encoding, ignoring errors. So to get AAAAAAAAAAAAAAAAAAAAᆳ (interpreting as UTF-8), you'd do:
print((b"A"*20+b"\xef\xbe\xad\xde").decode('utf-8', errors='ignore'))
while to get AAAAAAAAAAAAAAAAAAAAï¾Þ you'd do:
# cp1252 can be used instead of latin-1, depending on intent; they overlap in this case
print((b"A"*20+b"\xef\xbe\xad\xde").decode('latin-1', errors='ignore'))
Importantly, note the leading b on the literals; they're recognized and ignored on Python 2.7 (unless from __future__ unicode_literals is in effect, in which case they're needed just like in Python 3) and on Python 3, it makes the literals bytes literals (no special encoding assumed), rather than str literals, so you can decode in your desired encoding. Either way, you end up with raw bytes, which can then be decoded in the preferred encoding, with errors ignored.
Note that ignoring errors is usually going to be wrong; you're dropping data on the floor. 0xDEADBEEF isn't guaranteed to produce a useful byte string in any given encoding, and if that's not your real data, you're probably still risking errors by wanting to silently ignore undecodeable data.
If you want to write the raw bytes and let whatever is consuming stdout interpret them however it wants, you need to drop below the print level, since print on Python 3 is purely str based. To write the raw bytes on Python 3, you'd use sys.stdout.buffer (sys.stdout is text based, sys.stdout.buffer is the underlying buffered byte-oriented stream it wraps); you'd also need to manually add the newline (if desired):
sys.stdout.buffer.write(b"A"*20+b"\xef\xbe\xad\xde\n")
vs. on Python 2 where stdout isn't an encoding wrapper:
sys.stdout.write(b"A"*20+b"\xef\xbe\xad\xde\n")
For portable code, you can get a "raw stdout" ahead of time and use that:
# Put this at the top of your file so you don't have to constantly recheck/reacquire
# Gets sys.stdout.buffer if it exists, sys.stdout otherwise
bstdout = getattr(sys.stdout, 'buffer', sys.stdout)
# Works on both Py2 and Py3
bstdout.write(b"A"*20+b"\xef\xbe\xad\xde\n")

Why web-server complains about Cyrillic letters and command line not?

I have a web-server on which I try to submit a form containing Cyrillic letters. As a result I get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
This message comes from the following line of the code:
ups = 'rrr {0}'.format(body.replace("'","''"))
(body contains Cyrillic letters). Strangely I cannot reproduce this error message in the python command line. The following works fine:
>>> body = 'ппп'
>>> ups = 'rrr {0}'.format(body.replace("'","''"))
It's working in the interactive prompt because your terminal is using your locale to determine what encoding to use. Directly from the Python docs:
Whereas the other file-like objects in python always convert to ASCII
unless you set them up differently, using print() to output to the
terminal will use the user’s locale to convert before sending the
output to the terminal.
On the other hand, while your server is running the scripts, there is no such assumption. Everything read as a byte str from a file-like object is encoded as ASCII in memory unless otherwise specified. Your Cyrillic characters, presumably encoded as UTF-8, can't be converted; they're far beyond the U+007F code point that maps directly between UTF-8 and ASCII. (Unicode uses hex to map its code points; U+007F, then, is U+00127 in decimal. In fact, ASCII only has 127 zero-indexed code points because it uses only 1 byte, and of that one byte, only the least-significant 7 bits. The most significant bit is always 0.)
Back to your problem. If you want to operate on the body of the file, you'll have to specify that it should be opened with a UTF-8 encoding. (Again, I'm assuming it's UTF-8 because it's information submitted from the web. If it's not -- well, it really should be.) The solution has already been given in other StackOverflow answers, so I'll just link to one of them rather than reiterate what's already been answered. The best answer may vary a little bit depending on your version of Python -- if you let me know in a comment I could give you a clearer recommendation.

Understanding Python Unicode and Linux terminal

I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:
mystring="this is unicode string:"+japanesevalues[1]
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring
I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.
If I execute this:
#python myscript.py
this is unicode string: カラダーズ ソフィー
But if I do that:
#python myscript.py > output
I got the typical error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)
Why is that?
The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.
But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.
Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).
In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).
If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.

noob queries on unicode and str methods in Python

I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?
First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'
I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.
See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?
The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".

Categories

Resources