Chinese and Japanese character support in python - python

How to read correctly japanese and chinese characters.
I'm using python 2.5. Output is displayed as "E:\Test\?????????"
path = r"E:\Test\は最高のプログラマ"
t = path.encode()
print t
u = path.decode()
print u
t = path.encode("utf-8")
print t
t = path.decode("utf-8")
print t

Please do read the Python Unicode HOWTO; it explains how to process and include non-ASCII text in your Python code.
If you want to include Japanese text literals in your code, you have several options:
Use unicode literals (create unicode objects instead of byte strings), but any non-ascii codepoint is represented by a unicode escape character. They take the form of \uabcd, so a backslash, a u and 4 hexadecimal digits:
ru = u'\u30EB'
would be one character, the katakana 'ru' codepoint ('ル').
Use unicode literals, but include the characters in some form of encoding. Your text editor will save files in a given encoding (say, UTF-16); you need to declare that encoding at the top of the source file:
# encoding: utf-16
ru = u'ル'
where 'ル' is included without using an escape. The default encoding for Python 2 files is ASCII, so by declaring an encoding you make it possible to use Japanese directly.
Use byte string literals, ready encoded. Encode the codepoints by some other means and include them in your byte string literals. If all you are going to do with them is use them in encoded form anyway, this should be fine:
ru = '\xeb\x30' # ru encoded to UTF16 little-endian
I encoded 'ル' to UTF-16 little-endian because that's the default Windows NTFS filename encoding.
Next problem will be your terminal, the Windows console is notorious for not supporting many character sets out of the box. You probably want to configure it to handle UTF-8 instead. See this question for some details, but you need to run the following command in the console:
chcp 65001
to switch to UTF-8, and you may need to switch to a console font that can handle your codepoints (Lucida perhaps?).

There are two independent issues:
You should specify Python source encoding if you use non-ascii characters and use Unicode literals for data that represents text e.g.:
# -*- coding: utf-8 -*-
path = ur"E:\Test\は最高のプログラマ"
Printing Unicode to Windows console is complicated but if you set correct font then just:
print path
might work.
Regardless of whether your console can display the path; it should be fine to pass the Unicode path to filesystem functions e.g.:
entries = os.listdir(path)
Don't call .encode(char_enc) on bytestrings, call it on Unicode strings instead.
Don't call .decode(char_enc) on Unicode strings, call it on bytestrings instead.

You should force the string to be a unicode object like
path = ur"E:\Test\は最高のプログラマ"
Docs on string literals relevant to 2.5 are located here
Edit: I'm not positive on if the object is a unicode in 2.5 but the docs do state that \uXXXX[XXXX] will be processed and the the string will be "a Unicode string".

Related

when python interpreter loads source file, will it convert file content to unicode in memory?

Say, I have a source file encoded in utf8, when python interpreter loads that source file, will it convert file content to unicode in memory and then try to evaluate source code in unicode?
If I have a string with non ASCII char in it, like
astring = '中文'
and the file is encoded in gbk.
Running that file with python 2, I found that string actually is still in raw gbk bytes.
So I dboubt, python 2 interpret does not convert source code to unicode. Beacause if so, the string content will be in unicode(I heard it is actually UTF16)
Is that right? And if so, how about python 3 interpreter? Does it convert source code to unicode format?
Acutally, I know how to define unicode and raw string in both Python2 and 3.
I'm just curious about one detail when the interpreter loads source code.
Will it convert the WHOLE raw source code (encoded bytes) to unicode at very beginning and then try to interpret unicode format source code piece by piece?
Or instead, it just loads raw source piece by piece, and only decodes what it think should. For example, when it hits the statement u'中文' , OK, decode to unicode. While it hits statment b'中文', OK, no need to decode.
Which way the interpreter will go?
If your source file is encoded with GBK, put this line at the top of the file (first or second line):
# coding: gbk
This is required for both Python 2 and 3.
If you omit this encoding declaration, the interpreter will assume ASCII in the case of Python 2, and UTF-8 for Python 3.
The encoding declaration controls how the interpreter reads the bytes of the source file. This is mainly relevant for string literals (like in your example), but theoretically also applies to comments and even identifiers (it's probably not a good idea to use non-ASCII in identifiers, though).
As for the question whether you get byte strings or unicode strings: this depends on the syntax, not on the choice and declaration of encoding.
As pointed out in Ignacio's answer, if you want to have unicode strings in Python 2, you need to use the u'...' notation.
In Python 3, the u prefix is optional.
So, with a correct encoding declaration in the file header, it is sufficient to write astring = '中文' to get a correct unicode string in Python 3.
Update
By comment, the OP asks about the interpretation of b'中文'.
In Python 3, this isn't allowed (byte strings can only contain ASCII characters), but you can test this yourself in Python 2.x:
# coding: gbk
x = b'中文'
y = u'中文'
print repr(x)
print repr(y)
This will produce:
'\xd6\xd0\xce\xc4'
u'\u4e2d\u6587'
The first line reflects the actual bytes contained in the source file (if you saved it with GBK, of course).
So there seems to be no decoding happening for b'中文'.
However, I don't know how the interpreter internally represents the source code with respect to encoding (that seems to be your question).
This is implementation-dependent anyway, so the answer might even be different for cPython, Jython, IronPython etc.
So I dboubt, python 2 interpret does not convert source code to unicode.
It never does. If you want to use Unicode rather than bytes then you need to use a unicode instead.
astring = u'中文'
Python source is only plain ASCII, meaning that the actual encoding does not matter except for litteral strings, be them unicode strings or byte strings. Identifiers can use non ascii characters (IMHO it would be a very bad practice), but their meaning is normally internal to the Python interpreter, so the way it reads them is not really important
Byte strings are always left unchanged. That means that normal strings in Python 2 and byte litteral strings in Python 3 are never converted.
Unicode strings are always converted:
if the special string coding: charset_name exists in a comment on first or second line, the original byte string is converted as it would be with decode(charset_name)
if not encoding is specified, Python 2 will assume ASCII and Python 3 will assume utf8

Python UTF-8 Latin-1 displays wrong character

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).
I tried a method like this:
def latin1_to_unicode(character):
uni = character.decode('latin-1').encode("utf-8")
retutn uni
It works fine for characters that are not specific to the latin-1 set, but if I try the following example:
print latin1_to_Unicode('å')
It returns Ã¥ instead of å. Same goes for other letters like æ and ø.
Can anyone please explain why this is happening?
Thanks
I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem
Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.
Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).
My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:
>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥
You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.
Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.
You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Specifying unicode literal's encoding on a per-literal basis

According to the documentation, it is possible to define the encoding of the literals used in the python source like this:
# -*- coding: latin-1 -*-
u = u'abcdé' # This is a unicode string encoded in latin-1
Is there any syntax support to specify the encoding on a literal basis? I am looking for something like:
latin1 = u('latin-1')'abcdé' # This is a unicode string encoded in latin-1
utf8 = u('utf-8')'xxxxx' # This is a unicode string encoded in utf-8
I know that syntax does not make sense, but I am looking for something similar. What can I do? Or is it maybe not possible to have a single source file with unicode strings in different encodings?
There is no way for you to mark a unicode literal as having using a different encoding from the rest of the source file, no.
Instead, you'd manually decode the literal from a bytestring instead:
latin1 = 'abcdé'.decode('latin1') # provided `é` is stored in the source as a E9 byte.
or using escape sequences:
latin1 = 'abcd\xe9'.decode('latin1')
The whole point of the source-code codec line is to support using an arbitrary codec in your editor. Source code should never use mixed encodings, really.

noob queries on unicode and str methods in Python

I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?
First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'
I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.
See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?
The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".

UnicodeDecodeError when redirecting to file

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what's going on behind the curtain in both cases?
The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:
Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.
On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).
In summary, computers need to internally represent characters with bytes, and they do so through two operations:
Encoding: characters → bytes
Decoding: bytes → characters
Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).
Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).
Some more information on Unicode, characters and code points, if you are interested:
Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster".
Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python 2).
While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).
This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.
In Python 2, Unicode strings are called… "Unicode strings" (unicode type, literal form u"…"), while byte arrays are "strings" (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (str type, literal form "…"), while byte arrays are "bytes" (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).
With these few key points, you should be able to understand most encoding related questions!
Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:
% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8
If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).
If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.
However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):
% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8
The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:
% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8
If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.
At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:
import codecs
import locale
import sys
# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
For Python 3, you can check one of the questions asked previously on StackOverflow.
Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.
In your case you've printed 4 uncommon characters that your terminal didn't support in its font. Here's some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).
Example 1
Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.
#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
cp437
αßΓπΣσµτΦΘΩδ∞φ
Python correctly determined the encoding of the terminal.
Output (redirected to file)
None
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Python could not determine encoding (None) so used 'ascii' default. ASCII only supports converting the first 128 characters of Unicode.
Output (redirected to file, PYTHONIOENCODING=cp437)
cp437
and my output file was correct:
C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ
Example 2
Now I'll throw in a character in the source that isn't supported by my terminal:
#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
cp437
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>
My terminal didn't understand that last Chinese character.
Output (run directly, PYTHONIOENCODING=437:replace)
cp437
αßΓπΣσµτΦΘΩδ∞φ?
Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.
Encode it while printing
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")
This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.

Categories

Resources