Python 2.x Strings: Unicode vs. Bytes

Python 2.x Strings: Unicode vs. Bytes - python

I deal with languages that are non-us and also sometimes still have to write in Python 2.x. Reading this article: http://www.snarky.ca/why-python-3-exists by Brett Cannon makes me wonder if that implies that if I use strings that are only characters and not bytes, should I prepend all my strings with u, to avoid a potential mix up betwen byte-strings and unicode-strings? And: Does this also apply for Jython?
And one last question: -*- coding: utf-8 -*- is completely independed of the above, providing only the encoding of the file itself - correct?

Yes, you want to keep text in unicode objects (the str type in Python 3), and maintain a Unicode sandwich (decode incoming data as soon as possible, postpone encoding until the data needs to exit your application). See Ned Batchelder's excellent Unicode presentation.
This also applies to Jython, which is just another implementation of the Python language.
The PEP 263 source code encoding declaration tells the interpreter what codec to use when decoding bytes in your source code. It helps when defining Unicode literals with non-ASCII bytes, but doesn't dictate how other data other than the source code is encoded or decoded.

Related

when python interpreter loads source file, will it convert file content to unicode in memory?

Say, I have a source file encoded in utf8, when python interpreter loads that source file, will it convert file content to unicode in memory and then try to evaluate source code in unicode?
If I have a string with non ASCII char in it, like
astring = '中文'
and the file is encoded in gbk.
Running that file with python 2, I found that string actually is still in raw gbk bytes.
So I dboubt, python 2 interpret does not convert source code to unicode. Beacause if so, the string content will be in unicode(I heard it is actually UTF16)
Is that right? And if so, how about python 3 interpreter? Does it convert source code to unicode format?
Acutally, I know how to define unicode and raw string in both Python2 and 3.
I'm just curious about one detail when the interpreter loads source code.
Will it convert the WHOLE raw source code (encoded bytes) to unicode at very beginning and then try to interpret unicode format source code piece by piece?
Or instead, it just loads raw source piece by piece, and only decodes what it think should. For example, when it hits the statement u'中文' , OK, decode to unicode. While it hits statment b'中文', OK, no need to decode.
Which way the interpreter will go?

If your source file is encoded with GBK, put this line at the top of the file (first or second line):
# coding: gbk
This is required for both Python 2 and 3.
If you omit this encoding declaration, the interpreter will assume ASCII in the case of Python 2, and UTF-8 for Python 3.
The encoding declaration controls how the interpreter reads the bytes of the source file. This is mainly relevant for string literals (like in your example), but theoretically also applies to comments and even identifiers (it's probably not a good idea to use non-ASCII in identifiers, though).
As for the question whether you get byte strings or unicode strings: this depends on the syntax, not on the choice and declaration of encoding.
As pointed out in Ignacio's answer, if you want to have unicode strings in Python 2, you need to use the u'...' notation.
In Python 3, the u prefix is optional.
So, with a correct encoding declaration in the file header, it is sufficient to write astring = '中文' to get a correct unicode string in Python 3.
Update
By comment, the OP asks about the interpretation of b'中文'.
In Python 3, this isn't allowed (byte strings can only contain ASCII characters), but you can test this yourself in Python 2.x:
# coding: gbk
x = b'中文'
y = u'中文'
print repr(x)
print repr(y)
This will produce:
'\xd6\xd0\xce\xc4'
u'\u4e2d\u6587'
The first line reflects the actual bytes contained in the source file (if you saved it with GBK, of course).
So there seems to be no decoding happening for b'中文'.
However, I don't know how the interpreter internally represents the source code with respect to encoding (that seems to be your question).
This is implementation-dependent anyway, so the answer might even be different for cPython, Jython, IronPython etc.

So I dboubt, python 2 interpret does not convert source code to unicode.
It never does. If you want to use Unicode rather than bytes then you need to use a unicode instead.
astring = u'中文'

Python source is only plain ASCII, meaning that the actual encoding does not matter except for litteral strings, be them unicode strings or byte strings. Identifiers can use non ascii characters (IMHO it would be a very bad practice), but their meaning is normally internal to the Python interpreter, so the way it reads them is not really important
Byte strings are always left unchanged. That means that normal strings in Python 2 and byte litteral strings in Python 3 are never converted.
Unicode strings are always converted:
if the special string coding: charset_name exists in a comment on first or second line, the original byte string is converted as it would be with decode(charset_name)
if not encoding is specified, Python 2 will assume ASCII and Python 3 will assume utf8

Unicode Byte Order Mark (BOM) as a python constant?

It's not a real problem in practice, since I can just write BOM = "\uFEFF"; but it bugs me that I have to hard-code a magic constant for such a basic thing. [Edit: And it's error prone! I had accidentally written the BOM as \uFFFE in this question, and nobody noticed. It even led to an incorrect proposed solution.] Surely python defines it in a handy form somewhere?
Searching turned up a series of constants in the codecs module: codecs.BOM, codecs.BOM_UTF8, and so on. But these are bytes objects, not strings. Where is the real BOM?
This is for python 3, but I would be interested in the Python 2 situation for completeness.

There isn't one. The bytes constants in codecs are what you should be using.
This is because you should never see a BOM in decoded text (i.e., you shouldn't encounter a string that actually encodes the code point U+FEFF). Rather, the BOM exists as a byte pattern at the start of a stream, and when you decode some bytes with a BOM, the U+FEFF isn't included in the output string. Similarly, the encoding process should handle adding any necessary BOM to the output bytes---it shouldn't be in the input string.
The only time a BOM matters is when either converting into or converting from bytes.

I suppose you could use:
unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
but it's not as clean as what you already have

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in the infamous UnicodeEncodeError. I tried using .encode('utf8'), but that didn't help either. Finally, it turned out I shouldn't use either and it all works out automagically. However, I (here I need to give credit to a friend who helped me) did notice something weird while banging my head against the wall. sys.getdefaultencoding() returns ascii, while sys.stdout.encoding returns UTF-8. 1. in the code below works fine without any modifications to sys and 2. raises a UnicodeEncodeError. If I change the default system encoding with reload(sys).setdefaultencoding("utf8"), then 2. works fine. My question is why the two encoding variables are different in the first place and how do I manage to use the wrong encoding in this simple piece of code? Please, don't send me to the Unicode HOWTO, I've read that obviously in the tens of questions about UnicodeEncodeError.
# -*- coding: utf-8 -*-
import sys
class Token:
def __init__(self, string, final=False):
self.value = string
self.final = final
def __str__(self):
return self.value
def __repr__(self):
return self.value
print(sys.getdefaultencoding())
print(sys.stdout.encoding)
# 1.
myString = "I need 20 000€."
tok = Token(myString)
print(tok)
reload(sys).setdefaultencoding("utf8")
# 2.
myString = u"I need 20 000€."
tok = Token(myString)
print(tok)

My question is why the two encoding variables are different in the first place
They serve different purposes.
sys.stdout.encoding should be the encoding that your terminal uses to interpret text otherwise you may get mojibake in the output. It may be utf-8 in one environment, cp437 in another, etc.
sys.getdefaultencoding() is used on Python 2 for implicit conversions (when the encoding is not set explicitly) i.e., Python 2 may mix ascii-only bytestrings and Unicode strings together e.g., xml.etree.ElementTree stores text in ascii range as bytestrings or json.dumps() returns an ascii-only bytestring instead of Unicode in Python 2 — perhaps due to performance — bytes were cheaper than Unicode for representing ascii characters. Implicit conversions are forbidden in Python 3.
sys.getdefaultencoding() is always 'ascii' on all systems in Python 2 unless you override it that you should not do otherwise it may hide bugs and your data may be easily corrupted due to the implicit conversions using a possibly wrong encoding for the data.
btw, there is another common encoding sys.getfilesystemencoding() that may be different from the two. sys.getfilesystemencoding() should be the encoding that is used to encode OS data (filenames, command-line arguments, environment variables).
The source code encoding declared using # -*- coding: utf-8 -*- may be different from all of the already-mentioned encodings.
Naturally, if you read data from a file, network; it may use character encodings different from the above e.g., if a file created in notepad is saved using Windows ANSI encoding such as cp1252 then on another system all the standard encodings can be different from it.
The point being: there could be multiple encodings for reasons unrelated to Python and to avoid the headache, use Unicode to represent text: convert as soon as possible encoded text to Unicode on input, and encode it to bytes (possibly using a different encoding) as late as possible on output — this is so called the concept of Unicode sandwich.
how do I manage to use the wrong encoding in this simple piece of code?
Your first code example is not fine. You use non-ascii literal characters in a byte string on Python 2 that you should not do. Use bytestrings' literals only for binary data (or so called native strings if necessary). The code may produce mojibake such as I need 20 000Γé¼. (notice the character noise) if you run it using Python 2 in any environment that does not use utf-8-compatible encoding such as Windows console
The second code example is ok assuming reload(sys) is not part of it. If you don't want to prefix all string literals with u''; you could use from __future__ import unicode_literals
Your actual issue is UnicodeEncodeError error and reload(sys) is not the right solution!
The correct solution is to configure your locale properly on POSIX (LANG, LC_CTYPE) or set PYTHONIOENCODING envvar if the output is redirected to a pipe/file or install win-unicode-console to print Unicode to Windows console.

I have noticed the same behaviour of some standard code (mailman library).
Thanks for your analysis, it helped me save some time. :-)
The problem is exactly the same. My system uses sys.getdefaultencoding() and gets ascii, which is inappropriate to handle a list of 1000 UTF-8 encoded names.
There is a mismatch between stdin/stdout and even filesystem encoding (utf-8) on one hand and "defaultencoding" on the other (ascii). This thread: How to print UTF-8 encoded text to the console in Python < 3? seems to indicate that it is well known and Changing default encoding of Python? contains some indication that a more homogeneous (like "utf-8 everywhere") would break other things like the hash implementation.
For that reason it is also not straightforward to change the defaultencoding. (See http://blog.ianbicking.org/illusive-setdefaultencoding.html for various ways to do so.) It is removed from the sys instance in the site.py file.

How to move a python 2.6 project to UTF-8?

We are moving from latin1 to UTF-8 and have 100k lines of python code.
Plus I'm new in python (ha-ha-ha!).
I already know that str() function fails when receiving Unicode so we should use unicode() instead of it with almost the same effect.
What are the other "dangerous" places of code?
Are there any basic guidelines/algorithms for moving to UTF-8? Can it be written an automatic 'code transformer'?

str and unicode are classes, not functions. When you call str(u'abcd') you are initialising a new string which takes 'abcd' as a variable. It just so happens that str() can be used to convert a string of any type to an ascii str.
Other areas to look out for are when reading from a file/input, or basically anything you get back as a string from a function that was not written for unicode.
Enjoy :)

Can it be written an automatic 'code transformer'? =)
No. str and unicode are two different types which have different purposes. You should not attempt to replace every occurrence of a byte string with a Unicode string, neither in Python 2 nor Python 3.
Continue to use byte strings for binary data. In particular anything you're writing to a file or network socket is bytes. And use Unicode strings for user-facing text.
In between there is a grey area of internal ASCII-character strings which could equally be bytes or Unicode. In Python 2 these are typically bytes, in Python 3 typically Unicode. In you are happy to limit your code to Python 2.6+, you can mark your definitely-bytes strings as b'' and bytes, your definitely-characters strings as u'' and unicode, and use '' and str for the “whatever the default type of string is” strings.

One way to quickly convert Python 2.x to have a default encoding of UTF-8 is to set the default encoding. This approach has its downsides--primarily that it changes the encoding for all libraries as well as your application, so use with caution. My company uses that technique in our production apps and it suits us well. It's also forward-compatible with Python 3, which has UTF-8 as the default encoding. You'll still have to change references of str() to unicode(), but you won't have to explicitly specify the encoding with .decode() and encode().

What encoding do normal python strings use?

i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?

In Python 2: Normal strings (Python 2.x str) don't have an encoding: they are raw data.
In Python 3: These are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.
For representing text, you want unicode strings, not byte strings. By "unicode strings", I mean unicode instances in Python 2 and str instances in Python 3. Unicode strings are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.
Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.
This whole situation is complicated by the fact that while you should turn unicode into bytes by calling encode and turn bytes into unicode using decode, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.

Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(
FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.
Here's a few comments:
The need to prefix unicode literals with "u" in 2.x is pretty easily removed in recent (2.6+) 2.x Pythons. from __future__ import unicode_literals
Simialrly, ASCII is only the default source encoding. Python understands a variety of coding hints including the emacs-style # -*- coding: utf-8 -*-. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.
Python of course does use an encoding internally for Unicode strings (str in py3k, unicode in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of sys.maxunicode. If it's 1114111, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points 0x0000 to 0xFFFF) covers most people's needs. For more information, see PEP 0261.

what encoding are normal python
strings use?
In Python 3.x
str is Unicode. This may be either UTF-16 or UTF-32 depending on whether your Python interpreter was built with "narrow" or "wide" Unicode characters.
The Windows version of CPython uses UTF-16. On Unix-like systems, UTF-32 tends to be preferred.
In Python 2.x
str is a byte string type like C char. The encoding isn't defined by the language, but is whatever your locale's default encoding is. Or whatever the MIME charset of the document you got off the Internet is. Or, if you get a string from a function like struct.pack, it's binary data, and doesn't meaningfully have a character encoding at all.
unicode strings in 2.x are equivalent to str in 3.x.
and why don't they use unicode?
Because Python (slightly) predates Unicode. And because Guido wanted to save all the major backwards-incompatible changes for 3.0. Strings in 3.x do use Unicode by default.

From Python 3.0 on all strings are unicode by default, there is also the bytes datatype (Python documentation).
So the python developers think that using unicode is a good idea, that it is not used universally in python 2 is mostly due to backwards compatibility. It also has performance implications.

Python 2.x strings are 8-bit, nothing more. The encoding may vary (though ASCII is assumed). I guess the reasons are historical. Few languages, especially languages that date back to the last century, use unicode right away.
In Python 3, all strings are unicode.

Before Python 3.0, string encoding was ascii by default, but could be changed. Unicode string literals were u"...". This was silly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.