Unicode in Python - just UTF-16?

Unicode in Python - just UTF-16? - python

I was happy in my Python world knowing that I was doing everything in Unicode and encoding as UTF-8 when I needed to output something to a user. Then, one of my colleagues sent me the "The UTF-8 Everywhere' manifesto" (2012) and it confused me.
The author of the article claims a number of times that UCS-2, the Unicode representation that Python uses is synonymous with UTF-16.
He even goes as far as directly saying Python uses UTF-16 for internal string representation.
The author also admits to being a Windows lover and developer and states that the way MS has handled character encodings over the years has led to that group being the most confused so maybe it is just his own confusion. I don't know...
Can somebody please explain what the state of UTF-16 vs Unicode is in Python? Are they synonymous and if not, in what way?

The internal representation of a Unicode string in Python (versions from 2.2 up to 3.2) depends on whether Python was compiled in wide or narrow modes. Most Python builds are narrow (you can check with sys.maxunicode -- it is 65535 on narrow builds and 1114111 on wide builds).
With a wide build, strings are internally sequences of 4-byte wide characters, i.e. they use the UTF-32 encoding. All code points are exactly one wide-character in length.
With a narrow build, strings are internally sequences of 2-byte wide characters, using UTF-16. Characters beyond the BMP (code points U+10000 and above) are stored using the usual UTF-16 surrogate pairs:
>>> q = u'\U00010000'
>>> len(q)
2
>>> q[0]
u'\ud800'
>>> q[1]
u'\udc00'
>>> q
u'\U00010000'
Note that UTF-16 and UCS-2 are not the same. UCS-2 is a fixed-width encoding: every code point is encoded as 2 bytes. Consequently, UCS-2 cannot encode code points beyond the BMP. UTF-16 is a variable-width encoding; code points outside the BMP are encoded using a pair of characters, called a surrogate pair.
Note that this all changes in 3.3, with the implementation of PEP 393. Now, Unicode strings are represented using characters wide enough to hold the largest code point -- 8 bits for ASCII strings, 16 bits for BMP strings, and 32 bits otherwise. This does away with the wide/narrow divide and also helps reduce the memory usage when many ASCII-only strings are used.

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 32173: character maps to <undefined> [duplicate]

I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?

ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)
0x20 - 0x40 contain numbers and punctuation
0x41 - 0x7F contain mostly alphabetic characters
0x80 - 0xFF the 8th bit = undefined.
French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".
The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).
Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.
Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.
UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
Content-Type: text/html; charset=utf-8
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:
Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.
The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.
For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.
The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
a byte is stored as a signed byte (range: -128 to 127).
the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)
a stream returns an integer in range -1 to 255.
If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
// the -1 indicates that there is no data
int input = stream.read();
if (input == -1) throw new EOFException();
// bytes must be made positive first.
byte myByte = (byte) input;
int unsignedInteger = myByte & 0xFF;
char ascii = (char)(unsignedInteger);
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
// wrap your stream in a reader.
// specify the encoding
// The reader will decode the data for you
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.

(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.

Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.

Most computer programs must communicate with a person using some text in a natural language (a language used by humans). But computers have no fundamental means for representing text: the fundamental computer representation is a sequence of bits organized into bytes and words, with hardware support for interpreting sequences of bits as fixed width base-2 (binary) integers and floating-point real numbers. Computer programs must therefore have a scheme for representing text as sequences of bits. This is fundamentally what character encoding is. There is no inherently obvious or correct scheme for character encoding, and so there exist many possible character encodings.
However, practical character encodings have some shared characteristics.
Encoded texts are divided into a sequence of characters (graphemes).
Each of the known possible characters has an encoding. The encoding of a text consists of the sequence of the encoding of the characters of the text.
Each possible (allowed) character is assigned a unique unsigned (non negative) integer (this is sometimes called a code point). Texts are therefore encoded as a sequence of unsigned integers. Different character encodings differ in the characters they allow, and how they assign these unique integers. Most character encodings do not allow all the characters used by the many human writing systems (scripts) that do and have existed. Thus character encodings differ in which texts they can represent at all. Even character encodings that can represent the same text can represent it differently, because of their different assignment of code points.
The unsigned integer encoding a character is encoded as a sequence of bits. Character encodings differ in the number of bits they use for this encoding. When those bits are grouped into bytes (as is the case for popular encodings), character encodings can differ in endianess. Character encodings can differ in whether they are fixed width (the same number of bits for each encoded character) or variable width (using more bits for some characters).
Therefore, if a computer program receives a sequence of bytes that are meant to represent some text, the computer program must know the character encoding used for that text, if it is to do any kind of manipulation of that text (other than regarding it as an opaque value and forwarding it unchanged). The only possibilities are that the text is accompanied by additional data that indicates the encoding used or the program requires (assumes) that the text has a particular encoding.
Similarly, if a computer program must send (output) text to another program or a display device, it must either tell the destination the character encoding used or the program must use the encoding that the destination expects.
In practice, almost all problems with character encodings are caused when a destination expects text sent using one character encoding, and the text is actually sent with a different character encoding. That in turn is typically caused by the computer programmer not bearing in mind that there exist many possible character encodings, and that their program can not treat encoded text as opaque values, but must convert from an external representation on input and convert to an external representation on output.

When to raise UnicodeTranslateError?

The standard library documentation says:
exception UnicodeTranslateError
Raised when a Unicode-related error occurs during translating.
But translation is never defined. Doing a grep through the cpython source I can't see any examples of this class being raised as an error from anything. What is this exception used for and what's the difference between it and the Decode exception which seems to be used much more frequently?

Unicode has room for more than 1 million code points. (At the moment "only" about 150,000 of them are assigned to characters, but in theory more than 1M can be used.) The number of all available code points, written as a binary number, has 21 binary digits, which means you need 21 bit or at least 3 byte to encode all code points.
But the most used characters have unicode code points then need less than 10 bit, many even less than 8 bit. So, a 3-byte encoding would waste a lot of space when you use it to encode texts that contain mainly characters with low code points.
On the other hand a 3-byte encoding has disadvantages when processing in modern computers because the CPU prefers packages of 2, 4 or 8 byte.
And so there are different encodings for unicode strings:
UTF-32 uses 32-bit fields to encode unicode characters. This encoding performs very fast in computers, but also wastes a lot of space in memory.
UCS-4 is just another name for UTF-32. The number 4 means: exactly 4 bytes (which are 32 bit).
USC-2 uses 2 byte and therefore 16-bit fields. You need only half of the memory, but not all existing unicode code points can be encoded in UCS-2.
UTF-16 also uses 16-bit fields, but here also 2 of these fields can be used together to encode one character. So also UTF-16 can be used to encode all possible unicode codepoints.
UTF-8 uses 1-byte-fields. So in theory you need between 1 and 3 byte to encode every code point, but you also must add the information, if a byte is the start byte of a code point, and how many bytes the code point is long. And if you add these control bit to teh 21 data bit, you get more than 24 bit in total, which means: you need up to 4 byte to encode every possible unicode character.
There are even more encodings for unicode: UTF-1, UTF-7, CESU-8, GB 18030 and many more.
And the fact, that there are many different encodings make it necessary to translate from one encoding to another in some situations. And when you want to translate for example UTF-8 to UCS-2 you will get in trouble if the original text contains characters with code points out of the range that UCS-2 can encode. And in this case you should through a UnicodeTranslateError.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 125: invalid start byte [duplicate]

I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?

(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.

Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.

Unicode encoding, python documentation - why a 32-bit encoding? [duplicate]

This question already has answers here:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?
(4 answers)
Closed 3 years ago.
I am reading UNICODE Howto in the Python documentation.
It is written that
a Unicode string is a sequence of code points, which are numbers from
0 through 0x10FFFF
which make it looks like the maximum number of bits needed to represent a code point is 24 (because there are 6 hexadecimal characters, and 6*4=24).
But then the documentation states:
The first encoding you might think of is using 32-bit integers as the
code unit
Why is that? The first encoding I could think of is with 24-bit integers, not 32-bit.

Actually you only need 21. Many CPUs use 32-bit registers natively, and most languages have a 32-bit integer type.
If you study the UTF-16 and UTF-8 encodings, you’ll find that their algorithms encode a maximum of a 21-bit code point using two 16-bit code units and four 8-bit code units, respectively.

Because it is the standard way. Python uses different "internal encoding", depending the content of the string: ASCII/ISO, UTF-16, UTF-32. UTF-32 is a common used representation (usually just intern to programs) to represent Unicode code point. So Python, instead of reinventing an other encoding (e.g. a UTF-22), it just uses UTF-32 representation. It is also easier for the different interfaces. Not so efficient on space, but much more on string operations.
Note: Python uses (in seldom cases) also surrogate range to encode "wrong" bytes. So you need more than 10FFFF code points.
Note: Also colour encoding had a similar encoding: 8bit * 3 channels = 24bit, but often represented with 32 integers (but this also for other reasons: just a write, instead of 2 read + 2 write on bus). 32 bits is much more easier and fast to handle.

What encoding do normal python strings use?

i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?

In Python 2: Normal strings (Python 2.x str) don't have an encoding: they are raw data.
In Python 3: These are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.
For representing text, you want unicode strings, not byte strings. By "unicode strings", I mean unicode instances in Python 2 and str instances in Python 3. Unicode strings are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.
Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.
This whole situation is complicated by the fact that while you should turn unicode into bytes by calling encode and turn bytes into unicode using decode, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.

Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(
FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.
Here's a few comments:
The need to prefix unicode literals with "u" in 2.x is pretty easily removed in recent (2.6+) 2.x Pythons. from __future__ import unicode_literals
Simialrly, ASCII is only the default source encoding. Python understands a variety of coding hints including the emacs-style # -*- coding: utf-8 -*-. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.
Python of course does use an encoding internally for Unicode strings (str in py3k, unicode in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of sys.maxunicode. If it's 1114111, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points 0x0000 to 0xFFFF) covers most people's needs. For more information, see PEP 0261.

what encoding are normal python
strings use?
In Python 3.x
str is Unicode. This may be either UTF-16 or UTF-32 depending on whether your Python interpreter was built with "narrow" or "wide" Unicode characters.
The Windows version of CPython uses UTF-16. On Unix-like systems, UTF-32 tends to be preferred.
In Python 2.x
str is a byte string type like C char. The encoding isn't defined by the language, but is whatever your locale's default encoding is. Or whatever the MIME charset of the document you got off the Internet is. Or, if you get a string from a function like struct.pack, it's binary data, and doesn't meaningfully have a character encoding at all.
unicode strings in 2.x are equivalent to str in 3.x.
and why don't they use unicode?
Because Python (slightly) predates Unicode. And because Guido wanted to save all the major backwards-incompatible changes for 3.0. Strings in 3.x do use Unicode by default.

From Python 3.0 on all strings are unicode by default, there is also the bytes datatype (Python documentation).
So the python developers think that using unicode is a good idea, that it is not used universally in python 2 is mostly due to backwards compatibility. It also has performance implications.

Python 2.x strings are 8-bit, nothing more. The encoding may vary (though ASCII is assumed). I guess the reasons are historical. Few languages, especially languages that date back to the last century, use unicode right away.
In Python 3, all strings are unicode.

Before Python 3.0, string encoding was ascii by default, but could be changed. Unicode string literals were u"...". This was silly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.