Encode a string to fixed-width unicode UCS-2 in Python - python

I need a fixed-width string encoding. From what I understood, UCS-2 and UCS-4 (also, ASCII) are such fixed-width encodings.
From what I understood, Python only supports a variable-width UTF-16 via s.encode('utf_16_le'). Is it true? Is there an easy way to encode into a unicode fixed-width encoding?
Context: I'm storing a string array in raw bytes and need a way to index into it to recover original strings. Index calculation is easier when all characters are fixed-width.
strings = ['asd', 'def']
# ascii
bytelens = list(map(len, strings))
bytes = ''.join(strings).encode('ascii')
# utf8
bytelens = []
bytes = bytearray()
for s in strings:
e = s.encode('utf-8')
bytelens.append(len(e))
bytes.extend(e)
# i need bytelens to later recover original strings from the array bytes
As you can see, ASCII variant is very simple, and UTF-8 is more convoluted and 20% slower (probably because of many allocations and function calls). A true fixed-width UCS-2 would be a solution!
A follow-up question: how can I know if my string has characters from UCS-1 / UCS-2 / UCS-4? For UCS-1 there is str.isascii. Any ideas about UCS-2?

You are mixing various concepts.
In Python, you can just index a string (or an array). It doesn't matter the length of every character. But also in this case, I should warn you that one character is not a single/simple entity: if you need single entities, you should put together more characters (combining characters, e.g. accents, etc.).
UTF16 is variable width, but it is the same as UCS2, but for characters outside UCS2. So for most things, it doesn't matter, and if you have such characters, you just work with sometime low and high surrogates (like on many other computer languages, which supports only UCS2). But this is often not a problem, because you should not split a string at random places, but always at end of an entity.
UCS4 and UTF-32 are practically the same encoding: Unicode code points into 32-bit numbers. (Differences are just virtual, and on some definition, not for Unicode characters [UCS is based on an ISO which allowed more (higher) code-points, never allocated)

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 32173: character maps to <undefined> [duplicate]

I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?
ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)
0x20 - 0x40 contain numbers and punctuation
0x41 - 0x7F contain mostly alphabetic characters
0x80 - 0xFF the 8th bit = undefined.
French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".
The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).
Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.
Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.
UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
Content-Type: text/html; charset=utf-8
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:
Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.
The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.
For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.
The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
a byte is stored as a signed byte (range: -128 to 127).
the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)
a stream returns an integer in range -1 to 255.
If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
// the -1 indicates that there is no data
int input = stream.read();
if (input == -1) throw new EOFException();
// bytes must be made positive first.
byte myByte = (byte) input;
int unsignedInteger = myByte & 0xFF;
char ascii = (char)(unsignedInteger);
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
// wrap your stream in a reader.
// specify the encoding
// The reader will decode the data for you
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.
(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.
Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.
Most computer programs must communicate with a person using some text in a natural language (a language used by humans). But computers have no fundamental means for representing text: the fundamental computer representation is a sequence of bits organized into bytes and words, with hardware support for interpreting sequences of bits as fixed width base-2 (binary) integers and floating-point real numbers. Computer programs must therefore have a scheme for representing text as sequences of bits. This is fundamentally what character encoding is. There is no inherently obvious or correct scheme for character encoding, and so there exist many possible character encodings.
However, practical character encodings have some shared characteristics.
Encoded texts are divided into a sequence of characters (graphemes).
Each of the known possible characters has an encoding. The encoding of a text consists of the sequence of the encoding of the characters of the text.
Each possible (allowed) character is assigned a unique unsigned (non negative) integer (this is sometimes called a code point). Texts are therefore encoded as a sequence of unsigned integers. Different character encodings differ in the characters they allow, and how they assign these unique integers. Most character encodings do not allow all the characters used by the many human writing systems (scripts) that do and have existed. Thus character encodings differ in which texts they can represent at all. Even character encodings that can represent the same text can represent it differently, because of their different assignment of code points.
The unsigned integer encoding a character is encoded as a sequence of bits. Character encodings differ in the number of bits they use for this encoding. When those bits are grouped into bytes (as is the case for popular encodings), character encodings can differ in endianess. Character encodings can differ in whether they are fixed width (the same number of bits for each encoded character) or variable width (using more bits for some characters).
Therefore, if a computer program receives a sequence of bytes that are meant to represent some text, the computer program must know the character encoding used for that text, if it is to do any kind of manipulation of that text (other than regarding it as an opaque value and forwarding it unchanged). The only possibilities are that the text is accompanied by additional data that indicates the encoding used or the program requires (assumes) that the text has a particular encoding.
Similarly, if a computer program must send (output) text to another program or a display device, it must either tell the destination the character encoding used or the program must use the encoding that the destination expects.
In practice, almost all problems with character encodings are caused when a destination expects text sent using one character encoding, and the text is actually sent with a different character encoding. That in turn is typically caused by the computer programmer not bearing in mind that there exist many possible character encodings, and that their program can not treat encoded text as opaque values, but must convert from an external representation on input and convert to an external representation on output.

String change from Latin to ASCII

I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 125: invalid start byte [duplicate]

I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?
ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)
0x20 - 0x40 contain numbers and punctuation
0x41 - 0x7F contain mostly alphabetic characters
0x80 - 0xFF the 8th bit = undefined.
French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".
The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).
Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.
Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.
UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
Content-Type: text/html; charset=utf-8
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:
Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.
The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.
For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.
The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
a byte is stored as a signed byte (range: -128 to 127).
the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)
a stream returns an integer in range -1 to 255.
If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
// the -1 indicates that there is no data
int input = stream.read();
if (input == -1) throw new EOFException();
// bytes must be made positive first.
byte myByte = (byte) input;
int unsignedInteger = myByte & 0xFF;
char ascii = (char)(unsignedInteger);
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
// wrap your stream in a reader.
// specify the encoding
// The reader will decode the data for you
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.
(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.
Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.
Most computer programs must communicate with a person using some text in a natural language (a language used by humans). But computers have no fundamental means for representing text: the fundamental computer representation is a sequence of bits organized into bytes and words, with hardware support for interpreting sequences of bits as fixed width base-2 (binary) integers and floating-point real numbers. Computer programs must therefore have a scheme for representing text as sequences of bits. This is fundamentally what character encoding is. There is no inherently obvious or correct scheme for character encoding, and so there exist many possible character encodings.
However, practical character encodings have some shared characteristics.
Encoded texts are divided into a sequence of characters (graphemes).
Each of the known possible characters has an encoding. The encoding of a text consists of the sequence of the encoding of the characters of the text.
Each possible (allowed) character is assigned a unique unsigned (non negative) integer (this is sometimes called a code point). Texts are therefore encoded as a sequence of unsigned integers. Different character encodings differ in the characters they allow, and how they assign these unique integers. Most character encodings do not allow all the characters used by the many human writing systems (scripts) that do and have existed. Thus character encodings differ in which texts they can represent at all. Even character encodings that can represent the same text can represent it differently, because of their different assignment of code points.
The unsigned integer encoding a character is encoded as a sequence of bits. Character encodings differ in the number of bits they use for this encoding. When those bits are grouped into bytes (as is the case for popular encodings), character encodings can differ in endianess. Character encodings can differ in whether they are fixed width (the same number of bits for each encoded character) or variable width (using more bits for some characters).
Therefore, if a computer program receives a sequence of bytes that are meant to represent some text, the computer program must know the character encoding used for that text, if it is to do any kind of manipulation of that text (other than regarding it as an opaque value and forwarding it unchanged). The only possibilities are that the text is accompanied by additional data that indicates the encoding used or the program requires (assumes) that the text has a particular encoding.
Similarly, if a computer program must send (output) text to another program or a display device, it must either tell the destination the character encoding used or the program must use the encoding that the destination expects.
In practice, almost all problems with character encodings are caused when a destination expects text sent using one character encoding, and the text is actually sent with a different character encoding. That in turn is typically caused by the computer programmer not bearing in mind that there exist many possible character encodings, and that their program can not treat encoded text as opaque values, but must convert from an external representation on input and convert to an external representation on output.

Unicode (Cyrillic) character indexing, re-writing in python

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7
There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.
These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.
To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё

Python str vs unicode types

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S
unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).
Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.
Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Unicode and encodings are completely different, unrelated things.
Unicode
Assigns a numeric ID to each character:
0x41 → A
0xE1 → á
0x414 → Д
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.
You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.
These numbers assigned to all characters by Unicode are called code points.
The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I'm talking about 😂, instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?
Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.
Encodings
Map characters to bit patterns.
These bit patterns are used to represent the characters in computer memory or on disk.
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
ASCII
Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.
Example:
a → 1100001 (0x61)
You can see all the mappings in this table.
ISO 8859-1 (aka Latin-1)
Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
Example:
a → 01100001 (0x61)
á → 11100001 (0xE1)
You can see all the mappings in this table.
UTF-8
Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
Example:
a → 01100001 (0x61)
á → 11000011 10100001 (0xC3 0xA1)
≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
The way UTF-8 encodes characters to bit strings is very well described here.
Unicode and Encodings
Looking at the above examples, it becomes clear how Unicode is useful.
For example, if I'm Latin-1 and I want to explain my encoding of á, I don't need to say:
"I encode that a with an aigu (or however you call that rising bar) as 11100001"
But I can just say:
"I encode U+00E1 as 11100001"
And if I'm UTF-8, I can say:
"Me, in turn, I encode U+00E1 as 11000011 10100001"
And it's unambiguously clear to everybody which character we mean.
Now to the often arising confusion
It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.
For example:
ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.
Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.
Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.
Back to your question
The encoding used by your Python interpreter is UTF-8.
Here's what's going on in your examples:
Example 1
The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.
>>> a = 'á'
When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':
>>> a
'\xc3\xa1'
Example 2
The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):
>>> ua = u'á'
When you look at the value of ua, Python tells you that it contains the code point U+00E1:
>>> ua
u'\xe1'
Example 3
The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:
>>> ua.encode('utf-8')
'\xc3\xa1'
Example 4
The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:
>>> ua.encode('latin1')
'\xe1'
There's no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.
Your terminal happens to be configured to UTF-8.
The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.
This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.
You can see the difference when you encode the unicode value to different output encodings:
>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.
It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua

Categories

Resources