What are the names for these different kinds of ascii representations of unicode?
\xF0\x9F\x98\xA2
\U0001f622
And is there a term for the set that they belong to that's more specific than "representation"? And in the context of these, how would I describe the non-ascii representation (😢)?
Since I don't know what to call them it is very hard to search for how to work with them.
Thanks!
As Tom Blodget already warned you, this is a somewhat python specific answer.
The leading \ shows that it's an escape sequence.
\x means that the next two characters will be interpreted as a hex digit.
\U means that the next eight characters will be interpreted as a 32-bit hex value.
You can read more about that here:
https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
To fully answer your question:
\xF0\x9F\x98\xA2 are simply four ASCII characters and you have their hex values
\U0001f622 is a UNICODE codepoint encoded with a 32-bit hex value
😢 is a glyph or simply a special character.
For Python 3
First there seems to be a misunderstanding about the hex escapes:
print("\xF0\x9F\x98\xA2" == "\u00F0\u009F\u0098\u00A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\N{LATIN SMALL LETTER ETH}\N{APPLICATION PROGRAM COMMAND}\N{START OF STRING}\N{CENT SIGN}")
and for completeness (I recall using octal effectively in machine code where some instructions had 3-bit, aligned arguments but I don't see the point in real programming):
print("\xF0\x9F\x98\xA2" == "\360\237\230\242")
It appears they are all Unicode codepoint escapes in 2-digit hexadecimal, 4-digit hexadecimal, and 8-digit hexadecimal, with ranges from U+0000 to U+00FF, U+FFFF, and U+10FFFF, respectively.
We can confirm that, unlike other languages where the \u for is for a UTF-16 code unit, in Python 3, it is really a codepoint.
print("\ud83d\ude22" == "\U0000d83d\U0000de22")
and for completeness:
print("\U0001f622" == "😢")
print("\N{CRYING FACE}" == "😢")
In other languages (where they would be two UTF-16 code units), "\ud83d\ude22" would equal "😢".
Now, U+D8ED and U+DE22 are Unicode codepoints designated as surrogates. In other words, not characters. They reserve the codepoint codespace for the UTF-16 code units with corresponding values. This is the way the USC-2 encoding of Unicode was transparently extended to UTF-16 when Unicode was expanded from 2^16 codepoints to 2^21 codepoints. For more information see the Unicode FAQ.
As #Robᵩ points out, you can have a bytestring literal, too:
print("\U0001f622".encode("utf-8") == b"\xF0\x9F\x98\xA2")
Related
I need a fixed-width string encoding. From what I understood, UCS-2 and UCS-4 (also, ASCII) are such fixed-width encodings.
From what I understood, Python only supports a variable-width UTF-16 via s.encode('utf_16_le'). Is it true? Is there an easy way to encode into a unicode fixed-width encoding?
Context: I'm storing a string array in raw bytes and need a way to index into it to recover original strings. Index calculation is easier when all characters are fixed-width.
strings = ['asd', 'def']
# ascii
bytelens = list(map(len, strings))
bytes = ''.join(strings).encode('ascii')
# utf8
bytelens = []
bytes = bytearray()
for s in strings:
e = s.encode('utf-8')
bytelens.append(len(e))
bytes.extend(e)
# i need bytelens to later recover original strings from the array bytes
As you can see, ASCII variant is very simple, and UTF-8 is more convoluted and 20% slower (probably because of many allocations and function calls). A true fixed-width UCS-2 would be a solution!
A follow-up question: how can I know if my string has characters from UCS-1 / UCS-2 / UCS-4? For UCS-1 there is str.isascii. Any ideas about UCS-2?
You are mixing various concepts.
In Python, you can just index a string (or an array). It doesn't matter the length of every character. But also in this case, I should warn you that one character is not a single/simple entity: if you need single entities, you should put together more characters (combining characters, e.g. accents, etc.).
UTF16 is variable width, but it is the same as UCS2, but for characters outside UCS2. So for most things, it doesn't matter, and if you have such characters, you just work with sometime low and high surrogates (like on many other computer languages, which supports only UCS2). But this is often not a problem, because you should not split a string at random places, but always at end of an entity.
UCS4 and UTF-32 are practically the same encoding: Unicode code points into 32-bit numbers. (Differences are just virtual, and on some definition, not for Unicode characters [UCS is based on an ISO which allowed more (higher) code-points, never allocated)
Could you explain in detail what the difference is between byte string and Unicode string in Python. I have read this:
Byte code is simply the converted source code into arrays of bytes
Does it mean that Python has its own coding/encoding format? Or does it use the operation system settings?
I don't understand. Could you please explain?
Thank you!
No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.
A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.
bytes objects give you access to the underlying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytes object that represents the string in that encoding. bytes objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the byte as a string encoded in the the given encoding.
For example:
>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'
We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac, to represent two characters.
Related reading:
Python Unicode Howto (from the official documentation)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.
Suppose you create a string using Python 3 in the usual way:
stringobject = 'ant'
stringobject would be a unicode string.
A unicode string is made up of unicode characters. In stringobject above, the unicode characters are the individual letters, e.g. a, n, t
Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a' is equivalent to '\u0061', and 'ant' is equivalent to '\u0061\u006E\u0074'.
So you will find that if you type in,
stringobject = '\u0061\u006E\u0074'
stringobject
You will also get the output 'ant'.
Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.
How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.
Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.
For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.
Running the following code:
byteobject = stringobject.encode('utf-8')
byteobject, type(byteobject)
converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'.
Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError.
Similarly,
stringobject = byteobject.decode('utf-8')
stringobject, type(stringobject)
gives you 'ant', str.
I have everything working as I want it in my code, but I'm still curious. I have a string: "stación." When I convert that string to unicode, I get:
unicode('stación', 'utf-8')
>>> u'staci\xf3n'
That "\xf3" in there looks like a byte character. This is different from:
unicode('Поиск', 'utf-8')
>>> u'\u041f\u043e\u0438\u0441\u043a'
In the latter example, as with everything I've converted to unicode before, I get unicode characters starting with "\u." Normally, when I see a byte starting with "\x," I think there's a problem. What gives here? Is this because "ó" is extended ASCII?
No, it's because "ó" is a non-ASCII character within the first 255 characters. Since it's representable using a single byte, we save 2 characters in the representation. The other two representations are valid, but not required.
>>> u'\u00f3'
u'\xf3'
>>> u'\U000000f3'
u'\xf3'
u'\xf3' is not a byte; it is a Unicode string with a single Unicode codepoint (U+00f3 LATIN SMALL LETTER O WITH ACUTE).
What you see (u'\xf3') is how Python 2 chooses to represent Unicode character with Unicode ordinals (integers) in the range 0..255 that are not printable ascii characters (Python 3 would show 'ó' here, only non-printable characters use '\xhh' form there by default). As #Ignacio Vazquez-Abrams said: u'\u00f3' and u'\U000000f3' literals create exactly the same Unicode string.
You could see how the Unicode character (u'\xf3') looks like bytes in different character encodings for comparision:
>>> print(u'\xf3')
ó
>>> u'\xf3'.encode('utf-8')
b'\xc3\xb3'
>>> u'\xf3'.encode('utf-16be')
b'\x00\xf3'
>>> u'\xf3'.encode('utf-32le')
b'\xf3\x00\x00\x00'
>>> u'\xf3'.encode('cp1252')
b'\xf3'
Note: b'\xf3' and u'\xf3' are different things. The former is a bytestring that contains a single byte (an integer 243), the latter is a Unicode string that contains a single Unicode codepoint (Unicode ordinal 243). The number is the same 243 by the units are different -- 100 calories is not the same thing as 100 grams.
I have the following code to write an ASCII "#" character to a file in a binary fashion:
fin=open('a.bin','wb')
fin.write('\x40')
fin.close()
It turns out the a "#" character has been written to "a.bin", which has a length of 1-byte.
However, when I tried to write a unicode character instead:
fin=open('a.bin','wb')
fin.write(u'\x40')
fin.close()
It turned out that "a.bin" is still 1-byte long. I thought it should be 2-byte long since a unicode character takes 2-bytes. There may be some trivial thing that I overlooked.
You are confusing Unicode with encodings. An encoding is a standard that represents text as within the confines of individual values in the range of 0-255 (bytes), while Unicode is a standard that describes codepoints representing textual glyphs. The two are related but not the same thing.
The Unicode standard includes several encodings. UTF-16 is one such encoding that uses 2 bytes per codepoint, but it is not the only encoding included in the standard. UTF-8 is another such encoding, and it uses a variable number of bytes per codepoint.
Your file, however, is written using ASCII, the default codec used by Python 2 when you do not specify an explicit encoding. If you expected to see 2 bytes per codepoint, encode to UTF-16 explicitly:
fin.write(u'\x40'.encode('utf16-le')
This writes UTF-16 in little endian byte order; there is also a utf16-be codec. Normally, for multi-byte encodings like UTF-16 or UTF32, you'd also include a BOM, or Byte Order Mark; it is included automatically when you write UTF-16 without picking any endianes.
fin.write(u'\x40'.encode('utf16')
I strongly urge you to study up on Unicode, codecs and Python before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
Character numbers from U+0000 to U+007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
consequence is that a plain ASCII string is also a valid UTF-8
string.
UTF-8, a transformation format of ISO 10646
Martijn is right in his elaborate answer: Learn more about Unicode first. But a smaller answer than reading large educational documents can be this:
When writing a Python unicode value (u'\x40' in your case) to a stream (an open file in your case), this abstract unicode value must be converted into a concrete stream of bytes. For this encodings are used.
You can do this explicitly (by using u'\x40'.encode('foo')) or you do it implicitly; then some encoding is being used. In your case either "ascii" or "utf8" which both represent a unicode-# as a single byte with value 40.
What you seem to want is using an encoding in which the unicode-# is represented as a two-byte value; that would be the encoding utf-16 for instance.
Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S
unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).
Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.
Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Unicode and encodings are completely different, unrelated things.
Unicode
Assigns a numeric ID to each character:
0x41 → A
0xE1 → á
0x414 → Д
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.
You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.
These numbers assigned to all characters by Unicode are called code points.
The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I'm talking about 😂, instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?
Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.
Encodings
Map characters to bit patterns.
These bit patterns are used to represent the characters in computer memory or on disk.
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
ASCII
Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.
Example:
a → 1100001 (0x61)
You can see all the mappings in this table.
ISO 8859-1 (aka Latin-1)
Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
Example:
a → 01100001 (0x61)
á → 11100001 (0xE1)
You can see all the mappings in this table.
UTF-8
Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
Example:
a → 01100001 (0x61)
á → 11000011 10100001 (0xC3 0xA1)
≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
The way UTF-8 encodes characters to bit strings is very well described here.
Unicode and Encodings
Looking at the above examples, it becomes clear how Unicode is useful.
For example, if I'm Latin-1 and I want to explain my encoding of á, I don't need to say:
"I encode that a with an aigu (or however you call that rising bar) as 11100001"
But I can just say:
"I encode U+00E1 as 11100001"
And if I'm UTF-8, I can say:
"Me, in turn, I encode U+00E1 as 11000011 10100001"
And it's unambiguously clear to everybody which character we mean.
Now to the often arising confusion
It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.
For example:
ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.
Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.
Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.
Back to your question
The encoding used by your Python interpreter is UTF-8.
Here's what's going on in your examples:
Example 1
The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.
>>> a = 'á'
When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':
>>> a
'\xc3\xa1'
Example 2
The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):
>>> ua = u'á'
When you look at the value of ua, Python tells you that it contains the code point U+00E1:
>>> ua
u'\xe1'
Example 3
The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:
>>> ua.encode('utf-8')
'\xc3\xa1'
Example 4
The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:
>>> ua.encode('latin1')
'\xe1'
There's no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.
Your terminal happens to be configured to UTF-8.
The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.
This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.
You can see the difference when you encode the unicode value to different output encodings:
>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.
It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua