Unicode encoding, python documentation - why a 32-bit encoding? [duplicate]

Unicode encoding, python documentation - why a 32-bit encoding? [duplicate] - python

This question already has answers here:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?
(4 answers)
Closed 3 years ago.
I am reading UNICODE Howto in the Python documentation.
It is written that
a Unicode string is a sequence of code points, which are numbers from
0 through 0x10FFFF
which make it looks like the maximum number of bits needed to represent a code point is 24 (because there are 6 hexadecimal characters, and 6*4=24).
But then the documentation states:
The first encoding you might think of is using 32-bit integers as the
code unit
Why is that? The first encoding I could think of is with 24-bit integers, not 32-bit.

Actually you only need 21. Many CPUs use 32-bit registers natively, and most languages have a 32-bit integer type.
If you study the UTF-16 and UTF-8 encodings, you’ll find that their algorithms encode a maximum of a 21-bit code point using two 16-bit code units and four 8-bit code units, respectively.

Because it is the standard way. Python uses different "internal encoding", depending the content of the string: ASCII/ISO, UTF-16, UTF-32. UTF-32 is a common used representation (usually just intern to programs) to represent Unicode code point. So Python, instead of reinventing an other encoding (e.g. a UTF-22), it just uses UTF-32 representation. It is also easier for the different interfaces. Not so efficient on space, but much more on string operations.
Note: Python uses (in seldom cases) also surrogate range to encode "wrong" bytes. So you need more than 10FFFF code points.
Note: Also colour encoding had a similar encoding: 8bit * 3 channels = 24bit, but often represented with 32 integers (but this also for other reasons: just a write, instead of 2 read + 2 write on bus). 32 bits is much more easier and fast to handle.

Related

When to raise UnicodeTranslateError?

The standard library documentation says:
exception UnicodeTranslateError
Raised when a Unicode-related error occurs during translating.
But translation is never defined. Doing a grep through the cpython source I can't see any examples of this class being raised as an error from anything. What is this exception used for and what's the difference between it and the Decode exception which seems to be used much more frequently?

Unicode has room for more than 1 million code points. (At the moment "only" about 150,000 of them are assigned to characters, but in theory more than 1M can be used.) The number of all available code points, written as a binary number, has 21 binary digits, which means you need 21 bit or at least 3 byte to encode all code points.
But the most used characters have unicode code points then need less than 10 bit, many even less than 8 bit. So, a 3-byte encoding would waste a lot of space when you use it to encode texts that contain mainly characters with low code points.
On the other hand a 3-byte encoding has disadvantages when processing in modern computers because the CPU prefers packages of 2, 4 or 8 byte.
And so there are different encodings for unicode strings:
UTF-32 uses 32-bit fields to encode unicode characters. This encoding performs very fast in computers, but also wastes a lot of space in memory.
UCS-4 is just another name for UTF-32. The number 4 means: exactly 4 bytes (which are 32 bit).
USC-2 uses 2 byte and therefore 16-bit fields. You need only half of the memory, but not all existing unicode code points can be encoded in UCS-2.
UTF-16 also uses 16-bit fields, but here also 2 of these fields can be used together to encode one character. So also UTF-16 can be used to encode all possible unicode codepoints.
UTF-8 uses 1-byte-fields. So in theory you need between 1 and 3 byte to encode every code point, but you also must add the information, if a byte is the start byte of a code point, and how many bytes the code point is long. And if you add these control bit to teh 21 data bit, you get more than 24 bit in total, which means: you need up to 4 byte to encode every possible unicode character.
There are even more encodings for unicode: UTF-1, UTF-7, CESU-8, GB 18030 and many more.
And the fact, that there are many different encodings make it necessary to translate from one encoding to another in some situations. And when you want to translate for example UTF-8 to UCS-2 you will get in trouble if the original text contains characters with code points out of the range that UCS-2 can encode. And in this case you should through a UnicodeTranslateError.

In Python, represent a SHA-256 hash using the fewest characters possible

What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?

Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.

Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.

Python length of unicode string confusion

There's been quite some help around this already, but I am still confused.
I have a unicode string like this:
title = u'😉test'
title_length = len(title) #5
But! I need len(title) to be 6. The clients expect it to be 6 because they seem to count in a different way than I do on the backend.
As a workaround I have written this little helper, but I am sure it can be improved (with enough knowledge about encodings) or perhaps it's even wrong.
title_length = len(title) + repr(title).count('\\U') #6
1. Is there a better way of getting the length to be 6? :-)
I assume me (Python) is counting the number of unicode characters which is 5. The clients are counting the number of bytes?
2. Would my logic break for other unicode characters that need 4 bytes for example?
Running Python 2.7 ucs4.

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.
In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.
You can encode your text to UTF-16 and divide the number of bytes by two (each UTF-16 code unit is 2 bytes). Pick the utf-16-le or utf-16-be variant to not include a BOM in the length:
title = u'😉test'
len_in_codeunits = len(title.encode('utf-16-le')) // 2
If you are using Python 2 (and judging by the u prefix to the string you may well be), take into account that there are 2 different flavours of Python, depending on how you built it. Depending on a build-time configuration switch you'll either have a UCS-2 or UCS-4 build; the former uses surrogates internally too, and your title value length will be 6 there as well. See Python returns length of 2 for single Unicode character string.

How is unicode represented internally in Python?

How is Unicode string literally represented in Python's memory?
For example I could visualize 'abc' as its equivalent ASCII bytes in Memory. Integer could be thought of as the 2's compliment representation. However u'\u2049', even though is represented in UTF-8 as '\xe2\x81\x89' - 3 bytes long, how do I visualize the literal u'\u2049' codepoint in the memory?
Is there a specific way it is stored in memory? Does Python 2 and Python 3 treat it differently?
Few related questions for anyone curious :
1) How are these strings represented internally in Python interpreter ? I don't understand
2) What is internal representation of string in Python 3.x

I'm assuming you want to know about CPython, the standard implementation. Python 2 and Python 3.0-3.2 use either UCS2* or UCS4 for Unicode characters, meaning it'll either use 2 bytes or 4 bytes for each character. Which one is picked is a compile-time option.
\u2049 is then represented as either \x49\x20 or \x20\x49 or \x49\x20\x00\x00 or \x00\x00\x20\x49 depending on the native byte order of your system and if UCS2 or UCS4 was picked. ASCII characters in a unicode string still use 2 or 4 bytes per character too.
Python 3.3 switched to a new internal representation, using the most compact form needed to represent all characters in a string. Either 1 byte, 2 bytes or 4 bytes are picked. ASCII and Latin-1 text uses just 1 byte per character, the rest of the BMP characters require 2 bytes and after that 4 bytes is used.
See PEP-393: Flexible String Representation for the full low-down on these representations.
* Technically speaking the UCS-2 build uses UTF-16, as non-BMP characters use UTF-16 surrogates to encode to 4 bytes (2 UTF-16 characters) each. However, Python documentation still refers to this as UCS2.
This does lead to unexpected behaviour such as the len() on non-BMP unicode strings being longer than the number of characters contained.

Unicode in Python - just UTF-16?

I was happy in my Python world knowing that I was doing everything in Unicode and encoding as UTF-8 when I needed to output something to a user. Then, one of my colleagues sent me the "The UTF-8 Everywhere' manifesto" (2012) and it confused me.
The author of the article claims a number of times that UCS-2, the Unicode representation that Python uses is synonymous with UTF-16.
He even goes as far as directly saying Python uses UTF-16 for internal string representation.
The author also admits to being a Windows lover and developer and states that the way MS has handled character encodings over the years has led to that group being the most confused so maybe it is just his own confusion. I don't know...
Can somebody please explain what the state of UTF-16 vs Unicode is in Python? Are they synonymous and if not, in what way?

The internal representation of a Unicode string in Python (versions from 2.2 up to 3.2) depends on whether Python was compiled in wide or narrow modes. Most Python builds are narrow (you can check with sys.maxunicode -- it is 65535 on narrow builds and 1114111 on wide builds).
With a wide build, strings are internally sequences of 4-byte wide characters, i.e. they use the UTF-32 encoding. All code points are exactly one wide-character in length.
With a narrow build, strings are internally sequences of 2-byte wide characters, using UTF-16. Characters beyond the BMP (code points U+10000 and above) are stored using the usual UTF-16 surrogate pairs:
>>> q = u'\U00010000'
>>> len(q)
2
>>> q[0]
u'\ud800'
>>> q[1]
u'\udc00'
>>> q
u'\U00010000'
Note that UTF-16 and UCS-2 are not the same. UCS-2 is a fixed-width encoding: every code point is encoded as 2 bytes. Consequently, UCS-2 cannot encode code points beyond the BMP. UTF-16 is a variable-width encoding; code points outside the BMP are encoded using a pair of characters, called a surrogate pair.
Note that this all changes in 3.3, with the implementation of PEP 393. Now, Unicode strings are represented using characters wide enough to hold the largest code point -- 8 bits for ASCII strings, 16 bits for BMP strings, and 32 bits otherwise. This does away with the wide/narrow divide and also helps reduce the memory usage when many ASCII-only strings are used.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.