Python length of unicode string confusion

Python length of unicode string confusion - python

There's been quite some help around this already, but I am still confused.
I have a unicode string like this:
title = u'😉test'
title_length = len(title) #5
But! I need len(title) to be 6. The clients expect it to be 6 because they seem to count in a different way than I do on the backend.
As a workaround I have written this little helper, but I am sure it can be improved (with enough knowledge about encodings) or perhaps it's even wrong.
title_length = len(title) + repr(title).count('\\U') #6
1. Is there a better way of getting the length to be 6? :-)
I assume me (Python) is counting the number of unicode characters which is 5. The clients are counting the number of bytes?
2. Would my logic break for other unicode characters that need 4 bytes for example?
Running Python 2.7 ucs4.

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.
In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.
You can encode your text to UTF-16 and divide the number of bytes by two (each UTF-16 code unit is 2 bytes). Pick the utf-16-le or utf-16-be variant to not include a BOM in the length:
title = u'😉test'
len_in_codeunits = len(title.encode('utf-16-le')) // 2
If you are using Python 2 (and judging by the u prefix to the string you may well be), take into account that there are 2 different flavours of Python, depending on how you built it. Depending on a build-time configuration switch you'll either have a UCS-2 or UCS-4 build; the former uses surrogates internally too, and your title value length will be 6 there as well. See Python returns length of 2 for single Unicode character string.

Related

When to raise UnicodeTranslateError?

The standard library documentation says:
exception UnicodeTranslateError
Raised when a Unicode-related error occurs during translating.
But translation is never defined. Doing a grep through the cpython source I can't see any examples of this class being raised as an error from anything. What is this exception used for and what's the difference between it and the Decode exception which seems to be used much more frequently?

Unicode has room for more than 1 million code points. (At the moment "only" about 150,000 of them are assigned to characters, but in theory more than 1M can be used.) The number of all available code points, written as a binary number, has 21 binary digits, which means you need 21 bit or at least 3 byte to encode all code points.
But the most used characters have unicode code points then need less than 10 bit, many even less than 8 bit. So, a 3-byte encoding would waste a lot of space when you use it to encode texts that contain mainly characters with low code points.
On the other hand a 3-byte encoding has disadvantages when processing in modern computers because the CPU prefers packages of 2, 4 or 8 byte.
And so there are different encodings for unicode strings:
UTF-32 uses 32-bit fields to encode unicode characters. This encoding performs very fast in computers, but also wastes a lot of space in memory.
UCS-4 is just another name for UTF-32. The number 4 means: exactly 4 bytes (which are 32 bit).
USC-2 uses 2 byte and therefore 16-bit fields. You need only half of the memory, but not all existing unicode code points can be encoded in UCS-2.
UTF-16 also uses 16-bit fields, but here also 2 of these fields can be used together to encode one character. So also UTF-16 can be used to encode all possible unicode codepoints.
UTF-8 uses 1-byte-fields. So in theory you need between 1 and 3 byte to encode every code point, but you also must add the information, if a byte is the start byte of a code point, and how many bytes the code point is long. And if you add these control bit to teh 21 data bit, you get more than 24 bit in total, which means: you need up to 4 byte to encode every possible unicode character.
There are even more encodings for unicode: UTF-1, UTF-7, CESU-8, GB 18030 and many more.
And the fact, that there are many different encodings make it necessary to translate from one encoding to another in some situations. And when you want to translate for example UTF-8 to UCS-2 you will get in trouble if the original text contains characters with code points out of the range that UCS-2 can encode. And in this case you should through a UnicodeTranslateError.

Unicode encoding, python documentation - why a 32-bit encoding? [duplicate]

This question already has answers here:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?
(4 answers)
Closed 3 years ago.
I am reading UNICODE Howto in the Python documentation.
It is written that
a Unicode string is a sequence of code points, which are numbers from
0 through 0x10FFFF
which make it looks like the maximum number of bits needed to represent a code point is 24 (because there are 6 hexadecimal characters, and 6*4=24).
But then the documentation states:
The first encoding you might think of is using 32-bit integers as the
code unit
Why is that? The first encoding I could think of is with 24-bit integers, not 32-bit.

Actually you only need 21. Many CPUs use 32-bit registers natively, and most languages have a 32-bit integer type.
If you study the UTF-16 and UTF-8 encodings, you’ll find that their algorithms encode a maximum of a 21-bit code point using two 16-bit code units and four 8-bit code units, respectively.

Because it is the standard way. Python uses different "internal encoding", depending the content of the string: ASCII/ISO, UTF-16, UTF-32. UTF-32 is a common used representation (usually just intern to programs) to represent Unicode code point. So Python, instead of reinventing an other encoding (e.g. a UTF-22), it just uses UTF-32 representation. It is also easier for the different interfaces. Not so efficient on space, but much more on string operations.
Note: Python uses (in seldom cases) also surrogate range to encode "wrong" bytes. So you need more than 10FFFF code points.
Note: Also colour encoding had a similar encoding: 8bit * 3 channels = 24bit, but often represented with 32 integers (but this also for other reasons: just a write, instead of 2 read + 2 write on bus). 32 bits is much more easier and fast to handle.

Caesar cipher with all Unicode printable characters

I want to create a Caesar cipher that can encode/decode unicode printable characters (single- and multi codepoint grapheme clusters, emojis ect.) from the whole of Unicode (except the private use area). Preferably, it will use a list of all printable characters.
NOTE: Even though I want to create a caesar cipher, it is really not about encryption. The question is about investigating the properties of unicode.
I found these questions:
What is the range of Unicode Printable Characters?
Cipher with all unicode characters
But I didn't get an answer to what I want.
Note:
If you give a coding answer, I am mostly interested in a solution that
uses either python3 or perl6, as they are my main languages.
Recently, I was given an assignment to write a Caesar cipher and then encode and decode an English text.
I solved it in python by using the string library's built-in string.printable constant. Here is a printout of the constant:
(I used visual studio code)
[see python code and results below]
The documentation says:
'''
String of ASCII characters which are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace.
'''
https://docs.python.org/3.6/library/string.html#string-constants
I am wondering how you could create a caesar cipher that could encode/decode all the possible printable characters you can make from unicode codepoints (just asume you have all necessary fonts to see those that should be visible on screen).
Here is my understanding of what it means for
something to be a printable character:
When I take the python string constant above,
and traverse it with the left or rigt arrow keys
on the keyboard, It takes me exactly 100 strokes to get
to the end (the same as the number of characters).
It looks like there is a one-to-one
correspondence between being a printable
character and being traversible with one stroke of an arrow key.
Now consider this string:
"👨‍👩‍👧‍👦ĳ
क्षि 🂡"
Based on pythons string.printable constant,
This string seems to me to be composed of the
following 7 printable characters:
(you can look up individual codepoints at: https://unicode-table.com/en/)
1 (family) 2 (Latin Small Ligature Ij)
3 (cariage return) 4 (Devanagari kshi)
5 (space) 6 (Zero Width No-Break Space)
7 (Ace of spades)
👨‍👩‍👧‍👦
codepoints: 128104 8205 128105 8205 128103 8205 128102
(reference: https://emojipedia.org/family-man-woman-girl-boy/)
(Latin Small Ligature Ij)
ĳ
codepoint: 307
(Carriage Return)
codepoint: 13
(Devanagari kshi)
क्षि
codepoints: 2325 2381 2359 2367
(see this page: http://unicode.org/reports/tr29/)
(the codepoints seems to be in hexadecimal rather than numerals)
(Space)
codepoint: 32
(Zero Width No-Break Space)
codepoint: 65279
(AKA U+FEFF BYTE ORDER MARK (BOM))
(https://en.wikipedia.org/wiki/Byte_order_mark)
(Playing Card Ace of Spades)
🂡
codepoint: 127137
When I paste this
string into notepad, and try to traverse it with an arrow key,
I end up using 10 key strokes rather than 7,
because the family emoji need
4 key strokes
(probably because notepad cant deal with the Zero Width Joiner,
codepoint: 8205, and of course notepad cant display a family glyph).
On the other hand when I post the string into google search,
i can traverse the whole string with 7 strokes.
Then I tried creating the string
in Perl6 to see what Perl6's grapheme
awareness would make of the string:
(I use the Atom editor)
[see perl6 code and results below]
perl6 thinks that the Devanagari kshi character क्षि (4 codepoints)
is actually 2 graphemes, each with 2 codepoints.
Even though it CAN be represented as two characters,
as seen in the above list,
I think this is a bug. Perl6 is supposed to be grapheme
aware, and even my windows notepad (and google search)
thinks it is a single grapheme/character.
Based on the 2 strings,
The practical definition of
a printable character seems to be this:
'It is any combination of unicode codepoints that can get traversed
by one push of a left or right arrow key on the keyboard
under ideal cirkumstances'.
"under ideal cirkumstances" means that
you are using an environment that, so to speak,
act like google search:
That is, it recognizes for example an emoji
(the 4 person family) or a grapheme cluster
(the devanagari character)
as one printable character.
3 questions:
1:
Is the above a fair definition of what it means
to be a printable character in unicode?
2:
Regardless of whether you accept the definition,
do you know of any list of printable characters
that cover the currently used unicode planes and possible
grapheme clusters, rather than just the 100 ASCII characters
the python string library has
(If I had such a list I imagine I could create a cipher
quite easily)?
3:
Given that such a list does not exist, and you
accept the definition,
how would you go about creating such a list with which
I could create a caesar cipher
that could cipher any/all printable
characters given the following 4 conditions?
NOTE: these 4 conditions are just
what I imagine is required for a proper
caesar cipher.
condition a
The string to be encrypted will
be a valid utf8 string consisting of standard
unicode code points (no unassigned, or private use area
codepoints)
condition b
The encrypted string must also be a valid
utf8 string consisting of standard
unicode code points.
condition c
You must be able to traverse the encrypted string
using the same number of strokes with
the left or right arrow keys on the keyboard as
the original string
(given ideal circumstances as described above).
This means that both the
man-woman-boy-girl family emoji
and the devanagari character,
when encoded, must each correspond to
exactly one other printable character and not a set
of "nonsence" codepoints that the
arrow keys will interpret as different characters.
It also means that a single codepoint character can
potentially be converted into a multi-codepoint character
and vice versa.
condition d
As in any encrypt/decrypt algoritm,
the string to be encrypted and
the string that has been decrypted
(the end result) must
contain the exact same codepoints
(the 2 strings must be equal).
# Python 3.6:
import string
# build-in library
print(string.printable)
print(type(string.printable))
print(len(string.printable))
# length of the string (number of ASCII characters)
#perl6
use v6;
my #ordinals = <128104 8205 128105 8205 128103 8205 128102>;
#array of the family codepoints
#ordinals.append(<307 13 2325 2381 2359 2367 32 65279 127137>);
#add the other codepoints
my $result_string = '';
for #ordinals {
$result_string = $result_string ~ $_.chr;
}
# get a string of characters from the ordinal numbers
say #ordinals; # the list of codepoints
say $result_string; # the string
say $result_string.chars; # the number of characters.
say $result_string.comb.perl; # a list of characters in the string
python results:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?#[]^_`{|}~
class 'str'
100
perl6 results:
[128104 8205 128105 8205 128103 8205 128102 307 13 2325 2381 2359 2367 32 65279 127137]
👨‍👩‍👧‍👦ĳ
क्षि 🂡
8
("👨‍👩‍👧‍👦", "ĳ", "\r", "क्", "षि", " ", "", "🂡").Seq

TL;DR I think your question is reasonable and deserves a better answer than the one I've written so far. Let's talk.
I don't think anyone can create a Caesar cipher with the requirements you list for multiple reasons.
But if your goal is really to "investigate properties of Unicode" rather than create a cipher then presumably that doesn't matter.
And of course I might just be experiencing a failure of imagination, or just a failure to understand Unicode despite spending years grappling with it.
If you critique technical aspects of my explanation below via comments I'll try to improve it and hopefully we'll both learn as we go. TIA.
"Caesar cipher with all Unicode printable characters"
This is the clean formulation you have in your title.
The only problematic parts are "Caesar", "cipher", "all", "Unicode", "printable", and "characters". Let's go thru them.
Caesar cipher
A Caesar cipher is an especially simple single alphabet cipher. Unicode isn't one big large single alphabet. But perhaps you could treat a subset of its codepoints as if they were.
I'd say that's what the SO Cipher with all unicode characters was all about.
You've currently rejected that and introduced a bunch of extra aspects that are either impossible or so difficult that they might as well be.
Ignoring your priority of investigating Unicode properties it would make sense if you instead settled for a regular ASCII cipher. Or perhaps go back to that Cipher with all unicode characters SO and pick up where they left off, perhaps noting that, according to a comment on that SO they apparently stopped at just the BMP plane:
Note that you’re only using BMP code points (i.e. from U+0000 to U+FFFF). Unicode ranges from U+0000 to U+10FFFF, so you’re missing about a million code points :)
So perhaps you could do better. I don't think it would be worthwhile from the perspective of creating a cipher for its own sake but it might be for learning more about the properties of Unicode.
Cipher
#TomBlodget in their comment on your question notes that:
The complexity of text motivates modern ciphers to not deal with characters. They deal with bytes for both input and output. Where the input is text, the receiver has to to be told the character encoding. Where further handling of the output must be as text, Base64 or similar is used. Expecting the output of a cipher to look like text is not generally a goal.
If you want a universal solution for a Unicode cipher, follow Tom's recipe.
All
In a comment on your question about the number of graphemes #nwellnhof noted that:
there's an infinite number
But you then also quite reasonably replied that there's only going to be a finite number in any given text; that Unicode's intent is that Unicode compliant software may/will generate mojibake results if given degenerate input (where what counts as degenerate is somewhat open to refinement in Unicode updates); and that that's the basis on which you hope to proceed.
That's a reasonable response, but you still can't have "all" even when restricted to "all non-degenerate" and "only ones that could appear in real life", because there's still an effectively infinite number of well formed and potentially reasonable characters.
I ought really insert some calculations here to put some bounds on the problem. Is "effectively infinite" a trillion? Why? That sort of thing. But before digging into that I'll await comments.
Let's pretend it's a trillion, and that that's not a problem, and move on.
Unicode
Unicode is enormously complex.
You've been given an assignment to produce a Caesar cipher, a very simple thing.
They really don't mix well unless you lean heavily on keeping things simple.
But you want to investigate properties of Unicode. So perhaps you want to wade into all the complexity. But then the question is, how many years do you want to spend exploring the consequences of opening this pandora's box? (I've been studying Unicode on and off for a decade. It's complicated.)
Printable
You linked to the SO question "What is the range of Unicode Printable Characters?". This includes an answer that notes:
The more you learn about Unicode, the more you realize how unexpectedly diverse and unfathomably weird human writing systems are. In particular whether a particular "character" is printable is not always obvious.
But you presumably read that and refused to be deterred. Which is both admirable and asking for trouble. For example, it seems to have driven you to define "printable" as something like "takes one or more keystrokes to traverse" which is so fraught it's hard to know where to start -- so I'll punt on that till later in this answer.
Characters
Given that your aim is to write a Caesar cipher, a cipher that was used thousands of years ago that acts on characters, it makes sense that you focused on "what a user thinks of as a character".
Per Unicode's definition, this is called a "grapheme".
One of your example characters makes it clear how problematic the distinction is between "what a user thinks of as a character" (a grapheme) and a codepoint (what Python thinks of as a character):
print('क्षि'[::-1])
िष्क
This shows mangling of a single "character" (a single grapheme) written in Devanagari, which is, according to Wikipedia, "one of the most used and adopted writing systems in the world".
(Or, if we want to ignore the half the planet this mangling ever more routinely affects and just focus on the folk who thought they were safe:
print('🇬🇧'[::-1])
🇧🇬
That's a flag of one nation turning into another's. Fortunately flags rarely appear in text -- though that's changing now text is increasingly arbitrary Unicode text like this text I'm writing -- and flag characters are not that important and both Great Britain and Bulgaria are members of the EU so it's probably not nearly as bad as scrambling the text of a billion Indians.)
Graphemes
So you quite reasonably thought to yourself, "Maybe Perl 6 will help".
To quote UAX#29, the Unicode Annex document on "Unicode Text Segmentation":
This document defines a default specification for grapheme clusters.
Perl 6 has implemented a grapheme clustering mechanism. It could in principle cluster in a variety of ways but for now it's implemented the default specification. This is what allows Perl 6 to avoid the mistakes Python's making in the above.
But the Unicode document continues:
[the specification for grapheme clusters] may be customized for particular languages, operations, or other situations.
So you can't just eyeball some text (or give it to some software) and say what "characters" it contains if by "character" you mean "what a user thinks of as a character".
It gets worse...
Keystrokes
"👨‍👩‍👧‍👦ĳ क्षि 🂡" ... notepad ... 10 key strokes ... google search ... 7 strokes ... Perl6 ... Atom editor ... perl6 thinks क्षि ... is actually 2 graphemes ... I think this is a bug ... notepad (and google search) thinks it is a single grapheme/character
For me, google search needs 10 keystrokes -- because it's not to do with google search but instead aspects of my system, including which web browser I'm using (Firefox) and other details.
Some editors could be configurable so that cursoring over 'क्षि' (or 'fi') would be either 1 or 2 keystrokes depending on how you configure them and/or what language you specify the text is written in. For me, editing this SO answer using Firefox on Linux Mint, it takes 2 keystrokes to cursor over क्षि.
Perl 6 correctly reports the .chars result for 'क्षि' as 2 by default because that's what Unicode says it is per the default grapheme clustering specification. ("Extended Grapheme Clusters".) That happens to match what Firefox on Linux Mint does editing this SO answer because the stars line up and it's Sunday.
Notepad or other software reasonably takes just one keystroke to cursor over क्षि, while other editors reasonably take two, because both are reasonable per the Unicode specification:
arrow key movement ... could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components
My emphasis added. Unicode leaves it up to the software to decide how the cursor will move.
Your questions
1: Is the above a fair definition of what it means to be a printable character in unicode?
I don't think so. Hopefully the foregoing explains why, or at least points you in the directions you would need to research (for a year or three) to understand why.
2: ... do you know of any list of printable characters that cover the currently used unicode planes and possible grapheme clusters ...
There's such a vast number of "possible grapheme clusters" that can reasonably occur that even excluding degenerate codepoint combinations leaves you with an effectively infinite list.
And any small subset anyone may have created would not be canonical because the Unicode consortium would not bless it and folk would argue about what should be included.
3: ... how would you go about creating such a list with which I could create a caesar cipher that could cipher any/all printable characters given the following 4 conditions?
First, your conditions are far too strenuous. See the next section.
But even if you drop the ones that are too difficult, it's still far too difficult, and the outcome far too uninteresting, to make doing anything worthwhile.
4: If you think creating such a list is a terrible idea, how would you create the cipher?
If it were me and it had to be a Caesar cipher I'd make it just handle bytes, as per Tom's comment at the start of this answer.
Your conditions
The string to be encrypted will be a valid utf8 string consisting of standard unicode code points (no unassigned, or private use area codepoints)
It'll need to be more restricted than that, but it's reasonable to say it'll need to be a valid Unicode string. If you want to insist it's utf8 that's fine too.
The encrypted string must also be a valid utf8 string consisting of standard unicode code points
Sure.
You must be able to traverse the encrypted string using the same number of strokes ... as the original string
You can have that for a small subset of Unicode characters. But...
This means [one keystroke for both the original and encoded version of] the devanagari character [क्षि]
... is not a reasonable requirement.
You could ensure the same grapheme clustering (character) interpretation of a given text if you wrote a custom implementation of the grapheme clustering for your cipher that was a faithful copy of the implementation of grapheme clustering used to control the cursor.
But then you'd then have to maintain these two code bases to keep them in sync. And that would be for just one particular system configuration.
It would be a ridiculous amount of pain. And for zero, or at most minuscule, gain.
the string to be encrypted and the string that has been decrypted (the end result) must contain the exact same codepoints (the 2 strings must be equal).
So, no normalization.
That rules out all Perl 6's grapheme aware goodies.
Then again, you need to drop paying attention to graphemes anyway because there's effectively an infinite number of them.
Conclusion
My answer only touches lightly on the topics it covers and probably contains lots of errors. If you critique it I'll try improve it.

Why is unicode string can't be compared to byte string in python?

From the Pattern python docs, I see 'Unicode String can't be compared with Byte String', but why?
You can read the line here:https://github.com/python/cpython/blob/3.5/Lib/re.py

Python 3 introduced a somewhat controversial change where all Python strings are Unicode strings, and all byte strings need to have an encoding specified before they can be converted to Unicode strings.
This goes with the Python principle of "explicit is better than implicit", and removes a large number of potential bugs where implicit conversion would quietly produce wrong or corrupt results when the programmer was careless or unaware of the implications.
The flip side of this is now that it's hard to write code which mixes Unicode and byte strings unless you properly understand the model. (Well, it was hard before, too; but programmers who were oblivious remained so, and thought their code worked until someone tested it properly. Now they get errors up front.)
Briefly, quoting from the Stack Overflow character-encoding tag info page:
just like changing the font from Arial to Wingdings
changes what your text looks like,
changing encodings affects the interpretation
of a sequence of bytes.
For example, depending on the encoding,
the bytes 0xE2 0x89 0xA0 could represent
the text â‰ in Windows code page 1252,
or Б┴═ in KOI8-R,
or the character ≠ in UTF-8.
Python 2 would do some unobvious stuff under the hood to coerce this byte string into a native string, which depending on context might involve the local system's "default encoding", and thus produce different results on different systems, creating some pretty hard bugs. Python 3 requires you to explicitly say how the bytes should be interpreted if you want to convert them into a string.
bytestr = b'\xE2\x89\xA0'
fugly = bytestr.decode('cp1252') # u'â‰ '
cyril = bytestr.decode('koi8-r') # u'Б┴═'
wtf_8 = bytestr.decode('utf-8') # u'≠'

Unicode in Python - just UTF-16?

I was happy in my Python world knowing that I was doing everything in Unicode and encoding as UTF-8 when I needed to output something to a user. Then, one of my colleagues sent me the "The UTF-8 Everywhere' manifesto" (2012) and it confused me.
The author of the article claims a number of times that UCS-2, the Unicode representation that Python uses is synonymous with UTF-16.
He even goes as far as directly saying Python uses UTF-16 for internal string representation.
The author also admits to being a Windows lover and developer and states that the way MS has handled character encodings over the years has led to that group being the most confused so maybe it is just his own confusion. I don't know...
Can somebody please explain what the state of UTF-16 vs Unicode is in Python? Are they synonymous and if not, in what way?

The internal representation of a Unicode string in Python (versions from 2.2 up to 3.2) depends on whether Python was compiled in wide or narrow modes. Most Python builds are narrow (you can check with sys.maxunicode -- it is 65535 on narrow builds and 1114111 on wide builds).
With a wide build, strings are internally sequences of 4-byte wide characters, i.e. they use the UTF-32 encoding. All code points are exactly one wide-character in length.
With a narrow build, strings are internally sequences of 2-byte wide characters, using UTF-16. Characters beyond the BMP (code points U+10000 and above) are stored using the usual UTF-16 surrogate pairs:
>>> q = u'\U00010000'
>>> len(q)
2
>>> q[0]
u'\ud800'
>>> q[1]
u'\udc00'
>>> q
u'\U00010000'
Note that UTF-16 and UCS-2 are not the same. UCS-2 is a fixed-width encoding: every code point is encoded as 2 bytes. Consequently, UCS-2 cannot encode code points beyond the BMP. UTF-16 is a variable-width encoding; code points outside the BMP are encoded using a pair of characters, called a surrogate pair.
Note that this all changes in 3.3, with the implementation of PEP 393. Now, Unicode strings are represented using characters wide enough to hold the largest code point -- 8 bits for ASCII strings, 16 bits for BMP strings, and 32 bits otherwise. This does away with the wide/narrow divide and also helps reduce the memory usage when many ASCII-only strings are used.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.