I'm currently learning Python and as a Slovenian I often use UTF-8 characters to test my programs. Normally everything works fine, but there is one catch that I can't overtake. Even though I've got encoding declared on the top of the file it fails when I try to reverse a string containing special characters
#-*- coding: utf-8 -*-
a = "čšž"
print a #prints čšž
b = a[::-1]
print b #prints �šō� instead of žšč
Is there any way to fix that?
Python 2 strings are byte strings, and UTF-8 encoded text uses multiple bytes per character. Just because your terminal manages to interpret the UTF-8 bytes as characters, doesn't mean that Python knows about what bytes form one UTF-8 character.
Your bytestring consists of 6 bytes, every two bytes form one character:
>>> a = "čšž"
>>> a
'\xc4\x8d\xc5\xa1\xc5\xbe'
However, how many bytes UTF-8 uses depends on where in the Unicode standard the character is defined; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte each, and many emoji need 4 bytes!
In UTF-8 order is everything; reversing the above bytestring reverses the bytes, resulting in some gibberish as far as the UTF-8 standard is concerned, but the middle 4 bytes just happen to be valid UTF-8 sequences (for š and ō):
>>> a[::-1]
'\xbe\xc5\xa1\xc5\x8d\xc4'
-----~~~~~~~~^^^^^^^^####
| š ō |
\ \
invalid UTF8 byte opening UTF-8 byte missing a second byte
You'd have to decode the byte string to a unicode object, which consists of single characters. Reversing that object gives you the right results:
b = a.decode('utf8')[::-1]
print b
You can always encode the object back to UTF-8 again:
b = a.decode('utf8')[::-1].encode('utf8')
Note that in Unicode, you can still run into issues when reversing text, when combining characters are used. Reversing text with combining characters places those combining characters in front rather than after the character they combine with, so they'll combine with the wrong character instead:
>>> print u'e\u0301a'
éa
>>> print u'e\u0301a'[::-1]
áe
You can mostly avoid this by converting the Unicode data to its normalised form (which replaces combinations with 1-codepoint forms) but there are plenty of other exotic Unicode characters that don't play well with string reversals.
Related
It may be too late at night for me to be still doing programming (so apologies if this is a very silly thing to ask), but I have spotted a weird behaviour with string decoding in Python:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> name = bs.decode("utf-8", "replace")
>>> print(name)
I n t e l ( R )
>>> list_of_dict = []
>>> list_of_dict.append({'name': name})
>>> list_of_dict
[{'name': 'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00'}]
How can the list contain unicode characters if it has already been decoded?
Decoding bytes by definition produces "Unicode" (text really, where Unicode is how you can store arbitrary text, so Python uses it internally for all text), so when you say "How can the list contain unicode characters if it has already been decoded?" it betrays a fundamental misunderstanding of what Unicode is. If you have a str in Python 3, it's text, and that text is composed of a series of Unicode code points (with unspecified internal encoding; in fact, modern Python stores in ASCII, latin-1, UCS-2 or UCS-4, depending on highest ordinal value, as well as sometimes caching a UTF-8 representation, or a native wchar representation for use with legacy extension modules).
You're seeing the repr of the nul character (Unicode ordinal 0) and thinking it didn't decode properly, and you're likely right (there's nothing illegal about nul characters, they're just not common in plain text); your input data is almost certainly encoded in UTF-16-LE, not UTF-8. Use the correct codec, and the text comes out correctly:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> bs.decode('utf-16-le') # No need to replace things, this is legit UTF-16-LE
'Intel(R)'
>>> list_of_dict = [{'name': _}]
>>> list_of_dict
[{'name': 'Intel(R)'}]
Point is, while producing nul characters is legal, unless it's a binary file, odds are it won't have any, and if you're getting them, you probably picked the wrong codec.
The discrepancy between printing the str and displaying is as part of a list/dict is because list/dict stringify with the repr of their contents (what you'd type to reproduce the object programmatically in many cases), so the string is rendered with the \x00 escapes. printing the str directly doesn't involve the repr, so the nul characters get rendered as spaces (since there is no printable character for nul, so your terminal chose to render it as spaces).
So what I think is happening is that the null terminated characters \x00 are not properly decoded and remain in the string after decoding. However, since these are null characters they do not mess up when you print the string which interprets them as nothing or spaces (in my case I tested your code on arch linux on python2 and python3 and they were completely ommited)
Now the thing is that you got a \x00 character for each of your string characters when you decode with utf-8 so what this means is that your bytestream consists actually out of 16bit characters and not 8bit. Therefore, if you try to decode using utf-16 your code will work like a charm :)
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> t = bs.decode("utf-16", "replace")
>>> print(t)
Intel(R)
>>> t
'Intel(R)'
I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7
There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.
These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.
To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё
I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.
You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.
If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you print a list, you get the repr of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]
I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.
You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.
If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you print a list, you get the repr of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]
Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S
unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).
Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.
Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Unicode and encodings are completely different, unrelated things.
Unicode
Assigns a numeric ID to each character:
0x41 → A
0xE1 → á
0x414 → Д
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.
You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.
These numbers assigned to all characters by Unicode are called code points.
The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I'm talking about 😂, instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?
Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.
Encodings
Map characters to bit patterns.
These bit patterns are used to represent the characters in computer memory or on disk.
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
ASCII
Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.
Example:
a → 1100001 (0x61)
You can see all the mappings in this table.
ISO 8859-1 (aka Latin-1)
Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
Example:
a → 01100001 (0x61)
á → 11100001 (0xE1)
You can see all the mappings in this table.
UTF-8
Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
Example:
a → 01100001 (0x61)
á → 11000011 10100001 (0xC3 0xA1)
≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
The way UTF-8 encodes characters to bit strings is very well described here.
Unicode and Encodings
Looking at the above examples, it becomes clear how Unicode is useful.
For example, if I'm Latin-1 and I want to explain my encoding of á, I don't need to say:
"I encode that a with an aigu (or however you call that rising bar) as 11100001"
But I can just say:
"I encode U+00E1 as 11100001"
And if I'm UTF-8, I can say:
"Me, in turn, I encode U+00E1 as 11000011 10100001"
And it's unambiguously clear to everybody which character we mean.
Now to the often arising confusion
It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.
For example:
ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.
Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.
Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.
Back to your question
The encoding used by your Python interpreter is UTF-8.
Here's what's going on in your examples:
Example 1
The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.
>>> a = 'á'
When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':
>>> a
'\xc3\xa1'
Example 2
The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):
>>> ua = u'á'
When you look at the value of ua, Python tells you that it contains the code point U+00E1:
>>> ua
u'\xe1'
Example 3
The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:
>>> ua.encode('utf-8')
'\xc3\xa1'
Example 4
The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:
>>> ua.encode('latin1')
'\xe1'
There's no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.
Your terminal happens to be configured to UTF-8.
The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.
This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.
You can see the difference when you encode the unicode value to different output encodings:
>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.
It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua