Python Latin Characters and Unicode - python

I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.

You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.
If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

When you print a list, you get the repr of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]

Related

Python - decoded unicode string does not stay decoded

It may be too late at night for me to be still doing programming (so apologies if this is a very silly thing to ask), but I have spotted a weird behaviour with string decoding in Python:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> name = bs.decode("utf-8", "replace")
>>> print(name)
I n t e l ( R )
>>> list_of_dict = []
>>> list_of_dict.append({'name': name})
>>> list_of_dict
[{'name': 'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00'}]
How can the list contain unicode characters if it has already been decoded?
Decoding bytes by definition produces "Unicode" (text really, where Unicode is how you can store arbitrary text, so Python uses it internally for all text), so when you say "How can the list contain unicode characters if it has already been decoded?" it betrays a fundamental misunderstanding of what Unicode is. If you have a str in Python 3, it's text, and that text is composed of a series of Unicode code points (with unspecified internal encoding; in fact, modern Python stores in ASCII, latin-1, UCS-2 or UCS-4, depending on highest ordinal value, as well as sometimes caching a UTF-8 representation, or a native wchar representation for use with legacy extension modules).
You're seeing the repr of the nul character (Unicode ordinal 0) and thinking it didn't decode properly, and you're likely right (there's nothing illegal about nul characters, they're just not common in plain text); your input data is almost certainly encoded in UTF-16-LE, not UTF-8. Use the correct codec, and the text comes out correctly:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> bs.decode('utf-16-le') # No need to replace things, this is legit UTF-16-LE
'Intel(R)'
>>> list_of_dict = [{'name': _}]
>>> list_of_dict
[{'name': 'Intel(R)'}]
Point is, while producing nul characters is legal, unless it's a binary file, odds are it won't have any, and if you're getting them, you probably picked the wrong codec.
The discrepancy between printing the str and displaying is as part of a list/dict is because list/dict stringify with the repr of their contents (what you'd type to reproduce the object programmatically in many cases), so the string is rendered with the \x00 escapes. printing the str directly doesn't involve the repr, so the nul characters get rendered as spaces (since there is no printable character for nul, so your terminal chose to render it as spaces).
So what I think is happening is that the null terminated characters \x00 are not properly decoded and remain in the string after decoding. However, since these are null characters they do not mess up when you print the string which interprets them as nothing or spaces (in my case I tested your code on arch linux on python2 and python3 and they were completely ommited)
Now the thing is that you got a \x00 character for each of your string characters when you decode with utf-8 so what this means is that your bytestream consists actually out of 16bit characters and not 8bit. Therefore, if you try to decode using utf-16 your code will work like a charm :)
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> t = bs.decode("utf-16", "replace")
>>> print(t)
Intel(R)
>>> t
'Intel(R)'

Python 2 string somehow saved as pure Unicode

I have the following strings in Chinese that are saved in a following form as "str" type:
\u72ec\u5230
\u7528\u8272
I am on Python 2.7, when I print those strings they are printed as actual Chinese characters:
chinese_list = ["\u72ec\u5230", "\u7528\u8272", "\u72ec"]
print(chinese_list[0], chinese_list[1], chinese_list[2])
>>> 独到 用色 独
I can't really figure out how they were saved in that form, to me it looks like Unicode. The goal would be to take other Chinese characters that I have and save them in the same kind of encoding. Say I have "国道" and I would need them to be saved in the same way as in the original chinese_list.
I've tried to encode it as utf-8 and also other encodings but I never get the same output as in the original:
new_string = u"国道"
print(new_string.encode("utf-8"))
# >>> b'\xe5\x9b\xbd\xe9\x81\x93'
print(new_string.encode("utf-16"))
# >>> b'\xff\xfe\xfdVS\x90'
Any help appreciated!
EDIT: it doesn't have to have 2 Chinese characters.
EDIT2: Apparently, the encoding was unicode-escape. Thanks #deceze.
print(u"国".encode('unicode-escape'))
>>> \u56fd
The \u.... is unicode escape syntax. It works similar to how \n is a newline, not the two characters \ and n.
The elements of your list never actually contain a byte string with literal characters of \, u, 7 and so on. They contain a unicode string with the actual unicode characters, i.e. 独 and so on.
Note that this only works with unicode strings! In Python2, you need to write u"\u....". Python3 always uses unicode strings.
The unicode escape value of a character can be gotten with the ord builtin. For example, ord(u"国") gives 22269 - the same value as 0x56fd.
To get the hexadezimal escape value, convert the result to hex.
>>> def escape_literal(character):
... return r'\u' + hex(ord(character))[2:]
...
>>> print(escape_literal('国'))
\u56fd

Encoding unicode with 'utf-8' shows byte-strings only for non-ascii

I'm running python2.7.10
Trying to wrap my head around why the following behavior is seen. (Sure there is a reasonable explanation)
So I define two unicode characters, with only the first one in the ascii-set, and the second one outside of it.
>>> a=u'\u0041'
>>> b=u'\u1234'
>>> print a
A
>>> print b
ሴ
Now I encode it to see what the corresponding bytes would be. But only the latter gives me the results I am expecting to see (bytes)
>>> a.encode('utf-8')
'A'
>>> b.encode('utf-8')
'\xe1\x88\xb4'
Perhaps the issue is in my expectation, and if so, one of you can explain where the flaw lies.
- My a,b are unicodes (hex values of the ordinals inside)
- When I print these, the interpreter prints the actual character corresponding to each unicode byte.
- When I encode, I assumed that it would be converted into a byte-string using the encoding scheme I provide (in this case utf-8). I expected to see a bytestring for a.encode, just like I did for b.encode.
What am I missing?
There is no flaw. You encoded to UTF-8, which uses the same bytes as the ASCII standard for the first 127 codepoints of the Unicode standard, and uses multiple bytes (between 2 and 4) for everything else.
You then echoed that value in your terminal, which uses the repr() function to build a debugging representation. That representation produces a valid Python expression for strings, one that is ASCII safe. Any bytes in that value that is not printable as an ASCII character, is shown as an escape sequence. Thus UTF-8 bytes are shown as \xhh hex escapes.
Most importantly, because A is a printable ASCII character, it is shown as is; any code editor or terminal will accept ASCII, and for most English text showing the actual text is so much more useful.
Note that you used print for the unicode values stored in a and b, which means Python encodes those values to your terminal codec, coordinating with your terminal to produce the right output. You did not echo the values in the interpreter. Had you done so, you'd also seen debug output:
>>> a = u'\u0041'
>>> b = u'\u1234'
>>> a
u'A'
>>> b
u'\u1234'
In Python 3, the functionality of the repr() function (or rather, the object.__repr__ hook) has been updated to produce a unicode string with most printable codepoints left un-escaped. Use the new ascii() function to get the above behaviour.

Why does this still look like bytes after I convert it to Unicode?

I have everything working as I want it in my code, but I'm still curious. I have a string: "stación." When I convert that string to unicode, I get:
unicode('stación', 'utf-8')
>>> u'staci\xf3n'
That "\xf3" in there looks like a byte character. This is different from:
unicode('Поиск', 'utf-8')
>>> u'\u041f\u043e\u0438\u0441\u043a'
In the latter example, as with everything I've converted to unicode before, I get unicode characters starting with "\u." Normally, when I see a byte starting with "\x," I think there's a problem. What gives here? Is this because "ó" is extended ASCII?
No, it's because "ó" is a non-ASCII character within the first 255 characters. Since it's representable using a single byte, we save 2 characters in the representation. The other two representations are valid, but not required.
>>> u'\u00f3'
u'\xf3'
>>> u'\U000000f3'
u'\xf3'
u'\xf3' is not a byte; it is a Unicode string with a single Unicode codepoint (U+00f3 LATIN SMALL LETTER O WITH ACUTE).
What you see (u'\xf3') is how Python 2 chooses to represent Unicode character with Unicode ordinals (integers) in the range 0..255 that are not printable ascii characters (Python 3 would show 'ó' here, only non-printable characters use '\xhh' form there by default). As #Ignacio Vazquez-Abrams said: u'\u00f3' and u'\U000000f3' literals create exactly the same Unicode string.
You could see how the Unicode character (u'\xf3') looks like bytes in different character encodings for comparision:
>>> print(u'\xf3')
ó
>>> u'\xf3'.encode('utf-8')
b'\xc3\xb3'
>>> u'\xf3'.encode('utf-16be')
b'\x00\xf3'
>>> u'\xf3'.encode('utf-32le')
b'\xf3\x00\x00\x00'
>>> u'\xf3'.encode('cp1252')
b'\xf3'
Note: b'\xf3' and u'\xf3' are different things. The former is a bytestring that contains a single byte (an integer 243), the latter is a Unicode string that contains a single Unicode codepoint (Unicode ordinal 243). The number is the same 243 by the units are different -- 100 calories is not the same thing as 100 grams.

Print python unicode list of strings on on mac console [duplicate]

I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.
You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.
If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you print a list, you get the repr of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]

Categories

Resources