I have the following strings in Chinese that are saved in a following form as "str" type:
\u72ec\u5230
\u7528\u8272
I am on Python 2.7, when I print those strings they are printed as actual Chinese characters:
chinese_list = ["\u72ec\u5230", "\u7528\u8272", "\u72ec"]
print(chinese_list[0], chinese_list[1], chinese_list[2])
>>> 独到 用色 独
I can't really figure out how they were saved in that form, to me it looks like Unicode. The goal would be to take other Chinese characters that I have and save them in the same kind of encoding. Say I have "国道" and I would need them to be saved in the same way as in the original chinese_list.
I've tried to encode it as utf-8 and also other encodings but I never get the same output as in the original:
new_string = u"国道"
print(new_string.encode("utf-8"))
# >>> b'\xe5\x9b\xbd\xe9\x81\x93'
print(new_string.encode("utf-16"))
# >>> b'\xff\xfe\xfdVS\x90'
Any help appreciated!
EDIT: it doesn't have to have 2 Chinese characters.
EDIT2: Apparently, the encoding was unicode-escape. Thanks #deceze.
print(u"国".encode('unicode-escape'))
>>> \u56fd
The \u.... is unicode escape syntax. It works similar to how \n is a newline, not the two characters \ and n.
The elements of your list never actually contain a byte string with literal characters of \, u, 7 and so on. They contain a unicode string with the actual unicode characters, i.e. 独 and so on.
Note that this only works with unicode strings! In Python2, you need to write u"\u....". Python3 always uses unicode strings.
The unicode escape value of a character can be gotten with the ord builtin. For example, ord(u"国") gives 22269 - the same value as 0x56fd.
To get the hexadezimal escape value, convert the result to hex.
>>> def escape_literal(character):
... return r'\u' + hex(ord(character))[2:]
...
>>> print(escape_literal('国'))
\u56fd
Related
I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7
There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.
These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.
To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё
I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.
You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.
If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you print a list, you get the repr of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]
I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.
You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.
If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you print a list, you get the repr of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]
I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))
When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.