Converting Unicode string to Chinese characters - python

How can I convert the following Unicode string to Chinese characters?
The string is:
'\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'
And I want it to be:
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?

Decode it using unicode-escape will give you what you want.
Python 2.7
>>> print '\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'.decode('unicode-escape')
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?
Python 3.x
>>> print('\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'.encode('ascii').decode('unicode-escape'))
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?
>>> print(b'\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'.decode('unicode-escape'))
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?

Related

How to decode a unicode-like string in Python 3?

I have unicode-like strings but with slash escaped. For example, '\\u000D'. I need to decode them as normal strings. The above example should be convert to '\r' which '\u000D' corresponds to.
Use the unicode-escape codec.
>>> import codecs
>>> codecs.decode('\\u000D', 'unicode-escape')
'\r'

Python decode from ascii hex

I'm looking for a function to convert "\x61\x62\x63\x64\x65\x66" to "abcdef" in python, without having to print it
and is the proper term for this type of encoding ascii hex?
>>> "\x61\x62\x63\x64\x65\x66".decode('ascii')
u'abcdef'
You can convert "\x61\x62\x63\x64\x65\x66" to a unicode string with the unicode method:
unicode("\x61\x62\x63\x64\x65\x66")
Output: u'abcdef'

How to decode unicode raw literals to readable string?

If I assign unicode raw literals to a variable, I can read its value:
>>> s = u'\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> s
u'\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> print s
Сообщение отправлено
But when I have already assigned value to a plain, not unicode string, I can not:
>>> s = '\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> s
'\\u0421\\u043e\\u043e\\u0431\\u0449\\u0435\\u043d\\u0438\\u0435 \\u043e\\u0442\\u043f\\u0440\\u0430\\u0432\\u043b\\u0435\\u043d\\u043e'
>>> print s
\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e
How can I decode and read it?
Use the unicode_escape codec:
s.decode('unicode_escape')
If you are getting weird results when decoding try following
print repr(s).decode('unicode-escape').encode('latin-1') // or encode using some other encoding
It could be that python terminal is using default ASCII and there is symbol that goes out of range.

Python unicode woes

What is the correct way to convert '\xbb' into a unicode string? I have tried the following and only get UnicodeDecodeError:
unicode('\xbb', 'utf-8')
'\xbb'.decode('utf-8')
Since it comes from Word it's probably CP1252.
>>> print '\xbb'.decode('cp1252')
»
It looks to be Latin-1 encoded. You should use:
unicode('\xbb', 'Latin-1')
Not sure what you are trying to do. But in Python3 all strings are unicode per default. In Python2.X you have to use u'my unicode string \xbb' (or double, tripple quoted) to get unicode strings. When you want to print unicode strings you have to encode them in character set that is supported on the output device, eg. the terminal. u'my unicode string \xbb'.endoce('iso-8859-1') for instance.

How do convert unicode escape sequences to unicode characters in a python string

When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.

Categories

Resources