I'm looking for a function to convert "\x61\x62\x63\x64\x65\x66" to "abcdef" in python, without having to print it
and is the proper term for this type of encoding ascii hex?
>>> "\x61\x62\x63\x64\x65\x66".decode('ascii')
u'abcdef'
You can convert "\x61\x62\x63\x64\x65\x66" to a unicode string with the unicode method:
unicode("\x61\x62\x63\x64\x65\x66")
Output: u'abcdef'
Related
I am trying to convert a string that contains unicode-style formatting, like:
'#U0048#U0045#U004C#U004C#U004F'
What would be the most pythonic way to convert this to:
'HELLO'
thanks!
You have to replace #U with \u to create unicode's codes
\u0048\u0045\u004C\u004C\u004F
and then you can encode it to bytes and decode it back using 'unicode_escape' or 'raw_unicode_escape'
print('#U0048#U0045#U004C#U004C#U004F'.replace('#U', '\\u').encode().decode('unicode_escape'))
Doc: codecs - Text Encoding
In Python 3, suppose I have
>>> thai_string = 'สีเ'
Using encode gives
>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'
My question: how can I get encode() to return a bytes sequence using \u instead of \x? And how can I decode them back to a Python 3 str type?
I tried using the ascii builtin, which gives
>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"
But this doesn't seem quite right, as I can't decode it back to obtain thai_string.
Python documentation tells me that
\xhh escapes the character with the hex value hh while
\uxxxx escapes the character with the 16-bit hex value xxxx
The documentation says that \u is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?
You can use unicode_escape:
>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'
Note that encode() will always return a byte string (bytes) and the unicode_escape encoding is intended to:
Produce a string that is suitable as Unicode literal in Python source code
How can I convert the following Unicode string to Chinese characters?
The string is:
'\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'
And I want it to be:
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?
Decode it using unicode-escape will give you what you want.
Python 2.7
>>> print '\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'.decode('unicode-escape')
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?
Python 3.x
>>> print('\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'.encode('ascii').decode('unicode-escape'))
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?
>>> print(b'\\u5982\\u679c\\u6211\\u662f\\u4e00\\u4e2a\\u4ece\\u524d\\u7684\\u54f2\\u4eba\\uff0c\\u6765\\u5230\\u4eca\\u5929\\u7684\\u4e16\\u754c\\uff0c\\u6211\\u4f1a\\u6700\\u6000\\u5ff5\\u4ec0\\u4e48\\uff1f'.decode('unicode-escape'))
如果我是一个从前的哲人,来到今天的世界,我会最怀念什么?
If I assign unicode raw literals to a variable, I can read its value:
>>> s = u'\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> s
u'\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> print s
Сообщение отправлено
But when I have already assigned value to a plain, not unicode string, I can not:
>>> s = '\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> s
'\\u0421\\u043e\\u043e\\u0431\\u0449\\u0435\\u043d\\u0438\\u0435 \\u043e\\u0442\\u043f\\u0440\\u0430\\u0432\\u043b\\u0435\\u043d\\u043e'
>>> print s
\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e
How can I decode and read it?
Use the unicode_escape codec:
s.decode('unicode_escape')
If you are getting weird results when decoding try following
print repr(s).decode('unicode-escape').encode('latin-1') // or encode using some other encoding
It could be that python terminal is using default ASCII and there is symbol that goes out of range.
What is the correct way to convert '\xbb' into a unicode string? I have tried the following and only get UnicodeDecodeError:
unicode('\xbb', 'utf-8')
'\xbb'.decode('utf-8')
Since it comes from Word it's probably CP1252.
>>> print '\xbb'.decode('cp1252')
»
It looks to be Latin-1 encoded. You should use:
unicode('\xbb', 'Latin-1')
Not sure what you are trying to do. But in Python3 all strings are unicode per default. In Python2.X you have to use u'my unicode string \xbb' (or double, tripple quoted) to get unicode strings. When you want to print unicode strings you have to encode them in character set that is supported on the output device, eg. the terminal. u'my unicode string \xbb'.endoce('iso-8859-1') for instance.