This question already has answers here:
What's the correct way to convert bytes to a hex string in Python 3?
(9 answers)
Closed 4 years ago.
I have a variable b whose value is b'\xac\xed\x05sr\x00'.
How can I convert it to 'aced05737200'?
s, and r are converted to 73 and 72 respectively because their ascii code are 73 and 72.
b.decode('utf-8') gives me this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position
0: invalid start byte
Simply use .hex()-method
>>> b = b'\xac\xed\x05sr\x00'
>>> b.hex()
'aced05737200'
to get the wanted result, because it's not a problem with decoding or encoding. Your bytestring looks ok to produce a proper string object with hexadecimal numbers.
Related
This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a string like:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.
Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?
You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
Note the difference. \\ is an escape code indicating a literal, single backslash:
>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36
The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:
s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)
Output:
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
If you did have s_str as posted, a simple .encode('latin1') would convert it:
>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:
s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")
As I said I have no idea why this works so feel free to explain it if you know why.
You might simply use .encode("utf-8") to get desired result i.e.:
s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)
output
b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'
This question already has answers here:
Decode Hex String in Python 3
(3 answers)
Closed 3 years ago.
I have lots of unicode characters codes stored as strings in Python3, e.g.
unicode = '3077'
where U+3077 is ぷ. How do I print this as human-readable text? I.e. how do I convert the string unicode to unicode_as_text such that:
>>> print(unicode_as_text)
ぷ
Your string is the unicode codepoint represented in hexdecimal, so the character can be rendered by printing the result of calling chr on the decimal value of the code point.
>>> print(chr(int('3077', 16)))
ぷ
This question already has answers here:
Python string to unicode [duplicate]
(3 answers)
Closed 6 years ago.
I have a question about Python 2 encoding. I am trying to decode an ASCII string which contains Unicode code of a letter to Unicode, and then encode it back to Latin-1, but with no success. Here is an illustration:
In[27]: d = u'\u010d'
In[28]: print d.encode('utf-8')
č
In[29]: d1 = '\u010d'
In[30]: d1.decode('ascii').encode('utf-8')
Out[30]: '\\u010d'
I would like to convert '\u010d' to 'č'. Are there any built-in solutions to avoid custom string replacement?
When you do
d1 = '\u010d'
you actually get this string:
In [3]: d1
Out[3]: '\\u010d'
This is because "normal" (non-Unicode) strings don't recognize the \unnnn escape sequence and therefore convert it to a literal backslash, followed by unnnn.
In order to decode that, you need to use the unicode_escape codec:
In [4]: print d1.decode("unicode_escape").encode('utf-8')
č
But of course you shouldn't use Unicode escape sequences in non-Unicode strings in the first place.
This question already has answers here:
Split unicode string into 300 byte chunks without destroying characters
(5 answers)
Closed 8 years ago.
I want to split unicode string to max 255 byte characters and return the result as unicode:
# s = arbitrary-length-unicode-string
s.encode('utf-8')[:255].decode('utf-8')
Problem with this snippet, is that if 255-th byte character is part of 2-byte unicode character, I'll get error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpected end of data
Even if I handle the error I'll get unwanted garbage at the string end.
How to solve this more elegantly?
One very nice property of UTF-8 is that trailing bytes can easily be differentiated from starting bytes. Just work backwards until you've deleted a starting byte.
trunc_s = s.encode('utf-8')[:256]
if len(trunc_s) > 255:
final = -1
while ord(trunc_s[final]) & 0xc0 == 0x80:
final -= 1
trunc_s = trunc_s[:final]
trunc_s = trunc_s.decode('utf-8')
Edit: Check out the answers in the question identified as a duplicate, too.
This question already has an answer here:
How can I convert strings like "\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167" to Chinese characters
(1 answer)
Closed 9 years ago.
I have unicode string, i'm sure that it's UTF-8, but I can't decode it. The string is '\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'. How to decode it?
You can use aString.decode('unicode_escape'), it convert a unicode-format string to unicode object
>>> u'\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'
u'\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'
>>> '\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'.decode('unicode_escape')
u'\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'
>>>
In your case
>>> print '\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'.decode('unicode_escape')
Легковые
>>>