Convert unicode string into byte string [duplicate]

Convert unicode string into byte string [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a string like:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.
Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?

You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
Note the difference. \\ is an escape code indicating a literal, single backslash:
>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36
The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:
s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)
Output:
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
If you did have s_str as posted, a simple .encode('latin1') would convert it:
>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:
s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")
As I said I have no idea why this works so feel free to explain it if you know why.

You might simply use .encode("utf-8") to get desired result i.e.:
s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)
output
b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

Related

"bytearray.fromhex()" not returning the correct hex

I'm trying to convert a string to bytes. I have this function:
def unpack_byte_data(self, byte_arr):
byte_str = byte_arr.decode('utf-8')
print("BYTE_STR", byte_str)
byte_data = bytearray.fromhex(byte_str)
print("BYTE_DATA", byte_data)
return byte_data
But, as we see below, I'm not getting the correct result. The BYTE_STR does not match the resulting BYTE_DATA.
BYTE_STR 080e36860001
BYTE_DATA bytearray(b'\x08\x0e6\x86\x00\x01')
Why am I getting the wrong number of bytes with bad data?

The answer is in the comments from 'user2357112 supports Monica'.
When you print a byte array, any byte that can be considered an ASCII character is printed as an ASCII character.
So, \x0e6 is really \x0e, \x36, as \x36 is the ASCII character 6.
This would have been more obvious if were a non-hex letter, but bad luck.
John Coleman explained how to print a bytearray as one would expect using a list comprehension:
[hex(b) for b in byte_data]
I think this is a tricky behavior and it's worth keeping this answer here.

Bytes operations in Python

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?

Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.

In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

How convert abitrary string into bytes without UnicodeEncodeError issue?

I should not expect any error here. I just want to take the string literaly and translate it into its bytes. I don't want to encode or decode anything.
I am taking here a stupid example:
>>> astring
u'\xb0'
Stupid enough to give me headache...
>>> bytes(astring)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position...
One horrible trick is to do this:
>>> bytes(repr(astring)[2:-1])
'\xb0'
One other bad solution is:
>>> bytes(astring.encode("utf-8"))
'\xc2\xb0'
It is a bad solution because my string is not composed of two chars. This is wrong.
Another awful solution would be:
>>> bytes(''.join(map(bytes, [chr(ord(c)) for c in astring])))
'\xb0'
I am using Python 2.7
Background
I would like to compare two columns on a database where the encoding is unknown and sometime conflicting. I don't care about wrong chars on my dump. I just want to get it to have a look at it.

If your Unicode strings are guaranteed to only contain codepoints < 256 then you can convert them to bytes using the Latin1 encoding. Here's some Python 2 code that performs this conversion on all codepoints in range(256).
r = range(256)
s = u''.join([unichr(i) for i in r])
print repr(s)
b = s.encode('latin1')
print repr(b)
a = [ord(c) for c in b]
print a == r
output
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
FWIW, here's the equivalent Python 3 code.
r = range(256)
s = u''.join([chr(i) for i in r])
print(repr(s))
b = s.encode('latin1')
print(repr(b))
print(list(b) == list(r))
output
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
Note that the Python 3 Unicode repr output is a little more human-friendly.

You cannot just 'take the string literally' because the actual, internal, bytes representation of your string is not fixed and is an implementation detail of the your python interpreter that your should not rely on (see PEP3993, on the same system different string can use different internal encoding).
That also means that to get a byte representation of you string, you really need to encode it, and thus specify the encoding.
By the way, astring.encode("utf-8") is not wrong (and already returns a bytes, you don't need the extra bytes(...) in your code), as in utf-8 a single character can be represented as several bytes.

You should be able to just add b before the quotes of the string.
>>> astring = b'\xb0'
>>> astring
b'\xb0'
>>> bytes(astring)
b'\xb0'
>>>
Putting b before the string makes it a bytes object. No more UnicodeEncodeError.

How to convert byte string with non-printable chars to hexadecimal in python? [duplicate]

This question already has answers here:
What's the correct way to convert bytes to a hex string in Python 3?
(9 answers)
Closed 7 years ago.
I have an ANSI string Ď–ór˙rXüď\ő‡íQl7 and I need to convert it to hexadecimal like this:
06cf96f30a7258fcef5cf587ed51156c37 (converted with XVI32).
The problem is that Python cannot encode all characters correctly (some of them are incorrectly displayed even here, on Stack Overflow) so I have to deal with them with a byte string.
So the above string is in bytes this: b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
And that's what I need to convert to hexadecimal.
So far I tried binascii with no success, I've tried this:
h = ""
for i in b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7':
h += hex(i)
print(h)
It prints:
0x60xcf0x960xf30xa0x720x830xff0x720x580xfc0xef0x5c0xf50x870xed0x510x150x6c0x37
Okay. It looks like I'm getting somewhere... but what's up with the 0x thing?
When I remove 0x from the string like this:
h.replace("0x", "")
I get 6cf96f3a7283ff7258fcef5cf587ed51156c37 which looks like it's correct.
But sometimes the byte string has a 0 next to a x and it gets removed from the string resulting in a incorrect hexadecimal string. (the string above is missing the 0 at the beginning).
Any ideas?

If you're running python 3.5+, bytes type has an new bytes.hex() method that returns string representation.
>>> h = b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
>>> h.hex()
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
Otherwise you can use binascii.hexlify() to do the same thing
>>> import binascii
>>> binascii.hexlify(h).decode('utf8')
'06cf96f30a7283ff7258fcef5cf587ed51156c37'

As per the documentation, hex() converts “an integer number to a lowercase hexadecimal string prefixed with ‘0x’.” So when using hex() you always get a 0x prefix. You will always have to remove that if you want to concatenate multiple hex representations.
But sometimes the byte string has a 0 next to a x and it gets removed from the string resulting in a incorrect hexadecimal string. (the string above is missing the 0 at the beginning).
That does not make any sense. x is not a valid hexadecimal character, so in your solution it can only be generated by the hex() call. And that, as said above, will always create a 0x. So the sequence 0x can never appear in a different way in your resulting string, so replacing 0x by nothing should work just fine.
The actual problem in your solution is that hex() does not enforce a two-digit result, as simply shown by this example:
>>> hex(10)
'0xa'
>>> hex(2)
'0x2'
So in your case, since the string starts with b\x06 which represents the number 6, hex(6) only returns 0x6, so you only get a single digit here which is the real cause of your problem.
What you can do is use format strings to perform the conversion to hexadecimal. That way you can both leave out the prefix and enforce a length of two digits. You can then use str.join to combine it all into a single hexadecimal string:
>>> value = b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
>>> ''.join(['{:02x}'.format(x) for x in value])
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
This solution does not only work with a bytes string but with really anything that can be formatted as a hexadecimal string (e.g. an integer list):
>>> value = [1, 2, 3, 4]
>>> ''.join(['{:02x}'.format(x) for x in value])
'01020304'

Python 3 struct.pack() printing weird characters

I am testing struct module because I would like to send simple commands with parameters in bytes (char) and unsigned int to another application.
However I found some weird things when converting to little endian unsigned int, these examples print the correct hexadecimal representation:
>>> import struct
>>> struct.pack('<I',7)
b'\x07\x00\x00\x00'
>>> struct.pack('<I',11)
b'\x0b\x00\x00\x00'
>>> struct.pack('<I',16)
b'\x10\x00\x00\x00'
>>> struct.pack('<I',15)
b'\x0f\x00\x00\x00'
but these examples apparently not:
>>> struct.pack('<I',10)
b'\n\x00\x00\x00'
>>> struct.pack('<I',32)
b' \x00\x00\x00'
>>> struct.pack('<I',64)
b'#\x00\x00\x00'
I would appreciate any explanation or hint. Thanks beforehand!

Python is being helpful.
The bytes representation will use ASCII characters for any bytes that are printable and escape codes for the rest.
Thus, 0x40 is printed as #, because that's a printable byte. But 0x0a is represented as \n instead, because that is the standard Python escape sequence for a newline character. 0x00 is represented as \x00, a hex escape sequence denoting the NULL byte value. Etc.
All this is just the Python representation when echoing the values, for your debugging benefit. The actual value itself still consists of actual byte values.
>>> b'\x40' == b'#'
True
>>> b'\x0a' == b'\n'
True
It's just that any byte in the printable ASCII range will be shown as that ASCII character rather than a \xhh hex escape or dedicated \c one-character escape sequence.
If you wanted to see only hexadecimal representations, use the binascii.hexlify() function:
>>> import binascii
>>> binascii.hexlify(b'#\x00\x00\x00')
b'40000000'
>>> binascii.hexlify(b'\n\x00\x00\x00')
b'0a000000'
which returns bytes as hex characters (with no prefixes), instead. The return value is of course no longer the same value, you now have a bytestring of twice the original length consisting of characters representing hexadecimal values, literal a through to f and 0 through to 9 characters.

"\xNN" is just the way to represent a non-prinatble character ... it will give you the prinable character if it can
print "\x0a" == "\n" == chr(10)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert unicode string into byte string [duplicate] - python

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is: s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem") As I said I have no idea why this works so feel free to explain it if you know why.

You might simply use .encode("utf-8") to get desired result i.e.: s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04" s_2 = s_1.encode("utf-8") print(s_2) output b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

Related

"bytearray.fromhex()" not returning the correct hex

Bytes operations in Python

How convert abitrary string into bytes without UnicodeEncodeError issue?

How to convert byte string with non-printable chars to hexadecimal in python? [duplicate]

Python 3 struct.pack() printing weird characters

Categories

Resources