Currently, I have Python 2.7 code that receives <str> objects over a socket connection. All across the code we use <str> objects, comparisons, etc. In an effort to convert to Python 3, I've found that socket connections now return <bytes> objects which requires us to change all literals to be like b'abc' to make literal comparisons, etc. This is a lot of work, and although it is apparent why this change was made in Python 3 I am curious if there are any simpler workarounds.
Say I receive <bytes> b'\xf2a27' over a socket connection. Is there a simple way to convert these <bytes> into a <str> object with the same escapes in Python 3.6? I have looked into some solutions myself to no avail.
a = b'\xf2a27'.decode('utf-8', errors='backslashescape')
Above yields '\\xf2a27' with len(a) = 7 instead of the original len(b'\xf2a27') = 3. Indexing is wrong too, this just won't work but it seems like it is headed down the right path.
a = b'\xf2a27'.decode('latin1')
Above yields 'òa27' which contains Unicode characters that I would like to avoid. Although in this case len(a) = 5 and comparisons like a[0] == '\xf2' work, but I'd like to keep the information escaped in representation if possible.
Is there perhaps a more elegant solution that I am missing?
You really have to think about what the data you receive represents and Python 3 makes a strong point in that direction. There's an important difference between a string of bytes that actually represent a collection of bytes and a string of (abstract, unicode) characters.
You may have to think about each piece of data individually if they can have different representations.
Let's take your example of b'\xf2a27' which in its raw form you receive from the socket is just a string of 4 bytes: 0xf2, 0x61, 0x32, 0x37 in hex or 242, 97, 50, 55 in decimal.
Let's say you actually want 4 bytes out of that. You could either keep it as a byte string or convert it into a list or tuple of bytes if that serves you better:
raw_bytes = b'\xf2a27'
list_of_bytes = list(raw_bytes)
tuple_of_bytes = tuple(raw_bytes)
if raw_bytes == b'\xf2a27':
pass
if list_of_bytes == [0xf2, 0x61, 0x32, 0x37]:
pass
if tuple_of_bytes == (0xf2, 0x61, 0x32, 0x37):
pass
Let's say this actually represents a 32-bit integer in which case you should convert it into a Python int. Choose whether it is encoded in little or big endian byte order and make sure you pick the correct one of signed vs. unsigned.
raw_bytes = b'\xf2a27'
signed_little_endian, = struct.unpack('<i', raw_bytes)
signed_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=True)
unsigned_little_endian, = struct.unpack('<I', raw_bytes)
unsigned_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=False)
signed_big_endian, = struct.unpack('>i', raw_bytes)
signed_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=True)
unsigned_big_endian, = struct.unpack('>I', raw_bytes)
unsigned_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=False)
if signed_litte_endian == 926048754:
pass
Let's say it's actually text. Think about what encoding it comes in. In your case it cannot be UTF-8 as b'\xf2' would be a byte string that cannot be correctly decoded as UTF-8. If it's latin1 a.k.a. iso8859-1 and you're sure about it, that's fine.
raw_bytes = b'\xf2a27'
character_string = raw_bytes.decode('iso8859-1')
if character_string == '\xf2a27':
pass
If your choice of encoding was correct, having a '\xf2' or 'ò' character inside the string will also be correct. It's still a single character. 'ò', '\xf2', '\u00f2' and '\U000000f2' are just 4 different ways to represent the same single character in a (unicode) string literal. Also, the len will be 4, not 5.
print(ord(character_string[0])) # will be 242
print(hex(ord(character_string[0]))) # will be 0xf2
print(len(character_string)) # will be 4
If you actually observed a length of 5, you may have observed it at the wrong point. Perhaps after encoding the character string to UTF-8 or having it implicitly encoded to UTF-8 by printing to a UTF-8 Terminal.
Note the difference of the number of bytes output to the shell when changing the default I/O encoding:
PYTHONIOENCODING=UTF-8 python3 -c 'print(b"\xf2a27".decode("latin1"), end="")' | wc -c
# will output 5
PYTHONIOENCODING=latin1 python3 -c 'print(b"\xf2a27".decode("latin1"), end="")' | wc -c
# will output 4
Ideally, you should perform your comparisons after converting the raw bytes to the correct data type they represent. That makes your code more readable and easier to maintain.
As a general rule of thumb, you should always convert raw bytes to their actual (abstract) data type as soon as you receive them. Then keep it in that abstract data type for processing as long as possible. If necessary, convert it back to some raw data on output.
Related
This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
My problem is as follows:
I'm reading a .csv generated by some software and to read it I'm using Pandas. Pandas read the .csv properly but one of the columns stores bytes sequences representing vectors and Pandas stores them as a string.
So I have data (string) and I want to use np.frombuffer() to get the proper vector. The problem is, data is a string so its already encoded so when I use .encode() to turn it into bytes, the sequence is not the original one.
Example: The .csv contains \x00\x00 representing the vector [0,0] with dtype=np.uint8. Pandas stores it as a string and when I try to process it something like this happens:
data = df.data[x] # With x any row.
type(data)
<class 'str'>
print(data)
\x00\x00
e_data = data.encode("latin1")
print(e_data)
b'\\x00\\x00'
v = np.frombuffer(e_data, np.uint8)
print(v)
array([ 92 120 48 48 92 120 48 48], dtype=uint8)
I just want to get b'\x00\x00' from data instead of b'\\x00\\x00' which I understand is a little encoding mess I have not been able to fix yet.
Any way to do this?
Thanks!
Issue: you (apparently) have a string that contains literal backslash escape sequences, such as:
>>> x = r'\x00' # note the use of a raw string literal
>>> x # Python's representation of the string escapes the backslash
'\\x00'
>>> print(x) # but it looks right when printing
\x00
From this, you wish to create a corresponding bytes object, wherein the backslash-escape sequences are translated into the corresponding byte.
Handling these kinds of escape sequences is done using the unicode-escape string encoding. As you may be aware, string encodings convert between bytes and str objects, specifying the rules for which byte sequences correspond to what Unicode code points.
However, the unicode-escape codec assumes that the escape sequences are on the bytes side of the equation and that the str side will have the corresponding Unicode characters:
>>> rb'\x00'.decode('unicode-escape') # create a string with a NUL char
'\x00'
Applying .encode to the string will reverse that process; so if you start with the backslash-escape sequence, it will re-escape the backslash:
>>> r'\x00'.encode('unicode-escape') # the result contains two backslashes, represented as four
b'\\\\x00'
>>> list(r'\x00'.encode('unicode-escape')) # let's look at the numeric values of the bytes
[92, 92, 120, 48, 48]
As you can see, that is clearly not what we want.
We want to convert from bytes to str to do the backslash-escaping. But we have a str to start, so we need to change that to bytes; and we want bytes at the end, so we need to change the str that we get from the backslash-escaping. In both cases, we need to make each Unicode code point from 0-255 inclusive, correspond to a single byte with the same value.
The encoding we need for that task is called latin-1, also known as iso-8859-1.
For example:
>>> r'\x00'.encode('latin-1')
b'\\x00'
Thus, we can reason out the overall conversion:
>>> r'\x00'.encode('latin-1').decode('unicode-escape').encode('latin-1')
b'\x00'
As desired: our str with a literal backslash, lowercase x and two zeros, is converted to a bytes object containing a single zero byte.
Alternately: we can request that backslash-escapes are processed while decoding, by using escape_decode from the codecs standard library module. However, this isn't documented and isn't really meant to be used that way - it's internal stuff used to implement the unicode-escape codec and possibly some other things.
If you want to expose yourself to the risk of that breaking in the future, it looks like:
>>> import codecs
>>> codecs.escape_decode(r'\x00\x00')
(b'\x00\x00', 8)
We get a 2-tuple, with the desired bytes and what I assume is the number of Unicode code points that were decoded (i.e. the length of the string). From my testing, it appears that it can only use UTF-8 encoding for the non-backslash sequences (but this could be specific to how Python is configured), and you can't change this; there is no actual parameter to specify the encoding, for a decode method. Like I said - not meant for general use.
Yes, all of that is as awkward as it seems. The reason you don't get easy support for this kind of thing is that it isn't really how you're intended to design your system. Fundamentally, all data is bytes; text is an abstraction that is encoded by that byte data. Using a single byte (with value 0) to represent four characters of text (the symbols \, x, 0 and 0) is not a normal encoding, and not a reversible one (how do I know whether to decode the byte as those four characters, or as a single NUL character?). Instead, you should strongly consider using some other friendly string representation of your data (perhaps a plain hex dump) and a non-text-encoding-related way to parse it. For example:
>>> data = '41 42' # a string in a simple hex dump format
>>> bytes.fromhex(data) # support is built-in, and works simply
b'AB'
>>> list(bytes.fromhex(data))
[65, 66]
Some things that were trivial in Python 2 get a bit more tedious in Python 3. I am sending a string followed by some hex value:
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
This gives an error when sending, and I have read in other post that the solution is to use sendall and encode:
s.sendall(buffer.encode("UTF-8"))
However, what is send in the network for the hex value is the UTF-8 encoded:
c3 8a c3 be c3 8a c3 be
instead of the exact bytes I defined. How should I do this without using external libraries and possibly without having to "convert" the data into another structure?
I know this question has been widely asked, but I can't find a satisfying solution
You may think Python 3 is making thing more difficult, but it is the converse which is intended. You are experiencing a charset enforcement issue. In python 2 there were multiple reasons to be confused with UTF-8 and Unicode charsets. It is now fixed.
First of all, if you need to send binary data, you better choose the ad-hoc type, which is bytes. Using Python 3, it is sufficient to prefix your string with a b. This should fix you problem:
buffer = b"ABCD"
buffer += b"\xCA\xFE\xCA\xFE"
s.sendall(buffer)
Of course, bytes object has no encode method as it is already encoded to binary. But it has the converse method decode.
When you create a str object using quotes with no prefix, by default Python 3 will use Unicode encoding (which was enforced by unicode type or u prefix in Python 2). It means you will require to use encode method to get binary data.
Instead, directly use bytes to store binary data as no encoding operation will occur and it will stay as you typed it.
The error can only concatenate str (not "bytes") to str speaks for itself. Python is complaining it cannot concatenate str with bytes as the former data requires a further step, namely encoding, to make the + operation meaningful.
Based on the information in your question, you might be able to get away with encoding your data as latin-1, because this will not change any byte values
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
payload = buffer.encode("latin-1")
print(payload)
b'ABCD\xca\xfe\xca\xfe'
On the other side, you could just decode from latin-1:
buffer = payload.decode('latin-1')
buffer
'ABCDÊþÊþ'
But you might prefer to keep the text and binary parts of your message as their respective types:
encoded_text = payload[:4]
encoded_text
b'ABCD'
text = encoded_text.decode('latin-1')
print(text)
ABCD
binary_data = payload[4:]
binary_data
b'\xca\xfe\xca\xfe'
If your text contains codepoints which cannot be encoded as latin-1 - '你好,世界' for example - you could follow the same approach, but you would need to encode the text as UTF-8 while encoding the binary data as 'latin-1'; the resulting bytes will need to be split into their text and binary sections and decoded separately.
Finally: encoding string literals like '\xca\xfe\xca\xfe' is a poor style in Python3 - better to declare them as bytes literals like b'\xca\xfe\xca\xfe'.
For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding
The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).
I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'
I have a device that returns a UTF-8 encoded string. I can only read from it byte-by-byte and the read is terminated by a byte of value 0x00.
I'm making a Python 2.7 function for others to access my device and return string.
In a previous design when the device just returned ASCII, I used this in a loop:
x = read_next_byte()
if x == 0:
break
my_string += chr(x)
Where x is the latest byte value read from the device.
Now the device can return a UTF-8 encoded string, but I'm not sure how to convert the bytes that I get back into a UTF-8 encoded string/unicode.
chr(x) understandably causes an error when the x>127, so I thought that using unichr(x) may work, but that assumes the value passed is a full unicode character value, but I only have a part 0-255.
So how can I convert the bytes that I get back from the device into a string that can be used in Python and still handle the full UTF-8 string?
Likewise, if I was given a UTF-8 string in Python, how would I break that down into individual bytes to send to my device and still maintain UTF-8?
The correct solution would be to read until you hit the terminating byte, then convert to UTF-8 at that time (so you have all characters):
mybytes = bytearray()
while True:
x = read_next_byte()
if x == 0:
break
mybytes.append(x)
my_string = mybytes.decode('utf-8')
The above is the most direct translation of your original code. Interestingly, this is one of those cases where two arg iter can be used to dramatically simplify the code by making your C-style stateful byte reader function into a Python iterator that lets you one-line the work:
# If this were Python 3 code, you'd use the bytes constructor instead of bytearray
my_string = bytearray(iter(read_next_byte, 0)).decode('utf-8')
I have a udp socket which received datagram of different length.
The first of the datagram specifies what type of data it is going to receive say for example 64-means bool false, 65-means bool true, 66-means sint, 67-means int and so on. As most of datatypes have known length, but when it comes to string and wstring, the first byte says 85-means string, next 2 bytes says string length followed by actual string. For wstring 85, next 2 bytes says wstring length, followed by actual wstring.
To parse the above kind off wstring format b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001' I used the following code
data = str(rawdata[3:]).split("\\x00")
data = "".join(data[1:])
data = "".join(data[:-1])
Is this correct or any other simple way?
As I received the datagram, I need to send the datagram also. But I donot know how to create the datagrams as the socket.sendto requires bytes. If I try to convert string to utf-16 format will it covert to wstring. If so how would I add the rest of the information into bytes
From the above datagram information U-85 which is wstring, \x00\x07 - 7 length of the wstring data, \x00C\x00o\x00u\x00p\x00o\x00n\x001 - is the actual string Coupon1
A complete answer depends on exactly what you intend to do with the resulting data. Splitting the string with '\x00' (assuming that's what you meant to do? not sure I understand why there are two backslashes there) doesn't really make sense. The reason for using a wstring type in the first place is to be able to represent characters that aren't plain old 8-bit (really 7-bit) ascii. If you have any characters that aren't standard Roman characters, they may well have something other than a zero byte separating the characters in which case your split result will make no sense.
Caveat: Since you mentioned sendto requiring bytes, I assume you're using python3. Details will be slightly different under python2.
Anyway if I understand what it is you're meaning to do, the "utf-16-be" codec may be what you're looking for. (The "utf-16" codec puts a "byte order marker" at the beginning of the encoded string which you probably don't want; "utf-16-be" just puts the big-endian 16-bit chars into the byte string.) Decoding could be performed something like this:
rawdata = b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001'
dtype = rawdata[0]
if dtype == 85: # wstring
dlen = ord(rawdata[1:3].decode('utf-16-be'))
data = rawdata[3: (dlen * 2) + 3]
dstring = data.decode('utf-16-be')
This will leave dstring as a python unicode string. In python3, all strings are unicode. So you're done.
Encoding it could be done something like this:
tosend = 'Coupon1'
snd_data = bytearray([85]) # wstring indicator
snd_data += bytearray([(len(tosend) >> 8), (len(tosend) & 0xff)])
snd_data += tosend.encode('utf-16-be')