string to wstring in python - python

I have a udp socket which received datagram of different length.
The first of the datagram specifies what type of data it is going to receive say for example 64-means bool false, 65-means bool true, 66-means sint, 67-means int and so on. As most of datatypes have known length, but when it comes to string and wstring, the first byte says 85-means string, next 2 bytes says string length followed by actual string. For wstring 85, next 2 bytes says wstring length, followed by actual wstring.
To parse the above kind off wstring format b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001' I used the following code
data = str(rawdata[3:]).split("\\x00")
data = "".join(data[1:])
data = "".join(data[:-1])
Is this correct or any other simple way?
As I received the datagram, I need to send the datagram also. But I donot know how to create the datagrams as the socket.sendto requires bytes. If I try to convert string to utf-16 format will it covert to wstring. If so how would I add the rest of the information into bytes
From the above datagram information U-85 which is wstring, \x00\x07 - 7 length of the wstring data, \x00C\x00o\x00u\x00p\x00o\x00n\x001 - is the actual string Coupon1

A complete answer depends on exactly what you intend to do with the resulting data. Splitting the string with '\x00' (assuming that's what you meant to do? not sure I understand why there are two backslashes there) doesn't really make sense. The reason for using a wstring type in the first place is to be able to represent characters that aren't plain old 8-bit (really 7-bit) ascii. If you have any characters that aren't standard Roman characters, they may well have something other than a zero byte separating the characters in which case your split result will make no sense.
Caveat: Since you mentioned sendto requiring bytes, I assume you're using python3. Details will be slightly different under python2.
Anyway if I understand what it is you're meaning to do, the "utf-16-be" codec may be what you're looking for. (The "utf-16" codec puts a "byte order marker" at the beginning of the encoded string which you probably don't want; "utf-16-be" just puts the big-endian 16-bit chars into the byte string.) Decoding could be performed something like this:
rawdata = b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001'
dtype = rawdata[0]
if dtype == 85: # wstring
dlen = ord(rawdata[1:3].decode('utf-16-be'))
data = rawdata[3: (dlen * 2) + 3]
dstring = data.decode('utf-16-be')
This will leave dstring as a python unicode string. In python3, all strings are unicode. So you're done.
Encoding it could be done something like this:
tosend = 'Coupon1'
snd_data = bytearray([85]) # wstring indicator
snd_data += bytearray([(len(tosend) >> 8), (len(tosend) & 0xff)])
snd_data += tosend.encode('utf-16-be')

Related

Right encoding to send data

Some things that were trivial in Python 2 get a bit more tedious in Python 3. I am sending a string followed by some hex value:
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
This gives an error when sending, and I have read in other post that the solution is to use sendall and encode:
s.sendall(buffer.encode("UTF-8"))
However, what is send in the network for the hex value is the UTF-8 encoded:
c3 8a c3 be c3 8a c3 be
instead of the exact bytes I defined. How should I do this without using external libraries and possibly without having to "convert" the data into another structure?
I know this question has been widely asked, but I can't find a satisfying solution
You may think Python 3 is making thing more difficult, but it is the converse which is intended. You are experiencing a charset enforcement issue. In python 2 there were multiple reasons to be confused with UTF-8 and Unicode charsets. It is now fixed.
First of all, if you need to send binary data, you better choose the ad-hoc type, which is bytes. Using Python 3, it is sufficient to prefix your string with a b. This should fix you problem:
buffer = b"ABCD"
buffer += b"\xCA\xFE\xCA\xFE"
s.sendall(buffer)
Of course, bytes object has no encode method as it is already encoded to binary. But it has the converse method decode.
When you create a str object using quotes with no prefix, by default Python 3 will use Unicode encoding (which was enforced by unicode type or u prefix in Python 2). It means you will require to use encode method to get binary data.
Instead, directly use bytes to store binary data as no encoding operation will occur and it will stay as you typed it.
The error can only concatenate str (not "bytes") to str speaks for itself. Python is complaining it cannot concatenate str with bytes as the former data requires a further step, namely encoding, to make the + operation meaningful.
Based on the information in your question, you might be able to get away with encoding your data as latin-1, because this will not change any byte values
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
payload = buffer.encode("latin-1")
print(payload)
b'ABCD\xca\xfe\xca\xfe'
On the other side, you could just decode from latin-1:
buffer = payload.decode('latin-1')
buffer
'ABCDÊþÊþ'
But you might prefer to keep the text and binary parts of your message as their respective types:
encoded_text = payload[:4]
encoded_text
b'ABCD'
text = encoded_text.decode('latin-1')
print(text)
ABCD
binary_data = payload[4:]
binary_data
b'\xca\xfe\xca\xfe'
If your text contains codepoints which cannot be encoded as latin-1 - '你好,世界' for example - you could follow the same approach, but you would need to encode the text as UTF-8 while encoding the binary data as 'latin-1'; the resulting bytes will need to be split into their text and binary sections and decoded separately.
Finally: encoding string literals like '\xca\xfe\xca\xfe' is a poor style in Python3 - better to declare them as bytes literals like b'\xca\xfe\xca\xfe'.

Alignment/Packing in Python Struct.Unpack

I have a piece of hardware sending data at a fixed length: 2bytes, 1 bytes, 4 bytes, 4 bytes, 2 bytes, 4bytes for a total of 17 bytes. If I change my format to 18bytes the code works but values are incorrect.
format = '<2s1s4s4s2s4s'
print(struct.calcsize(format))
print(len(hardware_data))
splitdata = struct.unpack(format,hardware_data)
The output is 17, 18 and an error because of the mismatch. I think this is caused by alignment but I'm unsure and nothing I've tried had fixed this. Below are a couple typical strings, if I print(hardware_data) I noticed the 'R' and 'n' characters but I'm unsure how to handle.
b'\x18\x06\x00R\x1f\x01\x00\x00\x00\x00\x00\xd8\xff\x00\x00\x00\x00\x80'
b'\x18\x06\x00R\x1f\x01\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x80'
Odds are whatever is sending the data is padding it in some way you're not expecting.
For example, if the first four byte field is supposed to represent an int, C struct padding rules would require a padding byte, after the one byte field (to align the next four byte field to four byte alignment). So just add the padding byte explicitly, changing your format string to:
format = '<2s1sx4s4s2s4s'
The x in there says "I expect a byte here, but it's padding, don't unpack it to anything." It's possible the pad byte belongs elsewhere (I have no idea what your hardware is doing); I notice the third byte is the NUL (\0) byte in both examples, but the spot I assumed would be padding is 'R', so it's possible you want:
format = '<2sx1s4s4s2s4s'
instead. Or it could be somewhere else (without knowing which of the fields is a char array in the hardware struct, and which are larger types with alignment requirements, it's impossible to say). Point is, your hardware is sending 18 bytes; figure out which one is garbage, and put the x pad byte at the appropriate location.
Side-note: The repr of bytes objects will use ASCII or simpler ASCII escapes when available. That's why you see an R and a \n in your output; b'R' and b'\x52' are equivalent literals, as are b'\n' and b'\x0a' and Python chooses to use the "more readable" version (when the bytes is actually just ASCII, this is much more readable).

<bytes> to escaped <str> Python 3

Currently, I have Python 2.7 code that receives <str> objects over a socket connection. All across the code we use <str> objects, comparisons, etc. In an effort to convert to Python 3, I've found that socket connections now return <bytes> objects which requires us to change all literals to be like b'abc' to make literal comparisons, etc. This is a lot of work, and although it is apparent why this change was made in Python 3 I am curious if there are any simpler workarounds.
Say I receive <bytes> b'\xf2a27' over a socket connection. Is there a simple way to convert these <bytes> into a <str> object with the same escapes in Python 3.6? I have looked into some solutions myself to no avail.
a = b'\xf2a27'.decode('utf-8', errors='backslashescape')
Above yields '\\xf2a27' with len(a) = 7 instead of the original len(b'\xf2a27') = 3. Indexing is wrong too, this just won't work but it seems like it is headed down the right path.
a = b'\xf2a27'.decode('latin1')
Above yields 'òa27' which contains Unicode characters that I would like to avoid. Although in this case len(a) = 5 and comparisons like a[0] == '\xf2' work, but I'd like to keep the information escaped in representation if possible.
Is there perhaps a more elegant solution that I am missing?
You really have to think about what the data you receive represents and Python 3 makes a strong point in that direction. There's an important difference between a string of bytes that actually represent a collection of bytes and a string of (abstract, unicode) characters.
You may have to think about each piece of data individually if they can have different representations.
Let's take your example of b'\xf2a27' which in its raw form you receive from the socket is just a string of 4 bytes: 0xf2, 0x61, 0x32, 0x37 in hex or 242, 97, 50, 55 in decimal.
Let's say you actually want 4 bytes out of that. You could either keep it as a byte string or convert it into a list or tuple of bytes if that serves you better:
raw_bytes = b'\xf2a27'
list_of_bytes = list(raw_bytes)
tuple_of_bytes = tuple(raw_bytes)
if raw_bytes == b'\xf2a27':
pass
if list_of_bytes == [0xf2, 0x61, 0x32, 0x37]:
pass
if tuple_of_bytes == (0xf2, 0x61, 0x32, 0x37):
pass
Let's say this actually represents a 32-bit integer in which case you should convert it into a Python int. Choose whether it is encoded in little or big endian byte order and make sure you pick the correct one of signed vs. unsigned.
raw_bytes = b'\xf2a27'
signed_little_endian, = struct.unpack('<i', raw_bytes)
signed_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=True)
unsigned_little_endian, = struct.unpack('<I', raw_bytes)
unsigned_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=False)
signed_big_endian, = struct.unpack('>i', raw_bytes)
signed_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=True)
unsigned_big_endian, = struct.unpack('>I', raw_bytes)
unsigned_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=False)
if signed_litte_endian == 926048754:
pass
Let's say it's actually text. Think about what encoding it comes in. In your case it cannot be UTF-8 as b'\xf2' would be a byte string that cannot be correctly decoded as UTF-8. If it's latin1 a.k.a. iso8859-1 and you're sure about it, that's fine.
raw_bytes = b'\xf2a27'
character_string = raw_bytes.decode('iso8859-1')
if character_string == '\xf2a27':
pass
If your choice of encoding was correct, having a '\xf2' or 'ò' character inside the string will also be correct. It's still a single character. 'ò', '\xf2', '\u00f2' and '\U000000f2' are just 4 different ways to represent the same single character in a (unicode) string literal. Also, the len will be 4, not 5.
print(ord(character_string[0])) # will be 242
print(hex(ord(character_string[0]))) # will be 0xf2
print(len(character_string)) # will be 4
If you actually observed a length of 5, you may have observed it at the wrong point. Perhaps after encoding the character string to UTF-8 or having it implicitly encoded to UTF-8 by printing to a UTF-8 Terminal.
Note the difference of the number of bytes output to the shell when changing the default I/O encoding:
PYTHONIOENCODING=UTF-8 python3 -c 'print(b"\xf2a27".decode("latin1"), end="")' | wc -c
# will output 5
PYTHONIOENCODING=latin1 python3 -c 'print(b"\xf2a27".decode("latin1"), end="")' | wc -c
# will output 4
Ideally, you should perform your comparisons after converting the raw bytes to the correct data type they represent. That makes your code more readable and easier to maintain.
As a general rule of thumb, you should always convert raw bytes to their actual (abstract) data type as soon as you receive them. Then keep it in that abstract data type for processing as long as possible. If necessary, convert it back to some raw data on output.

Python convert strings of bytes to byte array

For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding
The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).
I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'

Convert ASCII data to hex/binary/bytes in Python

The protocol for a device I'm working with sends a UDP packet to a server (My Python program, using twisted for networking) in a comma separated ASCII format. In order to acknowledge that my application received the packet correctly, I have to take a couple of values from the header of that packet, convert them to binary and send them back. I have everything setup, though I can't seem to figure out the binary part.
I've been going through a lot of stuff and I'm still confused. The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
The format of the acknowledgement requires me to send "0xFE 0x02" + an 8 unsigned integer (IMEI number, 15 digits) + 2 byte unsigned integer (Sequence ID)
How would I go about converting the ASCII text values that I have into "binary"?
First:
The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
Well, it's impossible print actual binary data in a human-readable form, so documentation will usually either give a sequence of hexadecimal bytes. They could use a bytes literal like b'\xFE\x02' or something instead, but that's still effectively hexadecimal to the human reader, right?
So, if they say "binary", they probably mean "binary", and the hex is just how they're showing you what binary bytes you need.
So, you need to convert the ASCII representation of a number into an actual number, which you do with the int function. Then you need to pack that into 8 bytes, which you do with the struct module.
You didn't mention whether you needed big-endian or little-endian. Since this sounds like a network protocol, and it sounds like it wasn't designed by Microsoft, I would guess big-endian, but you should actually know, not guess.
So:
imei_string = '1234567890123456789'
imei_number = int(imei_string) # 1234567890123456789
imei_bytes = struct.pack('>Q', imei_number) # b'\x11\x22\x10\xf4\x7d\xe9\x81\x15'
buf = b'\xFE\x02' + imei_bytes + seq_bytes
(You didn't say where you're supposed to get the sequence number from, but wherever it comes from, you'll pack it the same way, just using >H instead of >Q.)
If you actually did need a hex string rather than binary, you'd need to know exactly what format. The binascii.hexlify function gives you "bare hex", two characters per byte, no 0x or other header or footer. If you want something different, well, it depends on what exactly you want; no format is really that hard. But, again, I'm pretty sure you don't need a hex string here.
One last thing: Since you didn't specify your Python version, I wrote this in a way that's compatible with both 2.6+ and 3.0+; if you need to use 2.5 or earlier, just drop the b prefix on the literal in the buf = line.

Categories

Resources