Can you compress bytes in python and send them? - python

I am writing a TCP python script, and I need the first 4 bytes to be the size of the file.
I got the size of the file by doing
SIZE_OF_FILE = os.path.getsize(infile.name)
The size is 392399 bytes.
When I do
s.send(str(SIZE_OF_FILE).encode("utf-8"))
it sends the file, and then on my server I have
fileSize = conn.recv(4).decode('utf-8')
This should read the first 4 bytes, and extract the file size information, but it returns 3923 instead of the 392399.
as the file size... what happened? "392399" should be able to fit into 4 bytes.
We are suppose to be using big endian.

This is because str(SIZE_OF_FILE) typesets the number using decimal notation - that is, you get the string "392399", which is 6 characters (and 6 bytes in UTF-8). If you send only the first 4, you are sending "3923".
What you probably want to do is use something like struct.pack to create a bytestring containing the binary representation of the number.
s.send(struct.pack(format_string, SIZE_OF_FILE))

You are sending the size as a string ("392399"), which is 6 ASCII characters and therefore 6 bytes. You want to send it as a raw integer; use struct.pack to do that:
s.send(struct.pack(">i", SIZE_OF_FILE))
To recieve:
fileSize = struct.unpack(">i", conn.recv(4))[0]
The > makes it big-endian. To make it little-endian, use < instead. i is the type; in this case, a 4-byte integer. The linked documentation has a list of types, in case you want to use another one.

Related

Alignment/Packing in Python Struct.Unpack

I have a piece of hardware sending data at a fixed length: 2bytes, 1 bytes, 4 bytes, 4 bytes, 2 bytes, 4bytes for a total of 17 bytes. If I change my format to 18bytes the code works but values are incorrect.
format = '<2s1s4s4s2s4s'
print(struct.calcsize(format))
print(len(hardware_data))
splitdata = struct.unpack(format,hardware_data)
The output is 17, 18 and an error because of the mismatch. I think this is caused by alignment but I'm unsure and nothing I've tried had fixed this. Below are a couple typical strings, if I print(hardware_data) I noticed the 'R' and 'n' characters but I'm unsure how to handle.
b'\x18\x06\x00R\x1f\x01\x00\x00\x00\x00\x00\xd8\xff\x00\x00\x00\x00\x80'
b'\x18\x06\x00R\x1f\x01\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x80'
Odds are whatever is sending the data is padding it in some way you're not expecting.
For example, if the first four byte field is supposed to represent an int, C struct padding rules would require a padding byte, after the one byte field (to align the next four byte field to four byte alignment). So just add the padding byte explicitly, changing your format string to:
format = '<2s1sx4s4s2s4s'
The x in there says "I expect a byte here, but it's padding, don't unpack it to anything." It's possible the pad byte belongs elsewhere (I have no idea what your hardware is doing); I notice the third byte is the NUL (\0) byte in both examples, but the spot I assumed would be padding is 'R', so it's possible you want:
format = '<2sx1s4s4s2s4s'
instead. Or it could be somewhere else (without knowing which of the fields is a char array in the hardware struct, and which are larger types with alignment requirements, it's impossible to say). Point is, your hardware is sending 18 bytes; figure out which one is garbage, and put the x pad byte at the appropriate location.
Side-note: The repr of bytes objects will use ASCII or simpler ASCII escapes when available. That's why you see an R and a \n in your output; b'R' and b'\x52' are equivalent literals, as are b'\n' and b'\x0a' and Python chooses to use the "more readable" version (when the bytes is actually just ASCII, this is much more readable).

string to wstring in python

I have a udp socket which received datagram of different length.
The first of the datagram specifies what type of data it is going to receive say for example 64-means bool false, 65-means bool true, 66-means sint, 67-means int and so on. As most of datatypes have known length, but when it comes to string and wstring, the first byte says 85-means string, next 2 bytes says string length followed by actual string. For wstring 85, next 2 bytes says wstring length, followed by actual wstring.
To parse the above kind off wstring format b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001' I used the following code
data = str(rawdata[3:]).split("\\x00")
data = "".join(data[1:])
data = "".join(data[:-1])
Is this correct or any other simple way?
As I received the datagram, I need to send the datagram also. But I donot know how to create the datagrams as the socket.sendto requires bytes. If I try to convert string to utf-16 format will it covert to wstring. If so how would I add the rest of the information into bytes
From the above datagram information U-85 which is wstring, \x00\x07 - 7 length of the wstring data, \x00C\x00o\x00u\x00p\x00o\x00n\x001 - is the actual string Coupon1
A complete answer depends on exactly what you intend to do with the resulting data. Splitting the string with '\x00' (assuming that's what you meant to do? not sure I understand why there are two backslashes there) doesn't really make sense. The reason for using a wstring type in the first place is to be able to represent characters that aren't plain old 8-bit (really 7-bit) ascii. If you have any characters that aren't standard Roman characters, they may well have something other than a zero byte separating the characters in which case your split result will make no sense.
Caveat: Since you mentioned sendto requiring bytes, I assume you're using python3. Details will be slightly different under python2.
Anyway if I understand what it is you're meaning to do, the "utf-16-be" codec may be what you're looking for. (The "utf-16" codec puts a "byte order marker" at the beginning of the encoded string which you probably don't want; "utf-16-be" just puts the big-endian 16-bit chars into the byte string.) Decoding could be performed something like this:
rawdata = b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001'
dtype = rawdata[0]
if dtype == 85: # wstring
dlen = ord(rawdata[1:3].decode('utf-16-be'))
data = rawdata[3: (dlen * 2) + 3]
dstring = data.decode('utf-16-be')
This will leave dstring as a python unicode string. In python3, all strings are unicode. So you're done.
Encoding it could be done something like this:
tosend = 'Coupon1'
snd_data = bytearray([85]) # wstring indicator
snd_data += bytearray([(len(tosend) >> 8), (len(tosend) & 0xff)])
snd_data += tosend.encode('utf-16-be')

Convert ASCII data to hex/binary/bytes in Python

The protocol for a device I'm working with sends a UDP packet to a server (My Python program, using twisted for networking) in a comma separated ASCII format. In order to acknowledge that my application received the packet correctly, I have to take a couple of values from the header of that packet, convert them to binary and send them back. I have everything setup, though I can't seem to figure out the binary part.
I've been going through a lot of stuff and I'm still confused. The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
The format of the acknowledgement requires me to send "0xFE 0x02" + an 8 unsigned integer (IMEI number, 15 digits) + 2 byte unsigned integer (Sequence ID)
How would I go about converting the ASCII text values that I have into "binary"?
First:
The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
Well, it's impossible print actual binary data in a human-readable form, so documentation will usually either give a sequence of hexadecimal bytes. They could use a bytes literal like b'\xFE\x02' or something instead, but that's still effectively hexadecimal to the human reader, right?
So, if they say "binary", they probably mean "binary", and the hex is just how they're showing you what binary bytes you need.
So, you need to convert the ASCII representation of a number into an actual number, which you do with the int function. Then you need to pack that into 8 bytes, which you do with the struct module.
You didn't mention whether you needed big-endian or little-endian. Since this sounds like a network protocol, and it sounds like it wasn't designed by Microsoft, I would guess big-endian, but you should actually know, not guess.
So:
imei_string = '1234567890123456789'
imei_number = int(imei_string) # 1234567890123456789
imei_bytes = struct.pack('>Q', imei_number) # b'\x11\x22\x10\xf4\x7d\xe9\x81\x15'
buf = b'\xFE\x02' + imei_bytes + seq_bytes
(You didn't say where you're supposed to get the sequence number from, but wherever it comes from, you'll pack it the same way, just using >H instead of >Q.)
If you actually did need a hex string rather than binary, you'd need to know exactly what format. The binascii.hexlify function gives you "bare hex", two characters per byte, no 0x or other header or footer. If you want something different, well, it depends on what exactly you want; no format is really that hard. But, again, I'm pretty sure you don't need a hex string here.
One last thing: Since you didn't specify your Python version, I wrote this in a way that's compatible with both 2.6+ and 3.0+; if you need to use 2.5 or earlier, just drop the b prefix on the literal in the buf = line.

Python 3 file input change in binary mode

In Python 3, when I opened a text file with mode string 'rb', and then did f.read(), I was taken aback to find the file contents enclosed in single quotes after the character 'b'.
In Python 2 I just get the file contents.
I'm sure this is well known, but I can't find anything about it in the doco. Could someone point me to it?
You get "just the file contents" in Python 3 as well. Most likely you can just keep on doing whatever you were doing anyway. Read on for a longer explanation:
The b'' signifies that the result value is a bytes string. A bytes-string is quite similar to a normal string, but not quite, and is used to handle binary, non-textual data.
Some of the methods on a string that doesn't make sense for binary data is gone, but most are still there. A big difference is that when you get a specific byte from a bytes string you get an integer back, while for a normal str you get a one-length str.
>>> b'foo'[1]
111
>>> 'foo'[1]
'o'
If you open the file in text mode with the 't' flag you get a str back. The Python 3 str is what in Python 2 was called unicode. It's used to handle textual data.
You convert back and forth between bytes and str with the .encode() and .decode methods.
First of all, the Python 2 str type has been renamed to bytes in Python 3, and byte literals use the b'' prefix. The Python 2 unicode type is the new Python 3 str type.
To get the Python 3 file behaviour in Python 2, you'd use io.open() or codecs.open(); Python 3 decodes text files to Unicode by default.
What you see is that for binary files, Python 3 gives you the exact same thing as in Python 2, namely byte strings. What changed then, is that the repr() of a byte string is prefixed with b and the print() function will use the repr() representation of any object passed to it except for unicode values.
To print your binary data as unicode text with the print() function., decode it to unicode first. But then you could perhaps have opened the file as a text file instead anyway.
The bytes type has some other improvements to reflect that you are dealing with binary data, not text. Indexing individual bytes or iterating over a bytes value gives you int values (between 0 and 255) and not characters, for example.
Sometimes we need (needed?) to know whether a text file had single-character newlines (0A) or double character newlines (0D0A).
We used to avoid confusion by opening the text file in binary mode, recognising 0D and 0A, and treating other bytes as regular text characters.
One could port such code by finding all binary﷓mode reads and replacing them with a new function oldread() that stripped off the added material, but it’s a bit painful.
I suppose the Python theologians thought of keeping ‘rb’ as it was, and adding a new ‘rx’ or something for the new behaviour. It seems a bit high-handed just to abolish something.
But, there it is, the question is certainly answered by a search for ‘rb’ in Lennert’s document.

How to unpack from a binary file a byte array using Python?

I'm giving myself a crash course in reading a binary file using Python. I'm new to both, so please bear with me.
The file format's documentation tells me that the first 16 bytes are a GUID and further reading tells me that this GUID is formatted thus:
typedef struct {
unsigned long Data1;
unsigned short Data2;
unsigned short Data3;
byte Data4[8];
} GUID,
UUID,
*PGUID;
I've got as far us being able to unpack the first three entries in the struct, but I'm getting stumped on #4. It's an array of 8 bytes I think but I'm not sure how to unpack it.
import struct
fp = open("./file.bin", mode='rb')
Data1 = struct.unpack('<L', fp.read(4)) # unsigned long, little-endian
Data2 = struct.unpack('<H', fp.read(2)) # unsigned short, little-endian
Data3 = struct.unpack('<H', fp.read(2)) # unsigned short, little-endian
Data4 = struct.unpack('<s', bytearray(fp.read(8))) # byte array with 8 entries?
struct.error: unpack requires a bytes object of length 1
What am I doing wrong for Data4? (I'm using Python 3.2 BTW)
Data1 thru 3 are OK. If I use hex() on them I am getting the correct data that I'd expect to see (woohoo) I'm just failing over on the syntax of this byte array.
Edit: Answer
I'm reading a GUID as defined in MS-DTYP and this nailed it:
data = uuid.UUID(bytes_le=fp.read(16))
If you want an 8-byte string, you need to put the number 8 in there:
struct.unpack('<8s', bytearray(fp.read(8)))
From the docs:
A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.
…
For the 's' format character, the count is interpreted as the length of the bytes, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1. For packing, the string is truncated or padded with null bytes as appropriate to make it fit. For unpacking, the resulting bytes object always has exactly the specified number of bytes. As a special case, '0s' means a single, empty string (while '0c' means 0 characters).
However, I'm not sure why you're doing this in the first place.
fp.read(8) gives you an 8-byte bytes object. You want an 8-byte bytes object. So, just do this:
Data4 = fp.read(8)
Converting the bytes to a bytearray has no effect except to make a mutable copy. Unpacking it just gives you back a copy of the same bytes you started with. So… why?
Well, actually, struct.unpack returns a tuple whose one value is a copy of the same bytes you started with, but you can do that with:
Data4 = (fp.read(8),)
Which raises the question of why you want four single-element tuples in the first place. You're going to be doing Data1[0], etc. all over the place for no good reason. Why not this?
Data1, Data2, Data3, Data4 = struct.unpack('<LHH8s', fp.read(16))
Of course if this is meant to read a UUID, it's always better to use the "batteries included" than to try to build your own batteries from nickel and cadmium ore. As icktoofay says, just use the uuid module:
data = uuid.UUID(bytes_le=fp.read(16))
But keep in mind that Python's uuid uses the 4-2-2-1-1-6 format, not the 4-2-2-8 format. If you really need exactly that format, you'll need to convert it, which means either struct or bit twiddling anyway. (Microsoft's GUID makes things even more fun by using a 4-2-2-2-6 format, which is not the same as either, and representing the first 3 in native-endian and the last two in big-endian, because they like to make things easier…)
UUIDs are supported by Python with the uuid module. Do something like this:
import uuid
my_uuid = uuid.UUID(bytes_le=fp.read(16))

Categories

Resources