Python Struct, size changed by alignment. - python

Here's the hex code I am trying to unpack.
b'ABCDFGHa\x00a\x00a\x00a\x00a\x00\x00\x00\x00\x00\x00\x01' (it's not supposed to make any sense)
labels = unpack('BBBBBBBHHHHH5sB', msg)
struct.error: unpack requires a bytes argument of length 24
From what I counted, both of those are length = 23, both the format in my unpack function and the length of the hex values. I don't understand.
Thanks in advance

Most processors access data faster when the data is on natural boundaries, meaning data of size 2 should be on even addresses, data of size 4 should be accessed on addresses divisible by four, etc.
struct by default maintains this alignment. Since your structure starts out with 7 'B', a padding byte is added to align the next 'H' on an even address. To prevent this in Python, precede your string with '='.
Example:
>>> import struct
>>> struct.calcsize('BBB')
3
>>> struct.calcsize('BBBH')
6
>>> struct.calcsize('=BBBH')
5

I think H is enforcing 2-byte alignment after your 7 B
Aha, the alignment info is at the top of http://docs.python.org/library/struct.html, not down by the definition of the format characters.

Related

Native array.frombytes() (not numpy!) mysterious behavior

[I cannot use numpy so please refrain from talking about it]
I (apparently naively) thought Python array.frombytes() would read from a series of bytes representing various machine format integers, depending on how you create the array object. On creation you are required to provide a letter type code telling it (or so I thought) the machine type of integer making up the byte stream.
import array
b = b"\x01\x00\x02\x00\x03\x00\x04\x00"
a = array.array('i') #signed int (2 bytes)
a.frombytes(b)
print(a)
array('i', [131073, 262147])
and in the debugger:
array('i', [131073, 262147])
itemsize: 4
typecode: 'i'
The bytes in b are a series of little endian int16s (type code = 'i'). Despite being told this, it interpreted the bytes as 4-byte integers. This is Python 3.7.8.
I really need to convert the varying ints into an array (or list) of Python ints to deal with image data coming in byte-streams but which is actually either 16-bit or 32-bit integer, or 64 bit double floating format. What did I miss or do wrong? Or what is the right way to accomplish this?
Note that the documentation doesn't specify the exact size of each type, it specifies the minimum size. Which means it may use a larger size if it wants, probably based on the types in the C compiler that was used to build Python.
Here are all the sizes on my system:
for c in 'bBuhHiIlLqQfd':
print(c, array(c).itemsize)
b 1
B 1
u 2
h 2
H 2
i 4
I 4
l 4
L 4
q 8
Q 8
f 4
d 8
I would suggest using the 'h' or 'H' type.

Why is the first packed data in struct little endian, but the rest is big endian?

import struct
port = 1331
fragments = [1,2,3,4]
flags = bytes([64])
name = "Hello World"
data = struct.pack('HcHH', port, flags, len(fragments), len(name))
print(int.from_bytes(data[3:5], byteorder='big'))
print(int.from_bytes(data[5:7], byteorder='big'))
print(int.from_bytes(data[0:2], byteorder='little'))
When I print them like this, they come out correctly. It seems port is in little endian, while len(fragments) and len(name) are in big endian. If I also do big endian on the port, it gets the wrong value.
So why does struct behave like this? Or am I missing something?
There is some funny alignment taking place because of the 'c' in the middle of 'H'. You can see it with calcsize:
>>> struct.calcsize('HcHH')
8
>>> struct.calcsize('HHHc')
7
So your data is not aligned as you thought. The correct unpacking is:
print(int.from_bytes(data[4:6], byteorder='little'))
# 4
print(int.from_bytes(data[6:], byteorder='little'))
# 11
It turns out that by chance, the added byte of the 'c' is '\x00', and made your byte-chain correct in big-endian:
>>> data
b'3\x05#\x00\x04\x00\x0b\x00'
^^^^
this is the intruder
By default, your call to pack is equivalent to the following:
struct.pack('#HcHH', port, flags, len(fragments), len(name))
The result looks like this (printed with '.'.join(f'{x:02X} for x in data')):
33.05.40.00.04.00.0B.00
0 1 2 3 4 5 6 7
The number 4 is encoded in bytes 4 and 5, in little endian, and 11 is encoded in bytes 6 and 7. Byte 3 is a padding byte, inserted by pack to properly align the following shorts on an even boundary.
Per the docs:
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types involved; similarly, alignment is taken into account when unpacking. This behavior is chosen so that the bytes of a packed struct correspond exactly to the layout in memory of the corresponding C struct. To handle platform-independent data formats or omit implicit pad bytes, use standard size and alignment instead of native size and alignment: see Byte Order, Size, and Alignment for details.
To remove the alignment byte and justify your assumptions about the positions of the bytes while keeping native byte order, use
struct.pack('=HcHH', port, flags, len(fragments), len(name))
You can also use a fixed byte order by using < or > as the prefix.
The "correct" solution is to use unpack to get your numbers back, so you don't have to worry about endianness, padding or anything else, really.

Base56 conversion etc

It seems base58 and base56 conversion treat input data as a single Big Endian number; an unsigned bigint number.
If I'm encoding some integers into shorter strings by trying to use base58 or base56 it seems in some implementations the integer is taken as a native (little endian in my case) representation of bytes and then converted to a string, while in other implementations the number is converted to big endian representation first. It seems the loose specifications of these encoding don't clarify which approach is right. Is there an explicit specification of which to do, or a more wildly popular option of the two I'm not aware of?
I was trying to compare some methods of making a short URL. The source is actually a 10 digit number that's less than 4 billion. In this case I was thinking to make it an unsigned 4 byte integer, possibly Little Endian, and then encode it with a few options (with alphabets):
base64 A…Za…z0…9+/
base64 url-safe A…Za…z0…9-_
Z85 0…9a…zA…Z.-:+=^!/*?&<>()[]{}#%$#
base58 1…9A…HJ…NP…Za…km…z (excluding 0IOl+/ from base64 & reordered)
base56 2…9A…HJ…NP…Za…kmnp…z (excluding 1o from base58)
So like, base16, base32 and base64 make pretty good sense in that they're taking 4, 5 or 6 bits of input data at a time and looking them up in an alphabet index. The latter uses 4 symbols per 3 bytes. Straightforward, and this works for any data.
The other 3 have me finding various implementations that disagree with each other as to the right output. The problem appears to be that no amount of bytes has a fixed number of lookups in these. EG taking 2^1 to 2^100 and getting the remainders for 56, 58 and 85 results in no remainders of 0.
Z85 (ascii85 and base85 etal.) approach this by grabbing 4 bytes at a time and encoding them to 5 symbols and accepting some waste. But there's byte alignment to some degree here (base64 has alignment per 16 symbols, Z85 gets there with 5). But the alphabet is … not great for urls, command-line, nor sgml/xml use.
base58 and base56 seem intent on treating the input bytes like a Big Endian ordered bigint and repeating: % base; lookup; -= % base; /= base on the input bigint. Which… I mean, I think that ends up modifying most of the input for every iteration.
For my input that's not a huge performance concern though.
Because we shouldn't treat the input as string data, or we get output longer than the 10 digit decimal number input and what's the point in that, does anyone know of any indication of which kind of processing for the output results in something canonical for base56 or base58?
Have the Little Endian 4 byte word of the 10 digit number (<4*10^10) turned into a sequence of bytes that represent a different number if Big Endian, and convert that by repeating the steps.
Have the 10 digit number (<4*10^10) represented in 4 bytes Big Endian before converting that by repeating the steps.
I'm leaning towards going the route of the 2nd way.
For example given the number: 3003295320
The little endian representation is 58 a6 02 b3
The big endian representation is b3 02 a6 58, Meaning
base64 gives:
>>> base64.b64encode(int.to_bytes(3003295320,4,'little'))
b'WKYCsw=='
>>> base64.b64encode(int.to_bytes(3003295320,4,'big'))
b'swKmWA=='
>>> base64.b64encode('3003295320'.encode('ascii'))
b'MzAwMzI5NTMyMA==' # Definitely not using this
Z85 gives:
>>> encode(int.to_bytes(3003295320,4,'little'))
b'sF=ea'
>>> encode(int.to_bytes(3003295320,4,'big'))
b'VJv1a'
>>> encode('003003295320'.encode('ascii')) # padding to 4 byte boundary
b'fFCppfF+EAh8v0w' # Definitely not using this
base58 gives:
>>> base58.b58encode(int.to_bytes(3003295320,4,'little'))
b'3GRfwp'
>>> base58.b58encode(int.to_bytes(3003295320,4,'big'))
b'5aPg4o'
>>> base58.b58encode('3003295320')
b'3soMTaEYSLkS4w' # Still not using this
base56 gives:
>>> b56encode(int.to_bytes(3003295320,4,'little'))
b'4HSgyr'
>>> b56encode(int.to_bytes(3003295320,4,'big'))
b'6bQh5q'
>>> b56encode('3003295320')
b'4uqNUbFZTMmT5y' # Longer than 10 digits so...

Presumed incoherence in Python's struct.pack method

I am very confused right now.
If I pack a 7-length binary string, the result is the following:
>>> struct.pack('7s',b'\x1fBLOCK\n')
b'\x1fBLOCK\n'
Moreover, if I pack an unsigned long long, the result is:
>>> struct.pack('1Q',126208)
b'\x00\xed\x01\x00\x00\x00\x00\x00'
But, if I pack both together, the reuslt adds an extra byte:
>>> struct.pack('7s1Q',b'\x1fBLOCK\n',126208)
b'\x1fBLOCK\n\x00\x00\xed\x01\x00\x00\x00\x00\x00'
Anyone knows why this extra byte appears?
b'\x1fBLOCK\n \x00\x00\xed\x01\x00\x00\x00\x00\x00'
This fact is ruining the binary reading of a custom file...
The layout of bytes produced by struct.pack will (by default) match that produced by your platforms C compiler, which may include pad bytes between fields. You can disable this behaviour by adding a = to the start of your format string:
> struct.pack('7s1Q',b'\x1fBLOCK\n',126208) # C-style layout with padding bytes
'\x1fBLOCK\n\x00\x00\xed\x01\x00\x00\x00\x00\x00'
> struct.pack('=7s1Q',b'\x1fBLOCK\n',126208) # No padding
'\x1fBLOCK\n\x00\xed\x01\x00\x00\x00\x00\x00'
It seems that I have used the # flag, which means that the byte order is the native one, and the final size is thus variable.
The solution lies on using a fixed size flag, such as <, >, ! or =:
>>> struct.pack('<7s1Q',b'\x1fBLOCK\n',126208)
b'\x1fBLOCK\n\x00\xed\x01\x00\x00\x00\x00\x00'
The additional \x00 is the string termination byte - in C a string is ended by \x00.
You concattenate a string to a unsigned long long, so
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types involved
applies.
https://docs.python.org/3/library/struct.html#format-characters
b'\x1fBLOCK\n\x00 \x00\xed\x01\x00\x00\x00\x00\x00'
1 23456 7 8th 1 2 3 4 5 6 7 8

Python array.tostring - Explanation for the byte representation

I know that array.tostring gives the array of machine values. But I am trying to figure out how they are represented.
e.g
>>> a = array('l', [2])
>>> a.tostring()
'\x02\x00\x00\x00'
Here, I know that 'l' means each index will be min of 4 bytes and that's why we have 4 bytes in the tostring representation. But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
>>> a = array('l', [50,3])
>>> a.tostring()
'2\x00\x00\x00\x03\x00\x00\x00'
Here I am guessing the 2 in the beginning is because 50 is the ASCII value of 2, then why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C
But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
The \x02 in '\x02\x00\x00\x00' is not the most significant byte. I guess you are confused by trying to read it as a hexadecimal number where the most significant digit is on the left. This is not how the string representation of an array returned by array.tostring() works. Bytes of the represented value are put together in a string left-to-right in the order from least significant to most significant. Just consider the array as a list of bytes, and the first (or, rather, 0th) byte is on the left, as is usual in regular python lists.
why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C?
Do you have any example where python represents the character behind Ctrl-C as Ctrl-C or similar? Since the ASCII code 3 corresponds to an unprintable character and it has no corresponding escape sequence, hence it is represented through its hex code.

Categories

Resources