Presumed incoherence in Python's struct.pack method - python

I am very confused right now.
If I pack a 7-length binary string, the result is the following:
>>> struct.pack('7s',b'\x1fBLOCK\n')
b'\x1fBLOCK\n'
Moreover, if I pack an unsigned long long, the result is:
>>> struct.pack('1Q',126208)
b'\x00\xed\x01\x00\x00\x00\x00\x00'
But, if I pack both together, the reuslt adds an extra byte:
>>> struct.pack('7s1Q',b'\x1fBLOCK\n',126208)
b'\x1fBLOCK\n\x00\x00\xed\x01\x00\x00\x00\x00\x00'
Anyone knows why this extra byte appears?
b'\x1fBLOCK\n \x00\x00\xed\x01\x00\x00\x00\x00\x00'
This fact is ruining the binary reading of a custom file...

The layout of bytes produced by struct.pack will (by default) match that produced by your platforms C compiler, which may include pad bytes between fields. You can disable this behaviour by adding a = to the start of your format string:
> struct.pack('7s1Q',b'\x1fBLOCK\n',126208) # C-style layout with padding bytes
'\x1fBLOCK\n\x00\x00\xed\x01\x00\x00\x00\x00\x00'
> struct.pack('=7s1Q',b'\x1fBLOCK\n',126208) # No padding
'\x1fBLOCK\n\x00\xed\x01\x00\x00\x00\x00\x00'

It seems that I have used the # flag, which means that the byte order is the native one, and the final size is thus variable.
The solution lies on using a fixed size flag, such as <, >, ! or =:
>>> struct.pack('<7s1Q',b'\x1fBLOCK\n',126208)
b'\x1fBLOCK\n\x00\xed\x01\x00\x00\x00\x00\x00'

The additional \x00 is the string termination byte - in C a string is ended by \x00.
You concattenate a string to a unsigned long long, so
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types involved
applies.
https://docs.python.org/3/library/struct.html#format-characters
b'\x1fBLOCK\n\x00 \x00\xed\x01\x00\x00\x00\x00\x00'
1 23456 7 8th 1 2 3 4 5 6 7 8

Related

Why is the first packed data in struct little endian, but the rest is big endian?

import struct
port = 1331
fragments = [1,2,3,4]
flags = bytes([64])
name = "Hello World"
data = struct.pack('HcHH', port, flags, len(fragments), len(name))
print(int.from_bytes(data[3:5], byteorder='big'))
print(int.from_bytes(data[5:7], byteorder='big'))
print(int.from_bytes(data[0:2], byteorder='little'))
When I print them like this, they come out correctly. It seems port is in little endian, while len(fragments) and len(name) are in big endian. If I also do big endian on the port, it gets the wrong value.
So why does struct behave like this? Or am I missing something?
There is some funny alignment taking place because of the 'c' in the middle of 'H'. You can see it with calcsize:
>>> struct.calcsize('HcHH')
8
>>> struct.calcsize('HHHc')
7
So your data is not aligned as you thought. The correct unpacking is:
print(int.from_bytes(data[4:6], byteorder='little'))
# 4
print(int.from_bytes(data[6:], byteorder='little'))
# 11
It turns out that by chance, the added byte of the 'c' is '\x00', and made your byte-chain correct in big-endian:
>>> data
b'3\x05#\x00\x04\x00\x0b\x00'
^^^^
this is the intruder
By default, your call to pack is equivalent to the following:
struct.pack('#HcHH', port, flags, len(fragments), len(name))
The result looks like this (printed with '.'.join(f'{x:02X} for x in data')):
33.05.40.00.04.00.0B.00
0 1 2 3 4 5 6 7
The number 4 is encoded in bytes 4 and 5, in little endian, and 11 is encoded in bytes 6 and 7. Byte 3 is a padding byte, inserted by pack to properly align the following shorts on an even boundary.
Per the docs:
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types involved; similarly, alignment is taken into account when unpacking. This behavior is chosen so that the bytes of a packed struct correspond exactly to the layout in memory of the corresponding C struct. To handle platform-independent data formats or omit implicit pad bytes, use standard size and alignment instead of native size and alignment: see Byte Order, Size, and Alignment for details.
To remove the alignment byte and justify your assumptions about the positions of the bytes while keeping native byte order, use
struct.pack('=HcHH', port, flags, len(fragments), len(name))
You can also use a fixed byte order by using < or > as the prefix.
The "correct" solution is to use unpack to get your numbers back, so you don't have to worry about endianness, padding or anything else, really.

STL binary file reader with Python

I'm trying to write my "personal" python version of STL binary file reader, according to WIKIPEDIA : A binary STL file contains :
an 80-character (byte) headern which is generally ignored.
a 4-byte unsigned integer indicating the number of triangular facets in the file.
Each triangle is described by twelve 32-bit floating-point numbers: three for the normal and then three for the X/Y/Z coordinate of each vertex – just as with the ASCII version of STL. After these follows a 2-byte ("short") unsigned integer that is the "attribute byte count" – in the standard format, this should be zero because most software does not understand anything else. --Floating-point numbers are represented as IEEE floating-point numbers and are assumed to be little-endian--
Here is my code :
#! /usr/bin/env python3
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
The output is :
b'\x90\x08\x00\x00'
It represents an unsigned integer, I need to convert it without using any package (struct,stl...). Are there any (basic) rules to do it ?, I don't know what does \x mean ? How does \x90 represent one byte ?
most of the answers in google mention "C structs", but I don't know nothing about C.
Thank you for your time.
Since you're using Python 3, you can use int.from_bytes. I'm guessing the value is stored little-endian, so you'd just do:
nbtriangles = int.from_bytes(fichier.read(4), 'little')
Change the second argument to 'big' if it's supposed to be big-endian.
Mind you, the normal way to parse a fixed width type is the struct module, but apparently you've ruled that out.
For the confusion over the repr, bytes objects will display ASCII printable characters (e.g. a) or standard ASCII escapes (e.g. \t) if the byte value corresponds to one of them. If it doesn't, it uses \x##, where ## is the hexadecimal representation of the byte value, so \x90 represents the byte with value 0x90, or 144. You need to combine the byte values at offsets to reconstruct the int, but int.from_bytes does this for you faster than any hand-rolled solution could.
Update: Since apparent int.from_bytes isn't "basic" enough, a couple more complex, but only using top-level built-ins (not alternate constructors) solutions. For little-endian, you can do this:
def int_from_bytes(inbytes):
res = 0
for i, b in enumerate(inbytes):
res |= b << (i * 8) # Adjust each byte individually by 8 times position
return res
You can use the same solution for big-endian by adding reversed to the loop, making it enumerate(reversed(inbytes)), or you can use this alternative solution that handles the offset adjustment a different way:
def int_from_bytes(inbytes):
res = 0
for b in inbytes:
res <<= 8 # Adjust bytes seen so far to make room for new byte
res |= b # Mask in new byte
return res
Again, this big-endian solution can trivially work for little-endian by looping over reversed(inbytes) instead of inbytes. In both cases inbytes[::-1] is an alternative to reversed(inbytes) (the former makes a new bytes in reversed order and iterates that, the latter iterates the existing bytes object in reverse, but unless it's a huge bytes object, enough to strain RAM if you copy it, the difference is pretty minimal).
The typical way to interpret an integer is to use struct.unpack, like so:
import struct
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
nbtriangles=struct.unpack("<I", nbtriangles)
print(nbtriangles)
If you are allergic to import struct, then you can also compute it by hand:
def unsigned_int(s):
result = 0
for ch in s[::-1]:
result *= 256
result += ch
return result
...
nbtriangles = unsigned_int(nbtriangles)
As to what you are seeing when you print b'\x90\x08\x00\x00'. You are printing a bytes object, which is an array of integers in the range [0-255]. The first integer has the value 144 (decimal) or 90 (hexadecimal). When printing a bytes object, that value is represented by the string \x90. The 2nd has the value eight, represented by \x08. The 3rd and final integers are both zero. They are presented by \x00.
If you would like to see a more familiar representation of the integers, try:
print(list(nbtriangles))
[144, 8, 0, 0]
To compute the 32-bit integers represented by these four 8-bit integers, you can use this formula:
total = byte0 + (byte1*256) + (byte2*256*256) + (byte3*256*256*256)
Or, in hex:
total = byte0 + (byte1*0x100) + (byte2*0x10000) + (byte3*0x1000000)
Which results in:
0x00000890
Perhaps you can see the similarities to decimal, where the string "1234" represents the number:
4 + 3*10 + 2*100 + 1*1000

Python 2,3 Convert Integer to "bytes" Cleanly

The shortest ways I have found are:
n = 5
# Python 2.
s = str(n)
i = int(s)
# Python 3.
s = bytes(str(n), "ascii")
i = int(s)
I am particularly concerned with two factors: readability and portability. The second method, for Python 3, is ugly. However, I think it may be backwards compatible.
Is there a shorter, cleaner way that I have missed? I currently make a lambda expression to fix it with a new function, but maybe that's unnecessary.
Answer 1:
To convert a string to a sequence of bytes in either Python 2 or Python 3, you use the string's encode method. If you don't supply an encoding parameter 'ascii' is used, which will always be good enough for numeric digits.
s = str(n).encode()
Python 2: http://ideone.com/Y05zVY
Python 3: http://ideone.com/XqFyOj
In Python 2 str(n) already produces bytes; the encode will do a double conversion as this string is implicitly converted to Unicode and back again to bytes. It's unnecessary work, but it's harmless and is completely compatible with Python 3.
Answer 2:
Above is the answer to the question that was actually asked, which was to produce a string of ASCII bytes in human-readable form. But since people keep coming here trying to get the answer to a different question, I'll answer that question too. If you want to convert 10 to b'10' use the answer above, but if you want to convert 10 to b'\x0a\x00\x00\x00' then keep reading.
The struct module was specifically provided for converting between various types and their binary representation as a sequence of bytes. The conversion from a type to bytes is done with struct.pack. There's a format parameter fmt that determines which conversion it should perform. For a 4-byte integer, that would be i for signed numbers or I for unsigned numbers. For more possibilities see the format character table, and see the byte order, size, and alignment table for options when the output is more than a single byte.
import struct
s = struct.pack('<i', 5) # b'\x05\x00\x00\x00'
You can use the struct's pack:
In [11]: struct.pack(">I", 1)
Out[11]: '\x00\x00\x00\x01'
The ">" is the byte-order (big-endian) and the "I" is the format character. So you can be specific if you want to do something else:
In [12]: struct.pack("<H", 1)
Out[12]: '\x01\x00'
In [13]: struct.pack("B", 1)
Out[13]: '\x01'
This works the same on both python 2 and python 3.
Note: the inverse operation (bytes to int) can be done with unpack.
I have found the only reliable, portable method to be
bytes(bytearray([n]))
Just bytes([n]) does not work in python 2. Taking the scenic route through bytearray seems like the only reasonable solution.
Converting an int to a byte in Python 3:
n = 5
bytes( [n] )
>>> b'\x05'
;) guess that'll be better than messing around with strings
source: http://docs.python.org/3/library/stdtypes.html#binaryseq
In Python 3.x, you can convert an integer value (including large ones, which the other answers don't allow for) into a series of bytes like this:
import math
x = 0x1234
number_of_bytes = int(math.ceil(x.bit_length() / 8))
x_bytes = x.to_bytes(number_of_bytes, byteorder='big')
x_int = int.from_bytes(x_bytes, byteorder='big')
x == x_int
from int to byte:
bytes_string = int_v.to_bytes( lenth, endian )
where the lenth is 1/2/3/4...., and endian could be 'big' or 'little'
form bytes to int:
data_list = list( bytes );
When converting from old code from python 2 you often have "%s" % number this can be converted to b"%d" % number (b"%s" % number does not work) for python 3.
The format b"%d" % number is in addition another clean way to convert int to a binary string.
b"%d" % number

Convert decimal int to little endian string ('\x##\x##...')

I want to convert an integer value to a string of hex values, in little endian. For example, 5707435436569584000 would become '\x4a\xe2\x34\x4f\x4a\xe2\x34\x4f'.
All my googlefu is finding for me is hex(..) which gives me '0x4f34e24a4f34e180' which is not what I want.
I could probably manually split up that string and build the one I want but I'm hoping somone can point me to a better option.
You need to use the struct module:
>>> import struct
>>> struct.pack('<Q', 5707435436569584000)
'\x80\xe14OJ\xe24O'
>>> struct.pack('<Q', 5707435436569584202)
'J\xe24OJ\xe24O'
Here < indicates little-endian, and Q that we want to pack a unsigned long long (8 bytes).
Note that Python will use ASCII characters for any byte that falls within the printable ASCII range to represent the resulting bytestring, hence the 14OJ, 24O and J parts of the above result:
>>> struct.pack('<Q', 5707435436569584202).encode('hex')
'4ae2344f4ae2344f'
>>> '\x4a\xe2\x34\x4f\x4a\xe2\x34\x4f'
'J\xe24OJ\xe24O'
I know it is an old thread, but it is still useful. Here my two cents using python3:
hex_string = hex(5707435436569584202) # '0x4f34e24a4f34e180' as you said
bytearray.fromhex(hex_string[2:]).reverse()
So, the key is convert it to a bytearray and reverse it.
In one line:
bytearray.fromhex(hex(5707435436569584202)[2:])[::-1] # bytearray(b'J\xe24OJ\xe24O')
PS: You can treat "bytearray" data like "bytes" and even mix them with b'raw bytes'
Update:
As Will points in coments, you can also manage negative integers:
To make this work with negative integers you need to mask your input with your preferred int type output length. For example, -16 as a little endian uint32_t would be bytearray.fromhex(hex(-16 & (2**32-1))[2:])[::-1], which evaluates to bytearray(b'\xf0\xff\xff\xff')

Python Struct, size changed by alignment.

Here's the hex code I am trying to unpack.
b'ABCDFGHa\x00a\x00a\x00a\x00a\x00\x00\x00\x00\x00\x00\x01' (it's not supposed to make any sense)
labels = unpack('BBBBBBBHHHHH5sB', msg)
struct.error: unpack requires a bytes argument of length 24
From what I counted, both of those are length = 23, both the format in my unpack function and the length of the hex values. I don't understand.
Thanks in advance
Most processors access data faster when the data is on natural boundaries, meaning data of size 2 should be on even addresses, data of size 4 should be accessed on addresses divisible by four, etc.
struct by default maintains this alignment. Since your structure starts out with 7 'B', a padding byte is added to align the next 'H' on an even address. To prevent this in Python, precede your string with '='.
Example:
>>> import struct
>>> struct.calcsize('BBB')
3
>>> struct.calcsize('BBBH')
6
>>> struct.calcsize('=BBBH')
5
I think H is enforcing 2-byte alignment after your 7 B
Aha, the alignment info is at the top of http://docs.python.org/library/struct.html, not down by the definition of the format characters.

Categories

Resources