Parsing binary files with Python - python

As a side project I would like to try to parse binary files (Mach-O files specifically). I know tools exist for this already (otool) so consider this a learning exercise.
The problem I'm hitting is that I don't understand how to convert the binary elements found into a python representation. For example, the Mach-O file format starts with a header which is defined by a C Struct. The first item is a uint_32 'magic number' field. When i do
magic = f.read(4)
I get
b'\xcf\xfa\xed\xfe'
This is starting to make sense to me. It's literally a byte array of 4 bytes. However I want to treat this like a 4-byte int that represents the original magic number. Another example is the numberOfSections field. I just want the number represented by 4-byte field, not an array of literal bytes.
Perhaps I'm thinking about this all wrong. Has anybody worked on anything similar? Do I need to write functions to look these 4-byte byte arrays and shift and combine their values to produce the number I want? Is endienness going to screw me here? Any pointers would be most helpful.

Take a look at the struct module:
In [1]: import struct
In [2]: magic = b'\xcf\xfa\xed\xfe'
In [3]: decoded = struct.unpack('<I', magic)[0]
In [4]: hex(decoded)
Out[4]: '0xfeedfacf'

There's Kaitai Struct project that solves exactly that problem. First, you describe a certain file format using a .ksy spec, then you compile it into a Python library (or, actually, a library in any other major programming language), import it, and, voila, parsing boils down to:
from mach_o import MachO
my_file = MachO.from_file("/path/to/your/file")
my_file.magic # => 0xfeedface
my_file.num_of_sections # => some other integer
my_file.sections # => list of objects that represent sections
They have a growing repository of file format specs. It doesn't have Mach-O file format spec (yet?), but there are complex formats like Java .class or Microsoft's PE executable described there, so I guess it shouldn't be a major problem to write spec for Mach-O format as well.
It is actually better than Construct or Hachoir, because it's compiled (as opposed to interpreted), thus it's faster, and it includes tons of other helpful tools like visualizer or format diagram maker. For example, this is a generated explanation diagram for PE executable format:

I would suggest the Construct module. It offers a very high level interface.

Related

fixed length flat file without separators

I have a comprehension question not related to any particular language, but since I am writing in python, I tagged python. I am asked to provide some data in "fixed length, flatfile without separators". It confuses me, since I understand it like:
Input: Column A: date (len6)
Input: Column B: name (len20)
Output: "20170409MYVERYSHORTNAME[space][space][space][space][space]"
"MYVERYSHORTNAME" is only 15 char long, but since it's fixed 20-length output, I am supposed to fill 5 times it with something ? It's not specified.
Why do someone even needs a file without separators? He/she will need to break it down to separated fields anyway, what's the point?
This kind of flat (binary) file is meant to be faster/easier to read by machines, and more memory efficient than the equivalent in a more human friendly representation (eg, JSON, CSV, etc.). For example, the machine can preallocate the appropriate amount of memory before reading the contents.
Nowadays, with the virtually unlimited quantity of RAM and dynamic nature of the languages, nobody uses flat files anymore (unless it is specifically needed).
In Python, in order to deal properly with this kind of binary files, you can for example use the struct module from the standard library:
https://docs.python.org/3.6/library/struct.html#module-struct
Example:
import struct
from datetime import datetime
mydate = datetime.now()
myshortname = "HelloWorld!"
struct.pack("8s20s", mydate.strftime('%Y%m%d').encode(), myshortname.encode())
>>> b'201709HelloWorld!\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Typically, when you see fixed-length files, you're dealing with legacy systems. The AS400, for instance, usually spits out fixed-length files with artificial separators (why, I don't know, but that's what I've seen).
Usually, strings are right-padded with spaces, and numbers are left-padded with 0's (zeros).
This is not absolute.

Murmurhash 2 results on Python and Haskell

Haskell and Python don't seem to agree on Murmurhash2 results. Python, Java, and PHP returned the same results but Haskell don't. Am I doing something wrong regarding Murmurhash2 on Haskell?
Here is my code for Haskell Murmurhash2:
import Data.Digest.Murmur32
main = do
print $ asWord32 $ hash32WithSeed 1 "woohoo"
And here is the code written in Python:
import murmur
if __name__ == "__main__":
print murmur.string_hash("woohoo", 1)
Python returned 3650852671 while Haskell returned 3966683799
From a quick inspection of the sources, it looks like the algorithm operates on 32 bits at a time. The Python version gets these by simply grabbing 4 bytes at a time from the input string, while the Haskell version converts each character to a single 32-bit Unicode index.
It's therefore not surprising that they yield different results.
The murmur-hash package (I am its author) does not promise to compute the same hashes as other languages. If you rely on hashes to be compatible with other software that computes hashes I suggest you create newtype wrappers that compute hashes the way you want them. For text, in particular, you need to at least specify the encoding. In your case you could convert the text to an ASCII string using Data.ByteString.Char8.pack, but that still doesn't give you the same hash since the ByteString instance is more of a placeholder.
BTW, I'm not actively improving that package because MurmurHash2 has been superseded by MurmurHash3, but I keep accepting patches.

Writing and reading headers with struct

I have a file header which I am reading and planning on writing which contains information about the contents; version information, and other string values.
Writing to the file is not too difficult, it seems pretty straightforward:
outfile.write(struct.pack('<s', "myapp-0.0.1"))
However, when I try reading back the header from the file in another method:
header_version = struct.unpack('<s', infile.read(struct.calcsize('s')))
I have the following error thrown:
struct.error: unpack requires a string argument of length 2
How do I fix this error and what exactly is failing?
Writing to the file is not too difficult, it seems pretty straightforward:
Not quite as straightforward as you think. Try looking at what's in the file, or just printing out what you're writing:
>>> struct.pack('<s', 'myapp-0.0.1')
'm'
As the docs explain:
For the 's' format character, the count is interpreted as the size of the string, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1.
So, how do you deal with this?
Don't use struct if it's not what you want. The main reason to use struct is to interact with C code that dumps C struct objects directly to/from a buffer/file/socket/whatever, or a binary format spec written in a similar style (e.g. IP headers). It's not meant for general serialization of Python data. As Jon Clements points out in a comment, if all you want to store is a string, just write the string as-is. If you want to store something more complex, consider the json module; if you want something even more flexible and powerful, use pickle.
Use fixed-length strings. If part of your file format spec is that the name must always be 255 characters or less, just write '<255s'. Shorter strings will be padded, longer strings will be truncated (you might want to throw in a check for that to raise an exception instead of silently truncating).
Use some in-band or out-of-band means of passing along the length. The most common is a length prefix. (You may be able to use the 'p' or 'P' formats to help, but it really depends on the C layout/binary format you're trying to match; often you have to do something ugly like struct.pack('<h{}s'.format(len(name)), len(name), name).)
As for why your code is failing, there are multiple reasons. First, read(11) isn't guaranteed to read 11 characters. If there's only 1 character in the file, that's all you'll get. Second, you're not actually calling read(11), you're calling read(1), because struct.calcsize('s') returns 1 (for reasons which should be obvious from the above). Third, either your code isn't exactly what you've shown above, or infile's file pointer isn't at the right place, because that code as written will successfully read in the string 'm' and unpack it as 'm'. (I'm assuming Python 2.x here; 3.x will have more problems, but you wouldn't have even gotten that far.)
For your specific use case ("file header… which contains information about the contents; version information, and other string values"), I'd just use write the strings with newline terminators. (If the strings can have embedded newlines, you could backslash-escape them into \n, use C-style or RFC822-style continuations, quote them, etc.)
This has a number of advantages. For one thing, it makes the format trivially human-readable (and human-editable/-debuggable). And, while sometimes that comes with a space tradeoff, a single-character terminator is at least as efficient, possibly more so, than a length-prefix format would be. And, last but certainly not least, it means the code is dead-simple for both generating and parsing headers.
In a later comment you clarify that you also want to write ints, but that doesn't change anything. A 'i' int value will take 4 bytes, but most apps write a lot of small numbers, which only take 1-2 bytes (+1 for a terminator/separator) if you write them as strings. And if you're not writing small numbers, a Python int can easily be too large to fit in a C int—in which case struct will silently overflow and just write the low 32 bits.

Endianness of integers in Python

I'm working on a program where I store some data in an integer and process it bitwise. For example, I might receive the number 48, which I will process bit-by-bit. In general the endianness of integers depends on the machine representation of integers, but does Python do anything to guarantee that the ints will always be little-endian? Or do I need to check endianness like I would in C and then write separate code for the two cases?
I ask because my code runs on a Sun machine and, although the one it's running on now uses Intel processors, I might have to switch to a machine with Sun processors in the future, which I know is big-endian.
Python's int has the same endianness as the processor it runs on. The struct module lets you convert byte blobs to ints (and viceversa, and some other data types too) in either native, little-endian, or big-endian ways, depending on the format string you choose: start the format with # or no endianness character to use native endianness (and native sizes -- everything else uses standard sizes), '~' for native, '<' for little-endian, '>' or '!' for big-endian.
This is byte-by-byte, not bit-by-bit; not sure exactly what you mean by bit-by-bit processing in this context, but I assume it can be accomodated similarly.
For fast "bulk" processing in simple cases, consider also the array module -- the fromstring and tostring methods can operate on large number of bytes speedily, and the byteswap method can get you the "other" endianness (native to non-native or vice versa), again rapidly and for a large number of items (the whole array).
If you need to process your data 'bitwise' then the bitstring module might be of help to you. It can also deal with endianness between platforms.
The struct module is the best standard method of dealing with endianness between platforms. For example this packs and unpack the integers 1, 2, 3 into two 'shorts' and one 'long' (2 and 4 bytes on most platforms) using native endianness:
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
To check the endianness of the platform programmatically you can use
>>> import sys
>>> sys.byteorder
which will either return "big" or "little".
The following snippet will tell you if your system default is little endian (otherwise it is big-endian)
import struct
little_endian = (struct.unpack('<I', struct.pack('=I', 1))[0] == 1)
Note, however, this will not affect the behavior of bitwise operators: 1<<1 is equal to 2 regardless of the default endianness of your system.
Check when?
When doing bitwise operations, the int in will have the same endianess as the ints you put in. You don't need to check that. You only need to care about this when converting to/from sequences of bytes, in both languages, afaik.
In Python you use the struct module for this, most commonly struct.pack() and struct.unpack().

Reading 32bit Packed Binary Data On 64bit System

I'm attempting to write a Python C extension that reads packed binary data (it is stored as structs of structs) and then parses it out into Python objects. Everything works as expected on a 32 bit machine (the binary files are always written on 32bit architecture), but not on a 64 bit box. Is there a "preferred" way of doing this?
It would be a lot of code to post but as an example:
struct
{
WORD version;
BOOL upgrade;
time_t time1;
time_t time2;
} apparms;
File *fp;
fp = fopen(filePath, "r+b");
fread(&apparms, sizeof(apparms), 1, fp);
return Py_BuildValue("{s:i,s:l,s:l}",
"sysVersion",apparms.version,
"powerFailTime", apparms.time1,
"normKitExpDate", apparms.time2
);
Now on a 32 bit system this works great, but on a 64 bit my time_t sizes are different (32bit vs 64 bit longs).
Damn, you people are fast.
Patrick, I originally started using the struct package but found it just way to slow for my needs. Plus I was looking for an excuse to write a Python Extension.
I know this is a stupid question but what types do I need to watch out for?
Thanks.
Explicitly specify that your data types (e.g. integers) are 32-bit. Otherwise if you have two integers next to each other when you read them they will be read as one 64-bit integer.
When you are dealing with cross-platform issues, the two main things to watch out for are:
Bitness. If your packed data is written with 32-bit ints, then all of your code must explicitly specify 32-bit ints when reading and writing.
Byte order. If you move your code from Intel chips to PPC or SPARC, your byte order will be wrong. You will have to import your data and then byte-flip it so that it matches up with the current architecture. Otherwise 12 (0x0000000C) will be read as 201326592 (0x0C000000).
Hopefully this helps.
The 'struct' module should be able to do this, although alignment of structs in the middle of the data is always an issue. It's not very hard to get it right, however: find out (once) what boundary the structs-in-structs align to, then pad (manually, with the 'x' specifier) to that boundary. You can doublecheck your padding by comparing struct.calcsize() with your actual data. It's certainly easier than writing a C extension for it.
In order to keep using Py_BuildValue() like that, you have two options. You can determine the size of time_t at compiletime (in terms of fundamental types, so 'an int' or 'a long' or 'an ssize_t') and then use the right format character to Py_BuildValue -- 'i' for an int, 'l' for a long, 'n' for an ssize_t. Or you can use PyInt_FromSsize_t() manually, in which case the compiler does the upcasting for you, and then use the 'O' format characters to pass the result to Py_BuildValue.
You need to make sure you're using architecture independent members for your struct. For instance an int may be 32 bits on one architecture and 64 bits on another. As others have suggested, use the int32_t style types instead. If your struct contains unaligned members, you may need to deal with padding added by the compiler too.
Another common problem with cross architecture data is endianness. Intel i386 architecture is little-endian, but if you're reading on a completely different machine (e.g. an Alpha or Sparc), you'll have to worry about this too.
The Python struct module deals with both these situations, using the prefix passed as part of the format string.
# - Use native size, endianness and alignment. i= sizeof(int), l= sizeof(long)
= - Use native endianness, but standard sizes and alignment (i=32 bits, l=64 bits)
< - Little-endian standard sizes/alignment
Big-endian standard sizes/alignment
In general, if the data passes off your machine, you should nail down the endianness and the size / padding format to something specific — ie. use "<" or ">" as your format. If you want to handle this in your C extension, you may need to add some code to handle it.
What's your code for reading the binary data? Make sure you're copying the data into properly-sized types like int32_t instead of just int.
Why aren't you using the struct package?

Categories

Resources