What is the best way to do Bit Field manipulation in Python? - python

I'm reading some MPEG Transport Stream protocol over UDP and it has some funky bitfields in it (length 13 for example). I'm using the "struct" library to do the broad unpacking, but is there a simple way to say "Grab the next 13 bits" rather than have to hand-tweak the bit manipulation? I'd like something like the way C does bit fields (without having to revert to C).
Suggestions?

The bitstring module is designed to address just this problem. It will let you read, modify and construct data using bits as the basic building blocks. The latest versions are for Python 2.6 or later (including Python 3) but version 1.0 supported Python 2.4 and 2.5 as well.
A relevant example for you might be this, which strips out all the null packets from a transport stream (and quite possibly uses your 13 bit field?):
from bitstring import Bits, BitStream
# Opening from a file means that it won't be all read into memory
s = Bits(filename='test.ts')
outfile = open('test_nonull.ts', 'wb')
# Cut the stream into 188 byte packets
for packet in s.cut(188*8):
# Take a 13 bit slice and interpret as an unsigned integer
PID = packet[11:24].uint
# Write out the packet if the PID doesn't indicate a 'null' packet
if PID != 8191:
# The 'bytes' property converts back to a string.
outfile.write(packet.bytes)
Here's another example including reading from bitstreams:
# You can create from hex, binary, integers, strings, floats, files...
# This has a hex code followed by two 12 bit integers
s = BitStream('0x000001b3, uint:12=352, uint:12=288')
# Append some other bits
s += '0b11001, 0xff, int:5=-3'
# read back as 32 bits of hex, then two 12 bit unsigned integers
start_code, width, height = s.readlist('hex:32, 2*uint:12')
# Skip some bits then peek at next bit value
s.pos += 4
if s.peek(1):
flags = s.read(9)
You can use standard slice notation to slice, delete, reverse, overwrite, etc. at the bit level, and there are bit level find, replace, split etc. functions. Different endiannesses are also supported.
# Replace every '1' bit by 3 bits
s.replace('0b1', '0b001')
# Find all occurrences of a bit sequence
bitposlist = list(s.findall('0b01000'))
# Reverse bits in place
s.reverse()
The full documentation is here.

It's an often-asked question. There's an ASPN Cookbook entry on it that has served me in the past.
And there is an extensive page of requirements one person would like to see from a module doing this.

Related

Reading bit by bit for Huffman Compression

I'm writing a python program that implements the Huffman Compression. However, it seems that I can only read / write to bin file byte by byte instead of bit by bit. Is there any workaround for this problem? Wouldn't processing byte by byte defeat the purpose of compression since extraneous padding would be needed. Also, it'd be great if someone can enlighten me about the application of Huffman Compression with regards to this byte-by-byte problem. w
A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).
In the simplest case, table based decoding needs a "window" of the bit stream that is as large as1 the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long2), which can be created by shifting a buffer to the right until it has the correct size:
window = buffer >> (maxCodeLen - bitsInBuffer)
Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:
while bitsInBuffer < maxCodeLen:
buffer = (buffer << 8) | readByte()
bitsInBuffer += 8
Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).
Decoding itself could look like this:
# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol
This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:
# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
table[topBits | bottom] = (symbol, codeLen)
By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.
1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.
2: eg 15 bits max for Deflate
Layer your code. Have a bottom io layer that does all file reads and writes either entire file at once or with buffering. Have a layer above that which processes the Huffman code bitstream by bits.

unpacking int + long long in python

I have to following strange problem while trying to read and unpack int32 + int64 in python 2.7.9
file = open('my_file.bin','rb')
s = file.read(4 + 8)
struct.unpack('IQ',s)
I get the following error:
unpack requires a string argument of length 16
Why is that ? I=4 Q=8 IQ=12
btw the following works:
s = file.read(4)
struct.unpack('I',s)
s = file.read(8)
struct.unpack('Q',s)
Haven't used it myself, but according to the documentation, unpack() uses native padding of structs, as would a C compiler on your machine: apparently, you are running on a 64 bit machine. Prefix the format string IQ with an equals sign =IQ if you know the struct to be packed and follow native byte ordering.
Background: CPU's can fetch data aligned on word boundaries more efficiently than packed data, which require two fetch cycles (and DRAM access is slow compared to CPU speeds). Now that 64 bits is common (with 8 byte words), this helps explain why we need much more memory these days…
It is alignment related issue. You can check in the docs.

How do I write a long integer as binary in Python?

In Python, long integers have unlimited precision. I would like to write a 16 byte (128 bit) integer to a file. struct from the standard library supports only up to 8 byte integers. array has the same limitation. Is there a way to do this without masking and shifting each integer?
Some clarification here: I'm writing to a file that's going to be read in from non-Python programs, so pickle is out. All 128 bits are used.
I think for unsigned integers (and ignoring endianness) something like
import binascii
def binify(x):
h = hex(x)[2:].rstrip('L')
return binascii.unhexlify('0'*(32-len(h))+h)
>>> for i in 0, 1, 2**128-1:
... print i, repr(binify(i))
...
0 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
1 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
340282366920938463463374607431768211455 '\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
might technically satisfy the requirements of having non-Python-specific output, not using an explicit mask, and (I assume) not using any non-standard modules. Not particularly elegant, though.
Two possible solutions:
Just pickle your long integer. This will write the integer in a special format which allows it to be read again, if this is all you want.
Use the second code snippet in this answer to convert the long int to a big endian string (which can be easily changed to little endian if you prefer), and write this string to your file.
The problem is that the internal representation of bigints does not directly include the binary data you ask for.
The PyPi bitarray module in combination with the builtin bin() function seems like a good combination for a solution that is simple and flexible.
bytes = bitarray(bin(my_long)[2:]).tobytes()
The endianness can be controlled with a few more lines of code. You'll have to evaluate the efficiency.
Why not use struct with the unsigned long long type twice?
import struct
some_file.write(struct.pack("QQ", var/(2**64), var%(2**64)))
That's documented here (scroll down to get the table with Q): http://docs.python.org/library/struct.html
This may not avoid the "mask and shift each integer" requirement. I'm not sure that avoiding mask and shift means in the context of Python long values.
The bytes are these:
def bytes( long_int ):
bytes = []
while long_int != 0:
b = long_int%256
bytes.insert( 0, b )
long_int //= 256
return bytes
You can then pack this list of bytes using struct.pack( '16b', bytes )
With Python 3.2 and later, you can use int.to_bytes and int.from_bytes: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
You could pickle the object to binary, use protocol buffers (I don't know if they allow you to serialize unlimited precision integers though) or BSON if you do not want to write code.
But writing a function that dumps 16 byte integers by shifting it should not be so hard to do if it's not time critical.
This may be a little late, but I don't see why you can't use struct:
bigint = 0xFEDCBA9876543210FEDCBA9876543210L
print bigint,hex(bigint).upper()
cbi = struct.pack("!QQ",bigint&0xFFFFFFFFFFFFFFFF,(bigint>>64)&0xFFFFFFFFFFFFFFFF)
print len(cbi)
The bigint by itself is rejected, but if you mask it with &0xFFFFFFFFFFFFFFFF you can reduce it to an 8 byte int instead of 16. Then the upper part is shifted and masked as well. You may have to play with byte ordering a bit. I used the ! mark to tell it to produce a network endian byte order. Also, the msb and lsb (upper and lower bytes) may need to be reversed. I will leave that as an exercise for the user to determine. I would say saving things as network endian would be safer so you always know what the endianess of your data is.
No, don't ask me if network endian is big or little endian...
Based on #DSM's answer, and to support negative integers and varying byte sizes, I've created the following improved snippet:
def to_bytes(num, size):
x = num if num >= 0 else 256**size + num
h = hex(x)[2:].rstrip("L")
return binascii.unhexlify("0"*((2*size)-len(h))+h)
This will properly handle negative integers and let the user set the number of bytes

Convert binary information to regular data type without outside modules in python

I'm tasked with reading a poorly formatted binary file and taking in the variables. Although I need to do it in C++ (ROOT, specifically), I've decided to do it in python because python makes sense to me, but my plan is to get it working in python and then tackle re-writing in in C++, so using easy to use python modules won't get me too far later down the road.
Basically, I do this:
In [5]: some_value
Out[5]: '\x00I'
In [6]: ''.join([str(ord(i)) for i in some_value])
Out[6]: '073'
In [7]: int(''.join([str(ord(i)) for i in some_value]))
Out[7]: 73
And I know there has to be a better way. What do you think?
EDIT:
A bit of info on the binary format.
alt text http://grab.by/3njm
alt text http://grab.by/3njv
alt text http://grab.by/3nkL
This is the endian test I am using:
# Read a uint32 for endianess
endian_test = rq1_file.read(uint32)
if endian_test == '\x04\x03\x02\x01':
print "Endian test: \\x04\\x03\\x02\\x01"
swapbits = True
elif endian_test == '\x01\x02\x03\x04':
print "Endian test: \\x01\\x02\\x03\\x04"
swapbits = False
Your int(''.join([str(ord(i)) for i in some_value])) works ONLY when all bytes except the last byte are zero.
Examples:
'\x01I' should be 1 * 256 + 73 == 329; you get 173
'\x01\x02' should be 1 * 256 + 2 == 258; you get 12
'\x01\x00' should be 1 * 256 + 0 == 256; you get 10
It also relies on an assumption that integers are stored in bigendian fashion; have you verified this assumption? Are you sure that '\x00I' represents the integer 73, and not the integer 73 * 256 + 0 == 18688 (or something else)? Please let us help you verify this assumption by telling us what brand and model of computer and what operating system were used to create the data.
How are negative integers represented?
Do you need to deal with floating-point numbers?
Is the requirement to write it in C++ immutable? What does "(ROOT, specifically)" mean?
If the only dictate is common sense, the preferred order would be:
Write it in Python using the struct module.
Write it in C++ but use C++ library routines (especially if floating-point is involved). Don't re-invent the wheel.
Roll your own conversion routines in C++. You could snarf a copy of the C source for the Python struct module.
Update
Comments after the file format details were posted:
The endianness marker is evidently optional, except at the start of a file. This is dodgy; it relies on the fact that if it is not there, the 3rd and 4th bytes of the block are the 1st 2 bytes of the header string, and neither '\x03\x04' nor '\x02\x01' can validly start a header string. The smart thing to do would be to read SIX bytes -- if first 4 are the endian marker, the next two are the header length, and your next read is for the header string; otherwise seek backwards 4 bytes then read the header string.
The above is in the nuisance category. The negative sizes are a real worry, in that they specify a MAXIMUM length, and there is no mention of how the ACTUAL length is determined. It says "The actual size of the entry is then given line by line". How? There is no documentation of what a "line of data" looks like. The description mentions "lines" many times; are these lines terminated by carriage return and/or line feed? If so, how does one tell the difference between say a line feed byte and the first byte of say a uint16 that belongs to the current "line" of data? If no linefeed or whatever, how does one know when the current line of data is finished? Is there a uintNN size in front of every variable or slice thereof?
Then it says that (2) above (negative size) also applies to the header string. The mind boggles. Do you have any examples (in documentation of the file layout, or in actual files) of "negative size" of (a) header string (b) data "line"?
Is this "decided format" publically available e.g. documentation on the web? Does the format have a searchable name? Are you sure you are the first person in the world to want to read that format?
Reading that file format, even with a full specification, is no trivial exercise, even for a binary-format-experienced person who's also experienced with Python (which BTW doesn't have a float128). How many person-hours have you been allocated for the task? What are the penalties for (a) delay (b) failure?
Your original question involved fixing your interesting way of trying to parse a uint16 -- doing much more is way outside the scope/intention of what SO questions are all about.
You're basically computing a "number-in-base-256", which is a polynomial, so, by Horner's method:
>>> v = 0
>>> for c in someval: v = v * 256 + ord(c)
More typical would be to use equivalent bit-operations rather than arithmetic -- the following's equivalent:
>>> v = 0
>>> for c in someval: v = v << 8 | ord(c)
import struct
result, = struct.unpack('>H', some_value)
The equivalent to the Python struct module is a C struct and/or union, so being afraid to use it is silly.
I'm not exactly sure how the format of the data is you want to extract, but maybe you better just write a couple of generic utility functions to extract the different data type you need:
def int1b(data, i):
return ord(data[i])
def int2b(data, i):
return (int1b(data, i) << 8) + int1b(data, i+1)
def int4b(data, i):
return (int2b(data, i) << 16) + int2b(data, i+2)
With such functions you can easily extract values from the data and they also can be translated rather easily to C.

Reading 32bit Packed Binary Data On 64bit System

I'm attempting to write a Python C extension that reads packed binary data (it is stored as structs of structs) and then parses it out into Python objects. Everything works as expected on a 32 bit machine (the binary files are always written on 32bit architecture), but not on a 64 bit box. Is there a "preferred" way of doing this?
It would be a lot of code to post but as an example:
struct
{
WORD version;
BOOL upgrade;
time_t time1;
time_t time2;
} apparms;
File *fp;
fp = fopen(filePath, "r+b");
fread(&apparms, sizeof(apparms), 1, fp);
return Py_BuildValue("{s:i,s:l,s:l}",
"sysVersion",apparms.version,
"powerFailTime", apparms.time1,
"normKitExpDate", apparms.time2
);
Now on a 32 bit system this works great, but on a 64 bit my time_t sizes are different (32bit vs 64 bit longs).
Damn, you people are fast.
Patrick, I originally started using the struct package but found it just way to slow for my needs. Plus I was looking for an excuse to write a Python Extension.
I know this is a stupid question but what types do I need to watch out for?
Thanks.
Explicitly specify that your data types (e.g. integers) are 32-bit. Otherwise if you have two integers next to each other when you read them they will be read as one 64-bit integer.
When you are dealing with cross-platform issues, the two main things to watch out for are:
Bitness. If your packed data is written with 32-bit ints, then all of your code must explicitly specify 32-bit ints when reading and writing.
Byte order. If you move your code from Intel chips to PPC or SPARC, your byte order will be wrong. You will have to import your data and then byte-flip it so that it matches up with the current architecture. Otherwise 12 (0x0000000C) will be read as 201326592 (0x0C000000).
Hopefully this helps.
The 'struct' module should be able to do this, although alignment of structs in the middle of the data is always an issue. It's not very hard to get it right, however: find out (once) what boundary the structs-in-structs align to, then pad (manually, with the 'x' specifier) to that boundary. You can doublecheck your padding by comparing struct.calcsize() with your actual data. It's certainly easier than writing a C extension for it.
In order to keep using Py_BuildValue() like that, you have two options. You can determine the size of time_t at compiletime (in terms of fundamental types, so 'an int' or 'a long' or 'an ssize_t') and then use the right format character to Py_BuildValue -- 'i' for an int, 'l' for a long, 'n' for an ssize_t. Or you can use PyInt_FromSsize_t() manually, in which case the compiler does the upcasting for you, and then use the 'O' format characters to pass the result to Py_BuildValue.
You need to make sure you're using architecture independent members for your struct. For instance an int may be 32 bits on one architecture and 64 bits on another. As others have suggested, use the int32_t style types instead. If your struct contains unaligned members, you may need to deal with padding added by the compiler too.
Another common problem with cross architecture data is endianness. Intel i386 architecture is little-endian, but if you're reading on a completely different machine (e.g. an Alpha or Sparc), you'll have to worry about this too.
The Python struct module deals with both these situations, using the prefix passed as part of the format string.
# - Use native size, endianness and alignment. i= sizeof(int), l= sizeof(long)
= - Use native endianness, but standard sizes and alignment (i=32 bits, l=64 bits)
< - Little-endian standard sizes/alignment
Big-endian standard sizes/alignment
In general, if the data passes off your machine, you should nail down the endianness and the size / padding format to something specific — ie. use "<" or ">" as your format. If you want to handle this in your C extension, you may need to add some code to handle it.
What's your code for reading the binary data? Make sure you're copying the data into properly-sized types like int32_t instead of just int.
Why aren't you using the struct package?

Categories

Resources