Reading bit by bit for Huffman Compression - python

I'm writing a python program that implements the Huffman Compression. However, it seems that I can only read / write to bin file byte by byte instead of bit by bit. Is there any workaround for this problem? Wouldn't processing byte by byte defeat the purpose of compression since extraneous padding would be needed. Also, it'd be great if someone can enlighten me about the application of Huffman Compression with regards to this byte-by-byte problem. w

A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).
In the simplest case, table based decoding needs a "window" of the bit stream that is as large as1 the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long2), which can be created by shifting a buffer to the right until it has the correct size:
window = buffer >> (maxCodeLen - bitsInBuffer)
Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:
while bitsInBuffer < maxCodeLen:
buffer = (buffer << 8) | readByte()
bitsInBuffer += 8
Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).
Decoding itself could look like this:
# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol
This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:
# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
table[topBits | bottom] = (symbol, codeLen)
By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.
1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.
2: eg 15 bits max for Deflate

Layer your code. Have a bottom io layer that does all file reads and writes either entire file at once or with buffering. Have a layer above that which processes the Huffman code bitstream by bits.

Related

Reverse Engineering 'UTF-8 Like' Encoding Algorithm

I'm attempting to reverse engineer an encoding algorithm to ensure backwards compatibility with other software packages. For each type of quantity to be encoded in the output file, there is a separate encoding procedure.
The given documentation only shows the end-user how to parse values from the encoded file, not write anything back to it. However, I have been able to successfully create a corresponding write_int() for every documented read_int() for every file type except the read_string() below.
I am currently (and have been for a while) struggling to wrap my head around exactly what is going on in the read_string() function listed below.
I understand fully that this is a masking problem, and that the first operation while partial_length & 0x80 > 0: is a simple bitwise mask that mandates we only enter the loop when we examine values larger than 128, I begin to lose my head when trying to assign or extract meaning from the loop that is within that while statement. I get the mathematical machinery behind the operations, but I can't see why they would be doing things in this way.
I have included the read_byte() function for context, as it is called in the read_string() function.
def read_byte(handle):
return struct.unpack("<B", handle.read(1))[0]
def read_string(handle):
total_length = 0
partial_length = read_byte(handle)
num_bytes = 0
while partial_length & 0x80 > 0:
total_length += (partial_length & 0x7F) << (7 * num_bytes)
partial_length = ord(struct.unpack("c", handle.read(1))[0])
num_bytes += 1
total_length += partial_length << (7 * num_bytes)
result = handle.read(total_length)
result = result.decode("utf-8")
if len(result) < total_length:
raise Exception("Failed to read complete string")
else:
return result
Is this indicative of an impossible task due to information loss, or am I missing an obvious way to perform the opposite of this read_string function?
I would greatly appreciate any information, insights (however obvious you may think they may be), help, or pointers possible, even if it means just a link to a page that you think might prove useful.
Cheers!
It's just reading a length, which then tells it how many characters to read. (I don't get the check at the end but that's a different issue.)
In order to avoid a fixed length for the length, the length is divided into seven-bit units, which are sent low-order chunk first. Each seven-bit unit is sent in a single 8-bit byte with the high-order bit set, except the last unit which is sent as is. Thus, the reader knows when it gets to the end of the length, because it reads a byte whose high-order bit is 0 (in other words, a byte less than 0x80).

Write bit string stream to file

So I have a stream of bits in python that looks like that:
bitStream = "001011000011011111000011100111000101001111100011001"
Now this stream is dynamic, meaning that it changes depending on the input received, now I want to write this to a file in python, I'm currently doing that:
f = open("file.txt", "rb+")
s = file.read() # stream
bitStream = "001011000011011111000011100111000101001111100011001"
byteStream = int(bitStream,2).to_bytes(len(bitStream)//8, 'little')
f.close() #close handle
However that works but the thing is that the bit stream can be a non-8bits aligned string which results in a file write of n-1 bytes or an error of the type int too big to convert.
Now normally I would align the file bits to be divisible by 8 (which is normal behavior) but in this case I really cannot add bits because otherwise, when I would give again this file to my program it will misinterpret the alignment bits as something other than expected.
Would you guys have any idea?
Thanks in advance
A easy fix is to make sure the number is always rounded up:
(len(bitStream)+7)//8
This works because //8 always rounds down. We need to make sure that any integer above a multiple of 8 is bigger or equal to the next multiple so rounding down actually round up.
Alternatively:
math.ceil(len(bitStream)/8)
This makes sure there are always plenty of bytes.
I really cannot add bits because otherwise, when I would give again this file to my program it will misinterpret the alignment bits as something other than expected
So, you need to write an amount of bits whose size is not a multiple of 8, but computer memory is normally byte-addressable, meaning that you can't read or write anything that is smaller than 1 byte.
However, there is only a very small number of ways the length of your input can be not a multiple of 8: len(bitStream) % 8 may be 0, 1, 2, 3, 4, 5, 6 or 7. Thus, you can align your data to a multiple of 8 bytes (if needed) and use one additional byte to indicate the amount of bits that are used for padding (possibly zero), like this:
01110011101 # initial data
1111101110011101 # align it with 1's (or 0's, or whatever)
^^^^^ alignment of 5 bits
000001011111101110011101
^^^^^^^^----------------- the number 5 (size of alignment)
|||||
^^^^^------------ the alignment itself
When you read the file, you know that the first byte holds the size of the alignment (n), so you read it, then read the remaining data and disregard the n leading bits.

Why does a grouped struct.pack write wrong data?

I just spent ~30 minutes debugging and double checking Python & C# code, to find out that my struct.pack was writing the wrong data. When I separated this into separate calls, it works fine.
This is what I had before
file.write(struct.pack("fffHf", kf_time / frame_divisor, kf_in_tangent, kf_out_tangent, kf_interpolation_type, kf_value))
This is what I have now
file.write(struct.pack("f", kf_time / frame_divisor))
file.write(struct.pack("f", kf_in_tangent))
file.write(struct.pack("f", kf_out_tangent))
file.write(struct.pack("H", kf_interpolation_type))
file.write(struct.pack("f", kf_value))
Why does the first variation not write the data that I expected? What is so different than writing these separately?
(File is opened in binary mode, platform is 64 bit Windows, Python 3.5)
Presumably because, as the struct documentation clearly states:
Note By default, the result of packing a given C struct
includes pad bytes in order to maintain proper alignment
for the C types involved; similarly, alignment is taken
into account when unpacking. This behavior is chosen so
that the bytes of a packed struct correspond exactly to
the layout in memory of the corresponding C struct. To
handle platform-independent data formats or omit implicit
pad bytes, use standard size and alignment instead of
native size and alignment: see Byte Order, Size, and
Alignment for details.

How do I write a long integer as binary in Python?

In Python, long integers have unlimited precision. I would like to write a 16 byte (128 bit) integer to a file. struct from the standard library supports only up to 8 byte integers. array has the same limitation. Is there a way to do this without masking and shifting each integer?
Some clarification here: I'm writing to a file that's going to be read in from non-Python programs, so pickle is out. All 128 bits are used.
I think for unsigned integers (and ignoring endianness) something like
import binascii
def binify(x):
h = hex(x)[2:].rstrip('L')
return binascii.unhexlify('0'*(32-len(h))+h)
>>> for i in 0, 1, 2**128-1:
... print i, repr(binify(i))
...
0 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
1 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
340282366920938463463374607431768211455 '\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
might technically satisfy the requirements of having non-Python-specific output, not using an explicit mask, and (I assume) not using any non-standard modules. Not particularly elegant, though.
Two possible solutions:
Just pickle your long integer. This will write the integer in a special format which allows it to be read again, if this is all you want.
Use the second code snippet in this answer to convert the long int to a big endian string (which can be easily changed to little endian if you prefer), and write this string to your file.
The problem is that the internal representation of bigints does not directly include the binary data you ask for.
The PyPi bitarray module in combination with the builtin bin() function seems like a good combination for a solution that is simple and flexible.
bytes = bitarray(bin(my_long)[2:]).tobytes()
The endianness can be controlled with a few more lines of code. You'll have to evaluate the efficiency.
Why not use struct with the unsigned long long type twice?
import struct
some_file.write(struct.pack("QQ", var/(2**64), var%(2**64)))
That's documented here (scroll down to get the table with Q): http://docs.python.org/library/struct.html
This may not avoid the "mask and shift each integer" requirement. I'm not sure that avoiding mask and shift means in the context of Python long values.
The bytes are these:
def bytes( long_int ):
bytes = []
while long_int != 0:
b = long_int%256
bytes.insert( 0, b )
long_int //= 256
return bytes
You can then pack this list of bytes using struct.pack( '16b', bytes )
With Python 3.2 and later, you can use int.to_bytes and int.from_bytes: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
You could pickle the object to binary, use protocol buffers (I don't know if they allow you to serialize unlimited precision integers though) or BSON if you do not want to write code.
But writing a function that dumps 16 byte integers by shifting it should not be so hard to do if it's not time critical.
This may be a little late, but I don't see why you can't use struct:
bigint = 0xFEDCBA9876543210FEDCBA9876543210L
print bigint,hex(bigint).upper()
cbi = struct.pack("!QQ",bigint&0xFFFFFFFFFFFFFFFF,(bigint>>64)&0xFFFFFFFFFFFFFFFF)
print len(cbi)
The bigint by itself is rejected, but if you mask it with &0xFFFFFFFFFFFFFFFF you can reduce it to an 8 byte int instead of 16. Then the upper part is shifted and masked as well. You may have to play with byte ordering a bit. I used the ! mark to tell it to produce a network endian byte order. Also, the msb and lsb (upper and lower bytes) may need to be reversed. I will leave that as an exercise for the user to determine. I would say saving things as network endian would be safer so you always know what the endianess of your data is.
No, don't ask me if network endian is big or little endian...
Based on #DSM's answer, and to support negative integers and varying byte sizes, I've created the following improved snippet:
def to_bytes(num, size):
x = num if num >= 0 else 256**size + num
h = hex(x)[2:].rstrip("L")
return binascii.unhexlify("0"*((2*size)-len(h))+h)
This will properly handle negative integers and let the user set the number of bytes

What is the best way to do Bit Field manipulation in Python?

I'm reading some MPEG Transport Stream protocol over UDP and it has some funky bitfields in it (length 13 for example). I'm using the "struct" library to do the broad unpacking, but is there a simple way to say "Grab the next 13 bits" rather than have to hand-tweak the bit manipulation? I'd like something like the way C does bit fields (without having to revert to C).
Suggestions?
The bitstring module is designed to address just this problem. It will let you read, modify and construct data using bits as the basic building blocks. The latest versions are for Python 2.6 or later (including Python 3) but version 1.0 supported Python 2.4 and 2.5 as well.
A relevant example for you might be this, which strips out all the null packets from a transport stream (and quite possibly uses your 13 bit field?):
from bitstring import Bits, BitStream
# Opening from a file means that it won't be all read into memory
s = Bits(filename='test.ts')
outfile = open('test_nonull.ts', 'wb')
# Cut the stream into 188 byte packets
for packet in s.cut(188*8):
# Take a 13 bit slice and interpret as an unsigned integer
PID = packet[11:24].uint
# Write out the packet if the PID doesn't indicate a 'null' packet
if PID != 8191:
# The 'bytes' property converts back to a string.
outfile.write(packet.bytes)
Here's another example including reading from bitstreams:
# You can create from hex, binary, integers, strings, floats, files...
# This has a hex code followed by two 12 bit integers
s = BitStream('0x000001b3, uint:12=352, uint:12=288')
# Append some other bits
s += '0b11001, 0xff, int:5=-3'
# read back as 32 bits of hex, then two 12 bit unsigned integers
start_code, width, height = s.readlist('hex:32, 2*uint:12')
# Skip some bits then peek at next bit value
s.pos += 4
if s.peek(1):
flags = s.read(9)
You can use standard slice notation to slice, delete, reverse, overwrite, etc. at the bit level, and there are bit level find, replace, split etc. functions. Different endiannesses are also supported.
# Replace every '1' bit by 3 bits
s.replace('0b1', '0b001')
# Find all occurrences of a bit sequence
bitposlist = list(s.findall('0b01000'))
# Reverse bits in place
s.reverse()
The full documentation is here.
It's an often-asked question. There's an ASPN Cookbook entry on it that has served me in the past.
And there is an extensive page of requirements one person would like to see from a module doing this.

Categories

Resources