Write bit string stream to file - python

So I have a stream of bits in python that looks like that:
bitStream = "001011000011011111000011100111000101001111100011001"
Now this stream is dynamic, meaning that it changes depending on the input received, now I want to write this to a file in python, I'm currently doing that:
f = open("file.txt", "rb+")
s = file.read() # stream
bitStream = "001011000011011111000011100111000101001111100011001"
byteStream = int(bitStream,2).to_bytes(len(bitStream)//8, 'little')
f.close() #close handle
However that works but the thing is that the bit stream can be a non-8bits aligned string which results in a file write of n-1 bytes or an error of the type int too big to convert.
Now normally I would align the file bits to be divisible by 8 (which is normal behavior) but in this case I really cannot add bits because otherwise, when I would give again this file to my program it will misinterpret the alignment bits as something other than expected.
Would you guys have any idea?
Thanks in advance

A easy fix is to make sure the number is always rounded up:
(len(bitStream)+7)//8
This works because //8 always rounds down. We need to make sure that any integer above a multiple of 8 is bigger or equal to the next multiple so rounding down actually round up.
Alternatively:
math.ceil(len(bitStream)/8)
This makes sure there are always plenty of bytes.

I really cannot add bits because otherwise, when I would give again this file to my program it will misinterpret the alignment bits as something other than expected
So, you need to write an amount of bits whose size is not a multiple of 8, but computer memory is normally byte-addressable, meaning that you can't read or write anything that is smaller than 1 byte.
However, there is only a very small number of ways the length of your input can be not a multiple of 8: len(bitStream) % 8 may be 0, 1, 2, 3, 4, 5, 6 or 7. Thus, you can align your data to a multiple of 8 bytes (if needed) and use one additional byte to indicate the amount of bits that are used for padding (possibly zero), like this:
01110011101 # initial data
1111101110011101 # align it with 1's (or 0's, or whatever)
^^^^^ alignment of 5 bits
000001011111101110011101
^^^^^^^^----------------- the number 5 (size of alignment)
|||||
^^^^^------------ the alignment itself
When you read the file, you know that the first byte holds the size of the alignment (n), so you read it, then read the remaining data and disregard the n leading bits.

Related

How do I compress a rather long binary string in Python so that I will be able to access it later?

I have a long array of items (4700) that will ultimately be 1 or 0 when compared to settings in another list. I want to be able to construct a single integer/string item that I can store in some of the metadata such that it can be accessed later in order to uniquely identify the combination of items that goes into it.
I am writing this all in Python. I am thinking of doing something like zlib compression plus a hex conversion, but I am getting myself confused as to how to do the inverse transformation. So assuming bin_string is the string array of 1's and 0's it should look something like this
import zlib
#example bin_string, real one is much longer
bin_string="1001010010100101010010100101010010101010000010100101010"
compressed = zlib.compress(bin_string.encode())
this_hex = compressed.hex()
where I can then save this_hex to the metadata. The question is, how do I get the original bin_string back from my hex value? I have lots of Python experience with numerical methods and such but little with compression, so any basic insights would be very valuable.
Just do the inverse of each operation. This:
zlib.decompress(bytearray.fromhex(this_hex)).decode()
will return your original string.
It would be faster and might even result in better compression to simply encode your bits as bits in a byte string, along with a terminating one bit followed by zeros to pad out the last byte. That would be seven bytes instead of the 22 you're getting from zlib.compress(). zlib would do better only if there is a strong bias for 0's or 1's, and/or there are repeating patterns in the 0's and 1's.
As for encoding for the metadata, Base64 would be more compact than hexadecimal. Your example would be lKVKVKoKVQ==.
You should try using the .savez_compressed() method of numpy
Convert your simple array into a numpy array amd then use this -
numpy.savez_compressed("filename.npz")
Use
numpy.load()
To load the .npz file

Reading bit by bit for Huffman Compression

I'm writing a python program that implements the Huffman Compression. However, it seems that I can only read / write to bin file byte by byte instead of bit by bit. Is there any workaround for this problem? Wouldn't processing byte by byte defeat the purpose of compression since extraneous padding would be needed. Also, it'd be great if someone can enlighten me about the application of Huffman Compression with regards to this byte-by-byte problem. w
A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).
In the simplest case, table based decoding needs a "window" of the bit stream that is as large as1 the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long2), which can be created by shifting a buffer to the right until it has the correct size:
window = buffer >> (maxCodeLen - bitsInBuffer)
Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:
while bitsInBuffer < maxCodeLen:
buffer = (buffer << 8) | readByte()
bitsInBuffer += 8
Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).
Decoding itself could look like this:
# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol
This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:
# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
table[topBits | bottom] = (symbol, codeLen)
By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.
1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.
2: eg 15 bits max for Deflate
Layer your code. Have a bottom io layer that does all file reads and writes either entire file at once or with buffering. Have a layer above that which processes the Huffman code bitstream by bits.

Formatting number as fixed-length binary in python when number is larger than number of bits

I am trying to convert a number to its binary representation with a fixed length. I have tried using
>>> "{0:04b}".format(number)
which works when the number can be fully represented in four bits (0-15), however I need the string to always be four bits long, even if the number is out of range. For example,
>>> "{0:04b}".format(17)
should return '0001', not '10001'.
I am aware I can just index the last 4 bits of the resulting string, however I am wondering if there is a more elegant solution.
There is no general method to limit the width of an integer conversion. However, you may want to use a specific method for your specific case.
If you don't want to reduce the string after it is generated, you could reduce the input number. Since you are only interested in the low-order four bits, & 0xf should isolate those for you:
"{0:04b}".format(number & 0xF)
Alternatively, you could invoke format twice, once to convert the integer and once to limit the field width of the resulting string. This is fairly unreadable, so I wouldn't recommend it:
"{0:4.4s}".format("{0:04b}".format(number))

Interpreting binary files as ASCII

I have a binary file (which I've created in C) and I would like to have a look inside the file. Obviously, I won't be able to "see" anything useful as it's in binary. However I do know that contains certain number of rows with numbers in double precision. I am looking for a script to just read some values and print them so I can verify the if they are in the right range. In other words, it would be like doing head or tail in linux on an text file.
Is there a way of doing it?
Right now I've got something in Python, but it does not do what I want:
CHUNKSIZE = 8192
file = open('eigenvalues.bin', 'rb')
data = list(file.read())
print data
Use the array module to read homogenous binary-representation numbers:
from array import array
data = array('d')
CHUNKSIZE = 8192
rowcount = CHUNKSIZE / data.itemsize # number of doubles we find in CHUNKSIZE bytes
with open('eigenvalues.bin', 'rb') as eg:
data.fromfile(eg, rowcount)
The array.array type otherwise behaves just like a list, only the type of values it can hold is constricted (in this case to float).
Depending on the input data, you may need to add a data.byteswap() call after reading to switch between little and big-endian. Use sys.byteorder to see what byteorder was used to read the data. If your data was written on a platform using little-endianess, swap if your platform uses the other form, and vice-versa:
import sys
if sys.byteorder == 'big':
# data was written in little-endian form, so swap the bytes to match
data.byteswap()
You can use struct.unpack to convert binary data into a specific data type.
For example, if you want to read the first double from the binary data. (not tested, but believe this is correct)
struct.unpack("d",inputData[0:7])
http://docs.python.org/2/library/struct.html
You can see each byte of your file represented in unsigned decimal with this shell command:
od -t u1 eigenvalues.bin | less
Should you want to see a particular area and decode floating point numbers, you can use dd to extract them and od -F option to decode them, eg:
dd status=noxfer if=eigenvalues.bin bs=1 skip=800 count=16 | od -F
will show two double precision numbers stored at offset 800 and 808 in the binary file.
Note that according to the Linux tag set to your question, I assume you are using Gnu versions of dd and od.

What is the best way to do Bit Field manipulation in Python?

I'm reading some MPEG Transport Stream protocol over UDP and it has some funky bitfields in it (length 13 for example). I'm using the "struct" library to do the broad unpacking, but is there a simple way to say "Grab the next 13 bits" rather than have to hand-tweak the bit manipulation? I'd like something like the way C does bit fields (without having to revert to C).
Suggestions?
The bitstring module is designed to address just this problem. It will let you read, modify and construct data using bits as the basic building blocks. The latest versions are for Python 2.6 or later (including Python 3) but version 1.0 supported Python 2.4 and 2.5 as well.
A relevant example for you might be this, which strips out all the null packets from a transport stream (and quite possibly uses your 13 bit field?):
from bitstring import Bits, BitStream
# Opening from a file means that it won't be all read into memory
s = Bits(filename='test.ts')
outfile = open('test_nonull.ts', 'wb')
# Cut the stream into 188 byte packets
for packet in s.cut(188*8):
# Take a 13 bit slice and interpret as an unsigned integer
PID = packet[11:24].uint
# Write out the packet if the PID doesn't indicate a 'null' packet
if PID != 8191:
# The 'bytes' property converts back to a string.
outfile.write(packet.bytes)
Here's another example including reading from bitstreams:
# You can create from hex, binary, integers, strings, floats, files...
# This has a hex code followed by two 12 bit integers
s = BitStream('0x000001b3, uint:12=352, uint:12=288')
# Append some other bits
s += '0b11001, 0xff, int:5=-3'
# read back as 32 bits of hex, then two 12 bit unsigned integers
start_code, width, height = s.readlist('hex:32, 2*uint:12')
# Skip some bits then peek at next bit value
s.pos += 4
if s.peek(1):
flags = s.read(9)
You can use standard slice notation to slice, delete, reverse, overwrite, etc. at the bit level, and there are bit level find, replace, split etc. functions. Different endiannesses are also supported.
# Replace every '1' bit by 3 bits
s.replace('0b1', '0b001')
# Find all occurrences of a bit sequence
bitposlist = list(s.findall('0b01000'))
# Reverse bits in place
s.reverse()
The full documentation is here.
It's an often-asked question. There's an ASPN Cookbook entry on it that has served me in the past.
And there is an extensive page of requirements one person would like to see from a module doing this.

Categories

Resources