Writing bytes without padding - python

Given these 2 numbers (milliseconds since Unix epoch), 1656773855233 and 1656773888716, I'm trying to write them to a binary object. The kicker is that I want to write them to a 11 byte object.
If I accept that the timestamps can only represent dates before September 7th 2039, the numbers should each fit within 41bits. 2 41bit numbers fits within 88bit (11bytes) with 6 bits in excess.
I've tried the following python code, as a (rather naive) way to achieve this:
receive_time=1656773855233
event_time=1656773888716
with open("test.bin", "wb") as f:
f.write(receive_time.to_bytes(6, "big"))
f.write(event_time.to_bytes(6, "big"))
print(f"{receive_time:08b}")
print(f"{event_time:08b}")
with open("test.bin", "rb") as f:
a = f.read()
print(" ".join(f"{byte:08b}" for byte in a))
The output of the code very clearly shows that the test.bin file is 12 bytes, because each timestamp is padded with 7 0's when converted to bytes.
How would I go about writing the timestamps to test.bin without padding them individually?
The end result should be something like 0000001100000011011111101101010110010000000000111000000110111111011010110100101011001100 or 3037ED5900381BF6B4ACC
EDIT: Question originally mentioned that a 41bit number could only represent dates before May 15th, 2109 if that number represented milliseconds since the Unix epoch. That is wrong, as interjay correctly pointed out, a 41bit number would only represent 69,7 years and the max date would thus be September 7th, 2039. A 42bit number, however, can represent a number that represents dates before May 15th, 2109.

Use a bit shift to combine both times into the same int, before converting that to bytes:
x = (receive_time << 41) + event_time
b = x.to_bytes(11, 'big')
with open("test.bin", "wb") as f:
f.write(b)

Related

How can I densely store large numbers in a file?

I need to store and handle huge amounts of very long numbers, which are in range from 0 to f 64 times (ffffffffff.....ffff).
If I store these numbers in a file, I need 1 byte for each character (digit) + 2 bytes for \n symbol = up to 66 bytes. However to represent all possible numbers we need not more than 34 bytes (4 bits represent digits from 0 to f, therefore 4 [bits] * 64 [amount of hex digits]/8 [bits a in byte] = 32 bytes + \n, of course).
Is there any way to store the number without consuming excess memory?
So far I have created converter from hex (with 16 digits per symbol) to a number with base of 76 (hex + all letters and some other symbols), which reduces size of a number to 41 + 2 bytes.
You are trying to store 32 bytes long. Why not just store them as binary numbers? That way you need to store only 32 bytes per number instead of 41 or whatever. You can add on all sorts of quasi-compression schemes to take advantage of things like most of your numbers being shorter than 32 bytes.
If your number is a string, convert it to an int first. Python3 ints are basically infinite precision, so you will not lose any information:
>>> num = '113AB87C877AAE3790'
>>> num = int(num, 16)
>>> num
317825918024297625488
Now you can convert the result to a byte array and write it to a file opened for binary writing:
with open('output.bin', 'wb') as file:
file.write(num.to_bytes(32, byteorder='big'))
The int method to_bytes converts your number to a string of bytes that can be placed in a file. You need to specify the string length and the order. 'big' makes it easier to read a hex dump of the file.
To read the file back and decode it using int.from_bytes in a similar manner:
with open('output.bin', 'rb') as file:
bytes = file.read(32)
num = int.from_bytes(bytes, byteorder='big')
Remember to always include the b in the file mode, or you may run into unexpected problems if you try to read or write data with codes for \n in it.
Both the read and write operation can be looped as a matter of course.
If you anticipate storing an even distribution of numbers, then see Mad Physicist's answer. However, If you anticipate storing mostly small numbers but need to be able to store a few large numbers, then these schemes may also be useful.
If you only need to account for integers that are 255 or fewer bytes (2040 or fewer bits) in length, then simply convert the int to a bytes object and store the length in an additional byte, like this:
# This was only tested with non-negative integers!
def encode(num):
assert isinstance(num, int)
# Convert the number to a byte array and strip away leading null bytes.
# You can also use byteorder="little" and rstrip.
# If the integer does not fit into 255 bytes, an OverflowError will be raised.
encoded = num.to_bytes(255, byteorder="big").lstrip(b'\0')
# Return the length of the integer in the first byte, followed by the encoded integer.
return bytes([len(encoded)]) + encoded
def encode_many(nums):
return b''.join(encode(num) for num in nums)
def decode_many(byte_array):
assert isinstance(byte_array, bytes)
result = []
start = 0
while start < len(byte_array):
# The first byte contains the length of the integer.
int_length = byte_array[start]
# Read int_length bytes and decode them as int.
new_int = int.from_bytes(byte_array[(start+1):(start+int_length+1)], byteorder="big")
# Add the new integer to the result list.
result.append(new_int)
start += int_length + 1
return result
To store integers of (practically) infinite length, you can use this scheme, based on variable-length quantities in the MIDI file format. First, the rules:
A byte has eight bits (for those who don't know).
In each byte except the last, the left-most bit (the highest-order bit) will be 1.
The lower seven bits (i.e. all bits except the left-most bit) in each byte, when concatenated together, form an integer with a variable number of bits.
Here are a few examples:
0 in binary is 00000000. It can be represented in one byte without modification as 00000000.
127 in binary is 01111111. It can be represented in one byte without modification as 01111111.
128 in binary is 10000000. It must be converted to a two-byte representation: 10000001 00000000. Let's break that down:
The left-most bit in the first byte is 1, which means that it is not the last byte.
The left-most bit in the second byte is 0, which means that it is the last byte.
The lower seven bits in the first byte are 0000001, and the lower seven bits in the second byte are 0000000. Concatenate those together, and you get 00000010000000, which is 128.
173249806138790 in binary is 100111011001000111011101001001101111110110100110.
To store it:
First, split the binary number into groups of seven bits: 0100111 0110010 0011101 1101001 0011011 1111011 0100110 (a leading 0 was added)
Then, add a 1 in front of each byte except the last, which gets a 0: 10100111 10110010 10011101 11101001 10011011 11111011 00100110
To retrieve it:
First, drop the first bit of each byte: 0100111 0110010 0011101 1101001 0011011 1111011 0100110
You are left with an array of seven-bit segments. Join them together: 100111011001000111011101001001101111110110100110
When that is converted to decimal, you get 173,249,806,138,790.
Why, you ask, do we make the left-most bit in the last byte of each number a 0? Well, doing that allows you to concatenate multiple numbers together without using line breaks. When writing the numbers to a file, just write them one after another. When reading the numbers from a file, use a loop that builds an array of integers, ending each integer whenever it detects a byte where the left-most bit is 0.
Here are two functions, encode and decode, which convert between int and bytes in Python 3.
# Important! These methods only work with non-negative integers!
def encode(num):
assert isinstance(num, int)
# If the number is 0, then just return a single null byte.
if num <= 0:
return b'\0'
# Otherwise...
result_bytes_reversed = []
while num > 0:
# Find the right-most seven bits in the integer.
current_seven_bit_segment = num & 0b1111111
# Change the left-most bit to a 1.
current_seven_bit_segment |= 0b10000000
# Add that to the result array.
result_bytes_reversed.append(current_seven_bit_segment)
# Chop off the right-most seven bits.
num = num >> 7
# Change the left-most bit in the lowest-order byte (which is first in the list) back to a 0.
result_bytes_reversed[0] &= 0b1111111
# Un-reverse the order of the bytes and convert the list into a byte string.
return bytes(reversed(result_bytes_reversed))
def decode(byte_array):
assert isinstance(byte_array, bytes)
result = 0
for part in byte_array:
# Shift the result over by seven bits.
result = result << 7
# Add in the right-most seven bits from this part.
result |= (part & 0b1111111)
return result
Here are two functions for working with lists of ints:
def encode_many(nums):
return [encode(num) for num in nums]
def decode_many(byte_array):
parts = []
# Split the byte array after each byte where the left-most bit is 0.
start = 0
for i, b in enumerate(byte_array):
# Check whether the left-most bit in this byte is 0.
if not (b & 0b10000000):
# Copy everything up to here into a new part.
parts.append(byte_array[start:(i+1)])
start = i + 1
return [decode(part) for part in parts]
The densest possible way without knowing more about the numbers would be 256 bits per number (32 bytes).
You can store them right after one another.
A function to write to a file might look like this:
def write_numbers(numbers, file):
for n in numbers:
file.write(n.to_bytes(32, 'big'))
with open('file_name', 'wb') as f:
write_numbers(get_numbers(), f)
And to read the numbers, you can make a function like this:
def read_numbers(file):
while True:
read = file.read(32)
if not read:
break
yield int.from_bytes(read, 'big')
with open('file_name', 'rb') as f:
for n in read_numbers(f):
do_stuff(n)

Problems parsing binary data

From a simulation tool I get a binary file containing some measurement points. What I need to do is: parse the measurement values and store them in a list.
According to the documentation of the tool, the data structure of the file looks like this:
First 16 bytes are always the same:
Bytes 0 - 7 char[8] Header
Byte 8 u. char Version
Byte 9 u. char Byte-order (0 for little endian)
Bytes 10 - 11 u. short Record size
Bytes 12 - 15 char[4] Reserved
The quantities are following: (for example one double and one float):
Bytes 16 - 23 double Value of quantity one
Bytes 24 - 27 float Value of quantity two
Bytes 28 - 35 double Next value of quantity one
Bytes 36 - 39 float Next value of quantity two
I also know, that the encoding is little endian.
In my usecase there are two quantities but both of them are floats.
My code so far looks like this:
def parse(self, filePath):
infoFilePath = filePath+ '.info'
quantityList = self.getQuantityList(infoFilePath)
blockSize = 0
for quantity in quantityList:
blockSize += quantity.bytes
with open(filePath, 'r') as ergFile:
# read the first 16 bytes, as they are not needed now
ergFile.read(16)
# now read the rest of the file block wise
block = ergFile.read(blockSize)
while len(block) == blockSize:
for q in quantityList:
q.values.append(np.fromstring(block[:q.bytes], q.dataType)[0])
block = block[q.bytes:]
block = ergFile.read(blockSize)
return quantityList
QuantityList comes from a previous function and contains the quantity structure. Each quantity has a name, dataType, lenOfBytes called bytes and a prepared list for the values called values.
So in my usecase there are two quantities with:
dataType = "<f"
bytes = 4
values=[]
After the parse function has finished I plot the first quantity with matplotlib. As you can see from the attached Images something went wrong during the parsing.
My parsed values:
The reference:
But I am not able to find my fault.
i was able to solve my problem this morning.
The solution couldnt be any easier.
I changed
...
with open(ergFilePath, 'r') as ergFile:
...
to:
...
with open(ergFilePath, 'rb') as ergFile:
...
Notice the change from 'r' to 'rb' as mode.
The python docu made Things clear for me:
Thus, when opening a binary file, you should append 'b' to the mode
value to open the file in binary mode, which will improve portability.
(Appending 'b' is useful even on systems that don’t treat binary and
text files differently, where it serves as documentation.)
So the final parsed values look like this:
Final values

Python writing binary

I use python 3
I tried to write binary to file I use r+b.
for bit in binary:
fileout.write(bit)
where binary is a list that contain numbers.
How do I write this to file in binary?
The end file have to look like
b' x07\x08\x07\
Thanks
When you open a file in binary mode, then you are essentially working with the bytes type. So when you write to the file, you need to pass a bytes object, and when you read from it, you get a bytes object. In contrast, when opening the file in text mode, you are working with str objects.
So, writing “binary” is really writing a bytes string:
with open(fileName, 'br+') as f:
f.write(b'\x07\x08\x07')
If you have actual integers you want to write as binary, you can use the bytes function to convert a sequence of integers into a bytes object:
>>> lst = [7, 8, 7]
>>> bytes(lst)
b'\x07\x08\x07'
Combining this, you can write a sequence of integers as a bytes object into a file opened in binary mode.
As Hyperboreus pointed out in the comments, bytes will only accept a sequence of numbers that actually fit in a byte, i.e. numbers between 0 and 255. If you want to store arbitrary (positive) integers in the way they are, without having to bother about knowing their exact size (which is required for struct), then you can easily write a helper function which splits those numbers up into separate bytes:
def splitNumber (num):
lst = []
while num > 0:
lst.append(num & 0xFF)
num >>= 8
return lst[::-1]
bytes(splitNumber(12345678901234567890))
# b'\xabT\xa9\x8c\xeb\x1f\n\xd2'
So if you have a list of numbers, you can easily iterate over them and write each into the file; if you want to extract the numbers individually later you probably want to add something that keeps track of which individual bytes belong to which numbers.
with open(fileName, 'br+') as f:
for number in numbers:
f.write(bytes(splitNumber(number)))
where binary is a list that contain numbers
A number can have one thousand and one different binary representations (endianess, width, 1-complement, 2-complement, floats of different precision, etc). So first you have to decide in which representation you want to store your numbers. Then you can use the struct module to do so.
For example the byte sequence 0x3480 can be interpreted as 32820 (little-endian unsigned short), or -32716 (little-endian signed short) or 13440 (big-endian short).
Small example:
#! /usr/bin/python3
import struct
binary = [1234, 5678, -9012, -3456]
with open('out.bin', 'wb') as f:
for b in binary:
f.write(struct.pack('h', b)) #or whatever format you need
with open('out.bin', 'rb') as f:
content = f.read()
for b in content:
print(b)
print(struct.unpack('hhhh', content)) #same format as above
prints
210
4
46
22
204
220
128
242
(1234, 5678, -9012, -3456)

In Python, read chunks of a file as decimal numbers

My input files could be arbitrary, and so I will use
f = open("in-file", 'rb')
The chunk size is about 4K Bytes, and so I will use
f.read(4096)
What I want to do is to read chunks by chunks from the file.
Moreover, as chunk is actually a $2^15$-bit (4KB) sequence, when reading a chunk, I need to transform it into a decimal value for further computation.
For example, if the first chunk is of form 0000...10, what I want is having another variable keeping the corresponding decimal value, eg., x=2.
From Convert string to list of bits and viceversa I know that its code can help me read chunks by chunks.
def tobits(s):
result = []
for c in s:
bits = bin(ord(c))[2:]
bits = '00000000'[len(bits):] + bits
result.extend([int(b) for b in bits])
return result
However, I don't know how to transform the output list into decimal value. Could someone give me some sample code? Thank you.
By referencing http://code.activestate.com/recipes/510399-byte-to-hex-and-hex-to-byte-string-conversion/ I found that the following code probably will run faster because it seems to be no arithmetic involved.
def ByteToHex( byteStr ):
return ''.join( [ "%02X " % ord( x ) for x in byteStr ] ).strip()
Therefore, the task of, for example, reading 2-byte chunks as decimal numbers can be accomplished by the following code:
in_file=open("in-file", "rb")
piece = in_file.read(2)
a=ByteToHex(piece)
a=int(a,16)
If I understand the question right, you want something like the following:
def bytes_to_long(bytes):
result = 0l
for c in bytes:
result *= 256
result += ord(c)
return result
That said, it's likely this is going to be somewhat slow, 4kB is a fairly big long and a lot of garbage ones are going to be created. You could probably improve this by using struct.unpack() and processing more than one byte per iteration, but then you have to deal with the right endianness and everything. On Python 3 you also probably don't need the ord() since it should return the bytes type from IO methods.

How to read and extract data from a binary data file with multiple variable-length records?

Using Python (3.1 or 2.6), I'm trying to read data from binary data files produced by a GPS receiver. Data for each hour is stored in a separate file, each of which is about 18 MiB. The data files have multiple variable-length records, but for now I need to extract data from just one of the records.
I've got as far as being able to decode, somewhat, the header. I say somewhat because some of the numbers don't make sense, but most do. After spending a few days on this (I've started learning to program using Python), I'm not making progress, so it's time to ask for help.
The reference guide gives me the message header structure and the record structure. Headers can be variable length but are usually 28 bytes.
Header
Field # Field Name Field Type Desc Bytes Offset
1 Sync char Hex 0xAA 1 0
2 Sync char Hex 0x44 1 1
3 Sync char Hex 0x12 1 2
4 Header Lgth uchar Length of header 1 3
5 Message ID ushort Message ID of log 2 4
8 Message Lgth ushort length of message 2 8
11 Time Status enum Quality of GPS time 1 13
12 Week ushort GPS week number 2 14
13 Milliseconds GPSec Time in ms 4 16
Record
Field # Data Bytes Format Units Offset
1 Header 0
2 Number of SV Observations 4 integer n/a H
*For first SV Observation*
3 PRN 4 integer n/a H+4
4 SV Azimuth angle 4 float degrees H+8
5 SV Elevation angle 4 float degrees H+12
6 C/N0 8 double db-Hz H+16
7 Total S4 8 double n/a H+24
...
27 L2 C/N0 8 double db-Hz H+148
28 *For next SV Observation*
SV Observation is satellite - there could be anywhere from 8 to 13
in view.
Here's my code for trying to make sense of the header:
import struct
filename = "100301_110000.nvd"
f = open(filename, "rb")
s = f.read(28)
x, y, z, lgth, msg_id, mtype, port, mlgth, seq, idletime, timestatus, week, millis, recstatus, reserved, version = struct.unpack("<cccBHcBHHBcHLLHH", s)
print(x, y, z, lgth, msg_id, mtype, port, mlgth, seq, idletime, timestatus, week, millis, recstatus, reserved, version)
It outputs:
b'\xaa' b'D' b'\x12' 28 274 b'\x02' 32 1524 0 78 b'\xa0' 1573 126060000 10485760 3545 35358
The 3 sync fields should return xAA x44 x12. (D is the ascii equiv of x44 - I assume.)
The record ID for which I'm looking is 274 - that seems correct.
GPS week is returned as 1573 - that seems correct.
Milliseconds is returned as 126060000 - I was expecting 126015000.
How do I go about finding the records identified as 274 and extracting them? (I'm learning Python, and programming, so keep in mind the answer you give an experienced coder might be over my head.)
You have to read in pieces. Not because of memory constraints, but because of the parsing requirements. 18MiB fits in memory easily. On a 4Gb machine it fits in memory 200 times over.
Here's the usual design pattern.
Read the first 4 bytes only. Use struct to unpack just those bytes.
Confirm the sync bytes and get the header length.
If you want the rest of the header, you know the length, read the rest of the bytes.
If you don't want the header, use seek to skip past it.
Read the first four bytes of a record to get the number of SV Observations. Use struct to unpack it.
Do the math and read the indicated number of bytes to get all the SV Observations in the record.
Unpack them and do whatever it is you're doing.
I strongly suggest building namedtuple objects from the data before doing anything else with it.
If you want all the data, you have to actually read all the data.
"and without reading an 18 MiB file one byte at a time)?" I don't understand this constraint. You have to read all the bytes to get all the bytes.
You can use the length information to read the bytes in meaningful chunks. But you can't avoid reading all the bytes.
Also, lots of reads (and seeks) are often fast enough. Your OS buffers for you, so don't worry about trying to micro-optimize the number of reads.
Just follow the "read length -- read data" pattern.
18 MB should fit comfortably in memory, so I'd just gulp the whole thing into one big string of bytes with a single with open(thefile, 'rb') as f: data = f.read() and then perform all the "parsing" on slices to advance record by record. It's more convenient, and may well be faster than doing many small reads from here and there in the file (though it doesn't affect the logic below, because in either case the "current point of interest in the data" is always moving [[always forward, as it happens]] by amounts computed based on the struct-unpacking of a few bytes at a time, to find the lengths of headers and records).
Given the "start of a record" offset, you can determine its header's length by looking at just one byte ("field four", offset 3 from start of header that's the same as start of record) and look at message ID (next field, 2 bytes) to see if it's the record you care about (so a struct unpack of just those 3 bytes should suffice for that).
Whether it's the record you want or not, you next need to compute the record's length (either to skip it or to get it all); for that, you compute the start of the actual record data (start of record plus length of header plus the next field of the record (the 4 bytes right after the header) times the length of an observation (32 bytes if I read you correctly).
This way you either isolate the substring to be given to struct.unpack (when you've finally reached the record you want), or just add the total length of header + record to the "start of record" offset, to get the offset for the start of the next record.
Apart from writing a parser that correctly reads the file, you may try a somewhat brute-force approach...read the data to the memory and split it using the 'Sync' sentinel. Warning - you might get some false positives. But...
f = open('filename')
data = f.read()
messages = data.split('\xaa\x44\x12')
mymessages = [ msg for msg in messages if len(msg) > 5 and msg[4:5] == '\x12\x01' ]
But it is rather a very nasty hack...

Categories

Resources