Reading binary file with python without knowing structure - python

I have a binary file containing the position of 8000 particles.
I know that each particle value should look like "-24.6151..." (I don't know with which precision the values are given by my program. I guess it is double precision(?).
But when I try to read the file with this code:
In: with open('.//results0epsilon/energybinary/energy_00004.dat', 'br') as f:
buffer = f.read()
print ("Lenght of buffer is %d" % len(buffer))
for i in buffer:
print(int(i))
I get as output:
Lenght of buffer is 64000
10
168
179
43
...
I skip the whole list of values but as you can see those values are far away from what I expect. I think I have some kind of decoding error.
I would appreciate any kind of help :)

What you are printing now are the bytes composing your floating point data. So it doesn't make sense as numerical values.
Of course, there's no 100% sure answer since we didn't see your data, but I'll try to guess:
You have 8000 values to read and the file size is 64000. So you probably have double IEEE values (8 bytes each). If it's not IEEE, then you're toast.
In that case you could try the following:
import struct
with open('.//results0epsilon/energybinary/energy_00004.dat', 'br') as f:
buffer = f.read()
print ("Length of buffer is %d" % len(buffer))
data = struct.unpack("=8000d",buffer)
if the data is printed bogus, it's probably an endianness problem. So change the =8000 by <8000 or >8000.
for reference and packing/unpacking formats: https://docs.python.org/3/library/struct.html

Related

Saving and loading bits/bytes in Python

I've been studying compression algorithms recently, and I'm trying to understand how I can store integers as bits in Python to save space.
So first I save '1' and '0' as strings in Python.
import os
import numpy as np
array= np.random.randint(0, 2, size = 200)
string = [str(i) for i in array]
with open('testing_int.txt', 'w') as f:
for i in string:
f.write(i)
print(os.path.getsize('testing_int.txt'))
I get back 200 bytes which makes sense, since each each char is represented by one byte in ascii (and utf-8 as well if characters are latin?).
Now if trying to save these ones and zeroes as bits, I should only take up around 25 bytes right?
200 bits/8 = 25 bytes.
However, when I try the following code below, I get 105 bytes.
Am I doing something wrong?
Using the same 'array variable' as above I tried this:
bytes_string = [bytes(i) for i in array]
with open('testing_bytes.txt', 'wb') as f:
for i in bytes_string:
f.write(i)
Then I tried this:
bin_string = [bin(i) for i in array]
with open('testing_bin.txt', 'wb') as f:
for i in bytes_string:
f.write(i)
This also takes up around 105 bytes.
So I tried looking at the text files, and I noticed that
both the 'bytes.txt' and 'bin.txt' are blank.
So I tried to read the 'bytes.txt' file via this code:
with open(r"C:\Users\Moondra\Desktop\testing_bytes\testing_bytes.txt", 'rb') as f:
x =f.read()
Now I get get back as this :
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
So I tried these commands:
>>> int.from_bytes(x, byteorder='big')
0
>>> int.from_bytes(x, byteorder='little')
0
>>>
So apparently I'm doing multiple things incorrectly.
I can't figure out:
1) Why I am not getting a text file that is 25 bytes
2) Why can I read back the bytes file correctly.
Thank you.
bytes_string = [bytes(i) for i in array]
It looks like you expect bytes(x) to give you a one-byte bytes object with the value of x. Follow the documentation, and you'll see that bytes() is initialized like bytearray(), and bytearray() says this about its argument:
If it is an integer, the array will have that size and will be initialized with null bytes.
So bytes(0) gives you an empty bytes object, and bytes(1) gives you a single byte with the ordinal zero. That's why bytes_string is about half the size of array and is made up completely of zero bytes.
As for why the bin() example didn't work, it looks like a simple case of copy-pasting and forgetting to change bytes_string to bin_string in the for loop.
This all still doesn't accomplish your goal of treating 0 or 1 value integers as bits. Python doesn't really have that sort of functionality built in. There are third-party modules that allow you to work at the bit level, but I can't speak to any of them specifically. Personally I would probably just roll my own specific to the application.
It looks like you're trying to bit shift all the values into a single byte. For example, you expect the integer values [0,1,0,1,0,1,0,1] to be packed into a byte that looks like the following binary number: 0b01010101. To do this, you need to use the bitwise shift operator and bitwise or operator along with the struct module to pack the values into an unsigned Char which represents the sequence of int values you have.
The code below takes the array of random integers in range [0,1] and shifts them together to make a binary number that can be packed into a single byte. I used 256 ints for convenience. The expected number of bytes for the file to be is then 32 (256/8). You will see that when it is run this is indeed what you get.
import struct
import numpy as np
import os
a = np.random.randint(0, 2, size = 256)
bool_data = []
bin_vals = []
for i in range(0, len(a), 8):
bin_val = (a[i] << 0) | (a[i+1] << 1) | \
(a[i+2] << 2) | (a[i+3] << 3) | \
(a[i+4] << 4) | (a[i+5] << 5) | \
(a[i+6] << 6) | (a[i+7] << 7)
bin_vals.append(struct.pack('B', bin_val))
with open("output.txt", 'wb') as f:
for val in bin_vals:
f.write(val)
print(os.path.getsize('output.txt'))
Please note, however, that this will only work for values of integers in the range [0,1] since if they are bigger it will shift more non-zeros and wreck the structure of the generated byte. The binary number may also exceed 1 byte in size in this case.
It seems like you're just using python in attempt to generate an array of bits for demonstration purposes, and to that token I would say that python probably isn't best suited for this. I would recommend using a lower level language such as C/C++ which has more direct access to data type than python does.

Problems parsing binary data

From a simulation tool I get a binary file containing some measurement points. What I need to do is: parse the measurement values and store them in a list.
According to the documentation of the tool, the data structure of the file looks like this:
First 16 bytes are always the same:
Bytes 0 - 7 char[8] Header
Byte 8 u. char Version
Byte 9 u. char Byte-order (0 for little endian)
Bytes 10 - 11 u. short Record size
Bytes 12 - 15 char[4] Reserved
The quantities are following: (for example one double and one float):
Bytes 16 - 23 double Value of quantity one
Bytes 24 - 27 float Value of quantity two
Bytes 28 - 35 double Next value of quantity one
Bytes 36 - 39 float Next value of quantity two
I also know, that the encoding is little endian.
In my usecase there are two quantities but both of them are floats.
My code so far looks like this:
def parse(self, filePath):
infoFilePath = filePath+ '.info'
quantityList = self.getQuantityList(infoFilePath)
blockSize = 0
for quantity in quantityList:
blockSize += quantity.bytes
with open(filePath, 'r') as ergFile:
# read the first 16 bytes, as they are not needed now
ergFile.read(16)
# now read the rest of the file block wise
block = ergFile.read(blockSize)
while len(block) == blockSize:
for q in quantityList:
q.values.append(np.fromstring(block[:q.bytes], q.dataType)[0])
block = block[q.bytes:]
block = ergFile.read(blockSize)
return quantityList
QuantityList comes from a previous function and contains the quantity structure. Each quantity has a name, dataType, lenOfBytes called bytes and a prepared list for the values called values.
So in my usecase there are two quantities with:
dataType = "<f"
bytes = 4
values=[]
After the parse function has finished I plot the first quantity with matplotlib. As you can see from the attached Images something went wrong during the parsing.
My parsed values:
The reference:
But I am not able to find my fault.
i was able to solve my problem this morning.
The solution couldnt be any easier.
I changed
...
with open(ergFilePath, 'r') as ergFile:
...
to:
...
with open(ergFilePath, 'rb') as ergFile:
...
Notice the change from 'r' to 'rb' as mode.
The python docu made Things clear for me:
Thus, when opening a binary file, you should append 'b' to the mode
value to open the file in binary mode, which will improve portability.
(Appending 'b' is useful even on systems that don’t treat binary and
text files differently, where it serves as documentation.)
So the final parsed values look like this:
Final values

Binary storage of floating point values (between 0 and 1) using less than 4 bytes?

I need to store a massive numpy vector to disk. Right now the vector that I am trying to store is ~2.4 billion elements long and the data is float64. This takes about 18GB of space when serialized out to disk.
If I use struct.pack() and use float32 (4 bytes) I can reduce it to ~9GB. I don't need anywhere near this amount of precision disk space is going to quickly becomes an issue as I expect the number of values I need to store could grow by an order of magnitude or two.
I was thinking that if I could access the first 4 significant digits I could store those values in an int and only use 1 or 2 bytes of space. However, I have no idea how to do this efficiently. Does anyone have any idea or suggestions?
If your data is between 0 and 1, and 16bit is enough you can save the data as uint16:
data16 = (65535 * data).round().astype(uint16)
and expand the data with
data = data16 / 65535.0
Generally speaking, I'd recommend against using float16, but for what it's worth, it's quite easy to do.
However, the struct module can't convert to/from 16-bit floats.
Therefore, you'll need to do something similar to:
import numpy as np
x = np.linspace(0, 1, 1000)
x = x.astype(np.float16)
with open('outfile.dat', 'w') as outfile:
x.tofile(outfile)
Note that "outfile.dat" is exactly 2000 bytes - two bytes per item. tofile just writes the raw, "packed" binary data to disk. There's no header, etc, and no difference in the output between using it and the struct module.
Use struct.pack() with the f type code to get them into 4-byte packets.

Python writing binary

I use python 3
I tried to write binary to file I use r+b.
for bit in binary:
fileout.write(bit)
where binary is a list that contain numbers.
How do I write this to file in binary?
The end file have to look like
b' x07\x08\x07\
Thanks
When you open a file in binary mode, then you are essentially working with the bytes type. So when you write to the file, you need to pass a bytes object, and when you read from it, you get a bytes object. In contrast, when opening the file in text mode, you are working with str objects.
So, writing “binary” is really writing a bytes string:
with open(fileName, 'br+') as f:
f.write(b'\x07\x08\x07')
If you have actual integers you want to write as binary, you can use the bytes function to convert a sequence of integers into a bytes object:
>>> lst = [7, 8, 7]
>>> bytes(lst)
b'\x07\x08\x07'
Combining this, you can write a sequence of integers as a bytes object into a file opened in binary mode.
As Hyperboreus pointed out in the comments, bytes will only accept a sequence of numbers that actually fit in a byte, i.e. numbers between 0 and 255. If you want to store arbitrary (positive) integers in the way they are, without having to bother about knowing their exact size (which is required for struct), then you can easily write a helper function which splits those numbers up into separate bytes:
def splitNumber (num):
lst = []
while num > 0:
lst.append(num & 0xFF)
num >>= 8
return lst[::-1]
bytes(splitNumber(12345678901234567890))
# b'\xabT\xa9\x8c\xeb\x1f\n\xd2'
So if you have a list of numbers, you can easily iterate over them and write each into the file; if you want to extract the numbers individually later you probably want to add something that keeps track of which individual bytes belong to which numbers.
with open(fileName, 'br+') as f:
for number in numbers:
f.write(bytes(splitNumber(number)))
where binary is a list that contain numbers
A number can have one thousand and one different binary representations (endianess, width, 1-complement, 2-complement, floats of different precision, etc). So first you have to decide in which representation you want to store your numbers. Then you can use the struct module to do so.
For example the byte sequence 0x3480 can be interpreted as 32820 (little-endian unsigned short), or -32716 (little-endian signed short) or 13440 (big-endian short).
Small example:
#! /usr/bin/python3
import struct
binary = [1234, 5678, -9012, -3456]
with open('out.bin', 'wb') as f:
for b in binary:
f.write(struct.pack('h', b)) #or whatever format you need
with open('out.bin', 'rb') as f:
content = f.read()
for b in content:
print(b)
print(struct.unpack('hhhh', content)) #same format as above
prints
210
4
46
22
204
220
128
242
(1234, 5678, -9012, -3456)

In Python, read chunks of a file as decimal numbers

My input files could be arbitrary, and so I will use
f = open("in-file", 'rb')
The chunk size is about 4K Bytes, and so I will use
f.read(4096)
What I want to do is to read chunks by chunks from the file.
Moreover, as chunk is actually a $2^15$-bit (4KB) sequence, when reading a chunk, I need to transform it into a decimal value for further computation.
For example, if the first chunk is of form 0000...10, what I want is having another variable keeping the corresponding decimal value, eg., x=2.
From Convert string to list of bits and viceversa I know that its code can help me read chunks by chunks.
def tobits(s):
result = []
for c in s:
bits = bin(ord(c))[2:]
bits = '00000000'[len(bits):] + bits
result.extend([int(b) for b in bits])
return result
However, I don't know how to transform the output list into decimal value. Could someone give me some sample code? Thank you.
By referencing http://code.activestate.com/recipes/510399-byte-to-hex-and-hex-to-byte-string-conversion/ I found that the following code probably will run faster because it seems to be no arithmetic involved.
def ByteToHex( byteStr ):
return ''.join( [ "%02X " % ord( x ) for x in byteStr ] ).strip()
Therefore, the task of, for example, reading 2-byte chunks as decimal numbers can be accomplished by the following code:
in_file=open("in-file", "rb")
piece = in_file.read(2)
a=ByteToHex(piece)
a=int(a,16)
If I understand the question right, you want something like the following:
def bytes_to_long(bytes):
result = 0l
for c in bytes:
result *= 256
result += ord(c)
return result
That said, it's likely this is going to be somewhat slow, 4kB is a fairly big long and a lot of garbage ones are going to be created. You could probably improve this by using struct.unpack() and processing more than one byte per iteration, but then you have to deal with the right endianness and everything. On Python 3 you also probably don't need the ord() since it should return the bytes type from IO methods.

Categories

Resources