I am trying to unpack a specific piece of data from an encoded file. The data is a type int32.
My approach was to try and read each line of the file and if a section of that lines size matches the size of '
with open(r2sFile2, encoding="latin-1") as datafile:
for line in datafile:
for i in range(0, len(line)):
text = line.encode('latin-1')
fmt = text[0: i+1]
print(sys.getsizeof(fmt))
if (sys.getsizeof(fmt) == 4):
PacketSize = struct.unpack('<I', fmt)
if PacketSize == 'BTH0':
print("Identifier Found")
I am encountering 1 problem so far, the first is that the sys.getsizeof(fmt) [ which reads the size of the line segment ] returns a size always higher than the size required [ 4 ]. Maybe I have to convert every byte object into an int32( if possible )?
You cannot open a file using an encoding and also read raw bytes from the same stream. Specifying an encoding will convert those bytes into unicode points. You want to open the file in binary mode so that it returns byte's instead of unicode characters. You should also be reading fixed size data chunks from the file instead of reading line by line, which doesn't really make sense in binary mode
with open('/path/to/file.dat', 'rb') as f:
val = struct.unpack('<I', f.read(4))
if val == 123:
...
Also, 'BTH0' is not an unsigned integer, so it's never going to compare equal to the value unpacked from your binary file with the format string <I, which will always be an unsigned integer.
Related
I need to split a .bin file into chunks. However, I seem to face a problem when it comes to writing the output in the split/new binary file. The output is inconsistent, I can see the data, but there are shifts and gaps everywhere when comparing the split binary with the bigger original one.
def hash_file(filename: str, blocksize: int = 4096) -> str:
blocksCount = 0
with open(filename, "rb") as f:
while True:
#Read a new chunk from the binary file
full_string = f.read(blocksize)
if not full_string:
break
new_string = ' '.join('{:02x}'.format(b) for b in full_string)
split_string = ''.join(chr(int(i, 16)) for i in new_string.split())
#Append the split chunk to the new binary file
newf = open("SplitBin.bin","a", encoding="utf-8")
newf.write(split_string)
newf.close()
#Check if the desired number of mem blocks has been reached
blocksCount = blocksCount + 1
if blocksCount == 1:
break
For characters with ordinals between 0 and 0x7f, their UTF-8 representation will be the same as their byte value. But for characters with ordinals between 0x80 and 0xff, UTF-8 will output two bytes neither of which will be the same as the input. That's why you're seeing inconsistencies.
The easiest way to fix it would be to open the output file in binary mode as well. Then you can eliminate all the formatting and splitting, because you can directly write the data you just read:
with open("SplitBin.bin", "ab") as newf:
newf.write(full_string)
Note that reopening the file each time you write to it will be very slow. Better to leave it open until you're done.
I have a binary files that has saved values of a 2d array.
All values are saved in double format (8bytes)
The data is written to the file row by row.
I want to read the file as fast as possible without knowing how many rows the file has.
I am doing it this way, but I was wondering if there is a faster method than this:
with open("myfile", "rb") as f:
byte = f.read(8)
while byte != "":
# Do stuff with byte.
byte = f.read(8)
with open("myfile", "rb") as f:
for i in f:
#i is now your line, this only gathers it once.
btw your code is faulty the reasong your asking for it to be faster is because you stuck your self in a infinity loop when the line is empty you will get " "*8 not "" because you asked it to read the first 8
I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:
data = []
file = open('some_file.dat', 'rb')
for i in range(0, numcount)
data.append(struct.unpack('>f', file.read(4))[0])
But this code works very slow if file size is more than ~100 mb.
My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.
I found another solution, that works faster:
datafile = open('some_file.dat', 'rb').read()
f_len = ">" + "f" * numcount #numcount = 399513600
numbers = struct.unpack(f_len, datafile)
This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.
In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?
You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:
numpy.fromfile(filename, dtype='>f')
There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.
The following approach gave a good speed up for me:
import struct
import random
import time
block_size = 4096
start = time.time()
with open('some_file.dat', 'rb') as f_input:
data = []
while True:
block = f_input.read(block_size * 4)
data.extend(struct.unpack('>{}f'.format(len(block)/4), block))
if len(block) < block_size * 4:
break
print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)
Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.
From the struct - Format Characters documentation:
A format character may be preceded by an integral repeat count. For
example, the format string '4h' means exactly the same as 'hhhh'.
def read_big_endian(filename):
all_text = ""
with open(filename, "rb") as template:
try:
template.read(2) # first 2 bytes are FF FE
while True:
dchar = template.read(2)
all_text += dchar[0]
except:
pass
return all_text
def save_big_endian(filename, text):
with open(filename, "wb") as fic:
fic.write(chr(255) + chr(254)) # first 2 bytes are FF FE
for letter in text:
fic.write(letter + chr(0))
Used to read .rdp files
I'm trying to read an Android yuv image represented as a raw byte file.
f = open(self.fn)
self.yuvArray = bytearray(f.read())
I know that the file contains 720K bytes, but self.yuvArrayhas only 350K.
Moreover, after trying this with multiple filesof the same format, all of which are 720K byte long (verified both in file size, and c# code returns a 720k size array), I noticed all of them are different sizes, around 350K.
I tried to see if its some kind of compression, or something, couldn't find anything.
It is vital to me to receive the correct length, regardless of if its all there, just I can't see it.
How can I read it into a 720K sized array?
Open the file in binary mode (b).
f = open(self.fn, 'rb')
Otherwise, in Windows, carriage return, newline is converted, and a specific byte (26 == 0x1A) cause read return earlier.
with open('testfile', 'wb') as f:
f.write('\r\n')
with open('testfile', 'r') as f:
assert f.read() == '\n' # converted
with open('testfile', 'wb') as f:
f.write(''.join(chr(i) for i in range(256)))
with open('testfile', 'r') as f:
assert len(f.read()) < 256 # len(..) == 26
I'm using Python 3.2.3 on Windows, and am trying to convert binary data within a C-style ASCII file into its binary equivalent for later parsing using the struct module. For example, my input file contains "0x000A 0x000B 0x000C 0x000D", and I'd like to convert it into "\x00\x0a\x00\x0b\x00\x0c\x00\x0d".
The problem I'm running into is that the string datatypes have changed in Python 3, and the built-in functions to convert from hexadecimal to binary, such as binascii.unhexlify(), no longer accept regular unicode strings, but only byte strings. This process of converting from unicode strings to byte strings and back is confusing me, so I'm wondering if there's an easier way to achieve this. Below is what I have so far:
with open(path, "r") as f:
l = []
data = f.read()
values = data.split(" ")
for v in values:
if (v.startswith("0x")):
l.append(binascii.unhexlify(bytes(v[2:], "utf-8").decode("utf-8")
string = ''.join(l)
3>> ''.join(chr(int(x, 16)) for x in "0x000A 0x000B 0x000C 0x000D".split()).encode('utf-16be')
b'\x00\n\x00\x0b\x00\x0c\x00\r'
As agf says, opening the image with mode 'r' will give you string data.
Since the only thing you are doing here is looking at binary data, you probably want to open with 'rb' mode and make your result of type bytes, not str.
Something like:
with open(path, "rb") as f:
l = []
data = f.read()
values = data.split(b" ")
for v in values:
if (v.startswith(b"0x")):
l.append(binascii.unhexlify(v[2:]))
result = b''.join(l)