Python 3.10 Binary splitting script(inconsistent output) - python

I need to split a .bin file into chunks. However, I seem to face a problem when it comes to writing the output in the split/new binary file. The output is inconsistent, I can see the data, but there are shifts and gaps everywhere when comparing the split binary with the bigger original one.
def hash_file(filename: str, blocksize: int = 4096) -> str:
blocksCount = 0
with open(filename, "rb") as f:
while True:
#Read a new chunk from the binary file
full_string = f.read(blocksize)
if not full_string:
break
new_string = ' '.join('{:02x}'.format(b) for b in full_string)
split_string = ''.join(chr(int(i, 16)) for i in new_string.split())
#Append the split chunk to the new binary file
newf = open("SplitBin.bin","a", encoding="utf-8")
newf.write(split_string)
newf.close()
#Check if the desired number of mem blocks has been reached
blocksCount = blocksCount + 1
if blocksCount == 1:
break

For characters with ordinals between 0 and 0x7f, their UTF-8 representation will be the same as their byte value. But for characters with ordinals between 0x80 and 0xff, UTF-8 will output two bytes neither of which will be the same as the input. That's why you're seeing inconsistencies.
The easiest way to fix it would be to open the output file in binary mode as well. Then you can eliminate all the formatting and splitting, because you can directly write the data you just read:
with open("SplitBin.bin", "ab") as newf:
newf.write(full_string)
Note that reopening the file each time you write to it will be very slow. Better to leave it open until you're done.

Related

How to read a single UTF-8 character from a file in Python?

f.read(1) will return 1 byte, not one character. The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string. There is no newline character at the end of the string. How do I read such strings?
I have seen this question but none of the answers address the UTF-8 case.
Example code:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x41')
f.write(b'\xD0')
f.write(b'\xB1')
f.write(b'\xC0')
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.buffer.read(1), '+', f.read(1))
This outputs:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte
When f.write(b'\xC0') is removed, it works as expected. It seems to read more than it is told: the code doesn't say to read the 0xC0 byte.
The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.
You have the length of the string, which is likely the byte length as it makes the most sense in a binary file. Read the range of bytes in binary mode and decode it after-the-fact. Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first. It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.
import os
import struct
string = "我不喜欢你女朋友。你需要一个新的。"
with open('sample.bin','wb') as f:
f.write(os.urandom(10)) # write 10 random bytes
encoded = string.encode()
f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
f.write(encoded) # write string
f.write(os.urandom(10)) # 10 more random bytes
with open('sample.bin','rb') as f:
print(f.read()) # show the raw data
# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
f.seek(10)
length = int.from_bytes(f.read(2),'big')
result = f.read(length).decode()
print(result)
# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
# read 10 bytes and a big endian 16-bit value
*other,header = struct.unpack('>10bH',f.read(12))
result = f.read(length).decode()
print(result)
Output:
b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。
If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:
with open('sample.bin','rb') as f:
f.seek(12)
c = codecs.getreader('utf8')(f)
print(c.read(1))
Output:
我
Here's a character that takes up more than one byte. Whether you open the file giving the utf-8 encoding or not, reading one byte seems to do the job and you get the whole character.
file = 'temp.txt'
with open(file, 'wb') as f:
f.write('⾀'.encode('utf-8'))
f.write(b'\x01')
with open(file, 'rb') as f:
print(f.read(1))
with open(file, 'r') as f:
print(f.read(1))
Output:
b'\xe2'
⾀
Even though some of the file is non utf-8, you can still open the file in reading mode (non-binary), skip to the byte you want to read and then read a whole character by running read(1).
This works even if your character isn't in the beginning of the file:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x01')
f.write('⾀'.encode('utf-8'))
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.read(1),'+', f.read(1))
If this does not work for you please provide an example.

Reading specific positions of a file and writing it in a new file in python

Suppose I have a file of string of bytes that looks something as below:
00101000000000011000000000000011001.......
I want to read every 8th bits of the binary file and write it in a new file. How do I do it in python?
endianness = 'big'
with open('from.txt', 'rb') as r:
with open('to.txt', 'wb') as w:
while True:
chars = r.read(8)
if len(chars) == 8:
string = ''.join([str(byte % 2) for byte in chars])
if endianness == 'little':
string = string[::-1]
byte = int(string, 2).to_bytes(1, endianness)
w.write(byte)
else:
break

Read a large big-endian binary file

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:
data = []
file = open('some_file.dat', 'rb')
for i in range(0, numcount)
data.append(struct.unpack('>f', file.read(4))[0])
But this code works very slow if file size is more than ~100 mb.
My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.
I found another solution, that works faster:
datafile = open('some_file.dat', 'rb').read()
f_len = ">" + "f" * numcount #numcount = 399513600
numbers = struct.unpack(f_len, datafile)
This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.
In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?
You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:
numpy.fromfile(filename, dtype='>f')
There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.
The following approach gave a good speed up for me:
import struct
import random
import time
block_size = 4096
start = time.time()
with open('some_file.dat', 'rb') as f_input:
data = []
while True:
block = f_input.read(block_size * 4)
data.extend(struct.unpack('>{}f'.format(len(block)/4), block))
if len(block) < block_size * 4:
break
print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)
Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.
From the struct - Format Characters documentation:
A format character may be preceded by an integral repeat count. For
example, the format string '4h' means exactly the same as 'hhhh'.
def read_big_endian(filename):
all_text = ""
with open(filename, "rb") as template:
try:
template.read(2) # first 2 bytes are FF FE
while True:
dchar = template.read(2)
all_text += dchar[0]
except:
pass
return all_text
def save_big_endian(filename, text):
with open(filename, "wb") as fic:
fic.write(chr(255) + chr(254)) # first 2 bytes are FF FE
for letter in text:
fic.write(letter + chr(0))
Used to read .rdp files

How to unpack data using byte objects in python

I am trying to unpack a specific piece of data from an encoded file. The data is a type int32.
My approach was to try and read each line of the file and if a section of that lines size matches the size of '
with open(r2sFile2, encoding="latin-1") as datafile:
for line in datafile:
for i in range(0, len(line)):
text = line.encode('latin-1')
fmt = text[0: i+1]
print(sys.getsizeof(fmt))
if (sys.getsizeof(fmt) == 4):
PacketSize = struct.unpack('<I', fmt)
if PacketSize == 'BTH0':
print("Identifier Found")
I am encountering 1 problem so far, the first is that the sys.getsizeof(fmt) [ which reads the size of the line segment ] returns a size always higher than the size required [ 4 ]. Maybe I have to convert every byte object into an int32( if possible )?
You cannot open a file using an encoding and also read raw bytes from the same stream. Specifying an encoding will convert those bytes into unicode points. You want to open the file in binary mode so that it returns byte's instead of unicode characters. You should also be reading fixed size data chunks from the file instead of reading line by line, which doesn't really make sense in binary mode
with open('/path/to/file.dat', 'rb') as f:
val = struct.unpack('<I', f.read(4))
if val == 123:
...
Also, 'BTH0' is not an unsigned integer, so it's never going to compare equal to the value unpacked from your binary file with the format string <I, which will always be an unsigned integer.

Python bytearray isn't as long as it should be

I'm trying to read an Android yuv image represented as a raw byte file.
f = open(self.fn)
self.yuvArray = bytearray(f.read())
I know that the file contains 720K bytes, but self.yuvArrayhas only 350K.
Moreover, after trying this with multiple filesof the same format, all of which are 720K byte long (verified both in file size, and c# code returns a 720k size array), I noticed all of them are different sizes, around 350K.
I tried to see if its some kind of compression, or something, couldn't find anything.
It is vital to me to receive the correct length, regardless of if its all there, just I can't see it.
How can I read it into a 720K sized array?
Open the file in binary mode (b).
f = open(self.fn, 'rb')
Otherwise, in Windows, carriage return, newline is converted, and a specific byte (26 == 0x1A) cause read return earlier.
with open('testfile', 'wb') as f:
f.write('\r\n')
with open('testfile', 'r') as f:
assert f.read() == '\n' # converted
with open('testfile', 'wb') as f:
f.write(''.join(chr(i) for i in range(256)))
with open('testfile', 'r') as f:
assert len(f.read()) < 256 # len(..) == 26

Categories

Resources