Read a large big-endian binary file - python

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:
data = []
file = open('some_file.dat', 'rb')
for i in range(0, numcount)
data.append(struct.unpack('>f', file.read(4))[0])
But this code works very slow if file size is more than ~100 mb.
My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.
I found another solution, that works faster:
datafile = open('some_file.dat', 'rb').read()
f_len = ">" + "f" * numcount #numcount = 399513600
numbers = struct.unpack(f_len, datafile)
This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.
In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?

You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:
numpy.fromfile(filename, dtype='>f')
There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.

The following approach gave a good speed up for me:
import struct
import random
import time
block_size = 4096
start = time.time()
with open('some_file.dat', 'rb') as f_input:
data = []
while True:
block = f_input.read(block_size * 4)
data.extend(struct.unpack('>{}f'.format(len(block)/4), block))
if len(block) < block_size * 4:
break
print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)
Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.
From the struct - Format Characters documentation:
A format character may be preceded by an integral repeat count. For
example, the format string '4h' means exactly the same as 'hhhh'.

def read_big_endian(filename):
all_text = ""
with open(filename, "rb") as template:
try:
template.read(2) # first 2 bytes are FF FE
while True:
dchar = template.read(2)
all_text += dchar[0]
except:
pass
return all_text
def save_big_endian(filename, text):
with open(filename, "wb") as fic:
fic.write(chr(255) + chr(254)) # first 2 bytes are FF FE
for letter in text:
fic.write(letter + chr(0))
Used to read .rdp files

Related

Python 3.10 Binary splitting script(inconsistent output)

I need to split a .bin file into chunks. However, I seem to face a problem when it comes to writing the output in the split/new binary file. The output is inconsistent, I can see the data, but there are shifts and gaps everywhere when comparing the split binary with the bigger original one.
def hash_file(filename: str, blocksize: int = 4096) -> str:
blocksCount = 0
with open(filename, "rb") as f:
while True:
#Read a new chunk from the binary file
full_string = f.read(blocksize)
if not full_string:
break
new_string = ' '.join('{:02x}'.format(b) for b in full_string)
split_string = ''.join(chr(int(i, 16)) for i in new_string.split())
#Append the split chunk to the new binary file
newf = open("SplitBin.bin","a", encoding="utf-8")
newf.write(split_string)
newf.close()
#Check if the desired number of mem blocks has been reached
blocksCount = blocksCount + 1
if blocksCount == 1:
break
For characters with ordinals between 0 and 0x7f, their UTF-8 representation will be the same as their byte value. But for characters with ordinals between 0x80 and 0xff, UTF-8 will output two bytes neither of which will be the same as the input. That's why you're seeing inconsistencies.
The easiest way to fix it would be to open the output file in binary mode as well. Then you can eliminate all the formatting and splitting, because you can directly write the data you just read:
with open("SplitBin.bin", "ab") as newf:
newf.write(full_string)
Note that reopening the file each time you write to it will be very slow. Better to leave it open until you're done.

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

I have data in a .txt file that looks like this (let's name it "myfile.txt"):
28807644'~'0'~'Maun FCU'~'US#####28855353'~'0'~'WNB Holdings LLC'~'US#####29212330'~'0'~'Idaho First Bank'~'US#####29278777'~'0'~'Republic Bank of Arizona'~'US#####29633181'~'0'~'Friendly Hills Bank'~'US#####29760145'~'0'~'The Freedom Bank of Virginia'~'US#####100504846'~'0'~'Community First Fund Federal Credit Union'~'US#####
I have tried a couple of ways to convert this .txt into a .csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:
import pandas as pd
import time
#time at the start of program is noted
start = time.time()
# We set the path where our file is located and read it
path = r'myfile.txt'
f = open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#####", "\n").replace("'", "")
# We read everything in columns with the separator "~"
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
#total time taken to print the file
print("Execution time in seconds: ",(end - start))
This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.
I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:
import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
write = csv.writer(outfile)
write.writerows(split_content)
Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large .txt files into .csv? Likely above 35GB?
I'd be happy to read any suggestions you may have, thanks in advance!
I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8...) to get a test file that is 31GB.
Following #Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes, with a peak RAM usage depending on the chunk size.
The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.
My solution is to inspect the end of each chunk and see if it has a partial separator. If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:
CHUNK_SZ = 1024 * 1024
FS = "'~'"
RS = '#####'
# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['####', '###', '##', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS = PARTIAL_FSES + PARTIAL_RSES
f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')
f_in = open('my_file.txt')
line = ''
while True:
# Read chunks till no more, then break out
chunk = f_in.read(CHUNK_SZ)
if not chunk:
break
# Any previous partial separator, plus new chunk
line += chunk
# Check end-of-line for a partial FS or RS; only when separators are more than one char
final_partial = ''
if line.endswith(FS) or line.endswith(RS):
pass # Write-out will replace complete FS or RS
else:
for partial in ALL_PARTIALS:
if line.endswith(partial):
final_partial = partial
line = line[:-len(partial)]
break
# Process/write chunk
f_out.write(line
.replace(FS, ',')
.replace(RS, '\n'))
# Add partial back, to be completed next chunk
line = final_partial
# Clean up
f_in.close()
f_out.close()
Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:
def process(fn_in, fn_out, columns):
new_line = b'#####'
with open(fn_out, 'wb') as f_out:
# write the header
f_out.write((','.join(columns)+'\n').encode())
i = 0
with open(fn_in, "rb") as f_in:
while (b := f_in.read(1)):
if ord(b) == new_line[i]:
# keep matching the newline block
i += 1
if i == len(new_line):
# if matched entirely, write just a newline
f_out.write(b'\n')
i = 0
# write nothing while matching
continue
elif i > 0:
# if you reach this, it was a partial match, write it
f_out.write(new_line[:i])
i = 0
if b == b"'":
pass
elif b == b"~":
f_out.write(b',')
else:
# write the byte if no match
f_out.write(b)
process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])
That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.
This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.
Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.
Just to share an alternative way, based on convtools (table docs | github).
This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader).
Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.
from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table
def get_rows(filename):
with open(filename, "r") as f:
for row in split_buffer(f, "#####"):
yield row.replace("'", "")
Table.from_csv(
get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

Speed up reading in a compressed bz2 file ('rb' mode)

I have a BZ2 file of more than 10GB. I'd like to read it without decompressing it into a temporary file (it would be more than 50GB).
With this method:
import bz2, time
t0 = time.time()
time.sleep(0.001) # to avoid / by 0
with bz2.open("F:\test.bz2", 'rb') as f:
for i, l in enumerate(f):
if i % 100000 == 0:
print('%i lines/sec' % (i/(time.time() - t0)))
I can only read ~ 250k lines per second. On a similar file, first decompressed, I get ~ 3M lines per second, i.e. a x10 factor:
with open("F:\test.txt", 'rb') as f:
I think it's not only due to the intrinsic decompression CPU time (because the total time of decompression into a temp file + the reading as uncompressed file is much smaller than the method described here), but maybe a lack of buffering, or other reasons. Are there other faster Python implementations of bz2.open?
How to speed up the reading of a BZ2 file, in binary mode, and loop over "lines"? (separated by \n)
Note: currently time to decompress test.bz2 into test.tmp + time to iterate over lines of test.tmp is far smaller than time to iterate over lines of bz2.open('test.bz2'), and this probably should not be the case.
Linked topic: https://discuss.python.org/t/non-optimal-bz2-reading-speed/6869
You can use BZ2Decompressor to deal with huge files. It decompresses blocks of data incrementally, just out of the box:
t0 = time.time()
time.sleep(0.000001)
with open('temp.bz2', 'rb') as fi:
decomp = bz2.BZ2Decompressor()
residue = b''
total_lines = 0
for data in iter(lambda: fi.read(100 * 1024), b''):
raw = residue + decomp.decompress(data) # process the raw data and concatenate residual of the previous block to the beginning of the current raw data block
residue = b''
# process_data(current_block) => do the processing of the current data block
current_block = raw.split(b'\n')
if raw[-1] != b'\n':
residue = current_block.pop() # last line could be incomplete
total_lines += len(current_block)
print('%i lines/sec' % (total_lines / (time.time() - t0)))
# process_data(residue) => now finish processing the last line
total_lines += 1
print('Final: %i lines/sec' % (total_lines / (time.time() - t0)))
Here I read a chunk of binary file, feed it into a decompressor and receive a chunk of decompressed data. Be aware, the decompressed data chunks have to be concatenated to restore the original data. This is why last entry needs special treatment.
In my experiments it runs a little faster then your solution with io.BytesIO(). bz2 is known to be slow, so if it bothers you consider migration to snappy or zstandard.
Regarding the time it takes to process bz2 in Python. It might be fastest to decompress the file into temporary one using Linux utility and then process a normal text file. Otherwise you will be dependent on Python's implementation of bz2.
This method already gives a x2 improvement over native bz2.open.
import bz2, time, io
def chunked_readlines(f):
s = io.BytesIO()
while True:
buf = f.read(1024*1024)
if not buf:
return s.getvalue()
s.write(buf)
s.seek(0)
L = s.readlines()
yield from L[:-1]
s = io.BytesIO()
s.write(L[-1]) # very important: the last line read in the 1 MB chunk might be
# incomplete, so we keep it to be processed in the next iteration
# TODO: check if this is ok if f.read() stopped in the middle of a \r\n?
t0 = time.time()
i = 0
with bz2.open("D:\test.bz2", 'rb') as f:
for l in chunked_readlines(f): # 500k lines per second
# for l in f: # 250k lines per second
i += 1
if i % 100000 == 0:
print('%i lines/sec' % (i/(time.time() - t0)))
It is probably possible to do even better.
We could have a x4 improvement if we could use s as a a simple bytes object instead of a io.BytesIO. But unfortunately in this case, splitlines() does not behave as expected: splitlines() and iterating over an opened file give different results.

Easy way to switch endianess of string

I need to read the binary file, and write it's content in form of text file which will initialize memory model. Problem is, I need to switch endianess in process. Let's look at example
binary file content, when I read it with:
with open(source_name, mode='rb') as file:
fileContent = file.read().hex()
filecontent: "aa000000bb000000...".
I need, to transform that into "000000aa000000bb...".
Of course, I can split this string into list of 8 chars substrings, than manualy reorganize it like newsubstr = substr[6:8]+substr[4:6]+substr[2:4]+substr[0:2]
, and then merge them into result string, but that seems clumsily, I suppose there is more natural way to do this in python.
Thanks to k1m190r, I found out about struct module which looks like what I need, but I still lost. I just designed another clumsy solution:
with open(source_name, mode='rb') as file:
fileContent = file.read()
while len(fileContent)%4 != 0:
fileContent += b"\x00"
res = ""
for i in range(0,len(fileContent),4):
substr = fileContent[i:i+4]
substr_val = struct.unpack("<L", substr)[0]
res += struct.pack(">L", substr_val).hex()
Is there a more elegant way? This solution is just slightly better than the original.
Actually in your specific case you don't even need struct. Below should be sufficient.
from binascii import b2a_hex
# open files in binary
with open("infile", "rb") as infile, open("outfile", "wb") as outfile:
# read 4 bytes at a time till read() spits out empty byte string b""
for x in iter(lambda: infile.read(4), b""):
if len(x) != 4:
# skip last bit if it is not 4 bytes long
break
outfile.write(b2a_hex(x[::-1]))
Is there a more elegant way? This solution is just slightly better than the original
Alternatively, you can craft a "smarter" struct format string: format specifiers take a number prefix which is the number of repetitions e.g. 10L is the same as LLLLLLLLLL so you can inject the size of your data divided by 4 before the letter and and convert the entire thing in one go (or a few steps, I don't know how big the counter can be).
array.array might also work as that's what the `byteswap, but you can't specify the input endianness (I think), so it's iffier.
To answer the original question:
import re
changed = re.sub(b'(....)', lambda x:x.group()[::-1], bindata)
Note: original had r'(....)' when the r should have been b.

How to read a big binary file and split its content by some marker

In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).
The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?
There is no magic in Python that will do it for you, but it's not hard to write. For example:
def split_file(fp, marker):
BLOCKSIZE = 4096
result = []
current = ''
for block in iter(lambda: fp.read(BLOCKSIZE), ''):
current += block
while 1:
markerpos = current.find(marker)
if markerpos == -1:
break
result.append(current[:markerpos])
current = current[markerpos + len(marker):]
result.append(current)
return result
Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.
A general idea is using mmap you can then re.finditer over it:
import mmap
import re
with open('somefile', 'rb') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
markers = re.finditer('(.*?)MARKER', mf)
for marker in markers:
print marker.group(1)
I haven't tested, but you may want a (.*?)(MARKER|$) or similar in there as well.
Then, it's down to the OS to provide the necessaries for access to the file.
I don't think there's any built-in function for that, but you can "read-in-chunks" nicely with an iterator to prevent memory-inefficiency, similarly to #user4815162342 's suggestion:
def split_by_marker(f, marker = "-MARKER-", block_size = 4096):
current = ''
while True:
block = f.read(block_size)
if not block: # end-of-file
yield current
return
current += block
while True:
markerpos = current.find(marker)
if markerpos < 0:
break
yield current[:markerpos]
current = current[markerpos + len(marker):]
This way you won't save all the results in the memory at once, and you can still iterate it like:
for line in split_by_marker(open(filename, 'rb')): ...
Just make sure that each "line" does not take too much memory...
Readline itself reads in chunks, splits the chunks, remembers tails, etc. So, no.

Categories

Resources