I need to read the binary file, and write it's content in form of text file which will initialize memory model. Problem is, I need to switch endianess in process. Let's look at example
binary file content, when I read it with:
with open(source_name, mode='rb') as file:
fileContent = file.read().hex()
filecontent: "aa000000bb000000...".
I need, to transform that into "000000aa000000bb...".
Of course, I can split this string into list of 8 chars substrings, than manualy reorganize it like newsubstr = substr[6:8]+substr[4:6]+substr[2:4]+substr[0:2]
, and then merge them into result string, but that seems clumsily, I suppose there is more natural way to do this in python.
Thanks to k1m190r, I found out about struct module which looks like what I need, but I still lost. I just designed another clumsy solution:
with open(source_name, mode='rb') as file:
fileContent = file.read()
while len(fileContent)%4 != 0:
fileContent += b"\x00"
res = ""
for i in range(0,len(fileContent),4):
substr = fileContent[i:i+4]
substr_val = struct.unpack("<L", substr)[0]
res += struct.pack(">L", substr_val).hex()
Is there a more elegant way? This solution is just slightly better than the original.
Actually in your specific case you don't even need struct. Below should be sufficient.
from binascii import b2a_hex
# open files in binary
with open("infile", "rb") as infile, open("outfile", "wb") as outfile:
# read 4 bytes at a time till read() spits out empty byte string b""
for x in iter(lambda: infile.read(4), b""):
if len(x) != 4:
# skip last bit if it is not 4 bytes long
break
outfile.write(b2a_hex(x[::-1]))
Is there a more elegant way? This solution is just slightly better than the original
Alternatively, you can craft a "smarter" struct format string: format specifiers take a number prefix which is the number of repetitions e.g. 10L is the same as LLLLLLLLLL so you can inject the size of your data divided by 4 before the letter and and convert the entire thing in one go (or a few steps, I don't know how big the counter can be).
array.array might also work as that's what the `byteswap, but you can't specify the input endianness (I think), so it's iffier.
To answer the original question:
import re
changed = re.sub(b'(....)', lambda x:x.group()[::-1], bindata)
Note: original had r'(....)' when the r should have been b.
Related
Part of a Python script that I'm writing requires me to find a particular string in a large text or log file: if it exists then do something; otherwise, do something else.
The files which are being fed in are extremely large (10GB+). It feels extremely slow and inefficient to use:
with open('file.txt') as f:
for line in f:
if some_string in line:
return True
return False
If the string doesn't exist in the file, then iterating through would take a long time.
Is there a time efficient way to achieve this?
You can try with mmap:
>>> import mmap
>>> import re
>>> f = open("data.log", "r")
>>> mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
>>> re.search(b"test", mm)
<re.Match object; span=(12, 16), match=b'test'>
If you’re on Linux or BSD (Mac) I would just create a subprocess with grep or awk and let them do the search, they have had decades of optimisation for finding strings in big files. Make sure to include commandline flag to tell it to stop searching after the first match, if you only care that it exists and don’t need all instances or a count.
Try handling larger chunks instead of individual lines. For example:
def contains(filename, some_string):
n = len(some_string)
prev_chunk = ''
with open(filename) as f:
while chunk := f.read(2 ** 20):
if some_string in prev_chunk[-(n-1):] + chunk:
return True
prev_chunk = chunk
return False
I tried that with some made up 1 GB file and it took about 1 second to check a string that's not in there.
I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:
data = []
file = open('some_file.dat', 'rb')
for i in range(0, numcount)
data.append(struct.unpack('>f', file.read(4))[0])
But this code works very slow if file size is more than ~100 mb.
My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.
I found another solution, that works faster:
datafile = open('some_file.dat', 'rb').read()
f_len = ">" + "f" * numcount #numcount = 399513600
numbers = struct.unpack(f_len, datafile)
This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.
In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?
You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:
numpy.fromfile(filename, dtype='>f')
There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.
The following approach gave a good speed up for me:
import struct
import random
import time
block_size = 4096
start = time.time()
with open('some_file.dat', 'rb') as f_input:
data = []
while True:
block = f_input.read(block_size * 4)
data.extend(struct.unpack('>{}f'.format(len(block)/4), block))
if len(block) < block_size * 4:
break
print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)
Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.
From the struct - Format Characters documentation:
A format character may be preceded by an integral repeat count. For
example, the format string '4h' means exactly the same as 'hhhh'.
def read_big_endian(filename):
all_text = ""
with open(filename, "rb") as template:
try:
template.read(2) # first 2 bytes are FF FE
while True:
dchar = template.read(2)
all_text += dchar[0]
except:
pass
return all_text
def save_big_endian(filename, text):
with open(filename, "wb") as fic:
fic.write(chr(255) + chr(254)) # first 2 bytes are FF FE
for letter in text:
fic.write(letter + chr(0))
Used to read .rdp files
In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).
The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?
There is no magic in Python that will do it for you, but it's not hard to write. For example:
def split_file(fp, marker):
BLOCKSIZE = 4096
result = []
current = ''
for block in iter(lambda: fp.read(BLOCKSIZE), ''):
current += block
while 1:
markerpos = current.find(marker)
if markerpos == -1:
break
result.append(current[:markerpos])
current = current[markerpos + len(marker):]
result.append(current)
return result
Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.
A general idea is using mmap you can then re.finditer over it:
import mmap
import re
with open('somefile', 'rb') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
markers = re.finditer('(.*?)MARKER', mf)
for marker in markers:
print marker.group(1)
I haven't tested, but you may want a (.*?)(MARKER|$) or similar in there as well.
Then, it's down to the OS to provide the necessaries for access to the file.
I don't think there's any built-in function for that, but you can "read-in-chunks" nicely with an iterator to prevent memory-inefficiency, similarly to #user4815162342 's suggestion:
def split_by_marker(f, marker = "-MARKER-", block_size = 4096):
current = ''
while True:
block = f.read(block_size)
if not block: # end-of-file
yield current
return
current += block
while True:
markerpos = current.find(marker)
if markerpos < 0:
break
yield current[:markerpos]
current = current[markerpos + len(marker):]
This way you won't save all the results in the memory at once, and you can still iterate it like:
for line in split_by_marker(open(filename, 'rb')): ...
Just make sure that each "line" does not take too much memory...
Readline itself reads in chunks, splits the chunks, remembers tails, etc. So, no.
So I am attempting to read in a large data file in python. If the data had one column and 1 million rows I would do:
fp = open(ifile,'r');
for row in fp:
process row
My problem arises when the data I am reading in has, say 1 million columns and only 1 row. What I would like is a similar functionality to the fscanf() function in C.
Namely,
while not EOF:
part_row = read_next(%lf)
work on part_row
I could use fp.read(%lf), if I knew that the format was long float or whatever.
Any thoughts?
A million floats in text format really isn't that big... So unless it's proving a bottle neck of some sort, then I wouldn't worry about it and just do:
with open('file') as fin:
my_data = [process_line(word) for word in fin.read().split()]
A possible alternative (assuming space delimited "words") is something like:
import mmap, re
with open('whatever.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for word in re.finditer(r'(.*?)\s', mf):
print word.group(1)
And that'll scan the entire file and effectively give a massive word stream, regardless of rows / columns.
There are two basic ways to approach this:
First, you can write a read_column function with its own explicit buffer, either as a generator function:
def column_reader(fp):
buf = ''
while True:
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
buf += fp.read(4096)
col_and_buf = buf.split(',', 1)
col, buf = col_and_buf
yield col
… or as a class:
class ColumnReader(object):
def __init__(self, fp):
self.fp, self.buf = fp, ''
def next(self):
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
self.buf += self.fp.read(4096)
col_and_buf = self.buf.split(',', 1)
self.buf = buf
return col
But, if you write a read_until function that handles the buffering internally, then you can just do this:
next_col = read_until(fp, ',')[:-1]
There are multiple read_until recipes on ActiveState.
Or, if you mmap the file, you effectively get this for free. You can just treat the file as a huge string and use find (or regular expressions) on it. (This assumes the entire file fits within your virtual address space—probably not a problem in 64-bit Python builds, but in 32-bit builds, it can be.)
Obviously these are incomplete. They don't handle EOF, or newline (in real life you probably have six rows of a million columns, not one, right?), etc. But this should be enough to show the idea.
you can accomplish this using yield.
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('your_file.txt')
for piece in read_in_chunks(f):
process_data(piece)
Take a look at this question for more examples.
I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?
The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.
Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.