How to read dynamic data structure from binary file until EOF? - python

I have binary data that is stored in a non-trivial format where the information 'chunks' are not a fixed size and are similar to packets. I am reading them dynamically using this function:
def unpack_bytes(stream: BytesIO, binary_format: str) -> tuple:
size = struct.calcsize(binary_format)
buf = stream.read(size)
print(buf)
return struct.unpack(binary_format, buf)
This function is called with the appropriate format as needed and the code that creates the stream and loops over it is as follows:
def parse_data_file(data_directory: str) -> Generator[CompressedFile]:
with open(data_directory, 'rb') as packet_stream:
while <EOF file logic here>:
contents = parse_packet(packet_stream)
contents = gzip.compress(data=contents, compresslevel=9)
yield CompressedFile(filename=f"{uuid.uuid4()}.gz", datetime=datetime.now(),
contents=contents)
CompressedFile is just a small dataclass to store the
parse_packet extracts a single packet (as per the data spec) from the bin file and returns the contents. Since the packets don't have a fixed width I am wondering what the best way to stop the loop would be. The two options I know of are:
Add some extra logic to unpack_bytes() to bubble up an EOF.
Do some cursor-foo to save the EOF and check against it as it loops. I'd like to not manipulate the cursor directly if possible
Is there are more idomatic way to check EOF within parse_data_file?
The last call to parse_packet (and by extension the last call to unpack_bytes) will consume all the data and the cursor will be at the end when the next iteration of the loop begins. I'd like to take advantage of that state instead of adding EOF handling code all the way up from unpack_bytes or fiddling with the cursor directly.

Related

How to use write() method on a file without replacing characters?

I have a .txt file whose contents are:
This is an example file.
These are its contents.
This is line 3.
If I open the file, move to the beginning, and write some text like so...
f = open(r'C:\Users\piano\Documents\sample.txt', 'r+')
f.seek(0, 0)
f.write('Now I am adding text.\n')
What I am expecting is for the file to read:
Now I am adding text.
This is an example file.
These are its contents.
This is line 3.
...but instead it reads:
Now I am adding text.
.
These are its contents.
This is line 3.
So why is some of the text being replaced instead of the text I'm writing simply being added onto the beginning? How can I fix this?
Write - will overwrite any existing content
To overcome this, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
string = file.read()
file.truncate(0) #delete all contents
file.seek(0, 0)
file.write('Now I am adding text.\n' + string)
It is also recommended you use with because it comes automatic with the close() method in its __exit__() magic method. This is important as not all Python interpreters use CPython
Bonus: If you wish to insert lines inbetween, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
contents = file.readlines()
contents.insert(1, 'Now I am adding text.\n')
#Inserting into second line
file.truncate(0) #delete all contents
file.seek(0, 0)
file.writelines(contents)
Most file systems don't work like that. A file's contents is mapped to data blocks, and these data blocks are not guaranteed to be contiguous on the underlying system (i.e. not necessarily "side-by-side").
When you seek, you're seeking to a byte offset. So if you want to insert new data between 2 byte offsets of a particular block, you'll have to actually shift all subsequent data over by the length of what you're inserting. Since the block could easily be entirely "filled", shifting the bytes over might require allocating a new block. If the subsequent block was entirely "filled" as well, you'll have to shift the data of that block as well, and so on.. You can start to see why there's no "simple" operation for shifting data.
Generally, we solve this by just reading all the data into memory and then re-writing it back to a file. When you encounter the byte offset you're interested in inserting "new" content at, you write your buffer and then continue writing the "original" data. In Python, you won't have to worry about interleaving multiple buffers when writing, since Python will abstract the data to some data structure. So you'd just concatenate the higher-level data structures (e.g. if it's a text file, just concat the 3 strings).
If the file is too large for you to comfortably place it in memory, you can write to a "new" temporary file, and then just swap it with the original when done your write operation.
Now if you consider the "shifting" of data in data blocks I mentioned above, you might consider the simpler edge case where you happen to be inserting data of length N at an offset that's a multiple of N, where N is the fixed size of the data block in the file system. In this case, if you think of the data blocks as a linked list, you might consider it a rather simple operation to add a new data block between the offset you're inserting at and the next block in the list.
In fact, Linux systems do support allocating an additional block at this boundary. See fallocate.

How can I remove a specific number of bytes from the beginning and end of a file using python?

I have a folder full of files that need to be modified in order to extract the true file in it's real format.
I need to remove a certain number of bytes from BOTH the beginning and end of the file in order to extract the data I am looking for.
How can I do this in python?
I need this to work recursively on an entire folder only
I also need this to output (or modify the exisiting) file with the bytes removed.
I would greatly appreciate any help or guidance you can provide.
Recursive iteration over files os.walk
Change position in file: f.seek
Get file size: os.stat
Remove data from current position to end of file: f.truncate
So, base logic:
Iterate over files
Get file size.
Open file ('rb+' i suppouse )
Seek to position from wich you want read file
Read until bytes you want to drop ( f.read(file_size - top_dropped - bottom_dropped ) )
Seek(0)
Write read text to file
Truncate file
Your question is pretty badly constructed, but as this is somewhat advanced stuff I'll provide you with a code.
You can now use os.walk() to recursively traverse directory you want and apply my slicefile() function.
This code does the following:
After checking validity of start and end arguments it creates a memory map on top of an opened file.
mmap() creates a memory map object that mapps, in this case, portion of a file system over which the file is written. The object exposes both a string-like and file-like interface with some additional methods like move(). So you can treat memory map either as a string or as a file or use size(), move(), resize() or whatever additional methods you need.
We calculate what is a distance between our start and end, i.e. this is how much bytes we will have in the end.
We move stream of bytes, end-start long, starting from our start position to the 0 position i.e. we move them backwards for number of bytes indicated by starting point.
We discard the rest of file. I.e. we resize it to end-start bytes. So, what is left is our new string.
The operation will be longer as the file is bigger. Unfortunately there is nothing much you can do about it. If a file is big, this is your best bet. The procedure is the same as when removing items from a start/middle of an in-memory array, except this has to be buffered (in chunks) not to fill RAM too much.
If your file is smaller than a third of your free RAM space, you can load it whole into a string with f.read(), then perform string slicing on the loaded content ( s = s[start:end] ) and then write it back into file by opening it again and just doing f.write(s).
If you have enough disk space, you can open another file, seek to the starting point you want in the original file and then read it in chunks and write these into the new file. Perhaps even using shutil.copyfileobj(). After that, you remove the original file and use os.rename() to put the new one in its place. These are your only 3 options.
Whole file into RAM; move by buffering backward and then resizing; and, copying into another file, then renaming it. The second option is most universal and won't fail you for small or big files. Therefore I used it.
OK, Not only 3 options. There is a fourth option. It could be possible to cut off N number of bytes from beginning of the file by manipulating the file system itself using low-level operations. To write a kind of truncate() function that truncates the beginning instead of the end. But this would be pretty suicidal. In the end memory fragmentation would occur and whole mess will arise. You don't need such speed anyway. You will be patient until your script finishes. :D
Why did i use mmap()?
Because it uses memory maps implemented in OS rather than completely new code. This reduces number of system calls needed to deal with the opened file. Half of the work is thrust upon operating system, leaving Python to breathe easily.
Because it is mostly written in C which makes it a touch faster than its pure Python implementation would be.
Because it implements move() which wee need. The buffering and everything is already written, so no needs for bulky while loop which would be the alternative (manual) solution.
And so on...
from mmap import mmap
def slicefile (path, start=0, end=None):
f = open(path, "r+b") # Read and write binary
f.seek(0, 2)
size = f.tell()
start = 0 if start==None else start
end = size if end==None else end
start = size+start if start<0 else start
end = size+end if end<0 else end
end = size if end>size else end
if (end==size and start==0) or (end<=start):
f.close()
return
# If start is 0, no need to move anything, just cut off the rest after end
if start==0:
f.seek(end)
f.truncate()
f.close()
return
# Modify in place using mapped memory:
newsize = end-start
m = mmap(f.fileno(), 0)
m.move(0, start, newsize)
m.flush()
m.resize(newsize)
m.close()
f.close()

Buffer/stream to read a subset of a file in Python?

I have written a class which parses the header segment of data file (for storing scientific instrument data) and collects things like offsets to various data segments within the file. The actual data is obtained through various methods which read and parse the data segments.
The problem I'm running into is that there is a segment defined for vendor-specific, unstructured data. Since there's nothing to parse I just need my method to return raw binary data. However this segment could be very large so I don't just want to read it all at once and return a single bytes object.
What I'd like to do is have the method return an io.BufferedReader object or similar into the file that only reads between a beginning and end offset. I haven't been able to figure out a way to do this using the builtin IO classes. Is it possible?
You inherit all of the class methods from IOBase so you can absolutely call reader.seek(byte_offset) to skip to this byte position in the stream. However from there you will have to manually track the bytes you have read until you get to the max offset you are reading to. The start offset for seek() will of course have to be known beforehand as will the ending byte offset. Some example code follows (assuming a start offset of byte 250):
import io
stream = io.open("file.txt", "r")
buffered_reader = io.BufferedReader(stream)
# set the stream to byte 250
buffered_reader.seek(250)
# read up to byte 750 (500 bytes from position 250)
data = buffered_reader.read(500)
Of course, if this header is dynamically sized... You will have to scan to determine start positions which means reading line by line.

Creating a program which counts words number in a row of a text file (Python)

I am trying to create a program which takes an input file, counts the number of words in each row and writes a string of that certain number in another output file. I managed to develope this code:
in_file = "our_input.txt"
out_file = "output.txt"
f=open(in_file)
g=open(out_file,"w")
for line in f:
if line == "\n":
g.write("0\n")
else:
g.write(str(line.count(" ")+1)+"\n")
now, this works well, but the problem is that it works for only a certain amount of lines. If my input file has 8000 lines, it will display only the first 6800. If it has 6000, than will be displayed (all numbers are rounded, right).
I tried creating another program, which splits each line to a list, and then counting the length of it, but the problem remains just the same.
Any idea what could cause this?
You need to close each file after you're done with it. The safest way to do this is by using the with statement:
with open(in_file) as f, open(out_file,"w") as g:
for line in f:
if line == "\n":
g.write("0\n")
else:
g.write(str(line.count(" ")+1)+"\n")
When reaching the end of a with block, all files you opened in the with line will be closed.
The reason for the behavior you see is that for performance reasons, reading and writing to/from files is buffered. Because of the way hard drives are constructed, data is read/written in blocks rather than in individual bytes - so even if you attempt to read/write a single byte, you have to read/write an entire block. Therefore, most programming languages' built-in file IO functions actually read (at least) one block at a time into memory and feed you data from that in-memory block until it needs to read another block. Similarly, writing is performed by actually writing into a memory block first, and only writing the block to disk when it is full. If you don't close the file writer, whatever is in the last in-memory block won't be written.

How to read from two separate locations of a file in Python?

I have a file descriptor created from a simple open call. I want to pass it to separate functions while keeping its location within the function's scope.
Let's say I have these two functions:
OFFSET_TABLE_OFS = 64
def next_ofs(handle):
handle.seek(OFFSET_TABLE_OFS)
while True:
ofs = struct.unpack('I', handle.read(4))
if ofs == 0xFFFF:
return None
yield ofs
def parse_at(handle, ofs):
handle.seek(ofs)
parsed_stuff = handle.read(256)
The problem here is that if I have a really long offset table and don't want or need to load all of it at once. In a small application, it's easy to either simply seek back to a known offset after parsing before continuing reads, but that can have so many complications. Are there any better suggestions for handling this problem?

Categories

Resources