I have a file descriptor created from a simple open call. I want to pass it to separate functions while keeping its location within the function's scope.
Let's say I have these two functions:
OFFSET_TABLE_OFS = 64
def next_ofs(handle):
handle.seek(OFFSET_TABLE_OFS)
while True:
ofs = struct.unpack('I', handle.read(4))
if ofs == 0xFFFF:
return None
yield ofs
def parse_at(handle, ofs):
handle.seek(ofs)
parsed_stuff = handle.read(256)
The problem here is that if I have a really long offset table and don't want or need to load all of it at once. In a small application, it's easy to either simply seek back to a known offset after parsing before continuing reads, but that can have so many complications. Are there any better suggestions for handling this problem?
Related
I have binary data that is stored in a non-trivial format where the information 'chunks' are not a fixed size and are similar to packets. I am reading them dynamically using this function:
def unpack_bytes(stream: BytesIO, binary_format: str) -> tuple:
size = struct.calcsize(binary_format)
buf = stream.read(size)
print(buf)
return struct.unpack(binary_format, buf)
This function is called with the appropriate format as needed and the code that creates the stream and loops over it is as follows:
def parse_data_file(data_directory: str) -> Generator[CompressedFile]:
with open(data_directory, 'rb') as packet_stream:
while <EOF file logic here>:
contents = parse_packet(packet_stream)
contents = gzip.compress(data=contents, compresslevel=9)
yield CompressedFile(filename=f"{uuid.uuid4()}.gz", datetime=datetime.now(),
contents=contents)
CompressedFile is just a small dataclass to store the
parse_packet extracts a single packet (as per the data spec) from the bin file and returns the contents. Since the packets don't have a fixed width I am wondering what the best way to stop the loop would be. The two options I know of are:
Add some extra logic to unpack_bytes() to bubble up an EOF.
Do some cursor-foo to save the EOF and check against it as it loops. I'd like to not manipulate the cursor directly if possible
Is there are more idomatic way to check EOF within parse_data_file?
The last call to parse_packet (and by extension the last call to unpack_bytes) will consume all the data and the cursor will be at the end when the next iteration of the loop begins. I'd like to take advantage of that state instead of adding EOF handling code all the way up from unpack_bytes or fiddling with the cursor directly.
I have a folder full of files that need to be modified in order to extract the true file in it's real format.
I need to remove a certain number of bytes from BOTH the beginning and end of the file in order to extract the data I am looking for.
How can I do this in python?
I need this to work recursively on an entire folder only
I also need this to output (or modify the exisiting) file with the bytes removed.
I would greatly appreciate any help or guidance you can provide.
Recursive iteration over files os.walk
Change position in file: f.seek
Get file size: os.stat
Remove data from current position to end of file: f.truncate
So, base logic:
Iterate over files
Get file size.
Open file ('rb+' i suppouse )
Seek to position from wich you want read file
Read until bytes you want to drop ( f.read(file_size - top_dropped - bottom_dropped ) )
Seek(0)
Write read text to file
Truncate file
Your question is pretty badly constructed, but as this is somewhat advanced stuff I'll provide you with a code.
You can now use os.walk() to recursively traverse directory you want and apply my slicefile() function.
This code does the following:
After checking validity of start and end arguments it creates a memory map on top of an opened file.
mmap() creates a memory map object that mapps, in this case, portion of a file system over which the file is written. The object exposes both a string-like and file-like interface with some additional methods like move(). So you can treat memory map either as a string or as a file or use size(), move(), resize() or whatever additional methods you need.
We calculate what is a distance between our start and end, i.e. this is how much bytes we will have in the end.
We move stream of bytes, end-start long, starting from our start position to the 0 position i.e. we move them backwards for number of bytes indicated by starting point.
We discard the rest of file. I.e. we resize it to end-start bytes. So, what is left is our new string.
The operation will be longer as the file is bigger. Unfortunately there is nothing much you can do about it. If a file is big, this is your best bet. The procedure is the same as when removing items from a start/middle of an in-memory array, except this has to be buffered (in chunks) not to fill RAM too much.
If your file is smaller than a third of your free RAM space, you can load it whole into a string with f.read(), then perform string slicing on the loaded content ( s = s[start:end] ) and then write it back into file by opening it again and just doing f.write(s).
If you have enough disk space, you can open another file, seek to the starting point you want in the original file and then read it in chunks and write these into the new file. Perhaps even using shutil.copyfileobj(). After that, you remove the original file and use os.rename() to put the new one in its place. These are your only 3 options.
Whole file into RAM; move by buffering backward and then resizing; and, copying into another file, then renaming it. The second option is most universal and won't fail you for small or big files. Therefore I used it.
OK, Not only 3 options. There is a fourth option. It could be possible to cut off N number of bytes from beginning of the file by manipulating the file system itself using low-level operations. To write a kind of truncate() function that truncates the beginning instead of the end. But this would be pretty suicidal. In the end memory fragmentation would occur and whole mess will arise. You don't need such speed anyway. You will be patient until your script finishes. :D
Why did i use mmap()?
Because it uses memory maps implemented in OS rather than completely new code. This reduces number of system calls needed to deal with the opened file. Half of the work is thrust upon operating system, leaving Python to breathe easily.
Because it is mostly written in C which makes it a touch faster than its pure Python implementation would be.
Because it implements move() which wee need. The buffering and everything is already written, so no needs for bulky while loop which would be the alternative (manual) solution.
And so on...
from mmap import mmap
def slicefile (path, start=0, end=None):
f = open(path, "r+b") # Read and write binary
f.seek(0, 2)
size = f.tell()
start = 0 if start==None else start
end = size if end==None else end
start = size+start if start<0 else start
end = size+end if end<0 else end
end = size if end>size else end
if (end==size and start==0) or (end<=start):
f.close()
return
# If start is 0, no need to move anything, just cut off the rest after end
if start==0:
f.seek(end)
f.truncate()
f.close()
return
# Modify in place using mapped memory:
newsize = end-start
m = mmap(f.fileno(), 0)
m.move(0, start, newsize)
m.flush()
m.resize(newsize)
m.close()
f.close()
I have a text file test.txt, with the following contents:
Thing 1. string
And I'm creating a python file that will increment the number every time it gets run without affecting the rest of the string, like so.
Run once:
Thing 2. string
Run twice:
Thing 3. string
Run three times:
Thing 4. string
Run four times:
Thing 5. string
This is the code that I'm using to accomplish this.
file = open("test.txt","r+")
started = False
beginning = 0 #start of the digits
done = False
num = 0
#building the number from digits
while not done:
next = file.read(1)
if ord(next) in range(48, 58): #ascii values of 0-9
started = True
num *= 10
num += int(next)
elif started: #has reached the end of the number
done = True
else: #has not reached the beginning of the number
beginning += 1
num += 1
file.seek(beginning,0)
file.write(str(num))
This code works, so long as the number is not 10^n-1 (9, 99, 999, etc) because in those cases, it writes more bytes than were previously in the number. As such, it will override the characters that follow.
So this brings me to the point. I need a way to write to the file that overwrites previously bytes, which I have, and a way to write to the file that does not overwrite previously existing bytes, which I don't have. Does such a mechanism exist in python, and if so, what is it?
I have already tried opening the file using the line file = open("test.txt","a+") instead. When I do that, it always writes to the end, regardless of the seek point.
file = open("test.txt","w+") will not work because I need to keep the contents of the file while altering it, and files opened in any variant of w mode are wiped clean.
I have also thought of solving my problem using a function like this:
#file is assumed to be in r+ mode
def write(string, file, index = -1):
if index != -1:
file.seek(index, 0)
remainder = file.read()
file.seek(index)
file.write(remainder + string)
But I also want to be able to expand the solution to larger files, and reading the rest of the file single-handedly changes what I'm trying to accomplish from being O(1) to O(n). It also seems very non-Pythonic, since it seeks to accomplish the task in a less-than-straightforward way.
It would also make my I/O operations inconsistent: I would have class methods (file.read() and file.write()) to read from the file and write to it replacing old characters, but an external function to insert without replacing.
If I make the code inline, rather than a function, it means I have to write several of the same lines of code every time I try to write without replacing, which is also non-Pythonic.
To reiterate my question, is there a more straightforward way to do this, or am I stuck with the function?
Unfortunately, what you want to do is not possible. This is a limitation at a lower level than Python, in the operating system. Neither the Unix nor the Windows file access API offers any way to insert new bytes in the middle of a file without overwriting the bytes that were already there.
Reading the rest of the file and rewriting it is the usual workaround. Actually, the usual workaround is to rewrite the entire file under a new name and then use rename to move it back to the old name. On Unix, this accomplishes an atomic file update - unless the computer crashes, concurrent readers will see either the new file or the old file, not some hybrid. (Windows, sadly, still does not allow you to rename over a name that already exists, so if you use this strategy you have to delete the old file first, opening an unavoidable race window where the file might appear not to exist at all.)
Yes, this is O(N), and yes, if you use the write-new-file-and-rename strategy it temporarily consumes scratch disk space equal to the size of the file (old or new, whichever is larger). That's just how it is.
I haven't thought about it enough to give you even a sketch of the code, but it should be possible to use context managers to wrap up the write-new-file-and-rename approach tidily.
No, the disk doesn't work like you think it does.
You have to remember that your file is stored on disk as one contiguous
chunk of data*
Your disk happens to be wound up in a great big spool, a bit like a record,
but if you were to unwind your file, you'd get something that looks like
this:
+------------------------------------------------------------+
| Thing 1. String |
+------------------------------------------------------------+
^ ^
^ | \_, ^
| Start of file End of File |
Start of disk End of disk
As you've discovered, there's no way to simply insert data in the middle.
Generally speaking, that wouldn't be possible at all, without physically
altering your disk. And who wants to do that? Especially when just flipping
the magnetic bits on your disk is so much easier and faster. In order to
do what you want to do, you have to read the bytes the you want to
overwrite, then start writing down your new ones. It might look something
like this:
Open the file
Seek to the point of insert
Read the current byte
Seek backward one byte
Write down the first byte of the new string
Read the next byte
Seek backward one byte
Write down the next byte of the new string
Repeat until all the bytes have been written to disk
close the file
Of course, this might be a little bit on the slow side, due to all the
seeking back & forth in the file. It might be faster to read each line,
and then seek back to the previous location in the file. It should be
relatively straightforward to implement something like this in Python,
but as you've discovered, there are system limitations that Python can't
really overcome.
*Unless the files are fragmented, but we're living in an ideal
world where gravity adheres to 9.8m/s2 and the Earth is a perfect
sphere.
I'm writing a program to search for a specific line in a very large (unordered) file (so it would be preferred not to load the entire file into memory).
I'm implementing multi threading to speed up the process. I'm trying to give a particular thread a particular part of the file i.e., the first thread would run through the first quarter of the file, the 2nd thread scans (simultaneously) from the endpoint of where the first thread stops and so on.
So to do this I need to find byte location of different parts of the file for simplicity of the question lets say I just want to find the middle of the file. But the problem is each line has a different length so if I just do
fo.seek(0, 2)
end = fo.tell()
mid = end/2
fo.seek(mid, 0)
It could give me the middle of the line. So I need a way to seek to the next or previous newline. Also, note I dont want the exact middle just somewhere around it (since its a very large file).
Heres what I was able to code, I'm not sure whether this loads the file into memory or not. And I would really like to avoid opening 2 instances of the same file (I did so in my program because I didnt want to worry about the offset changing when I read the file).
Any modification (or a new program) which is faster would be appreciated.
fo = open(filename, "rw+")
f2 = open(filename, "rw+")
file_ = dict()
fo.seek(0, 2)
file_['end'] = fo.tell()
file_['mid'] = file_['end'] / 2
fo.seek(file_['mid'], 0)
f2.seek(file_['mid'], 0)
line = f2.readline()
fo.seek(f2.tell(), 0)
file_['mid'] = f2.tell()
fo.seek(file_['mid'], 0)
print fo.readline()
How large is very large? grep tears relatively quickly through even 1-10GB files.
If the file is static and you plan to search through it repeatedly, you could split it:
split -l <line_count> <file>
Now you have multiple files, and can pass each to a separate thread/process/whatever.
Is the file sorted? That changes things again, since now you can just binary search with fo.seek() calls.
How fast is fast enough? Beyond a certain point, you're going to have to build a search index. Up to that point, simple tools like grep, split, etc. work wonders.
Without more information, it's impossible to say what the right tradeoffs are here.
I am a somewhat Python/programing newbie, and I am attempting to use a python class for the first time.
In this code I am trying to create a script to backup some files. I have 6 files in total that I want to back up regularly with this script so I thought that I would try and use the python Class to save me writing things out 6 times, and also to get practice using Classes.
In my code below I have things set up for just creating 1 instance of a class for now to test things. However, I have hit a snag. I can't seem to use the operator to assign the original filename and the back-up filename.
Is it not possible to use the operator for a filename when opening a file? Or am I doing things wrong.
class Back_up(object):
def __init__(self, file_name, back_up_file):
self.file_name = file_name
self.back_up_file = back_up_file
print "I %s and me %s" % (self.file_name, self.back_up_file)
with open('%s.txt', 'r') as f, open('{}.txt', 'w') as f2 % (self.file_name, self.back_up_file):
f_read = read(f)
f2.write(f_read)
first_back_up = Back_up("syn1_ready", "syn1_backup")
Also, line #7 is really long, any tips on how to shorten it are appreciated.
Thanks
Darren
If you just want your files backed up, may I suggest using shutil.copy()?
As for your program:
If you want to substitute in a string to build a filename, you can do it. But your code doesn't do it.
You have this:
with open('%s.txt', 'r') as f, open('{}.txt', 'w') as f2 % (self.file_name, self.back_up_file):
Try this instead:
src = "%s.txt" % self.file_name
dest = "{}.txt".format(self.back_up_file)
with open(src, "rb") as f, open(dest, "wb") as f2:
# copying code goes here
The % operator operates on a string. The .format() method call is a method on a string. Either way, you need to do the operation with the string; you can't have two with statements and then try to use these operators at the end of the with statements line.
You don't have to use explicit temp variables like I show here, but it's a good way to make the code easy to read, while greatly shortening the length of the with statements line.
Your code to copy the files will read all the file data into memory at one time. That will be fine for a small file. For a large file, you should use a loop that calls .read(CHUNK_SIZE) where CHUNK_SIZE is a maximum amount to read in a single chunk. That way if you ever back up a really large file on a computer with limited memory, it will simply work rather than filling the computer's memory and making the computer start swapping to disk.
Try simplicity :)
Your line 7 is not going to parse. Split it using intermediate variables:
source_fname = "%s.txt" % self.file_name
target_fname = "%s.txt" % self.back_up_file
with open(source_fname) as source, open(target_fname) as target:
# do your thing
Also, try hard avoiding inconsistent and overly generic attribute names, like file_name, when you have two files to operate on.
Your copy routine is not going to be very efficient, too. It tries to read the entire file into memory, then write it. If I were you I'd call rsync or something similar via popen() and feed it with proper list of files to operate on. Most probably I'd use bash for that, though Python may be fine, too.