I've read here that editing a file in-place is not possible since it's OS related and entirely independent of language. Why is that? Doesn't the C standard fseek() define an in-place file write?
You can seek in Python as well - the answer you linked is just pointing that you can't insert in the middle of a file (ABC -> ABXC).
If you seek and then write, whatever data is written overwrites the contents at that position of the file (ABC -> AXC).
The point is that a file is like an array (just on disk). It occupies a fixed place on the disk.
If you have to insert something in the middle of an array, you have to move out the trailing elements, otherwise you will overwrite these. The same concept applies to files, you need to move out the trailing part of the file to insert something in the middle.
The one exception to this is when you need to overwrite a fixed size block of the file with a new block of the same size. Then it is perfectly fine to use fseek() and then just overwrite.
The problem is OS related, because it depends on how OS organizes data on the disk with its file system.
Usually, the disk is divided into blocks of fixed size to store the content of a file. When a byte is overwritten, only the (whole) block containing the byte is rewritten. When the file is appended, possibly the last block is rewritten, plus new blocks being allocated and "linked" to the file to store content of the appended data. These 2 operations are efficient under this scheme.
However, when some byte is inserted or deleted, usually, the file needs to be rewritten from the first block that contains the insert/delete change.
It is technically possible to reduce the time spent rewriting the file. For insert case, the OS can allocate a new block for the data inserted, "link" it to the file, and rewrite one or two neighbor blocks. For delete case, the OS can reduce the number of valid bytes in the blocks, rewrite the block, and/or "relink" the blocks in the file. As a disclaimer, the "solution" I just mentioned are dependent on how the specific file system is structured. However, if you reduce the cost of rewriting the file, usually, it will incur a hit on the read operations later, or internal fragmentation will occur.
Maybe the problem will be solved, if someone one day invented a data structure that allows efficient insertion/deletion of data on disk while maintaining the performance of read operation.
You can modify a file in place; if the data you are replacing is the same size as the data you are replacing (smaller and you just pad). What if it is larger? Well, then everything after the modification would need to be rewritten to disk anyway, right? You also can't assume you have enough room at the end in all cases, so the file may need to be moved to a completely new location.
Obviously this makes the valid use cases rather rare. Normally it's not worth it unless you have a really good reason to do so. As for whether or not you have access to in-place modification API's in Python... I have no idea.
Of course you can edit a file in place:
import struct
RECSIZE = struct.calcsize('<i')
fp = open('name.txt', 'r+b')
fp.seek(132 * RECSIZE) # go to record number 132
fp.write(struct.pack('<i', 42))
fp.seek(132 * RECSIZE)
assert struct.unpack('<i', fp.read(RECSIZE)) == 42
this also shows the most common reason for doing so and the limitations.
Related
I need to modify a gzipped tab-delimited file. I can read from input and write modified reads to an output file as:
output = tempfile.NamedTemporaryFile(mode="w", delete=False)
with gzip.open(input, "rb") as in_file,\
gzip.open(output, "wb") as out_file:
for l in in_file:
split_line = l.split("\t")
if split_line[0] == "hello":
split_line[0] = "hi"
out_file.write("\t".join(split_line))
The gzipped files I work with are in 100s of GB scale, hence rewriting the entire file to a different file only for modifying a subset is not ideal. Therefore, I am interested in a solution that modifies the file in-place (i.e., modifying the original file as you traverse through it).
For normal gzip files, certainly not. Your only option would be read the gzip file up to where you want the modification, make the modification, and recompress the rest. Some attention is required where you make the cut, to remove the deflate block that includes the cut, and recompress from there, appending the remaining deflate blocks on the correct bit position.
You could, in theory, prepare a large gzip file so that such modifications could be done in place. You would need to break up the gzip file into independent blocks, where the history at the start of each block is discarded. (pigz does this with the --independent option.) You would also need to insert several empty blocks or other filler space at the end of each independent block to allow for variations in the length of the independent block so that the modified result can fit back into the exact same number of bytes. There are five-byte and two-byte empty blocks you can insert, that in combination should be able to accommodate any small number of byte count difference, if you have enough of them.
You would need a separate index of the locations of these independent blocks, otherwise you would be spending time searching for them, again making the time dependent on the length of the file.
In order to not significantly impact the overall compression ratio of the gzip file, you would want the independent blocks to be on the order of 128K bytes uncompressed or larger. Any modification would require recompression of an entire independent block.
You would also need to update the CRC and length at the end of the gzip file. I think that there's a way to update the CRC without recomputing it for the whole file, but I'd have to think about it. It is certainly possible if the length of the file doesn't change, but if you are inserting or deleting bytes, it gets trickier.
This would all be a large amount of work to try to put a square gzip peg into a round random modification hole. It suggests that you are simply using the wrong format for the application. Find a different format for what you want to do.
I have a .txt file whose contents are:
This is an example file.
These are its contents.
This is line 3.
If I open the file, move to the beginning, and write some text like so...
f = open(r'C:\Users\piano\Documents\sample.txt', 'r+')
f.seek(0, 0)
f.write('Now I am adding text.\n')
What I am expecting is for the file to read:
Now I am adding text.
This is an example file.
These are its contents.
This is line 3.
...but instead it reads:
Now I am adding text.
.
These are its contents.
This is line 3.
So why is some of the text being replaced instead of the text I'm writing simply being added onto the beginning? How can I fix this?
Write - will overwrite any existing content
To overcome this, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
string = file.read()
file.truncate(0) #delete all contents
file.seek(0, 0)
file.write('Now I am adding text.\n' + string)
It is also recommended you use with because it comes automatic with the close() method in its __exit__() magic method. This is important as not all Python interpreters use CPython
Bonus: If you wish to insert lines inbetween, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
contents = file.readlines()
contents.insert(1, 'Now I am adding text.\n')
#Inserting into second line
file.truncate(0) #delete all contents
file.seek(0, 0)
file.writelines(contents)
Most file systems don't work like that. A file's contents is mapped to data blocks, and these data blocks are not guaranteed to be contiguous on the underlying system (i.e. not necessarily "side-by-side").
When you seek, you're seeking to a byte offset. So if you want to insert new data between 2 byte offsets of a particular block, you'll have to actually shift all subsequent data over by the length of what you're inserting. Since the block could easily be entirely "filled", shifting the bytes over might require allocating a new block. If the subsequent block was entirely "filled" as well, you'll have to shift the data of that block as well, and so on.. You can start to see why there's no "simple" operation for shifting data.
Generally, we solve this by just reading all the data into memory and then re-writing it back to a file. When you encounter the byte offset you're interested in inserting "new" content at, you write your buffer and then continue writing the "original" data. In Python, you won't have to worry about interleaving multiple buffers when writing, since Python will abstract the data to some data structure. So you'd just concatenate the higher-level data structures (e.g. if it's a text file, just concat the 3 strings).
If the file is too large for you to comfortably place it in memory, you can write to a "new" temporary file, and then just swap it with the original when done your write operation.
Now if you consider the "shifting" of data in data blocks I mentioned above, you might consider the simpler edge case where you happen to be inserting data of length N at an offset that's a multiple of N, where N is the fixed size of the data block in the file system. In this case, if you think of the data blocks as a linked list, you might consider it a rather simple operation to add a new data block between the offset you're inserting at and the next block in the list.
In fact, Linux systems do support allocating an additional block at this boundary. See fallocate.
I have a folder full of files that need to be modified in order to extract the true file in it's real format.
I need to remove a certain number of bytes from BOTH the beginning and end of the file in order to extract the data I am looking for.
How can I do this in python?
I need this to work recursively on an entire folder only
I also need this to output (or modify the exisiting) file with the bytes removed.
I would greatly appreciate any help or guidance you can provide.
Recursive iteration over files os.walk
Change position in file: f.seek
Get file size: os.stat
Remove data from current position to end of file: f.truncate
So, base logic:
Iterate over files
Get file size.
Open file ('rb+' i suppouse )
Seek to position from wich you want read file
Read until bytes you want to drop ( f.read(file_size - top_dropped - bottom_dropped ) )
Seek(0)
Write read text to file
Truncate file
Your question is pretty badly constructed, but as this is somewhat advanced stuff I'll provide you with a code.
You can now use os.walk() to recursively traverse directory you want and apply my slicefile() function.
This code does the following:
After checking validity of start and end arguments it creates a memory map on top of an opened file.
mmap() creates a memory map object that mapps, in this case, portion of a file system over which the file is written. The object exposes both a string-like and file-like interface with some additional methods like move(). So you can treat memory map either as a string or as a file or use size(), move(), resize() or whatever additional methods you need.
We calculate what is a distance between our start and end, i.e. this is how much bytes we will have in the end.
We move stream of bytes, end-start long, starting from our start position to the 0 position i.e. we move them backwards for number of bytes indicated by starting point.
We discard the rest of file. I.e. we resize it to end-start bytes. So, what is left is our new string.
The operation will be longer as the file is bigger. Unfortunately there is nothing much you can do about it. If a file is big, this is your best bet. The procedure is the same as when removing items from a start/middle of an in-memory array, except this has to be buffered (in chunks) not to fill RAM too much.
If your file is smaller than a third of your free RAM space, you can load it whole into a string with f.read(), then perform string slicing on the loaded content ( s = s[start:end] ) and then write it back into file by opening it again and just doing f.write(s).
If you have enough disk space, you can open another file, seek to the starting point you want in the original file and then read it in chunks and write these into the new file. Perhaps even using shutil.copyfileobj(). After that, you remove the original file and use os.rename() to put the new one in its place. These are your only 3 options.
Whole file into RAM; move by buffering backward and then resizing; and, copying into another file, then renaming it. The second option is most universal and won't fail you for small or big files. Therefore I used it.
OK, Not only 3 options. There is a fourth option. It could be possible to cut off N number of bytes from beginning of the file by manipulating the file system itself using low-level operations. To write a kind of truncate() function that truncates the beginning instead of the end. But this would be pretty suicidal. In the end memory fragmentation would occur and whole mess will arise. You don't need such speed anyway. You will be patient until your script finishes. :D
Why did i use mmap()?
Because it uses memory maps implemented in OS rather than completely new code. This reduces number of system calls needed to deal with the opened file. Half of the work is thrust upon operating system, leaving Python to breathe easily.
Because it is mostly written in C which makes it a touch faster than its pure Python implementation would be.
Because it implements move() which wee need. The buffering and everything is already written, so no needs for bulky while loop which would be the alternative (manual) solution.
And so on...
from mmap import mmap
def slicefile (path, start=0, end=None):
f = open(path, "r+b") # Read and write binary
f.seek(0, 2)
size = f.tell()
start = 0 if start==None else start
end = size if end==None else end
start = size+start if start<0 else start
end = size+end if end<0 else end
end = size if end>size else end
if (end==size and start==0) or (end<=start):
f.close()
return
# If start is 0, no need to move anything, just cut off the rest after end
if start==0:
f.seek(end)
f.truncate()
f.close()
return
# Modify in place using mapped memory:
newsize = end-start
m = mmap(f.fileno(), 0)
m.move(0, start, newsize)
m.flush()
m.resize(newsize)
m.close()
f.close()
I have a text file test.txt, with the following contents:
Thing 1. string
And I'm creating a python file that will increment the number every time it gets run without affecting the rest of the string, like so.
Run once:
Thing 2. string
Run twice:
Thing 3. string
Run three times:
Thing 4. string
Run four times:
Thing 5. string
This is the code that I'm using to accomplish this.
file = open("test.txt","r+")
started = False
beginning = 0 #start of the digits
done = False
num = 0
#building the number from digits
while not done:
next = file.read(1)
if ord(next) in range(48, 58): #ascii values of 0-9
started = True
num *= 10
num += int(next)
elif started: #has reached the end of the number
done = True
else: #has not reached the beginning of the number
beginning += 1
num += 1
file.seek(beginning,0)
file.write(str(num))
This code works, so long as the number is not 10^n-1 (9, 99, 999, etc) because in those cases, it writes more bytes than were previously in the number. As such, it will override the characters that follow.
So this brings me to the point. I need a way to write to the file that overwrites previously bytes, which I have, and a way to write to the file that does not overwrite previously existing bytes, which I don't have. Does such a mechanism exist in python, and if so, what is it?
I have already tried opening the file using the line file = open("test.txt","a+") instead. When I do that, it always writes to the end, regardless of the seek point.
file = open("test.txt","w+") will not work because I need to keep the contents of the file while altering it, and files opened in any variant of w mode are wiped clean.
I have also thought of solving my problem using a function like this:
#file is assumed to be in r+ mode
def write(string, file, index = -1):
if index != -1:
file.seek(index, 0)
remainder = file.read()
file.seek(index)
file.write(remainder + string)
But I also want to be able to expand the solution to larger files, and reading the rest of the file single-handedly changes what I'm trying to accomplish from being O(1) to O(n). It also seems very non-Pythonic, since it seeks to accomplish the task in a less-than-straightforward way.
It would also make my I/O operations inconsistent: I would have class methods (file.read() and file.write()) to read from the file and write to it replacing old characters, but an external function to insert without replacing.
If I make the code inline, rather than a function, it means I have to write several of the same lines of code every time I try to write without replacing, which is also non-Pythonic.
To reiterate my question, is there a more straightforward way to do this, or am I stuck with the function?
Unfortunately, what you want to do is not possible. This is a limitation at a lower level than Python, in the operating system. Neither the Unix nor the Windows file access API offers any way to insert new bytes in the middle of a file without overwriting the bytes that were already there.
Reading the rest of the file and rewriting it is the usual workaround. Actually, the usual workaround is to rewrite the entire file under a new name and then use rename to move it back to the old name. On Unix, this accomplishes an atomic file update - unless the computer crashes, concurrent readers will see either the new file or the old file, not some hybrid. (Windows, sadly, still does not allow you to rename over a name that already exists, so if you use this strategy you have to delete the old file first, opening an unavoidable race window where the file might appear not to exist at all.)
Yes, this is O(N), and yes, if you use the write-new-file-and-rename strategy it temporarily consumes scratch disk space equal to the size of the file (old or new, whichever is larger). That's just how it is.
I haven't thought about it enough to give you even a sketch of the code, but it should be possible to use context managers to wrap up the write-new-file-and-rename approach tidily.
No, the disk doesn't work like you think it does.
You have to remember that your file is stored on disk as one contiguous
chunk of data*
Your disk happens to be wound up in a great big spool, a bit like a record,
but if you were to unwind your file, you'd get something that looks like
this:
+------------------------------------------------------------+
| Thing 1. String |
+------------------------------------------------------------+
^ ^
^ | \_, ^
| Start of file End of File |
Start of disk End of disk
As you've discovered, there's no way to simply insert data in the middle.
Generally speaking, that wouldn't be possible at all, without physically
altering your disk. And who wants to do that? Especially when just flipping
the magnetic bits on your disk is so much easier and faster. In order to
do what you want to do, you have to read the bytes the you want to
overwrite, then start writing down your new ones. It might look something
like this:
Open the file
Seek to the point of insert
Read the current byte
Seek backward one byte
Write down the first byte of the new string
Read the next byte
Seek backward one byte
Write down the next byte of the new string
Repeat until all the bytes have been written to disk
close the file
Of course, this might be a little bit on the slow side, due to all the
seeking back & forth in the file. It might be faster to read each line,
and then seek back to the previous location in the file. It should be
relatively straightforward to implement something like this in Python,
but as you've discovered, there are system limitations that Python can't
really overcome.
*Unless the files are fragmented, but we're living in an ideal
world where gravity adheres to 9.8m/s2 and the Earth is a perfect
sphere.
I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()