I have a .txt file whose contents are:
This is an example file.
These are its contents.
This is line 3.
If I open the file, move to the beginning, and write some text like so...
f = open(r'C:\Users\piano\Documents\sample.txt', 'r+')
f.seek(0, 0)
f.write('Now I am adding text.\n')
What I am expecting is for the file to read:
Now I am adding text.
This is an example file.
These are its contents.
This is line 3.
...but instead it reads:
Now I am adding text.
.
These are its contents.
This is line 3.
So why is some of the text being replaced instead of the text I'm writing simply being added onto the beginning? How can I fix this?
Write - will overwrite any existing content
To overcome this, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
string = file.read()
file.truncate(0) #delete all contents
file.seek(0, 0)
file.write('Now I am adding text.\n' + string)
It is also recommended you use with because it comes automatic with the close() method in its __exit__() magic method. This is important as not all Python interpreters use CPython
Bonus: If you wish to insert lines inbetween, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
contents = file.readlines()
contents.insert(1, 'Now I am adding text.\n')
#Inserting into second line
file.truncate(0) #delete all contents
file.seek(0, 0)
file.writelines(contents)
Most file systems don't work like that. A file's contents is mapped to data blocks, and these data blocks are not guaranteed to be contiguous on the underlying system (i.e. not necessarily "side-by-side").
When you seek, you're seeking to a byte offset. So if you want to insert new data between 2 byte offsets of a particular block, you'll have to actually shift all subsequent data over by the length of what you're inserting. Since the block could easily be entirely "filled", shifting the bytes over might require allocating a new block. If the subsequent block was entirely "filled" as well, you'll have to shift the data of that block as well, and so on.. You can start to see why there's no "simple" operation for shifting data.
Generally, we solve this by just reading all the data into memory and then re-writing it back to a file. When you encounter the byte offset you're interested in inserting "new" content at, you write your buffer and then continue writing the "original" data. In Python, you won't have to worry about interleaving multiple buffers when writing, since Python will abstract the data to some data structure. So you'd just concatenate the higher-level data structures (e.g. if it's a text file, just concat the 3 strings).
If the file is too large for you to comfortably place it in memory, you can write to a "new" temporary file, and then just swap it with the original when done your write operation.
Now if you consider the "shifting" of data in data blocks I mentioned above, you might consider the simpler edge case where you happen to be inserting data of length N at an offset that's a multiple of N, where N is the fixed size of the data block in the file system. In this case, if you think of the data blocks as a linked list, you might consider it a rather simple operation to add a new data block between the offset you're inserting at and the next block in the list.
In fact, Linux systems do support allocating an additional block at this boundary. See fallocate.
Related
I need to modify a gzipped tab-delimited file. I can read from input and write modified reads to an output file as:
output = tempfile.NamedTemporaryFile(mode="w", delete=False)
with gzip.open(input, "rb") as in_file,\
gzip.open(output, "wb") as out_file:
for l in in_file:
split_line = l.split("\t")
if split_line[0] == "hello":
split_line[0] = "hi"
out_file.write("\t".join(split_line))
The gzipped files I work with are in 100s of GB scale, hence rewriting the entire file to a different file only for modifying a subset is not ideal. Therefore, I am interested in a solution that modifies the file in-place (i.e., modifying the original file as you traverse through it).
For normal gzip files, certainly not. Your only option would be read the gzip file up to where you want the modification, make the modification, and recompress the rest. Some attention is required where you make the cut, to remove the deflate block that includes the cut, and recompress from there, appending the remaining deflate blocks on the correct bit position.
You could, in theory, prepare a large gzip file so that such modifications could be done in place. You would need to break up the gzip file into independent blocks, where the history at the start of each block is discarded. (pigz does this with the --independent option.) You would also need to insert several empty blocks or other filler space at the end of each independent block to allow for variations in the length of the independent block so that the modified result can fit back into the exact same number of bytes. There are five-byte and two-byte empty blocks you can insert, that in combination should be able to accommodate any small number of byte count difference, if you have enough of them.
You would need a separate index of the locations of these independent blocks, otherwise you would be spending time searching for them, again making the time dependent on the length of the file.
In order to not significantly impact the overall compression ratio of the gzip file, you would want the independent blocks to be on the order of 128K bytes uncompressed or larger. Any modification would require recompression of an entire independent block.
You would also need to update the CRC and length at the end of the gzip file. I think that there's a way to update the CRC without recomputing it for the whole file, but I'd have to think about it. It is certainly possible if the length of the file doesn't change, but if you are inserting or deleting bytes, it gets trickier.
This would all be a large amount of work to try to put a square gzip peg into a round random modification hole. It suggests that you are simply using the wrong format for the application. Find a different format for what you want to do.
I have a folder full of files that need to be modified in order to extract the true file in it's real format.
I need to remove a certain number of bytes from BOTH the beginning and end of the file in order to extract the data I am looking for.
How can I do this in python?
I need this to work recursively on an entire folder only
I also need this to output (or modify the exisiting) file with the bytes removed.
I would greatly appreciate any help or guidance you can provide.
Recursive iteration over files os.walk
Change position in file: f.seek
Get file size: os.stat
Remove data from current position to end of file: f.truncate
So, base logic:
Iterate over files
Get file size.
Open file ('rb+' i suppouse )
Seek to position from wich you want read file
Read until bytes you want to drop ( f.read(file_size - top_dropped - bottom_dropped ) )
Seek(0)
Write read text to file
Truncate file
Your question is pretty badly constructed, but as this is somewhat advanced stuff I'll provide you with a code.
You can now use os.walk() to recursively traverse directory you want and apply my slicefile() function.
This code does the following:
After checking validity of start and end arguments it creates a memory map on top of an opened file.
mmap() creates a memory map object that mapps, in this case, portion of a file system over which the file is written. The object exposes both a string-like and file-like interface with some additional methods like move(). So you can treat memory map either as a string or as a file or use size(), move(), resize() or whatever additional methods you need.
We calculate what is a distance between our start and end, i.e. this is how much bytes we will have in the end.
We move stream of bytes, end-start long, starting from our start position to the 0 position i.e. we move them backwards for number of bytes indicated by starting point.
We discard the rest of file. I.e. we resize it to end-start bytes. So, what is left is our new string.
The operation will be longer as the file is bigger. Unfortunately there is nothing much you can do about it. If a file is big, this is your best bet. The procedure is the same as when removing items from a start/middle of an in-memory array, except this has to be buffered (in chunks) not to fill RAM too much.
If your file is smaller than a third of your free RAM space, you can load it whole into a string with f.read(), then perform string slicing on the loaded content ( s = s[start:end] ) and then write it back into file by opening it again and just doing f.write(s).
If you have enough disk space, you can open another file, seek to the starting point you want in the original file and then read it in chunks and write these into the new file. Perhaps even using shutil.copyfileobj(). After that, you remove the original file and use os.rename() to put the new one in its place. These are your only 3 options.
Whole file into RAM; move by buffering backward and then resizing; and, copying into another file, then renaming it. The second option is most universal and won't fail you for small or big files. Therefore I used it.
OK, Not only 3 options. There is a fourth option. It could be possible to cut off N number of bytes from beginning of the file by manipulating the file system itself using low-level operations. To write a kind of truncate() function that truncates the beginning instead of the end. But this would be pretty suicidal. In the end memory fragmentation would occur and whole mess will arise. You don't need such speed anyway. You will be patient until your script finishes. :D
Why did i use mmap()?
Because it uses memory maps implemented in OS rather than completely new code. This reduces number of system calls needed to deal with the opened file. Half of the work is thrust upon operating system, leaving Python to breathe easily.
Because it is mostly written in C which makes it a touch faster than its pure Python implementation would be.
Because it implements move() which wee need. The buffering and everything is already written, so no needs for bulky while loop which would be the alternative (manual) solution.
And so on...
from mmap import mmap
def slicefile (path, start=0, end=None):
f = open(path, "r+b") # Read and write binary
f.seek(0, 2)
size = f.tell()
start = 0 if start==None else start
end = size if end==None else end
start = size+start if start<0 else start
end = size+end if end<0 else end
end = size if end>size else end
if (end==size and start==0) or (end<=start):
f.close()
return
# If start is 0, no need to move anything, just cut off the rest after end
if start==0:
f.seek(end)
f.truncate()
f.close()
return
# Modify in place using mapped memory:
newsize = end-start
m = mmap(f.fileno(), 0)
m.move(0, start, newsize)
m.flush()
m.resize(newsize)
m.close()
f.close()
I am trying to create a program which takes an input file, counts the number of words in each row and writes a string of that certain number in another output file. I managed to develope this code:
in_file = "our_input.txt"
out_file = "output.txt"
f=open(in_file)
g=open(out_file,"w")
for line in f:
if line == "\n":
g.write("0\n")
else:
g.write(str(line.count(" ")+1)+"\n")
now, this works well, but the problem is that it works for only a certain amount of lines. If my input file has 8000 lines, it will display only the first 6800. If it has 6000, than will be displayed (all numbers are rounded, right).
I tried creating another program, which splits each line to a list, and then counting the length of it, but the problem remains just the same.
Any idea what could cause this?
You need to close each file after you're done with it. The safest way to do this is by using the with statement:
with open(in_file) as f, open(out_file,"w") as g:
for line in f:
if line == "\n":
g.write("0\n")
else:
g.write(str(line.count(" ")+1)+"\n")
When reaching the end of a with block, all files you opened in the with line will be closed.
The reason for the behavior you see is that for performance reasons, reading and writing to/from files is buffered. Because of the way hard drives are constructed, data is read/written in blocks rather than in individual bytes - so even if you attempt to read/write a single byte, you have to read/write an entire block. Therefore, most programming languages' built-in file IO functions actually read (at least) one block at a time into memory and feed you data from that in-memory block until it needs to read another block. Similarly, writing is performed by actually writing into a memory block first, and only writing the block to disk when it is full. If you don't close the file writer, whatever is in the last in-memory block won't be written.
I've read here that editing a file in-place is not possible since it's OS related and entirely independent of language. Why is that? Doesn't the C standard fseek() define an in-place file write?
You can seek in Python as well - the answer you linked is just pointing that you can't insert in the middle of a file (ABC -> ABXC).
If you seek and then write, whatever data is written overwrites the contents at that position of the file (ABC -> AXC).
The point is that a file is like an array (just on disk). It occupies a fixed place on the disk.
If you have to insert something in the middle of an array, you have to move out the trailing elements, otherwise you will overwrite these. The same concept applies to files, you need to move out the trailing part of the file to insert something in the middle.
The one exception to this is when you need to overwrite a fixed size block of the file with a new block of the same size. Then it is perfectly fine to use fseek() and then just overwrite.
The problem is OS related, because it depends on how OS organizes data on the disk with its file system.
Usually, the disk is divided into blocks of fixed size to store the content of a file. When a byte is overwritten, only the (whole) block containing the byte is rewritten. When the file is appended, possibly the last block is rewritten, plus new blocks being allocated and "linked" to the file to store content of the appended data. These 2 operations are efficient under this scheme.
However, when some byte is inserted or deleted, usually, the file needs to be rewritten from the first block that contains the insert/delete change.
It is technically possible to reduce the time spent rewriting the file. For insert case, the OS can allocate a new block for the data inserted, "link" it to the file, and rewrite one or two neighbor blocks. For delete case, the OS can reduce the number of valid bytes in the blocks, rewrite the block, and/or "relink" the blocks in the file. As a disclaimer, the "solution" I just mentioned are dependent on how the specific file system is structured. However, if you reduce the cost of rewriting the file, usually, it will incur a hit on the read operations later, or internal fragmentation will occur.
Maybe the problem will be solved, if someone one day invented a data structure that allows efficient insertion/deletion of data on disk while maintaining the performance of read operation.
You can modify a file in place; if the data you are replacing is the same size as the data you are replacing (smaller and you just pad). What if it is larger? Well, then everything after the modification would need to be rewritten to disk anyway, right? You also can't assume you have enough room at the end in all cases, so the file may need to be moved to a completely new location.
Obviously this makes the valid use cases rather rare. Normally it's not worth it unless you have a really good reason to do so. As for whether or not you have access to in-place modification API's in Python... I have no idea.
Of course you can edit a file in place:
import struct
RECSIZE = struct.calcsize('<i')
fp = open('name.txt', 'r+b')
fp.seek(132 * RECSIZE) # go to record number 132
fp.write(struct.pack('<i', 42))
fp.seek(132 * RECSIZE)
assert struct.unpack('<i', fp.read(RECSIZE)) == 42
this also shows the most common reason for doing so and the limitations.
I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()