I am trying to update a file, but only certain lines. There are a lot of lines and I don't want to rewrite the whole file to update it.
Ex. out of 4k line i need to change 5 item in n-th item
Most answers on this question use two files or rewrite it completely. I am wondering if there is a more effective command for this, one that can attack one line at a time without writing the whole file at the end of the process. If no possible way, what would be most efficient way to do so.
I used Python 2.7
The problem is, files on the disk are just a sequence of bytes. And not an intelligently stored sequence of lines.
That is, you can alter the n'th line in your file, if and only if, the amount of bytes it contains remains the same after that operation.
If your csv format allows to contain arbitrary amounts of insignificant whitespace as part of those lines, you could make them large enough to hold any data you may ever write to it (say a fixed large-enough-line-size). Then you can update single lines without a rewrite of the entire file. You need to make sure that you overwrite the previous content (if any), though.
Related
I'm trying to iterate over the lines of a csv, for each line, I want to do a bunch of work, save that line in a destination csv and remove it from the original csv, saving both origin and destination csv files at every line (save state in case of a crash). Is there an elegant way of doing this that doesn't involve opening and closing the file at every point?
To write to a file immediately, open it without buffering:
with open("test.csv", "w", buffering=0) as my_file:
...
This makes sense for the output file; repeatedly deleting the first line of the input is another matter. The only way to do that is to write out the entire remainder of the file. Over and over (google "quadratic complexity"). Which will definitely have a performance impact, and rather increases rather than reduces the chance that something will go wrong.
I strongly recommend leaving the input file alone, and finding another way to keep track of how much has been processed. (E.g. write out somewhere else the number of lines that have been processed, and adapt your code to skip this many lines.)
PS. If you wanted to get cute you could process the input file from the end (last row first), and use truncate to delete each processed line without rewriting what comes before. But that's tricky to get right, and really it's not a good fit for your goal of simply tracking how far you have gotten with processing.
I would like to track the changes of a file being appended by another program.
My plan of approach is this. I read the file's contents first. At a later time, the file contents would be appended from another application. I want to read the appended data only rather than re-reading everything from the top. In order to do this, I'm going to check the file's modification time. I would then seek() to the previous size of the file, and start reading from there.
Is this a proper approach? Or there is a known idiom for this?
Well, you have to make quite some assumptions about both the other program writing to file as well as the file system, but in generally it should work. Personally I would rather write the current seek position or line number (if reading simple text files) to another file and check it from there. This will also allow you to revert back in the file if some part is rewritten and the file size stays the same (or even gets smaller).
If you have some very important/unique data, besides making backups you should maybe think about appending the new data to new file and later rejoining the files (if needed) when you have checked that the data is fine in your other program. This way you could just read any new file as a whole after certain time. (Also remember that in a larger picture, system time and creation/modification times are not 100% trustworthy).
I have achieved writing all the things I needed to the text file, but essentially the program needs to keep going back to the text file and saving only the changes. At the moment it overwrites the entire file, deleting all the previous information.
There is typical confusion about how are text files organized.
Text files are not organized by lines, but by bytes
When one looks to a text file, it looks like lines.
It is natural to expect, that on disk it goes the same way, but this is not true.
Text file are written to disk byte by byte, often one character being represented by one byte (but
in some cases more bytes). A line of text happens to be just a sequence of bytes, being terminated
by some sort of new lines ("\n", "\n\r" or whatever is used for new line).
If we want to change 2nd line out of 3, we would have to fit the change just in the bytes, used for
2nd line, not to mess up with line 3. If we would write too many bytes for line 2, we would
overwrite bytes of line 3. If we would write too few bytes, there would be stil present some (alredy
obsolete) bytes from remainder of line 2.
Strategies to modify content of text file
Republisher - Read it, modify in memory, write all content completely back
This might first sound like vasting a lot of effort, but it is by far the most often used approach
and is in 99% most effective one.
The beauty is, it is simple.
The fact is, for most files sizes it is fast enouhg.
Journal - append changes to the end
Rather rare approach is to write first version of the file to the disk and later on append to the
end notes about what has changed.
Reading such a file means, it has to rerun all the history of changes from journal to find out final
content of the file.
Surgeon - change only affected lines
In case you keep lines of fixed length (measured in bytes!! not in characters), you might point to
modified line and rewrite just that line.
This is quite difficult to do easily and is used rather with binary files. This is definitely not
the task for beginers.
Conclusions
Go for "Republisher" pattern.
Use whatever format fits your needs (INI, CSV, JSON, XML, YAML...).
Personally I prefer saving data to JSON format - json package is part of Python stdlib and it
supports lists as well dictionaries, what allows saving tabular as well as tree like structures.
Are the changes you are making going to be over several different runs of a program? If not, I suggest making all of your changes to the data while it is still in memory and then writing it out just before program termination.
You can open it as follows:
FileOpen = open("test.txt","a")
I'm writing a program that extracts and adds files to the xbox 360's STFS files. The STFS structure is a mini file system, it has hashtables, a file table, etc.
Extracting the files is simple enough. I have the starting block of the file and the amount of blocks in the file, so I just need to find the block offsets, read the block lengths, and send that out as the file. What happens, though, when I need to replace or remove a file? I've read that on Windows and computers in general, files aren't actually deleted, they're just removed from the file table and are overwritten when something else needs the space. When I'm writing a file, then, how do I find an unused sequence of blocks large enough to hold it? The blocks are 0x1000 bytes in length and fill remaining space with empty bytes, so everything evens out nicely, but I can't think of an efficient way to find an unused range of blocks that will fit the file I want to add.
My current plan is to rewrite everything on removing or adding a file so that I don't have large amounts of unused space that I'm unable to figure out how to overwrite. Is there a good introduction to file systems like NTFS or FAT32 that I could read that won't take days to understand and will contain the necessary information to write a basic file manager?
reference to structure: http://free60.org/STFS
edit: on Second though, I would create a list of ranges for each file in the table. That is, the start offset and end offset based on size. When looking for an open range to insert a file, I would start at 0 and check if the end start or end of each range is inside the range needed by the file to be inserted. If either the start or end is inside the range, I would move on to the end of the other file's end offset. This is better than my initial idea, but still seems inefficient. I would have to make multiple comparisons for every file in the file table.
Question: How do you write data to an already existing file at the beginning of the file with out writing over what's already there and with out reading the entire file into memory? (e.g. prepend)
Info:
I'm working on a project right now where the program frequently dumps data into a file. this file will very quickly balloon up to 3-4gb. I'm running this simulation on a computer with only 768mb of ram. pulling all that data to the ram over and over will be a great pain and a huge waste of time. The simulation already takes long enough to run as it is.
The file is structured such that the number of dumps it makes is listed at the beginning with just a simple value, like 6. each time the program makes a new dump I want that to be incremented, so now it's 7. the problem lies with the 10th, 100th, 1000th, and so dump. the program will enter the 10 just fine, but remove the first letter of the next line:
"9\n580,2995,2083,028\n..."
"10\n80,2995,2083,028\n..."
obviously, the difference between 580 and 80 in this case is significant. I can't lose these values. so i need a way to add a little space in there so that I can add in this new data without losing my data or having to pull the entire file up and then rewrite it.
Basically what I'm looking for is a kind of prepend function. something to add data to the beginning of a file instead of the end.
Programmed in Python
~n
See the answers to this question:
How do I modify a text file in Python?
Summary: you can't do it without reading the file in (this is due to how the operating system works, rather than a Python limitation)
It's not addressing your original question, but here are some possible workarounds:
Use SQLite (it's bundled with your Python)
Use a fancier database, either RDBMS or NoSQL
Just track the number of dumps in a different text file
The first couple of options are a little more work up front, but provide more flexibility. The last option is the easiest solution to your current problem.
You could quite easily create an new file, output the data you wish to prepend to that file and then copy the content of the existing file and append it to the new one, then rename.
This would prevent having to read the whole file if that is the primary issue.