I'm writing a program that extracts and adds files to the xbox 360's STFS files. The STFS structure is a mini file system, it has hashtables, a file table, etc.
Extracting the files is simple enough. I have the starting block of the file and the amount of blocks in the file, so I just need to find the block offsets, read the block lengths, and send that out as the file. What happens, though, when I need to replace or remove a file? I've read that on Windows and computers in general, files aren't actually deleted, they're just removed from the file table and are overwritten when something else needs the space. When I'm writing a file, then, how do I find an unused sequence of blocks large enough to hold it? The blocks are 0x1000 bytes in length and fill remaining space with empty bytes, so everything evens out nicely, but I can't think of an efficient way to find an unused range of blocks that will fit the file I want to add.
My current plan is to rewrite everything on removing or adding a file so that I don't have large amounts of unused space that I'm unable to figure out how to overwrite. Is there a good introduction to file systems like NTFS or FAT32 that I could read that won't take days to understand and will contain the necessary information to write a basic file manager?
reference to structure: http://free60.org/STFS
edit: on Second though, I would create a list of ranges for each file in the table. That is, the start offset and end offset based on size. When looking for an open range to insert a file, I would start at 0 and check if the end start or end of each range is inside the range needed by the file to be inserted. If either the start or end is inside the range, I would move on to the end of the other file's end offset. This is better than my initial idea, but still seems inefficient. I would have to make multiple comparisons for every file in the file table.
Related
I would like to track the changes of a file being appended by another program.
My plan of approach is this. I read the file's contents first. At a later time, the file contents would be appended from another application. I want to read the appended data only rather than re-reading everything from the top. In order to do this, I'm going to check the file's modification time. I would then seek() to the previous size of the file, and start reading from there.
Is this a proper approach? Or there is a known idiom for this?
Well, you have to make quite some assumptions about both the other program writing to file as well as the file system, but in generally it should work. Personally I would rather write the current seek position or line number (if reading simple text files) to another file and check it from there. This will also allow you to revert back in the file if some part is rewritten and the file size stays the same (or even gets smaller).
If you have some very important/unique data, besides making backups you should maybe think about appending the new data to new file and later rejoining the files (if needed) when you have checked that the data is fine in your other program. This way you could just read any new file as a whole after certain time. (Also remember that in a larger picture, system time and creation/modification times are not 100% trustworthy).
I have achieved writing all the things I needed to the text file, but essentially the program needs to keep going back to the text file and saving only the changes. At the moment it overwrites the entire file, deleting all the previous information.
There is typical confusion about how are text files organized.
Text files are not organized by lines, but by bytes
When one looks to a text file, it looks like lines.
It is natural to expect, that on disk it goes the same way, but this is not true.
Text file are written to disk byte by byte, often one character being represented by one byte (but
in some cases more bytes). A line of text happens to be just a sequence of bytes, being terminated
by some sort of new lines ("\n", "\n\r" or whatever is used for new line).
If we want to change 2nd line out of 3, we would have to fit the change just in the bytes, used for
2nd line, not to mess up with line 3. If we would write too many bytes for line 2, we would
overwrite bytes of line 3. If we would write too few bytes, there would be stil present some (alredy
obsolete) bytes from remainder of line 2.
Strategies to modify content of text file
Republisher - Read it, modify in memory, write all content completely back
This might first sound like vasting a lot of effort, but it is by far the most often used approach
and is in 99% most effective one.
The beauty is, it is simple.
The fact is, for most files sizes it is fast enouhg.
Journal - append changes to the end
Rather rare approach is to write first version of the file to the disk and later on append to the
end notes about what has changed.
Reading such a file means, it has to rerun all the history of changes from journal to find out final
content of the file.
Surgeon - change only affected lines
In case you keep lines of fixed length (measured in bytes!! not in characters), you might point to
modified line and rewrite just that line.
This is quite difficult to do easily and is used rather with binary files. This is definitely not
the task for beginers.
Conclusions
Go for "Republisher" pattern.
Use whatever format fits your needs (INI, CSV, JSON, XML, YAML...).
Personally I prefer saving data to JSON format - json package is part of Python stdlib and it
supports lists as well dictionaries, what allows saving tabular as well as tree like structures.
Are the changes you are making going to be over several different runs of a program? If not, I suggest making all of your changes to the data while it is still in memory and then writing it out just before program termination.
You can open it as follows:
FileOpen = open("test.txt","a")
Imagine you have a filesystem tree:
root/AA/aadata
root/AA/aafile
root/AA/aatext
root/AB/abinput
root/AB/aboutput
root/AC/acinput
...
Totally around 10 million files. Each file is around 10kb in size. They are mostly like a key-value storage, separated by folders just in order to improve speed (FS would die if I put 5 million files in a single folder).
Now we need to:
archive this tree into a single big file (it must be relatively fast but have a good compression ratio as well - thus, 7z is too slow)
seek the result big file very quickly - so, when I need to get the content of "root/AB/aboutput", I should be able to read it very quickly.
I won't use Redis because in the future the amount of files might increase and there would be no space for them in RAM. But on the other side, I can use SSD-powered servers to the data access will be relatively fast (compared to HDD).
Also it should not be any exotic file system, such as squashfs or similar file systems. It should work in an ordinary EXT3 or EXT4 or NTFS.
I also thought about storing the files as a simple zlib-compressed strings, remembering the file offset for each string, and then creating something like a map which will be in RAM. Each time I need a file, I will read the content offset form the map, and then - using the offsets - from the actual file. But maybe there is something easier or already done?
Assumptions (from information in contents). You might use the following strategy: use two files (one for the "index", the second for your actual content. For simplicity, make the second file a set of "blocks" (say each of 8196). To process your files, read them into an programmatic structure for file name (key) along with the block number of the second file where the content begins. Write the file content to the second file (compressing if storage space is at a premium). Save the index information.
To retrieve, read the index file into programmattic storage and store as a binary tree. If search times are a problem, you might hash the keys and store values into a table and handle collision with a simple add to next available slot. To retrieve content, get block number (and length) from your index find; read the content from the second file (expand if you compressed).
I am trying to update a file, but only certain lines. There are a lot of lines and I don't want to rewrite the whole file to update it.
Ex. out of 4k line i need to change 5 item in n-th item
Most answers on this question use two files or rewrite it completely. I am wondering if there is a more effective command for this, one that can attack one line at a time without writing the whole file at the end of the process. If no possible way, what would be most efficient way to do so.
I used Python 2.7
The problem is, files on the disk are just a sequence of bytes. And not an intelligently stored sequence of lines.
That is, you can alter the n'th line in your file, if and only if, the amount of bytes it contains remains the same after that operation.
If your csv format allows to contain arbitrary amounts of insignificant whitespace as part of those lines, you could make them large enough to hold any data you may ever write to it (say a fixed large-enough-line-size). Then you can update single lines without a rewrite of the entire file. You need to make sure that you overwrite the previous content (if any), though.
Question: How do you write data to an already existing file at the beginning of the file with out writing over what's already there and with out reading the entire file into memory? (e.g. prepend)
Info:
I'm working on a project right now where the program frequently dumps data into a file. this file will very quickly balloon up to 3-4gb. I'm running this simulation on a computer with only 768mb of ram. pulling all that data to the ram over and over will be a great pain and a huge waste of time. The simulation already takes long enough to run as it is.
The file is structured such that the number of dumps it makes is listed at the beginning with just a simple value, like 6. each time the program makes a new dump I want that to be incremented, so now it's 7. the problem lies with the 10th, 100th, 1000th, and so dump. the program will enter the 10 just fine, but remove the first letter of the next line:
"9\n580,2995,2083,028\n..."
"10\n80,2995,2083,028\n..."
obviously, the difference between 580 and 80 in this case is significant. I can't lose these values. so i need a way to add a little space in there so that I can add in this new data without losing my data or having to pull the entire file up and then rewrite it.
Basically what I'm looking for is a kind of prepend function. something to add data to the beginning of a file instead of the end.
Programmed in Python
~n
See the answers to this question:
How do I modify a text file in Python?
Summary: you can't do it without reading the file in (this is due to how the operating system works, rather than a Python limitation)
It's not addressing your original question, but here are some possible workarounds:
Use SQLite (it's bundled with your Python)
Use a fancier database, either RDBMS or NoSQL
Just track the number of dumps in a different text file
The first couple of options are a little more work up front, but provide more flexibility. The last option is the easiest solution to your current problem.
You could quite easily create an new file, output the data you wish to prepend to that file and then copy the content of the existing file and append it to the new one, then rename.
This would prevent having to read the whole file if that is the primary issue.