Deleting an arbitrary chunk of a file - python

What is the most efficient way to delete an arbitrary chunk of a file, given the start and end offsets? I'd prefer to use Python, but I can fall back to C if I have to.
Say the file is this
..............xxxxxxxx----------------
I want to remove a chunk of it:
..............[xxxxxxxx]----------------
After the operation it should become:
..............----------------
Reading the whole thing into memory and manipulating it in memory is not a feasible option.

The best performance will almost invariably be obtained by writing a new version of the file and then having it atomically write the old version, because filesystems are strongly optimized for such sequential access, and so is the underlying hardware (with the possible exception of some of the newest SSDs, but, even then, it's an iffy proposition). In addition, this avoids destroying data in the case of a system crash at any time -- you're left with either the old version of the file intact, or the new one in its place. Since every system could always crash at any time (and by Murphy's Law, it will choose the most unfortunate moment;-), integrity of data is generally considered very important (often data is more valuable than the system on which it's kept -- hence, "mirroring" RAID solutions to ensure against disk crashes losing precious data;-).
If you accept this sane approach, the general idea is: open old file for reading, new one for writing (creation); copy N1 bytes over from the old file to the new one; then skip N2 bytes of the old file; then copy the rest over; close both files; atomically rename new to old. (Windows apparently has no "atomic rename" system call usable from Python -- to keep integrity in that case, instead of the atomic rename, you'd do three step: rename old file to a backup name, rename new file to old, delete backup-named file -- in case of system crash during the second one of these three very fast operations, one rename is all it will take to restore data integrity).
N1 and N2, of course, are the two parameters saying where the deleted piece starts, and how long it is. For the part about opening the files, with open('old.dat', 'rb') as oldf: and with open('NEWold.dat', 'wb') as newf: statements, nested into each other, are clearly best (the rest of the code until the rename step must be nested in both of them of course).
For the "copy the rest over" step, shutil.copyfileobj is best (be sure to specify a buffer length that's comfortably going to fit in your available RAM, but a large one will tend to give better performance). The "skip" step is clearly just a seek on the oldf open-for-reading file object. For copying exactly N1 bytes from oldf to newf, there is no direct support in Python's standard library, so you have to write your own, e.g:
def copyN1(oldf, newf, N1, buflen=1024*1024):
while N1:
newf.write(oldf.read(min(N1, buflen)))
N1 -= buflen

I'd suggest memory mapping. Though it is actually manipulating file in memory it is more efficient then plain reading of the whole file into memory.
Well, you have to manipulate the file contents in-memory one way or another as there's no system call for such operation neither in *nix nor in Win (at least not that I'm aware of).

Try mmaping the file. This won't necessarily read it all into memory at once.
If you really want to do it by hand, choose some chunk size and do back-and-forth reads and writes. But the seeks are going to kill you...

Related

Does Python's "append" file write mode only write new bytes, or does it re-write the entire file as well?

Though I would imagine that append mode is "smart" enough to only insert the new bytes being appended, I want to make absolutely sure that Python doesn't handle it by re-writing the entire file along with the new bytes.
I am attempting to keep a running backup of a program log, and it could reach several thousand records in a CSV format.
Python file operations are convenience wrappers over operating system file operations. The operating system either implements this file system operations internally, forwards them to a loadable module (plugin) or an external server (NFS,SMB). Most of the operating systems since very 1971 are capable to perform appending data to the existing file. At least all the ones that claim to be even remotely POSIX compliant.
The POSIX append mode simply opens the file for writing and moves the file pointer to the end of the file. This means that all the write operations will just write past the end of the file.
There might be a few exceptions to that, for example some routine might use low level system calls to move the file pointer backwards. Or the underlying file system might be not POSIX compliant and use some form of object transactional storage like AWS S3. But for any standard scenario I wouldn't worry about such cases.
However since you mentioned backup as your use case you need to be extra careful. Backups are not as easy as they seem on the surface. Things to worry about, various caches that might hold data in memory before if it is written to disk. What will happen if the power goes out just right after you appended new records. Also, what will happen if somebody starts several copies of your program?
And the last thing. Unless you are running on a 1980s 8bit computer a few thousand CSV lines is nothing to the modern hardware. Even if the files are loaded and written back you wouldn't notice any difference

Rewrite a file in-place with Python

It might depend on each OS and also be hardware-dependent, but is there a way in Python to ask a write operation on a file to happen "in-place", i.e. at the same place of the original file, i.e. if possible on the same sectors on disk?
Example: let's say sensitivedata.raw, a 4KB file has to be crypted:
with open('sensitivedata.raw', 'r+') as f: # read write mode
s = f.read()
cipher = encryption_function(s) # same exact length as input s
f.seek(0)
f.write(cipher) # how to ask this write operation to overwrite the original bytes?
Example 2: replace a file by null-byte content of the same size, to avoid an undelete tool to recover it (of course, to do it properly we need several passes, with random data and not only null-bytes, but here it's just to give an idea)
with open('sensitivedata.raw', 'r+') as f:
s = f.read()
f.seek(0)
f.write(len(s) * '\x00') # totally inefficient but just to get the idea
os.remove('sensitivedata.raw')
PS: if it really depends a lot on OS, I'm primarily interested in the Windows case
Side-quesion: if it's not possible in the case of a SSD, does this mean that if you once in your life wrote sensitive data as plaintext on a SSD (example: a password in plaintext, a crypto private key, or anything else, etc.), then there is no way to be sure that this data is really erased? i.e. the only solution is to 100% wipe the disk and fill it many passes with random bytes? Is that correct?
That's an impossible requirement to impose. While on most spinning disk drives, this will happen automatically (there's no reason to write the new data elsewhere when it could just overwrite the existing data directly), SSDs can't do this (when they claim to do so, they're lying to the OS).
SSDs can't rewrite blocks; they can only erase a block, or write to an empty block. The implementation of a "rewrite" is to write to a new block (reading from the original block to fill out the block if there isn't enough new data), then (eventually, cause it's relatively expensive) erase the old block to make it available for a future write.
Update addressing side-question: The only truly secure solution is to run your drive through a woodchipper, then crush the remains with a millstone. :-) Really, in most cases, the window of vulnerability on an SSD should be relatively short; erasing sectors is expensive, so even SSDs that don't honor TRIM typically do it in the background to ensure future (cheap) write operations aren't held up by (expensive) erase operations. This isn't really so bad when you think about it; sure, the data is visible for a period of time after you logically erased it. But it was visible for a period of time before you erased it too, so all this is doing is extending the window of vulnerability by (seconds, minutes, hours, days, depending on the drive); the mistake was in storing sensitive data to permanent storage in the first place; even with extreme (woodchipper+millstone) solutions, someone else could have snuck in and copied the data before you thought to encrypt/destroy it.

How to perform a pickling so that it is robust against crashing?

I routinely use pickle.dump() to save large files in Python 2.7. In my code, I have one .pickle file that I continually update with each iteration of my code, overwriting the same file each time.
However, I occasionally encounter crashes (e.g. from server issues). This may happen in the middle of the pickle dump, rendering the pickle incomplete and the pickle file unreadable, and I lose all my data from the past iterations.
I guess one way I could do it is to save one .pickle file for each iteration, and combine all of them later. Are there any other recommended methods, or best practices in writing to disk that is robust to crashing?
You're effectively doing backups, as your goal is the same: disaster recovery, lose as little work as possible.
In backups, there are these standard practices, so choose whatever fits you best:
backing up
full backup (save everything each time)
incremental backup (save only what changed since the last backup)
differential backup (save only what changed since the last full backup)
dealing with old backups
circular buffer/rotating copies (delete or overwrite backups older than X days/iterations, optionally change indices in others' names)
consolidatating old incremental/differential copies into the preceding full backup (as a failsafe, consolidate into a new file and only then delete the old ones)
Are there any other recommended methods, or best practices ?
Let me mention a cool Mike McKearn's dill package
import dill as pickle
pickle.dump_session( aRotatingIndexFileNAME ) # saves python session-state
# # session is worth
# # many computing [CPU-core * hrs]
if needed, just
import dill as pickle; pickle.load_session( aLastSessionStateFileNAME )
Using rotating filenames are a common best-practice for archiving up to some depth of roll-back capabilities, so worth not repeating these here.
dill has literally saved me on this, including having knowingly the same call-signatures for easy substitutions into python projects.

Python - Tailing a logfile - sleep() versus inotify?

I'm writing a Python script that needs to tail -f a logfile.
The operating system is RHEL, running Linux 2.6.18.
The normal approach I believe is to use an infinite loop with sleep, to continually poll the file.
However, since we're on Linux, I'm thinking I can also use something like pyinotify (https://github.com/seb-m/pyinotify) or Watchdog (https://github.com/gorakhargosh/watchdog) instead?
What are the pros/cons of the this?
I've heard that using sleep(), you can miss events, if the file is growing quickly - is that possible? I thought GNU tail uses sleep as well anyhow?
Cheers,
Victor
The cleanest solution would be inotify in many ways - this is more or less exactly what it's intended for, after all. If the log file was changing extremely rapidly then you could potentially risk being woken up almost constantly, which wouldn't necessarily be particularly efficient - however, you could always mitigate this by adding a short delay of your own after the inotify filehandle returns an event. In practice I doubt this would be an issue on most systems, but I thought it worth mentioning in case your system is very tight on CPU resources.
I can't see how the sleep() approach would miss file updates except in cases where the file is truncated or rotated (i.e. renamed and another file of the same name created). These are tricky cases to handle however you do things, and you can use tricks like periodically re-opening the file by name to check for rotation. Read the tail man page because it handles many such cases, and they're going to be quite common for log files in particular (log rotation being widely considered to be good practice).
The downside of sleep() is of course that you'd end up batching up your reads with delays in between, and also that you have the overhead of constantly waking up and polling the file even when it's not changing. If you did this, say, once per second, however, the overhead probably isn't noticeable on most systems.
I'd say inotify is the best choice unless you want to remain compatible, in which case the simple fallback using sleep() is still quite reasonable.
EDIT:
I just realised I forgot to mention - an easy way to check for a file being renamed is to perform an os.fstat(fd.fileno()) on your open filehandle and a os.stat() on the filename you opened and compare the results. If the os.stat() fails then the error will tell you if the file's been deleted, and if not then comparing the st_ino (the inode number) fields will tell you if the file's been deleted and then replaced with a new one of the same name.
Detecting truncation is harder - effectively your read pointer remains at the same offset in the file and reading will return nothing until the file content size gets back to where you were - then the file will read from that point as normal. If you call os.stat() frequently you could check for the file size going backwards - alternatively you could use fd.tell() to record your current position in the file and then perform an explicit seek to the end of the file and call fd.tell() again. If the value is lower, then the file's been truncated under you. This is a safe operation as long as you keep the original file position around because you can always seek back to it after the check.
Alternatively if you're using inotify anyway, you could just watch the parent directory for changes.
Note that files can be truncated to non-zero sizes, but I doubt that's likely to happen to a log file - the common cases will be being deleted and replaced, or truncated to zero. Also, I don't know how you'd detect the case that the file was truncated and then immediately filled back up to beyond your current position, except by remembering the most recent N characters and comparing them, but that's a pretty grotty thing to do. I think inotify will just tell you the file has been modified in that case.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Categories

Resources