Danger in corrupting file during unexpected shutdown

Danger in corrupting file during unexpected shutdown - python

I am logging data via:
with open('filename.txt','a') as fid:
fid.write(line_of_data)
Granted, the amount of time the file is open is short for each write, but I will write data every second making it extremely repetitive. Since this is being used on a remote system there is always the chance that power will be interrupted causing the computer to shutdown. If power is cut in the middle of a fid.write() will the whole file become corrupt, or, since it was opened to "append" will only the last line be lost?

It actually depends on the filesystem and operating system. When you "write" to a file it may not really means write to actual harddrive - it may be buffered by an OS, for example, and never actually "make it" to the hard drive itself.
You should not assume anything in that case other than everything may happen.
If you need some form of persistent writing - you probably need to use specialized libraries which may add required layers of security

Related

Does Python's "append" file write mode only write new bytes, or does it re-write the entire file as well?

Though I would imagine that append mode is "smart" enough to only insert the new bytes being appended, I want to make absolutely sure that Python doesn't handle it by re-writing the entire file along with the new bytes.
I am attempting to keep a running backup of a program log, and it could reach several thousand records in a CSV format.

Python file operations are convenience wrappers over operating system file operations. The operating system either implements this file system operations internally, forwards them to a loadable module (plugin) or an external server (NFS,SMB). Most of the operating systems since very 1971 are capable to perform appending data to the existing file. At least all the ones that claim to be even remotely POSIX compliant.
The POSIX append mode simply opens the file for writing and moves the file pointer to the end of the file. This means that all the write operations will just write past the end of the file.
There might be a few exceptions to that, for example some routine might use low level system calls to move the file pointer backwards. Or the underlying file system might be not POSIX compliant and use some form of object transactional storage like AWS S3. But for any standard scenario I wouldn't worry about such cases.
However since you mentioned backup as your use case you need to be extra careful. Backups are not as easy as they seem on the surface. Things to worry about, various caches that might hold data in memory before if it is written to disk. What will happen if the power goes out just right after you appended new records. Also, what will happen if somebody starts several copies of your program?
And the last thing. Unless you are running on a 1980s 8bit computer a few thousand CSV lines is nothing to the modern hardware. Even if the files are loaded and written back you wouldn't notice any difference

Do you need to call close if you already flush to make content survives unexpected reboot?

I'm writing to a file and I call flush() after I'm done. Is it safe to force quit the program say pulling the power plug without calling close() on the file? Is calling flush() sufficient that the file won't become corrupt.

To ensure that content is flushed to the operating system (and is visible to other running applications), either flush() or close() will suffice (which is to say, you don't need both).
To ensure that content is flushed to disk, you also need to add os.fsync() or os.fdatasync(). Do note that in the case of a newly-created (or newly-renamed) file, you need to worry about whether the directory was flushed as well.
By the way -- if you care about being really sure things get to disk, the classic presentation Eat My Data is worth your time and attention.

Python - Tailing a logfile - sleep() versus inotify?

I'm writing a Python script that needs to tail -f a logfile.
The operating system is RHEL, running Linux 2.6.18.
The normal approach I believe is to use an infinite loop with sleep, to continually poll the file.
However, since we're on Linux, I'm thinking I can also use something like pyinotify (https://github.com/seb-m/pyinotify) or Watchdog (https://github.com/gorakhargosh/watchdog) instead?
What are the pros/cons of the this?
I've heard that using sleep(), you can miss events, if the file is growing quickly - is that possible? I thought GNU tail uses sleep as well anyhow?
Cheers,
Victor

The cleanest solution would be inotify in many ways - this is more or less exactly what it's intended for, after all. If the log file was changing extremely rapidly then you could potentially risk being woken up almost constantly, which wouldn't necessarily be particularly efficient - however, you could always mitigate this by adding a short delay of your own after the inotify filehandle returns an event. In practice I doubt this would be an issue on most systems, but I thought it worth mentioning in case your system is very tight on CPU resources.
I can't see how the sleep() approach would miss file updates except in cases where the file is truncated or rotated (i.e. renamed and another file of the same name created). These are tricky cases to handle however you do things, and you can use tricks like periodically re-opening the file by name to check for rotation. Read the tail man page because it handles many such cases, and they're going to be quite common for log files in particular (log rotation being widely considered to be good practice).
The downside of sleep() is of course that you'd end up batching up your reads with delays in between, and also that you have the overhead of constantly waking up and polling the file even when it's not changing. If you did this, say, once per second, however, the overhead probably isn't noticeable on most systems.
I'd say inotify is the best choice unless you want to remain compatible, in which case the simple fallback using sleep() is still quite reasonable.
EDIT:
I just realised I forgot to mention - an easy way to check for a file being renamed is to perform an os.fstat(fd.fileno()) on your open filehandle and a os.stat() on the filename you opened and compare the results. If the os.stat() fails then the error will tell you if the file's been deleted, and if not then comparing the st_ino (the inode number) fields will tell you if the file's been deleted and then replaced with a new one of the same name.
Detecting truncation is harder - effectively your read pointer remains at the same offset in the file and reading will return nothing until the file content size gets back to where you were - then the file will read from that point as normal. If you call os.stat() frequently you could check for the file size going backwards - alternatively you could use fd.tell() to record your current position in the file and then perform an explicit seek to the end of the file and call fd.tell() again. If the value is lower, then the file's been truncated under you. This is a safe operation as long as you keep the original file position around because you can always seek back to it after the check.
Alternatively if you're using inotify anyway, you could just watch the parent directory for changes.
Note that files can be truncated to non-zero sizes, but I doubt that's likely to happen to a log file - the common cases will be being deleted and replaced, or truncated to zero. Also, I don't know how you'd detect the case that the file was truncated and then immediately filled back up to beyond your current position, except by remembering the most recent N characters and comparing them, but that's a pretty grotty thing to do. I think inotify will just tell you the file has been modified in that case.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Is closing file descriptor and removing inotify watch really necessary?

With python inotifyx, do I have to remove watch and close opened system file descriptor if I need them until program exit? E.g. is there some possible problems if I create one (file descriptor + watch) with each run and don't close it?

It's always a good idea to release resources (e.g. free memory, close file descriptors, waitpid(2) on child processes, etc) whenever you're done using them. Being lazy and letting the operating system take care of it for you when you exit is a sure way to cause bugs in the future.

The kernel stores watches as full paths, so closing the watch is preferable, it also takes unnecessary work off of VFS. As for the file descriptor, that would depend on how many others you had opened.
Kind of like a phone call, its nice to tell the other party that you have stopped listening, hanging up the phone is optional, but conventional. If you need it for something, keep it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.