Python - Tailing a logfile - sleep() versus inotify?

Python - Tailing a logfile - sleep() versus inotify? - python

I'm writing a Python script that needs to tail -f a logfile.
The operating system is RHEL, running Linux 2.6.18.
The normal approach I believe is to use an infinite loop with sleep, to continually poll the file.
However, since we're on Linux, I'm thinking I can also use something like pyinotify (https://github.com/seb-m/pyinotify) or Watchdog (https://github.com/gorakhargosh/watchdog) instead?
What are the pros/cons of the this?
I've heard that using sleep(), you can miss events, if the file is growing quickly - is that possible? I thought GNU tail uses sleep as well anyhow?
Cheers,
Victor

The cleanest solution would be inotify in many ways - this is more or less exactly what it's intended for, after all. If the log file was changing extremely rapidly then you could potentially risk being woken up almost constantly, which wouldn't necessarily be particularly efficient - however, you could always mitigate this by adding a short delay of your own after the inotify filehandle returns an event. In practice I doubt this would be an issue on most systems, but I thought it worth mentioning in case your system is very tight on CPU resources.
I can't see how the sleep() approach would miss file updates except in cases where the file is truncated or rotated (i.e. renamed and another file of the same name created). These are tricky cases to handle however you do things, and you can use tricks like periodically re-opening the file by name to check for rotation. Read the tail man page because it handles many such cases, and they're going to be quite common for log files in particular (log rotation being widely considered to be good practice).
The downside of sleep() is of course that you'd end up batching up your reads with delays in between, and also that you have the overhead of constantly waking up and polling the file even when it's not changing. If you did this, say, once per second, however, the overhead probably isn't noticeable on most systems.
I'd say inotify is the best choice unless you want to remain compatible, in which case the simple fallback using sleep() is still quite reasonable.
EDIT:
I just realised I forgot to mention - an easy way to check for a file being renamed is to perform an os.fstat(fd.fileno()) on your open filehandle and a os.stat() on the filename you opened and compare the results. If the os.stat() fails then the error will tell you if the file's been deleted, and if not then comparing the st_ino (the inode number) fields will tell you if the file's been deleted and then replaced with a new one of the same name.
Detecting truncation is harder - effectively your read pointer remains at the same offset in the file and reading will return nothing until the file content size gets back to where you were - then the file will read from that point as normal. If you call os.stat() frequently you could check for the file size going backwards - alternatively you could use fd.tell() to record your current position in the file and then perform an explicit seek to the end of the file and call fd.tell() again. If the value is lower, then the file's been truncated under you. This is a safe operation as long as you keep the original file position around because you can always seek back to it after the check.
Alternatively if you're using inotify anyway, you could just watch the parent directory for changes.
Note that files can be truncated to non-zero sizes, but I doubt that's likely to happen to a log file - the common cases will be being deleted and replaced, or truncated to zero. Also, I don't know how you'd detect the case that the file was truncated and then immediately filled back up to beyond your current position, except by remembering the most recent N characters and comparing them, but that's a pretty grotty thing to do. I think inotify will just tell you the file has been modified in that case.

Related

Danger in corrupting file during unexpected shutdown

I am logging data via:
with open('filename.txt','a') as fid:
fid.write(line_of_data)
Granted, the amount of time the file is open is short for each write, but I will write data every second making it extremely repetitive. Since this is being used on a remote system there is always the chance that power will be interrupted causing the computer to shutdown. If power is cut in the middle of a fid.write() will the whole file become corrupt, or, since it was opened to "append" will only the last line be lost?

It actually depends on the filesystem and operating system. When you "write" to a file it may not really means write to actual harddrive - it may be buffered by an OS, for example, and never actually "make it" to the hard drive itself.
You should not assume anything in that case other than everything may happen.
If you need some form of persistent writing - you probably need to use specialized libraries which may add required layers of security

How to find modified files in Python

I want to monitor a folder and see if any new files are added, or existing files are modified. The problem is, it's not guaranteed that my program will be running all the time (so, inotify based solutions may not be suitable here). I need to cache the status of the last scan and then with the next scan I need to compare it with the last scan before processing the files.
What are the alternatives for achieving this in Python 2.7?
Note1: Processing of the files is expensive, so I'm trying to process the files that are not modified in the meantime. So, if the file is only renamed (as opposed to a change in the contents of the file), I would also like to detect that and skip the processing.
Note2: I'm only interested in a Linux solution, but I wouldn't complain if answers for other platforms are added.

There are several ways to detect changes in files. Some are easier to
fool than others. It doesn't sound like this is a security issue; more
like good faith is assumed, and you just need to detect changes without
having to outwit an adversary.
You can look at timestamps. If files are not renamed, this is a good way
to detect changes. If they are renamed, timestamps alone wouldn't
suffice to reliably tell one file from another. os.stat will tell you
the time a file was last modified.
You can look at inodes, e.g., ls -li. A file's inode number may change
if changes involve creating a new file and removing the old one; this is
how emacs typically changes files, for example. Try changing a file
with the standard tool your organization uses, and compare inodes before
and after; but bear in mind that even if it doesn't change this time, it
might change under some circumstances. os.stat will tell you inode
numbers.
You can look at the content of the files. cksum computes a small CRC
checksum on a file; it's easy to beat if someone wants to. Programs such
as sha256sum compute a secure hash; it's infeasible to change a file
without changing such a hash. This can be slow if the files are large.
The hashlib module will compute several kinds of secure hashes.
If a file is renamed and changed, and its inode number changes, it would
be potentially very difficult to match it up with the file it used to
be, unless the data in the file contains some kind of immutable and
unique identifier.
Think about concurrency. Is it possible that someone will be changing a
file while the program runs? Beware of race conditions.

I would've probably go with some kind of sqlite solution, such as writing the last polling time.
Then on each such poll, sort the files by last_modified_time (mtime) and get all the ones who are having mtime greater than your previous poll (this value will be taken out of the sqlite or some kind of file if you insist on not having requirement of such db).

Monitoring for new files isn't hard -- just keep a list or database of inodes for all files in the directory. A new file will introduce a new inode. This will also help you avoid processing renamed files, since inode doesn't change on rename.
The harder problem is monitoring for file changes. If you also store file size per inode, then obviously a changed size indicates a changed file and you don't need to open and process the file to know that. But for a file that has (a) a previously recorded inode, and (b) is the same size as before, you will need to process the file (e.g. compute a checksum) to know if it has changed.

I suggest cheating and using the system find command. For example, the following finds all Python files that have been modified or created in the last 60 minutes. Using the ls output can determine if further checking is needed.
$ echo beer > zoot.py
$ find . -name '*.py' -mmin -60 -type f -ls
1973329 4 -rw-r--r-- 1 johnm johnm 5 Aug 30 15:17 ./zoot.py

What would happen if there's a power failure while the OS is in the middle of doing file I/O operations?

OK, this question is actually a follow-up question from my previous one: What would happen if I abruptly close my script while it's still doing file I/O operations?
So it's not possible to see an incomplete line written into a file whenever you force your script/program to quit, as the OS will do their job. But what if there's a power failure, and the OS is just in the middle of appending one line such as "This is a test"(or even bigger strings) into a file, do I get an incomplete line appended or nothing appended or even worse, previous content lost? I'm really curious to know, and this kind of situation would definitely happen on the server side. Anybody can help me out?

Rule 1. There's no magic. No guarantee. No assurance. Power failure means the circuitry passes through states that are outside their design tolerances. Anything could happen. No guarantees.
what if there's a power failure, and the OS is just in the middle of appending ... into a file, do I get an incomplete line appended
Possibly. There's no magic. The I/O could include two physical blocks. One written, one unwritten.
or nothing appended
Possibly. There's no magic. The I/O buffer may not have been synced to the device.
or even worse, previous content lost?
Possibly. There's no magic. A block write to the device could -- during a power failure -- fatally corrupt bits on the device.
I'm really curious to know, and this kind of situation would definitely happen on the server side.
"Definitely"? Nothing's definite during an uncontrollable event like a power failure. Anything could happen.
There's a really small possibility that the random scrambled bits could be the text of Lincoln's Gettysburg Address and that's what appears on the device.

It is dependent on FileSystem (and its options), hardware (caches/buffers, media, etc.), application behavior and lots of other tidbits.
You can lose data, even data you had safely written before. You can corrupt whole partitions. You can get garbage on files. You can get a line half-written, half-laden with garbage or whatever. Given the right combination of factors, you can pretty much get any result you imagine, files with mixed contents, old bits of deleted files resurfacing, dogs and cats living together... mass hysteria!
With a proper (journaled? versioned?) FS and sane hardware, you do lower the amount of chaos possible.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Deleting an arbitrary chunk of a file

What is the most efficient way to delete an arbitrary chunk of a file, given the start and end offsets? I'd prefer to use Python, but I can fall back to C if I have to.
Say the file is this
..............xxxxxxxx----------------
I want to remove a chunk of it:
..............[xxxxxxxx]----------------
After the operation it should become:
..............----------------
Reading the whole thing into memory and manipulating it in memory is not a feasible option.

The best performance will almost invariably be obtained by writing a new version of the file and then having it atomically write the old version, because filesystems are strongly optimized for such sequential access, and so is the underlying hardware (with the possible exception of some of the newest SSDs, but, even then, it's an iffy proposition). In addition, this avoids destroying data in the case of a system crash at any time -- you're left with either the old version of the file intact, or the new one in its place. Since every system could always crash at any time (and by Murphy's Law, it will choose the most unfortunate moment;-), integrity of data is generally considered very important (often data is more valuable than the system on which it's kept -- hence, "mirroring" RAID solutions to ensure against disk crashes losing precious data;-).
If you accept this sane approach, the general idea is: open old file for reading, new one for writing (creation); copy N1 bytes over from the old file to the new one; then skip N2 bytes of the old file; then copy the rest over; close both files; atomically rename new to old. (Windows apparently has no "atomic rename" system call usable from Python -- to keep integrity in that case, instead of the atomic rename, you'd do three step: rename old file to a backup name, rename new file to old, delete backup-named file -- in case of system crash during the second one of these three very fast operations, one rename is all it will take to restore data integrity).
N1 and N2, of course, are the two parameters saying where the deleted piece starts, and how long it is. For the part about opening the files, with open('old.dat', 'rb') as oldf: and with open('NEWold.dat', 'wb') as newf: statements, nested into each other, are clearly best (the rest of the code until the rename step must be nested in both of them of course).
For the "copy the rest over" step, shutil.copyfileobj is best (be sure to specify a buffer length that's comfortably going to fit in your available RAM, but a large one will tend to give better performance). The "skip" step is clearly just a seek on the oldf open-for-reading file object. For copying exactly N1 bytes from oldf to newf, there is no direct support in Python's standard library, so you have to write your own, e.g:
def copyN1(oldf, newf, N1, buflen=1024*1024):
while N1:
newf.write(oldf.read(min(N1, buflen)))
N1 -= buflen

I'd suggest memory mapping. Though it is actually manipulating file in memory it is more efficient then plain reading of the whole file into memory.
Well, you have to manipulate the file contents in-memory one way or another as there's no system call for such operation neither in *nix nor in Win (at least not that I'm aware of).

Try mmaping the file. This won't necessarily read it all into memory at once.
If you really want to do it by hand, choose some chunk size and do back-and-forth reads and writes. But the seeks are going to kill you...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.