Tracking a file over time

Tracking a file over time - python

My idea is to track a specific file on a file-system over time between two points in time, T1 and T2. The emphasis here lies on looking at a file as a unique entity on a file-system. One that can change in data and attributes but still maintain its unique identity.
The ultimate goal is to determine whether or not the data of a file has (unwillingly) changed between T1 and T2 by capturing and recording the data-hash and creation/modification attributes of the file at T1 and comparing them with the equivalents at T2. If all attributes are unchanged but the hash doesn't validate we can say that there is a problem. In all other cases we might be willing to say that a changed hash is the result of a modification and an unchanged hash and unchanged modification-attribute the result of no change on the file(data) at all.
Now, there are several ways to refer to a file and corresponding drawbacks:
The path to the file: However, if the file is moved to a different location this method fails.
A data-hash of the file-data: Would allow a file, or rather (a) pointer to the file-data on disk, to be found, even if the pointer has been moved to a different directory, but the data cannot change or this method fails as well.
My idea is to retrieve a fileId for that specific file at T1 to track the file at T2, even if it has changed its location so it doesn't need to be looked at as a new file.
I am aware of two methods pywin offers. win32file.GetFileInformationByHandle() and win32file.GetFileInformationByHandleEx(), but they obviously are restricted to specific file-systems, break cross-platform-compatibility and sway away from a universal approach to track the file.
My question is simple: Are there any other ideas/theories to track a file, ideally accross platforms/FSs?
Any brainstormed food for thought is welcome!

It's not really feasible in general, because the idea of file identity is an illusion (similar to the illusion of physical identity, but this isn't a philosophy forum).
You cannot track identity using file contents, because contents change.
You cannot track by any other properties attached to the file, because many file editors will save changes by deleting the old file and creating a new one.
Version control systems handle this in three ways:
(CVS) Don't track move operations.
(Subversion) Track move operations manually.
(Git) Use a heuristic to label operations as "move" operations based on changes to the contents of a file (e.g., if a new file differs from an existing file by less than 50%, then it's labeled as a copy).
Things like inode numbers are not stable and not to be trusted. Here, you can see that editing a file with Vim will change the inode number, which we can examine with stat -f %i:
$ touch file.txt
$ stat -f %i file.txt
4828200
$ vim file.txt
...make changes to file.txt...
$ stat -f %i file.txt
4828218

Related

Tracking changes of the file being appended in Python

I would like to track the changes of a file being appended by another program.
My plan of approach is this. I read the file's contents first. At a later time, the file contents would be appended from another application. I want to read the appended data only rather than re-reading everything from the top. In order to do this, I'm going to check the file's modification time. I would then seek() to the previous size of the file, and start reading from there.
Is this a proper approach? Or there is a known idiom for this?

Well, you have to make quite some assumptions about both the other program writing to file as well as the file system, but in generally it should work. Personally I would rather write the current seek position or line number (if reading simple text files) to another file and check it from there. This will also allow you to revert back in the file if some part is rewritten and the file size stays the same (or even gets smaller).
If you have some very important/unique data, besides making backups you should maybe think about appending the new data to new file and later rejoining the files (if needed) when you have checked that the data is fine in your other program. This way you could just read any new file as a whole after certain time. (Also remember that in a larger picture, system time and creation/modification times are not 100% trustworthy).

If power is lost while a file is being read in read-only mode, can that file's data be lost?

If power is lost while a file is being read in read-only mode, can that file's data be lost?
Example in Python:
>>> f = open("example.txt", "r")
>>> first_line = f.readline()
>>> second_line = f.readline()
>>> # Here the machine executing the above code unexpectedly powers off.

If power is lost while a file is being read in read-only mode, can that file's data be lost?
One would think that since you have the file open in read-only mode, the answer would be a solid "no". There's two scenarios that come to mind:
Hardware failure
In the case of a hard disk, the head must be above the platter to read the file. If the power dies, that could be just the last straw that causes the disk to just fail outright.
Access times
File metadata. Even when opening a file read-only, "last access date" might still need to get updated, and thus cause a write. Whether this is true depends; consider:
does the filesystem that the file exists on support a last access time?
Is the filesystem configured to use it? (Linux, for example, has a noatime attribute that means access times are not updated)
Is the filesystem read-only? (again, Linux is a good example here; you can mount an FS as read-only)
If there is an access time that could be written, the next big question is does the FS at hand journal metadata? A "journal" is a data structure many FSs use to prevent corruption. If the answer is "no", then I'd say "yes, it is possible."
Corrupting the file metadata, could, conceivably, render the data in the file itself corrupt. (More likely, the metadata that stores where on disk the file is located is likely near where the access time; this might cause that data to itself get corrupt. The file contents are probably fine, but the thing that says where they are is what got corrupt.)
At the end of the day, if you need to protect against such things,
Use a filesystem that journals metadata. (ext3, for example, can do this.) Do note that some FSs with journals do not journal metadata. (They journal only the main file data.) (Also note that some are configurable either way.)
Always have a backup. The disk can always outright fail.

Your file's data should be safe, but you should be aware that some file systems will modify the access time in the file's meta-data, even if you're just reading the file, and even if the partition is mounted read-only. However, if you are worried about that it's possible to disable that feature - on Unix-like systems you can mount the partition with the noatime option. Please see the man pages for mount and fstab for details.
But there's really no need to be too concerned about this if you are using a modern journaling file system.

How to traverse directory in order of modification date without visiting all files in directory at least once in Python

I have a directory that is full of potentially millions of files. These files "mark" themselves when used, and then my Python program wants to find the "marked" ones then record that they were marked and unmark them. They are individual html files so they can't easily communicate with the python program themselves during this marking process (the user will just open whatever ones they choose).
Because they are marked when used, if I access them by modification date, one at a time, once I reach one that isn't marked I can stop (or at least once I get to one that was modified a decent amount of time in the future). However, all ways I've seen of doing this so far require accessing every file's metadata at least once, and then sorting this data, which isn't ideal with the magnitude of files I have. Note that this check occurs during an update step which occurs every 5 seconds or so combined with other work and so the time ideally needs to be independent of the number of files in the directory.
So is there a way to traverse a directory in order of modification date without visiting all files's medatada's at least once in Python?

No, I don't think there is a way to fetch file names in chunks sorted by modification dates.
You should use file system notifications to know about modified files.
For example use https://github.com/gorakhargosh/watchdog or https://github.com/seb-m/pyinotify/wiki

How to find modified files in Python

I want to monitor a folder and see if any new files are added, or existing files are modified. The problem is, it's not guaranteed that my program will be running all the time (so, inotify based solutions may not be suitable here). I need to cache the status of the last scan and then with the next scan I need to compare it with the last scan before processing the files.
What are the alternatives for achieving this in Python 2.7?
Note1: Processing of the files is expensive, so I'm trying to process the files that are not modified in the meantime. So, if the file is only renamed (as opposed to a change in the contents of the file), I would also like to detect that and skip the processing.
Note2: I'm only interested in a Linux solution, but I wouldn't complain if answers for other platforms are added.

There are several ways to detect changes in files. Some are easier to
fool than others. It doesn't sound like this is a security issue; more
like good faith is assumed, and you just need to detect changes without
having to outwit an adversary.
You can look at timestamps. If files are not renamed, this is a good way
to detect changes. If they are renamed, timestamps alone wouldn't
suffice to reliably tell one file from another. os.stat will tell you
the time a file was last modified.
You can look at inodes, e.g., ls -li. A file's inode number may change
if changes involve creating a new file and removing the old one; this is
how emacs typically changes files, for example. Try changing a file
with the standard tool your organization uses, and compare inodes before
and after; but bear in mind that even if it doesn't change this time, it
might change under some circumstances. os.stat will tell you inode
numbers.
You can look at the content of the files. cksum computes a small CRC
checksum on a file; it's easy to beat if someone wants to. Programs such
as sha256sum compute a secure hash; it's infeasible to change a file
without changing such a hash. This can be slow if the files are large.
The hashlib module will compute several kinds of secure hashes.
If a file is renamed and changed, and its inode number changes, it would
be potentially very difficult to match it up with the file it used to
be, unless the data in the file contains some kind of immutable and
unique identifier.
Think about concurrency. Is it possible that someone will be changing a
file while the program runs? Beware of race conditions.

I would've probably go with some kind of sqlite solution, such as writing the last polling time.
Then on each such poll, sort the files by last_modified_time (mtime) and get all the ones who are having mtime greater than your previous poll (this value will be taken out of the sqlite or some kind of file if you insist on not having requirement of such db).

Monitoring for new files isn't hard -- just keep a list or database of inodes for all files in the directory. A new file will introduce a new inode. This will also help you avoid processing renamed files, since inode doesn't change on rename.
The harder problem is monitoring for file changes. If you also store file size per inode, then obviously a changed size indicates a changed file and you don't need to open and process the file to know that. But for a file that has (a) a previously recorded inode, and (b) is the same size as before, you will need to process the file (e.g. compute a checksum) to know if it has changed.

I suggest cheating and using the system find command. For example, the following finds all Python files that have been modified or created in the last 60 minutes. Using the ls output can determine if further checking is needed.
$ echo beer > zoot.py
$ find . -name '*.py' -mmin -60 -type f -ls
1973329 4 -rw-r--r-- 1 johnm johnm 5 Aug 30 15:17 ./zoot.py

Multiple Editing of the same file

What would I have to do to allow multiple programs/users to read/write to the same file ?
Use Case
I have a CSV file and I want to enable multiple users to edit it in more or less in real time. I want to be able to write and read the small changes in the file but I also want to be able to refresh the data, loaded in my program, in the event that the entire file is replaced by some careless soul.
Background
I have seen that certain programs will refresh a file if the time stamp is changed or the file is overwritten by another program/user. (I've used this myself when editing a file in two different editors leveraging their different features).
Home Work
I would imagine this requires my application to duplicate the original file when it is initially opened. In this way any updates to the original can be diff'd against the copy to get the modifications to the current data. Then when the temporary file is updated the primary file can be re-written. Each user/program could then reload the updated files them selves. Is this a sensible way/Best practice or are there better means to an ends here.
Alternatively one could Cache the file from what I understand.
Is it better to block/lock the file ? Must I be wary of race conditions ?
Environment
I plan to do this in Python. I would also like this to be platform independent e.g. linux, windows and mac (expensive linux).
Related
It seems these are related here, here and here.

If the intensity of the edits is low, you can pull it of with csv file, but by locking the entire file to avoid users overwriting each other's edits. If the file cannot be locked until the edit is applied, you will be better by using DB, where specific records will be locked instead of the entire file.

When a user opens the file you actually serve a copy of it file_userid-1.csv and let him edit that one to avoid users overwriting their work. When the user saves you overwrite the original one. In between you keep a hook to see if the original one was modified while current user also modified his. If the original file was modified you to a diff or something I don't know.
I think what you need is a tiny replica of how svn or git works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.