Efficient way to track inter-runtime changes to a directory?

Efficient way to track inter-runtime changes to a directory? - python

I know that I can use a library such as Watchdog in order to track changes to a directory during the runtime of a Python program. However, what if I want to track changes to the same directory between invocations of the same program? For example, if I have the following dir when I run the program the first time:
example/
file1
file2
I then quit, delete one file, and add another one:
example/
file2
file3
Now, when I start the program the second time, I would like to efficiently get a summary of the changes done ("deleted file1, added file3") to the directory since I last ran the program.
I know that I could brute force a solution by (for example) saving a list of all files when the program quits, create a new list when it starts, and then compare the two. However, is there a more efficient way to do this - preferably one that makes use of the underlying OS/filesystem AND is deployable cross-platform?

Related

How to check if files have been modified?

Okay, so I'm looking for an easy way to check if the contents of files in a folder have changed. And if one did change, it updates the version of that file.
I'm guessing this is what is called logging? I'm am completely new to this, so is a bit hard to explain the concept of what I'm looking for. I'll give an example:
Let's I have a reference folder that contains my original data.
Then for every time I run my code it inspects the contents of the files in said reference folder.
If the contents are the same, then it the code continues to run normally.
But if the contents of the files have changed, it updates the version of that file (for example: from '1.0.0' to '1.0.1') and keeps a copy of the changes.
Is there a way to do this in python or a module that helps me accomplish this? Or where can I start looking into this?

Getting the time where a file is copied to a folder (Python)

I'm trying to write a Python script that runs on Windows. Files are copied to a folder every few seconds, and I'm polling that folder every 30 seconds for names of the new files that were copied to the folder after the last poll.
What I have tried is to use one of the os.path.getXtime(folder_path) functions and compare that with the timestamp of my previous poll. If the getXtime value is larger than the timestamp, then I work on those files.
I have tried to use the function os.path.getctime(folder_path), but that didn't work because the files were created before I wrote the script. I tried os.path.getmtime(folder_path) too but the modified times are usually smaller than the poll timestamp.
Finally, I tried os.path.getatime(folder_path), which works for the first time the files were copied over. The problem is I also read the files once they were in the folder, so the access time keeps getting updated and I end up reading the same files over and over again.
I'm not sure what is a better way or function to do this.

You've got a bit of an XY problem here. You want to know when files in a folder change, you tried a homerolled solution, it didn't work, and now you want to fix your homerolled solution.
Can I suggest that instead of terrible hackery, you use an existing package designed for monitoring for file changes? One that is not a polling loop, but actually gets notified of changes as they happen? While inotify is Linux-only, there are other options for Windows.

Force directory to be created in Python

I am running Python with MPI on a supercomputing cluster. I am getting strange nondeterministic behavior that I think is a result of I/O complications that are not present on the single machines I'm used to working with.
One of the things my code does is to create directories using os.makedirs somewhat frequently. I know also that I generally should not write small amounts of data to the filesystem-- this can end up with the data getting stuck in some buffer and not written for a long time. I suspect this may be happening with my directory creation calls, and then later code tries to write to files inside the directory before it exists. Two questions:
is creating a new directory effectively the same thing as writing a small amount of data?
When forcing data to be written, I use flush and os.fsync. These require a file object. Is there an equivalent to make sure the directory has been created?

Creating a new directory is effectively the same as writing small amount of data. It adds an inode.
The only way mkdir (or os.mkdirs) should fail is if the directory exists - otherwise the directory will always be created. In terms of the data being buffered - it's unlikely that this would happen - even journaled filesystems will sync out pretty regularly.
If you're having non-deterministic behavior, just wrap your directory creation / writing a file into that directory inside a try / except / finally that makes a few efforts? But really - the need for such code hints at something much more sinister and is likely a bigger issue.

update the list of directories with only the new ones

I am creating a list (a text file) of all the directories recursively using the code below. Since there are a thousands of sub-directories, I don't want to create the list again and again, but would like to update/insert only the newly created ones from the last time I had listed them.
Is there a good way to do this ?
import os, sys
rootdir ="/store/user/"
myusers=['u1','u2','u3','u4','u5','u6','u7']
for myuser in myusers:
rootuserdir=os.path.join(rootdir, myuser)
for myRoot, mySubFolders, myFiles in os.walk(rootuserdir):
for mySubFolder in mySubFolders:
dirpath = os.path.join(myRoot, mySubFolder)
print dirpath

You don't save anything by trying to incrementally update your list of folders. There is no efficient way to delete a line from the middle of a file, nor to insert a line. Simply writing the whole list again is the most efficient approach, and also the easiest one.

Trying to locate a specific entry in a file would be more resource intensive than just repopulating the list each time.
For performance optimization always try to determine where the true bottlenecks lie before focusing on one specific area. More times than none, your focus will be in the wrong place when not employing this approach.
Determining the bottlenecks or hotspots should always be one of the first focus areas when refactoring your code. By doing so, you'll ensure that you are focusing on the areas that have the highest ROI with the least amount of LOE. A rule of thumb is that you should only attempt to refactor code if you can make the entire program or at least a significant part of it at least twice as fast. more...

You could run a one-off process to cache the information in some sort of database (possibly a document orientated one for simplicity), then use pyinotify in a daemon process to keep the database in sync.

Grabbing output FILE from Python Popen process?

I have written a python program to interface with a compiled program (call it ProgramX) that has some idiosyncrasies that are proving difficult to deal with. I need to feed many thousands of input files to ProgramX via my python program. What I would like to do is to grab the output file that ProgramX creates with each run, and rename it something sensible, like inputfilename.output.
The problem comes in the output file that is written by ProgramX -- it is named via an unpredictable method, which will write, and "mercilessly overwrite", the output file if it already exists (which is the case the majority of the time). The saving grace probably comes with the fact that there is a standard prefix to the output files: think ProgramX.notQuiteRandomNumber.
The only think I can think to do is something like this in my bash shell:
PROGRAMXOUTPUT=$(ls -ltr ProgramX* | tail -n -1 | awk '{print $8}')
mv $PROGRAMXOUTPUT input.output
Which does 90% of what I need, but before I program all that bash into a series of Popen statements, is there a better way to do this? This problem feels like something people might have a much better solution than what I'm thinking.
Sidenote: I can grab the program's standard output without problems, however it's the output file that I need to grab.
Bonus: I was planning on running a bunch of instantiations of the program in the same directory, so my naive approach above may start to have unforeseen problems. So perhaps something fancy that watches the PID of ProgramX and follows its output.

To do what your shell script above does, assuming you've only got one ProgramX* in the current directory:
import glob, os
programxoutput = glob.glob('ProgramX*')[0]
os.rename(programxoutput, 'input.output')
If you need to sort by time, etc., there are ways to do that too (look at os.stat), but using the most recent modification date is a recipe for nasty race conditions if you'll be running multiple copies of ProgramX concurrently.
I'd suggest instead that you create and change to a new, perhaps temporary directory for each run of ProgramX, so the runs have no possibility of treading on each other. The tempfile module can help with this.

Two options that I see:
You could use lsof to find open files to find the files that ProgramX is writing.
A different approach would be to run ProgramX in a temporary directory (see tempfile for an easy way of setting up directories. Between runs of ProgramX, you can clean that directory or keep requesting new temp directories, if you are planning on running multiple copieProgramX at the same time.

If there is only one ProgramX* file, then what about just:
mv ProgramX* input.output

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.