What I want to do:
I got two directories. Each one contains about 90.000 xml and bak files.
I need the xml files to sync at both folders when a file changes (of course the newer one should be copied).
The problem is:
Because of the huge amount of files and the fact that one of the directories is a network share I can't just loop though the directory and compare os.path.getmtime(file) values.
Even watchdog and PyQt don't work (tried the solutions from here and here).
The question:
Is there any other way to get a file changed event (on windows systems) which works for those configuration without looping though all those files?
So I finally found the solution:
I changed some of my network share settings and used the FileSystemWatcher
To prevent files getting synced on syncing i use a md5 filehash.
The code i use can be found at pastebin (It's a quick and dirty code and just the parts of it mentioned in the question here).
The code in...
https://stackoverflow.com/a/12345282/976427
Seems to work for me when passed a network share.
I'm risking giving an answer which is way off here (you didn't specify requirement regarding speed, etc) but... Dropbox would do exactly that for you for free, and would required writing no code at all.
Of course it might not suit your needs if you required real-time syncing, or if you want to avoid "sharing" your files with a third party (although you can encrypt them first).
Can you use the second option on this page?
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
From mentioning watchdog, I assume you are running under Linux. For the local machine inotify can help, but for the network share you are out of luck.
Mercurial's inotify extension http://hgbook.red-bean.com/read/adding-functionality-with-extensions.html has the same limitation.
In a similar situation (10K+ files) I have used a cloned mercurial repository with inotify on both the server and the local machine. They automatically commit and notified each other of changes. It had a slight delay (no problem in my case) but as a benifit had a full history
of changes and easily resynced after one of the systems had been down.
Related
Background
We have a lot of data files stored on a network drive which we process in python. For performance reasons I typically copy the files to my local SSD when processing. My wish is to make this happen automatically, so whenever I try to open a file it will grab the remote version if it isn't stored locally, and ideally also keep some sort of timer to delete the files after some time. The files will practically never be changed so I do not require actual syncing capabilities.
Functionality
To sum up what I am looking for is functionality for:
Keeping a local cache of files/directories from a network drive, automatically retrieving the remote version when unavailable locally
Support for directory structure - that is, the files are stored in a directory structure on the remote server which should be duplicated locally for the requested files
Ideally keep some sort of timer to expire cached files
It wouldn't be to difficult for me to write a piece of code which does this my self, but when possible, I prefer to rely on existing projects as this typically give a more versatile end result and also make any of my own improvements easily available to other users.
Question
I have searched a bit around for terms like python local file cache, file synchronization and the like, but what I have found mostly handles caching of function return values. I was a bit surprised because I would imagine this is a quite general problem, my question is therefore: is there something I have overlooked, and more importantly, are there any technical terms describing this functionality which could help my research.
Thank you in advance,
Gregers Poulsen
-- Update --
Due to other proprietary software packages I am forced to use Windows so the solution naturally must support this.
Have a look at fsspec remote caching, with a tutorial on the anaconda blog and the official documentation. Quoting the former:
In this article, we will present [fsspec]s new ability to cache remote content, keeping a local copy for faster lookup after the initial read.
They give an example for how to use it:
import fsspec
of = fsspec.open("filecache://anaconda-public-datasets/iris/iris.csv", mode='rt',
cache_storage='/tmp/cache1',
target_protocol='s3', target_options={'anon': True})
with of as f:
print(f.readline())
On first call, the file will be downloaded, stored into cache, and provided. On the second call, it will be downloaded from the local filesystem.
I haven't used it yet, but I need it and it looks promising.
I need to modify a text file at runtime but restore its original state later (even if the computer crash).
My program runs in regular sessions. Once a session ended, the original state of that file can be changed, but the original state won't change at runtime.
There are several instances of this text file with the same name in several directories. My program runs in each directory (but not in parallel), but depending on the directory content's it does different things. The order of choosing a working directory like this is completely arbitrary.
Since the file's name is the same in each directory, it seems a good idea to store the backed up file in slightly different places (ie. the parent directory name could be appended to the backup target path).
What I do now is backup and restore the file with a self-written class, and also check at startup if the previous backup for the current directory was properly restored.
But my implementation needs serious refactoring, and now I'm interested if there are libraries already implemented for this kind of task.
edit
version control seems like a good idea, but actually it's a bit overkill since it requires network connection and often a server. Other VCS need clients to be installed. I would be happier with a pure-python solution, but at least it should be cross-platform, portable and small enough (<10mb for example).
Why not just do what every unix , mac , window file has done for years -- create a lockfile/working file concept.
When a file is selected for edit:
Check to see if there is an active lock or a crashed backup.
If the file is locked or crashed, give a "recover" option
Otherwise, begin editing the file...
The editing tends to do one or more of a few things:
Copy the original file into a ".%(filename)s.backup"
Create a ".%(filename)s.lock" to prevent others from working on it
When editing is achieved, the lock goes away and the .backup is removed
Sometimes things are slightly reversed, and the original stays in place while a .backup is the active edit; on success the .backup replaces the original
If you crash vi or some other text programs on a linux box, you'll see these files created . note that they usually have a dot(.) prefix so they're normally hidden on the command line. Word/Powerpoint/etc all do similar things.
Implement Version control ... like svn (see pysvn) it should be fast as long as the repo is on the same server... and allows rollbacks to any version of the file... maybe overkill but that will make everything reversible
http://pysvn.tigris.org/docs/pysvn_prog_guide.html
You dont need a server ... you can have local version control and it should be fine...
Git, Subversion or Mercurial is your friend.
I'd like to save my script's data to disk to load next time the script runs. For simplicity, is it a good idea to use os.path.expanduser('~') and save a directory named ".myscript_data" there? It would only need to be read by the script, and to avoid clutter for the user, I'd like it to be hidden. Is it acceptable practice to place hidden ".files" on the users computer?
On Windows, use a subfolder of os.environ['APPDATA']; on Linux, a dot-folder under $HOME is typical.
You can also consider putting your files in a subdirectory of ~/.config*, which is a sort of emerging standard for configuration file placement. See also the XDG basefiles spec.
Not entirely related, but interesting: origin of dotfiles
*(edit) More accurately, os.environ.get('XDG_CONFIG_HOME', os.path.expanduser('~/.config')) as per the XDG spec.
Yes, it is. (I'm assuming you're on linux, right?)
See also this.
Yes, this is standard practice on most Unix systems. For transparency, it's often a good idea to print an informative message like Creating directory .dir to store script state the first time you create the storage location.
If you are expecting to store significant amounts of data, it's a good idea to confirm the location with the user.
This is also the standard place for any additional configuration files for your application.
On Linux, I suggest a file or directory (no dotfile) in os.environ['XDG_CONFIG_HOME'], which is in most cases the directory $HOME/.config. A dotfile in $HOME, however, is also often being used.
I have data across several computers stored in folders. Many of the folders contain 40-100 G of files of size from 500 K to 125 MB. There are some 4 TB of files which I need to archive, and build a unfied meta data system depending on meta data stored in each computer.
All systems run Linux, and we want to use Python. What is the best way to copy the files, and archive it.
We already have programs to analyze files, and fill the meta data tables and they are all running in Python. What we need to figure out is a way to successfully copy files wuthout data loss,and ensure that the files have been copied successfully.
We have considered using rsync and unison use subprocess.POPEn to run them off, but they are essentially sync utilities. These are essentially copy once, but copy properly. Once files are copied the users would move to new storage system.
My worries are 1) When the files are copied there should not be any corruption 2) the file copying must be efficient though no speed expectations are there. The LAN is 10/100 with ports being Gigabit.
Is there any scripts which can be incorporated, or any suggestions. All computers will have ssh-keygen enabled so we can do passwordless connection.
The directory structures would be maintained on the new server, which is very similar to that of old computers.
I would look at the python fabric library. This library is for streamlining the use of SSH, and if you are concerned about data integrity I would use SHA1 or some other hash algorithm for creating a fingerprint for each file before transfer and compare the fingerprint values generated at the initial and final destinations. All of this could be done using fabric.
If a more seamless python integration is the goal you can look at,
Duplicity
pyrsync
I think rsync is the solution. If you are concerned about data integrity, look at the explanation of the "--checksum" parameter in the man page.
Other arguments that might come in handy are "--delete" and "--archive". Make sure the exit code of the command is checked properly.
This is specifically geared towards managing MP3 files, but it should easily work for any directory structure with a lot of files.
I want to find or write a daemon (preferably in Python) that will watch a folder with many subfolders that should all contain X number of MP3 files. Any time a file is added, updated or deleted, it should reflect that in a database (preferably PostgreSQL). I am willing to accept if a file is simply moved that the respective rows are deleted and recreated anew but updating existing rows would make me the happiest.
The Stack Overflow question Managing a large collection of music has a little of what I want.
I basically just want a database that I can then do whatever I want to with. My most up-to-date database as of now is my iTunes.xml file, but I don't want to rely on that too much as I don't always want to rely on iTunes for my music management. I see plenty of projects out there that do a little of what I want but in a format that either I can't access or is just more complex than I want. If there is some media player out there that can watch a folder and update a database that is easily accessible then I am all for it.
The reason I'm leaning towards writing my own is because it would be nice to choose my database and schema myself.
Another answer already suggested pyinotify for Linux, let me add watch_directory for Windows (a good discussion of the possibilities in Windows is here, the module's an example) and fsevents on the Mac (unfortunately I don't think there's a single cross-platform module offering a uniform interface to these various system-specific ways to get directory-change notification events).
Once you manage to get such events, updating an appropriate SQL database is simple!-)
If you use Linux, you can use PyInotify.
inotify can notify you about filesystem events when your program is running.
IMO, the best media player that has these features is Winamp. It rescans the music folders every X minutes, which is enough for music (but of course a little less efficient than letting the operating system watch for changes).
But as you were asking for suggestions on writing your own, you could make use of pyinotify (Linux only). If you're running Windows, you can use the ReadDirectoryChangesW API call