Caching remote files with python

Caching remote files with python - python

Background
We have a lot of data files stored on a network drive which we process in python. For performance reasons I typically copy the files to my local SSD when processing. My wish is to make this happen automatically, so whenever I try to open a file it will grab the remote version if it isn't stored locally, and ideally also keep some sort of timer to delete the files after some time. The files will practically never be changed so I do not require actual syncing capabilities.
Functionality
To sum up what I am looking for is functionality for:
Keeping a local cache of files/directories from a network drive, automatically retrieving the remote version when unavailable locally
Support for directory structure - that is, the files are stored in a directory structure on the remote server which should be duplicated locally for the requested files
Ideally keep some sort of timer to expire cached files
It wouldn't be to difficult for me to write a piece of code which does this my self, but when possible, I prefer to rely on existing projects as this typically give a more versatile end result and also make any of my own improvements easily available to other users.
Question
I have searched a bit around for terms like python local file cache, file synchronization and the like, but what I have found mostly handles caching of function return values. I was a bit surprised because I would imagine this is a quite general problem, my question is therefore: is there something I have overlooked, and more importantly, are there any technical terms describing this functionality which could help my research.
Thank you in advance,
Gregers Poulsen
-- Update --
Due to other proprietary software packages I am forced to use Windows so the solution naturally must support this.

Have a look at fsspec remote caching, with a tutorial on the anaconda blog and the official documentation. Quoting the former:
In this article, we will present [fsspec]s new ability to cache remote content, keeping a local copy for faster lookup after the initial read.
They give an example for how to use it:
import fsspec
of = fsspec.open("filecache://anaconda-public-datasets/iris/iris.csv", mode='rt',
cache_storage='/tmp/cache1',
target_protocol='s3', target_options={'anon': True})
with of as f:
print(f.readline())
On first call, the file will be downloaded, stored into cache, and provided. On the second call, it will be downloaded from the local filesystem.
I haven't used it yet, but I need it and it looks promising.

Related

How to make Python ssl module use data in memory rather than pass file paths?

The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...

It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)

Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.

If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

File changed event network share

What I want to do:
I got two directories. Each one contains about 90.000 xml and bak files.
I need the xml files to sync at both folders when a file changes (of course the newer one should be copied).
The problem is:
Because of the huge amount of files and the fact that one of the directories is a network share I can't just loop though the directory and compare os.path.getmtime(file) values.
Even watchdog and PyQt don't work (tried the solutions from here and here).
The question:
Is there any other way to get a file changed event (on windows systems) which works for those configuration without looping though all those files?

So I finally found the solution:
I changed some of my network share settings and used the FileSystemWatcher
To prevent files getting synced on syncing i use a md5 filehash.
The code i use can be found at pastebin (It's a quick and dirty code and just the parts of it mentioned in the question here).

The code in...
https://stackoverflow.com/a/12345282/976427
Seems to work for me when passed a network share.

I'm risking giving an answer which is way off here (you didn't specify requirement regarding speed, etc) but... Dropbox would do exactly that for you for free, and would required writing no code at all.
Of course it might not suit your needs if you required real-time syncing, or if you want to avoid "sharing" your files with a third party (although you can encrypt them first).

Can you use the second option on this page?
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html

From mentioning watchdog, I assume you are running under Linux. For the local machine inotify can help, but for the network share you are out of luck.
Mercurial's inotify extension http://hgbook.red-bean.com/read/adding-functionality-with-extensions.html has the same limitation.
In a similar situation (10K+ files) I have used a cloned mercurial repository with inotify on both the server and the local machine. They automatically commit and notified each other of changes. It had a slight delay (no problem in my case) but as a benifit had a full history
of changes and easily resynced after one of the systems had been down.

Limiting file explorer mini-reads

I'm implementing a FUSE driver for Google Drive. The aim is to allow a user to mount her Google Drive/Docs account as a virtual filesystem. Full source at https://github.com/jforberg/drivefs. I use the fusepy bindings to integrate FUSE with Python, and Google's Document List API to access Drive.
My driver is complete to the degree that readdir(2), stat(2) and read(2) work as expected. In the filesystem, each file read translates to a HTTPS request which has a large overhead. I've managed to limit the overhead by forcing a larger buffer size for reads.
Now to my problem. File explorers like Thunar and Nautilus build thumbs and determine file types by reading the first part of each file (the first 4k bytes or so). But in my filessystem, reading from many files at once is a painful procedure, and getting a file listing in thunar takes a very long time compared with a simple ls (which only stat(2)s each file).
I need some way to tell file explorers that my filessystem does not play well with "mini-reads", or some way to identify these mini-reads and feed them made-up data to make them happy. Any help would be appreciated!
EDIT: The problem was not with HTTPS overhead, but with my handling of Google's native "doc" format. I added a line to make read(2) return an empty string when someone tries to read a native doc, and the file listing is now almost instantaneous.
This seems a mild limitation, as not even Google's official client program is able to edit native docs.

Here is pycloudfuse which is a similar attempt but for cloud files / openstack object storage which you might find useful bits in.
When writing this I can't say I noticed any problems with Thunar and Nautilus with the directory listings.
I don't think you can feed the file managers made up data - that is bound to lead to problems.
I like the option is to signal to the file explorer not to do thumbnails etc, but I don't think that is possible either.
I think the best option is to remind your users that drivefs is not a real filesystem, and to give a list of its limitations, and if it is anything like pycloudfuse there will be lots!

easiest way to program a virtual file system in windows with Python

I want to program a virtual file system in Windows with Python.
That is, a program in Python whose interface is actually an "explorer windows". You can create & manipulate file-like objects but instead of being created in the hard disk as regular files they are managed by my program and, say, stored remotely, or encrypted or compressed or versioned, or whatever I can do with Python.
What is the easiest way to do that?

While perhaps not quite ripe yet (unfortunately I have no first-hand experience with it), pywinfuse looks exactly like what you're looking for.

Does it need to be Windows-native? There is at least one protocol which can be both browsed by Windows Explorer, and served by free Python libraries: FTP. Stick your program behind pyftpdlib and you're done.

Have a look at Dokan a User mode filesystem for Windows. There are Ruby, .NET (and Java by 3rd party) bindings available, and I don't think it'll be difficult to write python bindings either.

If you are trying to write a virtual file system (I may misunderstand you) - I would look at a container file format. VHD is well documented along with HDI and (embedded) OSQ. There are basically two things you need to do. One is you need to decide on a file/container format. After that it is as simple as writing the API to manipulate that container. If you would like it to be manipulated over the internet, pick a transport protocol then just write a service (would would emulate a file system driver) that listens on a certain port and manipulates this container using your API

You might be interested in PyFilesystem;
A filesystem abstraction layer for Python
PyFilesystem is an abstraction layer for filesystems. In the same way that Python's file-like objects provide a common way of accessing files, PyFilesystem provides a common way of accessing entire filesystems. You can write platform-independent code to work with local files, that also works with any of the supported filesystems (zip, ftp, S3 etc.).
What the description on the homepage does not advertise is that you can then expose this abstraction again as a filesystem, among others SFTP, FTP (though currently disfunct, probably fixable) and dokan (dito) as well as fuse.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.