Limiting file explorer mini-reads

Limiting file explorer mini-reads - python

I'm implementing a FUSE driver for Google Drive. The aim is to allow a user to mount her Google Drive/Docs account as a virtual filesystem. Full source at https://github.com/jforberg/drivefs. I use the fusepy bindings to integrate FUSE with Python, and Google's Document List API to access Drive.
My driver is complete to the degree that readdir(2), stat(2) and read(2) work as expected. In the filesystem, each file read translates to a HTTPS request which has a large overhead. I've managed to limit the overhead by forcing a larger buffer size for reads.
Now to my problem. File explorers like Thunar and Nautilus build thumbs and determine file types by reading the first part of each file (the first 4k bytes or so). But in my filessystem, reading from many files at once is a painful procedure, and getting a file listing in thunar takes a very long time compared with a simple ls (which only stat(2)s each file).
I need some way to tell file explorers that my filessystem does not play well with "mini-reads", or some way to identify these mini-reads and feed them made-up data to make them happy. Any help would be appreciated!
EDIT: The problem was not with HTTPS overhead, but with my handling of Google's native "doc" format. I added a line to make read(2) return an empty string when someone tries to read a native doc, and the file listing is now almost instantaneous.
This seems a mild limitation, as not even Google's official client program is able to edit native docs.

Here is pycloudfuse which is a similar attempt but for cloud files / openstack object storage which you might find useful bits in.
When writing this I can't say I noticed any problems with Thunar and Nautilus with the directory listings.
I don't think you can feed the file managers made up data - that is bound to lead to problems.
I like the option is to signal to the file explorer not to do thumbnails etc, but I don't think that is possible either.
I think the best option is to remind your users that drivefs is not a real filesystem, and to give a list of its limitations, and if it is anything like pycloudfuse there will be lots!

Related

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)

Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.

If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

Caching remote files with python

Background
We have a lot of data files stored on a network drive which we process in python. For performance reasons I typically copy the files to my local SSD when processing. My wish is to make this happen automatically, so whenever I try to open a file it will grab the remote version if it isn't stored locally, and ideally also keep some sort of timer to delete the files after some time. The files will practically never be changed so I do not require actual syncing capabilities.
Functionality
To sum up what I am looking for is functionality for:
Keeping a local cache of files/directories from a network drive, automatically retrieving the remote version when unavailable locally
Support for directory structure - that is, the files are stored in a directory structure on the remote server which should be duplicated locally for the requested files
Ideally keep some sort of timer to expire cached files
It wouldn't be to difficult for me to write a piece of code which does this my self, but when possible, I prefer to rely on existing projects as this typically give a more versatile end result and also make any of my own improvements easily available to other users.
Question
I have searched a bit around for terms like python local file cache, file synchronization and the like, but what I have found mostly handles caching of function return values. I was a bit surprised because I would imagine this is a quite general problem, my question is therefore: is there something I have overlooked, and more importantly, are there any technical terms describing this functionality which could help my research.
Thank you in advance,
Gregers Poulsen
-- Update --
Due to other proprietary software packages I am forced to use Windows so the solution naturally must support this.

Have a look at fsspec remote caching, with a tutorial on the anaconda blog and the official documentation. Quoting the former:
In this article, we will present [fsspec]s new ability to cache remote content, keeping a local copy for faster lookup after the initial read.
They give an example for how to use it:
import fsspec
of = fsspec.open("filecache://anaconda-public-datasets/iris/iris.csv", mode='rt',
cache_storage='/tmp/cache1',
target_protocol='s3', target_options={'anon': True})
with of as f:
print(f.readline())
On first call, the file will be downloaded, stored into cache, and provided. On the second call, it will be downloaded from the local filesystem.
I haven't used it yet, but I need it and it looks promising.

How to modify a large file remotely

I have a large XML file, ~30 MB.
Every now and then I need to update some of the values. I am using element tree module to modify the XML. I am currently fetching the entire file, updating it and then placing it again. SO there is ~60 MB of data transfer every time. Is there a way I update the file remotely?
I am using the following code to update the file.
import xml.etree.ElementTree as ET
tree = ET.parse("feed.xml")
root = tree.getroot()
skus = ["RUSSE20924","PSJAI22443"]
qtys = [2,3]
for child in root:
sku = child.find("Product_Code").text.encode("utf-8")
if sku in skus:
print "found"
i = skus.index(sku)
child.find("Quantity").text = str(qtys[i])
child.set('updated', 'yes')
tree.write("feed.xml")

Modifying a file directly via FTP without uploading the entire thing is not possible except when appending to a file.
The reason is that there are only three commands in FTP that actually modify a file (Source):
APPE: Appends to a file
STOR: Uploads a file
STOU: Creates a new file on the server with a unique name
What you could do
Track changes
Cache the remote file locally and track changes to the file using the MDTM command.
Pros:
Will half the required data transfer in many cases.
Hardly requires any change to existing code.
Almost zero overhead.
Cons:
Other clients will have to download the entire thing every time something changes(no change from current situation)
Split up into several files
Split up your XML into several files. (One per product code?)
This way you only have to download the data that you actually need.
Pros:
Less data to transfer
Allows all scripts that access the data to only download what they need
Combinable with suggestion #1
Cons:
All existing code has to be adapted
Additional overhead when downloading or updating all the data
Switch to a delta-sync protocol
If the storage server supports it switching to a delta synchronization protocol like rsync would help a lot because these only transmit the changes (with little overhead).
Pros:
Less data transfer
Requires little change to existing code
Cons:
Might not be available
Do it remotely
You already pointed out that you can't but it still would be the best solution.
What won't help
Switch to a network filesystem
As somebody in the comments already pointed out switching to a network file system (like NFS or CIFS/SMB) would not really help because you cannot actually change parts of the file unless the new data has the exact same length.
What to do
Unless you can do delta synchronization I'd suggest to implement some caching on the client side first and if it doesn't help enough to then split up your files.

represent a file as an object with python

Questions: How do you represent a file in python as an object, entirely in memory, making no calls to the hard drive?
Info:
I'm working on a project that has files distributed across many computers at once, these files are all stored in a SQLite3 database with identifiers so that the files can be keep in sync and I only have to have 1 file to deal with on the computers rather than many.
My problem is that the "open" command requires a path on the hard drive. well that path doesn't exist. I still need to be able to interact with these file objects though. What I'm looking for is a way to interact with these files as if they were on the hard drive, but they are only in memory, probably as a byte string. so kind like if i were to go:
file = open(<location in memory>,'r')
I've tried searching for this, but all search results just point to streaming files from the hard drive. so just to make this clear, I am no streaming from the hard drive, and that is not an option. If I have to do that, I'll rework my system for that. but right now that is an extra pointless step.

Take a look at the StringIO (and cStringIO) module: http://docs.python.org/library/stringio.html

easiest way to program a virtual file system in windows with Python

I want to program a virtual file system in Windows with Python.
That is, a program in Python whose interface is actually an "explorer windows". You can create & manipulate file-like objects but instead of being created in the hard disk as regular files they are managed by my program and, say, stored remotely, or encrypted or compressed or versioned, or whatever I can do with Python.
What is the easiest way to do that?

While perhaps not quite ripe yet (unfortunately I have no first-hand experience with it), pywinfuse looks exactly like what you're looking for.

Does it need to be Windows-native? There is at least one protocol which can be both browsed by Windows Explorer, and served by free Python libraries: FTP. Stick your program behind pyftpdlib and you're done.

Have a look at Dokan a User mode filesystem for Windows. There are Ruby, .NET (and Java by 3rd party) bindings available, and I don't think it'll be difficult to write python bindings either.

If you are trying to write a virtual file system (I may misunderstand you) - I would look at a container file format. VHD is well documented along with HDI and (embedded) OSQ. There are basically two things you need to do. One is you need to decide on a file/container format. After that it is as simple as writing the API to manipulate that container. If you would like it to be manipulated over the internet, pick a transport protocol then just write a service (would would emulate a file system driver) that listens on a certain port and manipulates this container using your API

You might be interested in PyFilesystem;
A filesystem abstraction layer for Python
PyFilesystem is an abstraction layer for filesystems. In the same way that Python's file-like objects provide a common way of accessing files, PyFilesystem provides a common way of accessing entire filesystems. You can write platform-independent code to work with local files, that also works with any of the supported filesystems (zip, ftp, S3 etc.).
What the description on the homepage does not advertise is that you can then expose this abstraction again as a filesystem, among others SFTP, FTP (though currently disfunct, probably fixable) and dokan (dito) as well as fuse.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.