Pythonic API that enables efficient file creation with known content

Pythonic API that enables efficient file creation with known content - python

The question is about API design. The scenario is, I have some bytes of KNOWN length in memory and I want to design a Pythonic API that flushes them to AWS S3 with few RPCs. By Pythonic, I mean if there is already an accepted API doing this, I want to copy it.
One way is to do something like io.create(filename, content). This can be translated into a single HTTP request using S3 XML API.
The traditional open, write, and close pattern is a streaming API. open doesn't accept a length argument so the created stream doesn't know it should buffer all writes into a single RPC.
Of course the API can look like:
with open(filename, buffering="UNLIMITED") as f:
f.write(content)
But buffering doesn't really support "UNLIMITED" constant.
So what can I do? File a PEP? Thanks.

NFS, at least in version 2 and 3, isn't terribly good at doing locking or atomicity.
The best bet, again at least in NFS 2 and 3, is to create your file with a temporary name, and then use os.rename().
Perhaps some progress has been made on this in NFS 4; I don't know.

Related

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)

Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.

If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

Limiting file explorer mini-reads

I'm implementing a FUSE driver for Google Drive. The aim is to allow a user to mount her Google Drive/Docs account as a virtual filesystem. Full source at https://github.com/jforberg/drivefs. I use the fusepy bindings to integrate FUSE with Python, and Google's Document List API to access Drive.
My driver is complete to the degree that readdir(2), stat(2) and read(2) work as expected. In the filesystem, each file read translates to a HTTPS request which has a large overhead. I've managed to limit the overhead by forcing a larger buffer size for reads.
Now to my problem. File explorers like Thunar and Nautilus build thumbs and determine file types by reading the first part of each file (the first 4k bytes or so). But in my filessystem, reading from many files at once is a painful procedure, and getting a file listing in thunar takes a very long time compared with a simple ls (which only stat(2)s each file).
I need some way to tell file explorers that my filessystem does not play well with "mini-reads", or some way to identify these mini-reads and feed them made-up data to make them happy. Any help would be appreciated!
EDIT: The problem was not with HTTPS overhead, but with my handling of Google's native "doc" format. I added a line to make read(2) return an empty string when someone tries to read a native doc, and the file listing is now almost instantaneous.
This seems a mild limitation, as not even Google's official client program is able to edit native docs.

Here is pycloudfuse which is a similar attempt but for cloud files / openstack object storage which you might find useful bits in.
When writing this I can't say I noticed any problems with Thunar and Nautilus with the directory listings.
I don't think you can feed the file managers made up data - that is bound to lead to problems.
I like the option is to signal to the file explorer not to do thumbnails etc, but I don't think that is possible either.
I think the best option is to remind your users that drivefs is not a real filesystem, and to give a list of its limitations, and if it is anything like pycloudfuse there will be lots!

Parsing parts of a Large XML File in App Engine using Blobstore?

I'm working on an google app engine app that will have to deal with some largish ( <100 MB) XML files uploaded from a form that will exceed GAE's limits -- either taking longer than 30 seconds to upload the file, or exceeding the 10 MB request size.
The current solution I'm envisioning is to upload the file to the blobstore, and then bring it into the application (1 MB at a time) for parsing. This could also very well exceed the 30 second limits for a request, so I'm wondering if there's a nice way to handle large XML documents in chunks, as I may end up having to do it via task queues in 30 second bursts.
I'm currently using BeautifulSoup for other parts of the project, having switched from minidom. Is there a way to handle data in chunks that would play nice with GAE?

The 30 second limit applies to the execution time of your code, and your code doesn't start executing until the entire user request has been received - so the amount of time the user takes to upload the file is irrelevant.
That said, using blobstore does sound like the best idea. You can use BlobReader, which emulates a file with blobstore access, to treat a blob like any other file, and read it using standard libraries (such as BeautifulSoup). If the XML file is sufficiently large, you risk running out of memory, however, so you might want to consider a SAX-based approach, instead, which doesn't require holding the whole file in memory.
As far as execution time limits go for processing the file, you almost certainly want to do this on the task queue, where the limits are 10 minutes, and you won't be forcing users to wait.

PullDom allows you to load only part of an XML document. Unfortunately, the official Python documentation is rather sparse. More information can be found here and here.

It really sounds like App Engine is not the right platform for this project.

This was pretty easy using pulldom thanks to the magic of python making everything look the same. Just parse the blob reader returned from the app engine, like so:
blob_reader = blobstore.BlobReader(blob_info.key())
events = pulldom.parse(blob_reader)
It is what I like best about python, you try something and it usually works.

Python: How to write to http input stream

I could see a couple of examples to read from the http stream. But how to write to a http input stream using python?

You could use standard library module httplib: in the HTTPConnection.request method, the body argument (since Python 2.6) can be an open file object (better be a "pretty real" file, since, as the docs say, "this file object should support fileno() and read() methods"; but it could be a named or unnamed pipe to which a separate process can be writing). The advantage is however dubious, since (again per the docs) "The header Content-Length is automatically set to the correct value" -- which, since headers come before body, and the file's content length can't be known until the file is read, implies the whole file's going to be read into memory anyway.
If you're desperate to "stream" dynamically generated content into an HTTP POST (rather than preparing it all beforehand and then posting), you need a server supporting HTTP's "chunked transfer encoding": this SO question's accepted answer mentions that the popular asynchronous networking Python package twisted does, and gives some useful pointers.

easiest way to program a virtual file system in windows with Python

I want to program a virtual file system in Windows with Python.
That is, a program in Python whose interface is actually an "explorer windows". You can create & manipulate file-like objects but instead of being created in the hard disk as regular files they are managed by my program and, say, stored remotely, or encrypted or compressed or versioned, or whatever I can do with Python.
What is the easiest way to do that?

While perhaps not quite ripe yet (unfortunately I have no first-hand experience with it), pywinfuse looks exactly like what you're looking for.

Does it need to be Windows-native? There is at least one protocol which can be both browsed by Windows Explorer, and served by free Python libraries: FTP. Stick your program behind pyftpdlib and you're done.

Have a look at Dokan a User mode filesystem for Windows. There are Ruby, .NET (and Java by 3rd party) bindings available, and I don't think it'll be difficult to write python bindings either.

If you are trying to write a virtual file system (I may misunderstand you) - I would look at a container file format. VHD is well documented along with HDI and (embedded) OSQ. There are basically two things you need to do. One is you need to decide on a file/container format. After that it is as simple as writing the API to manipulate that container. If you would like it to be manipulated over the internet, pick a transport protocol then just write a service (would would emulate a file system driver) that listens on a certain port and manipulates this container using your API

You might be interested in PyFilesystem;
A filesystem abstraction layer for Python
PyFilesystem is an abstraction layer for filesystems. In the same way that Python's file-like objects provide a common way of accessing files, PyFilesystem provides a common way of accessing entire filesystems. You can write platform-independent code to work with local files, that also works with any of the supported filesystems (zip, ftp, S3 etc.).
What the description on the homepage does not advertise is that you can then expose this abstraction again as a filesystem, among others SFTP, FTP (though currently disfunct, probably fixable) and dokan (dito) as well as fuse.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.