With Python (I am using v3.4), I am creating a program that will run on up to 20 machines concurrently connected to a same network.
Those machines will need from time to time to access the network and make some operations (create/delete/modify files)…
What I had currently planned was:
network will have a folder "db"
I will use the shelve module to have my shared data (a simple dictionary of less than 3,000 entries containing some path names and some logging information) in this folder. Since there is not so much data it is a convenient module that will be fast enough
to avoid operations by multiple machines on the network at the same time (and avoid conflicts), machines will have to lock the shelve file on the network before doing anything (anyway they will also need to update this file) and will unlock only at the end of their operation (updating its contents if necessary)
The problem I have is that shelve does not have a convenient way for concurrent access or locking mechanism. I understand that my 2 possibilities are:
either I don't use shelve and use another module to manage my simple database (ex: sqlite3 except it is a bit more complex than the simple shelve module)
either I create a locking mechanism that will work on network by multiple machines (so far I didn't find a module that seemed totally reliable)
Additional requirements (if possible) are:
It will mainly be used with Windows but I would like the solution to be cross-platform so that I can re-use it with Linux
network file system will be anything accessible through a standard explorer (in linux/windows) through an address like "\Machine\Folder". It could be either a file server or just a shared folder present on one of the machines.
Any recommendation?
Thanks
Multiplatform, locking. You best option is portalocker (https://pypi.python.org/pypi/portalocker/0.3). We've used it extensively -- and it's a winner !
Related
The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...
It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.
Background
We have a lot of data files stored on a network drive which we process in python. For performance reasons I typically copy the files to my local SSD when processing. My wish is to make this happen automatically, so whenever I try to open a file it will grab the remote version if it isn't stored locally, and ideally also keep some sort of timer to delete the files after some time. The files will practically never be changed so I do not require actual syncing capabilities.
Functionality
To sum up what I am looking for is functionality for:
Keeping a local cache of files/directories from a network drive, automatically retrieving the remote version when unavailable locally
Support for directory structure - that is, the files are stored in a directory structure on the remote server which should be duplicated locally for the requested files
Ideally keep some sort of timer to expire cached files
It wouldn't be to difficult for me to write a piece of code which does this my self, but when possible, I prefer to rely on existing projects as this typically give a more versatile end result and also make any of my own improvements easily available to other users.
Question
I have searched a bit around for terms like python local file cache, file synchronization and the like, but what I have found mostly handles caching of function return values. I was a bit surprised because I would imagine this is a quite general problem, my question is therefore: is there something I have overlooked, and more importantly, are there any technical terms describing this functionality which could help my research.
Thank you in advance,
Gregers Poulsen
-- Update --
Due to other proprietary software packages I am forced to use Windows so the solution naturally must support this.
Have a look at fsspec remote caching, with a tutorial on the anaconda blog and the official documentation. Quoting the former:
In this article, we will present [fsspec]s new ability to cache remote content, keeping a local copy for faster lookup after the initial read.
They give an example for how to use it:
import fsspec
of = fsspec.open("filecache://anaconda-public-datasets/iris/iris.csv", mode='rt',
cache_storage='/tmp/cache1',
target_protocol='s3', target_options={'anon': True})
with of as f:
print(f.readline())
On first call, the file will be downloaded, stored into cache, and provided. On the second call, it will be downloaded from the local filesystem.
I haven't used it yet, but I need it and it looks promising.
I am developing the machine learning analysis program which has to process the 27GB of text files in linux. Although my production system won't be rebooted very often but I need to test that in my home computer or development environment.
Now I have power failure very often so I can hardly run it continuously for 3 weeks.
My programs reads the files, applies some parsing, saves the filtered data in new files in dictionary, then I apply the algorithm on those files then saves result in mysqlDB.
I am not able to find how can I save the algorithm state.
I everything regarding the algorithm state is saved in a class, you can serialize the class an save it to disk: http://docs.python.org/2/library/pickle.html
Since the entire algorithm state can be saved in a class, you might want to use pickle (as mentioned above), but pickle comes with it's own overloads and risks.
For better ways to do the same, you might want to check out this article, which explains why you should use the camel library instead of pickle.
We have a significant ~(50kloc) tree of packages/modules (approx 2200files) that we ship around to our cluster with each job. The jobs run for ~12 hours, so the overhead of untarring/bootrapping (i.e. resolving PYTHONPATH for each module) usually isn't a big deal. However, as the number of cores in our worker nodes have increased, we've increasingly hit the case where the scheduler will have 12 jobs land simultaneously, which will grind the poor scratch drive to a halt servicing all the requests (worse, for reasons beyond our control, each job requires a separate loopback filesystem, so there's 2 layers of indirection on the drive).
Is there a way to hint to the interpreter the proper location of each file (without decorating the code with paths strewn throughout (maybe overriding import?)) or bundle up all the associated .pyc files into some sort of binary blob that can just be read once?
Thanks!
We had problems like this on our cluster. (The Lustre filesystem was slow for metadata operations.) Our solution was to use the "zip import" facilities in Python.
In our case we made a single zip of the stdlib (placed in the name given already in sys.path, like "/usr/lib/python26.zip") and another zip of our project, with the latter added to the PYTHONPATH.
This is much faster because it's a single filesystem metadata read, followed by a quick zip file read of the table-of-contents to figure out what's inside, and cache for later lookups.
I want to program a virtual file system in Windows with Python.
That is, a program in Python whose interface is actually an "explorer windows". You can create & manipulate file-like objects but instead of being created in the hard disk as regular files they are managed by my program and, say, stored remotely, or encrypted or compressed or versioned, or whatever I can do with Python.
What is the easiest way to do that?
While perhaps not quite ripe yet (unfortunately I have no first-hand experience with it), pywinfuse looks exactly like what you're looking for.
Does it need to be Windows-native? There is at least one protocol which can be both browsed by Windows Explorer, and served by free Python libraries: FTP. Stick your program behind pyftpdlib and you're done.
Have a look at Dokan a User mode filesystem for Windows. There are Ruby, .NET (and Java by 3rd party) bindings available, and I don't think it'll be difficult to write python bindings either.
If you are trying to write a virtual file system (I may misunderstand you) - I would look at a container file format. VHD is well documented along with HDI and (embedded) OSQ. There are basically two things you need to do. One is you need to decide on a file/container format. After that it is as simple as writing the API to manipulate that container. If you would like it to be manipulated over the internet, pick a transport protocol then just write a service (would would emulate a file system driver) that listens on a certain port and manipulates this container using your API
You might be interested in PyFilesystem;
A filesystem abstraction layer for Python
PyFilesystem is an abstraction layer for filesystems. In the same way that Python's file-like objects provide a common way of accessing files, PyFilesystem provides a common way of accessing entire filesystems. You can write platform-independent code to work with local files, that also works with any of the supported filesystems (zip, ftp, S3 etc.).
What the description on the homepage does not advertise is that you can then expose this abstraction again as a filesystem, among others SFTP, FTP (though currently disfunct, probably fixable) and dokan (dito) as well as fuse.