data cache for python package

data cache for python package - python

I have a python module which generates large data files which I want to cache on disk for future use. The cache is likely to end up some hundreds of MB for a normal user, but save a lot of computation time.
The files aren't distributed with the module, but are generated the first time the code is run with a given set of parameters.
So far I've just been using a single file module myself and putting them in a hardcoded path relative to the module (data/). But I now need to distribute this module in a Python package with distutils and I was wondering if there is a standard way to do that.
I was thinking of something like the compiled cache of scipy.weave - but wondering if there is a more modern supported way of doing it. On *nix platforms I would expect it to go in ~/.something but I'm not sure what the windows equivalent would be. Also this should configurable so that users can point it somewhere else if it's more convenient, or to share the cache dir between users. How should such a config file work? Where should it go?
Or should I just have it as an install option, either through a config file next to setup.py or set by manually editing setup.py, then hard code the directory in the module before installation?
Any pointers greatfully received...

You can use the standard library module ConfigParser to parse an ini file (or .rc file depending on your culture). To find the file, os.path.expanduser is a useful function that does the right thing on all platforms for paths like "~/.mytoolrc". To let the user override the location of things, you can use environment variables via os.environ.

There is an emerging standard in the free OS world: http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html
This module can help you for Windows and Max OS X, but it seems to be broken with respect the the XDG Base Dir Spec: http://pypi.python.org/pypi/appdirs

Related

Where to store large resource files with python/pypi packages?

I have the following problem: I'm developing a package in python that will need a static data file for its operation that is somewhat large (currently around 70 MB, may get larger over time).
This isn't excessively large, but it's likely beyond what pypi will accept, so just having the file as a resource file as part of the package is not really an option. It also doesn't compress very well.
So I'm expecting to do something of the following: I'll store the file somewhere where it can be downloaded via https and add a command to the tool that will download that extra data needed. (I.e. expect something like a commandline tool with a --fetch-operational-data parameter that one might call once after installation and may call for updates every now and then, though updates of that file are expected to be rare.)
However this leads to the question where to store that data, and I feel there's no really good option.
"Usually" package resource files can be managed with importlib_resources which can access files that are stored within module directories. But the functions like open_binary are all read only and while one could probably get the path and write there, this probably goes against the intention of how it is supposed to be used (e.g. a major selling point for the importlib functionality is that it can be used in zip'ed packages, and that would obviously break).
Alternatively one could have a dot directory (~/.mytool/). However this means there's no good way to install this globally.
On the other hand there could be a system-wide directory (/var/lib/mytool ?), but then a user couldn't use the package. One could try to autodetect if the data is in /var/lib and fallback to ~/.mytool and write to whatever is writable on the update command.
Furthermore currently the tool is executable from its git repo, which adds another complexity (would it download the file into an extra dir in the gitrepo if it's executed from there? or also use /var/lib/mytool / ~/.mytool ?)
Whatever I would do, it feels like an ugly hack. Any good ideas?

Python library with config file

I would like to gather many libraries I have made while working on my projects in some kind of container, so that I can easily use any of them in future projects of mine. It is pretty clear to me how to do this, except one part.
I am assuming that every service will have its own config file (for instance, the Cache service, will have a config file with cache host and port, and so on). Now the problem is: when I want to use this container in an arbitrary project I will have to make assumptions about the project directory structure to know where to find these config files.
For instance, one might assume that on the same path of my library there is a config folder where I will find the config files of my services. However, this might conflict with the project's directory structure (i.e. the project might already have its own config directory for instance).
So, all in all, my question is: is there a safe, standard way to ship a library which might assume to find some config files someplace, or for which example config files are shipped along with the library itself?

well, you should not keep config files, or anything that you want to modify along with code in python (or actually in any language). Each OS have folders for that purpose.
Either it's system wide, and on Unix it's /etc or it's for an user it's in ~/.config. You have theLibrary folders for OSX, and I'm sure there's something alike for windows beyond \Windows\SYSTEM32 😉.
What that means is that the path to your configuration files shall not be considered relative to your code at any point. Never. Ever.
You can include some assets in a python package, using the MANIFEST.in but, as it'll be within your python package, you shall assume you won't have rights to write where it'll be (installed by admin, ran by user).
You can also specify some of those assets to install at specific places using setup.py's data_files directive, which will be installed relatively to sys.prefix.
Common practice is to provide configuration files examples using a link from the documentation, or better generate those files when starting the application.
Also, another trend for desktops, is to use the XDG directory specification, to decide where to look for, or where to place your configuration files.
To sum it up:
make a list of default paths your code expects to find the configuration,
make it possible to specify manually at command line the path to the configuration python foo.py --config bar.ini
write a feature for your tool to generate the configuration (with a series of questions)
deploy your default configurations to standard places (XDG paths, $prefix/etc…)

Cross-platform method to obtain the user's configuration home directory in Python?

My program needs to store some configuration files. The major operating systems seems to have a designated location to place those; for instance, on Freedesktop.org compliant systems, it will be the path stored in the $XDG_CONFIG_HOME environment variable.
Is there a method (or a library) to obtain this configuration home directory across the major operating systems: Windows, OS X, Linux?

You can use the package appdirs. This is developed by ActiveState who must have quite a lot of experience on cross-platform python.
import appdirs
appdirs.user_config_dir(appname='MyApp')
The package is just a single file (/module), so if you need it for a small script it is simple to just copy what you need. Otherwise the package is available with both pip and conda.

import os
print os.path.expanduser("~/.my_config/data")
will always allow you to write ... config files are best stored where you want them
oftentimes on windows this is
os.path.expandvars("%appdata%/.my_config")
most other systems tend to put the in the userprofile director (~)
unless you are looking for a specific configuration file location(perhaps to interact with another software... in which case we would need to know which software)

Some way to create a cross-platform, self-contained, cloud-synchronized python library of modules for personal use? [duplicate]

I need to ship a collection of Python programs that use multiple packages stored in a local Library directory: the goal is to avoid having users install packages before using my programs (the packages are shipped in the Library directory). What is the best way of importing the packages contained in Library?
I tried three methods, but none of them appears perfect: is there a simpler and robust method? or is one of these methods the best one can do?
In the first method, the Library folder is simply added to the library path:
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'Library'))
import package_from_Library
The Library folder is put at the beginning so that the packages shipped with my programs have priority over the same modules installed by the user (this way I am sure that they have the correct version to work with my programs). This method also works when the Library folder is not in the current directory, which is good. However, this approach has drawbacks. Each and every one of my programs adds a copy of the same path to sys.path, which is a waste. In addition, all programs must contain the same three path-modifying lines, which goes against the Don't Repeat Yourself principle.
An improvement over the above problems consists in trying to add the Library path only once, by doing it in an imported module:
# In module add_Library_path:
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'Library'))
and then to use, in each of my programs:
import add_Library_path
import package_from_Library
This way, thanks to the caching mechanism of CPython, the module add_Library_path is only run once, and the Library path is added only once to sys.path. However, a drawback of this approach is that import add_Library_path has an invisible side effect, and that the order of the imports matters: this makes the code less legible, and more fragile. Also, this forces my distribution of programs to inlude an add_Library_path.py program that users will not use.
Python modules from Library can also be imported by making it a package (empty __init__.py file stored inside), which allows one to do:
from Library import module_from_Library
However, this breaks for packages in Library, as they might do something like from xlutils.filter import …, which breaks because xlutils is not found in sys.path. So, this method works, but only when including modules in Library, not packages.
All these methods have some drawback.
Is there a better way of shipping programs with a collection of packages (that they use) stored in a local Library directory? or is one of the methods above (method 1?) the best one can do?
PS: In my case, all the packages from Library are pure Python packages, but a more general solution that works for any operating system is best.
PPS: The goal is that the user be able to use my programs without having to install anything (beyond copying the directory I ship them regularly), like in the examples above.
PPPS: More precisely, the goal is to have the flexibility of easily updating both my collection of programs and their associated third-party packages from Library by having my users do a simple copy of a directory containing my programs and the Library folder of "hidden" third-party packages. (I do frequent updates, so I prefer not forcing the users to update their Python distribution too.)

Messing around with sys.path() leads to pain... The modern package template and Distribute contain a vast array of information and were in part set up to solve your problem.
What I would do is to set up setup.py to install all your packages to a specific site-packages location or if you could do it to the system's site-packages. In the former case, the local site-packages would then be added to the PYTHONPATH of the system/user. In the latter case, nothing needs to changes
You could use the batch file to set the python path as well. Or change the python executable to point to a shell script that contains a modified PYTHONPATH and then executes the python interpreter. The latter of course, means that you have to have access to the user's machine, which you do not. However, if your users only run scripts and do not import your own libraries, you could use your own wrapper for scripts:
#!/path/to/my/python
And the /path/to/my/python script would be something like:
#!/bin/sh
PYTHONPATH=/whatever/lib/path:$PYTHONPATH /usr/bin/python $*

I think you should have a look at path import hooks which allow to modify the behaviour of python when searching for modules.
For example you could try to do something like kde's scriptengine does for python plugins[1].
It adds a special token to sys.path(like "<plasmaXXXXXX>" with XXXXXX being a random number just to avoid name collisions) and then when python try to import modules and can't find them in the other paths, it will call your importer which can deal with it.
A simpler alternative is to have a main script used as launcher which simply adds the path to sys.path and execute the target file(so that you can safely avoid putting the sys.path.append(...) line on every file).
Yet an other alternative, that works on python2.6+, would be to install the library under the per-user site-packages directory.
[1] You can find the source code under /usr/share/kde4/apps/plasma_scriptengine_python in a linux installation with kde.

How to modularize a Python application

I've got a number of scripts that use common definitions. How do I split them in multiple files? Furthermore, the application can not be installed in any way in my scenario; it must be possible to have an arbitrary number of versions concurrently running and it must work without superuser rights. Solutions I've come up with are:
Duplicate code in every
script. Messy, and probably the worst
scheme.
Put all scripts and common
code in a single directory, and
use from . import to load them.
The downside of this approach is that
I'd like to put my libraries in other
directory than the applications.
Put common
code in its own directory, write a __init__.py that imports all submodules and finally use from . import to load them.
Keeps code organized, but it's a little bit of overhead to maintain __init__.py and qualify names.
Add the library directory to
sys.path and
import. I tend to
this, but I'm not sure whether
fiddling with sys.path
is nice code.
Load using
execfile
(exec in Python 3).
Combines the advantages of the
previous two approaches: Only one
line per module needed, and I can use
a dedicated. On the other hand, this
evades the python module concept and
polutes the global namespace.
Write and install a module using
distutils. This
installs the library for all python
scripts and needs superuser rights
and impacts other applications and is hence not applicable in my case.
What is the best method?

Adding to sys.path (usually using site.addsitedir) is quite common and not particularly frowned upon. Certainly you will want your common working shared stuff to be in modules somewhere convenient.
If you are using Python 2.6+ there's already a user-level modules folder you can use without having to add to sys.path or PYTHONPATH. It's ~/.local/lib/python2.6/site-packages on Unix-likes - see PEP 370 for more information.

You can set the PYTHONPATH environment variable to the directory where your library files are located. This adds that path to the library search path and you can use a normal import to import them.

If you have multiple environments which have various combinations of dependencies, a good solution is to use virtualenv to create sandboxed Python environments, each with their own set of installed packages. Each environment will function in the same way as a system-wide Python site-packages setup, but no superuser rights are required to create local environments.
Google has plenty of info, but this looks like a pretty good starting point.

Another alternative to manually adding the path to sys.path is to use the environment variable PYTHONPATH.
Also, distutils allows you to specify a custom installation directory using
python setup.py install --home=/my/dir
However, neither of these may be practical if you need to have multiple versions running simultaneously with the same module names. In that case you're probably best off modifying sys.path.

I've used the third approach (add the directories to sys.path) for more than one project, and I think it's a valid approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.