Where to store large resource files with python/pypi packages? - python

I have the following problem: I'm developing a package in python that will need a static data file for its operation that is somewhat large (currently around 70 MB, may get larger over time).
This isn't excessively large, but it's likely beyond what pypi will accept, so just having the file as a resource file as part of the package is not really an option. It also doesn't compress very well.
So I'm expecting to do something of the following: I'll store the file somewhere where it can be downloaded via https and add a command to the tool that will download that extra data needed. (I.e. expect something like a commandline tool with a --fetch-operational-data parameter that one might call once after installation and may call for updates every now and then, though updates of that file are expected to be rare.)
However this leads to the question where to store that data, and I feel there's no really good option.
"Usually" package resource files can be managed with importlib_resources which can access files that are stored within module directories. But the functions like open_binary are all read only and while one could probably get the path and write there, this probably goes against the intention of how it is supposed to be used (e.g. a major selling point for the importlib functionality is that it can be used in zip'ed packages, and that would obviously break).
Alternatively one could have a dot directory (~/.mytool/). However this means there's no good way to install this globally.
On the other hand there could be a system-wide directory (/var/lib/mytool ?), but then a user couldn't use the package. One could try to autodetect if the data is in /var/lib and fallback to ~/.mytool and write to whatever is writable on the update command.
Furthermore currently the tool is executable from its git repo, which adds another complexity (would it download the file into an extra dir in the gitrepo if it's executed from there? or also use /var/lib/mytool / ~/.mytool ?)
Whatever I would do, it feels like an ugly hack. Any good ideas?

Related

Check if package is imported from within the source tree

Users should install our python package via pip or it can be cloned from a github repo and installed from source. Users should not be running import Foo from within the source tree directory for a number of reasons, e.g. C extensions are missing (numpy has the same issue: read here). So, we want to check if the user is running import Foo from within the source tree, but how to do this cleanly, efficiently, and robustly with support for Python 3 and 2?
Edit: Note the source tree here is defined as where the code is downloaded too (e.g. via git or from the source archive) and it contrasts with the installation directory where the code is installed too.
We considered the following:
Check for setup.py, or other file like PKG-INFO, which should only be present in the source. It’s not that elegant and checking for the presence of a file is not very cheap, given this check will happen every time someone import Foo. Also there is nothing to stop someone from putting a setup.py outside to the source tree in their lib/python3.X/site-packages/ directory or similar.
Parsing the contents of setup.py for the package name, but it also adds overhead and is not that clean to parse.
Create a dummy flag file that is only present in the source tree.
Some clever, but likely overcomplicated and error-prone, ideas like modifying Foo/__init__.py during installation to note that we are now outside of the source tree.
Since you mention numpy in your comments and wanting to do it like they do but not fully understanding it, I figured I would break that down and see if you could implement a similar process.
__init__.py
The error you are seeking starts here which is what you linked to in your comments and answers so you already know that. It's just attempting an import of __config__.py and failing if it isn't there or can't be imported.
try:
from numpy.__config__ import show as show_config
except ImportError:
msg = """Error importing numpy: you should not try to import numpy from
its source directory; please exit the numpy source tree, and relaunch
your python interpreter from there."""
raise ImportError(msg)
So where does the __config__.py file come from then and how does that help? Let's follow below...
setup.py
When the package is installed, setup is called to run and it in turn does some configuration actions. This is essentially what ensures that the package is properly installed rather than being run from the download directory (which I think is what you are wanting to ensure).
The key here is this line:
config.make_config_py() # installs __config__.py
misc_util.py
That is imported from distutils/misc_util.py which we can follow all the way down to here.
def make_config_py(self,name='__config__'):
"""Generate package __config__.py file containing system_info
information used during building the package.
This file is installed to the
package installation directory.
"""
self.py_modules.append((self.name, name, generate_config_py))
Which is then running here which is writing in that __config__.py file with some system information and your show() function.
Summary
The import of __config__.py is attempted and fails which generates the error you are wanting to raise if setup.py wasn't run, which is what triggers that file to be properly created. That ensures not only that a file check is being done but that the file only exists in the installation directory. It is still some overhead of importing an additional file on every import but no matter what you do you're adding some amount of overhead making this check in the first place.
Suggestions
I think that you could implement a much lighter weight version of what numpy is doing while accomplishing the same thing.
Remove the distutils subfunction and create the checked file within your setup.py file as part of the standard install. It would only exist in the installed directory after installation and never elsewhere unless a user faked that (in which case they could get around just about anything you try probably).
As an alternative (without knowing your application and what your setup file is doing) possibly you have a function that is normally imported anyway that isn't key to the running of the application but is good to have available (in numpy's case the functions are information about the installation like version(). Instead of keeping those functions where you put them now, you make them part of this file that is created. Then you are at least loading something that you would otherwise load anyway from somewhere else.
Using this method you are importing something no matter what, which has some overhead, or raising the error. I think as far as methods to raise an error because they aren't working out of the installed directory, it's a pretty clean and straightforward way to do it. No matter what method you use, you have some overhead of using that method so I would focus on keeping the overhead low, simple, and not going to cause errors.
I wouldn't do something that is complicated like parsing the setup file or modifying necessary files like __init__.py somewhere. I think you are right that those methods would be more error prone.
Checking if setup.py exists could work but I would consider it less clean than attempting to import which is already optimized as a standard Python function. They accomplish similar things but I think implemented the numpy style is going to be more straight forward.

Using setuptools, how can I download external data upon installation?

I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?
As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.
Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.
Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.
This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.

Lazily download/install python submodules

Would it be possible to create a python module that lazily downloads and installs submodules as needed? I've worked with "subclassed" modules that mimic real modules, but I've never tried to do so with downloads involved. Is there a guaranteed directory that I can download source code and data to, that the module would then be able to use on subsequent runs?
To make this more concrete, here is the ideal behavior:
User runs pip install magic_module and the lightweight magic_module is installed to their system.
User runs the code import magic_module.alpha
The code goes to a predetermine URL, is told that there is an "alpha" subpackage, and is then given the URLs of alpha.py and alpha.csv files.
The system downloads these files to somewhere that it knows about, and then loads the alpha module.
On subsequent runs, the user is able to take advantage of the downloaded files to skip the server trip.
At some point down the road, the user could run a import magic_module.alpha ; alpha._upgrade() function from the command line to clear the cache and get the latest version.
Is this possible? Is this reasonable? What kinds of problems will I run into with permissions?
Doable, certainly. The core feature will probably be import hooks. The relevant module would be importlib in python 3.
Extending the import mechanism is needed when you want to load modules that are stored in a non-standard way. Examples include [...] modules that are loaded from a database over a network.
Convenient, probably not. The import machinery is one of the parts of python that has seen several changes over releases. It's undergoing a full refactoring right now, with most of the existing things being deprecated.
Reasonable, well it's up to you. Here are some caveats I can think of:
Tricky to get right, especially if you have to support several python versions.
What about error handling? Should application be prepared for import to fail in normal circumstances? Should they degrade gracefully? Or just crash and spew a traceback?
Security? Basically you're downloading code from someplace, how do you ensure the connection is not being hijacked?
How about versionning? If you update some of the remote modules, how can make the application download the correct version?
Dependencies? Pushing of security updates? Permissions management?
Summing it up, you'll have to solve most of the issues of a package manager, along with securing downloads and permissions issues of course. All those issues are tricky to begin with, easy to get wrong with dire consequences.
So with all that in mind, it really comes down to how much resources you deem worth investing into that, and what value that adds over a regular use of readily available tools such as pip.
(the permission question cannot really be answered until you come up with a design for your package manager)

Does git always produce stable results when doing a pull?

I'm building a little python script that is supposed to update itself everytime it starts. Currently I'm thinking about putting MD5 hashes on a "website" and downloading the files into a temp folder via the srcipt itself. Then if the MD5 Hashes line up the temp files will be moved over the old ones.
But now I'm wondering if git will just do something like this anyway.
What if the internet connection breaks away or power goes down when doing a git pull? Will I still have the "old" version or some intermediate mess?
Since my aproach works with an atomic rename from the os I can at least be shure that every file is either old or new, but not messed up. Is that true for git as well?
A pull is a complex command which will do a few different things depending on the configuration. It is not something you should use in a script, as it will try to merge (or rebase if so configured) which means that files with conflict markers may be left on the filesystem, which will make anything that tries to compile/interpret those files fail to do so.
If you want to switch to a particular version of files, you should use something like checkout -f <remote>/<branch> after fetching from <remote>. Keep in mind that git cannot know what particular needs you have, so if you're writing a script, it should be able to perform some sanity checks (e.g. make sure there are no extra files lying around)

data cache for python package

I have a python module which generates large data files which I want to cache on disk for future use. The cache is likely to end up some hundreds of MB for a normal user, but save a lot of computation time.
The files aren't distributed with the module, but are generated the first time the code is run with a given set of parameters.
So far I've just been using a single file module myself and putting them in a hardcoded path relative to the module (data/). But I now need to distribute this module in a Python package with distutils and I was wondering if there is a standard way to do that.
I was thinking of something like the compiled cache of scipy.weave - but wondering if there is a more modern supported way of doing it. On *nix platforms I would expect it to go in ~/.something but I'm not sure what the windows equivalent would be. Also this should configurable so that users can point it somewhere else if it's more convenient, or to share the cache dir between users. How should such a config file work? Where should it go?
Or should I just have it as an install option, either through a config file next to setup.py or set by manually editing setup.py, then hard code the directory in the module before installation?
Any pointers greatfully received...
You can use the standard library module ConfigParser to parse an ini file (or .rc file depending on your culture). To find the file, os.path.expanduser is a useful function that does the right thing on all platforms for paths like "~/.mytoolrc". To let the user override the location of things, you can use environment variables via os.environ.
There is an emerging standard in the free OS world: http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html
This module can help you for Windows and Max OS X, but it seems to be broken with respect the the XDG Base Dir Spec: http://pypi.python.org/pypi/appdirs

Categories

Resources