Would it be possible to create a python module that lazily downloads and installs submodules as needed? I've worked with "subclassed" modules that mimic real modules, but I've never tried to do so with downloads involved. Is there a guaranteed directory that I can download source code and data to, that the module would then be able to use on subsequent runs?
To make this more concrete, here is the ideal behavior:
User runs pip install magic_module and the lightweight magic_module is installed to their system.
User runs the code import magic_module.alpha
The code goes to a predetermine URL, is told that there is an "alpha" subpackage, and is then given the URLs of alpha.py and alpha.csv files.
The system downloads these files to somewhere that it knows about, and then loads the alpha module.
On subsequent runs, the user is able to take advantage of the downloaded files to skip the server trip.
At some point down the road, the user could run a import magic_module.alpha ; alpha._upgrade() function from the command line to clear the cache and get the latest version.
Is this possible? Is this reasonable? What kinds of problems will I run into with permissions?
Doable, certainly. The core feature will probably be import hooks. The relevant module would be importlib in python 3.
Extending the import mechanism is needed when you want to load modules that are stored in a non-standard way. Examples include [...] modules that are loaded from a database over a network.
Convenient, probably not. The import machinery is one of the parts of python that has seen several changes over releases. It's undergoing a full refactoring right now, with most of the existing things being deprecated.
Reasonable, well it's up to you. Here are some caveats I can think of:
Tricky to get right, especially if you have to support several python versions.
What about error handling? Should application be prepared for import to fail in normal circumstances? Should they degrade gracefully? Or just crash and spew a traceback?
Security? Basically you're downloading code from someplace, how do you ensure the connection is not being hijacked?
How about versionning? If you update some of the remote modules, how can make the application download the correct version?
Dependencies? Pushing of security updates? Permissions management?
Summing it up, you'll have to solve most of the issues of a package manager, along with securing downloads and permissions issues of course. All those issues are tricky to begin with, easy to get wrong with dire consequences.
So with all that in mind, it really comes down to how much resources you deem worth investing into that, and what value that adds over a regular use of readily available tools such as pip.
(the permission question cannot really be answered until you come up with a design for your package manager)
Related
I want to know if I can create a python script with a folder in the same directory with all the assets of a python module, so when someone wants to use it, they would not have to pip install module, because it would import from the directory.
Yes, you can, but it doesn't mean that you should.
First, ask yourself who is suposed to use that code.
If you plan to give it to consumers, it would be a good idea to use a tool like py2exe and create executable file which would include all modules and not allow for code to be changed.
If you plan to share it with another developer, you might want to look into virtual environments and requirements.txt file.
There are multiple reasons why sharing modules is bad idea:
It is harder to update modules later, at least without upgrading whole project.
It uses more space on version control, which can create issues on huge projects with hundreds of modules and branches
It might be illegal as some licenses specifically forbid including their code in your source code.
The pip install of some module might do different things depending on operating system version or installed packages. The modules on your machine might be suboptimal on someone else's machine, and in some instances might not even work.
And probably more that I can't think of right now.
The only situation where I saw this being unavoidable was when the module didn't support python implementation the application was running on. The module was changed, and its source was put under lib folder with the rest of the libraries.
I think you can add the directory with python modules into PYTHONPATH. Then people want to use those modules just need has this envvar set.
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH
I have an environment with some extreme constraints that require me to reduce the size of a planned Python 3.8.1 installation. The OS is not connected to the internet, and a user will never open an interactive shell or attach a debugger.
There are of course lots of ways to do this, and one of the ways I am exploring is to remove some core modules, for example python3-email. I am concerned that there are 3rd-party packages that future developers may include in their apps that have unused but required dependencies on core python features. For example, if python3-email is missing, what 3rd-party packages might not work that one would expect too? If a developer decides to use a logging package that contains an unreferenced EmailLogger class in a referenced module, it will break, simply because import email appears at the top.
Do package design requirements or guidelines exist that address this?
It's an interesting question, but it is too broad to be cleanly answered here. In short, the Python standard library is expected to always be there, even though sometimes it broken up in multiple parts (Debian for example). But you say it yourself, you don't know what your requirements are since you don't know yet what future packages will run on this interpreter... This is impossible to answer. One thing you could do is to use something like modulefinder on the future code before letting it run on that constrained Python interpreter.
I was able to get to a solution. The issue was best described to me as cascading imports. It is possible to stop a module from being loaded, by adding an entry to sys.modules. For example, when importing the asyncio module ssl and _ssl modules will be loaded, even though they are not needed outside of ssl. This can be stopped with the following code. This can be verified both by seeing the python process is 3MB smaller, but also by using module load hooks to watch each module as it loads:
import importhook
import sys
sys.modules['ssl'] = None
#importhook.on_import(importhook.ANY_MODULE)
def on_any_import(module):
print(module.__spec__.name)
assert module.__spec__.name not in ['ssl', '_ssl']
import asyncio
For my original question about 3rd-party design guidelines, some recommend placing the import statements within the class rathe that at the module level, however this is not routinely done.
I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?
As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.
Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.
Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.
This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.
I made a Python module (https://github.com/Yannbane/Tick.py) and a Python program (https://github.com/Yannbane/Flatland.py). The program imports the module, and without it, it cannot work. I have intended for people to download both of these files before they can run the program, but, I am concerned about this a bit.
In the program, I've added these lines:
sys.path.append("/home/bane/Tick.py")
import tick
"/home/bane/Tick.py" is the path to my local repo of the module that needs to be included, but this will obviously be different to other people! How can I solve this situation better?
What suggested by #Lattyware is a viable option. However, its not uncommon to have core dependencies boundled with the main program (Django and PyDev do this for example). This works fine especially if the main code is tweaked against a specific version of the library.
In order to avoid the troubles mentioned by Lattyware when it comes to code maintenance, you should look into git submodules, which allow precisely this kind of layout, keeping code versioning sane.
From the structure of your directory it seems that both files live in the same directory. This might be the tell-tale than they might be two modules of a same package. In that case you should simply add an empty file called __init__.py to the directory, and then your import could work by:
import bane.tick
or
from bane import tick
Oh, and yes... you should use lower case for module names (it's worth to take a in-depth look at PEP8 if you are going to code in python! :)
HTH!
You might want to try submitting your module to the Python Package Index, that way people can easily install it (pip tick) into their path, and you can just import it without having to add it to the python path.
Otherwise, I would suggest simply telling people to download the module as well, and place it in a subdirectory of the program. If you really feel that is too much effort, you could place a copy of the module into the repository for the program (of course, that means ensuring you keep both versions up-to-date, which is a bit of a pain, although I imagine it may be possible just to use a symlink).
It's also worth noting your repo name is a bit misleading, capitalisation is often important, so you might want to call the repo tick.py to match the module, and python naming conventions.
The Story
After cleaning up my Dreamhost shared server's home folder from all the cruft accumulated over time, I decided to start afresh and compile/reinstall Python.
All tutorials and snippets I found seemed overly simplistic, assuming (or ignoring) a bunch of dependencies needed by Python to compile all modules correctly. So, starting from http://andrew.io/weblog/2010/02/installing-python-2-6-virtualenv-and-virtualenvwrapper-on-dreamhost/ (so far the best guide I found), I decided to write a set-and-forget Bash script to automate this painful process, including along the way a bunch of other things I am planning to use.
The Script
I am hosting the script on http://bitbucket.org/tmslnz/python-dreamhost-batch/src/
The TODOs
So far it runs fine, and does all it needs to do in about 900 seconds, giving me at the end of the process a fully functional Python / Mercurial / etc... setup without even needing to log out and back in.
I though this might be of use for others too, but there are a few things that I think it's missing and I am not quite sure how to go for it, what's the best way to do it, or if this just doesn't make any sense at all.
Check for errors and break
Check for minor version bumps of the packages and give warnings
Check for known dependencies
Use arguments to install only some of the packages instead of commenting out lines
Organise the code in a manner that's easy to update
Optionally make the installers and compiling silent, with error logging to file
failproof .bashrc modification to prevent breaking ssh logins and having to log back via FTP to fix it
EDIT: The implied question is: can anyone, more bashful than me, offer general advice on the worthiness of the above points or highlight any problems they see with this approach? (see my answer to Ry4an's comment below)
The Gist
I am no UNIX or Bash or compiler expert, and this has been built iteratively, by trial and error. It is somehow going towards apt-get (well, 1% of it...), but since Dreamhost and others obviously cannot give root access on shared servers, this looks to me like a potentially very useful workaround; particularly so with some community work involved.
One way to streamline this would be to make it work with one of: capistrano/fabric, puppet/chef, jhbuild, or buildout+minitage (and a lot of cmmi tasks). There are some opportunities for factoring in common code, especially with something more high-level than bash. You will run into bootstrapping issues, however, so maybe leave good enough alone.
If you want to look into userland package managers, there is autopackage (bootstraps well), nix (quickstart), and stow (simple but helps with isolation).
Honestly, I would just build packages with a name prefix for all of the pieces and have them install under /opt so that they're out of the way. That way it only takes the download time and a bit of install time to do.