I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?
As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.
Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.
Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.
This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.
Related
I am responsible for maintaining a Python environment that is frequently deployed into production. To ensure that the library's dependencies work well together, I use pip to install all the required Python packages, following best practices I pip freeze all of my dependencies into a requirements.txt file.
The benefit of this approach is that I have a very stable environment that is unlikely to break due to a package issue. However, the drawback is that my environment is static while these packages are constantly releasing new versions that could improve performance and most importantly fix vulnerabilities.
I am wondering what the industry standard is for keeping packages up to date in an easy way, perhaps even in an automated way that detects new updates and tests for any potential issues. For instance, apt has a simple apt-get update and apt-get upgrade command to keep your packages constantly updated and be aware of it.
The obvious answer is to just update all packages to the latest official version. My concern is that this can potentially break some dependencies between packages and cause the environment to break.
Can anyone suggest a similar solution for keeping Python packages up to date while ensuring stability?
Well, I can't say what the industry standard is but one of the ways I could think of to update the packages periodically would be as such:
Create a requirements.txt file that contains the packages to update.
Create a python script that contains bash commands to update the packages in the requirements.txt.
Can use pip freeze to update the requirements.txt
Create a cron job or use the python schedule library to trigger a periodic update
You can refer, as an example for
cron job on mac, to: https://www.jcchouinard.com/python-automation-with-cron-on-mac/
schedule in python, to: https://www.geeksforgeeks.org/python-schedule-library/
The above can solve the update issue but may not solve the compatibility issue.
The importance is not so much current but to not have vulnerabilities that may harm customers and your company in any way.
Industristandard is to use Snyk, Safety, Deptrack, Sonatype Lifecycle etc to monitor package/module vulnerabilities. And a Continuous-Integration(CI) system that support this like docker containers in kubernetes.
Deploy the exact version you are running in prod and run vulnerability scans.
(All CI systems make this easy)
(it is important to do on running code because the requirements.txt most likely never show all installed that will be installed)
The vulnerability-scanners (all except owast deptrack) costs money if you like to be current, but they in its simplest form just give you a more refined answer to:
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=python
You can make your own system based on the safety source
https://github.com/pyupio/safety and the CVE list above and make it fail if any of the versions are found in your code.
When it comes to requirements.txt listings we always use >= in favor of == and deploy every night in test-systems and run test-on them, if they fail we manually have to go though errors and check and pin the releases.
Generally you build Docker images or similar. (There are other VM solutions, but Docker seems to be the most common/popular.) Have apt-get update, apt-get install and pip install -r requirements.txt as part of the Dockerfile, and integrate it into your CI/CD pipeline using GitHub or similar
That builds the image from scratch using a well-defined process. You then run the VM, including all automated unit and integration tests and flag any problems for the Devs. It's then their job to figure out whether they broke something, whether they need a particular version of some dependency, etc.
Once the automated tests pass, the image binary (including its "frozen" dependencies) gets pushed to a place where humans can interact with it and look for anything weird. Typically there are several such environments variously called things like "test", "integration", and "production". The names, numbers, and details vary from place to place - but that triplet is fairly common. They're typically defined as a sequence and gating between them is done manually. This might be a typical spec:
Development - This is the first environment where code gets pushed after the automated testing passes. Failures and weird idiosyncrasies are common. If local developers need to replicate a problem or experiment with a patch that can't be handled on their local machines, they put the code here. Images graduate from Development when the basic manual and automated "smoke tests" all pass and the Dev team agrees the code is stable.
Integration - This environment belongs to QA and dedicated software testers. They may have their own set of scripts to run, which may be automated, manual or ad hoc. This is also a good environment for load testing, or for internal red team attacks on the security, or for testing / exploration by internal trusted users who are willing to risk the occasional crash in order to exercise the newest up-and-coming features. QA flags any problems or oddities and sends issues back to the devs. Images graduate from Integration when QA agrees the code is production-quality.
Production - This environment is running somewhat older code since everything has to go through Dev & QA before reaching it, but it's likely to be quite robust. The binary images here are the only ones that are accessible to the outside world, and the only ones carefully monitored by the SOC.
Since VM images are only compiled once, just prior to sending the code to Dev, all three environments should have the same issues. The problem then becomes getting code through the environments in sequence and doing all the necessary checks in each before the security patches become too outdated... or before the customers get tired of waiting for the new features / bug-fixes.
There are a lot of details in the process, and many questions & answers on StackOverflow's sister site for networking administration
I wrote 2-3 Plugins for pyload.
Sometimes they change and i let users know over forum that theres a new version.
To avoid that i'd like to give my scripts an auto selfupdate function.
https://github.com/Gutz-Pilz/pyLoad-stuff/blob/master/FileBot.py
Something like that easy to setup ?
Or someone can point me in a direction ?
Thanks in advance!
It is possible, with some caveats. But it can easily become very complicated. Before you know it, your auto-update "feature" will be bigger than the original code!
First you need to have an URL that always contains the latest version. Since you are using github, using raw.githubusercontent might do very well.
Have your code download the latest version from that URL (e.g. using requests), and compare the version with that in the current code. For this purpose I would recommend a simple integer version number, so you don't need any complicated parsing logic.
However, you might want to consider only running that check once per day, or once per week. If you do it every time your file is run, the server might get hammered! So now you have to save a file with the date when the check was last done, and read that to see if it is time to run the check again. This file will need to be saved in a location that you can access on every platform your code is liable to run on. That in itself can be a challenge.
If it is just a single python file, which is installed as the user that is running it, updating is relatively easy. But if the original was installed as root in the global Python directory and your script is running as a nonprivileged user it will be difficult. Especially if it is running as a plugin and cannot ask the user for (temporary) root credentials to install the file.
And what are you going to do if a newer version has more dependencies outside the standard library?
Last but not least, as a sysadmin I don't really like auto-updating software. Especially for critical system infrstructure I like to be able to estimate the consequences before an update.
We have a relatively large number of machines in a cluster that run a certain Python-based software for computation in an academic research environment. In order to keep the code base up to date, we are using a build server which makes the current code base available in a directory each time we update a dedicated deployment tag on our Mercurial server. Each machine part of the cluster runs a daily rsync script that just synchronises with the deployment directory on the build server (if there's anything to sync) and restart the process if the code base was updated.
Now this approach I find a bit dated and a slight overkill, and would like to optimise in the following way:
Get rid of the build server as all it actually does is clone the latest code base that has a certain tag attached - it doesn't actually compile or do any additional checks (such as testing) on the code base at all. This would also reduce some pain for us as it'd be one less server to maintain and worry about.
Instad of having the build server, I would like to pull straight from our Mercurial server which hosts the code already. This would reduce the need to duplicate the code base each time we update the deployment tag.
Now I had a bit of a read before on how to install / deploy Python-based software with pip (e.g., How to point pip at a Mercurial branch?). It seems to be the right choice as it supports installing packages straight from a code repository. However, I ran into a few problems that I would need help with. The requirements I have are as follows:
Use Mercurial as a source.
Automated background process to update and install into a custom directory on the file system.
Only pull and update from the repository if there is a new version available.
The following command seems to almost do what I need:
pip install -e hg+https://authkey:anypw#mymercurialserver.hostname.com/Code/package#deployment#egg=package --upgrade --src ~/proj
It pulls the package from the Mercurial server, picks the code base with the tag "deployment" and installs it into proj inside the user's home directory.
The problem, however, is that regardless whether there is an update available or not, pip always uninstalls package and reinstalls it. This makes it difficult to decide whether the process needs to be restarted or not if nothing actually changed. In addition, pip always gets stuck with the the message that hg clone in ./proj/yarely exists with URL... and asks me: What to do? (s)witch, (i)gnore, (w)ipe, (b)ackup. Now this is not ideal, as (1) it would be an automated process without user prompt, and, (2) it should only pull the repository if there was an update in the first place to reduce traffic in the network and not overload our Mercurial server. I believe that in this case, a pull instead of clone if there was a local copy of the repository already would be more appropriate and potentially solve the problem.
I wasn't able to find an elegant and nice solution to this problem. Does anyone have a pointer or suggestion how this could be achieved?
Would it be possible to create a python module that lazily downloads and installs submodules as needed? I've worked with "subclassed" modules that mimic real modules, but I've never tried to do so with downloads involved. Is there a guaranteed directory that I can download source code and data to, that the module would then be able to use on subsequent runs?
To make this more concrete, here is the ideal behavior:
User runs pip install magic_module and the lightweight magic_module is installed to their system.
User runs the code import magic_module.alpha
The code goes to a predetermine URL, is told that there is an "alpha" subpackage, and is then given the URLs of alpha.py and alpha.csv files.
The system downloads these files to somewhere that it knows about, and then loads the alpha module.
On subsequent runs, the user is able to take advantage of the downloaded files to skip the server trip.
At some point down the road, the user could run a import magic_module.alpha ; alpha._upgrade() function from the command line to clear the cache and get the latest version.
Is this possible? Is this reasonable? What kinds of problems will I run into with permissions?
Doable, certainly. The core feature will probably be import hooks. The relevant module would be importlib in python 3.
Extending the import mechanism is needed when you want to load modules that are stored in a non-standard way. Examples include [...] modules that are loaded from a database over a network.
Convenient, probably not. The import machinery is one of the parts of python that has seen several changes over releases. It's undergoing a full refactoring right now, with most of the existing things being deprecated.
Reasonable, well it's up to you. Here are some caveats I can think of:
Tricky to get right, especially if you have to support several python versions.
What about error handling? Should application be prepared for import to fail in normal circumstances? Should they degrade gracefully? Or just crash and spew a traceback?
Security? Basically you're downloading code from someplace, how do you ensure the connection is not being hijacked?
How about versionning? If you update some of the remote modules, how can make the application download the correct version?
Dependencies? Pushing of security updates? Permissions management?
Summing it up, you'll have to solve most of the issues of a package manager, along with securing downloads and permissions issues of course. All those issues are tricky to begin with, easy to get wrong with dire consequences.
So with all that in mind, it really comes down to how much resources you deem worth investing into that, and what value that adds over a regular use of readily available tools such as pip.
(the permission question cannot really be answered until you come up with a design for your package manager)
I'm writing a small web app that I'd like to include the ability to download itself. The ideal solution would be for users to be able to "pip install" the full app but that users of the app would be able to download a version of it to use themselves (perhaps with reduced functionality or without some of the less essential dependencies).
I'm currently using Bottle as I'd like to keep everything as close to the standard library as possible. Users could be on different platforms or Python versions, which are other reasons for minimising the use of extra modules. (Though I'll assume 2.7 or 3.3 will be in use regardless of platform).
My current thinking is to have the app use __file__ or similar and zip itself up. It could also use setuptools/distribute and call sdist on itself. Users could then execute the zip file, or install the app using the source distribution. (ideally I'd like to provide both of these options).
The app would include aggressive import checking to fallback to available modules, with Bottle being the only requirement (and would be included in the downloaded file).
Can anyone think of a robust approach to providing this functionality?
Update: users of the app cannot be guaranteed to have internet access at all times, hence the requirement for being able to download a version of the app from someone who as previously installed it. Python experience cannot be assumed either, hence the idea of letting users run python -m myApp.zip to run their own version.
Update II: as the level of python experience also cannot be guaranteed I'd want the simplest way for a user to get a mostly working version of the app. Experienced users would then be free to 'upgrade' the app by installing their own choice of additional modules. The vast majority of these would be different servers to host the app from (CherryPy, Twisted, etc) and so would not strictly count as a dependency but a "nice to have".
Update III: based on the answer below I will look into a PyPI/buildout based solution but would still be interested in whether there is a specific solution to the above approach.
Just package your app and put it on PyPI. Trying to automatically package the code running on the server seems over-engineered. Then you can let people use pip to install your app. In your app, provide a link to the PyPI page.
Then you can also add dependencies in the setup.py, and pip will install them for you. It seems like you are trying to build your own packaging infrastructure, but don't have to. Use what's out there.