PyPI can be unreliable. I've had an unfortunate number of Travis-CI builds fail because pip fails to install one of my requirements (lxml is the most notorious offender).
Various online resources recommend the --use-mirrors flag, which has solved the issue for me thus far. However, --use-mirrors is deprecated for a number of good reasons.
Unfortunately, as mentioned in the link, one of the primary reasons for the removal of the flag is that the new CDN backed PyPI shouldn't have the same issues. It does. I still have issues with my builds, and I still can't reliably install packages with pip unless I use --use-mirrors.
The release notes for release 1.5 on 2014-01-01 recommend using one of the flags -i, -index-url, or --extra-index-url. Which is great, except... We run into some of the same issues that --use-mirrors had, namely that these mirrors can't necessarily be trusted.
The PyPI mirrors list has actually been removed, leaving us with some unofficial mirrors. Thus I'm left with a choice: keep using --use-mirrors and hope that one of the issues above is fixed before it is removed, or pick a mirror and hope it works and is trustworthy.
Is there a widely accepted and trusted mirror? Or a widely accepted and trusted alternative? Basically, how should I handle this problem?
Frankly, I have never faced the issue that you are describing - so I do not know what to do to resolve issues with the public pypi index.
However, as a general practice; I can recommend the following which is what we use when deploying (as the systems we deploy to do not have access to the Internet):
Create a local pypi mirror, and publish your packages there. You can do this in many ways. The simple approach with basket or you can do what we did and create your own pypi mirror (see: How to roll my own pypi? for some suggestions).
Use wheel. This is what we are migrating to because the install process is super simple and does not require reliance on other servers.
I know having a global pypi index is a great convenience but as part of a deployment build-chain I would use it as a backup; for one it is on a network I do not control (hence it may be unreachable or unreliable); and more importantly my systems may not need access to the Internet during the build process.
Related
Rather by accident I found myself in a situation in a previous role where the previous admin apparently installed "Python bindings" of InfluxDB and Docker-Compose and magically both applications where available on the systems while I was sure that they where written in Go.
I had a few issues with that:
It's incomprehensible what happens here, there should be some go binary belonging to the application but I can't find it by name, I doubt that docker-compose and influxdb have been rewritten in Python just to have one more option available while at least docker-compose static binaries are available on Github for direct download. It doesn't make a lot of sense to me.
Undermining security guidelines set by the organization and best practices for systems administration.
Dependency Confusion
Links to the packages on PyPI:
https://pypi.org/project/influxdb/
https://pypi.org/project/docker-compose/
I haven't looked into Python wheels and packaging before beyond Debian packaging, I just got curious again the get to the bottom of this strange usage pattern.
Docker-Compose refers to https://github.com/docker/compose a project consisting of 95.5% Go code according to GitHub, which isn't really helpful since the source package and wheel package on PyPI look completely different and at first sight I'm overwhelmed by the amount of Python files. InfluxDB seems to be a better example but I would really appreciate help from a Python developer or package maintainer explaining to me what happening there. Thanks.
Edit 2022-09-10:
From the show notes of Security Now 887: https://www.grc.com/sn/sn-887-notes.pdf
a researcher at Checkmarx noted in a technical report they published last week that “A worrying
feature in pip/PyPI allows code to automatically run when developers are merely downloading a
package.” He added that the feature is alarming because “a great deal of the malicious packages
we are finding in the wild use this feature of code execution upon installation to achieve higher
infection rates.”
With my preexisting misconception about some PyPI packages like docker-compose, that sounded alarming to me.
The following article mentions that compiled libraries from C, Rust, Go and others can be bundled in packages, but no applications "hidden" as artifacts, which I assumed. https://realpython.com/python-wheels/
I searched a bit but could not find a clear answer.
The goal is, to have two pip indexes, one is a private index, that will be a first priority. And one is the standard PyPI. The priority is there to prevent the security risk of code injection.
Say I have library named lib, and I configure index_url = http://my_private_pypi_repo and extra_index_url = https://pypi.org/simple
If I pip install lib, and lib exists in both indexes. What index will get the priority? From where it is going to be installed from?
Also, if I pip install lib=0.0.2 but lib exists in my private index at version 0.0.1. Is it going to look at PyPI as well?
And what is a good way to be in control, that certain libraries will only be fetched from the private index if they exists there, and will not be looked for at PyPI?
The short answer is: there is no prioritization and you probably should avoid using --extra-index-url entirely.
This is asked and answered here: https://github.com/pypa/pip/issues/5045#issuecomment-369521345
Question:
I have this in my pip.conf:
[global]
index-url = https://myregistry-xyz.com
extra-index-url = https://pypi.python.org/pypi
Let's assume packageX exists in both registries and I run pip install packageX.
I expect pip to install packageX from https://myregistry-xyz.com, but pip will use https://pypi.python.org/pypi instead.
If I switch the values for index-url and extra-index-url I get the same result. pypi is always prioritized.
Answer:
Packages are expected to be unique up to name and version, so two wheels with the same package name and version are treated as indistinguishable by pip. This is a deliberate feature of the package metadata, and not likely to change.
I would also recommend reading this discussion: https://discuss.python.org/t/dependency-notation-including-the-index-url/5659
There are quite a lot of things that are addressed in this discussion, some that is clearly out of scope for this question, but everything is very informative anyway.
In there, there should be the key takeaway for you:
Pip does not really prioritize one index over the other in theory. In practice, because of a coincidence in the way things are implemented in code, it might be that one is always checked first, but it is not a behavior you should rely on.
And what is a good way to be in control, that certain libraries will only be fetched from the private index if they exists there, and will not be looked for at PyPI?
You should setup and curate your own package index (devpi, pydist, jfrog artifactory, sonatype nexus, etc.) and use it exclusively, meaning: never use --extra-index-url. This is the only way you can have exact control over what gets downloaded. This custom repository might function mostly a proxy for the public PyPI, except for a couple of dependencies.
Related:
pip: selecting index url based on package name?
The title of this question feels a bit like an instance of XY problem. If you would elaborate more on what you want to achieve and what your constraints are we may be able to give you a better answer.
That said, sinoroc's suggestion to curate your own package index and use only that is a good one. A few other ideas also come to mind:
Update: Turns out pip may run distributions other than those in the constraints file so this method should probably be considered insecure. Additionally hashes are kind of broken on recent releases of pip.
Using a constraints file with hashes. This file can be generated using pip-tools like pip-compile --generate-hashes assuming you have documented your dependencies in a file named requirements.in. You can then install packages like pip install -c requirements.txt some_package.
Pro: What may be installed is documented alongside your code in your VCS.
Con: Controlling what is downloaded the first time is either tricky or laborious.
Con: Hash checking can be slow.
Con: You run into issues more frequently than when not using hashes. Some can be worked around others cannot; it is for instance not possible to combine constraints like -e file://` with hashes.
Use an alternative packaging tool like pipenv. It works similarly to the previous suggestion.
Pro: Easy to use
Con: Harder to integrate into your workflow if it does not fit naturally.
Curate packages locally. Packages and dependencies can be downloaded like pip download --dest some_dir some_package and installed like pip install --no-index --find-links some_dir.
Pro: What may be installed can be documented alongside your code, if you track the artifacts in VCS e.g. git lfs.
Con: Either all packages are downloaded or none are.
Use a hermetic build system. I know bazel advertise this as a feature, not sure about others like pants and buck.
Pro: May be the ultimate solution if you want control over your builds.
Con: Does not integrate well with open source python ecosystem afaik.
Con: A lot of overhead.
1: https://en.wikipedia.org/wiki/XY_proble
Question
How can I install the lowest possible version of a dependency using pip (or any other tool for that matter) given some set of version specifiers. So if I were to specify requests >=2.0, <3.0 I would actually want to install requests==2.0.
Motivation
When creating Python packages you probably want to enable users to interact with it in a variety of environments. This means that you'll often claim to support many versions of a given dependency. For example one might claim that their package supports requests>=2.0. The problem here is that to responsibly make this claim we need some way of testing that our package works both with requests==2.0 but also the latest version and anything in between.
Unfortunately I don't think there's a good solution for testing one's support for all possible combinations of dependencies, but you can at least try to test the maximum and minimum versions of your dependencies.
To solve this problem I usually create a requirements.min.txt file that contains the lowest version of all my dependencies so that I can install those when I want to test that my changes haven't broken my support for older versions. Thus, if I claimed to support requests>=2.0 my requirements.min.txt files would have requests==2.0.
This solution to the problem of testing against one's minimum dependency set feels messy though. Has anyone also found a good solution?
not a full answer, but something that might help you get started...
pip install pkg_name_here==
this will output all versions available on pip which you can then redirect to a string/list and split out the versions, then install any that fall between you min/max supported versions using some kind of conditional statement
Said feature not yet exists in pip (at time of writing).
However, there is an ongoing discussion https://github.com/pypa/pip/issues/8085 and PR https://github.com/pypa/pip/pull/11336 to implement it as --version-selection=min.
I hope it will be merged eventually.
I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day, even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than the newest version.
What is going on here?
Is there a bug in PyPi's downloaded counting, or is there an abundance of crawlers grabbing open source code (as mine is)?
This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like "z3c.pypimirror/1.0.15.1" and "pep381client/1.5". (PEP 381 describes a mirroring infrastructure for PyPI.)
I wrote a quick script to tally everything up, first including all of them and then leaving out the most obvious bots, and it turns out that literally 99% of the download activity for my package was caused by mirrorbots: 14,335 downloads total, compared to only 146 downloads with the bots filtered. And that's just leaving out the very obvious ones, so it's probably still an overestimate.
It looks like the main reason PyPI needs mirrors is because it has them.
Starting with Cairnarvon's summarizing statement:
"It looks like the main reason PyPI needs mirrors is because it has them."
I would slightly modify this:
It might be more the way PyPI actually works and thus has to be mirrored, that might contribute an additional bit (or two :-) to the real traffic.
At the moment I think you MUST interact with the main index to know what to update in your repository. State is not simply accesible through timestamps on some publicly accessible folder hierarchy. So, the bad thing is, rsync is out of the equation. The good thing is, you MAY talk to the index through JSON, OAuth, XML-RPC or HTTP interfaces.
For XML-RPC:
$> python
>>> import xmlrpclib
>>> import pprint
>>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
>>> client.package_releases('PartitionSets')
['0.1.1']
For JSON eg.:
$> curl https://pypi.python.org/pypi/PartitionSets/0.1.1/json
If there are approx. 30.000 packages hosted [1] with some being downloaded 50.000 to 300.000 times a week [2] (like distribute, pip, requests, paramiko, lxml, boto, paramike, redis and others) you really need mirrors at least from an accessibilty perspective. Just imagine what a user does when pip install NeedThisPackage fails: Wait? Also company wide PyPI mirrors are quite common acting as proxies for otherwise unrouteable networks. Finally not to forget the wonderful multi version checking enabled through virtualenv and friends. These all are IMO legitimate and potentially wonderful uses of packages ...
In the end, you never know what an agent really does with a downloaded package: Have N users really use it or just overwrite it next time ... and after all, IMHO package authors should care more for number and nature of uses, than the pure number of potential users ;-)
Refs: The guestimated numbers are from https://pypi.python.org/pypi (29303 packages) and http://pypi-ranking.info/week (for the weekly numbers, requested 2013-03-23).
You also have to take into account that virtualenv is getting more popular. If your package is something like a core library that people use in many of their projects, they will usually download it multiple times.
Consider a single user has 5 projects where he uses your package and each lives in its own virtualenv. Using pip to meet the requirements, your package is already downloaded 5 times this way. Then these projects might be set up on different machines, like work, home and laptop computers, in addition there might be a staging and a live server in case of a web application. Summing this up, you end up with many downloads by a single person.
Just a thought... perhaps your package is simply good. ;)
Hypothesis: CI tools like Travis CI and Appveyor also contribute quite a bit. It might mean that each commit/push leads to a build of a package and the installation of everything in requirements.txt
One issue that comes up during Pinax development is dealing with development versions of external apps. I am trying to come up with a solution that doesn't involve bringing in the version control systems. Reason being I'd rather not have to install all the possible version control systems on my system (or force that upon contributors) and deal the problems that might arise during environment creation.
Take this situation (knowing how Pinax works will be beneficial to understanding):
We are beginning development on a new version of Pinax. The previous version has a pip requirements file with explicit versions set. A bug comes in for an external app that we'd like to get resolved. To get that bug fix in Pinax the current process is to simply make a minor release of the app assuming we have control of the app. Apps we don't have control we just deal with the release cycle of the app author or force them to make releases ;-) I am not too fond of constantly making minor releases for bug fixes as in some cases I'd like to be working on new features for apps as well. Of course branching the older version is what we do and then do backports as we need.
I'd love to hear some thoughts on this.
Could you handle this using the "==dev" version specifier? If the distribution's page on PyPI includes a link to a .tgz of the current dev version (such as both github and bitbucket provide automatically) and you append "#egg=project_name-dev" to the link, both easy_install and pip will use that .tgz if ==dev is requested.
This doesn't allow you to pin to anything more specific than "most recent tip/head", but in a lot of cases that might be good enough?
I meant to mention that the solution I had considered before asking was to put up a Pinax PyPI and make development releases on it. We could put up an instance of chishop. We are already using pip's --find-links to point at pypi.pinaxproject.com for packages we've had to release ourselves.
Most open source distributors (the Debians, Ubuntu's, MacPorts, et al) use some sort of patch management mechanism. So something like: import the base source code for each package as released, as a tar ball, or as a SCM snapshot. Then manage any necessary modifications on top of it using a patch manager, like quilt or Mercurial's Queues. Then bundle up each external package with any applied patches in a consistent format. Or have URLs to the base packages and URLs to the individual patches and have them applied during installation. That's essentially what MacPorts does.
EDIT: To take it one step further, you could then version control the set of patches across all of the external packages and make that available as a unit. That's quite easy to do with Mercurial Queues. Then you've simplified the problem to just publishing one set of patches using one SCM system, with the patches applied locally as above or available for developers to pull and apply to their copies of the base release packages.
EDIT: I am not sure I am reading your question correctly so the following may not answer your question directly.
Something I've considered, but haven't tested, is using pip's freeze bundle feature. Perhaps using that and distributing the bundle with Pinax would work? My only concern would be how different OS's are handled. For example, I've never used pip on Windows, so I wouldn't know how a bundle would interact there.
The full idea I hope to try is creating a paver script that controls management of the bundles, making it easy for users to upgrade to newer versions. This would require a bit of scaffolding though.
One other option may be you keeping a mirror of the apps you don't control, in a consistent vcs, and then distributing your mirrored versions. This would take away the need for "everyone" to have many different programs installed.
Other than that, it seems the only real solution is what you guys are doing, there isn't a hassle-free way that I've been able to find.