Pip install undetected failure modes

Pip install undetected failure modes - python

We have a fleet of networked machines with packages installed via pip install, normally pip install -r requirements.txt. We regularly find that a package is incorrectly installed, often in the cache directories too. Typical problems are zero length files.
I would expect that the problem here is out-of-disk, but some of the machines in question have never been low on disk.
What I do know is that we have previously had programmers who need to be educated to check the return codes of Linux utilities - maybe they didn't?
My question is this, to the knowledgeable: can I expect pip always to report failures such as these (and others), and will it do so via its exit code?
Supplementary question: if pip cannot be relied upon as above, is there a reasonably foolproof way to check that library modules have installed properly?

If I understood correctly for each project installed by pip there should be a *.dist-info/RECORD file containing the names of the files that should have been installed along with their hash and size in bytes. This info can be used to double-check that the installation went well.
I doubt it would really help in your case, but at another level (download) maybe have a look at pip's "Hash-Checking Mode" as well as pip's documentation section on "Ensuring Repeatability".
Otherwise I would suggest simply running a test suite against the installed projects to check for sanity.
Update (2020-03-05)
PEP 458 might help in the future.

Related

What is the best way to make a clean reinstall of Python on Windows?

I tried to update Python 3.8.5. to 3.8.10 on a Windows 7 machine, but some part of Python's and/or pip's messy installer/path/package management system bricked everything. Nobody I asked knows a canonical solution to this + pretty much everybody is suggesting a complete reinstallation.
Which is why I've now completely removed Python and have to reinstall Python, pip, and all my packages one by one. I've already uninstalled/removed Python and pip and downloaded the official Python 3.8.10 64-bit Windows installer as well as get-pip.py.
But despite spending days and days of reading, I can't see through Python's complicated mess of "user-specific vs. local vs. system-wide" installation schemes, varying package installations paths, the seemingly arbitrary variations introduced by using python vs. python -m, pip install vs. pip install --user etc. during package installations, and the regular whining about some PATH environment variables not being set properly etc. pp. - if you've ever used Python professionally, you'll have an idea of what I'm describing here.
Anyway - what I want to do now is make one clean installation where I stick to one set of rules for everything. All packages installed to one single superdirectory (vs. getting scattered all over the system) and all PATH variables set accordingly to the most universal and complete configuration possible (I don't want to see any complaints from Python ever again in this regard). Note that I'm the administrator of the machine, but working from a normal user account with Windows UAC enabled and want an installation for all users - the most general solution possible, no limiting scenarios that may cause the very problems I'm trying to avoid.
Also, I do not want to use virtual environments for now, but this is different topic I'm already working on independently. So no suggestions regarding venv.
Question: How to procede with the installation?
Possible sub-issues that need to be addressed:
Correct privileges for the Windows installer, e.g. the confusing "for all users (requires elevation)" and subsequent (second!) "Install for all users" options. The latter changes the installation path from C:\Users\<Username>\AppData\Local\Programs\Python\Python38 to C:\Program Files\Python38 and Windows UAC may prevent access of Python and/or pip to C:\Program Files\ without proper exception handling (e.g. user prompt) in place:
Usage of get-pip.py vs. some other methods to install pip, which also concerns the usage of python vs. python -m vs. pip install vs. pip install --user etc. in this context and subsequent installation of user packages.
Prevention of scattering/fragmentation of the Python development framework over the system/different folders/different users, causing e.g. annoying PATH or dependency issues/conflicts.
Defining the correct set of Windows path variables under these requirements and addressing concerns/doubts about Python's and pip's ability of reliably handling this issue on their own.
Note: I'm the owner/only user on this machine and therefore have administrator rights. Managing installations/environments for multiple users is not the subject here and of no interest for me.

Anyway - what I want to do now is make one clean installation where I stick to one set of rules for everything. All packages installed to one single superdirectory (vs. getting scattered all over the system) and all PATH variables set accordingly to the most universal and complete configuration possible (I don't want to see any complaints from Python ever again in this regard).
This is contradictory. A single installation with a common set of rules, every possible installed package, comprehensive PATH etc. is inherently not "clean". The point of maintaining separate installations is so that one user's changes do not unexpectedly impact on the operation of another user's code. It is not possible to do this with a single installation. If you upgrade a library, for example, it affects every user who wrote code that uses that library, who must now check for incompatibilities.
Correct privileges for the Windows installer, e.g. the confusing "for all users (requires elevation)" and subsequent (second!) "Install for all users" options.
The first one means "Install the launcher for all users", which is why that checkbox is on the same line as "py launcher". (If you are not familiar with the Python launcher for Windows, please read the documentation.) it says "(requires elevation)" because the installer requires elevation in order to install this feature for all users. You do not need to do this if you have installed it from a previous version of Python.
and subsequent (second!) "Install for all users" options.
This is the option to install this version of Python for all users. The install directory changes according to the "for all users" setting, in the expected way.
Windows UAC may prevent access of Python and/or pip to C:\Program Files\ without proper exception handling (e.g. user prompt) in place
It will prevent write access without elevation, yes. This is by design, and it is why the option exists for per-user installations.
pip requires write access, because its purpose is to install libraries. I do not know whether it can request elevation from the command prompt; probably not. You can work around this by using the command in an elevated command window. With UAC enabled, you will not be able to install libraries into a Python for-all-users installation (i.e., in Program Files): it is the explicit purpose of UAC to prevent that sort of thing.
Python normally does not require write access to these directories. The standard library does not need write access once the library is installed. There might be a problem if you uncheck "Precompile standard libraries"; I have not tested this and have stopped using Windows. Third-party libraries normally will not require write access for their own installation directories, either. If you encounter a problem, consult the documentation for that library.
Usage of get-pip.py vs. some other methods to install pip, which also concerns the usage of python vs. python -m vs. pip install vs. pip install --user etc. in this context and subsequent installation of user packages.
get-pip etc. are deprecated. They are tools intended to account for the fact that Python didn't always come bundled with pip. It does now. pip works the same way regardless of whether it was bundled with Python or installed to an older Python version separately. There is not a clear question here; there are several specific things that need to be researched and understood about how to use pip.

python pip priority order with index-url and extra-index-url

I searched a bit but could not find a clear answer.
The goal is, to have two pip indexes, one is a private index, that will be a first priority. And one is the standard PyPI. The priority is there to prevent the security risk of code injection.
Say I have library named lib, and I configure index_url = http://my_private_pypi_repo and extra_index_url = https://pypi.org/simple
If I pip install lib, and lib exists in both indexes. What index will get the priority? From where it is going to be installed from?
Also, if I pip install lib=0.0.2 but lib exists in my private index at version 0.0.1. Is it going to look at PyPI as well?
And what is a good way to be in control, that certain libraries will only be fetched from the private index if they exists there, and will not be looked for at PyPI?

The short answer is: there is no prioritization and you probably should avoid using --extra-index-url entirely.
This is asked and answered here: https://github.com/pypa/pip/issues/5045#issuecomment-369521345
Question:
I have this in my pip.conf:
[global]
index-url = https://myregistry-xyz.com
extra-index-url = https://pypi.python.org/pypi
Let's assume packageX exists in both registries and I run pip install packageX.
I expect pip to install packageX from https://myregistry-xyz.com, but pip will use https://pypi.python.org/pypi instead.
If I switch the values for index-url and extra-index-url I get the same result. pypi is always prioritized.
Answer:
Packages are expected to be unique up to name and version, so two wheels with the same package name and version are treated as indistinguishable by pip. This is a deliberate feature of the package metadata, and not likely to change.
I would also recommend reading this discussion: https://discuss.python.org/t/dependency-notation-including-the-index-url/5659
There are quite a lot of things that are addressed in this discussion, some that is clearly out of scope for this question, but everything is very informative anyway.
In there, there should be the key takeaway for you:
Pip does not really prioritize one index over the other in theory. In practice, because of a coincidence in the way things are implemented in code, it might be that one is always checked first, but it is not a behavior you should rely on.
And what is a good way to be in control, that certain libraries will only be fetched from the private index if they exists there, and will not be looked for at PyPI?
You should setup and curate your own package index (devpi, pydist, jfrog artifactory, sonatype nexus, etc.) and use it exclusively, meaning: never use --extra-index-url. This is the only way you can have exact control over what gets downloaded. This custom repository might function mostly a proxy for the public PyPI, except for a couple of dependencies.
Related:
pip: selecting index url based on package name?

The title of this question feels a bit like an instance of XY problem. If you would elaborate more on what you want to achieve and what your constraints are we may be able to give you a better answer.
That said, sinoroc's suggestion to curate your own package index and use only that is a good one. A few other ideas also come to mind:
Update: Turns out pip may run distributions other than those in the constraints file so this method should probably be considered insecure. Additionally hashes are kind of broken on recent releases of pip.
Using a constraints file with hashes. This file can be generated using pip-tools like pip-compile --generate-hashes assuming you have documented your dependencies in a file named requirements.in. You can then install packages like pip install -c requirements.txt some_package.
Pro: What may be installed is documented alongside your code in your VCS.
Con: Controlling what is downloaded the first time is either tricky or laborious.
Con: Hash checking can be slow.
Con: You run into issues more frequently than when not using hashes. Some can be worked around others cannot; it is for instance not possible to combine constraints like -e file://` with hashes.
Use an alternative packaging tool like pipenv. It works similarly to the previous suggestion.
Pro: Easy to use
Con: Harder to integrate into your workflow if it does not fit naturally.
Curate packages locally. Packages and dependencies can be downloaded like pip download --dest some_dir some_package and installed like pip install --no-index --find-links some_dir.
Pro: What may be installed can be documented alongside your code, if you track the artifacts in VCS e.g. git lfs.
Con: Either all packages are downloaded or none are.
Use a hermetic build system. I know bazel advertise this as a feature, not sure about others like pants and buck.
Pro: May be the ultimate solution if you want control over your builds.
Con: Does not integrate well with open source python ecosystem afaik.
Con: A lot of overhead.
1: https://en.wikipedia.org/wiki/XY_proble

How to install a Python package with documentation?

I'm tryin' to find a way to install a python package with its docs.
I have to use this on machines that have no connection to the internet and so online help is not a solution to me. Similar questions already posted here are telling that this is not possible. Do you see any way to make this easier as I'm currently doing this:
downloading the source archive
extracting the docs folder
running sphinx
launching the index file from a browser (firefox et al.)
Any ideas?
P.S. I'm very new to Python, so may be I'm missing something... And I'm using Windows (virtual) machines...
Edit:
I'm talking about two possible ways to install a package:
installing the package via easy_install (or any other to me unknown way) on a machine while I'm online, and then copying the changes to my installation to the target machine
downloading the source package (containing sphinx compatible docs) and installing the package on the target machine off-line
But in any case I do not know a way to install the package in a way that the supplied documentations are installed alltogether with module!
You might know that there exists a folder for the docs: <python-folder>/Doc which will contain only python278.chm after installation of Python 2.78 on Windows. So, I expect that this folder will also contain the docs for a newly installed package. This will avoid looking at docs for a different package version on the internet as well as my specific machine setup problems.
Most packages I'm currently using are supplied with documentation generated with sphinx, and their source package contains all the files necessary to generate the docs offline.
So what I'm looking for is some cli argument for a package installer like it's common for unix/linux based package managers. I did expect something like:
easy_install a_package --with-html-docs.

Here are some scenarios:
packages have documentation included within the zip/tar
packages have a -docs to download/install seperately
packages that have buildable documentation
packages that only have online documentation
packages with no documentation other than internal.
packages with no documentation anywhere.
The sneaky trick that you can use for options 1 & 3 is to download the package as a tar or zip and then use easy-install archive_name on the target machine this will install the package from the zip or tar file including (I believe) any documentation. You will find that there are dependencies that are unmet in some packages - those should give an error on the easy install mentioning what is missing - you will need to get those and use the same trick.
A couple of things that are very handy - virtual-env will let you have a library free version of python running so you can get the requirements and pip -d <dir> which will download without installing storing your packages in dir.
You should be able to use the same trick for option 2.
With packages that only have on-line documentation you could look to see if there is a downloadable version or could scrape the web pages and use a tool like pandoc to convert to something useful.
In the 5 scenario I would suggest raising a ticket on the package stating that lack of accessible documentation makes it virtually unusable and running sphinx on it.
In scenario 6 I suggest raising the ticket but missing out virtually and avoiding the use of that package on the basis that if it has no documentation it probably has a lot of other problems as well - if you are a package author feeling slandered reading this then you should be feeling ashamed instead.
Mirror/Cache PyPi
Another possibly is to have a linux box, or VM, initially outside of your firewall, running a cached or mirroring service e.g. pipyserver, install the required packages through it to populate the cache and then move it, (or its cache to another pip server), inside the firewall and you can then use pip with the documented settings to do all your installs inside the firewall. See also the answer here.

What are the risks of running 'sudo pip'?

Occasionally I run into comments or responses that state emphatically that running pip under sudo is "wrong" or "bad", but there are cases (including the way I have a bunch of tools set up) where it is either much simpler, or even necessary to run it that way.
What are the risks associated with running pip under sudo?
Note that this is not the same question as this one, which, despite the title, provides no information about risks. This also isn't a question about how to avoid using sudo, but about specifically why one would want to.

When you run pip with sudo, you run setup.py with sudo. In other words, you run arbitrary Python code from the Internet as root. If someone puts up a malicious project on PyPI and you install it, you give an attacker root access to your machine. Prior to some recent fixes to pip and PyPI, an attacker could also run a man in the middle attack to inject their code when you download a trustworthy project.

Besides obvious security risks (which I think are in fact low when you install software you know) mentioned in other answers, there is another reason. Python that comes with the system is part of this system and when you want to manage the system you use tools designated for system maintenance like package managers in the case of installing/upgrading/uninstalling software. When you start to modify system software with third party tools (pip in this instance) then you have no guarantee about the state of your system. Yet another reason is that sudo can bring you problems you wouldn't have a chance or have a very small chance to have otherwise. See for example Mismatch between sys.executable and sys.version in Python
Distros are aware of this problem and try to mitigate it. For example Fedora – Making sudo pip safe and Debian – dist-packages instead of site-packages.

Using pip that way means you trust it to the level you allow it to make anything to your system. Not only pip, but also any code it will download and execute from sources you may not trust and that can be malicious.
And pip doesn't need all that privileges, only the write access to specific files and directories. If you can't use your system's package manager and do not want to go the virtual environment way, you may create a specific user that has write privilege to the python installation directory and use it for pip. That way you better control what can pip do and not do. And you can use sudo -u for that!

The only thing "wrong" with sudo is that it, well, DOes as Super User ala root meaning you can potentially destroy an installation with the wrong command. As PIP is a package maintenance for a particular program you would need such access anyhow to make changes...

There are a few reasons that haven't been mentioned by other users but are still important.
Lack of code review amongst pip packages
The first reason is that PyPI packages (packages that you can install via pip) are not monitored or code-reviewed like you may be used to with other package managers. There have been many cases of malicious PyPI packages being published and then downloaded by thousands of users before being removed. If you happen to download one of these malicious packages as root then you are essentially giving the malware access to your entire system. Though this isn't an every day occurrence, it is still an attack vector to be aware of. You can learn more about this by reading about the concept of least privileges.
Running pip as root interferes with system-level packages
The second, and more important reason, is that running pip with sudo or as the root user will interfere with system-level packages and can disrupt the functionality of your system. Piotr Dobrogost's answer briefly mentions the effects that package managers can have on the state of your system, but I think a more in-depth explanation will help people better understand why this practice can be harmful.
Take for example a Linux distro that ships with Python 3.6 and the Python package cryptography to perform cryptographic operations. For illustrative purposes, imagine that the cryptography package version 1.0.0 is used by the system to hash passwords and allows users to log in. If version 1.0.1 of the same package introduces a regression that the system doesn't account for and you upgrade the global cryptography package by running sudo pip3 install -U cryptography, you accidentally just broke the ability for users to log in system-wide by introducing a regression on system dependencies.
This is a contrived example and would actually be easier to track down than most, but it is certainly a possible scenario. In the real world you would most likely break something less important, but the lesson is the same. In some scenarios this example would be easier to undo because you would know exactly what you broke when everything instantly stopped working, but you could end up breaking something that is much harder to track down and you might not find out until much later when you have no recollection of what you changed.
Why would you want to run pip with sudo?
I haven't seen anyone address the final question in your post, so I'll address it here. There are a few reasons why someone would want to run pip with sudo, but they are far more rare.
The first reason that people would want to do it this way is because people are lazy and it's a fast way to force the system to install the package you need. Say that someone needs to install the coloredlogs package because they absolutely have to have their logs be colored right now and they don't know anything about having a secure system. It's often much easier for inexperienced users to add sudo to the beginning of everything when it doesn't work because "it just works" rather than learning why it didn't work the first time.
The second reason, and only legitimate reason that I can think of, is if an admin needs to patch something system-wide. Say that a vulnerability is introduced in pip version 20.0.0 and there is a hotfix that fixes the issue in version 20.0.1. The system administrator probably doesn't want to wait for the distro to patch this for them and instead wants to patch it right now to mitigate the issue. In this scenario I think it would be safe for the system administrator to use python3 -m pip install --upgrade pip to update their version of pip, but they would need to be cautious to ensure there are no unintended consequences.

How to easily distribute Python software that has Python module dependencies? Frustrations in Python package installation on Unix

My goal is to distribute a Python package that has several other widely used Python packages as dependencies. My package depends on well written, Pypi-indexed packages like pandas, scipy and numpy, and specifies in the setup.py that certain versions or higher of these are needed, e.g. "numpy >= 1.5".
I found that it's immensely frustrating and nearly impossible for Unix savvy users who are not experts in Python packaging (even if they know how to write Python) to install a package like mine, even when using what are supposed to be easy to use package managers. I am wondering if there is an alternative to this painful process that someone can offer, or if my experience just reflects the very difficult current state of Python packaging and distribution.
Suppose users download your package onto their system. Most will try to install it "naively", using something like:
$ python setup.py install
Since if you google instructions on installing Python packages, this is usually what comes up. This will fail for the vast majority of users, since most do not have root access on their Unix/Linux servers. With more searching, they will discover the "--prefix" option and try:
$ python setup.py install --prefix=/some/local/dir
Since the users are not aware of the intricacies of Python packaging, they will pick an arbitrary directory as an argument to --prefix, e.g. "~/software/mypackage/". It will not be a cleanly curated directory where all other Python packages reside, because again, most users are not aware of these details. If they install another package "myotherpackage", they might pass it "~/software/myotherpackage", and you can imagine how down the road this will lead to frustrating hacking of PYTHONPATH and other complications.
Continuing with the installation process, the call to "setup.py install" with "--prefix" will also fail once users try to use the package, even though it appeared to have been installed correctly, since one of the dependencies might be missing (e.g. pandas, scipy or numpy) and a package manager is not used. They will try to install these packages individually. Even if successful, the packages will inevitably not be in the PYTHONPATH due to the non-standard directories given to "--prefix" and patient users will dabble with modifications of their PYTHONPATH to get the dependencies to be visible.
At this stage, users might be told by a Python savvy friend that they should use a package manager like "easy_install", the mainstream manager, to install the software and have dependencies taken care of. After installing "easy_install", which might be difficult, they will try:
$ easy_install setup.py
This too will fail, since users again do not typically have permission to install software globally on production Unix servers. With more reading, they will learn about the "--user" option, and try:
$ easy_install setup.py --user
They will get the error:
usage: easy_install [options] requirement_or_url ...
or: easy_install --help
error: option --user not recognized
They will be extremely puzzled why their easy_install does not have the --user option where there are clearly pages online describing the option. They might try to upgrade their easy_install to the latest version and find that it still fails.
If they continue and consult a Python packaging expert, they will discover that there are two versions of easy_install, both named "easy_install" so as to maximize confusion, but one part of "distribute" and the other part of "setuptools". It happens to be that only the "easy_install" of "distribute" supports "--user" and the vast majority of servers/sys admins install "setuptools"'s easy_install and so local installation will not be possible. Keep in mind that these distinctions between "distribute" and "setuptools" are meaningless and hard to understand for people who are not experts in Python package management.
At this point, I would have lost 90% of even the most determined, savvy and patient users who try to install my software package -- and rightfully so! They wanted to install a piece of software that happened to be written in Python, not to become experts in state of the art Python package distribution, and this is far too confusing and complex. They will give up and be frustrated at the time wasted.
The tiny minority of users who continue on and ask more Python experts will be told that they ought to use pip/virtualenv instead of easy_install. Installing pip and virtualenv and figuring out how these tools work and how they are different from the conventional "python setup.py" or "easy_install" calls is in itself time consuming and difficult, and again too much to ask from users who just wanted to install a simple piece of Python software and use it. Even those who pursue this path will be confused as to whether whatever dependencies they installed with easy_install or setup.py install --prefix are still usable with pip/virtualenv or if everything needs to be reinstalled from scratch.
This problem is exacerbated if one or more of the packages in question depends on installing a different version of Python than the one that is the default. The difficulty of ensuring that your Python package manger is using the Python version you want it to, and that the required dependencies are installed in the relevant Python 2.x directory and not Python 2.y, will be so endlessly frustrating to users that they will certainly give up at that stage.
Is there a simpler way to install Python software that doesn't require users to delve into all of these technical details of Python packages, paths and locations? For example, I am not a big Java user, but I do use some Java tools occasionally, and don't recall ever having to worry about X and Y dependencies of the Java software I was installing, and I have no clue how Java package managing works (and I'm happy that I don't -- I just wanted to use a tool that happened to be written in Java.) My recollection is that if you download a Jar, you just get it and it tends to work.
Is there an equivalent for Python? A way to distribute software in a way that doesn't depend on users having to chase down all these dependencies and versions? A way to perhaps compile all the relevant packages into something self-contained that can just be downloaded and used as a binary?
I would like to emphasize that this frustration happens even with the narrow goal of distributing a package to savvy Unix users, which makes the problem simpler by not worrying about cross platform issues, etc. I assume that the users are Unix savvy, and might even know Python, but just aren't aware (and don't want to be made aware) about the ins and outs of Python packaging and the myriad of internal complications/rivalries of different package managers. A disturbing feature of this issue is that it happens even when all of your Python package dependencies are well-known, well-written and well-maintained Pypi-available packages like Pandas, Scipy and Numpy. It's not like I was relying on some obscure dependencies that are not properly formed packages: rather, I was using the most mainstream packages that many might rely on.
Any help or advice on this will be greatly appreciated. I think Python is a great language with great libraries, but I find it virtually impossible to distribute the software I write in it (once it has dependencies) in a way that is easy for people to install locally and just run. I would like to clarify that the software I'm writing is not a Python library for programmatic use, but software that has executable scripts that users run as individual programs. Thanks.

We also develop software projects that depend on numpy, scipy and other PyPI packages. Hands down, the best tool currently available out there for managing remote installations is zc.buildout. It is very easy to use. You download a bootstrapping script from their website and distribute that with your package. You write a "local deployment" file, called normally buildout.cfg, that explains how to install the package locally. You ship both the bootstrap.py file and buildout.cfg with your package - we use the MANIFEST.in file in our python packages to force the embedding of these two files with the zip or tar balls distributed by PyPI. When the user unpackages it, it should execute two commands:
$ python bootstrap.py # this will download zc.buildout and setuptools
$ ./bin/buildout # this will build and **locally** install your package + deps
The package is compiled and all dependencies are installed locally, which means that the user installing your package doesn't even need root privileges, which is an added feature. The scripts are (normally) placed under ./bin, so the user can just execute them after that. zc.buildout uses setuptools for interaction with PyPI so everything you expect works out of the box.
You can extend zc.buildout quite easily if all that power is not enough - you create the so-called "recipes" that can help the user to create extra configuration files, download other stuff from the net or instantiate custom programs. zc.buildout website contains a video tutorial that explains in details how to use buildout and how to extend it. Our project Bob makes extensive use of buildout for distributing packages for scientific usage. If you would like, please visit the following page that contains detailed instructions for our developers on how they can setup their python packages so other people can build and install them locally using zc.buildout.

We're currently working to make it easier for users to get started installing Python software in a platform independent manner (in particular see https://python-packaging-user-guide.readthedocs.org/en/latest/future.html and http://www.python.org/dev/peps/pep-0453/)
For right now, the problem with two competing versions of easy_install has been resolved, with the competing fork "distribute" being merged backing into the setuptools main line of development.
The best currently available advice on cross-platform distribution and installation of Python software is captured here: https://packaging.python.org/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.