Speed up debugging the conda-build tutorial

Speed up debugging the conda-build tutorial - python

I'm working outward from this conda-build example to eventually build a conda package of my own. (If you try it out, note that the meta.yaml in the example is out of date and you need to use a different meta.yaml; details in this issue.)
The source code in this conda-build example is an existing project called click, which seems to have a very specific structure with elements like tox.ini and setup.py and setup.cfg. It's hard for me to find definitive guidance on Conda's requirements or expectations about the structure of the source code anywhere in the conda-build docs, so I've just been changing one thing at a time starting from the working example and checking if it still works.
Each conda build command takes several minutes. It makes debugging slow and I've gotten impatient. How can I speed up conda build so that I can easily experiment with different inputs? There are tips to speed up conda environment solving here, but I'm not solving an environment; I'm building a package.
My package is pure Python, so I don't need to bother with any compiler details.

I use boa, which is an add-on to conda-build that will use Mamba as the solver instead (much faster solves). Once installed, one uses:
conda mambabuild
instead of
conda build
Not just me, but the entire Conda Forge CI has used boa for several months now.

Related

"Conda remove <package>" taking forever to remove package

I notice if I am trying to remove huge conda packages that occupy hundreds of megabytes in space, running conda remove <package> will take forever. Some examples of these huge packages are pystan, spacy-model-en_core_web_lg.
It is stuck at with no error messages;
Collecting package metadata (repodata.json): done
Solving environment:
Any hints how to fix this problem?
I am using anaconda, python 3.8, windows 10.

Conda's remove operation still needs to satisfy all the other specifications for the environment, so Conda invokes its solver and this can be complicated. Essentially, it re-solves the entire environment sans the specified package, compares that against the existing state, then makes a plan based on the difference.
I very much doubt there is anything directly impactful about size of package, which OP alludes to. Instead, things that negatively impact solving are:
having a large environment (e.g., anaconda package is installed)
channel mixing - in particular, including the conda-forge channel at equal or higher priority as defaults in an environment with the anaconda package; that package and all its dependencies are intended to be sourced from the anaconda channel
having an underspecified environment (see conda env export --from-history to see your explicit specifications); e.g., an environment with a python=3.8 specification will be easier on the solver than just a python specification
In general, using smaller specialized (e.g., per-project) environments, rather than large monolithic ones helps avoid such problems. The anaconda package is particularly problematic.
Try Mamba
Other than adopting better practices, one can also get significantly faster solves with Mamba, a drop-in compiled replacement for conda. Try it out:
## install Mamba in base env
conda install -n base conda-forge::mamba
## use it like you would the 'conda' command
mamba remove -n foo bar

How to freeze conda on a fixed version

I've been asked to look at some dev ops stuff regarding python and I'm a bit stuck. The network I'm working is not internet connected so I've been setting up Nexus repositories to bring in dependencies for docker, java and pypi that the other developers can access and pull down locally. However, they have started using conda more and more and we are on a fixed version on our dev network to match a delivery network.
I'm trying to use nexus' conda repos although every time I try and install something it tries to update everything else, including the python and conda versions which are:
conda version : 4.8.3
conda-build version : 3.18.11
python version : 3.8.3.final.0
I've edited my .condarc file to read:
channels:
- http://master:8041/repository/anaconda-proxy/main/
- http://master:8041/repository/conda-forge/
remote_read_timeout_secs: 1200.0
auto_update_conda: false
channel_priority: false
However every time i try to install something to cache the dependencies I get an huge list of updates. For example:
conda install cudatoolkit
<snip>
The following packages will be downloaded:
package | build
---------------------------|-----------------
alabaster-0.7.12 | py_0 16 KB http://master:8041/repository/anaconda-proxy/main
anaconda-client-1.7.2 | py38_0 172 KB http://master:8041/repository/anaconda-proxy/main
anaconda-project-0.8.4 | py_0 210 KB http://master:8041/repository/anaconda-proxy/main
argh-0.26.2 | py38_0 36 KB http://master:8041/repository/anaconda-proxy/main
.....
Any advice would be great. I've added the auto_update_conda and channel_priority flags but to no avail. Thanks in advance.
Additonal info:
I'm a Java developer and I only use a bit of python, so I'm not massively familar with the anaconda setup so apologies if this is simpler than I'm making it.

How Conda Solves
Conda always first attempts to solve the install directive without changing existing packages (i.e., it runs first with a --freeze-installed flag) and will only proceed to a full solve (what you are seeing) if it can't find any version of your requested package that already has all its dependencies satisfied in the environment. That is, this result implies that what you are asking for is not possible. Or at least not via the CLI if you want a valid environment.1
At the core of the issue is that even if there is only a single dependency that needs updating, there is no intermediate mode to indicate that you want to minimize the total number of changes (which I think would actually be a nice enhancement). Conda only has two solving modes:
Change nothing else (--freeze-installed).
All dependencies are allowed to update (--update-deps).
The exception to this are the aggressive_update_packages and the auto_update_conda, which it will always attempt to update whenever the environment is mutated. But it seems you've already realized those can be disabled through configuration settings.2
Manual Dependency Updating
This doesn't mean what you are hoping to accomplish is impossible, but that there isn't a clean way to automate it via the CLI. Instead, you might need to manually track down the dependencies that need updating (e.g., conda search cudatoolkit --info), update them first (conda install with specific versions), and then try installing your package again. I would strongly recommend first settling on the exact version of cudatoolkit you plan to install, otherwise conda search cudatoolkit --info will be too much info.
Package Pinning
For packages that you really do want absolutely fixed there is package pinning. You could do this for conda, python, and other core packages.
Base Environment
I find it a bit odd that the base environment (the one that has the conda package) is being mutated at all. Instead, I would expect software engineers to always use non-base environments for development and production. It is easy to create new environments, one can define them with version controlled YAML files, use them modularly by creating them on a per project or per task-type basis, and they can be mutated without worrying about affecting the Conda infrastructure. However, I'm not entirely clear on your setup, so this comment may not apply.
[1] If one doesn't care about validity (probably not a good idea for production) then there is always the --no-deps flag.
[2] The default aggressive_update_packages packages are ones that frequently become vulnerable to exploits (e.g., openssl), so carefully consider the implications of leaving them outdated.

Is there a point in creating a conda package from an PyPI package?

If one has a pure Python package already on pypi.org, is there any advantage of providing a conda package?
My assumption was that conda is really useful, if you have dependencies outside of the Python eco system. It then helps to have reliable builds. But is there something else? Is it in the end just personal preference?
The tutorial didn't even mention the question when a conda package is (not) necessary.

Updating a specific module with Conda removes numerous packages

I have recently started using the Anaconda Python distribution as it offers a lot of Data Analysis libraries out of the box. And using conda to create environments and install packages is also a breeze. But I have faced some serious issues when I want to update Python itself or any other module, I am informed beforehand that a LOT of my existing libraries will be removed.
For example, this is what I get when I use conda update [package_name]
$ conda update pandas
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: C:\Users\User\Anaconda3
added / updated specs:
- matplotlib
The following packages will be REMOVED:
[Almost half of my existing packages]
The following packages will be UPDATED:
[Some packages including my desired library, in this case, pandas]
I have searched the web on how to update packages and Python using conda and almost everywhere I saw that conda update [package name] was suggested. But why doesn't it work for me? I mean it will work but at the expense of tons of important libraries that I need.
So I have tried using the Anaconda Navigator to update the desired libraries (like matplotlib and pandas) hoping that the removal of existing libraries might be a command line issue on my computer. But I had seriously messed up my base (root) environment by updating pandas using Navigator. I didn't get any warnings that a lot of my modules will be removed so I thought I was doing fine. But after the update was done and I wrote some matplotlib code, I wasn't able to run it. I got errors that resembled something that indicated matplotlib was a "non-conda module". So I had to do conda install --revision n to go back to a state where I had my modules.
Right now, the only way for me to update any package or Python is to do this:
conda install pandas=[package_version_that_is_higher_than_mine]
But there's got to be a reason why I am facing this issue. Any help is absolutely appreciated.
EDIT: It turns out that the issue is mainly when I am trying to update using the base environment. When I use my other conda environments, the conda update [package_name] or conda update --all works fine.

Anaconda (as distinct from Conda) is designed to be used as a fixed set of package builds that have been vetted for compatibility (see "What's in a Name? Clarifying the Anaconda Metapackage). When you try to introduce new packages or package upgrades into that context, Conda can be rather unpredictable as to how it will solve that. I think it helps to keep in mind that commands like conda (install|upgrade|remove) mean requesting a distinct environment as a whole, and do not represent low-level commands to change a single package.
Conda does offer some options to get this more low-level behavior. One thing to try is the --freeze-installed flag, which would do what you're asking for. Recent versions of Conda do this by default in the first round of solves, and if it doesn't work then it attempts a full solve. There is also the more dangerous and brute force --no-dep flag, which won't do a solve at all and just install the package. The documentation for this literally says,
"This WILL lead to broken environments and inconsistent behavior. Use at your own risk."
Typically, if you want to use newer packages, it is better to create a new env (conda create -n my_env [pkg1 pkg2 ...]) because the fact is that you no longer want the Anaconda distribution, but instead a custom one with newer versions. My personal view is that most non-beginners should be using Miniconda and relegate their base env to only having conda, while being very liberal about creating envs for projects that have different package requirements. If you ever need a true Anaconda distribution, there's always the anaconda package for that.

What does conda do when "solving environment"

Whenever I run conda install/remove/update <package>, it tells me it's "Solving environment" for some time before telling me the list of things it's going to download/install/update. Presumably it's looking for dependencies for <package>, but why does it sometimes remove packages after doing this operation? For example, as I was trying to install Mayavi, it decided it needed to remove Anaconda Navigator.
Furthermore it does not provide an option to perform only a subset of the suggested operations. Is there a way to specify that I don't want a package removed?

You can add --debug option to the conda command and see the output from console(or terminal). For example, type conda update --debug numpy.
From the output, we can see that the client requests repodata.json from channel list and do some computation locally in the Solving Environment Step.

As a side note on the "Solving Environment" step...
Lack of administrator privileges may affect whether or where you can install python packages.
I observed that my installs would hang on the "Solving Environment" step and never get through when attempting to install packages while logged in as a non-administrator.
Getting switched to admin was possible for me on the machine I was stuck on, so I just did that and it solved the problem.
Commenter explains workaround when this is not possible.

JUST WAIT! I wasted hours trying to fix this. It turns out, it just took around 45 minutes :/

The short answer is: use mamba as a drop-in replacement for conda, it's much much faster at solving environments, no more waiting for minutes. mamba has been officially endorsed by the conda team.
Mamba also allows you to configure more precisely which packages you require to be installed and allows you to pin versions, as conda does. For a more detailed comparison of conda and mamba see this Stackoverflow answer: https://stackoverflow.com/a/68043228/7483211
The long answer is: Solving conda environments with more than a few packages that each have dependencies on their own quickly ends up becoming a quite complicated SAT problem (see Boolean satisfiability problem and dependency hell)
With good algorithms, even fairly big SAT problems can be solved fast. In contrast to mamba's solver which is written in C++ and designed to be fast, it seems that conda's solver is not very high performance. It worked well enough when people used small environments in the past, but with bigger and bigger environments, conda has started to struggle.
I made the switch about a year ago and I have not once looked back. The open source project I'm working for (Nextstrain) has also started to recommend mamba in place of conda for new users. I have not seen anyone advocating against using mamba in place of conda.

conda install --prune <<package>> helped me to install the right channel.
Suspecting environment used are for zipline and channel used not compatible with existing one. prune takes a lot of time but helped me in solving the environment issues.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.