TLDR
See question at the bottom
Background/Intro
I am still quite new to the world of Python and Conda and am trying to setup a CI pipeline for a bespoke requirement.
My understanding with the 'conda build' command is that internally it creates a temporary conda environment and uses this for evaluation of this build process. I am aware of this because it a version that I used last year that we upgraded, we had to change the meta.yaml file to have a new source folder entry for unit tests that it would then run against in this special folder.
More context (sorry for waffle)
Given the above, what I am looking to do is to extract the environment file once its run its operations e.g. dependency checks, unit tests etc..
If I were trying to extract (export) the env file for an environment, I would of course to the usual:
conda export > [some_path]/env.yml
The reason behind trying to get the environment during the 'conda build' process is for two reasons:
1.) Typically the developers tend to only list in the meta.yaml file the top level dependencies not all dependencies, and I need this list for a bespoke process down the line.
2.) (less vital but still good)
I can guarantee that the version built by the 'conda build' process (and all its dependencies) are valid.
Sometimes there is the issue of the build running and then a lower level dependency version changing between the time of the build and the release.
I do appreciate that the devs should be fixing all versions in the recipe but there are still sometimes some problems.
Question
Therefore, is it possible to retrieve/extract/export the environment file from the 'conda build' command-process?? Perhaps there is some flag? Or something I can script to run pre-build finishing?
Possible solution I thought of
Let process run, then script a step in my CI to create the environment and export the env file - I just don't want to add more time to the CI process though
Related
I would like to get snakemake running a Python script with a specific conda environment via a SGE cluster.
On the cluster I have miniconda installed in my home directory. My home directory is mounted via NFS so accessible to all cluster nodes.
Because miniconda is in my home directory, the conda command is not on the operating system path by default. I.e., to use conda I need to first explicitly add this to the path.
I have a conda environment specification as a yaml file, which could be used with the --use-conda option. Will this work with the --cluster "qsub" option also?
FWIW I also launch snakemake using a conda environment (in fact the same environment I want to run the script).
I have an existing Snakemake system, running conda, on an SGE cluster. It's delightful and very do-able. I'll try to offer perspective and guidance.
The location of your miniconda, local or shared, may not matter. If you are using a login to access your cluster, you should be able to update your default variables upon logging in. This will have a global effect. If possible, I highly suggest editing the default settings in your .bashrc to accomplish this. This will properly, and automatically, setup your conda path upon login.
One of the lines in my file, "home/tboyarski/.bashrc"
export PATH=$HOME/share/usr/anaconda/4.3.0/bin:$PATH
EDIT 1 Good point made in comment
Personally, I consider it good practice to put everything under conda control; however, this may not be ideal for users who commonly require access to software not supported by conda. Typically support issues have to do with using old operating systems (E.g. CentOS 5 support was recently dropped). As suggested in the comment, manually exporting the PATH variable in a single terminal session may be more ideal for users who do not work on pipelines exclusively, as this will not have a global effect.
With that said, like myself prior to Snakemake execution, I recommend initializing the conda environment used by the majority, or entirety of your pipeline. I find this the preferred way as it allows conda to create the environment, instead of getting Snakemake to ask conda to create the environment. I don't have the link for the web-dicussion, but I believe I read somewhere that individuals who only rely on Snakemake to create the environments, not lanching from a base environment, they found that the environments were being stored in the /.snakemake directory, and that it was getting excessively large. Feel free to look for the post. The issue was address by the author who reduced the load on the hidden folder, but still, I think it makes more sense to launch the jobs from an existing Snakemake environment, which interacts with your head node, and then passes the corresponding environmental variables to it's child nodes. I like a bit of hierarchy.
With that said, you will likely need to pass the environments to your child nodes if you are running Snakemake from your head node's environment and letting Snakemake interact with the SGE job scheduler, via qsub. I actually use the built-in DRMAA feature, which I highly recommend. Both submission mediums require me to provide the following arguments:
-V Available for qsub, qsh, qrsh with command and qalter.
Specifies that all environment variables active within the qsub
utility be exported to the context of the job.
Also...
-S [[hostname]:]pathname,...
Available for qsub, qsh and qalter.
Specifies the interpreting shell for the job. pathname must be
an executable file which interprets command-line options -c and
-s as /bin/sh does.
To give you a better starting point, I also specify virtual memory and core counts, this might be specific to my SGE system, I do not know.
-V -S /bin/bash -l h_vmem=10G -pe ncpus 1
I highly expect you'll require both arguments when submitting the the SGE cluster, as I do personally. I recommend putting your cluster submission variables in JSON format, in a separate file. The code snippet above can be found in this example of what I've done personally. I've organized it slightly differently than in the tutorial, but that's because I needed a bit more granularity.
Personally, I only use the --use-conda command when running a conda environment different than the one I used to launch and submit my Snakemake jobs. Example being, my main conda environment runs python 3, but if I need to use a tool that say, requires python 2, I will then and only then use Snakemake to launch a rule, with that specific environment, such that the execution of that rule uses a path corresponding to a python2 installation. This was of huge importance by my employer, as the existing system I was replacing struggled to seemlessly switch between python2 and 3, with conda and snakemake, this is very easy.
In principle I think this is good practice to launch a base conda environemnt, and to run Snakemake from there. It encourages the use of a single environment for the entire run. Keep it simple, right? Complicate things only when necessary, like when needing to run both python2 and python3 in the same pipeline. :)
been searching for this with no success, i don't know if i am missing something but i have a virtualenv already but how do i create a project to associate the virtualenv with, thanks
P.S. Am on windows
I could be wrong here, but I do not believe that a virtualenv is something that is by its very nature something that you associate with a project. When you use a virtualenv, you're basically saying, "I'm taking this Python interpreter, installing what I want on it, and setting it aside from the Python interpreter that the entire computer uses by default." Virtualenv does not have a concept of a Python "project"; it is just a custom version of a Python interpreter that you run code through. There are tools in IDEs like PyCharm that enable you to associate a project with a virtualenv, but those are another layer on top of the base software.
In order to use a virtualenv with a project, you will need to "activate" it every time you wish to use it. The documentation for activating a virtualenv on Windows is found here.
EDIT:
Saw that you had virtualenvwrapper tagged in your post, so I did a bit of hunting on that. It would appear that there is the mkproject command, which creates a project folder and then associates it with a virtualenv interpreter. Documentation on it can be found here.
Requirements:
Virtual Env
Pycharm
Go to Virtual env and type which python
Add remote project interpreter (File > Default Settings > Project Interpreter (cog) add remote)
You'll need to set up your file system so that PyCharm can also open the project.
NOTE:
do not turn off your virtual environment without saving your run configurations that will cause pycharm to see your run configurations as corrupt
There's a button on the top right that reads share enable this and your run configs will be saved to a .idea file and you'll have a lot less issues
If you already have your virtualenv installed you just need to start using it.
Create your projects virtual environment using virtualenv env_name on cmd. To associate a specific version of python with your environment use: virtualenv env_name -p pythonx.x;
Activate your environment by navigating into its Scripts folder and executing activate.
Your terminal now is using your virtual environment, that means every python package you install and the python version you run will be the ones you configured inside your env.
I like to create environments with the names similar to my projects, I always use one environment to each project, that helps keeping track of which packages my specific projects need to run.
If you haven't read much about venvs yet, try googling about requirements.txt along with pip freeze command those are pretty useful to keep track of your project's packages.
I like Pipenv: Python Dev Workflow for Humans to manage environments:
Pipenv is a tool that aims to bring the best of all packaging worlds (bundler, composer, npm, cargo, yarn, etc.) to the Python world. Windows is a first-class citizen, in our world.
It automatically creates and manages a virtualenv for your projects, as well as adds/removes packages from your Pipfile as you install/uninstall packages. It also generates the ever-important Pipfile.lock, which is used to produce deterministic builds.
Pipenv is primarily meant to provide users and developers of applications with an easy method to setup a working environment.
I am currently writing a command line application in Python, which needs to be made available to end users in such a way that it is very easy to download and run. For those on Windows, who may not have Python (2.7) installed, I intend to use PyInstaller to generate a self-contained Windows executable. Users will then be able to simply download "myapp.exe" and run myapp.exe [ARGUMENTS].
I would also like to provide a (smaller) download for users (on various platforms) who already have Python installed. One option is to put all of my code into a single .py file, "myapp.py" (beginning with #! /usr/bin/env python), and make this available. This could be downloaded, then run using myapp.py [ARGUMENTS] or python myapp.py [ARGUMENTS]. However, restricting my application to a single .py file has several downsides, including limiting my ability to organize the code and making it difficult to use third-party dependencies.
Instead I would like to distribute the contents of several files of my own code, plus some (pure Python) dependencies. Are there any tools which can package all of this into a single file, which can easily be downloaded and run using an existing Python installation?
Edit: Note that I need these applications to be easy for end users to run. They are not likely to have pip installed, nor anything else which is outside the Python core. Using PyInstaller, I can generate a file which these users can download from the web and run with one command (or, if there are no arguments, simply by double-clicking). Is there a way to achieve this ease-of-use without using PyInstaller (i.e. without redundantly bundling the Python runtime)?
I don't like the single file idea because it becomes a maintenance burden. I would explore an approach like the one below.
I've become a big fan of Python's virtual environments because it allows you to silo your application dependencies from the OS's installation. Imagine a scenario where the application you are currently looking to distribute uses a Python package requests v1.0. Some time later you create another application you want to distribute that uses requests v2.3. You may end up with version conflicts on a system where you want to install both applications side-by-side. Virtual environments solve this problem as each application would have its own package location.
Creating a virtual environment is easy. Once you have virtualenv installed, it's simply a matter of running, for example, virtualenv /opt/application/env. Now you have an isolated python environment for your application. Additionally, virtual environments are very easy to clean up, simply remove the env directory and you're done.
You'll need a setup.py file to install your application into the environment. Say your application uses requests v2.3.0, your custom code is in a package called acme, and your script is called phone_home. Your directory structure looks like this:
acme/
__init__.py
models.py
actions.py
scripts/
phone_home
setup.py
The setup.py would look something like this:
from distutils.core import setup
install_requires = [
'requests==2.3.0',
]
setup(name='phone_home',
version='0.0.1',
description='Sample application to phone home',
author='John Doe',
author_email='john#doe.com',
packages=['acme'],
scripts=['scripts/phone_home'],
url='http://acme.com/phone_home',
install_requires=install_requires,
)
You can now make a tarball out of your project and host it however you wish (your own web server, S3, etc.):
tar cvzf phone_home-0.0.1.tar.gz .
Finally, you can use pip to install your package into the virtual environment you created:
/opt/application/env/bin/pip install http://acme.com/phone_home-0.0.1.tar.gz
You can then run phone_home with:
/opt/application/env/bin/phone_home
Or create a symlink in /usr/local/bin to simply call the script using phone_home:
ln -s /opt/application/env/bin/phone_home /usr/local/bin/phone_home
All of the steps above can be put in a shell script, which would make the process a single-command install.
And with slight modification this approach works really well for development environments; i.e. using pip to install / reference your development directory: pip install -e . where . refers to the current directory and you should be in your project directory alongside setup.py.
Hope this helps!
You could use pip as suggested in the comments. You need to create a MANIFEST.in and setup.py in your project to make it installable. You can also add modules as prerequisites. More info can be found in this question (not specific to Django):
How do I package a python application to make it pip-installable?
This will make your module available in Python. You can then have users run a file that runs your module, by either python path/run.py, ./path/run.py (with +x permission) or python -c "some code here" (e.g. for an alias).
You can even have users install from a git public reporitory, like this
pip install git+https://bitbucket.org/yourname/projectname.git
...in which case they also need git.
I have already created a virtualenv for running my python script.
Now when I integrate this python scrip with Jenkins, I have found at the time execution Jenkins is using wrong python environment.
How I can ensure Jenkins is using the correct virtualenv?
As an example, for my case I want to use virtualenv test. How I can use this pre-prepared virtualenv to run my python script.
source test/bin/activate
You should install one of python plugins. I've used ShiningPanda. Then you'll be able to create separate virtual environment configurations in Manage Jenkins > Configure System > Python > Python installation.
In job configuration there will be Python Builder step, where you can select python environment.
Just make sure you're not starting Jenkins service from within existing python virtual environment.
First, you should avoid using ShiningPanda because is broken. It will fail if you try to run jobs in parallel and is also not compatible with Jenkins2 pipelines.
When builds are run in parallel (concurrent) Jenkins will append #2,#3... to the workspace directory so two executions will not share the same folder. Jenkins does clone the original worksspace so do not be surprised if it will contain a virtualenv you created in a previous build.
You need to take care of the virtualenv creation yourself but you have to be very careful about how you use it:
workspaces folder may not be cleanup and its location could change from one build to another
virtualenvs are knows to get broken when they are moved, and jenkins moves them.
creating files outside workspace is a really bad CI practice, avoid the temptation to use /tmp
So your only safe option is to create an unique virtual environment folder for each build inside the workspace. You can easily do this by using the $JOB_NUMBER environment variable.
This will be different even if you have jobs running in parallel. Also this will not repeat.
Downsides:
speed: virtualenvs are not reused between builds so they are fully recreated. If you use --site-packages you may speedup the creation considerably (if the heavy packets are already installed on the system)
space: if the workspace is not cleaned regularly, the number of virtualenvs will grow. Workaround: have a job that cleans workspaces every week or every two weeks. This is also a good practice for spotting other errors. Some people choose to clean workspace for each execution.
Shell snippet
#/bin/bash
set -euox pipefail
# Get an unique venv folder to using *inside* workspace
VENV=".venv-$BUILD_NUMBER"
# Initialize new venv
virtualenv "$VENV"
# Update pip
PS1="${PS1:-}" source "$VENV/bin/activate"
# <YOUR CODE HERE>
The first line is implementing bash string mode, more details at http://redsymbol.net/articles/unofficial-bash-strict-mode/
You can use Pyenv Pipeline Plugin. The usage is very easy, just add
stage('my_stage'){
steps{
script{
withPythonEnv('path_to_venv/bin'){
sh("python foo.py")
...
You can add pip install whatever in your steps to update any virtual environment you are using.
By default it will look for the virtual environment in the jenkins workspace and if it does not find it, it will create a new one.
I'm following the outline of the hudson/python/virtualenv CI solution described at heisel.org but one step of this is really chafing, and that's the part where the virtualenv, created just for the CI run, is configured:
pip install -q -E ./ve -r requirements.pip
pip install -q -E ./ve -r requirements-test.pip
This takes an inordinate amount of time to run, and every time a source file changes we'll end up re-downloading what amounts to a significant amount of data.
Is it possible to create template workspaces in Hudson, so that instead of checking out into a bare workspace it checks out into one that is pre-prepared?
Here are a couple options:
Have an archive in your source repository that blows up into the
virtualenv/pip install. You'll need to make the virtualenv starting point relocatable.
Use whatever SCM option is appropriate to not wipe out the workspace
between builds (e.g. Use svn update, or don't check Mercurial's Clean Build
option). Then keep the install commands in your build script, but put them in
under an if statement so they are only run (for example) if a .pip_installed file is not present, or if a build parameter is set.
You might be able to get the Clone Workspace plugin to do what you
want. But that's an alternative SCM, which I'm guessing you probably don't
want since Hudson won't check out from multiple SCMs (see this previous
question for some ideas about working around this).
It's probably also a good idea to set up your pip configuration to pull from a
local cache of packages.
pip -f http://localhost/packages/
An enhancement is to package the virtualenv in an archive named by the hash of the requirements file. If the requirements file has not changed since the last build, just extract the archive into an empty virtualenv directory. If the requirements file has changed, an archive won't yet exist, so you run pip install to build the environment and then store it in a new archive.
If you create a new venv per workspace then you only really have to install all the deps once at the beggining so subsequent builds are much faster. See my post for a script I wrote to help out:
"Pretty" Continuous Integration for Python