SnakeMake rule with Python script, conda and cluster

SnakeMake rule with Python script, conda and cluster - python

I would like to get snakemake running a Python script with a specific conda environment via a SGE cluster.
On the cluster I have miniconda installed in my home directory. My home directory is mounted via NFS so accessible to all cluster nodes.
Because miniconda is in my home directory, the conda command is not on the operating system path by default. I.e., to use conda I need to first explicitly add this to the path.
I have a conda environment specification as a yaml file, which could be used with the --use-conda option. Will this work with the --cluster "qsub" option also?
FWIW I also launch snakemake using a conda environment (in fact the same environment I want to run the script).

I have an existing Snakemake system, running conda, on an SGE cluster. It's delightful and very do-able. I'll try to offer perspective and guidance.
The location of your miniconda, local or shared, may not matter. If you are using a login to access your cluster, you should be able to update your default variables upon logging in. This will have a global effect. If possible, I highly suggest editing the default settings in your .bashrc to accomplish this. This will properly, and automatically, setup your conda path upon login.
One of the lines in my file, "home/tboyarski/.bashrc"
export PATH=$HOME/share/usr/anaconda/4.3.0/bin:$PATH
EDIT 1 Good point made in comment
Personally, I consider it good practice to put everything under conda control; however, this may not be ideal for users who commonly require access to software not supported by conda. Typically support issues have to do with using old operating systems (E.g. CentOS 5 support was recently dropped). As suggested in the comment, manually exporting the PATH variable in a single terminal session may be more ideal for users who do not work on pipelines exclusively, as this will not have a global effect.
With that said, like myself prior to Snakemake execution, I recommend initializing the conda environment used by the majority, or entirety of your pipeline. I find this the preferred way as it allows conda to create the environment, instead of getting Snakemake to ask conda to create the environment. I don't have the link for the web-dicussion, but I believe I read somewhere that individuals who only rely on Snakemake to create the environments, not lanching from a base environment, they found that the environments were being stored in the /.snakemake directory, and that it was getting excessively large. Feel free to look for the post. The issue was address by the author who reduced the load on the hidden folder, but still, I think it makes more sense to launch the jobs from an existing Snakemake environment, which interacts with your head node, and then passes the corresponding environmental variables to it's child nodes. I like a bit of hierarchy.
With that said, you will likely need to pass the environments to your child nodes if you are running Snakemake from your head node's environment and letting Snakemake interact with the SGE job scheduler, via qsub. I actually use the built-in DRMAA feature, which I highly recommend. Both submission mediums require me to provide the following arguments:
-V Available for qsub, qsh, qrsh with command and qalter.
Specifies that all environment variables active within the qsub
utility be exported to the context of the job.
Also...
-S [[hostname]:]pathname,...
Available for qsub, qsh and qalter.
Specifies the interpreting shell for the job. pathname must be
an executable file which interprets command-line options -c and
-s as /bin/sh does.
To give you a better starting point, I also specify virtual memory and core counts, this might be specific to my SGE system, I do not know.
-V -S /bin/bash -l h_vmem=10G -pe ncpus 1
I highly expect you'll require both arguments when submitting the the SGE cluster, as I do personally. I recommend putting your cluster submission variables in JSON format, in a separate file. The code snippet above can be found in this example of what I've done personally. I've organized it slightly differently than in the tutorial, but that's because I needed a bit more granularity.
Personally, I only use the --use-conda command when running a conda environment different than the one I used to launch and submit my Snakemake jobs. Example being, my main conda environment runs python 3, but if I need to use a tool that say, requires python 2, I will then and only then use Snakemake to launch a rule, with that specific environment, such that the execution of that rule uses a path corresponding to a python2 installation. This was of huge importance by my employer, as the existing system I was replacing struggled to seemlessly switch between python2 and 3, with conda and snakemake, this is very easy.
In principle I think this is good practice to launch a base conda environemnt, and to run Snakemake from there. It encourages the use of a single environment for the entire run. Keep it simple, right? Complicate things only when necessary, like when needing to run both python2 and python3 in the same pipeline. :)

Related

Exporting environment (yml) file during conda build

TLDR
See question at the bottom
Background/Intro
I am still quite new to the world of Python and Conda and am trying to setup a CI pipeline for a bespoke requirement.
My understanding with the 'conda build' command is that internally it creates a temporary conda environment and uses this for evaluation of this build process. I am aware of this because it a version that I used last year that we upgraded, we had to change the meta.yaml file to have a new source folder entry for unit tests that it would then run against in this special folder.
More context (sorry for waffle)
Given the above, what I am looking to do is to extract the environment file once its run its operations e.g. dependency checks, unit tests etc..
If I were trying to extract (export) the env file for an environment, I would of course to the usual:
conda export > [some_path]/env.yml
The reason behind trying to get the environment during the 'conda build' process is for two reasons:
1.) Typically the developers tend to only list in the meta.yaml file the top level dependencies not all dependencies, and I need this list for a bespoke process down the line.
2.) (less vital but still good)
I can guarantee that the version built by the 'conda build' process (and all its dependencies) are valid.
Sometimes there is the issue of the build running and then a lower level dependency version changing between the time of the build and the release.
I do appreciate that the devs should be fixing all versions in the recipe but there are still sometimes some problems.
Question
Therefore, is it possible to retrieve/extract/export the environment file from the 'conda build' command-process?? Perhaps there is some flag? Or something I can script to run pre-build finishing?
Possible solution I thought of
Let process run, then script a step in my CI to create the environment and export the env file - I just don't want to add more time to the CI process though

Make VS Code terminal match debug environment on a Mac

I'm teaching a beginners python class, the environment is Anaconda, VS Code and git (plus a few extras from a requirements.txt).
For the windows students this runs perfectly, however the mac students have an existing python (2.7) to contend with.
The windows students (i.e. they have a windows computer), their environment when they debug matches their console environment. However the mac students seem to be locked to their 2.7 environment.
I've tried aliasing, as suggested here and here
alias python2='python'
alias python='python3'
alias pip2='pip'
alias pip='pip3'
I've modified the .bash_profile file
echo 'export PATH="/Users/$USER/anaconda3/bin:$PATH"' >>.bash_profile
Both of these seem to work perfectly to modify their Terminal environments, when launched externally to VS Code. Neither seem to do anything to the environment launched from [cmd]+[`].
I've also tried conda activate base in the terminal, which seems to have no effect on a python --version or a which python
They can run things using python 3, but that means that they need to remember that they are different to the other 2/3 of the students. It's othering for them, and more work for me!
The students are doing fine, launching things from their external terminal, but it would streamline things greatly if the environments could be as consistent as possible across the OSs.
Whilst they are complete beginners, they can run a shell script. They currently have one that installs pip requirements and vs code extensions.
Is there a configuration that will keep the terminal in line with the debug env?

In my opinion the best practice is to create Python virtual environments (personally I love using conda environments, especially on Mac where you stuck with unremovable old Python version). Then VSCode will automatically (after installing very powerful Python extension) find all your virtual environments. This way you will teach your students a good practice of handling Python zoo a.k.a. package incompatibilities. Terminal environments settings will be consistent with VSCode, without being dependent on unneeded any more aliases. Obviously, virtual environments are OS independent, so you will be more consistent and remove unnecessary confusion between different students.
The additional bonus of the virtenvs is that you can create one exactly according to your requirements.txt and switch from one to another with a single click (in terminal it takes two commands: deactivate -> activate).
You can read more about how to handle Python virtual environments on VSCode site

Given the aliases are run just once and are not persistent in .bash_profile, python targets the default interpreter rather than the expected conda python3 interpreter.
Try to symlink conda's python3 executable to capture the python namespace
ln -sf /Users/$USER/anaconda3/bin/python3 /Users/$USER/anaconda3/bin/python
This will create or update the symlink. Use the same approach for pip and pip3.

Python in vscode let's you select which interpreter will be used to run the scripts.
It is in settings under "python.pythonPath", just set it to point to the interpreter of choice.
It can be set on a project basis as well (which is how you ensure that a project that has a virtual environment will execute using that interpreter and packages), you just select Workspace in the settings pane and add the desired python interpreter there.

Maintaining and switching between different python interpreters on all mediums - best practices?

I am looking for best practices for maintaining different python installations (or environments) simultaneously, specifically, with regards to achieving the following 3 objectives/requirements:
Note: I am using Anaconda installation for Python
Requirements
Requirement1 - Jupyter Notebook convenience
All python interpreters (either in different installations or environments) should be available in jupyter notebook in the list of kernels.
Requirement2 - Command line convenience
On the command line it should be easy to specify any python interpreter available on my machine to run a py script, something like: 'py2 test.py' or 'py3_env1 test.py'
Requirement3 - Automation convenience
Run the py scripts using the *.bat files, using any python interpreter available on the machine.
Choices
As implicit above, there are 2 broad choices here:
Choice1 - Single installation, multiple environments
...i.e. maintain a 'single' python installation folder but create multiple environments in it using 'conda'.
Choice2 - Mulitple installations, single environment each
... i.e. maintain different python interpreters in separate installation folders with only the root environment.
If I make any any of the two Choices above, I can satisfy some, but not all 3 requirements.
For instance, if I make Choice1, I can satisfy Requirement1 easily, but not 2 and 3; and with Choice2, I can satisfy Requirement2 and Requirement3 only. Details as below.
Satisfying requirements
Satisfying Requirement1 - Jupyter Notebook convenience
If I make Choice1 (single installation, multiple environments), I can satisfy Requirement1 using the solutions posted here. It suggests registering ipykernels in all of the environments.
With Choice2, is it possible for jupyter notebook to see all kernels spread across many python installations? I couldnt make it work.
Satisfing Requirement2 - Command Line convenience
If I make Choice2, then I can satisfy Requirement2 by using Python Launcher or creating symlinks as suggested here
With Choice1, it doesn't work since Python Launcher doesn't recognise different environments (yet!). So, is there any other way to make it possible? Can I, lets say, use some kind of command alias, something like a short-cut command 'py' that refers to the 'root environment' python.exe file, and another one 'py2' that refers to another environment's python.exe. And if I do that, will calling 'py2' be robust in the sense that it will call correct site-packages folder? (Note, we haven't activated the environment here, so the windows 'PATH' may not point to the correct folders). I would assume that activating the environment is necessary, but I am not sure.
Satisfing Requirement3 - Automation Convenience
If I make Choice2, I can probably satisfy Requirement3 using Python Launcher as well (just like above). Also, not to mention, my script can also use shebang operator to mention which python it wants to invoke. However, I don't want to hardcode the python version in the script and keep the flexibility in the command itself.
If I make Choice1, I thought that a *.bat file like below will work:
conda activate py3_env1
python test.py
deactivate
So, I intend to activate the environment first and then call the py file. However, this doesn't work. Calling this bat file executes just the first line, activates the environment and then nothing happens.
So, in short, I am looking to have a consistent setup that allows me to invoke different python versions from all mediums - notebook, command prompt and automated bat files - very conveniently.
Any guidance or recommendation around best practices are very appreciated.

Virtual Environments: python -m venv VS echo layout python3

I'm fairly new to python, but have built a few small projects. I have been taught, and have always used, the following commands to start a virtual environment: echo layout python3 > .envrc and then direnv allow.
What are the differences or advantages to using python -m venv <virtualenv name> versus echo layout?

Those two commands do entirely different things.
venv
The python -m venv <env_name> command creates a virtual environment as a subdirectory full of files in your filesystem. When it's done, a new virtual environment is sitting there ready for you to activate and use, but this command doesn't actually activate it yet.
Activating the virtual environment so you can use it is a separate step. The command to do this depends on which operating system and which shell you're using (see the "Command to activate virtual environment" table in the docs linked above).
The activation command alters only your current command-line shell session. This is why you have to re-activate the virtual environment in every shell session you start. This kind of annoyance is also what direnv exists to solve.
direnv and .envrc
First, about that echo command...
In both MS-DOS and Unix / Linux (and presumably recent versions of Macintosh), echo layout python3 just emits a string "layout python3".
The > redirects the echo command's output to a file, in this case .envrc. The redirection creates the file if it doesn't already exist, and then replaces its contents (if any) with that string. The end result is a file in your current working directory containing just:
layout python3
The .envrc file, and direnv allow
.envrc is a config file used by the direnv application. Whenever you cd into a directory containing a .envrc file, direnv reads it and executes the direnv instructions found inside.
direnv allow is a security feature. Since malicious .envrc files could be hidden almost anywhere (especially in world-writable directories like /var/tmp/), you could cd into a seemingly innocent directory and get a nasty surprise from someone else's .envrc land mine. The allow command specifically white-lists a directory's .envrc file, and apparently un-lists it if it discovers the .envrc file has changed since it was allowed.
Finally, back to direnv
I don't use direnv, but layout <language> is a direnv command to adjust your environment for developing in language, in this case activating a Python 3 virtual environment. The docs hint that it's more "helpful" than just that, but they don't go into any detail. (Also, you could have written your own direnv function called python3 that does something completely different.)
The goal of all that is to automatically enable your Python virtual environment as soon as you cd into its directory. This eliminates one kind of human error, namely forgetting to enable the virtual environment. For details, see Richard North's "Practical direnv", especially the "Automatic Python virtualenv switching section.
(Dis-)Advantages and Opinions
If that's the kind of mistake you've made frequently, and you trust that the direnv command will never fall prey to a malicious .envrc file (or otherwise "helpfully" mess up something you're working on), then it might be worth it to you.
The biggest down-side I see to direnv (aside from the security implications) is that it trains you to forget about a vital step in using Python virtual environments... namely, actually using the virtual environment. This goes double for any other "help" it silently provides without telling you. (The fact that I keep putting "help" in quotes should suggest what I think of utilities like this.)
If you ever find yourself working somewhere direnv isn't installed, the odds are good that you'll forget to activate your virtual environments, or forget whatever else direnv has been doing for you. And the odds are even better that you'll have forgotten how to do it.

How to tell Jenkins to use a particular virtualenv python

I have already created a virtualenv for running my python script.
Now when I integrate this python scrip with Jenkins, I have found at the time execution Jenkins is using wrong python environment.
How I can ensure Jenkins is using the correct virtualenv?
As an example, for my case I want to use virtualenv test. How I can use this pre-prepared virtualenv to run my python script.
source test/bin/activate

You should install one of python plugins. I've used ShiningPanda. Then you'll be able to create separate virtual environment configurations in Manage Jenkins > Configure System > Python > Python installation.
In job configuration there will be Python Builder step, where you can select python environment.
Just make sure you're not starting Jenkins service from within existing python virtual environment.

First, you should avoid using ShiningPanda because is broken. It will fail if you try to run jobs in parallel and is also not compatible with Jenkins2 pipelines.
When builds are run in parallel (concurrent) Jenkins will append #2,#3... to the workspace directory so two executions will not share the same folder. Jenkins does clone the original worksspace so do not be surprised if it will contain a virtualenv you created in a previous build.
You need to take care of the virtualenv creation yourself but you have to be very careful about how you use it:
workspaces folder may not be cleanup and its location could change from one build to another
virtualenvs are knows to get broken when they are moved, and jenkins moves them.
creating files outside workspace is a really bad CI practice, avoid the temptation to use /tmp
So your only safe option is to create an unique virtual environment folder for each build inside the workspace. You can easily do this by using the $JOB_NUMBER environment variable.
This will be different even if you have jobs running in parallel. Also this will not repeat.
Downsides:
speed: virtualenvs are not reused between builds so they are fully recreated. If you use --site-packages you may speedup the creation considerably (if the heavy packets are already installed on the system)
space: if the workspace is not cleaned regularly, the number of virtualenvs will grow. Workaround: have a job that cleans workspaces every week or every two weeks. This is also a good practice for spotting other errors. Some people choose to clean workspace for each execution.
Shell snippet
#/bin/bash
set -euox pipefail
# Get an unique venv folder to using *inside* workspace
VENV=".venv-$BUILD_NUMBER"
# Initialize new venv
virtualenv "$VENV"
# Update pip
PS1="${PS1:-}" source "$VENV/bin/activate"
# <YOUR CODE HERE>
The first line is implementing bash string mode, more details at http://redsymbol.net/articles/unofficial-bash-strict-mode/

You can use Pyenv Pipeline Plugin. The usage is very easy, just add
stage('my_stage'){
steps{
script{
withPythonEnv('path_to_venv/bin'){
sh("python foo.py")
...
You can add pip install whatever in your steps to update any virtual environment you are using.
By default it will look for the virtual environment in the jenkins workspace and if it does not find it, it will create a new one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.