Always run rule in Snakefile (snakemake)

Always run rule in Snakefile (snakemake) - python

I am writing a Snakefile for a snakemake workflow. As part of my workflow I need to check whether a set of records in a database has changed, and if they have re-download them.
My thought was to write a rule that checks the database timestamp and writes it to an output file. And use the timestamp file as an input into my download rule. The problem is once the timestamp file is written that timestamp rule will never run again, and hence the timestamp will never be updated.
Is there a way to make this rule run every time. (I know I can force it from the shell, but I would like to specify it in the Snakefile) Or, is there a better way to handle this?

Any code you add to a Snakefile outside of a rule or function definition will be run at startup just like a regular Python script, so you don't need an external shell script. You can implement the logic you want in Python right in the Snakefile, making use of the shell() function if you need it.
One caveat would be that if you tried to run your workflow on a cluster, the code would be run each time for each cluster job submitted. A crude but effective way to avoid this is to guard it with a check like this:
if '--nolock' not in sys.argv:
if check_database_for_updates():
os.utime('touch.file')
Then set touch.file as a proxy input to your rule that reads from the database. Does that make sense?
TIM

Since v3.6.0, onstart handler allows to always execute something before the workflow starts.
Snakemake 3.6.0 adds an onstart handler, that will be executed before the workflow starts. Note that dry-runs do not trigger any of the handlers.
It's unfortunate that onstart doesn't get triggered during dry-runs.
On similar note, onsuccess and onerrorhandlers can be used to trigger something to be executed depending on workflow's success and error, respectively.

Related

subprocess.run() completes after first step of nextflow pipeline

I wrote a nextflow workflow that performs five steps. I have done this to be used by me and my colleagues, but not everyone is skilled with nextflow. Therefore, I decided to write a small wrapper in python3 that could run it for them.
The wrapper is very simple, reads the options with argparse and creates a command to be run with subprocess.run(). The issue with the wrapper is that, once the first step of the pipeline is completed, subprocess.run() thinks that the process is over.
I tried using shell=True, I tried using subprocess.Popen() with a while waiting for an output file, but it won't solve it.
How can I tell either subprocess.run() to wait until the end-end, or to nextflow run not to emit any exit code until the last step? Is there a way, or am I better off giving my colleagues a NF tutorial instead?
EDIT:
The reason why I prefer the wrapper, is that nextflow creates lots of temporary files which one has to know how to clean up. The wrapper does it for them, saving disk space and my time.

The first part of your question is a bit tricky to answer without the details, but we know subprocess.run() should wait for the command specified to complete. If your nextflow command is actually exiting before all of your tasks/steps have completed, then there could be a problem with the workflow or with the version of Nextflow itself. Since this occurs after the first process completes, I would suspect the former. My guess is that there might be some plumbing issue somewhere. For example, if your second task/step definition is conditional in any way then this could allow an early exit from your workflow.
I would avoid the wrapper here. Running Nextflow pipelines should be easy, and the documentation that accompanies your workflow should be sufficient to get it up and running quickly. If you need to set multiple params on the command line, you could include one or more configuration profiles to make it easy for your colleagues to get started running it. The section on pipeline sharing is also worth reading if you haven't seen it already. If the workflow does create lots of temporary files, just ensure these are all written to the working directory. So upon successful completion, all you need to clean up should be a simple rm -rf ./work. I tend to avoid automating destructive commands like this to avoid accidental deletes. A line in your workflow's documentation to say that the working directory can be removed (following successful completion of the pipeline) should be sufficient in my opinion and just leave it up to the users to clean up after themselves.
EDIT: You may also be interested in this project: https://github.com/goodwright/nextflow.py

Python - run another python script with current environment passing the arguments over and getting the printed print output

A little bit of an ugly question, but I didn't find existing SO posts which cover it.
Right now I need to use an existing python tool available on this github
This is a rather big piece of code with a lot of dependencies which I don't want to mess with. In a nutshell one can run its module by passing the command line arguments, for example:
timesearch.py timesearch -r "subreddit1" -l "1466812800" -up "1498348800"
Now, I need to run this tool a bunch of times using a for loop, passing over different argument values each time. The tool also prints out some output into command line when you run it - and I would like to intercept and print it out from my python script as well. Finally, I need to ensure that before I move on in my loop and run the tool another time that current execution of the timesearch tool is completed.
One side note here - I do need to ensure that the timesearch is executed using same environment which I use to run my main script with for loop.
I am trying to understand what is the best way to do it.
If I just go for this it doesn't work:
import os
#for loop will go here
os.system('python timesearch.py timesearch -r "ethereum" -l "1466812800" -up "1498348800"')
It fails due to several reasons - it doesn't use the environment in which I am writing my script with a loop, it also doesn't capture the print output of timesearch.
Any advice on how to achieve it?
Just to highlight - I can't just go and pull function I need in timesearch, since it calls the __init__ to set up some things based on the arguments you pass.

I wouldn't call python script with os.system. There is basically one function which you need to use: main(sys.argv[1:])
https://github.com/voussoir/timesearch/blob/master/timesearch/__init__.py#L435.

when using Watchman's watch-make I want to access the name of the changed files

I am writing a watchman command with watchman-make and I'm at a loss when trying to access exactly what was changed in the directory. I want to run my upload.py script and inside the script I would like to access filenames of newly created files in /var/spool/cups-pdf/ANONYMOUS .
so far I have
$ watchman-make -p '/var/spool/cups-pdf/ANONYMOUS' -—run 'python /home/pi/upload.py'
I'd like to add another argument to python upload.py so I can have an exact filepath to the newly created file so that I can send the new file over to my database in upload.py,
I've been looking at the docs of watchman and the closest thing I can think to use is a trigger object. Please help!

Solution with watchman-wait:
Assuming project layout like this:
/posts/_SUBDIR_WITH_POST_NAME_/index.md
/Scripts/convert.sh
And the shell script like this:
#!/bin/bash
# File: convert.sh
SrcDirPath=$(cd "$(dirname "$0")/../"; pwd)
cd "$SrcDirPath"
echo "Converting: $SrcDirPath/$1"
Then we can launch watchman-wait like this:
watchman-wait . --max-events 0 -p 'posts/**/*.md' | while read line; do ./Scripts/convert.sh $line; done
When we changing file /posts/_SUBDIR_WITH_POST_NAME_/index.md the output will be like this:
...
Converting: /Users/.../Angular/dartweb_quickstart/posts/swift-on-android-building-toolchain/index.md
Converting: /Users/.../Angular/dartweb_quickstart/posts/swift-on-android-building-toolchain/index.md
...

watchman-make is intended to be used together with tools that will perform a follow-up query of their own to discover what they want to do as a next step. For example, running the make tool will cause make to stat the various deps to bring things up to date.
That means that your upload.py script needs to know how to do this for itself if you want to use it with watchman.
You have a couple of options, depending on how sophisticated you want things to be:
Use pywatchman to issue an ad-hoc query
If you want to be able to run upload.py whenever you want and have it figure out the right thing (just like make would do) then you can have it ask watchman directly. You can have upload.py use pywatchman (the python watchman client) to do this. pywatchman will get installed if the the watchman configure script thinks you have a working python installation. You can also pip install pywatchman. Once you have it available and in your PYTHONPATH:
import pywatchman
client = pywatchman.client()
client.query('watch-project', os.getcwd())
result = client.query('query', os.getcwd(), {
"since": "n:pi_upload",
"fields": ["name"]})
print(result["files"])
This snippet uses the since generator with a named cursor to discover the list of files that changed since the last query was issued using that same named cursor. Watchman will remember the associated clock value for you, so you don't need to complicate your script with state tracking. We're using the name pi_upload for the cursor; the name needs to be unique among the watchman clients that might use named cursors, so naming it after your tool is a good idea to avoid potential conflict.
This is probably the most direct way to extract the information you need without requiring that you make more invasive changes to your upload script.
Use pywatchman to initiate a long running subscription
This approach will transform your upload.py script so that it knows how to directly subscribe to watchman, so instead of using watchman-make you'd just directly run upload.py and it would keep running and performing the uploads. This is a bit more invasive and is a bit too much code to try and paste in here. If you're interested in this approach then I'd suggest that you take the code behind watchman-wait as a starting point. You can find it here:
https://github.com/facebook/watchman/blob/master/python/bin/watchman-wait
The key piece of this that you might want to modify is this line:
https://github.com/facebook/watchman/blob/master/python/bin/watchman-wait#L169
which is where it receives the list of files.
Why not triggers?
You could use triggers for this, but we're steering folks away from triggers because they are hard to manage. A trigger will run in the background and have its output go to the watchman log file. It can be difficult to tell if it is running, or to stop it running.
The interface is closer to the unix model and allows you to feed a list of files on stdin.
Speaking of unix, what about watchman-wait?
We also have a command that emits the list of changed files as they change. You could potentially stream the output from watchman-wait in your upload.py. This would make it have some similarities with the subscription approach but do so without directly using the pywatchman client.

Pysys - How to run only the validate portion of a test

I am looking at a way to add a new "mode" in the Pysys Baserunner.
In particular I would like to add a validate mode, that just re-run the validation portion. Useful when you are writing your testcase and trying to tune the validation condition so to fit the current output without having to re-reun the complete testcase.
What is the best way to do this without having to change the original class?

This requires support from the framework unfortunately. The issue is that the BaseRunner class will always automatically purge the output directory, and there is no hook into the framework to allow you to avoid this. You can for instance move the output subdirectory manually you want to re-run the validation over to say 'repeat' (same directory level), and then use;
from pysys.constants import *
from pysys.basetest import BaseTest
class PySysTest(BaseTest):
def execute(self):
if self.mode=='repeat': pass
def validate(self):
if self.mode=='repeat':
self.output=os.path.join(self.descriptor.output, 'repeat')
where I have ommitted the implementations of the execute and validate. You would need to add the mode into the descriptor for the test
<classification>
<groups>
<group></group>
</groups>
<modes>
<mode>repeat</mode>
</modes>
</classification>
and run using "pysys.py run -mrepeat". This would help with the debugging if your execute takes a long time, but probably not want you want out-of-the-box i.e. a top level option to the runner to just perform validation over a previously run test. I'll add a feature request for this.

Since the original discussion, a --validateOnly command line option was added to PySys (in v1.1.1) which does pretty much what you suggest - it skips the execute method and just runs validate.
Assumes you aren't running with --purge (which I imagine is a safe assumption for this use case), and that you don't have validation commands that try to read zero-byte files from the output dir (which always get deleted even if --purge is not specified). However assuming those conditions are met, your (non-empty) output files will still be there after you've completed the first run of the test and you can re-run just validation using the --validateOnly command.
To get this feature you can install the latest PySys version (1.4.0) - see https://pypi.org/project/PySys/

Is it possible to restrict access to globals for a block of code in python?

I would like users of my program to be able to define custom scripts in python without breaking the program. I am looking at something like this:
def call(script):
code = "access modification code" + script
exec(code)
Where "access modification code" defines a scope such that script can only access variables it instantiates itself. Is it possible to do this or something with similar functionality, such as creating a new python environment with its own scope and then receiving output from it?
Thank you for your time :)
Clarification Edit
"I want to prevent both active attacks and accidental interaction with the program variables outside the users script (hence hiding all globals). The user scripts are intended to be small and inputted as text. The return of the user script needs to be immediate, as though it were native to the program."

There's two separate problems you want to prevent in this scenario:
Prevent an outside attacker from running arbitrary code in the context of the OS user executing your program. This means preventing arbitrary code execution and privilege escalation.
Preventing a legitimate user of your program from changing the program behavior in unintended ways.
For the first problem, you need to make sure that the source of the python code you execute is the user. You shouldn't accept input from a socket, or from a file that other users can write to. You need to make sure, somehow, that it was the user who is running the program that provided the input.
Actual solutions will depend on your OS, but you may want to consider setting up restrictive file permissions, if you're storing the code in a file.
Don't ignore or downplay this problem, or your users will fall victim to virus/malware/hackers thanks to your program.
The way you solve the second problem depends on what exactly constitutes intended behaviour in your program. If you're happy with outputting simple data structures, you can run your user-inputted code in a separate process, and pass the result over a pipe in a serialization format such as JSON or YAML.
Here's a very simple example:
#remember to set restrictive file permissions on this file. This is OS-dependent
USER_CODE_FILE="/home/user/user_code_file.py"
#absolute path to python binary (executable)
PYTHON_PATH="/usr/bin/python"
import subprocess
import json
user_code= '''
import json
my_data= {"a":[1,2,3]}
print json.dumps(my_data)
'''
with open(USER_CODE_FILE,"wb") as f:
f.write(user_code)
user_result_str= subprocess.check_output([PYTHON_PATH, USER_CODE_FILE])
user_result= json.loads( user_result_str )
print user_result
This is a fairly simple solution, and it has significant overhead. Don't use this if you need to run the user code many times in a short period of time.
Ultimately, this solution is only effective against unsophisticated attackers (users). Generally speaking, there's no way to protect any user process from the user itself - nor would it make much sense.
If you really want more assurance, and want to mitigate the first problem too, you should run the process as a separate ("guest") user. This, again, is OS-dependent.
Finally a warning: avoid exec and eval to the best of your abilities. They don't protect either your program or the user against the inputted code. There's a lot of information about this on the web, just search for "python secure eval"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.