Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.
How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list
YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?
Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.
The following youtube video shows that it is possible to jump to definition using vim for python.
However when I try the same shortcut (Ctrl-G) it doesnt work...
How is it possible to perform the same "jump to definition"?
I installed the plugin Ctrl-P but not rope.
This does not directly answer your question but provides a better alternative. I use JEDI with VIM, as a static code analyser, it offers far better options than ctags. I use the spacemacs key-binding in vim so with localleader set to ','
" jedi
let g:jedi#use_tabs_not_buffers = 0 " use buffers instead of tabs
let g:jedi#show_call_signatures = "1"
let g:jedi#goto_command = "<localleader>gt"
let g:jedi#goto_assignments_command = "<localleader>ga"
let g:jedi#goto_definitions_command = "<localleader>gg"
let g:jedi#documentation_command = "K"
let g:jedi#usages_command = "<localleader>u"
let g:jedi#completions_command = "<C-Space>"
let g:jedi#rename_command = "<leader>r"
Vim's code navigation is based on a universal database called tags file. It needs to be generated (and updated) manually. :help ctags lists some applications that can do that. Exuberant ctags is a common one that supports many programming languages, but there are also specialized ones, like ptags.py (found in your Python source directory at Tools/scripts/ptags.py).
Plugins like easytags.vim provide more convenience by e.g. automatically updating the tags file on each save.
The default command for jumping to the definition is CTRL-] (not CTRL-G; that prints the current filename; see :help CTRL-G), or the Ex command :tag {identifier}; see all at :help tag-commands.
Some suggestions for people reading other answers to this question in the future:
tags file has one limitation. If in your code multiple objects has the same name you will have problem using ctrl-] as it will jump to first one and not necessary correct one. In this situation you can use g ctrl-] (or :tjump command or :tselect command) to get selection list. Potentially you want to map ctrl-] to "g ctrl-]"
It is possible that you want to have possibility to jump to correct object. In that case you might want to use jedi vim and if you are used to c-] you might want to use this mapping for jedi goto let g:jedi#goto_command = ""
Lastly you want to use universal ctags instead of excuberant ctags because of better new files support (not necessary python).
If you're using YouCompleteMe there is a command for that
:YcmCompleter GoToDefinition
if you want to add a shortcut for doing that in a new tab
nnoremap <leader>d :tab split \| YcmCompleter GoToDefinition<CR>
I have a very specific problem I have been trying to work out. I'm using a PowerShell script to name newly imaged computers during the imaging proceess, and I need to grab a newly generated number from a sequence. I use SCCM 2012 R2 for this, btw
For example, I have the script naming our computers by our convention using wmi query:
if ($ComputerVersion -eq "ThinkPad T400")
{
$OSDComputerName = "T400xxxx-11"
$TSEnv = New-Object -COMObject Microsoft.SMS.TSEnvironment
$TSEnv.Value("OSDComputerName") = "$OSDComputerName"
}
I set the $ComputerVersion variable, using WMI query, like so:
$ComputerVersion = (Get-WmiObject -Class Win32_ComputerSystemProduct | Select-Object Version).Version
So, the crux of my question is I want to set another variable, probably something simple
like $num, for the next number available to label our computers. This number will be replacing the "xxxx". I'll be doing that by:
if ($ComputerVersion -eq "ThinkPad T400")
{
$OSDComputerName = "T400" + $num + "-11"
$TSEnv = New-Object -COMObject Microsoft.SMS.TSEnvironment
$TSEnv.Value("OSDComputerName") = "$OSDComputerName"
}
This number is being generated by a linux server we have, and its already running some python script to dish out the next available number in the sequence. I can post that python script if needed, but it's 133 lines.
What I need to know is how to call for that web request via PowerShell, and set that returned number (the next available) as a new variable.
I've never used web-services or web-requests before and any help would be greatly appreciated. Thanks in advance!
Depends what the web request returns and whether or not you need to process any return data, but if it simply returns the number you could do this:
$webClient = New-Object System.Net.WebClient
$num = $webClient.downloadstring("http://yourwebservice.com/buildnumber")
I'm looking for a way to make a function in python where you pass in a string and it returns whether it's spelled correctly. I don't want to check against a dictionary. Instead, I want it to check Google's spelling suggestions. That way, celebrity names and other various proper nouns will count as being spelled correctly.
Here's where I'm at so far. It works most of the time, but it messes up with some celebrity names. For example, things like "cee lo green" or "posner" get marked as incorrect.
import httplib
import xml.dom.minidom
data = """
<spellrequest textalreadyclipped="0" ignoredups="0" ignoredigits="1" ignoreallcaps="1">
<text> %s </text>
</spellrequest>
"""
def spellCheck(word_to_spell):
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=en", data % word_to_spell)
response = con.getresponse()
dom = xml.dom.minidom.parseString(response.read())
dom_data = dom.getElementsByTagName('spellresult')[0]
if dom_data.childNodes:
for child_node in dom_data.childNodes:
result = child_node.firstChild.data.split()
for word in result:
if word_to_spell.upper() == word.upper():
return True;
return False;
else:
return True;
Peter Norvig tells you how implement spell checker in Python.
Rather than sticking to Mr. Google, try out other big fellows.
If you really want to stick with search engines which count page requests, Yahoo and Bing are providing some excellent features. Yahoo is directly providing spell checking services using YQL tables (Free: 5000 request/day and non-commercial).
You have good number of Python API's which are capable to do a lot similar magic including on nouns that you mentioned (sometimes may turn around - after all its somewhere based upon probability)
So, in the second case, you got a good list (totally free)
GNU - Aspell (Even got python bindings)
PyEnchant
Whoosh (It does a lot more than spell checking but I think it has some edge on it.)
I hope they should give you a clear idea of how things work.
Actually spell checking involves very complex mechanisms in the areas of Machine learning, AI, NLP.. etc a lot more. So, companies like Google/ Yahoo don't really offer their API entirely free.
I want to process a medium to large number of text snippets using a spelling/grammar checker to get a rough approximation and ranking of their "quality." Speed is not really of concern either, so I think the easiest way is to write a script that passes off the snippets to Microsoft Word (2007) and runs its spelling and grammar checker on them.
Is there a way to do this from a script (specifically, Python)? What is a good resource for learning about controlling Word programmatically?
If not, I suppose I can try something from Open Source Grammar Checker (SO).
Update
In response to Chris' answer, is there at least a way to a) open a file (containing the snippet(s)), b) run a VBA script from inside Word that calls the spelling and grammar checker, and c) return some indication of the "score" of the snippet(s)?
Update 2
I've added an answer which seems to work, but if anyone has other suggestions I'll keep this question open for some time.
It took some digging, but I think I found a useful solution. Following the advice at http://www.nabble.com/Edit-a-Word-document-programmatically-td19974320.html I'm using the win32com module (if the SourceForge link doesn't work, according to this Stack Overflow answer you can use pip to get the module), which allows access to Word's COM objects. The following code demonstrates this nicely:
import win32com.client, os
wdDoNotSaveChanges = 0
path = os.path.abspath('snippet.txt')
snippet = 'Jon Skeet lieks ponies. I can haz reputashunz? '
snippet += 'This is a correct sentence.'
file = open(path, 'w')
file.write(snippet)
file.close()
app = win32com.client.gencache.EnsureDispatch('Word.Application')
doc = app.Documents.Open(path)
print "Grammar: %d" % (doc.GrammaticalErrors.Count,)
print "Spelling: %d" % (doc.SpellingErrors.Count,)
app.Quit(wdDoNotSaveChanges)
which produces
Grammar: 2
Spelling: 3
which match the results when invoking the check manually from Word.