Here's an example of what I am trying to do:
mydictionary={
'apple': 'crunchy fruit',
'banana': 'mushy and yellow'
}
rule all:
input:
expand('{key}.txt', key=mydictionary.keys())
rule test:
output: temp('{f}.txt')
shell:
"""
echo {mydictionary[wildcards.f]} > {output}
cat {output}
"""
For some reason, I am not able to access the dictionary contents. I tried using double-curly brackets, but the content of the text files becomes literal {mydictionary[wildcards.f]} (while I want the content of the corresponding entry in the dictionary).
I'm pretty sure the bracket markup can only replace variables with string representations of their values, but does not support any code evaluation within the brackets. That is, {mydictionary[wildcards.f]} will try to look up a variable literally named "mydictionary[wildcards.f]". Likewise, {mydictionary}[{wildcards.f}] will just paste the string values together. So, I don't think you can do what you want within the shell section alone. Instead, you can accomplish what you want in the params section:
rule test:
output: temp('{f}.txt')
params:
value=lambda wcs: mydictionary[wcs.f]
shell:
"""
echo '{params.value}' > {output}
cat {output}
"""
for key in dictionary:
file = file.replace(str(key), dictionary[key])
With this simple snippet I am able to replace each occurence of dictionary key, with it's value, in a file. (Python)
Is there a similar way to go about in bash?
Exampple:
file="addMesh:"0x234544"
addMesh="0x12353514"
${!dictionary[i]}: 0x234544
${dictionary[i]}: 0x234544x0
${!dictionary[i]}: 0x12353514
${!dictionary[i]}: 0x12353514x0
Wanted output (new content of file):"addMesh:"0x234544x0"
addMesh="0x12353514x0"
:
for i in "${!dictionary[#]}"
do
echo "key : $i"
echo "value: ${dictionary[$i]}"
echo
done
While there certainly are more sophisticated methods to do this, I find the following much easier to understand, and maybe it's just fast enough for your use case:
#!/bin/bash
# Create copy of source file: can be omitted
cat addMesh.txt > newAddMesh.txt
file_to_modify=newAddMesh.txt
# Declare the dictionary
declare -A dictionary
dictionary["0x234544"]=0x234544x0
dictionary["0x12353514"]=0x12353514x0
# use sed to perform all substitutions
for i in "${!dictionary[#]}"
do
sed -i "s/$i/${dictionary[$i]}/g" "$file_to_modify"
done
# Display the result: can be omitted
echo "Content of $file_to_modify :"
cat "$file_to_modify"
Assuming that the input file addMesh.txt contains
"addMesh:"0x234544"
addMesh="0x12353514"
the resulting file will contain:
"addMesh:"0x234544x0"
addMesh="0x12353514x0"
This method is not very fast, because it invokes sed multiple times. But it does not require sed to generate other sed scripts or anything like that. Therefore, it is closer to the original Python script. If you need better performance, refer to the answers in the linked question.
There is no perfect equivalent in Bash. You could do it in a roundabout way, given that dict is the associative array:
# traverse the dictionary and build command file for sed
for key in "${!dict[#]}"; do
printf "s/%s/%s/g;\n" "$key" "${dict[$key]}"
done > sed.commands
# run sed
sed -f sed.commands file > file.modified
# clean up
rm -f sed.commands
Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.
How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list
YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?
Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.
I have an xml file which looks like below
<name>abcdefg</name>
<value>123456</value>
I am trying to write a script using sed to search for the tag "abcdefg" and then replace the corresponding value "123456" but unfortunately I am not able to find a logic to achieve above.
Need help!
Sample data used:
cat key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>123456</value>
<name>abcdaaefg</name>
<value>123456</value>
sed solution:
sed '/abcdefg/!b;n;c<value>OLA<value>' key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>OLA<value>
<name>abcdaaefg</name>
<value>123456</value>
For doing changes in file.
sed -i.bak '/abcdefg/!b;n;c<value>OLA<value>' key
awk Solution:
awk '/abcdefg/ {print $0;getline;sub(/>.*</,">ola<")} {print $0}' key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>ola</value>
<name>abcdaaefg</name>
<value>123456</value>
Search for a line containing abcdefg and then do following actions:
1. print that line,
2.move to next line and replace the value inside html tag to something else. Here , I have replaced 123456 with ola.
Whenever you have tag->value pairs in your data it's a good idea to create a tag->value array in your code:
$ awk -F'[<>]' '{tag=$2; v[tag]=$3} tag=="value" && v["name"]=="abcdefg" {sub(/>.*</,">blahblah<")} 1' file
<name>abcdefg</name>
<value>blahblah</value>
Use an XML-aware tool. This will make your approach far more robust: It means that tiny changes in the textual description (like added or removed newlines, or extra attributes added to a preexisting element) won't break your script.
Assuming that your input's structure looks like this (with being under a single parent item, here called item, defining the relationship between a name and a value):
<config>
<item><name>abcdef</name><value>123456</value></item>
<item><name>fedcba</name><value>654321</value></item>
</config>
...you can edit it like so:
# edit the value under an item having name "abcdef"
xmlstarlet ed -u '//item[name="abcdef"]/value' -v "new-value"
If instead it's like this (with ordering between name/value pairs describing their relationship):
<config>
<name>abcdef</name><value>123456</value>
<name>fedcba</name><value>654321</value>
</config>
...then you can edit it like so:
# update the value immediately following a name of "abcdef"
xmlstarlet ed -u '//name[. = "abcdef"]/following-sibling::value[1]' -v new-value
The yaml library in python is not able to detect duplicated keys. This is a bug that has been reported years ago and there is not a fix yet.
I would like to find a decent workaround to this problem. How plausible could be to create a regex that returns all the keys ? Then it would be quite easy to detect this problem.
Could any regex master suggest a regex that is able to extract all the keys to find duplicates ?
File example:
mykey1:
subkey1: value1
subkey2: value2
subkey3:
- value 3.1
- value 3.2
mykey2:
subkey1: this is not duplicated
subkey5: value5
subkey5: duplicated!
subkey6:
subkey6.1: value6.1
subkey6.2: valye6.2
The yamllint command-line tool does what you
want:
sudo pip install yamllint
Specifically, it has a rule key-duplicates that detects repetitions and keys
over-writing one another:
$ yamllint test.yaml
test.yaml
1:1 warning missing document start "---" (document-start)
10:5 error duplication of key "subkey5" in mapping (key-duplicates)
(It has many other rules that you can enable/disable or tweak.)
Over-riding on of the build in loaders is a more lightweight approach:
import yaml
# special loader with duplicate key checking
class UniqueKeyLoader(yaml.SafeLoader):
def construct_mapping(self, node, deep=False):
mapping = []
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=deep)
assert key not in mapping
mapping.append(key)
return super().construct_mapping(node, deep)
then:
yaml_text = open(filename), 'r').read()
data[f] = yaml.load(yaml_text, Loader=UniqueKeyLoader)