Replace dictionary key in string with dictionary value

Replace dictionary key in string with dictionary value - python

for key in dictionary:
file = file.replace(str(key), dictionary[key])
With this simple snippet I am able to replace each occurence of dictionary key, with it's value, in a file. (Python)
Is there a similar way to go about in bash?
Exampple:
file="addMesh:"0x234544"
addMesh="0x12353514"
${!dictionary[i]}: 0x234544
${dictionary[i]}: 0x234544x0
${!dictionary[i]}: 0x12353514
${!dictionary[i]}: 0x12353514x0
Wanted output (new content of file):"addMesh:"0x234544x0"
addMesh="0x12353514x0"
:
for i in "${!dictionary[#]}"
do
echo "key : $i"
echo "value: ${dictionary[$i]}"
echo
done

While there certainly are more sophisticated methods to do this, I find the following much easier to understand, and maybe it's just fast enough for your use case:
#!/bin/bash
# Create copy of source file: can be omitted
cat addMesh.txt > newAddMesh.txt
file_to_modify=newAddMesh.txt
# Declare the dictionary
declare -A dictionary
dictionary["0x234544"]=0x234544x0
dictionary["0x12353514"]=0x12353514x0
# use sed to perform all substitutions
for i in "${!dictionary[#]}"
do
sed -i "s/$i/${dictionary[$i]}/g" "$file_to_modify"
done
# Display the result: can be omitted
echo "Content of $file_to_modify :"
cat "$file_to_modify"
Assuming that the input file addMesh.txt contains
"addMesh:"0x234544"
addMesh="0x12353514"
the resulting file will contain:
"addMesh:"0x234544x0"
addMesh="0x12353514x0"
This method is not very fast, because it invokes sed multiple times. But it does not require sed to generate other sed scripts or anything like that. Therefore, it is closer to the original Python script. If you need better performance, refer to the answers in the linked question.

There is no perfect equivalent in Bash. You could do it in a roundabout way, given that dict is the associative array:
# traverse the dictionary and build command file for sed
for key in "${!dict[#]}"; do
printf "s/%s/%s/g;\n" "$key" "${dict[$key]}"
done > sed.commands
# run sed
sed -f sed.commands file > file.modified
# clean up
rm -f sed.commands

Related

How to reliably parse the output of git config --list?

I'm trying to parse the output of git config --list to get a list of key/value pairs. For simple variables like user.name e.g.
user.name=John Doe
it seems like I can just split the line on '=' to get the key and value. However, after looking into the syntax, it appears that a key can contain any value in a subsection (including '='), and a value can contain a '=' or a '.'. So how do I reliably parse the output of git config --list to get key/value pairs?
Also, I wonder if running git -c key=value <comecommand> will always work with key/value containing '=' or '.'.

If you have Bash4+ you can parse key=value pairs from the git config into an associative array:
#!/usr/bin/env bash
declare -A git_config
# Parse git config into an associative array
while IFS='=' read -r k v; do
git_config["$k"]="$v"
done < <(git config --list)
# Print-out of the parsed config
printf '%-40s %s\n' 'Git config keys' 'Values'
printf '=%.0s' {1..120}
echo
for k in "${!git_config[#]}"; do
printf '%-40q %q\n' "$k" "${git_config["$k"]}"
done

snakemake: correct syntax for accessing dictionary values

Here's an example of what I am trying to do:
mydictionary={
'apple': 'crunchy fruit',
'banana': 'mushy and yellow'
}
rule all:
input:
expand('{key}.txt', key=mydictionary.keys())
rule test:
output: temp('{f}.txt')
shell:
"""
echo {mydictionary[wildcards.f]} > {output}
cat {output}
"""
For some reason, I am not able to access the dictionary contents. I tried using double-curly brackets, but the content of the text files becomes literal {mydictionary[wildcards.f]} (while I want the content of the corresponding entry in the dictionary).

I'm pretty sure the bracket markup can only replace variables with string representations of their values, but does not support any code evaluation within the brackets. That is, {mydictionary[wildcards.f]} will try to look up a variable literally named "mydictionary[wildcards.f]". Likewise, {mydictionary}[{wildcards.f}] will just paste the string values together. So, I don't think you can do what you want within the shell section alone. Instead, you can accomplish what you want in the params section:
rule test:
output: temp('{f}.txt')
params:
value=lambda wcs: mydictionary[wcs.f]
shell:
"""
echo '{params.value}' > {output}
cat {output}
"""

Syntax for using config data in rules

Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.

How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list

YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?

Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.

Search Keys and Replace Values in XML

I have an xml file which looks like below
<name>abcdefg</name>
<value>123456</value>
I am trying to write a script using sed to search for the tag "abcdefg" and then replace the corresponding value "123456" but unfortunately I am not able to find a logic to achieve above.
Need help!

Sample data used:
cat key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>123456</value>
<name>abcdaaefg</name>
<value>123456</value>
sed solution:
sed '/abcdefg/!b;n;c<value>OLA<value>' key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>OLA<value>
<name>abcdaaefg</name>
<value>123456</value>
For doing changes in file.
sed -i.bak '/abcdefg/!b;n;c<value>OLA<value>' key
awk Solution:
awk '/abcdefg/ {print $0;getline;sub(/>.*</,">ola<")} {print $0}' key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>ola</value>
<name>abcdaaefg</name>
<value>123456</value>
Search for a line containing abcdefg and then do following actions:
1. print that line,
2.move to next line and replace the value inside html tag to something else. Here , I have replaced 123456 with ola.

Whenever you have tag->value pairs in your data it's a good idea to create a tag->value array in your code:
$ awk -F'[<>]' '{tag=$2; v[tag]=$3} tag=="value" && v["name"]=="abcdefg" {sub(/>.*</,">blahblah<")} 1' file
<name>abcdefg</name>
<value>blahblah</value>

Use an XML-aware tool. This will make your approach far more robust: It means that tiny changes in the textual description (like added or removed newlines, or extra attributes added to a preexisting element) won't break your script.
Assuming that your input's structure looks like this (with being under a single parent item, here called item, defining the relationship between a name and a value):
<config>
<item><name>abcdef</name><value>123456</value></item>
<item><name>fedcba</name><value>654321</value></item>
</config>
...you can edit it like so:
# edit the value under an item having name "abcdef"
xmlstarlet ed -u '//item[name="abcdef"]/value' -v "new-value"
If instead it's like this (with ordering between name/value pairs describing their relationship):
<config>
<name>abcdef</name><value>123456</value>
<name>fedcba</name><value>654321</value>
</config>
...then you can edit it like so:
# update the value immediately following a name of "abcdef"
xmlstarlet ed -u '//name[. = "abcdef"]/following-sibling::value[1]' -v new-value

How can I mine an XML document with awk, Perl, or Python?

I have a XML file with the following data format:
<net NetName="abc" attr1="123" attr2="234" attr3="345".../>
<net NetName="cde" attr1="456" attr2="567" attr3="678".../>
....
Can anyone tell me how could I data mine the XML file using an awk one-liner? For example, I would like to know attr3 of abc. It will return 345 to me.

In general, you don't. XML/HTML parsing is hard enough without trying to do it concisely, and while you may be able to hack together a solution that succeeds with a limited subset of XML, eventually it will break.
Besides, there are many great languages with great XML parsers already written, so why not use one of them and make your life easier?
I don't know whether or not there's an XML parser built for awk, but I'm afraid that if you want to parse XML with awk you're going to get a lot of "hammers are for nails, screwdrivers are for screws" answers. I'm sure it can be done, but it's probably going to be easier for you to write something quick in Perl that uses XML::Simple (my personal favorite) or some other XML parsing module.
Just for completeness, I'd like to note that if your snippet is an example of the entire file, it is not valid XML. Valid XML should have start and end tags, like so:
<netlist>
<net NetName="abc" attr1="123" attr2="234" attr3="345".../>
<net NetName="cde" attr1="456" attr2="567" attr3="678".../>
....
</netlist>
I'm sure invalid XML has its uses, but some XML parsers may whine about it, so unless you're dead set on using an awk one-liner to try to half-ass "parse" your "XML," you may want to consider making your XML valid.
In response to your edits, I still won't do it as a one-liner, but here's a Perl script that you can use:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple;
sub usage {
die "Usage: $0 [NetName] ([attr])\n";
}
my $file = XMLin("file.xml", KeyAttr => { net => 'NetName' });
usage() if #ARGV == 0;
exists $file->{net}{$ARGV[0]}
or die "$ARGV[0] does not exist.\n";
if(#ARGV == 2) {
exists $file->{net}{$ARGV[0]}{$ARGV[1]}
or die "NetName $ARGV[0] does not have attribute $ARGV[1].\n";
print "$file->{net}{$ARGV[0]}{$ARGV[1]}.\n";
} elsif(#ARGV == 1) {
print "$ARGV[0]:\n";
print " $_ = $file->{net}{$ARGV[0]}{$_}\n"
for keys %{ $file->{net}{$ARGV[0]} };
} else {
usage();
}
Run this script from the command line with 1 or 2 arguments. The first argument is the 'NetName' you want to look up, and the second is the attribute you want to look up. If no attribute is given, it should just list all the attributes for that 'NetName'.

I have written a tool called xml_grep2, based on XML::LibXML, the perl interface to libxml2.
You would find the value you're looking for by doing this:
xml_grep2 -t '//net[#NetName="abc"]/#attr3' to_grep.xml
The tool can be found at http://xmltwig.com/tool/

xmlgawk can use XML very easily.
$ xgawk -lxml 'XMLATTR["NetName"]=="abc"{print XMLATTR["attr3"]}' test.xml
This one liner can parse XML and print "345".

If you do not have xmlgawk and your XML format is fixed, normal awk can do.
$ nawk -F '[ ="]+' '/abc/{for(i=1;i<=NF;i++){if($i=="attr3"){print $(i+1)}}}' test.xml
This script can return "345".
But I think it is very dangerous because normal awk can not use XML.

You might try this nifty little script: http://awk.info/?doc/tools/xmlparse.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.