Escape a period properly in snakemake expand function - python

I am writing a snakemake pipeline where I download various files whose filenames contain many periods, and I am having a devil of a time getting it to properly understand the file names. I have essentially two rules, a download rule and a target rule. Here they are simplified below.
rule download_files_from_s3:
input:
some input
params:
some params
output:
expand("destinationDir/{{sampleName}}\.{{suffix}}")
shell:
shell:
"aws s3 cp input destinationDir"
rule targets:
input:
expand("destinationDir/{sampleName}\.{suffix}", sampleName=sampleNames)
In this formulation snakemake compiles successfully, and properly downloads the files from s3 to where I want them. However, it is unable to find them and "says waiting at most 5 seconds for missing files". I can see when I run the snakemake in dry run mode, the snakemake expects files of the form "destinationDir/sampleName\.suffix" when in reality they exist without a backslash: "destinationDir/sampleName.suffix". My first though was to ditch the backslash, changing my rules to the form:
expand("destinationDir/{sampleName}.{suffix}", sampleName=sampleNames)
This however creates an overly greedy regular expression. My value for suffix should be ".params.txt". When I run the no backslash version snakemake evaluates the wildcard sampleName as "sampleName.params" and the wildcard suffix as "txt". How ought I to best go about this either by forcing the regular expression matching in expand to behave or to have snakemake properly interpret the '' character? My efforts so far haven't been successful.

I don't know snakemake, but reading their docs mentions that you can avoid ambiguity by providing the regex in your wildcards:
{wildcard,regex}
Eg:
{sampleName,[^.]+}
To quote the docs:
[...]
Multiple wildcards in one filename can cause ambiguity. Consider the pattern {dataset}.{group}.txt and assume that a file 101.B.normal.txt is available. It is not clear whether dataset=101.B and group=normal or dataset=101 and group=B.normal in this case.
Hence wildcards can be constrained to given regular expressions. Here we could restrict the wildcard dataset to consist of digits only using \d+ as the corresponding regular expression. With Snakemake 3.8.0, there are three ways to constrain wildcards. First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:
output: "{dataset,\d+}.{group}.txt"
[...]
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
Also note that {{wildcard}} will "mask" the wildcard, as mentioned in the expand section of the docs:
You can also mask a wildcard expression in expand such that it will be kept, e.g.
expand("{{dataset}}/a.{ext}", ext=FORMATS)
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function
I'm not sure how I should understand "mask" but reading the aside note at the top leads me to thing it's a way to escape the special meaning of { and use the string literal {wildcard}.
Note that any placeholders in the shell command (like {input}) are always evaluated and replaced when the corresponding job is executed, even if they are occuring inside a comment. To avoid evaluation and replacement, you have to mask the braces by doubling them, i.e. {{input}}.
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

The short answer is you don't need to escape the period, the string in expand isn't a regex.
If your value for the suffix always is "params.txt", a simple solution would be to specify that:
expand("destinationDir/{sampleName}.params.txt", sampleName=sampleNames)
If the suffix is variable but you want to constrain it somehow, according to the latest documentation single brackets around the suffix and setting allow_missing should work to preserve suffix as a wildcard, and you should be able to (as mentioned in this answer) use wildcard constraints to specify the potential suffixes. Assuming that your suffix pattern is ".SOMETHING.txt":
rule targets:
wildcard_constraints:
suffix = "\.[a-zA-Z]+\.txt"
input:
expand("destinationDir/{sampleName}.{suffix}", sampleName=sampleNames, allow_missing=True)
(Wrapping suffix in double brackets instead would also be a way of preserving it as the wildcard {suffix} rather than trying to expand it - your output in download_files_from_s3, as it's written now, also simplifies to
output: "destinationDir/{sampleName}.{suffix}"
where you once again can have wildcard constraints if needed. In fact, given the issue with the file names probably will continue through your workflow, you may want to set wildcard constraints globally by adding
wildcard_constraints:
suffix = "\.[a-zA-Z]+\.txt"
to your snakefile before defining any rules.)

Related

snakemake how to use glob_wilcards properly

I have many paired fastq files and I have a problem on after running trim_galore package, as it named the fastq files with _1_val_1 and _2_val_2, for example:
AD50_CTGATCGTA_1_val_1.fq.gz and
AD50_CTGATCGTA_2_val_2.fq.gz.
I would like continue snakemake and use
import os
import snakemake.io
import glob
DIR="AD50"
(SAMPLES,READS,) = glob_wildcards(DIR+"{sample}_{read}.fq.gz")
READS=["1","2"]
rule all:
input:
expand(DIR+"{sample}_dedup_{read}.fq.gz",sample=SAMPLES,read=READS)
rule clumpify:
input:
r1=DIR+"{sample}_1_val_1.fq.gz",
r2=DIR+"{sample}_2_val_2.fq.gz"
output:
r1out=DIR+"{sample}_dedup_1.fq.gz",
r2out=DIR+"{sample}_dedup_2.fq.gz"
shell:
"clumpify.sh in={input.r1} in2={input.r2} out={output.r1out} out2={output.r2out} dedupe subs=0"
and the error is:
Building DAG of jobs...
MissingInputException in line 13 of /home/peterchung/Desktop/Rerun-Result/clumpify.smk:
Missing input files for rule clumpify:
AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz
AD50/AD50_CTGATCGTA_2_val_1_val_1.fq.gz
I tired another way, somehow the closest is that it detected the missing input like
AD50_CTGATCGTA_1_val_2.fq.gz and AD50_CTGATCGTA_2_val_1.fq.gz which is not exist.
I am not sure the glob_wildcards function I used properly since there are many underscore in it. I tired:
glob_wildcards(DIR+"{sample}_{read}_val_{read}.fq.gz")
but it did not work as well.
Glob wildcards is effectively a wrapper for applying a regular expression to the list of directories. By default, wildcards will match to .* greedily. You need to specify your wildcards more specifically and make sure your rule input matches the same pattern matching.
Going through your example:
AD50_CTGATCGTA_2_val_2.fq.gz
^ {sample} ^ ^{read}
The {sample} wildcard consumes everything until the regex will no longer match, up to the val. {read} is then left with just 2.
In rule all, you then request {sample}_dedup_{read}.fq.gz, which is {AD50_CTGATCGTA_2_val}_dedup_{2}.fq.gz (leaving in curly braces to show where wildcards are). When that is matched to clumpify, you request as input:
{AD50_CTGATCGTA_2_val}_2_val_2.fq.gz, which is why you are missing that file.
To address, you have a few options:
If sample should contain the 1_val part, then you need to update the input for clumpify to match your existing filenames (remove the extra _2_val, etc)
If sample should only contain AD50_CTGATCGTA, build a more specific filename. Consider adding wildcard constraints to limit your matches, e.g. [\d]+ for read. This seems to be what you are getting at in the last example.
Finally, the expand function by default applies the product of the supplied iterables. That's why you are getting AD50_CTGATCGTA_1_val_2.fq.gz, for example. You need to add zip as the second argument to override that default and match the order of wildcards returned from glob_wildcards. See here as well.

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/
Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.
Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag
What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.
You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)
Most regex engines don't support variable-length expressions for lookbehind assertions.

Wildcard in python dictionary

I am trying create a python dictionary to reference 'WHM1',2,3, 'HISPM1',2,3, etc. and other iterations to create a new column with a specific string for ex. White or Hispanic. Using regex seems like the right path but I am missing something here and refuse to hard code the whole thing in the dictionary.
I have tried several iterations of regex and regexdict :
d = regexdict({'W*':'White', 'H*':'Hispanic'})
eeoc_nac2_All_unpivot_df['Race'] =
eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
A new column will be created with 'White' or 'Hispanic' for each row based on what is in an existing column called 'EEOC_Code'.
Your regular expressions are wrong - you appear to be using glob syntax instead of proper regular expressions.
In regex, x* means "zero or more of x" and so both your regexes will trivially match the empty string. You apparently mean
d = regexdict({'^W':'White', '^H':'Hispanic'})
instead, where the regex anchor ^ matches beginning of string.
There are several third-party packages 1, 2, 3 named regexdict so you should probably point out which one you use. I can't tell whether the ^ is necessary here, or whether the regexes need to match the input completely (I have assumed a substring match is sufficient, as is usually the case in regex) because this sort of detail may well differ between implementations.
I'm not sure to have completely understood your problem. However, if all your labels have structure WHM... and HISP..., then you can simply check the first character:
for race in eeoc_nac2_All_unpivot_df['EEOC_Code']:
if race.startswith('W'):
eeoc_nac2_All_unpivot_df['Race'] = "White"
else:
eeoc_nac2_All_unpivot_df['Race'] = "Hispanic"
Note: it only works if what you have inside eeoc_nac2_All_unpivot_df['EEOC_Code'] is iterable.

How can I find all Markdown links using regular expressions?

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].
I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.
So far I have this:
(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])
Debuggex Demo
But this doesn't seem to match either of my two test cases in Debuggex:
http://example.com
(Example)[http://example.com]
Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.
What am I doing wrong? Or is this not doable at all?
EDIT: I'm doing this in Python so will be using their regex engine.
The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.
You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).
Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
An online demo
pattern details:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.
\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

What does the "s!" operator in Perl do?

I have this Perl snippet from a script that I am translating into Python. I have no idea what the "s!" operator is doing; some sort of regex substitution. Unfortunately searching Google or Stackoverflow for operators like that doesn't yield many helpful results.
$var =~ s!<foo>.+?</foo>!!;
$var =~ s!;!/!g;
What is each line doing? I'd like to know in case I run into this operator again.
And, what would equivalent statements in Python be?
s!foo!bar! is the same as the more common s/foo/bar/, except that foo and bar can contain unescaped slashes without causing problems. What it does is, it replaces the first occurence of the regex foo with bar. The version with g replaces all occurences.
It's doing exactly the same as $var =~ s///. i.e. performing a search and replace within the $var variable.
In Perl you can define the delimiting character following the s. Why ? So, for example, if you're matching '/', you can specify another delimiting character ('!' in this case) and not have to escape or backtick the character you're matching. Otherwise you'd end up with (say)
s/;/\//g;
which is a little more confusing.
Perlre has more info on this.
Perl lets you choose the delimiter for many of its constructs. This makes it easier to see what is going on in expressions like
$str =~ s{/foo/bar/baz/}{/quux/};
As you can see though, not all delimiters have the same effects. Bracketing characters (<>, [], {}, and ()) use different characters for the beginning and ending. And ?, when used as a delimiter to a regex, causes the regexes to match only once between calls to the reset() operator.
You may find it helpful to read perldoc perlop (in particular the sections on m/PATTERN/msixpogc, ?PATTERN?, and s/PATTERN/REPLACEMENT/msixpogce).
s! is syntactic sugar for the 'proper' s/// operator. Basically, you can substitute whatever delimiter you want instead of the '/'s.
As to what each line is doing, the first line is matching occurances of the regex <foo>.+?</foo> and replacing the whole lot with nothing. The second is matching the regex ; and replacing it with /.
s/// is the substitute operator. It takes a regular expression and a substitution string.
s/regex/replace string/;
It supports most (all?) of the normal regular expression switches, which are used in the normal way (by appending them to the end of the operator).
s is the substitution operator. Usually it is in the form of s/foo/bar/, but you can replace // separator characters some other characters like !. Using other separator charaters may make working with things like paths a lot easier since you don't need to escape path separators.
See manual page for further info.
You can find similar functionality for python in re-module.
s is the substitution operator. Normally this uses '/' for the delimiter:
s/foo/bar/
, but this is not required: a number of other characters can be used as delimiters instead. In this case, '!' has been used as the delimiter, presumably to avoid the need to escape the '/' characters in the actual text to be substituted.
In your specific case, the first line removes text matching '.+?'; i.e. it removes 'foo' tags with or without content.
The second line replaces all ';' characters with '/' characters, globally (all occurences).
The python equivalent code uses the re module:
f=re.sub(searchregx,replacement_str,line)
And the python equivalent is to use the re module.

Categories

Resources