snakemake how to use glob_wilcards properly

snakemake how to use glob_wilcards properly - python

I have many paired fastq files and I have a problem on after running trim_galore package, as it named the fastq files with _1_val_1 and _2_val_2, for example:
AD50_CTGATCGTA_1_val_1.fq.gz and
AD50_CTGATCGTA_2_val_2.fq.gz.
I would like continue snakemake and use
import os
import snakemake.io
import glob
DIR="AD50"
(SAMPLES,READS,) = glob_wildcards(DIR+"{sample}_{read}.fq.gz")
READS=["1","2"]
rule all:
input:
expand(DIR+"{sample}_dedup_{read}.fq.gz",sample=SAMPLES,read=READS)
rule clumpify:
input:
r1=DIR+"{sample}_1_val_1.fq.gz",
r2=DIR+"{sample}_2_val_2.fq.gz"
output:
r1out=DIR+"{sample}_dedup_1.fq.gz",
r2out=DIR+"{sample}_dedup_2.fq.gz"
shell:
"clumpify.sh in={input.r1} in2={input.r2} out={output.r1out} out2={output.r2out} dedupe subs=0"
and the error is:
Building DAG of jobs...
MissingInputException in line 13 of /home/peterchung/Desktop/Rerun-Result/clumpify.smk:
Missing input files for rule clumpify:
AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz
AD50/AD50_CTGATCGTA_2_val_1_val_1.fq.gz
I tired another way, somehow the closest is that it detected the missing input like
AD50_CTGATCGTA_1_val_2.fq.gz and AD50_CTGATCGTA_2_val_1.fq.gz which is not exist.
I am not sure the glob_wildcards function I used properly since there are many underscore in it. I tired:
glob_wildcards(DIR+"{sample}_{read}_val_{read}.fq.gz")
but it did not work as well.

Glob wildcards is effectively a wrapper for applying a regular expression to the list of directories. By default, wildcards will match to .* greedily. You need to specify your wildcards more specifically and make sure your rule input matches the same pattern matching.
Going through your example:
AD50_CTGATCGTA_2_val_2.fq.gz
^ {sample} ^ ^{read}
The {sample} wildcard consumes everything until the regex will no longer match, up to the val. {read} is then left with just 2.
In rule all, you then request {sample}_dedup_{read}.fq.gz, which is {AD50_CTGATCGTA_2_val}_dedup_{2}.fq.gz (leaving in curly braces to show where wildcards are). When that is matched to clumpify, you request as input:
{AD50_CTGATCGTA_2_val}_2_val_2.fq.gz, which is why you are missing that file.
To address, you have a few options:
If sample should contain the 1_val part, then you need to update the input for clumpify to match your existing filenames (remove the extra _2_val, etc)
If sample should only contain AD50_CTGATCGTA, build a more specific filename. Consider adding wildcard constraints to limit your matches, e.g. [\d]+ for read. This seems to be what you are getting at in the last example.
Finally, the expand function by default applies the product of the supplied iterables. That's why you are getting AD50_CTGATCGTA_1_val_2.fq.gz, for example. You need to add zip as the second argument to override that default and match the order of wildcards returned from glob_wildcards. See here as well.

Related

Regex expression to match last numerical component, but exclude file extension

I'm stumped trying to figure out a regex expression. Given a file path, I need to match the last numerical component of the path ("frame" number in an image sequence), but also ignore any numerical component in the file extension.
For example, given path:
/path/to/file/abc123/GCAM5423.xmp
The following expression will correctly match 5423.
((?P<index>(?P<padding>0*)\d+)(?!.*(0*)\d+))
However, this expression fails if for example the file extension contains a number as follows:
/path/to/file/abc123/GCAM5423.cr2
In this case the expression will match the 2 in the file extension, when I still need it to match 5423. How can I modify the above expression to ignore file extensions that have a numerical component?
Using python flavor of regex. Thanks in advance!
Edit: Thanks all for your help! To clarify, I specifically need to modify the above expression to only capture the last group. I am passing this pattern to an external library so it needs to include the named groups and to only match the last number prior to the extension.

You can try this one:
\/[a-zA-Z]*(\d*)\.[a-zA-Z0-9]{3,4}$

Try this pattern:
\/[^/\d\s]+(\d+)\.[^/]+$
See Regex Demo
Code:
import re
pattern = r"\/[^/\d\s]+(\d+)\.[^/]+$"
texts = ['/path/to/file/abc123/GCAM5423.xmp', '/path/to/file/abc123/GCAM5423.cr2']
print([match.group(1) for x in texts if (match := re.search(pattern, x))])
Output:
['5423', '5423']

Step1: Find substring before last dot.
(.*)\.
Input: /path/to/file/abc123/GCAM5423.cr2
Output: /path/to/file/abc123/GCAM5423
Step2: Find the last numbers using your regex.
Input: /path/to/file/abc123/GCAM5423
Output: 5423
I don't know how to join these two regexs, but it also usefult for you. My hopes^_^

Escape a period properly in snakemake expand function

I am writing a snakemake pipeline where I download various files whose filenames contain many periods, and I am having a devil of a time getting it to properly understand the file names. I have essentially two rules, a download rule and a target rule. Here they are simplified below.
rule download_files_from_s3:
input:
some input
params:
some params
output:
expand("destinationDir/{{sampleName}}\.{{suffix}}")
shell:
shell:
"aws s3 cp input destinationDir"
rule targets:
input:
expand("destinationDir/{sampleName}\.{suffix}", sampleName=sampleNames)
In this formulation snakemake compiles successfully, and properly downloads the files from s3 to where I want them. However, it is unable to find them and "says waiting at most 5 seconds for missing files". I can see when I run the snakemake in dry run mode, the snakemake expects files of the form "destinationDir/sampleName\.suffix" when in reality they exist without a backslash: "destinationDir/sampleName.suffix". My first though was to ditch the backslash, changing my rules to the form:
expand("destinationDir/{sampleName}.{suffix}", sampleName=sampleNames)
This however creates an overly greedy regular expression. My value for suffix should be ".params.txt". When I run the no backslash version snakemake evaluates the wildcard sampleName as "sampleName.params" and the wildcard suffix as "txt". How ought I to best go about this either by forcing the regular expression matching in expand to behave or to have snakemake properly interpret the '' character? My efforts so far haven't been successful.

I don't know snakemake, but reading their docs mentions that you can avoid ambiguity by providing the regex in your wildcards:
{wildcard,regex}
Eg:
{sampleName,[^.]+}
To quote the docs:
[...]
Multiple wildcards in one filename can cause ambiguity. Consider the pattern {dataset}.{group}.txt and assume that a file 101.B.normal.txt is available. It is not clear whether dataset=101.B and group=normal or dataset=101 and group=B.normal in this case.
Hence wildcards can be constrained to given regular expressions. Here we could restrict the wildcard dataset to consist of digits only using \d+ as the corresponding regular expression. With Snakemake 3.8.0, there are three ways to constrain wildcards. First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:
output: "{dataset,\d+}.{group}.txt"
[...]
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
Also note that {{wildcard}} will "mask" the wildcard, as mentioned in the expand section of the docs:
You can also mask a wildcard expression in expand such that it will be kept, e.g.
expand("{{dataset}}/a.{ext}", ext=FORMATS)
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function
I'm not sure how I should understand "mask" but reading the aside note at the top leads me to thing it's a way to escape the special meaning of { and use the string literal {wildcard}.
Note that any placeholders in the shell command (like {input}) are always evaluated and replaced when the corresponding job is executed, even if they are occuring inside a comment. To avoid evaluation and replacement, you have to mask the braces by doubling them, i.e. {{input}}.
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

The short answer is you don't need to escape the period, the string in expand isn't a regex.
If your value for the suffix always is "params.txt", a simple solution would be to specify that:
expand("destinationDir/{sampleName}.params.txt", sampleName=sampleNames)
If the suffix is variable but you want to constrain it somehow, according to the latest documentation single brackets around the suffix and setting allow_missing should work to preserve suffix as a wildcard, and you should be able to (as mentioned in this answer) use wildcard constraints to specify the potential suffixes. Assuming that your suffix pattern is ".SOMETHING.txt":
rule targets:
wildcard_constraints:
suffix = "\.[a-zA-Z]+\.txt"
input:
expand("destinationDir/{sampleName}.{suffix}", sampleName=sampleNames, allow_missing=True)
(Wrapping suffix in double brackets instead would also be a way of preserving it as the wildcard {suffix} rather than trying to expand it - your output in download_files_from_s3, as it's written now, also simplifies to
output: "destinationDir/{sampleName}.{suffix}"
where you once again can have wildcard constraints if needed. In fact, given the issue with the file names probably will continue through your workflow, you may want to set wildcard constraints globally by adding
wildcard_constraints:
suffix = "\.[a-zA-Z]+\.txt"
to your snakefile before defining any rules.)

Regex in python: matching duplicates of optional substrings

I am developing a python package that needs to, among other things, process a file containing a list of dataset names and I need to extract the components of these names.
Examples of dataset names would be:
diskLineLuminosity:halpha:rest:z1.0
diskLineLuminosity:halpha:rest:z1.0:dust
diskLineLuminosity:halpha:rest:z1.0:contam_NII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OII:contam_OIII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OIII:dust
diskLineLuminosity:halpha:rest:z1.0:contam_OII:contam_NII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent
I'm looking for a way to parse the dataset names using regex to extract all the dataset information, including a list of all instances of "contam_*" (where zero instances are allowed). I realise that I could just split the string and used fnmatch.filter, or equivalent, but I also need to be able to flag erroneous dataset names that do not match the above syntax. Also, regex is currently used extensively in similar situations throughout the package and so I prefer not to introduce a second parsing method.
As an MWE, with an example dataset name, I have pieced together:
import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)
This returns:
print M.group(1,2,3,4,5,6,7)
('disk', 'halpha', 'rest', '1.0', None, ':contam_NII', None)
In the package, this regex search needs to go into a function similar to:
def getDatasetNameInformation(datasetName):
INFO = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)
if not INFO:
raise ParseError("Cannot parse '"+datasetName+"'!")
return INFO
I am still new to using regex so how can I modify the re.search string to successfully parse all of the above dataset names and extract the information in the substrings (including a list of all the instances of contamination)?
Thanks for any help you can provide!

If you are still learning regular expressions (to be honest, later as well), get in the habit of using the verbose mode as often as possible, it makes for better code and more readable expressions.
That said, you could use
^
(disk|spheroid)
LineLuminosity:
([^:]+):
([^:]+):
z([\d\.]+)
((?::contam_[^:]+)+)?
(:recent)?
(:dust[^:]*)?
Just changed the order a bit and used a non-capturing group inside he contam part, see a demo on regex101.com.

You could capture all of those contam_ with ((?::contam_[^:]+)*): this will capture all of them in one group. Then launch a second regular expression, apply it just on that match alone, and use that result as a nested list within the first results:
import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:recent:contam_NII:contam_NII:dust"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(?::(recent))?((?::contam_[^:]+)*)(?::(dust))?",datasetName)
lst = list(M.groups())
if lst[5]:
lst[5] = re.findall(":contam_([^:]+)", lst[5])
print(lst)
Output:
['disk', 'halpha', 'rest', '1.0', 'recent', ['NII', 'NII'], 'dust']

How can I find all Markdown links using regular expressions?

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].
I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.
So far I have this:
(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])
Debuggex Demo
But this doesn't seem to match either of my two test cases in Debuggex:
http://example.com
(Example)[http://example.com]
Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.
What am I doing wrong? Or is this not doable at all?
EDIT: I'm doing this in Python so will be using their regex engine.

The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.
You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).
Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
An online demo
pattern details:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.
\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

Regex pattern to match two datetime formats

I am doing a directory listening and need to get all directory names that follow the pattern: Feb14-2014 and 14022014-sometext. The directory names must not contain dots, so I dont want to match 14022014-sometext.more. Like you can see I want to match just the directories that follow the pattern %b%d-%Y and %d%m%Y-textofanylengthWithoutDots.
For the first case it should be something like [a-zA-Z]{3}\d{2}. I dont know how to parse the rest because my regex skills are poor, sorry. So I hope someone can tell me what the correct patterns look like. Thanks.

I am assuming each directory listing is separated by a new line
([A-Z]\w{2}\d{1,2}\-\d{4}|\d{7,8}\-\w+)$
Will match both cases and will match the text only if it is uninterrupted (by dots or anything else for that matter) until it hits the end of the line.
Some notes:
If you want to match everything except dot you may replace the final \w+ with [^.]+.
You need the multiline modifier /m for this to work, otherwise the $ will match the end of the string only.
I've not added a ^ to the start of the regex, but you may do so if each line contains a single directory
Of course you may expand this regex to include (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) instead of [A-Z]\w{2}. I've not done this to keep it readable. I would also suggest you store this in a python array and insert it dynamically into your regex for maintainability sake.
See it in action: http://regex101.com/r/pS6iY9

That's quite easy.
The best one I can make is:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d)|(\d\d\d\d\d\d\d\d-\w+)
The first part ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d) matches the first kind of dates and the second part (\d\d\d\d\d\d\d\d-\w+) - the second kind.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

snakemake how to use glob_wilcards properly - python

Related

Regex expression to match last numerical component, but exclude file extension

Escape a period properly in snakemake expand function

Regex in python: matching duplicates of optional substrings

How can I find all Markdown links using regular expressions?

Regex pattern to match two datetime formats

Categories

Resources