regex for capturing group that is only sometimes present - python

I have a set of filenames like:
PATJVI_RNA_Tumor_8_3_63BJTAAXX.310_BUSTARD-2012-02-19.fq.gz
PATMIF_RNA_Tumor_CGTGAT_2_1_BC0NKBACXX.334_BUSTARD-2012-05-07.fq.gz
I would like to have a single regex (in python, fyi) that can capture each of the groups between the "_" characters. However, note that in the second filename, there is a group that is present that is not present in the first filename. Of course, one can use a string split, etc., but I would like to do this with a single regex. The regex for the first filename is something like:
(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz
And the second will be:
(\w+)_(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz
I'd like the regex group to be empty when the optional group is present and contain the optional group when it is present (so that I can use it later to in constructing a new filename with \4).

To make a group optional, you can add ? after the desired group. Like this:
(\w+)?
But your example has an underscore that should be optional as well. To deal with it, you can group it together with optional group.
((\w+)_)?
However this will add a new group to your match results. To avoid it, use a non-matching group:
(?:(\w+)_)?
The final result will look like this:
(\w+)_(\w+)_(\w+)_(?:(\w+)_)?(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz

Related

Single regular expression for extracting different values

I have some inputs like
ID= 5657A
ID=PID=FSGDVD
IDS=5645SD
I have created a regex i.e IDS=[A-Za-z0-9]+|ID=[A-Za-z0-9]+|PID=[A-Za-z0-9]+. But, in the case of ID=PID=FSGDVD, I want PID=FSGDVD as output.
My outputs must look like
ID= 5657A
PID=FSGDVD
IDS=5645SD
How to go for this problem?
Add end of line anchor and use grouping and quantifiers to simplify the regex:
(?:IDS?|PID)=[A-Za-z0-9]+$
IDS? will match both ID and IDS
(?:IDS?|PID) will match ID or IDS or PID
(?:pattern) is a non-capturing group, some functions like re.split and re.findall will change their behavior based on capture groups, thus non-capturing group is ideal whenever backreferences aren't needed
$ is end of line anchor, thus you'll get the match towards end of line instead of start of line
Demo: https://regex101.com/r/e9uvmC/1
In case your input can be something like ID=PID=FSGDVD xyz then you could use lookarounds:
(?:IDS?|PID)=[A-Za-z0-9]+\b(?!=)
Here \b will ensure to match all word characters after = sign and (?!=) is a negative lookahead assertion to avoid a match if there is = afterwards
Demo: https://regex101.com/r/e9uvmC/2
Another one could be
[A-Z]+=\s*[^=]+$
See a demo on regex101.com.

Regex (python) exclude some part of the replacement

For sure there are other ways to solve this, but I'm interested if this can be solved exclusively via regex. I have lines of text like this:
9,A
11,B
22,>
33,B
72,A
91,<
112,A
162,B
When I try to apply this replacement to basically "join" or erase the part between arrows and replace them with "+++":
re.sub(r'\>(\n\d.+)+<','+++',string_above)
I get this, which is fine:
9,A
11,B
22,+++
112,A
162,B
But what if want to keep that last number before the "<" sign and "X" last say, so to get something like this:
9,A
11,B
22,+++
91,X
112,A
162,B
How can I do that?
In this concrete case, you may replace with
r'+++\1X'
See the regex demo
If X is a digit, replace with
r'+++\g<1>X'
The \1 and \g<1> are called replacement backreferences, these refer to the capturing group #1 value.

Looping over group in a Python regex

EDIT: I've gotten it work--I had forgotten to put in a space as a separator for multiple edges.
I've got this Python regex, which handles most of the strings I have to parse.
edge_value_pattern = re.compile(r'(?P<edge>e[0-9]+) +(?P<label1>[^ ]*)[^"]+"(?P<word>[^"]+)"[^:]+:: (?P<label2>[^\n]+)')
Here is an example string that my regex is meant to parse:
'e0 BIKE-EVENT 1 "biking" 2'
It correctly stores e0 into the edge group, BIKE-EVENT into the label1 group, and "biking" into the word group. The last group, label2, is for a slightly different variation of the string, as shown below. Note that the label2 regex group behaves as expected when given a string like the one below.
'e29 e30 "of" :: of, OF'
However, the regex pattern fills in label1 with the value e30. The truth is that this string does not have any label1 value--it should be None or at least the empty string. An ad-hoc solution would be to parse label1 with a regex to determine if it's an actual label or just another edge. I want to know if there is way to modify my original regex so that the group edge takes in all edges. E.g., the output for the above string would be:
edge = "e29 e30"
label1 = None
word = of
label2 = of, OF
I tried this solution below, which I thought would translate to simply looping over the first group, edge (this would be trivial if I had an actual FSA), but it doesn't change the behavior of the regex.
edge_value_pattern = re.compile(r'(?P<edge>(e[0-9]+)+) +(?P<label1>[^ ]*)[^"]+"(?P<word>[^"]+)"[^:]+:: (?P<label2>[^\n]+)')
If you want edge to match "e29 e30", you have to put the repetition inside the group, not outside.
You did that by sticking a new group inside the edge group with a + repetition—which is fine, although you probably wanted a non-capturing group there—but you forgot to include the space inside the repeating group.
(You also left the external repeat, and used a capturing group where you probably wanted a non-capturing, but those are less serious.)
Look at just that fragment:
(?P<edge>(e[0-9]+)+)
Debuggex Demo
Here, the expression catches e29 as one match, then e30 as a subsequent match. So, if you add anything else to the expression, it's either going to miss e29, or just fail. But add the space:
(?P<edge>(e[0-9]+ )+)
Debuggex Demo
And now it's matching e29 e30 plus the trailing space as a single match, which means you can tack on any additional stuff and it will work (as long as you get that additional stuff right—you still need to remove the extra +, and I think you may need to make a couple of other repetitions non-greedy…).

Python Regex - Is it possible to use the same group (named or unnamed) in multiple spots?

I have a bunch of strings, some of which I need to replace a part of. However, the parts before and after the parts that need to be replaced are not always the same. Also, the part of the string that needs to be replaced is not something I can match with a regex without it matching other parts that I don't want to replace. For example:
"prefixA_REPLACEME_postfixA",
"prefixB_SOMETHING_postfixB",
"prefixA_LLAMAS_postfixC",
"prefixB_DONTREPLACE_postfixA",
Turned into:
"prefixA_NEWSTR_postfixA",
"prefixB_NEWSTR_postfixB",
"prefixA_NEWSTR_postfixC",
"prefixB_DONTREPLACE_postfixA",
I would love to do this with a single regex, like this:
re.sub('(prefixA_).*(_postfixA)|(prefixB_).*(_postfixB)|(prefixA_).*(_postfixC)', '\\1NEWSTR\\2', stringToFix)
Unfortunately this doesn't work, because group 1 and group 2 are (prefixA_) and (postfixA), whether or not that is the part of the regex that ends up being used. I also can't use this
re.sub('(?P<one>prefixA_).*(?P<two>_postfixA)|(?P<one>prefixB_).*(?P<two>_postfixB)|(?P<one>prefixA_).*(?P<two>_postfixC)', '\\1NEWSTR\\2', stringToFix)
because it gives me the error
sre_constants.error: redefinition of group name 'one' as group 3; was group 1
Something else that won't work is this
re.sub('(prefixA_|prefixB).*(_postfixA|_postfixB|_postfixC)', '\\1NEWSTR\\2', stringToFix)
because this would capture the fourth string, which I don't want to be matched.
So is there a way to make it so that any uncaptured groups are not counted (which would make my first regex work correctly)? Or any other way to do this with a single regex?
You can't define a named capturing group more than once within the same regex (unlike other regex flavors like .NET). But since you're not doing anything with the pre- and postfixes, you can simply use lookaround assertions:
>>> s = """prefixA_REPLACEME_postfixA
... prefixB_SOMETHING_postfixB
... prefixA_LLAMAS_postfixC
... prefixB_DONTREPLACE_postfixA"""
>>> import re
>>> print re.sub("(?<=prefixA).*(?=postfixA)|(?<=prefixB).*(?=postfixB)|(?<=prefixA).*(?=postfixC)", "_NEWSTR_", s)
prefixA_NEWSTR_postfixA
prefixB_NEWSTR_postfixB
prefixA_NEWSTR_postfixC
prefixB_DONTREPLACE_postfixA
looks like what you want to do is use
if re.search("shouldReplaceRegex",matchstring): matchstring = re.sub("_.*?_","_yourReplacement_",matchstring)

Regex to match one or two groups or both

I want a regex that can match either one group, or two groups. Here is an example of how it looks. Either like this:
(key)
Or like this:
(key "value")
So far I've come up with an expression which matches the latter example. But I have no idea how to modify it so it matches either the first one, or the latter one. Here it is:
\((?P<property_key>[^() ]+) "(?P<property_value>[^"]*)"\)
I believe you are looking for regex pattern
\((?P<property_key>\w+)(?:\s+"(?P<property_value>\w+)")?\)

Categories

Resources