Regex to match one or two groups or both - python

I want a regex that can match either one group, or two groups. Here is an example of how it looks. Either like this:
(key)
Or like this:
(key "value")
So far I've come up with an expression which matches the latter example. But I have no idea how to modify it so it matches either the first one, or the latter one. Here it is:
\((?P<property_key>[^() ]+) "(?P<property_value>[^"]*)"\)

I believe you are looking for regex pattern
\((?P<property_key>\w+)(?:\s+"(?P<property_value>\w+)")?\)

Related

Regex (python) exclude some part of the replacement

For sure there are other ways to solve this, but I'm interested if this can be solved exclusively via regex. I have lines of text like this:
9,A
11,B
22,>
33,B
72,A
91,<
112,A
162,B
When I try to apply this replacement to basically "join" or erase the part between arrows and replace them with "+++":
re.sub(r'\>(\n\d.+)+<','+++',string_above)
I get this, which is fine:
9,A
11,B
22,+++
112,A
162,B
But what if want to keep that last number before the "<" sign and "X" last say, so to get something like this:
9,A
11,B
22,+++
91,X
112,A
162,B
How can I do that?
In this concrete case, you may replace with
r'+++\1X'
See the regex demo
If X is a digit, replace with
r'+++\g<1>X'
The \1 and \g<1> are called replacement backreferences, these refer to the capturing group #1 value.

Capturing groups with an or operator in Python

I have found odd behavior in Python 3.7.0 when capturing groups with an or operator when one branch initially matches but the regex has to eventually backtrack and use a different branch. In this scenario, the capture groups stick with the first branch even though the regex uses the second branch.
Example code:
regexString = "^(a)|(ab)$"
captureString = "ab"
match = re.match(regexString, captureString)
print(match.groups())
Output:
('a', None)
The second group is the group that is used, but the first group is captured and the second group isn't.
Interestingly, I have found a workaround by adding non-capturing parentheses around both groups like so:
regexString = "^(?:(a)|(ab))$"
New Output:
(None, 'ab')
To me this behavior looks like a bug. If it is not, can someone point me to some documentation explaining why this is occurring? Thank you!
This is a common regex mistake. Here is your original pattern:
^(a)|(ab)$
This actually says to match ^a, i.e. a at the start of the input or ab$, i.e. ab at the end of the input. If you instead want to match a or ab as the entire input, then as you figured out you need:
^(?:(a)|(ab))$
To further convince yourself of this behavior, you may verify that the following pattern matches the same things as your original pattern:
(ab)$|^(a)
That is, each term in alternation is separate, and the position does not even matter, at least with regard to which inputs would match or nor match. By the way, you could have just used the following pattern:
^ab?$
This would match a or ab, and also you would not even need a capture group, as the entire match would correspond to what you want.

Python Regex - Is it possible to use the same group (named or unnamed) in multiple spots?

I have a bunch of strings, some of which I need to replace a part of. However, the parts before and after the parts that need to be replaced are not always the same. Also, the part of the string that needs to be replaced is not something I can match with a regex without it matching other parts that I don't want to replace. For example:
"prefixA_REPLACEME_postfixA",
"prefixB_SOMETHING_postfixB",
"prefixA_LLAMAS_postfixC",
"prefixB_DONTREPLACE_postfixA",
Turned into:
"prefixA_NEWSTR_postfixA",
"prefixB_NEWSTR_postfixB",
"prefixA_NEWSTR_postfixC",
"prefixB_DONTREPLACE_postfixA",
I would love to do this with a single regex, like this:
re.sub('(prefixA_).*(_postfixA)|(prefixB_).*(_postfixB)|(prefixA_).*(_postfixC)', '\\1NEWSTR\\2', stringToFix)
Unfortunately this doesn't work, because group 1 and group 2 are (prefixA_) and (postfixA), whether or not that is the part of the regex that ends up being used. I also can't use this
re.sub('(?P<one>prefixA_).*(?P<two>_postfixA)|(?P<one>prefixB_).*(?P<two>_postfixB)|(?P<one>prefixA_).*(?P<two>_postfixC)', '\\1NEWSTR\\2', stringToFix)
because it gives me the error
sre_constants.error: redefinition of group name 'one' as group 3; was group 1
Something else that won't work is this
re.sub('(prefixA_|prefixB).*(_postfixA|_postfixB|_postfixC)', '\\1NEWSTR\\2', stringToFix)
because this would capture the fourth string, which I don't want to be matched.
So is there a way to make it so that any uncaptured groups are not counted (which would make my first regex work correctly)? Or any other way to do this with a single regex?
You can't define a named capturing group more than once within the same regex (unlike other regex flavors like .NET). But since you're not doing anything with the pre- and postfixes, you can simply use lookaround assertions:
>>> s = """prefixA_REPLACEME_postfixA
... prefixB_SOMETHING_postfixB
... prefixA_LLAMAS_postfixC
... prefixB_DONTREPLACE_postfixA"""
>>> import re
>>> print re.sub("(?<=prefixA).*(?=postfixA)|(?<=prefixB).*(?=postfixB)|(?<=prefixA).*(?=postfixC)", "_NEWSTR_", s)
prefixA_NEWSTR_postfixA
prefixB_NEWSTR_postfixB
prefixA_NEWSTR_postfixC
prefixB_DONTREPLACE_postfixA
looks like what you want to do is use
if re.search("shouldReplaceRegex",matchstring): matchstring = re.sub("_.*?_","_yourReplacement_",matchstring)

Python Regex instantly replace groups

Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.

regex for capturing group that is only sometimes present

I have a set of filenames like:
PATJVI_RNA_Tumor_8_3_63BJTAAXX.310_BUSTARD-2012-02-19.fq.gz
PATMIF_RNA_Tumor_CGTGAT_2_1_BC0NKBACXX.334_BUSTARD-2012-05-07.fq.gz
I would like to have a single regex (in python, fyi) that can capture each of the groups between the "_" characters. However, note that in the second filename, there is a group that is present that is not present in the first filename. Of course, one can use a string split, etc., but I would like to do this with a single regex. The regex for the first filename is something like:
(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz
And the second will be:
(\w+)_(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz
I'd like the regex group to be empty when the optional group is present and contain the optional group when it is present (so that I can use it later to in constructing a new filename with \4).
To make a group optional, you can add ? after the desired group. Like this:
(\w+)?
But your example has an underscore that should be optional as well. To deal with it, you can group it together with optional group.
((\w+)_)?
However this will add a new group to your match results. To avoid it, use a non-matching group:
(?:(\w+)_)?
The final result will look like this:
(\w+)_(\w+)_(\w+)_(?:(\w+)_)?(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz

Categories

Resources