Python regex optional group with capturing group not working

Python regex optional group with capturing group not working - python

I am struggling with a regex in python. I've spent several hours trying to figure out what is wrong.
Here is my content:
Some Title - Description (Gold Edition)
Some Title - Description
I need to match Some Title and optional Gold word in brackets.
I've tried the following regex https://regex101.com/r/9MNYZl/1 :
(.*)\-.*(?:\((.*)[Ee]dition\))*?
But it doesn't capture the word before Edition.
One interesting thing that I tried this for PHP and it worked fine.
I have no ideas what is wrong, please help to solve the issue.
Many thanks.

The first .* in your pattern will match until the end of the string, then it will backtrack to match the - and the second .* will match again till the end of the string.
As this part of the pattern (?:\((.*)[Ee]dition\))*? is optional, the pattern will suffice at the end of the string.
You could use a negated character class with an optional non capturing group.
To match the first word after the opening parenthesis you could match 1+ word chars \w+ or a broader match using \S+
^([^-]+)-[^\()]+(?:\((\S+) [Ee]dition\))?
In parts
^ Start of string
( Capture group 1
[^-]+ Match 1+ times any char except -
)- Close group 1 and match -
[^()]+ Match 1+ times any char except ( or )
(?: Non capturing group
\( Match (
(\S+) Capture group 2, match 1+ times a non whitespace char
[Ee]dition Match a space and [eE]dition
\) Match )
)? Close non capturing group and make it optional
Regex demo
To capture all until the edition in group 2 instead of a single word:
^([^-]+)-[^()]+(?:\(([^()]+) [Ee]dition\))?
Regex demo

Related

Second optional capturing group depending on optional delimiter in regex

I'm sorry for asking this maybe duplicate question. I checked the existing questions and answers about optional capturing groups. I tried some things but I'm not able to translate the answer to my own example.
This are two imput lines
id:target][label
id:target
I would like to capture id: (group 1), target (group 2) and if ][ is present label (group 3).
The used regex (Python regex) only works on the first line (live example on regex101).
^(.+:)(.*)\]\[(.*)
In the other examples I don't get what the regex makes a capturing group optional. And maybe the delimiter ][ used by me also mix up with my understanding problem.
One thing I tried was this
^(.+:)(.*)(\]\[(.*))?
This doesn't work as expected

You could write the pattern using an anchor at the end, a negated character class for group 1, a non greedy quantifier for group 2 and then optionally match a 3rd part:
^([^:]+:)(.*?)(?:]\[(.*))?$
Explanation
^ Start of string
([^:]+:) Group 1, match 1+ chars other than : and then match : using a negated character class
(.*?) Group 2, match any char, as few as possible
(?: Non capture group to match as a whole part
]\[ Match ][
(.*) Group 3, match any character
)? Close the non capture group and make it optional
$ End of string
See a regex101 demo
If you are only matching for example word characters, this you might consider:
^([^:]+:)(\w+)(?:]\[(\w+))?
See a another regex101 demo

Regex that captures a group with a positive lookahead but doesn't match a pattern

Using regex (Python) I want to capture a group \d-.+? that is immediately followed by another pattern \sLEFT|\sRIGHT|\sUP.
Here is my test set (from http://nflsavant.com/about.php):
(9:03) (SHOTGUN) 30-J.RICHARD LEFT GUARD PUSHED OB AT MIA 9 FOR 18 YARDS (29-BR.JONES; 21-E.ROWE).
(1:06) 69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK LEFT GUARD TO NO 4 FOR -3 YARDS (56-D.DAVIS; 93-D.ONYEMATA).
(3:34) (SHOTGUN) 28-R.FREEMAN LEFT TACKLE TO LAC 37 FOR 6 YARDS (56-K.MURRAY JR.).
(1:19) 22-L.PERINE UP THE MIDDLE TO CLE 43 FOR 2 YARDS (54-O.VERNON; 51-M.WILSON).
My best attempt is (\d*-.+?)(?=\sLEFT|\sRIGHT|\sUP), which works unless other characters appear between a matching capture group and my positive lookahead. In the second line of my test set this expression captures "69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK." instead of the desired "33-D.COOK".
My inputs are also saved on regex101, here: https://regex101.com/r/tEyuiJ/1
How can I modify (or completely rewrite) my regex to only capture the group immediately followed by my exact positive lookahead with no extra characters between?

To prevent skipping over digits, use \D non-digit (upper is negated \d).
\b(\d+-\D+?)\s(?:LEFT|RIGHT|UP)
See this demo at regex101
Further added a word boundary and changed the lookahead to a group.

If you want a capture group without any lookarounds:
\b(\d+-\S*)\s(?:LEFT|RIGHT|UP)\b
Explanation
\b A word boundary to prevent a partial word match
(\d+-\S*) Capture group 1, match 1+ digits - and optional non whitespace characters
\s Match a single whitespace character
(?:LEFT|RIGHT|UP) Match any of the alternatives
\b A word boundary
See the capture group value on regex101.

This is why you should be careful about using . to match anything and everything unless it's absolutely necessary. From the example you provided, it appears that what you're actually wanting to capture contains no spaces, thus we could utilize a negative character class [^\s] or alternatively more precisely [\w.], with either case using a * quantifier.
Your end result would look like "(\d*-[\w.]*)(?=\sLEFT|\sRIGHT|\sUP)"gm. And of course, when . is within the character class it's treated as a literal string - so it's not required to be escaped.
See it live at regex101.com

Try this:
\b\d+-[^\r \n]+(?= +(?:LEFT|RIGHT|UP)\b)
\b\d+-[^\r \n]+
\b word boundary to ignore things like foo30-J.RICHARD
\d+ match one or more digit.
- match a literal -.
[^\r \n]+ match on or more character except \r, \n and a literal space . Excluding \r and \n helps us not to cross newlines, and that is why \s is not used(i.e., it matches \r and \n too)
(?= +(?:LEFT|RIGHT|UP)\b) Using positive lookahead.
+ Ensure there is one or more literal space .
(?:LEFT|RIGHT|UP)\b using non-caputring group, ensure our previous space followed by one of these words LEFT, RIGHT or UP. \b word boundary to ignore things like RIGHTfoo or LEFTbar.
See regex demo

regex non greedy quantifier catching nothing, greedy catching too much

I'm writing a python regex formula that parses the content of a heading, however the greedy quantifier is not working well, and the non greedy quantifier is not working at all.
My string is
Step 1 Introduce The Assets:
Step2 Verifying the Assets
Step 3Making sure all the data is in the right place:
What I'm trying to do is extract the step number, and the heading, excluding the :.
Now I've tried multiple regex string and came up with these 2:
r1 = r"Step ?([0-9]+) ?(.*) ?:?"
r2 = r"Step ?([0-9]+) ?(.*?) ?:?"
r1 is capturing the step number, but is also capturing : at the end.
r2 is capturing the step number, and ''. I'm not sure how to handle the case where there is a .* followed by a string.
Necessary Edit:
The heading might contain : inside the string, I just want to ignore the trailing one. I know I can strip(':') but I want to understand what I'm doing wrong.

You can write the pattern using a negated character class without the non greedy and optional parts using a negated character class:
\bStep ?(\d+) ?([^:\n]+)
\bStep ? Match the word Step and optional space
(\d+) ? Capture 1+ digits in group 1 followed by matching an optional space
([^:\n]+) Capture 1+ chars other than : or a newline in group 2
Regex demo
If the colon has to be at the end of the string:
\bStep ?(\d+) ?([^:\n]+):?$
Regex demo

How to not capture a group in regex if it is followed by an another group

If I have a string eg.: 'hcto,231' or 'hcto.12' I want to be able to capture 'o,231' or 'o.12' and process it as a number ('hct' is random and any other string can replace it).
But I don't want to capture if the 'o' character if followed by a decimal number eg: 'wordo.23.12' or 'wordo,23,12'.
I've tried using the following regex:
([oO][.,][0-9]+)(?!([.,][0-9]+))
but it always matches.
In the string 'hcto.22.23' it matches the bold part, but I don't want it to match anything. Is there a way to combine groups so it won't match if the negative lookahead is true.

The match occurs in hcto.22.23 because the lookahead triggers backtracking, and since [0-9]+ match match a single 2 (it does not have to match 22) the match succeeds and returns a smaller, unexpected match:
It seems the simplest way to fix the current issue is to make the dot or comma pattern in the lookahead optional, and remove unnecessary groups:
[oO][.,]\d+(?![.,]?\d)
See the regex demo.
Details
[oO] - o or O
[.,] - a dot or comma
\d+ - one or more digits
(?![.,]?\d) - not followed with ./, and a digit, or just with a digit.

regex named group if exist

Good morning,
I have a string that I need to parse and print the content of two named group knowing that one might not exist.
The string looks like this (basically content of /proc/pid/cmdline):
"""
<some chars with letters / numbers / space / punctuation> /CLASS_NAME:myapp.server.starter.StarterHome /PARAM_XX:value_XX /PARAM_XX:value_XX /CONFIG_FILE:myapp.server.config.myconfig.txt /PARAM_XX:value_XX /PARAM_XX:value_XX /PARAM_XX:value_XX <some chars with letters / numbers / space / punctuation>
"""
my processes have almost the same pattern, that is:
/CLASS_NAME:myapp.server.starter.StarterHome is always present, but
/CONFIG_FILE:myapp.server.config.myconfig.txt is NOT always present.
I'm using python2 with re module to catch the values. So far my pattern looks like this and I'm able to catch the value I want corresponding to /CLASS_NAME
re.compile('CLASS_NAME:\w+\W\w+\W\w+\W(?P<class>\w+)')
The because /CONFIG_FILE is present or not, I added the following to myregexp:
re.compile(r"""CLASS_NAME:\w+\W\w+\W\w+\W(?P<class>\w+).*?
(CONFIG_FILE:\w+\W\w+\W\w+\W(?P<cnf>\w+.txt))?
""", re.X)
My understanding is that the second part of my rexexp is optional because the whole part is between parenthesis followed by ?.
Unfortunately my assumption is wrong as it couldn't catch it
I also tried by removing the 1st ? but it didn't help.
I gave several tries through PYTHEX to try to understand my regexp but couldn't find a solution.
Could anyone have any suggestion to resolve my case?

You can wrap the whole optional part within an optional non-capturing group and make the capturing group for CONFIG_FILE obligatory:
re.compile(r"""CLASS_NAME:(?:\w+\W+){3}(?P<class>\w+)(?:.*?
(CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)))?
""", re.X)
In case there are newlines, use re.X | re.S modifier options. Note that \w+\W\w+\W\w+\W is better written as (?:\w+\W+){3}.
See the regex demo
The main difference is (?:.*?(CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)))? part:
(?: - start of an optional (as there is a greedy ? quantifier after it) non-capturing group matching
.*? - any 0+ chars, as few as possible
(CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)) - matches
CONFIG_FILE: - a literal substring
(?:\w+\W+){3} - three sequences of 1+ word chars followed with 1+ non-word chars
(?P<cnf>\w+\.txt) - Group cnf: 1+ word chars, a dot (note it should be escaped) and then txt
)? - end of the optional non-capturing group (that will be tried once)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex optional group with capturing group not working - python

Related

Second optional capturing group depending on optional delimiter in regex

Regex that captures a group with a positive lookahead but doesn't match a pattern

regex non greedy quantifier catching nothing, greedy catching too much

How to not capture a group in regex if it is followed by an another group

regex named group if exist

Categories

Resources