Second optional capturing group depending on optional delimiter in regex - python

I'm sorry for asking this maybe duplicate question. I checked the existing questions and answers about optional capturing groups. I tried some things but I'm not able to translate the answer to my own example.
This are two imput lines
id:target][label
id:target
I would like to capture id: (group 1), target (group 2) and if ][ is present label (group 3).
The used regex (Python regex) only works on the first line (live example on regex101).
^(.+:)(.*)\]\[(.*)
In the other examples I don't get what the regex makes a capturing group optional. And maybe the delimiter ][ used by me also mix up with my understanding problem.
One thing I tried was this
^(.+:)(.*)(\]\[(.*))?
This doesn't work as expected

You could write the pattern using an anchor at the end, a negated character class for group 1, a non greedy quantifier for group 2 and then optionally match a 3rd part:
^([^:]+:)(.*?)(?:]\[(.*))?$
Explanation
^ Start of string
([^:]+:) Group 1, match 1+ chars other than : and then match : using a negated character class
(.*?) Group 2, match any char, as few as possible
(?: Non capture group to match as a whole part
]\[ Match ][
(.*) Group 3, match any character
)? Close the non capture group and make it optional
$ End of string
See a regex101 demo
If you are only matching for example word characters, this you might consider:
^([^:]+:)(\w+)(?:]\[(\w+))?
See a another regex101 demo

Related

regex non greedy quantifier catching nothing, greedy catching too much

I'm writing a python regex formula that parses the content of a heading, however the greedy quantifier is not working well, and the non greedy quantifier is not working at all.
My string is
Step 1 Introduce The Assets:
Step2 Verifying the Assets
Step 3Making sure all the data is in the right place:
What I'm trying to do is extract the step number, and the heading, excluding the :.
Now I've tried multiple regex string and came up with these 2:
r1 = r"Step ?([0-9]+) ?(.*) ?:?"
r2 = r"Step ?([0-9]+) ?(.*?) ?:?"
r1 is capturing the step number, but is also capturing : at the end.
r2 is capturing the step number, and ''. I'm not sure how to handle the case where there is a .* followed by a string.
Necessary Edit:
The heading might contain : inside the string, I just want to ignore the trailing one. I know I can strip(':') but I want to understand what I'm doing wrong.
You can write the pattern using a negated character class without the non greedy and optional parts using a negated character class:
\bStep ?(\d+) ?([^:\n]+)
\bStep ? Match the word Step and optional space
(\d+) ? Capture 1+ digits in group 1 followed by matching an optional space
([^:\n]+) Capture 1+ chars other than : or a newline in group 2
Regex demo
If the colon has to be at the end of the string:
\bStep ?(\d+) ?([^:\n]+):?$
Regex demo

How to not capture a group in regex if it is followed by an another group

If I have a string eg.: 'hcto,231' or 'hcto.12' I want to be able to capture 'o,231' or 'o.12' and process it as a number ('hct' is random and any other string can replace it).
But I don't want to capture if the 'o' character if followed by a decimal number eg: 'wordo.23.12' or 'wordo,23,12'.
I've tried using the following regex:
([oO][.,][0-9]+)(?!([.,][0-9]+))
but it always matches.
In the string 'hcto.22.23' it matches the bold part, but I don't want it to match anything. Is there a way to combine groups so it won't match if the negative lookahead is true.
The match occurs in hcto.22.23 because the lookahead triggers backtracking, and since [0-9]+ match match a single 2 (it does not have to match 22) the match succeeds and returns a smaller, unexpected match:
It seems the simplest way to fix the current issue is to make the dot or comma pattern in the lookahead optional, and remove unnecessary groups:
[oO][.,]\d+(?![.,]?\d)
See the regex demo.
Details
[oO] - o or O
[.,] - a dot or comma
\d+ - one or more digits
(?![.,]?\d) - not followed with ./, and a digit, or just with a digit.

Python - regex cleanup for any brackets

My code needs to clean up the whitespace around brackets of any kind, so I assume using regex is my best course of action. My strings will (I think) always look like the following (although more robustness is always appreciated):
text = "the people ( that don't still like / love you } are going to ..."
to look like:
final = "the people (that don't still like / love you} are going to ..."
My current attempt seems to do nothing (I know it only considers round brackets for now):
final = re.sub( r'\s[\(]+\s(\w*)\s[\)]+\s' , '\s[\(]+\1[\)]+\s' , text )
Please & thank you.
In your example string, you want to remove spaces after the opening and before the closing bracket for not the same type of brackets.
The pattern that you tried does not work as there are multiple words between ( and ) and you are not matching the }
Note that in the character class you don't have to escape the parenthesis.
([{[(])\s*(.*?)\s*([]})])
Explanation
([{[(]) Capture group 1 match any of the listed brackets
\s* Match 0+ whitespace chars
(.*?) Capture group 2, match any char, as least as possible
\s* Match 0+ whitespace chars
([]})]) Capture group 3 match any of the listed brackets
See a regex demo
Replace with 3 capturing groups.
\1\2\3

Python regex optional group with capturing group not working

I am struggling with a regex in python. I've spent several hours trying to figure out what is wrong.
Here is my content:
Some Title - Description (Gold Edition)
Some Title - Description
I need to match Some Title and optional Gold word in brackets.
I've tried the following regex https://regex101.com/r/9MNYZl/1 :
(.*)\-.*(?:\((.*)[Ee]dition\))*?
But it doesn't capture the word before Edition.
One interesting thing that I tried this for PHP and it worked fine.
I have no ideas what is wrong, please help to solve the issue.
Many thanks.
The first .* in your pattern will match until the end of the string, then it will backtrack to match the - and the second .* will match again till the end of the string.
As this part of the pattern (?:\((.*)[Ee]dition\))*? is optional, the pattern will suffice at the end of the string.
You could use a negated character class with an optional non capturing group.
To match the first word after the opening parenthesis you could match 1+ word chars \w+ or a broader match using \S+
^([^-]+)-[^\()]+(?:\((\S+) [Ee]dition\))?
In parts
^ Start of string
( Capture group 1
[^-]+ Match 1+ times any char except -
)- Close group 1 and match -
[^()]+ Match 1+ times any char except ( or )
(?: Non capturing group
\( Match (
(\S+) Capture group 2, match 1+ times a non whitespace char
[Ee]dition Match a space and [eE]dition
\) Match )
)? Close non capturing group and make it optional
Regex demo
To capture all until the edition in group 2 instead of a single word:
^([^-]+)-[^()]+(?:\(([^()]+) [Ee]dition\))?
Regex demo

[FORKING]Python Regex - Re.Sub and Re.Findall Interesting Challenges

Not sure if this is something that should be a bounty. II just want to understand regex better.
I checked the responses in the Regex to match pattern.one skip newlines and characters until pattern.two and Regex to match if given text is not found and match as little as possible threads and read about Tempered Greedy Token Solutions and Explicit Greedy Alternation Solutions on RexEgg, but admittedly the explanations baffled me.
I spent the last day fiddling mainly with re.sub (and with findall) because re.sub's behaviour is odd to me.
.
Problem 1:
Given Strings below with characters followed by / how would I produce a SINGLE regex (using only either re.sub or re.findall) that uses alternating capture groups which must use [\S]+/ to get the desired output
>>> string_1 = 'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
>>> string_2 = 'variety.com/2017/biz/the/life/of/madam/green/news/tax-march-donald-trump-protest-1202031487/'
>>> string_3 = 'variety.com/2017/biz/the/life/of/news/tax-march-donald-trump-protest-1202031487/the/days/of/our/lives'
Desired Output Given the Conditions(!!)
tax-march-donald-trump-protest-
CONDITIONS: Must use alternating capture groups which must capture ([\S]+) or ([\S]+?)/ to capture the other groups but ignore them if they don't contain -
I'M WELL AWARE that it would be better to use re.findall('([\-]*(?:[^/]+?\-)+)[\d]+', string) or something similar but I want to know if I can use [\S]+ or ([\S]+) or ([\S]+?)/ and tell regex that if those are captured, ignore the result if it contains / or doesn't contain - While also having used an alternating capture group
I KNOW I don't need to use [\S]+ or ([\S]+) but I want to see if there is an extra directive I can use to make the regex reject some characters those two would normally capture.
Posted per request:
(?:(?!/)[\S])*-(?:(?!/)[\S])*
https://regex101.com/r/azrwjO/1
Explained
(?: # Optional group
(?! / ) # Not a forward slash ahead
[\S] # Not whitespace class
)* # End group, do 0 to many times
- # A dash must exist
(?: # Optional group, same as above
(?! / )
[\S]
)*
You could use
/([-a-z]+)-\d+
and take the first capturing group, see a demo on regex101.com.

Categories

Resources