This question already has answers here:
What is a non-capturing group in regular expressions?
(18 answers)
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
i saw a regular expression (?= (?:\d{5}|[A-Z]{2})) in a python re example, and was very confused about the meaning of the ?: .
I also see the python doc, there is the explain:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
who can give me an example, and explain why it works, thanks!!
Ordinarily, parentheses create a "capturing" group inside your regex:
regex = re.compile("(set|let) var = (\\w+|\\d+)")
print regex.match("set var = 12").groups()
results
('set', '12')
Later you can retrieve those groups by calling .groups() method on the result of a match. As you see whatever is inside parentheses is captured in "groups." But you might not care about all those groups. Say you only want to find what's in the second group and not the first. You need the first set of parentheses in order to group "get" and "set" but you can turn off capturing by putting "?:" at the beginning:
regex = re.compile("(?:set|let) var = (\\w+|\\d+)")
print regex.match("set var = 12").groups()
results:
('12',)
If you do not need the group to capture its match, you can optimize
this regular expression into Set(?:Value)?. The question mark and the
colon after the opening parenthesis are the syntax that creates a
non-capturing group. The question mark after the opening bracket is
unrelated to the question mark at the end of the regex. The final
question mark is the quantifier that makes the previous token
optional. This quantifier cannot appear after an opening parenthesis,
because there is nothing to be made optional at the start of a group.
Therefore, there is no ambiguity between the question mark as an
operator to make a token optional and the question mark as part of the
syntax for non-capturing groups, even though this may be confusing at
first. There are other kinds of groups that use the (? syntax in
combination with other characters than the colon that are explained
later in this tutorial.
color=(?:red|green|blue) is another regex with a non-capturing group.
This regex has no quantifiers.
From : http://www.regular-expressions.info/brackets.html
Also read: What is a non-capturing group? What does a question mark followed by a colon (?:) mean?
Related
Suppose I have the following regex that matches a string with a semicolon at the end:
\".+\";
It will match any string except an empty one, like the one below:
"";
I tried using this:
\".+?\";
But that didn't work.
My question is, how can I make the .+ part of the, optional, so the user doesn't have to put any characters in the string?
To make the .+ optional, you could do:
\"(?:.+)?\";
(?:..) is called a non-capturing group. It only does the matching operation and it won't capture anything. Adding ? after the non-capturing group makes the whole non-capturing group optional.
Alternatively, you could do:
\".*?\";
.* would match any character zero or more times greedily. Adding ? after the * forces the regex engine to do a shortest possible match.
As an alternative:
\".*\";
Try it here: https://regex101.com/r/hbA01X/1
I have updated the following re to not match when the string is B/C, B/O, S/C, or S/O.
old (.*)/(.*)
new: (.*)(?<!^(B|S)(?=/(C|O)$))/(.*)
This regex is being used downstream with a list of other regex patterns and is expected to separate the data into two groups. Is there a way for my regex pattern (or a better one) to not count the zero-width assertions?
I've tried pushing the validation till the end with a single lookbehind assertion but that only has access to the group after the slash.
I've also tried enclosing the assertions in (?:...) but inner parenthesis are still counted towards matching groups.
Thanks to #user2357112
(.*)(?<!^(?:B|S)(?=/(?:C|O)$))/(.*)
I was using (?:...) incorrectly on my first attempts
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I am currently going through pythonchallenge.com, and now trying to make a code that searches for a lowercase letter with exactly three uppercase letters on both side of it. Then I got stuck on trying to make a regular expression for it. This is what I have tried:
import re
#text is in https://pastebin.com/pAFrenWN since it is too long
p = re.compile("[^A-Z]+[A-Z]{3}[a-z][A-Z]{3}[^A-Z]+")
print("".join(p.findall(text)))
This is what I got with it:
dqIQNlQSLidbzeOEKiVEYjxwaZADnMCZqewaebZUTkLYNgouCNDeHSBjgsgnkOIXdKBFhdXJVlGZVme
gZAGiLQZxjvCJAsACFlgfe
qKWGtIDCjn
I later searched for the solution, which had this regular expression:
p = re.compile("[^A-Z]+[A-Z]{3}([a-z])[A-Z]{3}[^A-Z]+")
So there is a bracket around [a-z], and I couldn't figure out what difference it makes. I would like some explanation on this.
Use Parentheses for Grouping and Capturing By placing part of a
regular expression inside round brackets or parentheses, you can group
that part of the regular expression together. This allows you to apply
a quantifier to the entire group or to restrict alternation to part of
the regex.
https://www.regular-expressions.info/brackets.html
Basicly the regex engine can find a list of strings matching the whole search pattern, and return you the parts inside the ().
This question already has answers here:
Reference - What does this regex mean?
(1 answer)
What is a non-capturing group in regular expressions?
(18 answers)
Closed 5 years ago.
I'm trying to create a function to capture phone numbers written in a canonical form (XXX)XXX-XXX or XXX-XXX-XXXX with additional conditions. This is my approach
def parse_phone2(s):
phone_number = re.compile(r'''^\s*\(? # Begining of string, Ignore leading spaces
([0-9]{3}) # Area code
\)?\s*|-? # Match 0 or 1 ')' followed by 0 or more spaces or match a single hyphen
([0-9]{3}) # Three digit
-? # hyphen
([0-9]{4}) # four digits
\s*$ # End of string. ignore trailing spaces''', re.VERBOSE)
try:
return (phone_number.match(s).groups())
except AttributeError as e:
raise ValueError
I was failing this test case ' (404) 555-1212 ' but another question of SO suggest me to replace \)?\s*|-? by (?:\)?\s*|-?) and it works. The problem is that I don't understand the difference between both nor the purpose of (?:...) further than create non-capturing groups. The docs aren't clear enough for me as well.
https://docs.python.org/3/library/re.html
Consider a simpler example:
re.compile(r'(?:a|b)*')
which simply matches a (possibly empty) string of as and bs. The only difference between this and
re.compile(r'(a|b)*')
is that the matching engine will capture the first character matched for retrieval with the group method. Using a non-capture group is just an optimization to speed up the match (or at least save memory) when a capture group isn't needed.
You have an alternate token in the part you replaced. Alternate will match either what's before the token, or what's after. And since separating a regex into lines like you've done here isn't considered grouping, it would try to match not just what's before or after on the same line, but on the lines before and after as well.
Grouping should instead be done by surrounding the group in parentheses, BUT by default this will also "capture" the group, meaning it will return the match as one of the groups when you call groups(). To specify that it should not, you need to add ?:.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
What does this regex mean? I know the functionality of re.sub but unable to figure out the 2nd part:
s = re.sub(r'\.([a-zA-Z])', r'. \1', s)
^^^^^^^
Can someone explain me the underlined part?
Next time it you should mention which programming language you are using, because regular expression syntaxes are very different from one language to another. Also when using regular expressions to replace something, then usually the second argument isn't a regular expression, but just a string with a special syntax, so knowing the programming language would help with that, too.
\1 is a back reference to what the first capturing group (expression in parentheses) matched.
So \.([a-zA-Z]) matches a period followed by a letter, and that letter is captured (stored/saved/remembered) because it surrounded by parentheses and use at the place of \1. The period and the letter is then replaced with a period, a space and that letter.
Examples:
.H becomes . H.
This.is.a.Test becomes This. is. a. Test