Purpose of (?: ... ) in python regex [duplicate] - python

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
What is a non-capturing group in regular expressions?
(18 answers)
Closed 5 years ago.
I'm trying to create a function to capture phone numbers written in a canonical form (XXX)XXX-XXX or XXX-XXX-XXXX with additional conditions. This is my approach
def parse_phone2(s):
phone_number = re.compile(r'''^\s*\(? # Begining of string, Ignore leading spaces
([0-9]{3}) # Area code
\)?\s*|-? # Match 0 or 1 ')' followed by 0 or more spaces or match a single hyphen
([0-9]{3}) # Three digit
-? # hyphen
([0-9]{4}) # four digits
\s*$ # End of string. ignore trailing spaces''', re.VERBOSE)
try:
return (phone_number.match(s).groups())
except AttributeError as e:
raise ValueError
I was failing this test case ' (404) 555-1212 ' but another question of SO suggest me to replace \)?\s*|-? by (?:\)?\s*|-?) and it works. The problem is that I don't understand the difference between both nor the purpose of (?:...) further than create non-capturing groups. The docs aren't clear enough for me as well.
https://docs.python.org/3/library/re.html

Consider a simpler example:
re.compile(r'(?:a|b)*')
which simply matches a (possibly empty) string of as and bs. The only difference between this and
re.compile(r'(a|b)*')
is that the matching engine will capture the first character matched for retrieval with the group method. Using a non-capture group is just an optimization to speed up the match (or at least save memory) when a capture group isn't needed.

You have an alternate token in the part you replaced. Alternate will match either what's before the token, or what's after. And since separating a regex into lines like you've done here isn't considered grouping, it would try to match not just what's before or after on the same line, but on the lines before and after as well.
Grouping should instead be done by surrounding the group in parentheses, BUT by default this will also "capture" the group, meaning it will return the match as one of the groups when you call groups(). To specify that it should not, you need to add ?:.

Related

Python Regex: how to capture alternative groups with OR operator [duplicate]

Suppose I have the following regex that matches a string with a semicolon at the end:
\".+\";
It will match any string except an empty one, like the one below:
"";
I tried using this:
\".+?\";
But that didn't work.
My question is, how can I make the .+ part of the, optional, so the user doesn't have to put any characters in the string?
To make the .+ optional, you could do:
\"(?:.+)?\";
(?:..) is called a non-capturing group. It only does the matching operation and it won't capture anything. Adding ? after the non-capturing group makes the whole non-capturing group optional.
Alternatively, you could do:
\".*?\";
.* would match any character zero or more times greedily. Adding ? after the * forces the regex engine to do a shortest possible match.
As an alternative:
\".*\";
Try it here: https://regex101.com/r/hbA01X/1

Matching a space between occurrences in Regex

I need assistance with matching spaces and subsequent matches in regex.
the example is as follows:
I want to match all of the following scenarios:
60 ml ( 1)
60ML (2 )
60ml(2) (a)
the regex I have used is:
(60\s?(?:ml)\s?(?:\w|\(.{0,3}\)){0,5})
link to the example: link to regex
the regex matches the first 2 examples, but not the instances where there is a space between (2) and (a).
any guidance would be appreciated.
Your regex doesn't allow for spaces between the parenthesised groups (2) and (a) in your last example. You can add <space>* to it to allow it to do so. Note you cannot use \s* unless you are only matching a single value at a time, otherwise the fact that \s will match newline can cause the first match to go too far.
(60\s?ml\s?(?:\w|\(.{0,3}\) *){0,5})
Note that without anchors counting repetitions doesn't really make sense. For example, this regex will match both 60ML (2 )(a)(a)(a)(a) and 60ML (2 )(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a), returning 60ML (2 )(a)(a)(a)(a) in both cases. If that is not what you want, you will need to add an anchor to the end of the regex ($ perhaps) to prevent it matching the longer string.
Demo on regex101

Python regex to extract number of processors [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
I have a string which contains the number of processors:
SQLDB_GP_Gen5_2
The number is after _Gen and before _ (the number 5). How can I extract this using python and regular expressions?
I am trying to do it like this but don't get a match:
re.match('_Gen(.*?)_', 'SQLDB_GP_Gen5_2')
I was also trying this using pandas:
x['SLO'].extract(pat = '(?<=_Gen).*?(?:(?!_).)')
But this also wasn't working. (x is a Series)
Can someone please also point me to a book/tutorial site where I can learn regex and how to use with Pandas.
Thanks,
Mick
re.match searches from the beginning of the string. Use re.search instead, and retrieve the first capturing group:
>>> re.search(r'_Gen(\d+)_', 'SQLDB_GP_Gen5_2').group(1)
'5'
You need to use Series.str.extract with a pattern containing a capturing group:
x['SLO'].str.extract(r'_Gen(.*?)_', expand=False)
^^^^ ^^^^^^^^^^^
To only match a number, use r'_Gen(\d+)_'.
NOTES:
With Series.str.extract, you need to use a capturing group, the method only returns any value if it is captured
r'_Gen(.*?)_' will match _Gen, then will capture any 0+ chars other than line break chars as few as possible, and then match _. If you use \d+, it will only match 1+ digits.
Using re :
re.findall(r'Gen(.*)_',text)[0]

[FORKING]Python Regex - Re.Sub and Re.Findall Interesting Challenges

Not sure if this is something that should be a bounty. II just want to understand regex better.
I checked the responses in the Regex to match pattern.one skip newlines and characters until pattern.two and Regex to match if given text is not found and match as little as possible threads and read about Tempered Greedy Token Solutions and Explicit Greedy Alternation Solutions on RexEgg, but admittedly the explanations baffled me.
I spent the last day fiddling mainly with re.sub (and with findall) because re.sub's behaviour is odd to me.
.
Problem 1:
Given Strings below with characters followed by / how would I produce a SINGLE regex (using only either re.sub or re.findall) that uses alternating capture groups which must use [\S]+/ to get the desired output
>>> string_1 = 'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
>>> string_2 = 'variety.com/2017/biz/the/life/of/madam/green/news/tax-march-donald-trump-protest-1202031487/'
>>> string_3 = 'variety.com/2017/biz/the/life/of/news/tax-march-donald-trump-protest-1202031487/the/days/of/our/lives'
Desired Output Given the Conditions(!!)
tax-march-donald-trump-protest-
CONDITIONS: Must use alternating capture groups which must capture ([\S]+) or ([\S]+?)/ to capture the other groups but ignore them if they don't contain -
I'M WELL AWARE that it would be better to use re.findall('([\-]*(?:[^/]+?\-)+)[\d]+', string) or something similar but I want to know if I can use [\S]+ or ([\S]+) or ([\S]+?)/ and tell regex that if those are captured, ignore the result if it contains / or doesn't contain - While also having used an alternating capture group
I KNOW I don't need to use [\S]+ or ([\S]+) but I want to see if there is an extra directive I can use to make the regex reject some characters those two would normally capture.
Posted per request:
(?:(?!/)[\S])*-(?:(?!/)[\S])*
https://regex101.com/r/azrwjO/1
Explained
(?: # Optional group
(?! / ) # Not a forward slash ahead
[\S] # Not whitespace class
)* # End group, do 0 to many times
- # A dash must exist
(?: # Optional group, same as above
(?! / )
[\S]
)*
You could use
/([-a-z]+)-\d+
and take the first capturing group, see a demo on regex101.com.

python re ?: example [duplicate]

This question already has answers here:
What is a non-capturing group in regular expressions?
(18 answers)
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
i saw a regular expression (?= (?:\d{5}|[A-Z]{2})) in a python re example, and was very confused about the meaning of the ?: .
I also see the python doc, there is the explain:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
who can give me an example, and explain why it works, thanks!!
Ordinarily, parentheses create a "capturing" group inside your regex:
regex = re.compile("(set|let) var = (\\w+|\\d+)")
print regex.match("set var = 12").groups()
results
('set', '12')
Later you can retrieve those groups by calling .groups() method on the result of a match. As you see whatever is inside parentheses is captured in "groups." But you might not care about all those groups. Say you only want to find what's in the second group and not the first. You need the first set of parentheses in order to group "get" and "set" but you can turn off capturing by putting "?:" at the beginning:
regex = re.compile("(?:set|let) var = (\\w+|\\d+)")
print regex.match("set var = 12").groups()
results:
('12',)
If you do not need the group to capture its match, you can optimize
this regular expression into Set(?:Value)?. The question mark and the
colon after the opening parenthesis are the syntax that creates a
non-capturing group. The question mark after the opening bracket is
unrelated to the question mark at the end of the regex. The final
question mark is the quantifier that makes the previous token
optional. This quantifier cannot appear after an opening parenthesis,
because there is nothing to be made optional at the start of a group.
Therefore, there is no ambiguity between the question mark as an
operator to make a token optional and the question mark as part of the
syntax for non-capturing groups, even though this may be confusing at
first. There are other kinds of groups that use the (? syntax in
combination with other characters than the colon that are explained
later in this tutorial.
color=(?:red|green|blue) is another regex with a non-capturing group.
This regex has no quantifiers.
From : http://www.regular-expressions.info/brackets.html
Also read: What is a non-capturing group? What does a question mark followed by a colon (?:) mean?

Categories

Resources