python group(0) meaning - python

What is the exact definition of group(0) in
re.search?
Sometimes the search can get complex and I would like to know what is the supposed group(0) value by definition?
Just to give an example of where the confusion comes, consider this matching. The printed result is only def. So in this case group(0) didn't return the entire match.
m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
def

match_object.group(0) says that the whole part of match_object is chosen.
In addition group(0) can be be explained by comparing it with group(1), group(2), group(3), ..., group(n). Group(0) locates the whole match expression. Then to determine more matching locations paranthesis are used: group(1) means the first paranthesis pair locates matching expression 1, group(2) says the second next paranthesis pair locates the match expression 2, and so on. In each case the opening bracket determines the next paranthesis pair by using the furthest closing bracket to form a paranthesis pair. This probably sounds confusing, that's why there is an example below.
But you need to differentiate between the syntax of the paranthesis of '(?<=abc)'. These paranthesis have a different syntactical meaning, which is to locate what is bound by '?<='. So your main problem is that you don't know what '?<=' does. This is a so called look-behind which means that it matches the part behind the expression that it bounds.
In the following example 'abc' is bound by the look-behind.
No paranthesis are needed to form match group 0 since it locates the whole match object anyway.
The opening bracket in front of the letter 'd' takes the last closing bracket in front of the letter 'f' to form matching group 1.
The brackets that are around the letter 'e' define matching group 2.
import re
m = re.search('(?<=abc)(d(e))f', 'abcdef')
print(m.group(0))
print(m.group(1))
print(m.group(2))
This prints:
def
de
e

group(0) returns the full string matched by the regex. It's just that abc isn't part of the match. (?<=abc) doesn't match abc - it matches any position in the string immediately preceded by abc.

supplementary:
run this:
import re
m = re.search('text', 'my text')
help(m.group)
print(m.group(0) == m.group())
# when in doubt, dir(m) helps too
output:
Help on built-in function group:
group(...) method of re.Match instance
group([group1, ...]) -> str or tuple.
Return subgroup(s) of the match by indices or names.
For 0 returns the entire match.
True

Related

python re regex matching in string with multiple () parenthesis

I have this string
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
What is a regex to first match exactly "IP(dev.maintserial):Y(dev.maintkeys)"
There might be a different path inside the parenthesis, like (name.dev.serial), so it is not like there will always be one dot there.
I though of something like this:
re.search('(IP\(.*?\):Y\(.*?\))', cmd) but this will also match the single IP(k1) and Y(y1
My usage will be:
If "IP(*):Y(*)" in cmd:
do substitution of IP(dev.maintserial):Y(dev.maintkeys) to Y(dev.maintkeys.IP(dev.maintserial))
How can I then do the above substitution? In the if condition I want to do this change in order: from IP(path_to_IP_key):Y(path_to_Y_key) to Y(path_to_Y_key.IP(path_to_IP_key)) , so IP is inside Y at the end after the dot.
This should work as it is more restrictive.
(IP\([^\)]+\):Y\(.*?\))
[^\)]+ means at least one character that isn't a closing parenthesis.
.*? in yours is too open ended allowing almost anything to be in until "):Y("
Something like this?
r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)"
You can try it with your string. It matches the entire string IP(dev.maintserial):Y(dev.maintkeys) with groups dev.maintserial and dev.maintkeys.
The RE matches IP(, zero or more characters that are not a closing parenthesis ([^)]*), a period . (\.), one or more of any characters (.+), then ):Y(, ... (between the parentheses -- same as above), ).
Example Usage
import re
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
# compile regular expression
p = re.compile(r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)")
s = p.search(cmd)
# if there is a match, s is not None
if s:
print(f"{s[0]}\n{s[1]}\n{s[2]}")
a = "Y(" + s[2] + ".IP(" + s[1] + "))"
print(f"\n{a}")
Above p.search(cmd) "[s]can[s] through [cmd] looking for the first location where this regular expression [p] produces a match, and return[s] a corresponding match object" (docs). None is the return value if there is no match. If there is a match, s[0] gives the entire match, s[1] gives the first parenthesized subgroup, and s[2] gives the second parenthesized subgroup (docs).
Output
IP(dev.maintserial):Y(dev.maintkeys)
dev.maintserial
dev.maintkeys
Y(dev.maintkeys.IP(dev.maintserial))
You can use 2 negated character classes [^()]* to match any character except parenthesis, and omit the outer capture group for a match only.
To prevent a partial word match, you might start matching IP with a word boundary \b
\bIP\([^()]*\):Y\([^()]*\)
Regex demo

Regular expression misses match at beginning of string

I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:
import re
pattern = """(?= # inside lookahead for overlapping results
(?:a|^) # match at beginning of str or after a
(b* (?:a) b*) # one a between any number of bs
(?:a|$)) # at end of str or before next a
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:
a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']
I don't understand what is happening. If I change the order of how a potential match could start, the results also change:
pattern = """(?=
(?:^|a) # a and ^ swapped
(b* (?:a) b*)
(?:a|$))
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
a_between_bs.findall("abbabb")
# ['abb']
I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?
Edit:
I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue, preceded and followed by any number of consonants. This is my regular expression to extract them:
vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'
pattern = f"""(?=
(?:[{vowels}]|^|{diphtongues})
([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
(?:[{vowels}]|$|{diphtongues})
)
"""
syllables = re.compile(pattern, re.VERBOSE)
The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}), but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).
So this is the problematic case with the pattern above:
syllables.findall('æbə')
# ['bə']
# should be: ['æb', 'bə']
Edit 2:
I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:
I suggest fixing this with a double negation:
(?= # inside lookahead for overlapping results
(?<![^a]) # match at beginning of str or after a
(b*ab*) # one a between any number of bs
(?![^a]) # at end of str or before next a
)
See the regex demo
Note I replaced the grouping constructs with lookarounds: (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]). The latter is not really important, but the first is very important here.
The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb, you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a, and cannot find any match since the only a left in the string has no a after bs.
Note that order of alternative matters. If you change to (?:^|a), the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb, and since there is a right after, you get abb as a match. There is no way to match anything after the first a.
Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.

re.findall gives different results than re.search with the same pattern

I have as str that I want to get the substring inside single quotes ('):
line = "This is a 'car' which has a 'person' in it!"
so I used:
name = re.findall("\'(.+?)\'", line)
print(name[0])
print(name[1])
car
person
But when I try this approach:
pattern = re.compile("\'(.+?)\'")
matches = re.search(pattern, line)
print(matches.group(0))
print(matches.group(1))
# print(matches.group(2)) # <- this produces an error of course
'car'
car
So, my question is why the pattern behaves differently in each case? I know that the former returns "all non-overlapping matches of pattern in string" and the latter match objects which might explain some difference but I would expect with the same pattern same results (even in different format).
So, to make it more concrete:
In the first case with findall the pattern returns all substrings but in the latter case it only return the first substring.
In the latter case matches.group(0) (which corresponds to the whole match according to the documentation) is different than matches.group(1) (which correspond to the first parenthesized subgroup). Why is that?
re.finditer("\'(.+?)\'", line) returns match objects so it functions like re.search.
I know that there are similar question is SO like this one or this one but they don't seem to answer why (or at least I did not get it).
You already read the docs and other answers, so I will give you a hands-on explanation
Let's first take this example from here
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')
If you go on this website you will find the correspondence with the previous detections
group(0) is taking the full match, group(1) and group(2) are respectively Group 1 and Group 2 in the picture. Because as said here "Match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)"
Now let's go back to your example
As said by others with re.search(pattern, line) you will find ONLY the first occurrence of the pattern ["Scan through string looking for the first location where the regular expression pattern produces a match" as said here] and following the previous logic you will now understand why matches.group(0) will output the full match and matches.group(1) the Group 1. And you will understand why matches.group(2) is giving you error [because as you can see from the screenshot there is not a group 2 for the first occurrence in this last example]
re.findall returns list of matches (in this particular example, first groups of matches), while re.search returns
only first leftmost match.
As stated in python documentation (re.findall):
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result.
matches.group(0) gives you whole fragment of string that matches your pattern, that's why it have quotes, while matches.group(1) gives you first parenthesized substring of matching fragment, that means it will not include quotes because they are outside of parentheses. Check Match.group() docs for more information.

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

negative lookbehind assertion in Python

I'm trying to find a match [abc], but not [[abc]] using Python regular expression.
I use negative lookbehind assertion (?<!) to filter out the [[abc]] as follows.
link = r"((?<!\[)\[([^<].+?) \s*([|] \s* (.+?) \s*)?])"
compLink = re.compile(link, re.X | re.U)
However, it doesn't work as first bracket in [[... satisfies condition unless the first bracket checks the next one is not [.
>>> a = compLink.findall("[[abc|Hi]]")
>>> a
[('[[abc|Hi]', '[abc', '|Hi', 'Hi')]
How to solve this issue?
You can try this:
(?<!\[)\[([^][]+)]|\[([^][]+)](?!])
The content is in group 1 or 2
Note: re options are not needed here.
If you need only to extract the deepest level of square brackets, these patterns suffice:
\[([^][]+)] # for the whole substring (with a capturing group)
or
(?<=\[)[^][]+(?=]) # for the content only (i.e. the whole match)
Note that a closing square bracket in a character class doesn't need to be escaped if you put it at the first position.
You can restrict the interior to “no brackets” and check for a matching double (this is easier expressed as the regular expression):
(?!\[\[[^\]]*\]\])(?:^|.)(\[[^\]]*\])(?:.|$)
(Take only the captured group)
I could find a match by having only one (?<!\[)\[([^[] bracket.
link = r"((?<!\[)\[([^[].+?) \s*([|] \s* (.+?) \s*)?])"
^
compLink = re.compile(link, re.X | re.U)
Just replace this part "((?<!\[)\[ with this "((?<!\[)\[(?!\[)
and leave the rest of the expression as is.

Categories

Resources