Python re module groups match mechanism - python

Question Formation
background
As I am reading through the tutorial at python2.7 redoc, it introduces the behavior of the groups:
The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
question
I clearly understands how this works singly. but I can understand the following example:
>>> m = re.match("([abc])+","abc")
>>> m.groups()
('c',)
I mean, isn't + simply means one or more. If so, shouldn't the regex ([abc])+ = ([abc])([abc])+ (not formal BNF). Thus, the result should be:
('a','b','c')
Please shed some light about the mechanism behind, thanks.
P.S
I want to learn the regex language interpreter, how should I start with? books or regex version, thanks!

Well, I guess a picture is worth a 1000 words:
link to the demo
what's happening is that, as you can see on the visual representation of the automaton, your regexp is grouping over a one character one or more times until it reaches the end of the match. Then that last character gets into the group.
If you want to get the output you say, you need to do something like the following:
([abc])([abc])([abc])
which will match and group one character at each position.
About documentation, I advice you to read first theory of NFA, and regexps. The MIT documentation on the topic is pretty nice:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/

Basically, the groups that are referred to in regex terminology are the capture groups as defined in your regex.
So for example, in '([abc])+', there's only a single capture group, namely, ([abc]), whereas in something like '([abc])([xyz])+' there are 2 groups.
So in your example, calling .groups() will always return a tuple of length 1 because that is how many groups exist in your regex.
The reason why it isn't returning the results you'd expect is because you're using the repeat operator + outside of the group. This ends up causing the group to equal only the last match, and thus only the last match (c) is retained. If, on the other hand, you had used '([abc]+)' (notice the + is inside the capture group), the results would have been:
('abc',)

One pair of grouping parentheses forms one group, even if it's inside a quantifier. If a group matches multiple times due to a quantifier, only the last match for that group is saved. The group doesn't become as many groups as it had matches.

Related

pandas.str.findall() returning multiple instance of the same value but with reduce characters

so I am having trouble with Pandas for a series findall(). currently I am trying to look at a report and retrieving all the electric components. Currently the report is either a line or a paragraph and mention components in a standardize way. I am using this code
failedCoFromReason =rlist['report'].str.findall(r'([CULJRQF]([\dV]{2,4}))',flags=re.IGNORECASE)
It returns the components but it also returns a repeat value of the number like this [('r919', '919'), ('r920', '920')]
I would like it just to return [('r919'), ('r920')] but I am struggling with getting it to work. Pretty new to pandas and regex and confused how to search. I have tried greedy and non greedy searches but it didn't work.
See the Series.str.findall reference:
Equivalent to applying re.findall() to all the elements in the Series/Index.
The re.findall references says that "if one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group."
So, all you need to do is actually remove all capturing parentheses in this case, as all you need is to get the whole match:
rlist['report'].str.findall(r'[CULJRQF][\dV]{2,4}', flags=re.I)
In other cases, when you need to preserve the group (to quantify it, or to use alternatives), you need to change the capturing groups to non-capturing ones:
rlist['report'].str.findall(r'(?:[CULJRQF](?:[\dV]{2,4}))', flags=re.I)
Though, in this case, it is quite redundant.

Capturing groups with an or operator in Python

I have found odd behavior in Python 3.7.0 when capturing groups with an or operator when one branch initially matches but the regex has to eventually backtrack and use a different branch. In this scenario, the capture groups stick with the first branch even though the regex uses the second branch.
Example code:
regexString = "^(a)|(ab)$"
captureString = "ab"
match = re.match(regexString, captureString)
print(match.groups())
Output:
('a', None)
The second group is the group that is used, but the first group is captured and the second group isn't.
Interestingly, I have found a workaround by adding non-capturing parentheses around both groups like so:
regexString = "^(?:(a)|(ab))$"
New Output:
(None, 'ab')
To me this behavior looks like a bug. If it is not, can someone point me to some documentation explaining why this is occurring? Thank you!
This is a common regex mistake. Here is your original pattern:
^(a)|(ab)$
This actually says to match ^a, i.e. a at the start of the input or ab$, i.e. ab at the end of the input. If you instead want to match a or ab as the entire input, then as you figured out you need:
^(?:(a)|(ab))$
To further convince yourself of this behavior, you may verify that the following pattern matches the same things as your original pattern:
(ab)$|^(a)
That is, each term in alternation is separate, and the position does not even matter, at least with regard to which inputs would match or nor match. By the way, you could have just used the following pattern:
^ab?$
This would match a or ab, and also you would not even need a capture group, as the entire match would correspond to what you want.

Findall vs search for overwriting groups in Python

I found topic Capturing group with findall? but unfortunately it is more basic and covers only groups that do not overwrite themselves.
Please let's take a look at the following example:
S = "abcabc" # string used for all the cases below
1. Findall - no groups
print re.findall(r"abc", S) # ['abc', 'abc']
General idea: No groups here so I expect findall to return a list of all matches - please confirm.
In this case: Findall is looking for abc, finds it, returns it, then goes on and finds the second one.
2. Findall - one explicit group
print re.findall(r"(abc)", S) # ['abc', 'abc']
General idea: Some groups here so I expect findall to return a list of all groups - please confirm.
In this case: Why two results while there is only one group? I understand it this way:
findall is looking for abc,
finds it,
places it in the group memory buffer,
returns it,
findall starts to look for abc again, and so on...
Is this reasoning correct?
3. Findall - overwriting groups
print re.findall(r"(abc)+", S) # ['abc']
This looks similar to the above yet returns only one abc. I understand it this way:
findall is looking for abc,
finds it,
places it in the group memory buffer,
does not return it because the RE itself demands to go on,
finds another abc,
places it in the group memory buffer (overwrites previous abc),
string ends so searching ends as well.
Is this reasoning correct? I am very specific here so if there is anything wrong (even tiny detail) then please let me know.
4. Search - overwriting groups
Search scans through a string looking for a single match, so re.search(r"(abc)", S) and re.search(r"(abc)", S) rather obviously return only one abc, then let me get right to:
re.search(r"(abc)+", S)
print m.group() # abcabc
print m.groups() # ('abc',)
a) Of course the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()? And that is why nothing gets overwritten for this method?
In fact, this grouping feature of parentheses is completely unnecessary here - in such cases I just want to use parentheses to stress what needs to be taken together when repeating things without creating any regex groups.
b) Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?
At first, let me state some facts:
A match value (match.group()) is the (sub)text that meets the whole pattern defined in a regular expression. Matches can contain zero or more capture groups.
A capture value (match.group(1..n)) is a part of the match (that can also be equal to the whole match if the whole pattern is enclosed into a capture group) that is matched with a parenthesized pattern part (a part of the pattern enclosed into a pair of unescaped parentheses).
Some languages can provide access to the capture collection, i.e. all the values that were captured with a quantified capture group like (\w{3})+. In Python, it is possible with PyPi regex module, in .NET, with a CaptureCollection, etc.
1: No groups here so I expect findall to return a list of all matches - please confirm.
True, only if there are capturing groups are defined in the pattern, re.findall returns a list of captured submatches. In case of abc, re.findall returns a list of matches.
2: Why two results while there is only one group?
There are two matches, re.findall(r"(abc)", S) finds two matches in abcabc, and each match has one submatch, or captured substring, so the resulting array has 2 elements (abc and abc).
3: Is this reasoning correct?
The re.findall(r"(abc)+", S) is looking for a match in the form abcabcabc and so on. It will match it as a whole and will keep the last abc in the capture group 1 buffer. So, I think your reasoning is correct. RE itself demands to go on can be precised as since the matching is not yet complete (as there are still characters for the regex engine to test for a match).
4: the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()?
No, the last group value is kept in this case. If you change your regex to (\w{3})+ and the string to abcedf you will feel the difference as the output for that case will be edf. And that is why nothing gets overwritten for this method? - So, you are wrong, the preceding capture group value is overwritten with the following ones.
5: Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?
The re.search(r"(abc)+", S) will match abcabc (match, not capture) because
abcabc is searched for abc from left to right. RE finds abc at the start and tries to find another abc right from the location after the first c. RE puts the abc into Capture group buffer 1.
RE finds the 2nd abc, rewrites the capture group #1 buffer with it. Tries to find another abc.
No more abc is found - return the matched value found : abcabc.

Python Regex instantly replace groups

Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.

Grouping in Python Regular Expressions

So I'm playing around with regular expressions in Python. Here's what I've gotten so far (debugged through RegExr):
##(VAR|MVAR):([a-zA-Z0-9]+)+(?::([a-zA-Z0-9]+))*##
So what I'm trying to match is stuff like this:
##VAR:param1##
##VAR:param2:param3##
##VAR:param4:param5:param6:0##
Essentially, you have either VAR or MVAR followed by a colon then some param name, then followed by the end chars (##) or another : and a param.
So, what I've gotten for the groups on the regex is the VAR, the first param, and then the last thing in the parameter list (for the last example, the 3rd group would be 0). I understand that groups are created by (...), but is there any way for the regex to match the multiple groups, so that param5, param6, and 0 are in their own group, rather than only having a maximum of three groups?
I'd like to avoid having to match this string then having to split on :, as I think this is capable of being done with regex. Perhaps I'm approaching this the wrong way.
Essentially, I'm attempting to see if I can find and split in the matching process rather than a postprocess.
If this format is fixed, you don't need regex, it just makes it harder. Just use split:
text.strip('#').split(':')
should do it.
The number of groups in a regular expression is fixed. You will need to postprocess somehow.

Categories

Resources