Findall vs search for overwriting groups in Python - python

I found topic Capturing group with findall? but unfortunately it is more basic and covers only groups that do not overwrite themselves.
Please let's take a look at the following example:
S = "abcabc" # string used for all the cases below
1. Findall - no groups
print re.findall(r"abc", S) # ['abc', 'abc']
General idea: No groups here so I expect findall to return a list of all matches - please confirm.
In this case: Findall is looking for abc, finds it, returns it, then goes on and finds the second one.
2. Findall - one explicit group
print re.findall(r"(abc)", S) # ['abc', 'abc']
General idea: Some groups here so I expect findall to return a list of all groups - please confirm.
In this case: Why two results while there is only one group? I understand it this way:
findall is looking for abc,
finds it,
places it in the group memory buffer,
returns it,
findall starts to look for abc again, and so on...
Is this reasoning correct?
3. Findall - overwriting groups
print re.findall(r"(abc)+", S) # ['abc']
This looks similar to the above yet returns only one abc. I understand it this way:
findall is looking for abc,
finds it,
places it in the group memory buffer,
does not return it because the RE itself demands to go on,
finds another abc,
places it in the group memory buffer (overwrites previous abc),
string ends so searching ends as well.
Is this reasoning correct? I am very specific here so if there is anything wrong (even tiny detail) then please let me know.
4. Search - overwriting groups
Search scans through a string looking for a single match, so re.search(r"(abc)", S) and re.search(r"(abc)", S) rather obviously return only one abc, then let me get right to:
re.search(r"(abc)+", S)
print m.group() # abcabc
print m.groups() # ('abc',)
a) Of course the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()? And that is why nothing gets overwritten for this method?
In fact, this grouping feature of parentheses is completely unnecessary here - in such cases I just want to use parentheses to stress what needs to be taken together when repeating things without creating any regex groups.
b) Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?

At first, let me state some facts:
A match value (match.group()) is the (sub)text that meets the whole pattern defined in a regular expression. Matches can contain zero or more capture groups.
A capture value (match.group(1..n)) is a part of the match (that can also be equal to the whole match if the whole pattern is enclosed into a capture group) that is matched with a parenthesized pattern part (a part of the pattern enclosed into a pair of unescaped parentheses).
Some languages can provide access to the capture collection, i.e. all the values that were captured with a quantified capture group like (\w{3})+. In Python, it is possible with PyPi regex module, in .NET, with a CaptureCollection, etc.
1: No groups here so I expect findall to return a list of all matches - please confirm.
True, only if there are capturing groups are defined in the pattern, re.findall returns a list of captured submatches. In case of abc, re.findall returns a list of matches.
2: Why two results while there is only one group?
There are two matches, re.findall(r"(abc)", S) finds two matches in abcabc, and each match has one submatch, or captured substring, so the resulting array has 2 elements (abc and abc).
3: Is this reasoning correct?
The re.findall(r"(abc)+", S) is looking for a match in the form abcabcabc and so on. It will match it as a whole and will keep the last abc in the capture group 1 buffer. So, I think your reasoning is correct. RE itself demands to go on can be precised as since the matching is not yet complete (as there are still characters for the regex engine to test for a match).
4: the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()?
No, the last group value is kept in this case. If you change your regex to (\w{3})+ and the string to abcedf you will feel the difference as the output for that case will be edf. And that is why nothing gets overwritten for this method? - So, you are wrong, the preceding capture group value is overwritten with the following ones.
5: Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?
The re.search(r"(abc)+", S) will match abcabc (match, not capture) because
abcabc is searched for abc from left to right. RE finds abc at the start and tries to find another abc right from the location after the first c. RE puts the abc into Capture group buffer 1.
RE finds the 2nd abc, rewrites the capture group #1 buffer with it. Tries to find another abc.
No more abc is found - return the matched value found : abcabc.

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Capturing groups with an or operator in Python

I have found odd behavior in Python 3.7.0 when capturing groups with an or operator when one branch initially matches but the regex has to eventually backtrack and use a different branch. In this scenario, the capture groups stick with the first branch even though the regex uses the second branch.
Example code:
regexString = "^(a)|(ab)$"
captureString = "ab"
match = re.match(regexString, captureString)
print(match.groups())
Output:
('a', None)
The second group is the group that is used, but the first group is captured and the second group isn't.
Interestingly, I have found a workaround by adding non-capturing parentheses around both groups like so:
regexString = "^(?:(a)|(ab))$"
New Output:
(None, 'ab')
To me this behavior looks like a bug. If it is not, can someone point me to some documentation explaining why this is occurring? Thank you!
This is a common regex mistake. Here is your original pattern:
^(a)|(ab)$
This actually says to match ^a, i.e. a at the start of the input or ab$, i.e. ab at the end of the input. If you instead want to match a or ab as the entire input, then as you figured out you need:
^(?:(a)|(ab))$
To further convince yourself of this behavior, you may verify that the following pattern matches the same things as your original pattern:
(ab)$|^(a)
That is, each term in alternation is separate, and the position does not even matter, at least with regard to which inputs would match or nor match. By the way, you could have just used the following pattern:
^ab?$
This would match a or ab, and also you would not even need a capture group, as the entire match would correspond to what you want.

Why is re.findall matching the string, but not returning the results correctly?

I want to find a substring of the pattern ([A-Z][0-9]+)+ in another string.
One way to do this would be:
import re
re.findall("([A-Z][0-9]+)+", "asdf A0B52X4 asdf")[0]
Curiously, this yields 'X4', not 'A0B52X4', which was the result I expected.
Digging a bit into this, I also tried to just match the simple groups the string is composed of:
re.findall("[A-Z][0-9]+", "asdf A0B52X4 asdf")
Which yields the expected result: ['A0', 'B52', 'X4']
And even more interesting:
re.findall("([A-Z][0-9]+){3,}", "asdf A0B52X4 asdf")
Which yields ['X4'], but still seems to match the whole string I'm interested in, which is confirmed by trying re.search and using the result to obtain the substring manually:
m = re.search("([A-Z][0-9]+)+", "asdf A0B52X4 asdf")
m.string[m.start():m.end()]
This yields 'A0B52X4'.
Now from what I know about regular expressions in python, parentheses not only just match the RE inside them, but also declare a "group" which lets you do all sorts of things with it. My theory would be that for some reason, re.findall only puts the last match of a group into the result string as opposed to the complete match.
Why does re.findall behave like this?
It's because your matching group only matches one instance of the pattern at a time. The + just means to match all of them that occur in a row. It still only captures the first part of the match at one time.
Wrap your regex in an outer group, like this:
((?:[A-Z][0-9]+)+)
Demo

Python re module groups match mechanism

Question Formation
background
As I am reading through the tutorial at python2.7 redoc, it introduces the behavior of the groups:
The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
question
I clearly understands how this works singly. but I can understand the following example:
>>> m = re.match("([abc])+","abc")
>>> m.groups()
('c',)
I mean, isn't + simply means one or more. If so, shouldn't the regex ([abc])+ = ([abc])([abc])+ (not formal BNF). Thus, the result should be:
('a','b','c')
Please shed some light about the mechanism behind, thanks.
P.S
I want to learn the regex language interpreter, how should I start with? books or regex version, thanks!
Well, I guess a picture is worth a 1000 words:
link to the demo
what's happening is that, as you can see on the visual representation of the automaton, your regexp is grouping over a one character one or more times until it reaches the end of the match. Then that last character gets into the group.
If you want to get the output you say, you need to do something like the following:
([abc])([abc])([abc])
which will match and group one character at each position.
About documentation, I advice you to read first theory of NFA, and regexps. The MIT documentation on the topic is pretty nice:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/
Basically, the groups that are referred to in regex terminology are the capture groups as defined in your regex.
So for example, in '([abc])+', there's only a single capture group, namely, ([abc]), whereas in something like '([abc])([xyz])+' there are 2 groups.
So in your example, calling .groups() will always return a tuple of length 1 because that is how many groups exist in your regex.
The reason why it isn't returning the results you'd expect is because you're using the repeat operator + outside of the group. This ends up causing the group to equal only the last match, and thus only the last match (c) is retained. If, on the other hand, you had used '([abc]+)' (notice the + is inside the capture group), the results would have been:
('abc',)
One pair of grouping parentheses forms one group, even if it's inside a quantifier. If a group matches multiple times due to a quantifier, only the last match for that group is saved. The group doesn't become as many groups as it had matches.

Why is the minimal (non-greedy) match affected by the end of string character '$'?

EDIT: remove original example because it provoked ancillary answers. also fixed the title.
The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:
Here is a simpler example:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
EDIT:
In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs:
"Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."
This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?
Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.
(Answer changed after OP clarification in comments. See history for previous text.)
The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.
Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.
EDIT:
More details... In this case:
re.search(r"a+?$", "baaaaaaaa")
the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.
But in this case:
re.search(r"a+?", "baaaaaaaa")
the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.
The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.
Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.
With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.
On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.
As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.
I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.
Answer to original question:
Why does the first search() span
multiple "/"s rather than taking the
shortest match?
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.
Answer to revised question:
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.
Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.
There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.
>>> import re
>>> string = "baaa"
>>>
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>>
>>> # This means the same thing as above, since the presence of the `$`
>>> # cancels out any meaning that the `?` might have.
>>> pattern = re.search(r"a+?$", string)
>>> pattern.group()
'aaa'
>>>
>>> # Here you remove the `$`, so it matches the least amount of `a` it can.
>>> pattern = re.search(r"a+?", string)
>>> pattern.group()
'a'
Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.
>>> # This is close to the example pattern with `a+?$` and therefore `a+$`.
>>> # It matches `a`s until the end of the line. Again the `?` can't do anything.
>>> pattern = re.search(r"(a+?)$", string)
>>> pattern.group(1)
'aaa'
>>>
>>> # In order to get the `?` to work, you need something else in your pattern
>>> # and outside your group that can be matched that will allow the selection
>>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up
>>> # everything that the lazy `a+?` doesn't want to.
>>> pattern = re.search(r"(a+?).*$", string)
>>> pattern.group(1)
'a'
Edit: Removed text related to old versions of the question.
Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.
>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis

Categories

Resources