re.findall gives different results than re.search with the same pattern - python

I have as str that I want to get the substring inside single quotes ('):
line = "This is a 'car' which has a 'person' in it!"
so I used:
name = re.findall("\'(.+?)\'", line)
print(name[0])
print(name[1])
car
person
But when I try this approach:
pattern = re.compile("\'(.+?)\'")
matches = re.search(pattern, line)
print(matches.group(0))
print(matches.group(1))
# print(matches.group(2)) # <- this produces an error of course
'car'
car
So, my question is why the pattern behaves differently in each case? I know that the former returns "all non-overlapping matches of pattern in string" and the latter match objects which might explain some difference but I would expect with the same pattern same results (even in different format).
So, to make it more concrete:
In the first case with findall the pattern returns all substrings but in the latter case it only return the first substring.
In the latter case matches.group(0) (which corresponds to the whole match according to the documentation) is different than matches.group(1) (which correspond to the first parenthesized subgroup). Why is that?
re.finditer("\'(.+?)\'", line) returns match objects so it functions like re.search.
I know that there are similar question is SO like this one or this one but they don't seem to answer why (or at least I did not get it).

You already read the docs and other answers, so I will give you a hands-on explanation
Let's first take this example from here
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')
If you go on this website you will find the correspondence with the previous detections
group(0) is taking the full match, group(1) and group(2) are respectively Group 1 and Group 2 in the picture. Because as said here "Match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)"
Now let's go back to your example
As said by others with re.search(pattern, line) you will find ONLY the first occurrence of the pattern ["Scan through string looking for the first location where the regular expression pattern produces a match" as said here] and following the previous logic you will now understand why matches.group(0) will output the full match and matches.group(1) the Group 1. And you will understand why matches.group(2) is giving you error [because as you can see from the screenshot there is not a group 2 for the first occurrence in this last example]

re.findall returns list of matches (in this particular example, first groups of matches), while re.search returns
only first leftmost match.
As stated in python documentation (re.findall):
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result.
matches.group(0) gives you whole fragment of string that matches your pattern, that's why it have quotes, while matches.group(1) gives you first parenthesized substring of matching fragment, that means it will not include quotes because they are outside of parentheses. Check Match.group() docs for more information.

Related

python regex - findall not returning output as expected

I am having trouble understanding findall, which says...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Why doesn't this basic IP regex work with findall as expected? The matches are not overlapping, and regexpal confirms that pattern is highlighted in re_str.
Expected: ['1.2.2.3', '123.345.34.3']
Actual: ['2.', '34.']
re_str = r'(\d{1,3}\.){3}\d{1,3}'
line = 'blahblah -- 1.2.2.3 blah 123.345.34.3'
matches = re.findall(re_str, line)
print(matches) # ['2.', '34.']
When you use parentheses in your regex, re.findall() will return only the parenthesized groups, not the entire matched string. Put a ?: after the ( to tell it not to use the parentheses to extract a group, and then the results should be the entire matched string.
This is because capturing groups return only the last match if they're repeated.
Instead, you should make the repeating group non-capturing, and use a non-repeated capture at an outer layer:
re_str = r'((?:\d{1,3}\.){3}\d{1,3})'
Note that for findall, if there is no capturing group, the whole match is automatically selected (like \0), so you could drop the outer capture:
re_str = r'(?:\d{1,3}\.){3}\d{1,3}'

Understanding * (zero or more) operator using re.search() [duplicate]

This question already has answers here:
Difference between * and + regex
(7 answers)
Closed 5 years ago.
I am new to python and was going through "Google for Education" python course
Now, the line below confuses me:
* -- 0 or more occurrences of the pattern to its left
(all the examples are in python3)
e.g. 1
In [1]: re.search(r"pi*", "piiig!!").group()
Out[1]: 'piii'
This is fine since, "pi" has 1 occurrance so it is retured
e.g. 2
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
Why does it not return "i" in fact - from my understanding, it should be returning "iii". But the result is an empty string.
Also, What exactly does "0 or more" mean? I searched on google but everywhere it is mentioned * -- 0 or more. But if there is 0 occurrence of an expression, does that not become true even if it's not there? What is the point of searching then?
I am so confused with this. Can you please help me with explaining this or point me in the right direction.
i hope the right explanation would also resolve my this issue:
In [3]: re.search(r"i?", "piiig!!").group()
Out[3]: ''
I have tried the examples in Spyder 3.2.4
The explanation is a bit more complicated than the answers we have seen so far.
First, unlike re.match() the primitive operation re.search() checks for a match anywhere in the string (this is what Perl does by default) and finds the pattern once:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string. See: Ref.
If we follow every step of the regex engine while it tries to find a match, we can observe the following for the pattern i* and the test string piigg!!:
As you can see, the first character (at position 0) produces a match because p is zero times i and the result is an empty match (and not p - because we do not search for p or any other character).
At the second character (position 1) the second match (spanning to position 2) is found since ii is zero or more times i... at position 3 there is another empty match, and so far and so forth.
Because re.search only returns the first match it sticks with the first empty match at position 0. That's why you get the (confusing) result you have posted:
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
In order to match every occurrence, you need re.findall():
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match. See: Ref.
You need to use *(0 or more) and +(1 or more) properly to get your desired output
Eg: 1 Matches because you have defined * only for "i", this patter will capture all the "p" or "pi" combination
Eg: 2 If you need to match only "i" you need to use "+" instead of "*".
If you use "*"
In: re.search(r"pi*g", "piiig!!").group()
This will return if you input is ("pig" or "piig" or "pg")
If you use "+"
In: re.search(r"pi+g", "piiig!!").group()
This will return if you input is ("pig" or "piig")
Because '' is the first matched result of r'i*' and 'iii' is the second matched result.
In [1]: import re
In [2]: re.findall(r'i*', 'piiig!!')
Out[2]: ['', 'iii', '', '', '', '']
This website will also explain the way how regular expression work.
https://regex101.com/r/XVPXMv/1
The special charecter * means 0 or more occurrence of the preceding character. For eg. a* matches with 0 or more occurrence of a which could be '', 'a', 'aa' etc. This happens because '' has 0 occurrence of a.
To get iii you should have used + instead of * and thus would have got the first non zero sequence of 'i' which is iii
re.search("i+", "piiig!!").group()

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

python group(0) meaning

What is the exact definition of group(0) in
re.search?
Sometimes the search can get complex and I would like to know what is the supposed group(0) value by definition?
Just to give an example of where the confusion comes, consider this matching. The printed result is only def. So in this case group(0) didn't return the entire match.
m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
def
match_object.group(0) says that the whole part of match_object is chosen.
In addition group(0) can be be explained by comparing it with group(1), group(2), group(3), ..., group(n). Group(0) locates the whole match expression. Then to determine more matching locations paranthesis are used: group(1) means the first paranthesis pair locates matching expression 1, group(2) says the second next paranthesis pair locates the match expression 2, and so on. In each case the opening bracket determines the next paranthesis pair by using the furthest closing bracket to form a paranthesis pair. This probably sounds confusing, that's why there is an example below.
But you need to differentiate between the syntax of the paranthesis of '(?<=abc)'. These paranthesis have a different syntactical meaning, which is to locate what is bound by '?<='. So your main problem is that you don't know what '?<=' does. This is a so called look-behind which means that it matches the part behind the expression that it bounds.
In the following example 'abc' is bound by the look-behind.
No paranthesis are needed to form match group 0 since it locates the whole match object anyway.
The opening bracket in front of the letter 'd' takes the last closing bracket in front of the letter 'f' to form matching group 1.
The brackets that are around the letter 'e' define matching group 2.
import re
m = re.search('(?<=abc)(d(e))f', 'abcdef')
print(m.group(0))
print(m.group(1))
print(m.group(2))
This prints:
def
de
e
group(0) returns the full string matched by the regex. It's just that abc isn't part of the match. (?<=abc) doesn't match abc - it matches any position in the string immediately preceded by abc.
supplementary:
run this:
import re
m = re.search('text', 'my text')
help(m.group)
print(m.group(0) == m.group())
# when in doubt, dir(m) helps too
output:
Help on built-in function group:
group(...) method of re.Match instance
group([group1, ...]) -> str or tuple.
Return subgroup(s) of the match by indices or names.
For 0 returns the entire match.
True

Python regex: greedy pattern returning multiple empty matches

This pattern is meant simply to grab everything in a string up until the first potential sentence boundary in the data:
[^\.?!\r\n]*
Output:
>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']
From the Python documentation:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Now, if the string is scanned left to right and the * operator is greedy, it makes perfect sense that the first match returned is the whole string up to the exclamation marks. However, after that portion has been consumed, I do not see how the pattern is producing an empty match exactly four times, presumably by scanning the string leftward after the "d". I do understand that the * operator means this pattern can match the empty string, I just don't see how it would doing that more than once between the trailing "d" of the letters and the leading "!" of the punctuation.
Adding the ^ anchor has this effect:
>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']
Since this eliminates the empty string matches, it would seem to indicate that said empty matches were occurring before the leading "A" of the string. But that would seem to contradict the documentation with respect to the matches being returned in the order found (matches before the leading "A" should have been first) and, again, exactly four empty matches baffles me.
The * quantifier allows the pattern to capture a substring of length zero. In your original code version (without the ^ anchor in front), the additional matches are:
the zero-length string between the end of hard and the first !
the zero-length string between the first and second !
the zero-length string between the second and third !
the zero-length string between the third ! and the end of the text
You can slice/dice this further if you like here.
Adding that ^ anchor to the front now ensures that only a single substring can match the pattern, since the beginning of the input text occurs exactly once.

Categories

Resources