Regular expression matches more than expected - python

Given is the following python script:
text = '<?xml version="1.24" encoding="utf-8">'
mu = (".??[?]?[?]", "....")
for item in mu:
print item,":",re.search(item, text).group()
Can someone please explain why the first hit with the regex .??[?]?[?] returns <? instead of just ?.
My explaination:
.?? should match nothing as .? can match or not any char and the second ? makes it not greedy.
[?]? can match ? or not, so nothing is good, too
[?] just matches ?
That should result in ? and not in <?

For the same reason o*?bar matches oobar in foobar. Even if the quantifier is non-greedy the regex will try to match from the first char in all possible ways, before moving on to the next.
First the .?? matches an empty string, but when the regex engine backtracks to it, it matches <, thus making the rest of the regex match, without moving the start position of the match to the next character.

Regex "greediness" only affects backtracking; it doesn't mean that the regex engine will skip earlier potential match points — a regex always takes the first possible match. In this case, that means <? because it starts farther to the left than ?.

Related

Python regex or | is greedy

>>> import re
>>> p = re.compile('.*&l=(.*)(&|$)')
>>> p.search('foo&l=something here&bleh').group(1)
'something here&bleh' # want to remove strings after &
>>> p.search('foo&l=something here').group(1)
'something here' # this is OK
The python documents (2.7) says that the or operator '|' is never greedy. But my codes has not been working fine. I want the regex to stop searching when it reached the next & instead going through the entire string.
You need change .* inside the first capturing group to [^&]*
p = re.compile('.*&l=([^&]*)')
Your regex p = re.compile('.*&l=(.*)(&|$)') matches also the extra chars because .* inside the first capturing group is greedy which matches all the chars upto the last. You all know $ matches the boundary which exists at the last. Hence finds a match.
So .* and then $ finds a match, so it won't get backtarck.
Your regex tries to match everything (.*), then when it reaches the end of the string, it begins to backtrack until it matches &. That's why you are getting that result.
Change your regex to
.*&l=(.*?)(&|$)
Adding the ? will make your regex lazy.
Simple example that demonstrate the issue:
Let's say you want to match everything until the first % character appears, and let's say you write the following regex:
.*%
Let's see how the engine works given the string "abc%def%g".
It first see .*, will try to consume everything, so it'll match the whole string. But then, it tries to match % and fails, so it backtracks to the previous character, it's g, still no match. Will backtrack again, and then it reaches %, it does match! So you'll get abc%def% as a result.

Two not greedy patterns in the same regex does not match shortest substring

I am trying to remove surrounding xml tags without using xml library, just with regular expressions :
s="<tr></tr><tr><td>stuff</td></tr><tr></tr>"
print re.sub(r'<tr>.*?stuff.*?</tr>',r'stuff_without_first_bounding_tr',s)
It prints :
stuff_without_first_bounding_tr<tr></tr>
I was expecting :
<tr></tr>stuff_without_first_bounding_tr<tr></tr>
I am using .*? two times, both should be non greedy (shortest solution should be taken)
Why only the second one is non greedy ?
What regex should I use ?
You need to use a negative lookahead assertion.
>>> s="<tr></tr><tr><td>stuff</td></tr><tr></tr>"
>>> re.sub(r'<tr>(?:(?!</?tr>).)*stuff(?:(?!</?tr>).)*</tr>',r'stuff_without_first_bounding_tr',s)
'<tr></tr>stuff_without_first_bounding_tr<tr></tr>'
(?:(?!</?tr>).)* first checks that the character going to be matched won't be the < symbol followed by optional forward slash and tr> . If yes, then it would match the corresponding character. We all know that the * repeats the previous token zero or more times, so (?:(?!</?tr>).)* the condition will be checked before matching each character. If a particular character failed to satisfy the condition then the match will be failed.

Match a sentence

I wish to chop some text into sentences.
I wish to match all text up until: a period followed by a space, a question mark followed by a space or an exclamation mark followed by a space, in an non greedy fashion.
Additionally, the punctuation might be found at the very end of the string or followed by a /r/n for example.
This will almost do it:
([^\.\?\!]*)
But I'm missing the space in the expression. How do I fix this?
Example:
I' a.m not. So? Sure about this! Actually. Should give:
I' a.m not
So
Sure about this
Actually
You can achieve such conditions by using positive lookahead assertions.
[^.?!]+(?=[.?!] )
See it here on Regexr.
When you look at the demo, The sentences at the end of a row with no following space are not matched. You can fix this by adding an alternation with the Anchor $ and using the modifier m (makes the $ match the end of a row):
[^.?!]+(?=[.?!](?: |$))
See it here on Regexr
Try this:
(.*?[!\.\?] )
.* gives all,
[] is any of these characters
then the () gives you a group to reference so you can get the match out.
Use a non-greedy match with s look ahead:
^.*?(?=[.!?]( |$))
Note how you don't have to escape those chars when they are in a character class [...].
This should do it:
^.*?(?=[!.?][\s])

Python regex with *?

What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.
* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.
. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.
Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

Could you explain why this regex is not working?

>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'
Why isn't group(0) matching Superman? This lookaround tutorial says:
(?<!a)b matches a "b" that is not
preceded by an "a", using negative
lookbehind
Batman isn't directly preceded by Bat, so that matches first. In fact, neither is Superman; there's a comma in-between in your string which will do just fine to allow that RE to match, but that's not matched anyway because it's possible to match earlier in the string.
Maybe this will explain better: if the string was Batman and you were starting to try to match from the m, the RE would not match until the character after (giving a match of an) because that's the only place in the string which is preceded by Bat.
At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.
At position 0 of your string (ie. prior to the B in Batman), the assertion succeeded, as Bat is not present before the current position - thus, \w+ can match the entire word Batman (remember, regexes are inherently greedy - ie. will match as much as possible).
See this page for more information on engine internals.
To achieve what you wanted, you could instead use something like:
\b(?!Bat)\w+
In this pattern, the engine will match a word boundary (\b)1, followed by one or more word characters, with the assertion that the word characters do not start with Bat. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.
1 Note that word boundaries match a boundary between \w and \W (ie. between [A-Za-z0-9_] and any other character; it also matches the ^ and $ anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.
From the manual:
Patterns which start with negative
lookbehind assertions may match at the
beginning of the string being
searched.
http://docs.python.org/library/re.html#regular-expression-syntax
You're looking for the first set of one or more alphanumeric characters (\w+) that is not preceded by 'Bat'. Batman is the first such match. (Note that negative lookbehind assertions can match the start of a string.)
To do what you want, you have to constrain the regex to match 'man' specifically; otherwise, as others have pointed out, \w greedily matches anything including 'Batman'. As in:
>>> re.search("\w+(?<!Bat)man","Batman,Superman").group(0)
'Superman'

Categories

Resources