Behaviour of Python non-greedy regular expression - python

I'm using python version 3.4.1 and I don't understand the result of the following regular
expression:
import re
print(re.match("\[{E=(.*?),Q=(.*?)}\]","[{E=KT,Q=P1.p01},{E=KT2,Q=P2.p02}]").groups())
('KT', 'P1.p01},{E=KT2,Q=P2.p02')
I would expect the result to be
('KT', 'P1.p01')
but apparently the second .*? 'eats' all characters until '}]' at the end.
I would expect to stop at the first '}" character.
If I leave out the '[' and ']' characters the behavior is as I expect:
print(re.match("{E=(.*?),Q=(.*?)}","{E=KT,Q=P1.p01},{E=KT2,Q=P2.p02}").groups())
('KT', 'P1.p01')

The \] forces a square bracket to be present in the match - and there only is one at the end of the string. The regex engine has to other option to match. If you remove it or make it optional (\]?), it stops at the closest }.

What you seem to want is everything between '{E=' and the next comma ',', then everything between 'Q=' and the next closing brace '}'. One expression to do this would be:
{E=([^,]*),Q=([^}]*)}
Here e.g. [^,]* means "as many non-comma characters as possible".
Example usage:
>>> import re
>>> re.findall("{E=([^,]*),Q=([^}]*)}",
"{E=KT,Q=P1.p01},{E=KT2,Q=P2.p02}")
[('KT', 'P1.p01'), ('KT2', 'P2.p02')]
You can see the full explanation in this regex101 demo.

Related

Regex to match and clean quotes in python

I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet, with each element of type bs4.element.Tag. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes] some quotes look like this
“Don't cry because it's over, smile because it happened.”
―
while others look like this:
“If you want to know what a man's like, take a good look at how he
treats his inferiors, not his equals.”
―
,
Each also has some extra blank lines at the end. My thought was I could iterate through quotes and call re.match on each quote as follows:
cleaned_quotes = []
for q in quote:
match = re.match(r'“[A-Z].+$”', str(q))
cleaned_quotes.append(match.group())
I'm guessing my regex pattern didn't match anything because I'm getting the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Not surprisingly, printing the list gives me a list of None objects. Any ideas on what I might be doing wrong?
As you requested this for learning purpose, here's the regex answer:
(?<=“)[\s\s]+?(?=”)
Explanation:
We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time.
Inside of the quotes we lazy match anything with the .+?
Online Demo
Sample Code:
import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []
for q in quote:
m = re.search(regex, str(q))
if m:
cleaned_quotes.append(m.group())
Arguably, we do not need any regex flags. Add the g|gloabal flag for multiple matches. And m|multiline to process matches line by line (in such a scenario could be required to use [\s\S] instead of the dot to get line spanning results.)
This will also change the behavior of the positional anchors ^ and $, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.
One more thing, I use re.search() since re.match() matches only from the beginning of the string. A common gotcha. See the documentation.
First of all, in your expression r'“[A-Z].+$”' end of line $ is defined before ", which is logically not possible.
To use $ in regexi for multiline strings, you should also specify re.MULTILINE flag.
Second - re.match expects to match the whole value, not find part of string that matches regular expression.
Meaning re.search should do what you initially expected to accomplish.
So the resulting regex could be:
re.search(r'"[A-Z].+"$', str(q), re.MULTILINE)

Regex to match parenthesis and its contents if it does not start with an underscore

I have this regex:
\([^\(]*?\)
Which matches parenthesis of a String and the contents within the parenthesis. I would like it to only match if there is no _ before the parenthesis.
For example I would like it to match (text) in this example:
This is some random (text)
But I do not want it to match anything in this example:
This is another_(text)
How would I go about this?
You can use negative lookbehind for that:
(?<!_)\([^\(]*\)
# ^ negative lookbehind
As is demonstrated in this regex101
Like #SebastianProske says, there is no reason to make [^\(] greedy: since it will never match a closing bracket. So I made it greedy.
Add negative lookbehind: (?<!_) checking just what you said (no "_" before).
One more remark: the content between both parentheses should be any sequence of
chars but other than closing one.
So the whole regex should be:
(?<!_)\([^\)]*\)

regex in python, remove pattern '[.../...]' from the string in python

I have an input string for e.g:
input_str = 'this is a test for [blah] and [blah/blahhhh]'
and I want to retain [blah] but want to remove [blah/blahhhh] from the above string.
I tried the following codes:
>>>re.sub(r'\[.*?\]', '', input_str)
'this is a test for and '
and
>>>re.sub(r'\[.*?\/.*?\]', '', input_str)
'this is a test for '
what should be the right regex pattern to get the output as "this is a test for [blah] and"?
I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.
Instead of using the wildcards you can use the \w like this:
\[\w+\/\w+\]
Working demo
By the way, if you can have non characters separated by /, then you can use this regex:
\[[^\]]*\/[^\]]*]
Working demo
The reason the second regex in the original post matches more than the OP wants is that . matches any character including ]. So \[.*?\/' (or just \[.*?/ since the \ before the / is superfluous) will match more than it seems the OP wanted: [blah] and [blah/ in input_str.
The ? adds confusion. It will limit repetition of the .* part of .*\] sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the . wildcard to begin with. So-called "greedy" matching of .* is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (] or / in your regexes). Instead of using ? to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.
As an illustration, see the following example of .* grabbing everything until the last occurrence of the character after .*:
echo '////k////,/k' | sed -r 's|/.*/|XXX|'
XXXk
echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
XXXk
And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.
If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:
re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '
This uses a character class expression - [^]] - instead of the wildcard . to match any character that is explicitly not a closing bracket.
If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.
Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.
[1] http://www.regular-expressions.info/repeat.html
You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'

regular expression match issue in Python

For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java
You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68
You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})
You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'

Python regex with *?

What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.
* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.
. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.
Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

Categories

Resources