Regular expression in Python not catching all information - python

I have the following string:
a = '''"The cat is running to the door, he does not look hungry anymore".
Said my mom, whispering.'''
Note the line breaks. In python the string will be:
'The cat is running to the door, he does not look hungry anymore".\n \n Said my mom, whispering.'
I have this regular expression:
pattern = u'^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*?)'
and I used as follows in Python:
>>> import re
>>> a = '''"The cat is running to the door, he does not look hungry anymore".
Said my mom, whispering.'''
>>> pattern = u'^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*?)'
>>> re.search(pattern, a).groups()
>>> ('"', 'The cat is running to the door, he does not look hungry anymore', '"', '.', '')
Why the last part (Said my mom, whispering.) is not being caught by the regular expression?
I'm expecting something like this:
>>> ('"', 'The cat is running to the door, he does not look hungry anymore', '"', '.', 'Said my mom, whispering.')
Can you please clarify to me what I'm doing wrong?

Just removing the ? would be enough. And also it's better to include DOTALL modifier because dot in your regex by default won't match new line characters.
pattern = u'(?s)^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*)'
Note that .*? is reluctant or non-greedy which means match any character zero or more times non-greedily. So it stops matching once it finds an empty string.

The problem with your expression is that (.*?) group is reluctant, meaning that it shall match as little text as possible. Since you do not ask for the match to "anchor" at the end of the input, the second group is empty.
Adding $ at the end of the regex will fix this problem:
pattern = u'^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*?)$'

Your input does not start with the quote and regex requires it. Then, there is a missing linebreak pattern for the second line. And third, the .*? lazy matching will not match anything since it can match empty string so it will if you do not use an anchor $ or use a greedy matching.
Also, it is not efficient to use single letters in alternations, so I'd rather use a character class for such cases: ("|«) => ["«].
With \s shorthand class, you can match not only linebreaks but also spaces thus "trimmimg" the results in capture groups.
Here is my suggestion:
import re
p = re.compile(r'^(["«])?(.*?)(["»])?\.\s*(.*?)\s*(.*)')
test_str = "The cat is running to the door, he does not look hungry anymore\".\n\nSaid my mom, whispering."
print re.search(p, test_str).groups()
See demo

Related

Matching regex pattern where there is \n\r between starting and ending pattern

The red underscore is the desired string I want to match
I would like to match all strings (including \n) between the the two string provided in the example
However, in the first example, where there is a newline, I can't get anything to match
In the second example, the regex expression works. It matches the string highlighted in Green because it resides on a single line
Not sure if there is a notation I need to include for \n\r to be part of the pattern to match
Use this
output = re.search('This(.*?)\n\n(.*?)match', text)
>>> output.group(1)
'is a multiline expression'
>>> output.group(2)
'I would like to '
Try this one aswell:
output = re.search(r"This ([\S.]+) match", text).group(1).replace(r'\n','')
That will find the entire thing as one group then remove the new lines.

regular expression match issue in Python

For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java
You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68
You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})
You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'

Behaviour of Python non-greedy regular expression

I'm using python version 3.4.1 and I don't understand the result of the following regular
expression:
import re
print(re.match("\[{E=(.*?),Q=(.*?)}\]","[{E=KT,Q=P1.p01},{E=KT2,Q=P2.p02}]").groups())
('KT', 'P1.p01},{E=KT2,Q=P2.p02')
I would expect the result to be
('KT', 'P1.p01')
but apparently the second .*? 'eats' all characters until '}]' at the end.
I would expect to stop at the first '}" character.
If I leave out the '[' and ']' characters the behavior is as I expect:
print(re.match("{E=(.*?),Q=(.*?)}","{E=KT,Q=P1.p01},{E=KT2,Q=P2.p02}").groups())
('KT', 'P1.p01')
The \] forces a square bracket to be present in the match - and there only is one at the end of the string. The regex engine has to other option to match. If you remove it or make it optional (\]?), it stops at the closest }.
What you seem to want is everything between '{E=' and the next comma ',', then everything between 'Q=' and the next closing brace '}'. One expression to do this would be:
{E=([^,]*),Q=([^}]*)}
Here e.g. [^,]* means "as many non-comma characters as possible".
Example usage:
>>> import re
>>> re.findall("{E=([^,]*),Q=([^}]*)}",
"{E=KT,Q=P1.p01},{E=KT2,Q=P2.p02}")
[('KT', 'P1.p01'), ('KT2', 'P2.p02')]
You can see the full explanation in this regex101 demo.

Regular expressions: How do I find a sub-string that is between two regular expression matches?

Let's say I have a string like:
data = 'MESSAGE: Hello world!END OF MESSAGE'
And I want to get the string between 'MESSAGE: ' and the next capitalized word. There are never any fully capitalized words in the message.
I tried to get this by using this regular expression in re.search:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
Here I would like it to output 'Hello world!'- but it always returns the wrong result. It is very easy in regular expressions for one to find a sub-string that occurs between two other strings, but how do you find a substring between strings that are matches for a regular expression. I have tried making it a raw string but that didn't seem to work.
I hope I am expressing myself well- I have extensive experience in Python but am new to regular expressions. If possible, I would like an explanation along with an example of how to make my specific example code work. Any helpful posts are greatly appreciated.
BTW, I am using Python 3.3.
Your code doesn't work but for the opposite reason:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
would match
'Hello world!END OF MESSA'
because (.*) is "greedy", i.e. it matches the most that will allow the rest (two uppercase chars) to match. You need to use a non-greedy quantifier with
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
that correctly matches
'Hello world!'
One little question mark:
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
Out[91]: 'Hello world!'
if you make the first capturing group lazy, it won't consume anything after the exclamation point.
You need your .* to be non-greedy (see the first ?) which means that it stops matching at the point where the next item could match, and you need the second group to be non-capturing (see the ?:).
import re
data = 'MESSAGE: Hello world!END OF MESSAGE'
regex = r'MESSAGE: (.*?)(?:[A-Z]{2,})'
re.search(regex, data).group(1)
Returns:
'Hello world!'
Alternatively, you could use this:
regex = r'MESSAGE: (.*?)[A-Z]{2,}'
To break this down (I'll include the search line with the VERBOSE flag:):
regex = r'''
MESSAGE:\s # first part, \s for the space (matches whitespace)
(.*?) # non-greedy, anything but a newline
(?:[A-Z]{2,}) # a secondary group, but non-capturing,
# good for alternatives separated by a pipe, |
'''
re.search(regex, data, re.VERBOSE).group(1)

Roman numerals in Python using "re" module

I'm working along on this page and continuing the code to cover the 10's place. My "pattern" is:
>>> pattern = '^M?M?M?(CM?|CD?|D?C?C?C?)(XC?|XL?|L?X?X?X?)$'
If I remove the carat (^) from the front of the "pattern", then strings like 'hat' will find a match:
>>> pattern = 'M?M?M?(CM?|CD?|D?C?C?C?)(XC?|XL?|L?X?X?X?)$'
>>> print re.search(pattern,'hat')
<_sre.SRE_Match object at 0x1004ba360>
but when I leave the carat in the front, then it works fine and 'hat' doesn't find a match. What does the carat do and why does 'hat' find a match?
If you actually print what it's matching, ie:
print re.search(pattern,"hat").group()
You'll see nothing, this is because it's matching to the empty string: "". In your regex, every expression ends with ? indicating 0 or 1 of whatever came before it. Without the ^ at the front, your regex will match anything. It essentially boils down to: pattern = '$', which again matches everything.
The ^ means "starts with." When you put the ^ in, "hat" doesn't match, because it doesn't adhere to any of your requirements and does not start with ""; however, if you put "" in lieu of "hat", you will get a match.

Categories

Resources