regular expression match issue in Python - python

For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java

You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68

You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})

You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'

Related

Regex to match strings which only contain alphabetic characters python [duplicate]

I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?
\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()
There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()
re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").
\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>
Change it from \d+ to ^\d+$

Python regex matching when it should not

I have a list of strings and I want to print out the ones that don't match the regex but I'm having some trouble. The regex seems to match strings that it should not, if there is a substring that starts at the beginning of the string that matches the regex. I'm not sure how to fix this.
Example
>>> import re
>>> pattern = re.compile(r'\d+')
>>> string = u"1+*"
>>> bool(pattern.match(string))
True
I get true because of the 1 at the start. How should I change my regex to account for this?
Note I'm on python 2.6.6
Have your regex start with \A and end with \Z. This will make sure that the match begins at the start of the input string, and also make sure that the match ends at the end of the input string.
So for the example you gave, it would look like:
pattern = re.compile(r'\A\d+\Z')
You should append \Z to the end of the regex, so the regex pattern is '\d+\Z'.
Your code then becomes:
>>> import re
>>> pattern = re.compile(r'\d+\Z')
>>> string = u"1+*"
>>> bool(pattern.match(string))
False
This works because \Z forces matching at only the end of the string. You may also use $, which forces a match at a newline before the end of the string or at the end of the string. If you would like to force the string to only contain numeric values (irrelevant if using re.match, but maybe useful if using other regular expression libraries), you may add a ^ to the front of the pattern, forcing a match at the start of the string. The pattern would then be '^\d+\Z'.

How to force regex to check on whole line [duplicate]

I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?
\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()
There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()
re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").
\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>
Change it from \d+ to ^\d+$

Regular expressions in python to match Twitter handles

I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that
Contain a specific string
Are of unknown length
May be followed by either
punctuation
whitespace
or the end of string.
For example, for each of these strings, Ive marked in italics what I'd like to return.
"#handle what is your problem?" [RETURN '#handle']
"what is your problem #handle?" [RETURN '#handle']
"#123handle what is your problem #handle123?" [RETURN '#123handle', '#handle123']
This is what I have so far:
>>> import re
>>> re.findall(r'(#.*handle.*?)\W','hi #123handle, hello #handle123')
['#123handle']
# This misses the handles that are followed by end-of-string
I tried modifying to include an or character allowing the end-of-string character. Instead, it just returns the whole string.
>>> re.findall(r'(#.*handle.*?)(?=\W|$)','hi #123handle, hello #handle123')
['#123handle, hello #handle123']
# This looks like it is too greedy and ends up returning too much
How can I write an expression that will satisfy both conditions?
I've looked at a couple other places, but am still stuck.
It seems you are trying to match strings starting with #, then having 0+ word chars, then handle, and then again 0+ word chars.
Use
r'#\w*handle\w*'
or - to avoid matching #+word chars in emails:
r'\B#\w*handle\w*'
See the Regex 1 demo and the Regex 2 demo (the \B non-word boundary requires a non-word char or start of string to be right before the #).
Note that the .* is a greedy dot matching pattern that matches any characters other than newline, as many as possible. \w* only matches 0+ characters (also as many as possible) but from the [a-zA-Z0-9_] set if the re.UNICODE flag is not used (and it is not used in your code).
Python demo:
import re
p = re.compile(r'#\w*handle\w*')
test_str = "#handle what is your problem?\nwhat is your problem #handle?\n#123handle what is your problem #handle123?\n"
print(p.findall(test_str))
# => ['#handle', '#handle', '#123handle', '#handle123']
Matches only handles that contain this range of characters -> /[a-zA-Z0-9_]/.
s = "#123handle what is your problem #handle123?"
print re.findall(r'\B(#[\w\d_]+)', s)
>>> ['#123handle', '#handle123']
s = '#The quick brown fox#jumped over the LAAZY #_dog.'
>>> ['#The', '#_dog']

Regular expressions: How do I find a sub-string that is between two regular expression matches?

Let's say I have a string like:
data = 'MESSAGE: Hello world!END OF MESSAGE'
And I want to get the string between 'MESSAGE: ' and the next capitalized word. There are never any fully capitalized words in the message.
I tried to get this by using this regular expression in re.search:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
Here I would like it to output 'Hello world!'- but it always returns the wrong result. It is very easy in regular expressions for one to find a sub-string that occurs between two other strings, but how do you find a substring between strings that are matches for a regular expression. I have tried making it a raw string but that didn't seem to work.
I hope I am expressing myself well- I have extensive experience in Python but am new to regular expressions. If possible, I would like an explanation along with an example of how to make my specific example code work. Any helpful posts are greatly appreciated.
BTW, I am using Python 3.3.
Your code doesn't work but for the opposite reason:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
would match
'Hello world!END OF MESSA'
because (.*) is "greedy", i.e. it matches the most that will allow the rest (two uppercase chars) to match. You need to use a non-greedy quantifier with
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
that correctly matches
'Hello world!'
One little question mark:
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
Out[91]: 'Hello world!'
if you make the first capturing group lazy, it won't consume anything after the exclamation point.
You need your .* to be non-greedy (see the first ?) which means that it stops matching at the point where the next item could match, and you need the second group to be non-capturing (see the ?:).
import re
data = 'MESSAGE: Hello world!END OF MESSAGE'
regex = r'MESSAGE: (.*?)(?:[A-Z]{2,})'
re.search(regex, data).group(1)
Returns:
'Hello world!'
Alternatively, you could use this:
regex = r'MESSAGE: (.*?)[A-Z]{2,}'
To break this down (I'll include the search line with the VERBOSE flag:):
regex = r'''
MESSAGE:\s # first part, \s for the space (matches whitespace)
(.*?) # non-greedy, anything but a newline
(?:[A-Z]{2,}) # a secondary group, but non-capturing,
# good for alternatives separated by a pipe, |
'''
re.search(regex, data, re.VERBOSE).group(1)

Categories

Resources