I'm currently writing an application that uses a framework to match certain phrases, currently it is supposed to match the following regex pattern:
Say \"(.*)\"
However, I've notices that my users are complaining about the fact that their OS sometimes copies and pastes 'curly quotes' in, what ends up happening is that users provide the following sentence:
Say "Hello world!" <-- Matches
Say “Hello world!” <-- Doesn't match!
Is there any way I can tell Python's regular expressions to treat these curly quotes the same as regular quotes?
Edit:
Turns out you can very easily tell Python to read your Regular Expression with a unicode string, I changed my code to the following and it worked:
u'Say (?:["“”])(.*)(?:["“”])'
# (?:["“”]) <-- Start non-capturing group, and match one of the three possible quote typesnot return it
# (.*) <-- Start a capture group, match anything and return it
# (?:["“”]) <-- Stop matching the string until another quote is found
You could just include the curly quotes in the regex:
Say [\"“”](.*)[\"“”]
As something you can replicate in the Python repl, it's like this:
>>> import re
>>> test_str = r'"Hello"'
>>> reg = r'["“”](.*)["“”]'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'Hello'
>>> test_str = r'“Hello world!”'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'\x80\x9cHello world!\xe2\x80'
As an alternative to Kyle's answer you can prepare string to your current regex by replacing curly quotes:
string.replace('“', '"').replace('”', '"')
Related
I have a list of strings and I want to print out the ones that don't match the regex but I'm having some trouble. The regex seems to match strings that it should not, if there is a substring that starts at the beginning of the string that matches the regex. I'm not sure how to fix this.
Example
>>> import re
>>> pattern = re.compile(r'\d+')
>>> string = u"1+*"
>>> bool(pattern.match(string))
True
I get true because of the 1 at the start. How should I change my regex to account for this?
Note I'm on python 2.6.6
Have your regex start with \A and end with \Z. This will make sure that the match begins at the start of the input string, and also make sure that the match ends at the end of the input string.
So for the example you gave, it would look like:
pattern = re.compile(r'\A\d+\Z')
You should append \Z to the end of the regex, so the regex pattern is '\d+\Z'.
Your code then becomes:
>>> import re
>>> pattern = re.compile(r'\d+\Z')
>>> string = u"1+*"
>>> bool(pattern.match(string))
False
This works because \Z forces matching at only the end of the string. You may also use $, which forces a match at a newline before the end of the string or at the end of the string. If you would like to force the string to only contain numeric values (irrelevant if using re.match, but maybe useful if using other regular expression libraries), you may add a ^ to the front of the pattern, forcing a match at the start of the string. The pattern would then be '^\d+\Z'.
For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java
You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68
You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})
You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'
I'm searching a block of text for a newline followed by a period.
pat = '\n\.'
block = 'Some stuff here. And perhaps another sentence here.\n.Some more text.'
For some reason when I use regex to search for my pattern it changes the value of pat (using Python 2.7).
import re
mysrch = re.search(pat, block)
Now the value of pat has been changed to:
'\n\\.'
Which is messing with the next search that I use pat for. Why is this happening, and how can I avoid it?
Thanks very much in advance in advance.
The extra slash isn't actually part of the string - the string itself hasn't changed at all.
Here's an example:
>>> pat = '\n\.'
>>> pat
'\n\\.'
>>> print pat
\.
As you can see, when you print pat, it's only got one \ in it. When you dump the value of a string it uses the __repr__ function which is designed to show you unambiguously what is in the string, so it shows you the escaped version of characters. Like \n is the escaped version of a newline, \\ is the escaped version of \.
Your regex is probably not matching how you expect because it has an actual newline character in it, not the literal string "\n" (as a repr: "\\n").
You should either make your regex a raw string (as suggested in the comments).
>>> pat = r"\n\."
>>> pat
'\\n\\.'
>>> print pat
\n\.
Or you could just escape the slashes and use
pat = "\\n\\."
string = "RegisterParameter uri wub {"
RegisterName = re.findall("RegisterParameter uri ([^ ]*) {",string)
print 'RegisterName is :',RegisterName
See the above code. Here i want to find register name in the string i.e wub by regular expression. I have written the RE for that. If you run this code it will give the output like ['wub'] ,but i want only wub not bracket or quote. So what modifications to be done over here.
Many thanks for your help.
RegisterName is a list with just one str element. If the issue is just printing you could try:
print 'RegisterName is :', RegisterName[0]
Output:
RegisterName is : wub
PS:
When you are not sure of the type of a variable try printing it:
print type(RegisterName)
I would recommend you to use Python conventions, identifiers with names like SomeName are often used as names of classes. For variables, you could use some_name or register_name
You can use re.search() (or re.match() - depends on your needs) and get the capturing group:
>>> import re
>>> s = "RegisterParameter uri wub {"
>>> match = re.search("RegisterParameter uri ([^ ]*) {", s)
>>> match.group(1) if match else "Nothing found"
'wub'
Also, instead of [^ ]*, you may want to use \w*. \w matches any word character.
See also:
What is the difference between Python's re.search and re.match?
In regex, what does \w* mean?
In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>