Python: RE issue with "re.findall()" - python

string = "RegisterParameter uri wub {"
RegisterName = re.findall("RegisterParameter uri ([^ ]*) {",string)
print 'RegisterName is :',RegisterName
See the above code. Here i want to find register name in the string i.e wub by regular expression. I have written the RE for that. If you run this code it will give the output like ['wub'] ,but i want only wub not bracket or quote. So what modifications to be done over here.
Many thanks for your help.

RegisterName is a list with just one str element. If the issue is just printing you could try:
print 'RegisterName is :', RegisterName[0]
Output:
RegisterName is : wub
PS:
When you are not sure of the type of a variable try printing it:
print type(RegisterName)
I would recommend you to use Python conventions, identifiers with names like SomeName are often used as names of classes. For variables, you could use some_name or register_name

You can use re.search() (or re.match() - depends on your needs) and get the capturing group:
>>> import re
>>> s = "RegisterParameter uri wub {"
>>> match = re.search("RegisterParameter uri ([^ ]*) {", s)
>>> match.group(1) if match else "Nothing found"
'wub'
Also, instead of [^ ]*, you may want to use \w*. \w matches any word character.
See also:
What is the difference between Python's re.search and re.match?
In regex, what does \w* mean?

Related

Python Regex to Remove Special Characters from Middle of String and Disregard Anything Else

Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.
re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.
Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo
This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/

Python re.compile curly quotes issue

I'm currently writing an application that uses a framework to match certain phrases, currently it is supposed to match the following regex pattern:
Say \"(.*)\"
However, I've notices that my users are complaining about the fact that their OS sometimes copies and pastes 'curly quotes' in, what ends up happening is that users provide the following sentence:
Say "Hello world!" <-- Matches
Say “Hello world!” <-- Doesn't match!
Is there any way I can tell Python's regular expressions to treat these curly quotes the same as regular quotes?
Edit:
Turns out you can very easily tell Python to read your Regular Expression with a unicode string, I changed my code to the following and it worked:
u'Say (?:["“”])(.*)(?:["“”])'
# (?:["“”]) <-- Start non-capturing group, and match one of the three possible quote typesnot return it
# (.*) <-- Start a capture group, match anything and return it
# (?:["“”]) <-- Stop matching the string until another quote is found
You could just include the curly quotes in the regex:
Say [\"“”](.*)[\"“”]
As something you can replicate in the Python repl, it's like this:
>>> import re
>>> test_str = r'"Hello"'
>>> reg = r'["“”](.*)["“”]'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'Hello'
>>> test_str = r'“Hello world!”'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'\x80\x9cHello world!\xe2\x80'
As an alternative to Kyle's answer you can prepare string to your current regex by replacing curly quotes:
string.replace('“', '"').replace('”', '"')

python regex search pattern

I'm searching a block of text for a newline followed by a period.
pat = '\n\.'
block = 'Some stuff here. And perhaps another sentence here.\n.Some more text.'
For some reason when I use regex to search for my pattern it changes the value of pat (using Python 2.7).
import re
mysrch = re.search(pat, block)
Now the value of pat has been changed to:
'\n\\.'
Which is messing with the next search that I use pat for. Why is this happening, and how can I avoid it?
Thanks very much in advance in advance.
The extra slash isn't actually part of the string - the string itself hasn't changed at all.
Here's an example:
>>> pat = '\n\.'
>>> pat
'\n\\.'
>>> print pat
\.
As you can see, when you print pat, it's only got one \ in it. When you dump the value of a string it uses the __repr__ function which is designed to show you unambiguously what is in the string, so it shows you the escaped version of characters. Like \n is the escaped version of a newline, \\ is the escaped version of \.
Your regex is probably not matching how you expect because it has an actual newline character in it, not the literal string "\n" (as a repr: "\\n").
You should either make your regex a raw string (as suggested in the comments).
>>> pat = r"\n\."
>>> pat
'\\n\\.'
>>> print pat
\n\.
Or you could just escape the slashes and use
pat = "\\n\\."

Parsing multi line comments from js using python

I want to get the contents of the multiline comments in a js file using python.
I tried this code sample
import re
code_m = """
/* This is a comment. */
"""
code_s = "/* This is a comment*/"
reg = re.compile("/\*(?P<contents>.*)\*/", re.DOTALL + re.M)
matches_m = reg.match(code_m)
matches_s = reg.match(code_s)
print matches_s # Give a match object
print matches_m # Gives None
I get matches_m as None. But matches_s works. What am I missing here?
match() only matches at the start of the string, use search() instead.
When using match(), it is like there is an implicit beginning of string anchor (\A) at the start of your regex.
As a side note, you don't need the re.M flag unless you are using ^ or $ in your regex and want them to match at the beginning and end of lines. You should also use a bitwise OR (re.S | re.M for example) instead of adding when combining multiple flags.
re.match tests to see if the string matches the regex. You're probably looking for re.search:
>>> reg.search(code_m)
<_sre.SRE_Match object at 0x7f293e94d648>
>>> reg.search(code_m).groups()
(' This is a comment. ',)

Python split by regular expression

In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>

Categories

Resources