Python Regex - non-greedy match does not work - python

I have a flat file with one C++ function name and part of its declaration like this:
virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const
void function_name2
void NameSpace2::NameSpace4::ClassName2::function_name3
function_name4
I am trying to extract the function names alone by using this line:
fn_name = re.match(":(.*?)\(?", lines)
I can understand why function_name2 and function_name4 do not match (because there is no leading :. But I am seeing that even for function_name1 and function_name3, it does not do non-greedy match. The output of fn_name.group() is
:NameSpace2::ClassName1::function_name1
I have three questions:
I expected just the string "function_name1" to be extracted from line 1, but the non-greedy match does not seem to work. Why?
Why is line 3 not being extracted?
How do I get the function names from all the lines using a single regex?
Please help.

This works pretty well, with your example at least:
^(?:\w+ +)*(?:\w+::)*(\w+)
i.e., in Python code:
import re
function_name = re.compile(r'^(?:\w+ +)*(?:\w+::)*(\w+)', re.MULTILINE)
matches = function_name.findall(your_txt)
# -> ['function_name1', 'function_name2', 'function_name3', 'function_name4']
Takeaway: If you can do it with greedy matching, do it with greedy matching.
Note that \w is not correct for a C identifier, but writing down the technically correct character class that matches those is besides the question. Find and use the correct set of characters instead of \w.

1) Always use r" " strings for regexes.
2)
I am trying to extract the function names alone by using this line:
fn_name = re.match(":(.*?)\(?", lines)
The output of fn_name.group() is
:NameSpace2::ClassName1::function_name1
I'm not seeing that:
import re
line = "virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const"
fn_name = re.search(r":(.*?)\(?", line)
print(fn_name.group())
--output:--
:
In any case, if you want to see how non-greedy works, look at this code:
import re
line = "N----1----2"
greedy_pattern = r"""
N
.*
\d
"""
match_obj = re.search(greedy_pattern, line, flags=re.X)
print(match_obj.group())
non_greedy_pattern = r"""
N
.*?
\d
"""
match_obj = re.search(non_greedy_pattern, line, flags=re.X)
print(match_obj.group())
--output:--
N----1----2
N----1
The non-greedy version asks for all the characters matching .* up until the first digit that is encountered, while the greedy version will try to find the longest match for .* that is followed by a digit.
3) Warning! No regex zone!
func_names = [
"virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const",
"void function_name2",
"void NameSpace2::NameSpace4::ClassName2::function_name3",
"function_name4",
]
for func_name in func_names:
name = func_name.rsplit("::", 1)[-1]
pieces = name.rsplit(" ", 1)
if pieces[-1] == "const":
name = pieces[-2]
else:
name = pieces[-1]
name = name.split('(', 1)[0]
print(name)
--output:--
function_name1
function_name2
function_name3
function_name4

I expected just the string "function_name1" to be extracted from line 1, but the non-greedy match does not seem to work. Why?
This is the result from your regex ":(.*?)\(?"
I think your regex is "Too Lazy". It will match only : because (.*?) stand for match any characters "as less as possible" then regex engine chooses to match zero character. It will not match till \(? as you expected because ? just means "optional".
Why is line 3 not being extracted?
As I've tested your regex. It doesn't work at all not only the third line.
How do I get the function names from all the lines using a single regex?
You can start from this minimal example
(?:\:\:|void\s+)(\w+)(?:\(|$)|(function_name4)
Where (?:\:\:|void\s+) represents to anything that leading your function name and (?:\(|$) represents to anything that follow you function name.
Note that function_name4 suppose to be declared explicitly due to lacking of pattern.
see: DEMO

I've been stumped before by something similar when trying to capture the "N----1" from "N foo bar N----1----2". Adding a leading .* gave the desired result.
import re
line = "N foo bar N----1----2"
match_obj = re.search(r'(N.*?\d)', line)
print(match_obj.group(1))
match_obj = re.search(r'.*(N.*?\d)', line)
print(match_obj.group(1))
--output:--
N foo bar N----1
N----1

Related

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

Python/Regex: Split string if a line does contain a certain special character

I'm trying to split a multiline string on a character but only if the line does not contain :. Unfortunately I can't see an easy way to use re.split() with negative lookback on the character : since it's possible that : occurred in another line earlier in the string.
As an example, I'd like to split the below string on ).
String:
Hello1 (
First : (),
Second )
Hello2 (
First
)
Output:
['Hello1 (\nFirst : (),\nSecond', 'Hello2 (\nFirst \n']
It is possible with Python, albeit not "out of the box" with the native re module.
First alternative
The newer regex module supports a variable-length lookbehind, so you could use
(?<=^[^:]+)\)
# pos. lookbehind making sure there's no : in that line
In Python:
import regex as re
data = """
Hello1 (
First : (),
Second )
Hello2 (
First
)"""
pattern = re.compile(r'(?<=^[^:]+)\)', re.MULTILINE)
parts = pattern.split(data)
print(parts)
Which yields
['\nHello1 (\nFirst : (),\nSecond ', '\n\nHello2 (\nFirst \n', '']
Second alternative
Alternatively, you could match the lines in question and let them fail with (*SKIP)(*FAIL) afterwards:
^[^:\n]*:.*(*SKIP)(*FAIL)|\)
# match lines with at least one : in it
# let them fail
# or match )
Again in Python:
pattern2 = re.compile(r'^[^:\n]*:.*(*SKIP)(*FAIL)|\)', re.MULTILINE)
parts2 = pattern.split(data)
print(parts2)
See a demo for the latter on regex101.com.
Third alternative
Ok, now the answer is getting longer than previously thought. You can even do it with the native re module with the help of a function. Here, you need to substitute the ) in question first and split by the substitute:
def replacer(match):
if match.group(1) is not None:
return "SUPERMAN"
else:
return match.group(0)
pattern3 = re.compile(r'^[^:\n]*:.*|(\))', re.MULTILINE)
data = pattern3.sub(replacer, data)
parts3 = data.split("SUPERMAN")
print(parts3)

repetition in regular expression in python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(
You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.
re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.
Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

Regex negative lookahead ignoring comments

I'm having problems with this regex. I want to pull out just MATCH3, because the others, MATCH1 and MATCH2 are commented out.
# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment
The regex I have captures all of the MATCH's.
(?<=url\(r'\^)(.*?)(?=\$',)
How do I ignore lines beginning with a comment? With a negative lookahead? Note the # character is not necessarily at the start of the line.
EDIT: sorry, all answers are good! the example forgot a comma after the $' at the end of the match group.
You really don't need to use lookarounds here, you could look for possible leading whitespace and then match "url" and the preceding context; capturing the part you want to retain.
>>> import re
>>> s = """# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment"""
>>> re.findall(r"(?m)^\s*url\(r'\^([^$]+)", s)
['MATCH3']
^\s*#.*$|(?<=url\(r'\^)(.*?)(?=\$'\))
Try this.Grab the capture.See demo.
https://www.regex101.com/r/rK5lU1/37
import re
p = re.compile(r'^\s*#.*$|(?<=url\(r\'\^)(.*?)(?=\$\'\))', re.IGNORECASE | re.MULTILINE)
test_str = "# url(r'^MATCH1/$'),\n #url(r'^MATCH2$'),\n url(r'^MATCH3$') # comment"
re.findall(p, test_str)
If this is the only place where you need to match, then match beginning of line followed by optional whitespace followed by url:
(?m)^\s*url\(r'(.*?)'\)
If you need to cover more complicated cases, I'd suggest using ast.parse instead, as it truly understands the Python source code parsing rules.
import ast
tree = ast.parse("""(
# url(r'^MATCH1/$'),
#url(r'^MATCH2$'),
url(r'^MATCH3$') # comment
)""")
class UrlCallVisitor(ast.NodeVisitor):
def visit_Call(self, node):
if getattr(node.func, 'id', None) == 'url':
if node.args and isinstance(node.args[0], ast.Str):
print(node.args[0].s.strip('$^'))
self.generic_visit(node)
UrlCallVisitor().visit(tree)
prints each first string literal argument given to function named url; in this case, it prints MATCH3. Notice that the source for ast.parse needs to be a well-formed Python source code (thus the parenthesis, otherwise a SyntaxError is raised).
As an alternative you can split your lines with '#' if the first element has 'url' in (it doesn't start with # ) you can use re.search to match the sub-string that you want :
>>> [re.search(r"url\(r'\^(.*?)\$'" ,i[0]).group(1) for i in [line.split('#') for line in s.split('\n')] if 'url' in i[0]]
['MATCH3']
Also note that you dont need to sue look-around for your pattern you can just use grouping!

Python/Regex splitting a specifically formatted return string

I'm working with a search&replace programming assignment. I'm a student and I'm finding the regex documentation a bit overwhelming (e.g. https://docs.python.org/2/library/re.html), so I'm hoping someone here could explain to me how to accomplish what I'm looking for.
I've used regex to get a list of strings from my document. They all look like this:
%#import fileName (regexStatement)
An actual example:
%#import script_example.py ( *out =(.|\n)*?return out)
Now, I'm wondering how I can split these up so I get the fileName and regexStatements as separate strings. I'd assume using a regex or string split function, but I'm not sure how to make it work on all kinds of variations of %#import fileName (regexstatement). Splitting using parentheses could hit the middle of the regex statement, or if a parentheses is part of the fileName, for instance. The assignment doesn't specify if it should only be able to import from python files, so I don't believe I can use ".py (" as a splitting point before the regex statement either.
I'm thinking something like a regex "%#import " to hit the gap after import, "\..* " to hit the gap after fileName. But I'm not sure how to get rid of the parentheses that encapsule the regex statement, or how to use all of it to actually split the string correctly so i have one variable storing fileName and one storing regexStatement for each entry in my list.
Thanks a lot for your attention!
If the filename can't contain spaces, just split your string on spaces with maxsplit 2:
>>> line.split(' ', 2)
['%#import', 'script_example.py', '( *out =(.|\n)*?return out)']
The maxsplit 2 makes it split only the first two spaces, and leave intact any spaces within the regex. Now you have the filename as the second element and the regex as the third. It's not clear from your statement whether the parentheses are part of the regex or not (i.e., as a capturing group). If not, you can easily remove them by trimming the first and last characters from that part.
If you assign the values like this:
filename, regex = line.split(' ', 2)[1:]
then you can strip the parentheses with:
regex = regex[1:-1]
That should do it nicely
^%#import (\S+) \((.*)\)
or, if the filename may have spaces:
^%#import ((?:(?! \().)+) \((.*)\)
Both expressions contain two groups, one for the file name and one for the contents of the parentheses. Run in multiline mode on the entire file or in normal mode if you work with single lines anyway.
This: ((?:(?! \().)+) breaks down as:
( # group start
(?: # non-capturing group
(?! # negative look-ahead: a position NOT followed by
\( # " ("
) # end look-ahead
. # match any char (this is part of the filename)
)+ # end non-capturing group, repeat
) # end group
The other bits of the expression should be self-explanatory.
import re
line = "%#import script_example.py ( *out =(.|\\n)*?return out)"
pattern = r'^%#import (\S+) \((.*)\)'
match = re.match(pattern, line)
if match:
print "match.group(1) '" + match.group(1) + "'"
print "match.group(2) '" + match.group(2) + "'"
else:
print "No match."
prints
match.group(1) 'script_example.py'
match.group(2) ' *out =(.|\n)*?return out'
For matching something like %#import script_example.py ( *out =(.|\n)*?return out) i suggest :
r'%#impor[\w\W ]+'
DEMO
note that :
\w match any word character [a-zA-Z0-9_]
\W match any non-word character [^a-zA-Z0-9_]
so you can use re.findall() for find all the matches :
import re
re.findall(r'%#impor[\w\W ]+', your_string)

Categories

Resources