I'm suppose to capture everything inside a tag and the next lines after it, but it's suppose to stop the next time it meets a bracket. What am i doing wrong?
import re #regex
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
(\b(?:.|\s)*(?!\[)) # should read: anyword that doesn't precede a bracket
""", re.MULTILINE | re.VERBOSE)
haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
#[this should be taken though as this is in the content]
[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m
what im trying to get is:
[('tab1', 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n', '[tab2]','help me\nwrite a better RE\n')]
edit:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
this seems to work but it's also trimming the brackets inside the content.
Python regex doesn't support recursion afaik.
EDIT: but in your case this would work:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
EDIT 2: yes, it doesn't work properly.
import re
regex = re.compile(r"""
(?:^|\n)\[ # tag's opening bracket
([^\]\n]*) # 1. text between brackets
\]\n # tag's closing bracket
(.*?) # 2. text between the tags
(?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
""", re.DOTALL | re.VERBOSE)
haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag
[tag2]
help me
write a better RE[[[]
"""
print regex.findall(haystack)
I do agree with viraptor though. Regex are cool but you can't check your file for errors with them. A hybrid perhaps? :P
tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))
result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()
print result
EDIT 3: That's because ^ character means negative match only inside [^squarebrackets]. Everywhere else it means string start (or line start with re.MULTILINE). There's no good way for negative string matching in regex, only character.
First of all why a regex if you're trying to parse? As you can see you cannot find the source of the problem yourself, because regex gives no feedback. Also you don't have any recursion in that RE.
Make your life simple:
def ini_parse(src):
in_block = None
contents = {}
for line in src.split("\n"):
if line.startswith('[') and line.endswith(']'):
in_block = line[1:len(line)-1]
contents[in_block] = ""
elif in_block is not None:
contents[in_block] += line + "\n"
elif line.strip() != "":
raise Exception("content out of block")
return contents
You get error handling with exceptions and the ability to debug execution as a bonus. Also you get a dictionary as a result and can handle duplicate sections while processing. My result:
{'tab2': 'help me\nwrite a better RE\n\n',
'tab1': 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n\n'}
RE is much overused these days...
Does this do what you want?
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^[]*)
""", re.MULTILINE | re.VERBOSE)
This gives a list of tuples (one 2-tuple per match). If you want a flattened tuple you can write:
m = sum(regex.findall(haystack), ())
Related
I am parsing strings containing code like the following. It can start with an empty lines followed by multiple optional patterns. These patterns can either be python-style inline comments (using a hash # character), or the command "!mycommand", and both must start at the beginning of a line. How can I write a regex matching up to the starting of the code?
mystring = """
# catch this comment
!mycommand
# catch this comment
#catch this comment too
!mycommand
# catch this comment
!mycommand
!mycommand
some code. match until the previous line
# do not catch this comment
!mycommand
# do not catch this comment
"""
import re
pattern = r'^\s*^#.*|!mycommand\s*'
m = re.search(pattern, mystring, re.MULTILINE)
mystring[m.start():m.end()]
mystring = 'code. do not match anything' + mystring
m = re.search(pattern, mystring, re.MULTILINE)
I want the regex to match the string up to "some code. catch until the previous line". I tried different things but I am probably stuck with the two multiple patterns
Without the need of re.MULTILINE you could repeatedly match 0+ whitespace chars before and after the match
^(?:\s*(?:#.*|!mycommand\s*))+\s*
Regex demo | Python demo
For example
import re
m = re.search(r'^(?:\s*(?:#.*|!mycommand\s*))+\s*', mystring)
print(m.group())
Your pattern matches one instance of # ... or !mycommand. One way to solve this problem is to put all of them into one match, and use re.search to find the first match.
To do this, you need to repeat the part that matches # ... or !mycommand using *:
^\s*^(?:#.*\s*|!mycommand\s*)*
I have also changed #.* to #.*\s* so that it goes all the way to the next line where a non-whitespace is found.
Demo
Responding to your comment:
if the string begins with code, this regex should not match anything
You can try:
\A\s*^(?:#.*\s*|!mycommand\s*)+
I changed to \A so that it only matches the absolute start of the string, instead of start of line. I also changed the last * to + so at least one # ... or !mycommand has to be present.
Matching and returning the comments at the start of the string
No need for a regex, read and append the lines to list until a line that does not start with ! or # occurs and ignore all blank lines:
mystring = "YOUR_STRING_HERE"
results = []
for line in mystring.splitlines():
if not line.strip(): # Skip blank lines
continue
if not line.startswith('#') and not line.startswith('!'): # Reject if does not start with ! or #
break
else:
results.append(line) # Append comment
print(results)
See the Python demo. Results:
['# catch this comment', '!mycommand', '# catch this comment', '#catch this comment too', '!mycommand', '# catch this comment', '!mycommand', '!mycommand']
Removing the comments at the start of the string
results = []
flag = False
for line in mystring.splitlines():
if not flag and not line.strip():
continue
if not flag and not line.startswith('#') and not line.startswith('!'):
flag = True
if flag:
results.append(line)
print("\n".join(results))
Output:
some code. match until the previous line
# do not catch this comment
!mycommand
# do not catch this comment
See this Python demo.
Regex approach
import re
print(re.sub(r'^(?:(?:[!#].*)?\n)+', '', mystring))
If there are optional indenting spaces at the start of a line add [^\S\n]*:
print(re.sub(r'^(?:[^\S\n]*(?:[!#].*)?\n)+', '', mystring, count=1))
See the regex demo and the Python demo. count=1 will make sure we just remove the first match (you need no check all other lines).
Regex details
^ - start of string
(?:[^\S\n]*(?:[!#].*)?\n)+ - 1 or more occurrences of
[^\S\n]* - optional horizontal whitespaces
(?:[!#].*)? - an optional sequence of
[!#] - ! or #
.* - the rest of the line
\n - a newline char.
with open('templates/data.xml', 'r') as s:
for line in s:
line = line.rstrip() #removes trailing whitespace and '\n' chars
if "\\$\\(" not in line:
if ")" not in line:
continue
print(line)
start = line.index("$(")
end = line.index(")")
print(line[start+2:end])
I need to match the strings which are like $(hello). But now this even matches (hello).
Im really new to python. So what am i doing wrong here ?
Use the following regex:
\$\(([^)]+)\)
It matches $, followed by (, then anything until the last ), and catches the characters between the parenthesis.
Here we did escape the $, ( and ) since when you use a function that accepts a regex (like findall), you don't want $ to be treated as the special character $, but as the literal "$" (same holds for the ( and )). However, note that the inner parenthesis didn't get quoted since you want to capture the text between the outer parenthesis.
Note that you don't need to escape the special characters when you're not using regex.
You can do:
>>> import re
>>> escaper = re.compile(r'\$\((.*?)\)')
>>> escaper.findall("I like to say $(hello)")
['hello']
I believe something along the lines of:
import re
data = "$(hello)"
matchObj = re.match( r'\$\(([^)]+)\)', data, re.M|re.I)
print matchObj.group()
might do the trick.
If you don't want to do it with regexes (I wouldn't necessarily; they can be hard to read).
Your for loop indentation is wrong.
"\$\(" means \$\( (you're escaping the brackets, not the $ and (.
You don't need to escpae $ or (. Just do if "$(" not in line
You need to check the $( is found before ). Currently your code will match "foo)bar$(baz".
Rather than checking if $( and ) are in the string twice, it would be better to just do the .index() anyway and catch the exception. Something like this:
with open('templates/data.xml', 'r') as s:
for line in s:
try:
start = line.index("$(")
end = line.index(")", start)
print(line[start+2:end])
except ValueError:
pass
Edit: That will only match one $() per line; you'll want to add a loop.
I'm having problems with this regex. I want to pull out just MATCH3, because the others, MATCH1 and MATCH2 are commented out.
# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment
The regex I have captures all of the MATCH's.
(?<=url\(r'\^)(.*?)(?=\$',)
How do I ignore lines beginning with a comment? With a negative lookahead? Note the # character is not necessarily at the start of the line.
EDIT: sorry, all answers are good! the example forgot a comma after the $' at the end of the match group.
You really don't need to use lookarounds here, you could look for possible leading whitespace and then match "url" and the preceding context; capturing the part you want to retain.
>>> import re
>>> s = """# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment"""
>>> re.findall(r"(?m)^\s*url\(r'\^([^$]+)", s)
['MATCH3']
^\s*#.*$|(?<=url\(r'\^)(.*?)(?=\$'\))
Try this.Grab the capture.See demo.
https://www.regex101.com/r/rK5lU1/37
import re
p = re.compile(r'^\s*#.*$|(?<=url\(r\'\^)(.*?)(?=\$\'\))', re.IGNORECASE | re.MULTILINE)
test_str = "# url(r'^MATCH1/$'),\n #url(r'^MATCH2$'),\n url(r'^MATCH3$') # comment"
re.findall(p, test_str)
If this is the only place where you need to match, then match beginning of line followed by optional whitespace followed by url:
(?m)^\s*url\(r'(.*?)'\)
If you need to cover more complicated cases, I'd suggest using ast.parse instead, as it truly understands the Python source code parsing rules.
import ast
tree = ast.parse("""(
# url(r'^MATCH1/$'),
#url(r'^MATCH2$'),
url(r'^MATCH3$') # comment
)""")
class UrlCallVisitor(ast.NodeVisitor):
def visit_Call(self, node):
if getattr(node.func, 'id', None) == 'url':
if node.args and isinstance(node.args[0], ast.Str):
print(node.args[0].s.strip('$^'))
self.generic_visit(node)
UrlCallVisitor().visit(tree)
prints each first string literal argument given to function named url; in this case, it prints MATCH3. Notice that the source for ast.parse needs to be a well-formed Python source code (thus the parenthesis, otherwise a SyntaxError is raised).
As an alternative you can split your lines with '#' if the first element has 'url' in (it doesn't start with # ) you can use re.search to match the sub-string that you want :
>>> [re.search(r"url\(r'\^(.*?)\$'" ,i[0]).group(1) for i in [line.split('#') for line in s.split('\n')] if 'url' in i[0]]
['MATCH3']
Also note that you dont need to sue look-around for your pattern you can just use grouping!
I'm trying to search through a bunch of large text files for specific information.
#!/usr/bin/env python
# pythnon 3.4
import re
sometext = """
lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------
"""
sometextpattern = re.compile( '''.*Sentinel\s+starts.*$ # sentinel
^.*-+.*$ # dividing line
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+ # item details
^.*-+.*$ # dividing line
''', flags = re.MULTILINE | re.VERBOSE)
print( re.findall( sometextpattern, sometext ) )
Individually, the sentinels and dividing lines match on their own. How do I make this work together? i.e. I would like this to print:
[('item_one','item_one_result'),('item_two','item_two_result'),('item_three','item_three_result'),('item_four','item_four_result'),('item_five','item_five_results'),('item_six','item_six_results')]
Try these regex:
for m in re.findall(r'(?:Sentinel starts\n[-\n]*)([^-]+)', sometext, flags=re.M ):
print(list(re.findall(r'(\w+)\s+(\w+)', m)))
It gives you a list of key,value tuples:
# [('item_one', 'item_one_result'), ('item_two', 'item_two_result')]
# [('item_three', 'item_three_result'), ('item_four', 'item_four_result')]
Because the text has trailing spaces, change the regex in the for statement for this one:
r'(?:Sentinel starts\s+-*)([^-]*\b)'
The regex multiline matching tag only makes ^ and $ match the beginning and end of each line, respectively. If you want to match multiple lines, you will need to add a whitespace meta character '\\s' to match the newline.
.*Sentinel\s+starts.*$\s
^.*-+.*$\s
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+
^.*-+.*$
Debuggex Demo
Also the string you are using does not have the required string escaping. I would recommend using the r'' type string instead. That way you do not have to escape your backslashes.
Use four capturing groups in-order to print the text you want inside the list.
>>> import regex
>>> text = """ lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------"""
>>> regex.findall(r'(?:(?:\bSentinel starts\s*\n\s*-+\n\s*|-+)|(?<!^)\G) *(\w+) *(\w+)\n*', text)
[('item_one', 'item_one_result'), ('item_two', 'item_two_result'), ('item_three', 'item_three_result'), ('item_four', 'item_four_result'), ('item_five', 'item_five_result'), ('item_six', 'item_six_result')]
\s* matches zero or more space characters and \S+ matches one or more non-space characters. \G assert position at the end of the previous match or the start of the string for the first match.
DEMO
I need to come up with a regular expression for a mini-project.
The string should not start with:
"/wiki"
and it should also not have the following pattern
"/.*:.*" (basically pattern starts with char '/' and there is any occurrence of ':' after that)
and it also cannot have a certain character '#'
So basically all these strings would fail:
"/wiki/index.php?title=ROM/TAP&action=edit§ion=2"
"/User:romamns"
"/Special:Watchlist"
"/Space_Wiki:Privacy_policy"
"#column-one"
And all these string would pass:
"/ROM/TAP/mouse"
"http://www.boost.org/"
I will be using the regex in python (if that makes any difference).
Thanks for any help.
^(/(?!wiki)[^:#]*|[^#/][^#]*)$ should be ok, as tested here, of course I might be missing something, but this appears to follow your specification.
This tested script implements a commented regex which precisely matches your stated requirements:
import re
def check_str(subject):
"""Retturn True if subject matches"""
reobj = re.compile(
""" # Match special string
(?!/wiki) # Does not start with /wiki.
(?![^/]*/[^:]*:) # Does not have : following /
[^#]* # Match whole string having no #
$ # Anchor to end of string.
""",
re.IGNORECASE | re.MULTILINE | re.VERBOSE)
if reobj.match(subject):
return True
else:
return False
return False
data_list = [
r"/wiki/index.php?title=ROM/TAP&action=edit§ion=2",
r"/User:romamns",
r"/Special:Watchlist",
r"/Space_Wiki:Privacy_policy",
r"#column-one",
r"/ROM/TAP/mouse",
r"http://www.boost.org/",
]
cnt = 0
for data in data_list:
cnt += 1
print("Data[%d] = \"%s\"" %
(cnt, check_str(data)))
If you match the following regular expression, then it should fail
^(\/wiki|.*?[\:#])