I'm trying to search through a bunch of large text files for specific information.
#!/usr/bin/env python
# pythnon 3.4
import re
sometext = """
lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------
"""
sometextpattern = re.compile( '''.*Sentinel\s+starts.*$ # sentinel
^.*-+.*$ # dividing line
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+ # item details
^.*-+.*$ # dividing line
''', flags = re.MULTILINE | re.VERBOSE)
print( re.findall( sometextpattern, sometext ) )
Individually, the sentinels and dividing lines match on their own. How do I make this work together? i.e. I would like this to print:
[('item_one','item_one_result'),('item_two','item_two_result'),('item_three','item_three_result'),('item_four','item_four_result'),('item_five','item_five_results'),('item_six','item_six_results')]
Try these regex:
for m in re.findall(r'(?:Sentinel starts\n[-\n]*)([^-]+)', sometext, flags=re.M ):
print(list(re.findall(r'(\w+)\s+(\w+)', m)))
It gives you a list of key,value tuples:
# [('item_one', 'item_one_result'), ('item_two', 'item_two_result')]
# [('item_three', 'item_three_result'), ('item_four', 'item_four_result')]
Because the text has trailing spaces, change the regex in the for statement for this one:
r'(?:Sentinel starts\s+-*)([^-]*\b)'
The regex multiline matching tag only makes ^ and $ match the beginning and end of each line, respectively. If you want to match multiple lines, you will need to add a whitespace meta character '\\s' to match the newline.
.*Sentinel\s+starts.*$\s
^.*-+.*$\s
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+
^.*-+.*$
Debuggex Demo
Also the string you are using does not have the required string escaping. I would recommend using the r'' type string instead. That way you do not have to escape your backslashes.
Use four capturing groups in-order to print the text you want inside the list.
>>> import regex
>>> text = """ lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------"""
>>> regex.findall(r'(?:(?:\bSentinel starts\s*\n\s*-+\n\s*|-+)|(?<!^)\G) *(\w+) *(\w+)\n*', text)
[('item_one', 'item_one_result'), ('item_two', 'item_two_result'), ('item_three', 'item_three_result'), ('item_four', 'item_four_result'), ('item_five', 'item_five_result'), ('item_six', 'item_six_result')]
\s* matches zero or more space characters and \S+ matches one or more non-space characters. \G assert position at the end of the previous match or the start of the string for the first match.
DEMO
Related
I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))
I tried to get a dictionary from multi-line string using regex, but I have a problem with proper separation of lines.
Here is what I have tried...
import re
text = '''\n\n\nName: Clash1\nDistance: -1.274m\nImage Location: navis_raport_txt_files\\cd000001.jpg\nHardStatus: New\nClash Point: 1585.236m, 193.413m'''
clash_data = re.compile('''
(?P<clash_number>Clash\d+)\n
(?P<clash_depth>\d.\d{3})\n
(?P<image_location>cd\d+.jpg)\n
(?P<clash_status>\w{2:})\n
(?P<clash_point>.*)\n
(?P<clash_grid>\w+-\d+)\n
(?P<clash_date>.*)''', re.I | re.VERBOSE)
print(clash_data.search(text).groupdict())
This similar example works well:
import re
MHP = ['''MHP-PW-K_SZ-117-R01-UZ-01 - drawing title 123''',
'MHP-PW-K_SZ-127-R01WIP - drawing title 2',
'MHP-PW-K_SZ-107-R03-UZ-1 - drawing title 3']
fields_from_name = re.compile('''
(?P<object>\w{3})[-_]
(?P<phase>\w{2})[-_]
(?P<field>\w)[-_]
(?P<type>\w{2})[-_]
(?P<dr_number>\d{3})[-_]
[-_]?
(?P<revision>\w\d{2})?
(?P<wip_status>WIP)?
[-_]?
(?P<suplement>UZ-\d+)?
[\s-]+
(?P<drawing_title>.*)
''', re.IGNORECASE | re.VERBOSE)
for name in MHP:
print(fields_from_name.search(name).groupdict())
Why doesn't my attempt work like the example?
It is not working simply because Pattern.search() is not finding a match. Based on the working example you are mimicking, you need to also match the characters between the named capture groups that you want in your output dict (so that the entire pattern returns a match).
Following is an example using .*\n.* as a bit of a brute force way to bridge the gap between your capture groups by matching any non-newline characters after the last capture group, then matching the newline, and then matching any non-newline characters that precede the next capture group (you will probably want to be more precise than this, but it demonstrates the issue). I only included your first 3 groups because I wasn't following what you intended with the regex in your <clash_status> group.
import re
text = '\n\n\nName: Clash1\nDistance: -1.274m\nImage Location: navis_raport_txt_files\\cd000001.jpg\nHardStatus: New\nClash Point: 1585.236m, 193.413m'
clash_data = re.compile(r'(?P<clash_number>Clash\d+).*\n.*'
r'(?P<clash_depth>\d.\d{3}).*\n.*'
r'(?P<image_location>cd\d+.jpg)', re.I | re.VERBOSE)
result = clash_data.search(text).groupdict()
print(result)
# OUTPUT
# {'clash_number': 'Clash1', 'clash_depth': '1.274', 'image_location': 'cd000001.jpg'}
The Setup:
Let's say I have the following regex defined in my script. I want to keep the comments there for future me because I'm quite forgetful.
RE_TEST = re.compile(r"""[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
""",
re.VERBOSE)
print(magic_function(RE_TEST)) # returns: "[0-9][A-Z][a-y]z"
The Question:
Does Python (3.4+) have a way to convert that to the simple string "[0-9][A-Z][a-y]z"?
Possible Solutions:
This question ("strip a verbose python regex") seems to be pretty close to what I'm asking for and it was answered. But that was a few years ago, so I'm wondering if a new (preferably built-in) solution has been found.
In addition to the above, there are work-arounds such as using implicit string concatenation and then using the .pattern attribute:
RE_TEST = re.compile(r"[0-9]" # 1 Number
r"[A-Z]" # 1 Uppercase Letter
r"[a-y]" # 1 lowercase, but not z
r"z", # gotta have z...
re.VERBOSE)
print(RE_TEST.pattern) # returns: "[0-9][A-Z][a-y]z"
or just commenting the pattern separately and not compiling it:
# matches pattern "nXxz"
RE_TEST = "[0-9][A-Z][a-y]z"
print(RE_TEST)
But I'd really like to keep the compiled regex the way it is (1st example). Perhaps I'm pulling the regex string from some file, and that file is already using the verbose form.
Background
I'm asking because I want to suggest an edit to the unittest module.
Right now, if you run assertRegex(string, pattern) using a compiled pattern with comments and that assertion fails, then the printed output is somewhat ugly (the below is a dummy regex):
Traceback (most recent call last):
File "verify_yaml.py", line 113, in test_verify_mask_names
self.assertRegex(mask, RE_MASK)
AssertionError: Regex didn't match: '(X[1-9]X[0-9]{2}) # comment\n |(XXX[0-9]{2}) # comment\n |(XXXX[0-9E]) # comment\n |(XXXX[O1-9]) # c
omment\n |(XXX[0-9][0-9]) # comment\n |(XXXX[
1-9]) # comment\n ' not found in 'string'
I'm going to propse that the assertRegex and assertNotRegex methods clean the regex before printing it by either removing the comments and extra whitespace or by printing it differently.
The following tested script includes a function that does a pretty good job converting an xmode regex string to non-xmode:
pcre_detidy(retext)
# Function pcre_detidy to convert xmode regex string to non-xmode.
# Rev: 20160225_1800
import re
def detidy_cb(m):
if m.group(2): return m.group(2)
if m.group(3): return m.group(3)
return ""
def pcre_detidy(retext):
decomment = re.compile(r"""(?#!py/mx decomment Rev:20160225_1800)
# Discard whitespace, comments and the escapes of escaped spaces and hashes.
( (?: \s+ # Either g1of3 $1: Stuff to discard (3 types). Either ws,
| \#.* # or comments,
| \\(?=[\r\n]|$) # or lone escape at EOL/EOS.
)+ # End one or more from 3 discardables.
) # End $1: Stuff to discard.
| ( [^\[(\s#\\]+ # Or g2of3 $2: Stuff to keep. Either non-[(\s# \\.
| \\[^# Q\r\n] # Or escaped-anything-but: hash, space, Q or EOL.
| \( # Or an open parentheses, optionally
(?:\?\#[^)]*(?:\)|$))? # starting a (?# Comment group).
| \[\^?\]? [^\[\]\\]* # Or Character class. Allow unescaped ] if first char.
(?:\\[^Q][^\[\]\\]*)* # {normal*} Zero or more non-[], non-escaped-Q.
(?: # Begin unrolling loop {((special1|2) normal*)*}.
(?: \[(?::\^?\w+:\])? # Either special1: "[", optional [:POSIX:] char class.
| \\Q [^\\]* # Or special2: \Q..\E literal text. Begin with \Q.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) [^\[\]\\]* # End special: One of 2 alternatives {(special1|2)}.
(?:\\[^Q][^\[\]\\]*)* # More {normal*} Zero or more non-[], non-escaped-Q.
)* (?:\]|\\?$) # End character class with ']' or EOL (or \\EOL).
| \\Q [^\\]* # Or \Q..\E literal text start delimiter.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) # End $2: Stuff to keep.
| \\([# ]) # Or g3of3 $6: Escaped-[hash|space], discard the escape.
""", re.VERBOSE | re.MULTILINE)
return re.sub(decomment, detidy_cb, retext)
test_text = r"""
[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
"""
print(pcre_detidy(test_text))
This function detidies regexes written in pcre-8/pcre2-10 xmode syntax.
It preserves whitespace inside [character classes], (?#comment groups) and \Q...\E literal text spans.
RegexTidy
The above decomment regex, is a variant of one I am using in my upcoming, yet to be released: RegexTidy application, which will not only detidy a regex as shown above (which is pretty easy to do), but it will also go the other way and Tidy a regex - i.e. convert it from non-xmode regex to xmode syntax, adding whitespace indentation to nested groups as well as adding comments (which is harder).
p.s. Before giving this answer a downvote on general principle because it uses a regex longer than a couple lines, please add a comment describing one example which is not handled correctly. Cheers!
Looking through the way sre_parse handles this, there really isn't any point where your verbose regex gets "converted" into a regular one and then parsed. Rather, your verbose regex is being fed directly to the parser, where the presence of the VERBOSE flag makes it ignore unescaped whitespace outside character classes, and from unescaped # to end-of-line if it is not inside a character class or a capture group (which is missing from the docs).
The outcome of parsing your verbose regex there is not "[0-9][A-Z][a-y]z". Rather it is:
[(IN, [(RANGE, (48, 57))]), (IN, [(RANGE, (65, 90))]), (IN, [(RANGE, (97, 121))]), (LITERAL, 122)]
In order to do a proper job of converting your verbose regex to "[0-9][A-Z][a-y]z" you could parse it yourself. You could do this with a library like pyparsing. The other answer linked in your question uses regex, which will generally not duplicate the behavior correctly (specifically, spaces inside character classes and # inside capture groups/character classes. And even just dealing with escaping is not as convenient as with a good parser.)
I have a piece of code which extracts a string that lies between two strings.However,this script performs this operation only on a line.I want to perform this operation on a complete file and get a list of all the words lying between those two words.
Note:The two words are fixed.For ex:If my code is something like
'const int variablename=1'
then I want a list of all the words in the file lying between 'int' and '='.
Here is the current script:
s='const int variablename = 1'
k=s[s.find('int')+4:s.find('=')]
print k
If the file fits comfortably into memory, you can get this with a single regex call:
import re
regex = re.compile(
r"""(?x)
(?<= # Assert that the text before the current location is:
\b # word boundary
int # "int"
\s # whitespace
) # End of lookbehind
[^=]* # Match any number of characters except =
(?<!\s) # Assert that the previous character isn't whitespace.
(?= # Assert that the following text is:
\s* # optional whitespace
= # "="
) # end of lookahead""")
with open(filename) as fn:
text = fn.read()
matches = regex.findall(text)
If there can be only one word between int and =, then the regex is a bit more simple:
regex = re.compile(
r"""(?x)
(?<= # Assert that the text before the current location is:
\b # word boundary
int # "int"
\s # whitespace
) # End of lookbehind
[^=\s]* # Match any number of characters except = or space
(?= # Assert that the following text is:
\s* # optional whitespace
= # "="
) # end of lookahead""")
with open(filename) as fn:
for row in fn:
# do something with the row?
I would use regular expressions over whole text (you can do it over one line too). This prints strings betweet "int " and "="
import re
text = open('example.txt').read()
print re.findall('(?<=int\s).*?(?=\=)', text)
If you want a quick and dirty ways and you're on a unix-like system.
I just should use a grep on the file.
Then i will split the string in order to recognize the pattern and the data i want.
I'm suppose to capture everything inside a tag and the next lines after it, but it's suppose to stop the next time it meets a bracket. What am i doing wrong?
import re #regex
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
(\b(?:.|\s)*(?!\[)) # should read: anyword that doesn't precede a bracket
""", re.MULTILINE | re.VERBOSE)
haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
#[this should be taken though as this is in the content]
[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m
what im trying to get is:
[('tab1', 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n', '[tab2]','help me\nwrite a better RE\n')]
edit:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
this seems to work but it's also trimming the brackets inside the content.
Python regex doesn't support recursion afaik.
EDIT: but in your case this would work:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
EDIT 2: yes, it doesn't work properly.
import re
regex = re.compile(r"""
(?:^|\n)\[ # tag's opening bracket
([^\]\n]*) # 1. text between brackets
\]\n # tag's closing bracket
(.*?) # 2. text between the tags
(?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
""", re.DOTALL | re.VERBOSE)
haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag
[tag2]
help me
write a better RE[[[]
"""
print regex.findall(haystack)
I do agree with viraptor though. Regex are cool but you can't check your file for errors with them. A hybrid perhaps? :P
tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))
result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()
print result
EDIT 3: That's because ^ character means negative match only inside [^squarebrackets]. Everywhere else it means string start (or line start with re.MULTILINE). There's no good way for negative string matching in regex, only character.
First of all why a regex if you're trying to parse? As you can see you cannot find the source of the problem yourself, because regex gives no feedback. Also you don't have any recursion in that RE.
Make your life simple:
def ini_parse(src):
in_block = None
contents = {}
for line in src.split("\n"):
if line.startswith('[') and line.endswith(']'):
in_block = line[1:len(line)-1]
contents[in_block] = ""
elif in_block is not None:
contents[in_block] += line + "\n"
elif line.strip() != "":
raise Exception("content out of block")
return contents
You get error handling with exceptions and the ability to debug execution as a bonus. Also you get a dictionary as a result and can handle duplicate sections while processing. My result:
{'tab2': 'help me\nwrite a better RE\n\n',
'tab1': 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n\n'}
RE is much overused these days...
Does this do what you want?
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^[]*)
""", re.MULTILINE | re.VERBOSE)
This gives a list of tuples (one 2-tuple per match). If you want a flattened tuple you can write:
m = sum(regex.findall(haystack), ())