Related to : Python parsing bracketed blocks
I have a file with the following format :
#
here
are
some
strings
#
and
some
others
#
with
different
levels
#
of
#
indentation
#
#
#
So a block is defined by a starting #, and a trailing #. However, the trailing # of the n-1th block is also the starting # of the nth block.
I am trying to write a function that, given this format, would retrieve the content of each blocks, and that could also be recursive.
To start with, I started with regexes but I abandonned quite fast (I think you guessed why), so I tried using pyparsing, yet I can't simply write
print(nestedExpr('#','#').parseString(my_string).asList())
Because it raises a ValueError Exception (ValueError: opening and closing strings cannot be the same).
Knowing that I cannot change the input format, do I have any better option than pyparsing for this one ?
I also tried using this answer : https://stackoverflow.com/a/1652856/740316, and replaced the {/} with #/# yet it fails to parse the string.
Unfortunately (for you), your grouping is not dependent only on the separating '#' characters, but also on the indent levels (otherwise, ['with','different','levels'] would be at the same level as the previous group ['and','some','others']). Parsing indent-sensitive grammars is not a strong suit for pyparsing - it can be done, but it is not pleasant. To do so we will use the pyparsing helper macro indentedBlock, which also requires that we define a list variable that indentedBlock can use for its indentation stack.
See the embedded comments in the code below to see how you might use one approach with pyparsing and indentedBlock:
from pyparsing import *
test = """\
#
here
are
some
strings
#
and
some
others
#
with
different
levels
#
of
#
indentation
#
#
#"""
# newlines are significant for line separators, so redefine
# the default whitespace characters for whitespace skipping
ParserElement.setDefaultWhitespaceChars(' ')
NL = LineEnd().suppress()
HASH = '#'
HASH_SEP = Suppress(HASH + Optional(NL))
# a normal line contains a single word
word_line = Word(alphas) + NL
indent_stack = [1]
# word_block is recursive, since word_blocks can contain word_blocks
word_block = Forward()
word_group = Group(OneOrMore(word_line | ungroup(indentedBlock(word_block, indent_stack))) )
# now define a word_block, as a '#'-delimited list of word_groups, with
# leading and trailing '#' characters
word_block <<= (HASH_SEP +
delimitedList(word_group, delim=HASH_SEP) +
HASH_SEP)
# the overall expression is one large word_block
parser = word_block
# parse the test string
parser.parseString(test).pprint()
Prints:
[['here', 'are', 'some', 'strings'],
['and',
'some',
'others',
[['with', 'different', 'levels'], ['of', [['indentation']]]]]]
Related
I am attempting to write a program that validates documents written in a markup language similar to BBcode.
This markup language has both matching ([b]bold[/b] text) and non-matching (today is [date]) tags. Unfortunately, using a different markup language is not an option.
However, my regex is not acting the way I want it to. It seems to always stop at the first matching closing tag instead of identifying that nested tag with the recursive (?R).
I am using the regex module, which supports (?R), and not re.
My questions are:
How can I effectively use a recursive regex to match nested tags without terminating on the first tag?
If there's a better method than a regular expression, what is that method?
Here is the regex once I build it:
\[(b|i|u|h1|h2|h3|large|small|list|table|grid)\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]
Here is a test string that doesn't work as expected:
[large]test1 [large]test2[/large] test3[/large] (it should match this whole string but stops before test3)
Here is the regex on regex101.com: https://regex101.com/r/laJSLZ/1
This test doesn't need to finish in milliseconds or even seconds, but it does need to be able to validate about 100 files of 1,000 to 10,000 characters each in a time that is reasonable for a Travis-CI build.
Here is what the logic using this regex looks like, for context:
import io, regex # https://pypi.org/project/regex/
# All the tags that must have opening and closing tags
matching_tags = 'b', 'i', 'u', 'h1', 'h2', 'h3', 'large', 'small', 'list', 'table', 'grid'
# our first part matches an opening tag:
# \[(b|i|u|h1|h2|h3|large|small|list|table|grid)\]
# our middle part matches the text in the middle, including any properly formed tag sets in between:
# (?:((?!\[\/\1\]).)*?|(?R))*
# our last part matches the closing tag for our first match:
# \[\/\1\]
pattern = r'\[(' + '|'.join(matching_tags) + r')\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]'
myRegex = re.compile(pattern)
data = ''
with open('input.txt', 'r') as file:
data = '[br]'.join(file.readlines())
def validate(text):
valid = True
for node in all_nodes(text):
valid = valid and is_valid(node)
return valid
# (Only important thing here is that I call this on every node, this
# should work fine but the regex to get me those nodes does not.)
# markup should be valid iff opening and closing tag counts are equal
# in the whole file, in each matching top-level pair of tags, and in
# each child all the way down to the smallest unit (a string that has
# no tags at all)
def is_valid(text):
valid = True
for tag in matching_tags:
valid = valid and text.count(f'[{tag}]') == text.count(f'[/{tag}]')
return valid
# this returns each child of the text given to it
# this call:
# all_nodes('[b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b] to use [b]regex to [i]do stuff[/i][/b]')
# should return a list containing these strings:
# [b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b]
# [large]text to[/large]
# [i]with [u]regex[/u]![/i]
# [u]regex[/u]
# [b]regex to [i]do stuff[/i][/b]
# [i]do stuff[/i]
def all_nodes(text):
matches = myRegex.findall(text)
if len(matches) > 0:
for m in matches:
result += all_nodes(m)
return result
exit(0 if validate(data) else 1)
Your main issue is within the ((?!\[\/\1\]).)*? tempered greedy token.
First, it is inefficient since you quantified it and then you quantify the whole group it is in, so making the regex engine look for more ways to match a string, and that makes it rather fragile.
Second, you only match up to the closing tag and you did not restrict the starting tag. The first step is to make the / before \1 optional, \/?. It won't stop before [tag] like tags with no attributes. To add attribute support, add an optional group after \1, (?:\s[^]]*)?. It matches an optional sequence of a whitespace and then any 0+ chars other than ].
A fixed regex will look like
\[([biu]|h[123]|l(?:arge|ist)|small|table|grid)](?:(?!\[/?\1(?:\s[^]]*)?]).|(?R))*\[/\1]
Do not forget to compile it with regex.DOTALL to match across multiple newlines.
I have a list of regexes in string form (created after parsing natural language text which were search queries). I want to use them for searching text now. Here is how I am doing it right now-
# given that regex_list=["r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'", "r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'"....]
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I)
if mo:
print(mo.group())
What I need is to be able to use these string regexes, but also have Python's raw literal notation on them, as we all should for regex queries. Now about these expressions - I have natural text search commands like -
LINE_CONTAINS foo(+)
Which I use pyparsing to convert to regex like r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))' based on a grammar. I send a list of these human rules to the pyparsing code and it gives me back a list of ~100 of these regexes. These regexes are constructed in string format.
This is the MCVE version of the code that generates these strings that are supposed to act as regexes -
from pyparsing import *
import re
def parse_hrr(received_sentences):
UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class ConsecutivePhrases(Node):
def generate(self):
join_these=[]
tokens = self.tokens[0]
for t in tokens:
tg = t.generate()
join_these.append(tg)
seq = []
for word in join_these[:-1]:
if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
seq.append(word + "")
else:
seq.append(word + "\s+")
seq.append(join_these[-1])
result = "".join(seq)
return result
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
join_these=[]
for t in tokens[::2]:
tg = t.generate()
tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
join_these.append(tg_mod)
joined = ''.join(ele for ele in join_these)
full = '('+ joined+')'
return full
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
joined = '|'.join(t.generate() for t in tokens[::2])
full = '('+ joined+')'
return full
class LineTermNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
dir_phr_map = {
'LINE_CONTAINS': lambda a: r"((?:(?<=[\W_])" + a + r"(?=[\W_]|$))456", #%gs1, sent, re.I)",
'PARA_STARTSWITH':
lambda a: ("r'(^" + a + "(?=[\W_]|$))' 457") if 'gene' in repr(a) #%gs1, s, re.I)"
else ("r'(^" + a + "(?=[\W_]|$))' 458")} #,s, re.I
for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
ret = dir_phr_map[line_dir](phr_term.generate())
return ret
## THE GRAMMAR
word = ~keyword + Word(alphas, alphanums+'-_+/()')
some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
phrase_item = some_words
phrase_expr = infixNotation(phrase_item,
[
(None, 2, opAssoc.LEFT, ConsecutivePhrases),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
(phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = line_term.setParseAction(LineTermNode)
###########################################################################################
mrrlist=[]
for t in received_sentences:
t = t.strip()
try:
parsed = line_contents_expr.parseString(t)
temp_regex = parsed[0].generate()
mrrlist.append(temp_regex)
return(mrrlist)
So basically, the code is stringing together the regex. Then I add the necessary parameters like re.search, %gs1 etc .to have the complete regex search query. I want to be able to use these string regexes for searching, hence I had earlier thought eval() would convert the string to its corresponding Python expression here, which is why I used it - I was wrong.
TL;DR - I basically have a list of strings that have been created in the source code, and I want to be able to use them as regexes, using Python's raw literal notation.
Your issue seems to stem from a misunderstanding of what raw string literals do and what they're for. There's no magic raw string type. A raw string literal is just another way of creating a normal string. A raw literal just gets parsed a little bit differently.
For instance, the raw string r"\(foo\)" can also be written "\\(foo\\)". The doubled backslashes tell Python's regular string parsing algorithm that you want an actual backslash character in the string, rather than the backslash in the literal being part of an escape sequence that gets replaced by a special character. The raw string algorithm doesn't the extra backslashes since it never replaces escape sequences.
However, in this particular case the special treatment is not actually necessary, since the \( and \) are not meaningful escape sequences in a Python string. When Python sees an invalid escape sequence, it just includes it literally (backslash and all). So you could also use "\(foo\)" (without the r prefix) and it will work just fine too.
But it's not generally a good idea to rely upon backslashes being ignored however, since if you edit the string later you might inadvertently add an escape sequence that Python does understand (when you really wanted the raw, un-transformed version). Since regex syntax has a number of its own escape sequences that are also escape sequences in Python (but with different meanings, such as \b and \1), it's a best practice to always write regex patterns with raw strings to avoid introducing issues when editing them.
Now to bring this around to the example code you've shown. I have no idea why you're using eval at all. As far as I can tell, you've mistakenly wrapped extra quotes around your regex patterns for no good reason. You're using exec to undo that wrapping. But because only the inner strings are using raw string syntax, by the time you eval them you're too late to avoid Python's string parsing messing up your literals if you have any of the troublesome escape sequences (the outer string will have already parsed \b for instance and turned it into the ASCII backspace character \x08).
You should tear the exec code out and fix your literals to avoid the extra quotes. This should work:
regex_list=[r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))', # use raw literals, with no extra quotes!
r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'] # unnecessary backslashes?
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I) # no eval here!
if mo:
print(mo.group())
This example works for me (it prints foo(+)). Note that you've got some extra unnecessary backslashes in your second pattern (before the spaces). Those are harmless, but might be adding even more confusion to a complicate subject (regex are notoriously hard to understand).
I'm working on strings where I'm taking input from the command line. For example, with this input:
format driveName "datahere"
when I go string.split(), it comes out as:
>>> input.split()
['format, 'driveName', '"datahere"']
which is what I want.
However, when I specify it to be string.split(" ", 2), I get:
>>> input.split(' ', 2)
['format\n, 'driveName\n', '"datahere"']
Does anyone know why and how I can resolve this? I thought it could be because I'm creating it on Windows and running on Unix, but the same problem occurs when I use nano in unix.
The third argument (data) could contain newlines, so I'm cautious not to use a sweeping newline remover.
Default separator in split() is all whitespace which includes newlines \n and spaces.
Here is what the docs on split say:
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
When you define a new sep it only uses that separator to split the strings.
Use None to get the default whitespace splitting behaviour with a limit:
input.split(None, 2)
This leaves the whitespace at the end of input() untouched.
Or you could strip the values afterwards; this removes whitespace from the start and end, not the middle, of each resulting string, just like input.split() would:
[v.strip() for v in input.split(' ', 2)]
The default str.split targets a number of "whitespace characters", including also tabs and others. If you do str.split(' '), you tell it to split only on ' ' (a space). You can get the default behavior by specifying None, as in str.split(None, 2).
There may be a better way of doing this, depending on what your actual use-case is (your example does not replicate the problem...). As your example output implies newlines as separators, you should consider splitting on them explicitly.
inp = """
format
driveName
datahere
datathere
"""
inp.strip().split('\n', 2)
# ['format', 'driveName', 'datahere\ndatathere']
This allows you to have spaces (and tabs etc) in the first and second item as well.
Input file:
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff
I'm trying to split by the first column-esque entry. For example, I want to get a list that looks like this...
Desired Output:
['rep_origin 607..1720 /label=Rep', 'Region 2643..5020 /label="region" extra_info and stuff']
I tried splitting by ' ' but that gave me some crazy stuff. If I could add a "fuzzy" search term at the end that includes all alphabet characters but NOT a whitespace. That would solve the problem. I suppose you can do it with regex with something like ' [A-Z]' findall but I wasn't sure if there was a less complicated way.
Is there a way to add a "fuzzy" search term at the very end of string.split identifier? (i.e. original_string.' [alphabet_character]'
I'm not sure exactly what you're looking for but the parse function below takes the text from your question and returns a list of sections and a section is a list of the lines from each section (with leading and trailing whitespace removed).
#!/usr/bin/env python
import re
# This is the input from your question
INPUT_TEXT = '''\
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff'''
# A regular expression that matches the start of a section. A section
# start is a line that has 4 spaces before the first non-space
# character.
match_section_start = re.compile(r'^ [^ ]').match
def parse(text):
sections = []
section_lines = None
def append_section_if_lines():
if section_lines:
sections.append(section_lines)
for line in text.split('\n'):
if match_section_start(line):
# We've found the start of a new section. Unless this is
# the first section, save the previous section.
append_section_if_lines()
section_lines = []
section_lines.append(line.strip())
# Save the last section.
append_section_if_lines()
return sections
sections = parse(INPUT_TEXT)
print(sections)
I'm trying to clean out some code by removing either leading or trailing white space characters using PyParsing. Removing leading white spaces was quite easy as I could make use of the FollowedBy subclass which matches a string but does not include it. Now I would need the same for something that follows my identifying string.
Here a small example:
from pyparsing import *
insource = """
annotation (Documentation(info="
<html>
<b>FOO</b>
</html>
"));
"""
# Working replacement:
HTMLStartref = OneOrMore(White(' \t\n')) + (FollowedBy(CaselessLiteral('<html>')))
## Not working because of non-existing "LeadBy"
# HTMLEndref = LeadBy(CaselessLiteral('</html>')) + OneOrMore(White(' \t\n')) + FollowedBy('"')
out = Suppress(HTMLStartref).transformString(insource)
out2 = Suppress(HTMLEndref).transformString(out)
As output one gets:
>>> print out
annotation (Documentation(info="<html>
<b>FOO</b>
</html>
"));
and should get:
>>> print out2
annotation (Documentation(info="<html>
<b>FOO</b>
</html>"));
I looked at the documentation but could not find a "LeadBy" equivalent to FollowedBy or a way how to achieve that.
What you are asking for is something like "lookbehind", that is, match only if something is preceded by a particular pattern. I don't really have an explicit class for that at the moment, but for what you want to do, you can still transform left-to-right, and just leave in the leading part, and not suppress it, just suppress the whitespace.
Here are a couple of ways to address your problem:
# define expressions to match leading and trailing
# html tags, and just suppress the leading or trailing whitespace
opener = White().suppress() + Literal("<html>")
closer = Literal("</html>") + White().suppress()
# define a single expression to match either opener
# or closer - have to add leaveWhitespace() call so that
# we catch the leading whitespace in opener
either = opener|closer
either.leaveWhitespace()
print either.transformString(insource)
# alternative, if you know what the tag will look like:
# match 'info=<some double quoted string>', and use a parse
# action to extract the contents within the quoted string,
# call strip() to remove leading and trailing whitespace,
# and then restore the original '"' characters (which are
# auto-stripped by the QuotedString class by default)
infovalue = QuotedString('"', multiline=True)
infovalue.setParseAction(lambda t: '"' + t[0].strip() + '"')
infoattr = "info=" + infovalue
print infoattr.transformString(insource)