Dealing with ZeroOrMore in pyparsing

Dealing with ZeroOrMore in pyparsing - python

I'm trying to parse pactl list with pyparsing: So far all parse is working correctly but I cannot make ZeroOrMore to work correctly.
I can find foo: or foo: bar and try to deal with that with ZeroOrMore but it doesn't work, I have to add special case "Argument:" to find results without value, but there're Argument: foo results (with value) so it will not work, and I expect any other property to exist without value.
With this definition, and a fixed pactl list output:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1)))))
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | ("Argument:") | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This gets:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
[[['Module'], ['6']],
[['Argument:'],
[[['Name'], ['module-alsa-card']]],
[[['Usage counter'], ['0']]],
['Properties:',
[[[['module.author'], ['"Lennart Poettering"']]],
[[['module.description'], ['"ALSA Card"']]],
[[['module.version'], ['"14.0-rebootstrapped"']]]]]]]
So far so good, but removing special case for Argument: it gets into error, as ZeroOrMore doesn't behave as expected:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This results in:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 19(3,9)
Matched Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) -> [[['Argument'], ['Name']]]
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 1(2,1)
Exception raised:Expected ":", found '#' (at char 8), (line:2, col:8)
Traceback (most recent call last):
File "/home/alberto/projects/node/pacmd_list_json/./pactl.py", line 55, in <module>
parseTree = syntax.parseString(partial)
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 1955, in parseString
raise exc
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 6336, in checkUnindent
raise ParseException(s, l, "not an unindent")
pyparsing.ParseException: Expected {{Group:({Group:(W:(ABCD...)) Suppress:("#") Group:(W:(0123...))}) indented block} | {"Properties:" indented block} | Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) | Group:({Group:(W:(ABCD...)) Suppress:("=") Group:(Combine:({{W:(ABCD...) | <SP><TAB>}}...))})}, found ':' (at char 41), (line:4, col:13)
See from setDebug value grammar ZeroOrMore is getting the tokens from next line [[['Argument'], ['Name']]]
I tried LineEnd() and other tricks but none works.
Any idea on how to deal with ZeroOrMore to stop on LineEnd() or without special cases?
NOTE: Real output can be retrieved using:
env = os.environ.copy()
env['LANG'] = 'C'
data = check_output(
['pactl', 'list'], universal_newlines=True, env=env)

indentedBlock is not the easiest pyparsing element to work with. But there are a few things that you are doing that are getting in your way.
To debug this, I broke down some of your more complex expressions, use setName() to give them names, and then added .setDebug(). Like this:
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
This will tell pyparsing to output a message whenever this expression is about to be matched, if it matched successfully, or if not, the exception that was raised.
Match identifier at loc 1(2,1)
Matched identifier -> ['Module']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 23(3,13)
Exception raised:Expected identifier, found ':' (at char 23), (line:3, col:13)
It looks like these expressions are messing up the indentedBlock matching, by processing whitespace that should be indentation space:
Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))
The " character in the Word and the whitespace lead me to believe you are trying to match quoted strings. I replaced this expression with:
Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString))
You also need to take care not to read past the end of the line, or you'll also mess up the indentedBlock indentation tracking. I added this expression for a newline at the top:
NL = LineEnd()
and then used it as the stopOn argument to OneOrMore and ZeroOrMore:
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
Here is the parser I ended up with:
indentStack = [1]
stmt = Forward()
NL = LineEnd()
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums))).setName("sect_def")#.setDebug()
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
#~ value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
value_label = originalTextFor(OneOrMore(identifier)).setName("value_label")#.setDebug()
value = Group(value_label
+ Suppress(":")
+ Optional(~NL + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_.') | quotedString(), stopOn=NL))))).setName("value")#.setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
#~ prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
prop = (prop_name + prop_section).setName("prop")#.setDebug()
stmt << ( section | prop | value | prop_val )
Which gives this:
[[['Module'], ['6']],
[[['Argument']],
[['Name', ['module-alsa-card']]],
[['Usage counter', ['0']]],
['Properties:',
[[['module.author', ['"Lennart Poettering"']]],
[['module.description', ['"ALSA Card"']]],
[['module.version', ['"14.0-rebootstrapped"']]]]]]]

Related

PyParsing: parse if not a keyword

I am trying to parse a file as follows:
testp.txt
title = Test Suite A;
timeout = 10000
exp_delay = 500;
log = TRUE;
sect
{
type = typeA;
name = "HelloWorld";
output_log = "c:\test\out.log";
};
sect
{
name = "GoodbyeAll";
type = typeB;
comm1_req = 0xDEADBEEF;
comm1_resp = (int, 1234366);
};
The file first contains a section with parameters and then some sects. I can parse a file containing just parameters and I can parse a file just containing sects but I can't parse both.
from pyparsing import *
from pathlib import Path
command_req = Word(alphanums)
command_resp = "(" + delimitedList(Word(alphanums)) + ")"
kW = Word(alphas+'_', alphanums+'_') | command_req | command_resp
keyName = ~Literal("sect") + Word(alphas+'_', alphanums+'_') + FollowedBy("=")
keyValue = dblQuotedString.setParseAction( removeQuotes ) | OneOrMore(kW,stopOn=LineEnd())
param = dictOf(keyName, Suppress("=")+keyValue+Optional(Suppress(";")))
node = Group(Literal("sect") + Literal("{") + OneOrMore(param) + Literal("};"))
final = OneOrMore(node) | OneOrMore(param)
param.setDebug()
p = Path(__file__).with_name("testp.txt")
with open(p) as f:
try:
x = final.parseFile(f, parseAll=True)
print(x)
print("...")
dx = x.asDict()
print(dx)
except ParseException as pe:
print(pe)
The issue I have is that param matches against sect so it expects a =. So I tried putting in ~Literal("sect") in keyName but that just leads to another error:
Exception raised:Found unwanted token, "sect", found '\n' (at char 188), (line:4, col:56)
Expected end of text, found 's' (at char 190), (line:6, col:1)
How do I get it use one parse method for sect and another (param) if not sect?
My final goal would be to have the whole lot in a Dict with the global params and sects included.
EDIT
Think I've figured it out:
This line...
final = OneOrMore(node) | OneOrMore(param)
...should be:
final = ZeroOrMore(param) + ZeroOrMore(node)
But I wonder if there is a more structured way (as I'd ultimately like a dict)?

Visualizing nested function calls in python

I need a way to visualize nested function calls in python, preferably in a tree-like structure. So, if I have a string that contains f(g(x,h(y))), I'd like to create a tree that makes the levels more readable. For example:
f()
|
g()
/ \
x h()
|
y
Or, of course, even better, a tree plot like the one that sklearn.tree.plot_tree creates.
This seems like a problem that someone has probably solved long ago, but it has so far resisted my attempts to find it. FYI, this is for the visualization of genetic programming output that tends to have very complex strings like this.
thanks!
update:
toytree and toyplot get pretty close, but just not quite there:
This is generated with:
import toytree, toyplot
mystyle = {"layout": 'down','node_labels':True}
s = '((x,(y)));'
toytree.tree(s).draw(**mystyle);
It's close, but the node labels aren't strings...
Update 2:
I found another potential solution that gets me closer in text form:
https://rosettacode.org/wiki/Visualize_a_tree#Python
tree2 = Node('f')([
Node('g')([
Node('x')([]),
Node('h')([
Node('y')([])
])
])
])
print('\n\n'.join([drawTree2(True)(False)(tree2)]))
This results in the following:
That's right, but I had to hand convert my string to the Node notation the drawTree2 function needs.

Here's a solution using pyparsing and asciitree. This can be adapted to parse just about anything and to generate whatever data structure is required for plotting. In this case, the code generates nested dictionaries suitable for input to asciitree.
#!/usr/bin/env python3
from collections import OrderedDict
from asciitree import LeftAligned
from pyparsing import Suppress, Word, alphas, Forward, delimitedList, ParseException, Optional
def grammar():
lpar = Suppress('(')
rpar = Suppress(')')
identifier = Word(alphas).setParseAction(lambda t: (t[0], {}))
function_name = Word(alphas)
expr = Forward()
function_arg = delimitedList(expr)
function = (function_name + lpar + Optional(function_arg) + rpar).setParseAction(lambda t: (t[0] + '()', OrderedDict(t[1:])))
expr << (function | identifier)
return function
def parse(expr):
g = grammar()
try:
parsed = g.parseString(expr, parseAll=True)
except ParseException as e:
print()
print(expr)
print(' ' * e.loc + '^')
print(e.msg)
raise
return dict([parsed[0]])
if __name__ == '__main__':
expr = 'f(g(x,h(y)))'
tree = parse(expr)
print(LeftAligned()(tree))
Output:
f()
+-- g()
+-- x
+-- h()
+-- y
Edit
With some tweaks, you can build an edge list suitable for plotting in your favorite graph library (igraph example below).
#!/usr/bin/env python3
import igraph
from pyparsing import Suppress, Word, alphas, Forward, delimitedList, ParseException, Optional
class GraphBuilder(object):
def __init__(self):
self.labels = {}
self.edges = []
def add_edges(self, source, targets):
for target in targets:
self.add_edge(source, target)
return source
def add_edge(self, source, target):
x = self.labels.setdefault(source, len(self.labels))
y = self.labels.setdefault(target, len(self.labels))
self.edges.append((x, y))
def build(self):
g = igraph.Graph()
g.add_vertices(len(self.labels))
g.vs['label'] = sorted(self.labels.keys(), key=lambda l: self.labels[l])
g.add_edges(self.edges)
return g
def grammar(gb):
lpar = Suppress('(')
rpar = Suppress(')')
identifier = Word(alphas)
function_name = Word(alphas).setParseAction(lambda t: t[0] + '()')
expr = Forward()
function_arg = delimitedList(expr)
function = (function_name + lpar + Optional(function_arg) + rpar).setParseAction(lambda t: gb.add_edges(t[0], t[1:]))
expr << (function | identifier)
return function
def parse(expr, gb):
g = grammar(gb)
g.parseString(expr, parseAll=True)
if __name__ == '__main__':
expr = 'f(g(x,h(y)))'
gb = GraphBuilder()
parse(expr, gb)
g = gb.build()
layout = g.layout('tree', root=len(gb.labels)-1)
igraph.plot(g, layout=layout, vertex_size=30, vertex_color='white')

pyparsing, ignore can't ignore string inline

I want to use the pyparsing's commaSeparatedList to seperate a string and ignore the staff inside '{' '}'.
example:
a = 'xyz,abc{def,123,456}'
after parse, I want to got
['xyz','abc{def,123,456}']
I wrote this:
nested_expr = '{' + SkipTo('}') + '}'
commaSeparatedList.ignore(nested_expr).parseString(a)
result: (['xyz', 'abc{def', '123', '456}'], {})
Actulally
It seems like when there is a separater before '{', this will work
a = 'xyz,abc,{def,123,456}'
commaSeparatedList.ignore(nested_expr).parseString(a)
result: (['xyz', 'abc', ''], {})
Could you take a look why this is happening?

Open up the pyparsing.py source file and see how commaSeparatedList is implemented - it isn't that much, and easily adapted to your case:
# original
_commasepitem = Combine(OneOrMore(Word(printables, excludeChars=',') +
Optional( Word(" \t") +
~Literal(",") + ~LineEnd() ) ) ).streamline().setName("commaItem")
commaSeparatedList = delimitedList( Optional( quotedString.copy() | _commasepitem, default="") ).setName("commaSeparatedList")
# modified
_commasepitem = Combine(OneOrMore(QuotedString('{',endQuoteChar='}',unquoteResults=False) | Word(printables, excludeChars=',{}') +
Optional( Word(" \t") +
~Literal(",") + ~LineEnd() ) ) ).streamline().setName("commaItem")
commaSeparatedList = delimitedList( Optional(_commasepitem, default="") ).setName("commaSeparatedList")
It's important that the Word in _commasepitem not allow for inclusion of '{}' characters.

pyparsing ParseException: Expected end of line -- general questions

I am a newbie to pyparsing and have been reading the examples, looking here and trying some things out.
I created a grammar and provided a buffer. I do however have a heavy background in lex/yacc from the old days.
I have a general question or two.
I'm currently seeing
ParseException: Expected end of line (at char 7024), (line 213, col:2)
and then it terminates
Because of the nature of my buffer, newlines have meaning, I did:
ParserElement.setDefaultWhitespaceChars('') # <-- zero len string
Does this error mean that somewhere in my productions, I have a rule that is looking for an LineEnd() and that rule happens to somehow be 'last'?
The location it is dying is the 'end of file'. I tried using parseFile but my file contains chars > ord(127) so instead I am loading it to memory, filtering all > ord(127) chars, then calling parseString.
I tried turning on verbose_stacktrace=True for some of the elements of my grammar where I thought the problem originated.
Is there a better way to track down the exact ParserElement it is trying to recognize when an error such as this occurs? Or can I get a 'stack or most recently recognized production trace?
I didn't realize I could edit up here...
My crash is this:
[centos#new-host /tmp/sample]$ ./zooparser.py
!(zooparser.py) TEST test1: valid message type START
Ready to roll
Parsing This message: ( ignore leading>>> and trailing <<< ) >>>
ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//
<<<
Match {Group:({Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) [Group:({LineEnd "FREE" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"})]...}) [LineEnd]... StringEnd} at loc 0(1,1)
Match Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) at loc 0(1,1)
Match Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"}) at loc 0(1,1)
Exception raised:None
Exception raised:None
Exception raised:None
Traceback (most recent call last):
File "./zooparser.py", line 319, in <module>
test1(pgm)
File "./zooparser.py", line 309, in test1
test(pgm, zooMsg, 'test1: valid message type' )
File "./zooparser.py", line 274, in test
tokens = zg.getTokensFromBuffer(fileName)
File "./zooparser.py", line 219, in getTokensFromBuffer
tokens = self.text.parseString(filteredBuffer,parseAll=True)
File "/usr/local/lib/python2.7/site-packages/pyparsing-1.5.7-py2.7.egg/pyparsing.py", line 1006, in parseString
raise exc
pyparsing.ParseException: Expected end of line (at char 148), (line:8, col:2)
[centos#new-host /tmp/sample]$
source: see http://prj1.y23.org/zoo.zip

pyparsing takes a different view toward parsing than lex/yacc does. You have to let the classes do some of the work. Here's an example in your code:
self.columnHeader = OneOrMore(self.aucc) \
| OneOrMore(nums) \
| OneOrMore(self.blankCharacter) \
| OneOrMore(self.specialCharacter)
You are equating OneOrMore with the '+' character of a regex. In pyparsing, this is true for ParseElements, but at the character level, pyparsing uses the Word class:
self.columnHeader = Word(self.aucc + nums + self.blankCharacter + self.specialCharacter)
OneOrMore works with ParseElements, not characters. Look at:
OneOrMore(nums)
nums is the string "0123456789", so OneOrMore(nums) will match "0123456789", "01234567890123456789", etc., but not "123". That is what Word is for. OneOrMore will accept a string argument, but will implicitly convert it to a Literal.
This is a fundamental difference between using pyparsing and lex/yacc, and I think is the source of much of the complexity in your code.
Some other suggestions:
Your code has some premature optimizations in it - you write:
aucc = ''.join(set([alphas.upper(),"'"]))
Assuming that this will be used for defining Words, just do:
aucc = alphas.upper() + "'"
There is no harm in having duplicate characters in aucc, Word will convert this to a set internally.
Write a BNF for what you want to parse. It does not have to be overly rigorous as you would with lex/yacc. From your samples, it looks something like:
# sample
ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//
parser :: header topicEntry+
header :: "ZOO" sep namedValue
namedValue :: uppercaseWord sep valueBody
valueBody :: (everything up to //)
topicEntry :: topicHeader topicBody
topicHeader :: "TOPIC" sep valuebody
topicBody :: freeText
freeText :: "FREE" sep valuebody
sep :: "/"
Converting to pyparsing, this looks something like:
SEP = Literal("/")
BODY_TERMINATOR = Literal("//")
FREE_,TOPIC_,ZOO_ = map(Keyword,"FREE TOPIC ZOO".split())
uppercaseWord = Word(alphas.upper())
valueBody = SkipTo(BODY_TERMINATOR) # adjust later, but okay for now...
freeText = FREE_ + SEP + valueBody
topicBody = freeText
topicHeader = TOPIC_ + SEP + valueBody
topicEntry = topicHeader + topicBody
namedValue = uppercaseWord + SEP + valueBody
zooHeader = ZOO_ + SEP + namedValue
parser = zooHeader + OneOrMore(topicEntry)
(valueBody will have to get more elaborate when you add support for '://' embedded within a value, but save that for Round 2.)
Don't make things super complicated until you get at least some simple stuff working.

Parsing significant whitespace with a PEG in Python (specificly Parsley)

I'm creating a syntax that supports significant whitespace (most like the "Z" lisp variant than Python or yaml, but same idea)
I came across this article on how to do significant whitespace parsing in a pegasus a PEG parser for C#
But I've been less than successful at converting that to parsley, looks like the #STATE# variable in Pegasus follows backtracking in some way.
This is the closest I've gotten to a simple parser, If I use the version of indent with look ahead it can't parse children, and if I use the version without, it can't parse siblings.
If this is a limitation of parsley and I need to use PyPEG or Parsimonious or something, I'm open to that, but it seems like if the internal indent variable could follow the PEGs internal backtracking this would all work.
import parsley
def indent(s):
s['i'] += 2
print('indent i=%d' % s['i'])
def deindent(s):
s['i'] -= 2
print('deindent i=%d' % s['i'])
grammar = parsley.makeGrammar(r'''
id = <letterOrDigit+>
eol = '\n' | end
nots = anything:x ?(x != ' ')
node = I:i id:name eol !(fn_print(_state['i'], name)) -> i, name
#I = !(' ' * _state['i'])
I = (' '*):spaces ?(len(spaces) == _state['i'])
#indent = ~~(!(' ' * (_state['i'] + 2)) nots) -> fn_indent(_state)
#deindent = ~~(!(' ' * (_state['i'] - 2)) nots) -> fn_deindent(_state)
indent = -> fn_indent(_state)
deindent = -> fn_deindent(_state)
child_list = indent (ntree+):children deindent -> children
ntree = node:parent (child_list?):children -> parent, children
nodes = ntree+
''', {
'_state': {'i': 0},
'fn_indent': indent,
'fn_deindent': deindent,
'fn_print': print,
})
test_string = '\n'.join((
'brother',
' brochild1',
#' gchild1',
#' brochild2',
#' grandchild',
'sister',
#' sischild',
#'brother2',
))
nodes = grammar(test_string).nodes()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dealing with ZeroOrMore in pyparsing - python

Related

PyParsing: parse if not a keyword

Visualizing nested function calls in python

pyparsing, ignore can't ignore string inline

pyparsing ParseException: Expected end of line -- general questions

Parsing significant whitespace with a PEG in Python (specificly Parsley)

Categories

Resources