Parsimonious ParseError - python

Digging deeper into grammars and PEG in special, I wanted to have a DSL with the following syntax:
a OR (b AND c)
I am using parsimonious here with the following grammar:
from parsimonious.grammar import Grammar
grammar = Grammar(
"""
expr = (term operator term)+
term = (lpar term rpar) / (variable operator variable)
operator = and / or
or = _? "OR" _?
and = _? "AND" _?
variable = ~r"[a-z]+"
lpar = "("
rpar = ")"
_ = ~r"\s*"
"""
)
print(grammar.parse('a OR (b AND c)'))
However, this fails for the above text with
parsimonious.exceptions.ParseError: Rule 'variable' didn't match at '(b AND c)' (line 1, column 6).
Why? Haven't I specified term as ( term ) or term ?
Why does it choose the rule for variable instead (which of course fails) ?

The first thing in expr is a term, so that's what the parser looks for.
A term in your grammar is either
( term )
or
variable operator variable
And the input is
a OR (b AND c)
That doesn't start with a ( so the only way it can be a term is if it matches variable operator variable. a is a variable; OR is an operator. So the next thing to match is variable.
Perhaps what you want is:
expr = term (operator term)*
term = (lpar expr rpar) / variable

Related

DSL for generating sequences

trying to create DSL to generate sequences ... here is what i did so far :
?start : expr
token : WORD
repeat_token : token ":" INT
tokens : (token | repeat_token)+
repeat : ":" INT
expr : "(" tokens | expr ")" repeat?
here is how the DSL look like :
(a b:2 (c d:3):2 ):3
[[a bb [[c ddd] [c ddd]] ] ... ]
I have problem with expr within expr ... ?
this fails:
(a:2 (b))
How do you see fitting (a:2 (b)) into your grammar? It doesn't seem like you can. Here's my logic:
The outer level has to be an expr because of the parens. In that expr you have both a repeat_token and another expr. I don't see anywhere that lets you have a sequence of elements that includes both repeat_tokens and exprs. Because of that, your input can't be parsed with your grammar.
As it is, a expr can only be in another expr all by itself, which doesn't seem very useful in general. That could only lead to extra sets of parentheses I think. What I think you need to do is allow an expr to be included in a tokens.
So then maybe:
?start : expr
token : WORD
repeat_token : token ":" INT
tokens : (token | repeat_token | expr)+
repeat : ":" INT
expr : "(" tokens ")" repeat?

Use pyparsing to parse expression starting with parenthesis

I'm trying to develop a grammar which can parse expression starting with parenthesis and ending parenthesis. There can be any combination of characters inside the parenthesis. I've written the following code, following the Hello World program from pyparsing.
from pyparsing import *
select = Literal("select")
predicate = "(" + Word(printables) + ")"
selection = select + predicate
print (selection.parseString("select (a)"))
But this throws error. I think it may be because printables also consist of ( and ) and it's somehow conflicting with the specified ( and ).
What is the correct way of doing this?
You could use alphas instead of printables.
from pyparsing import *
select = Literal("select")
predicate = "(" + Word(alphas) + ")"
selection = select + predicate
print (selection.parseString("select (a)"))
If using { } as the nested characters
from pyparsing import *
expr = Combine(Suppress('select ') + nestedExpr('{', '}'))
value = "select {a(b(c\somethinsdfsdf##!#$###$##))}"
print( expr.parseString( value ) )
output: [['a(b(c\\somethinsdfsdf##!#$###$##))']]
The problem with ( ) is they are used as the default quoting characters.

Rule precedence issue with grako

I'm redoing a minilanguage I originally built on Perl (see Chessa# on github), but I'm running into a number of issues when I go to apply semantics.
Here is the grammar:
(* integers *)
DEC = /([1-9][0-9]*|0+)/;
int = /(0b[01]+|0o[0-7]+|0x[0-9a-fA-F]+)/ | DEC;
(* floats *)
pointfloat = /([0-9]*\.[0-9]+|[0-9]+\.)/;
expfloat = /([0-9]+\.?|[0-9]*\.)[eE][+-]?[0-9]+/;
float = pointfloat | expfloat;
list = '[' #+:atom {',' #+:atom}* ']';
(* atoms *)
identifier = /[_a-zA-Z][_a-zA-Z0-9]*/;
symbol = int |
float |
identifier |
list;
(* functions *)
arglist = #+:atom {',' #+:atom}*;
function = identifier '(' [arglist] ')';
atom = function | symbol;
prec8 = '(' atom ')' | atom;
prec7 = [('+' | '-' | '~')] prec8;
prec6 = prec7 ['!'];
prec5 = [prec6 '**'] prec6;
prec4 = [prec5 ('*' | '/' | '%' | 'd')] prec5;
prec3 = [prec4 ('+' | '-')] prec4;
(* <| and >| are rotate-left and rotate-right, respectively. They assume the nearest C size. *)
prec2 = [prec3 ('<<' | '>>' | '<|' | '>|')] prec3;
prec1 = [prec2 ('&' | '|' | '^')] prec2;
expr = prec1 $;
The issue I'm running into is that the d operator is being pulled into the identifier rule when no whitespace exists between the operator and any following alphanumeric strings. While the grammar itself is LL(2), I don't understand where the issue is here.
For instance, 4d6 stops the parser because it's being interpreted as 4 d6, where d6 is an identifier. What should occur is that it's interpreted as 4 d 6, with the d being an operator. In an LL parser, this would indeed be the case.
A possible solution would be to disallow d from beginning an identifier, but this would disallow functions such as drop from being named as such.
In Perl, you can use Marpa, a general BNF parser, which supports generalized precedence with associativity (and many more) out of the box, e.g.
:start ::= Script
Script ::= Expression+ separator => comma
comma ~ [,]
Expression ::=
Number bless => primary
| '(' Expression ')' bless => paren assoc => group
|| Expression '**' Expression bless => exponentiate assoc => right
|| Expression '*' Expression bless => multiply
| Expression '/' Expression bless => divide
|| Expression '+' Expression bless => add
| Expression '-' Expression bless => subtract
Full working example is here. As for programming languages, there is a C parser based on Marpa.
Hope this helps.
The problem with your example is that Grako has the nameguard feature enabled by default, and that won't allow parsing just the d when d6 is ahead.
To disable the feature, instantiate your own Buffer and pass it to an instance of the generated parser:
from grako.buffering import Buffer
from myparser import MyParser
# get the text
parser = MyParser()
parser.parse(Buffer(text, nameguard=False), 'expre')
The tip version of Grako in the Bitbucket repository adds a --no-nameguard command-line option to generated parsers.

Parsing arithmetic expressions with function calls

I am working with pyparsing and found it to be excellent for developing a simple DSL that allows me to extract data fields out of MongoDB and do simple arithmetic operations on them. I am now trying to extend my tools such that I can apply functions of the form Rank[Person:Height] to the fields and potentially include simple expressions as arguments to the function calls. I am struggling hard with getting the parsing syntax to work. Here is what I have so far:
# Define parser
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + "." + Word(nums)).setParseAction(EvalConstant)
# Handle database field references that are coming out of Mongo,
# accounting for the fact that some fields contain whitespace
dbRef = Combine(Word(alphas) + ":" + Word(printables) + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword("Rank") | Keyword("ZS") | Keyword("Ntile")) + "[" + expr + "]"
functionCall.setParseAction(EvalFunction)
operand = functionCall | dbRef | (real | integer)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
My issue is that when I test a simple expression like Rank[Person:Height] I am getting a parse exception:
ParseException: Expected "]" (at char 19), (line:1, col:20)
If I use a float or arithmetic expression as the argument like Rank[3 + 1.1] the parsing works ok, and if I simplify the dbRef grammar so its just Word(alphas) it also works. Cannot for the life of me figure out whats wrong with my full grammar. I have tried rearranging the order of operands as well as simplifying the functionCall grammar to no avail. Can anyone see what I am doing wrong?
Once I get this working I would want to take a last step and introduce support for variable assignment in expressions ..
EDIT: Upon further testing, if I remove the printables from dbRef grammar, things work ok:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional("_" + Word(alphas)))
HOWEVER, if I add the character "-" to dbRef (which I need for DB fields like "Class:S-N"), the parser fails again. I think the "-" is being consumed by the signop in my operatorPrecedence?
What appears to happen is that the ] character at the end of your test string (Rank[Person:Height]) gets consumed as part of the dbRef token, because the portion of this token past the initial : is declared as being made of Word(printables) (and this character set, unfortunately includes the square brackets characters)
Then the parser tries to produce a functionCall but is missing the closing ] hence the error message.
A tentative fix is to use a character set that doesn't include the square brackets, maybe something more explicit like:
dbRef = Combine(Word(alphas) + ":" + Word(alphas, alphas+"-_./") + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
Edit:
Upon closer look, the above is loosely correct, but the token hierarchy is wrong (e.g. the parser attempts to produce a functionCall as one operand of an an expr etc.)
Also, my suggested fix will not work because of the ambiguity with the - sign which should be understood as a plain character when within a dbRef and as a plusOp when within an expr. This type of issue is common with parsers and there are ways to deal with this, though I'm not sure exactly how with pyparsing.
Found solution - the issue was that my grammar for dbRef was consuming some of the characters that were part of the function specification. New grammar that works correctly:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional(oneOf("_ -") + Word(alphas)))

Parsing nested function calls using pyparsing

I'm trying to use pyparsing to parse function calls in the form:
f(x, y)
That's easy. But since it's a recursive-descent parser, it should also be easy to parse:
f(g(x), y)
That's what I can't get. Here's a boiled-down example:
from pyparsing import Forward, Word, alphas, alphanums, nums, ZeroOrMore, Literal
lparen = Literal("(")
rparen = Literal(")")
identifier = Word(alphas, alphanums + "_")
integer = Word( nums )
functor = identifier
# allow expression to be used recursively
expression = Forward()
arg = identifier | integer | expression
args = arg + ZeroOrMore("," + arg)
expression << functor + lparen + args + rparen
print expression.parseString("f(x, y)")
print expression.parseString("f(g(x), y)")
And here's the output:
['f', '(', 'x', ',', 'y', ')']
Traceback (most recent call last):
File "tmp.py", line 14, in <module>
print expression.parseString("f(g(x), y)")
File "/usr/local/lib/python2.6/dist-packages/pyparsing-1.5.6-py2.6.egg/pyparsing.py", line 1032, in parseString
raise exc
pyparsing.ParseException: Expected ")" (at char 3), (line:1, col:4)
Why does my parser interpret the functor of the inner expression as a standalone identifier?
Nice catch on figuring out that identifier was masking expression in your definition of arg. Here are some other tips on your parser:
x + ZeroOrMore(',' + x) is a very common pattern in pyparsing parsers, so pyparsing includes a helper method delimitedList which allows you to replace that expression with delimitedList(x). Actually, delimitedList does one other thing - it suppresses the delimiting commas (or other delimiter if given using the optional delim argument), based on the notion that the delimiters are useful at parsing time, but are just clutter tokens when trying to sift through the parsed data afterwards. So you can rewrite args as args = delimitedList(arg), and you will get just the args in a list, no commas to have to "step over".
You can use the Group class to create actual structure in your parsed tokens. This will build your nesting hierarchy for you, without having to walk this list looking for '(' and ')' to tell you when you've gone down a level in the function nesting:
arg = Group(expression) | identifier | integer
expression << functor + Group(lparen + args + rparen)
Since your args are being Grouped for you, you can further suppress the parens, since like the delimiting commas, they do their job during parsing, but with grouping of your tokens, they are no longer necessary:
lparen = Literal("(").suppress()
rparen = Literal(")").suppress()
I assume 'h()' is a valid function call, just no args. You can allow args to be optional using Optional:
expression << functor + Group(lparen + Optional(args) + rparen)
Now you can parse "f(g(x), y, h())".
Welcome to pyparsing!
The definition of arg should be arranged with the item that starts with another at the left, so it is matched preferentially:
arg = expression | identifier | integer
Paul's answer helped a lot. For posterity, the same can be used to define for loops, as follows (simplified pseudo-parser here, to show the structure):
from pyparsing import (
Forward, Group, Keyword, Literal, OneOrMore)
sep = Literal(';')
if_ = Keyword('if')
then_ = Keyword('then')
elif_ = Keyword('elif')
end_ = Keyword('end')
if_block = Forward()
do_block = Forward()
stmt = other | if_block
stmts = OneOrMore(stmt + sep)
case = Group(guard + then_ + stmts)
cases = case + OneOrMore(elif_ + case)
if_block << if_ + cases + end_

Categories

Resources