trying to create DSL to generate sequences ... here is what i did so far :
?start : expr
token : WORD
repeat_token : token ":" INT
tokens : (token | repeat_token)+
repeat : ":" INT
expr : "(" tokens | expr ")" repeat?
here is how the DSL look like :
(a b:2 (c d:3):2 ):3
[[a bb [[c ddd] [c ddd]] ] ... ]
I have problem with expr within expr ... ?
this fails:
(a:2 (b))
How do you see fitting (a:2 (b)) into your grammar? It doesn't seem like you can. Here's my logic:
The outer level has to be an expr because of the parens. In that expr you have both a repeat_token and another expr. I don't see anywhere that lets you have a sequence of elements that includes both repeat_tokens and exprs. Because of that, your input can't be parsed with your grammar.
As it is, a expr can only be in another expr all by itself, which doesn't seem very useful in general. That could only lead to extra sets of parentheses I think. What I think you need to do is allow an expr to be included in a tokens.
So then maybe:
?start : expr
token : WORD
repeat_token : token ":" INT
tokens : (token | repeat_token | expr)+
repeat : ":" INT
expr : "(" tokens ")" repeat?
Related
I am trying to parse nested column type definitions such as
1 string
2 struct<col_1:string,col_2:int>
3 row(col_1 string,array(col_2 string),col_3 boolean)
4 array<struct<col_1:string,col_2:int>,col_3:boolean>
5 array<struct<col_1:string,col2:int>>
Using nestedExpr works as expected for cases 1-4, but throws a parse error on case 5. Adding a space between double closing brackets like "> >" seems work, and might be explained by this quote from the author.
By default, nestedExpr will look for space-delimited words of printables
https://sourceforge.net/p/pyparsing/bugs/107/
I'm mostly looking for alternatives to pre and post processing the input string
type_str = type_str.replace(">", "> ")
# parse string here
type_str = type_str.replace("> ", ">")
I've tried using the infix_notation but I haven't been able to figure out how to use it in this situation. I'm probably just using this the wrong way...
Code snippet
array_keyword = pp.Keyword('array')
row_keyword = pp.Keyword('row')
struct_keyword = pp.Keyword('struct')
nest_open = pp.Word('<([')
nest_close = pp.Word('>)]')
col_name = pp.Word(pp.alphanums + '_')
col_type = pp.Forward()
col_type_delimiter = pp.Word(':') | pp.White(' ')
column = col_name('name') + col_type_delimiter + col_type('type')
col_list = pp.delimitedList(pp.Group(column))
struct_type = pp.nestedExpr(
opener=struct_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
)
row_type = pp.locatedExpr(pp.nestedExpr(
opener=row_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
))
array_type = pp.nestedExpr(
opener=array_keyword + nest_open, closer=nest_close, content=col_type, ignoreExpr=None
)
col_type <<= struct_type('children') | array_type('children') | row_type('children') | scalar_type('type')
nestedExpr and infixNotation are not really appropriate for this project. nestedExpr is generally a short-cut expression for stuff you don't really want to go into details parsing, you just want to detect and step over some chunk of text that happens to have some nesting in opening and closing punctuation. infixNotation is intended for parsing expressions with unary and binary operators, usually some kind of arithmetic. You might be able to treat the punctuation in your grammar as operators, but it is a stretch, and definitely doing things the hard way.
For your project, you will really need to define the different elements, and it will be a recursive grammar (since the array and struct types will themselves be defined in terms of other types, which could also be arrays or structs).
I took a stab at a BNF, for a subset of your grammar using scalar types int, float, boolean, and string, and compound types array and struct, with just the '<' and '>' nesting punctuation. An array will take a single type argument, to define the type of the elements in the array. A struct will take one or more struct fields, where each field is an identifier:type pair.
scalar_type ::= 'int' | 'float' | 'string' | 'boolean'
array_type ::= 'array' '<' type_defn '>'
struct_type ::= 'struct' '<' struct_element (',' struct_element)... '>'
struct_element ::= identifier ':' type_defn
type_defn ::= scalar_type | array_type | struct_type
(If you later want to add a row definition also, think about what the row is supposed to look like, and how its elements would be defined, and then add it to this BNF.)
You look pretty comfortable with the basics of pyparsing, so I'll just start you off with some intro pieces, and then let you fill in the rest.
# define punctuation
LT, GT, COLON = map(pp.Suppress, "<>:")
ARRAY = pp.Keyword('array')
STRUCT = pp.Keyword('struct')
# create a Forward that will be used in other type expressions
type_defn = pp.Forward()
# here is the array type, you can fill in the other types following this model
# and the definitions in the BNF
array_type = pp.Group(ARRAY + LT + type_defn + GT)
...
# then finally define type_defn in terms of the other type expressions
type_defn <<= scalar_type | array_type | struct_type
Once you have that finished, try it out with some tests:
type_defn.runTests("""\
string
struct<col_1:string,col_2:int>
array<struct<col_1:string,col2:int>>
""", fullDump=False)
And you should get something like:
string
['string']
struct<col_1:string,col_2:int>
['struct', [['col_1', 'string'], ['col_2', 'int']]]
array<struct<col_1:string,col2:int>>
['array', ['struct', [['col_1', 'string'], ['col2', 'int']]]]>
Once you have that, you can play around with extending it to other types, such as your row type, maybe unions, or arrays that take multiple types (if that was your intention in your posted example). Always start by updating the BNF - then the changes you'll need to make in the code will generally follow.
Digging deeper into grammars and PEG in special, I wanted to have a DSL with the following syntax:
a OR (b AND c)
I am using parsimonious here with the following grammar:
from parsimonious.grammar import Grammar
grammar = Grammar(
"""
expr = (term operator term)+
term = (lpar term rpar) / (variable operator variable)
operator = and / or
or = _? "OR" _?
and = _? "AND" _?
variable = ~r"[a-z]+"
lpar = "("
rpar = ")"
_ = ~r"\s*"
"""
)
print(grammar.parse('a OR (b AND c)'))
However, this fails for the above text with
parsimonious.exceptions.ParseError: Rule 'variable' didn't match at '(b AND c)' (line 1, column 6).
Why? Haven't I specified term as ( term ) or term ?
Why does it choose the rule for variable instead (which of course fails) ?
The first thing in expr is a term, so that's what the parser looks for.
A term in your grammar is either
( term )
or
variable operator variable
And the input is
a OR (b AND c)
That doesn't start with a ( so the only way it can be a term is if it matches variable operator variable. a is a variable; OR is an operator. So the next thing to match is variable.
Perhaps what you want is:
expr = term (operator term)*
term = (lpar expr rpar) / variable
I'm redoing a minilanguage I originally built on Perl (see Chessa# on github), but I'm running into a number of issues when I go to apply semantics.
Here is the grammar:
(* integers *)
DEC = /([1-9][0-9]*|0+)/;
int = /(0b[01]+|0o[0-7]+|0x[0-9a-fA-F]+)/ | DEC;
(* floats *)
pointfloat = /([0-9]*\.[0-9]+|[0-9]+\.)/;
expfloat = /([0-9]+\.?|[0-9]*\.)[eE][+-]?[0-9]+/;
float = pointfloat | expfloat;
list = '[' #+:atom {',' #+:atom}* ']';
(* atoms *)
identifier = /[_a-zA-Z][_a-zA-Z0-9]*/;
symbol = int |
float |
identifier |
list;
(* functions *)
arglist = #+:atom {',' #+:atom}*;
function = identifier '(' [arglist] ')';
atom = function | symbol;
prec8 = '(' atom ')' | atom;
prec7 = [('+' | '-' | '~')] prec8;
prec6 = prec7 ['!'];
prec5 = [prec6 '**'] prec6;
prec4 = [prec5 ('*' | '/' | '%' | 'd')] prec5;
prec3 = [prec4 ('+' | '-')] prec4;
(* <| and >| are rotate-left and rotate-right, respectively. They assume the nearest C size. *)
prec2 = [prec3 ('<<' | '>>' | '<|' | '>|')] prec3;
prec1 = [prec2 ('&' | '|' | '^')] prec2;
expr = prec1 $;
The issue I'm running into is that the d operator is being pulled into the identifier rule when no whitespace exists between the operator and any following alphanumeric strings. While the grammar itself is LL(2), I don't understand where the issue is here.
For instance, 4d6 stops the parser because it's being interpreted as 4 d6, where d6 is an identifier. What should occur is that it's interpreted as 4 d 6, with the d being an operator. In an LL parser, this would indeed be the case.
A possible solution would be to disallow d from beginning an identifier, but this would disallow functions such as drop from being named as such.
In Perl, you can use Marpa, a general BNF parser, which supports generalized precedence with associativity (and many more) out of the box, e.g.
:start ::= Script
Script ::= Expression+ separator => comma
comma ~ [,]
Expression ::=
Number bless => primary
| '(' Expression ')' bless => paren assoc => group
|| Expression '**' Expression bless => exponentiate assoc => right
|| Expression '*' Expression bless => multiply
| Expression '/' Expression bless => divide
|| Expression '+' Expression bless => add
| Expression '-' Expression bless => subtract
Full working example is here. As for programming languages, there is a C parser based on Marpa.
Hope this helps.
The problem with your example is that Grako has the nameguard feature enabled by default, and that won't allow parsing just the d when d6 is ahead.
To disable the feature, instantiate your own Buffer and pass it to an instance of the generated parser:
from grako.buffering import Buffer
from myparser import MyParser
# get the text
parser = MyParser()
parser.parse(Buffer(text, nameguard=False), 'expre')
The tip version of Grako in the Bitbucket repository adds a --no-nameguard command-line option to generated parsers.
I would like to use the excellent pyparsing package to parse a python function call in its most general form. I read one post that was somewhat useful here but still not general enough.
I would like to parse the following expression:
f(arg1,arg2,arg3,...,kw1=var1,kw2=var2,kw3=var3,...)
where
arg1,arg2,arg3 ... are any kind of valid python objects (integer, real, list, dict, function, variable name ...)
kw1, kw2, kw3 ... are valid python keyword names
var1,var2,var3 are valid python objects
I was wondering if a grammar could be defined for such a general template. I am perhaps asking too much ... Would you have any idea ?
thank you very much for your help
Eric
Is that all? Let's start with a simple informal BNF for this:
func_call ::= identifier '(' func_arg [',' func_arg]... ')'
func_arg ::= named_arg | arg_expr
named_arg ::= identifier '=' arg_expr
arg_expr ::= identifier | real | integer | dict_literal | list_literal | tuple_literal | func_call
identifier ::= (alpha|'_') (alpha|num|'_')*
alpha ::= some letter 'a'..'z' 'A'..'Z'
num ::= some digit '0'..'9'
Translating to pyparsing, work bottom-up:
identifier = Word(alphas+'_', alphanums+'_')
# definitions of real, integer, dict_literal, list_literal, tuple_literal go here
# see further text below
# define a placeholder for func_call - we don't have it yet, but we need it now
func_call = Forward()
string = pp.quotedString | pp.unicodeString
arg_expr = identifier | real | integer | string | dict_literal | list_literal | tuple_literal | func_call
named_arg = identifier + '=' + arg_expr
# to define func_arg, must first see if it is a named_arg
# why do you think this is?
func_arg = named_arg | arg_expr
# now define func_call using '<<' instead of '=', to "inject" the definition
# into the previously declared Forward
#
# Group each arg to keep its set of tokens separate, otherwise you just get one
# continuous list of parsed strings, which is almost as worthless the original
# string
func_call << identifier + '(' + delimitedList(Group(func_arg)) + ')'
Those arg_expr elements could take a while to work through, but fortunately, you can get them off the pyparsing wiki's Examples page: http://pyparsing.wikispaces.com/file/view/parsePythonValue.py
from parsePythonValue import (integer, real, dictStr as dict_literal,
listStr as list_literal, tupleStr as tuple_literal)
You still might get args passed using *list_of_args or **dict_of_named_args notation. Expand arg_expr to support these:
deref_list = '*' + (identifier | list_literal | tuple_literal)
deref_dict = '**' + (identifier | dict_literal)
arg_expr = identifier | real | integer | dict_literal | list_literal | tuple_literal | func_call | deref_list | deref_dict
Write yourself some test cases now - start simple and work your way up to complicated:
sin(30)
sin(a)
hypot(a,b)
len([1,2,3])
max(*list_of_vals)
Additional argument types that will need to be added to arg_expr (left as further exercise for the OP):
indexed arguments : dictval['a'] divmod(10,3)[0] range(10)[::2]
object attribute references : a.b.c
arithmetic expressions : sin(30), sin(a+2*b)
comparison expressions : sin(a+2*b) > 0.5 10 < a < 20
boolean expressions : a or b and not (d or c and b)
lambda expression : lambda x : sin(x+math.pi/2)
list comprehension
generator expression
I am working with pyparsing and found it to be excellent for developing a simple DSL that allows me to extract data fields out of MongoDB and do simple arithmetic operations on them. I am now trying to extend my tools such that I can apply functions of the form Rank[Person:Height] to the fields and potentially include simple expressions as arguments to the function calls. I am struggling hard with getting the parsing syntax to work. Here is what I have so far:
# Define parser
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + "." + Word(nums)).setParseAction(EvalConstant)
# Handle database field references that are coming out of Mongo,
# accounting for the fact that some fields contain whitespace
dbRef = Combine(Word(alphas) + ":" + Word(printables) + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword("Rank") | Keyword("ZS") | Keyword("Ntile")) + "[" + expr + "]"
functionCall.setParseAction(EvalFunction)
operand = functionCall | dbRef | (real | integer)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
My issue is that when I test a simple expression like Rank[Person:Height] I am getting a parse exception:
ParseException: Expected "]" (at char 19), (line:1, col:20)
If I use a float or arithmetic expression as the argument like Rank[3 + 1.1] the parsing works ok, and if I simplify the dbRef grammar so its just Word(alphas) it also works. Cannot for the life of me figure out whats wrong with my full grammar. I have tried rearranging the order of operands as well as simplifying the functionCall grammar to no avail. Can anyone see what I am doing wrong?
Once I get this working I would want to take a last step and introduce support for variable assignment in expressions ..
EDIT: Upon further testing, if I remove the printables from dbRef grammar, things work ok:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional("_" + Word(alphas)))
HOWEVER, if I add the character "-" to dbRef (which I need for DB fields like "Class:S-N"), the parser fails again. I think the "-" is being consumed by the signop in my operatorPrecedence?
What appears to happen is that the ] character at the end of your test string (Rank[Person:Height]) gets consumed as part of the dbRef token, because the portion of this token past the initial : is declared as being made of Word(printables) (and this character set, unfortunately includes the square brackets characters)
Then the parser tries to produce a functionCall but is missing the closing ] hence the error message.
A tentative fix is to use a character set that doesn't include the square brackets, maybe something more explicit like:
dbRef = Combine(Word(alphas) + ":" + Word(alphas, alphas+"-_./") + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
Edit:
Upon closer look, the above is loosely correct, but the token hierarchy is wrong (e.g. the parser attempts to produce a functionCall as one operand of an an expr etc.)
Also, my suggested fix will not work because of the ambiguity with the - sign which should be understood as a plain character when within a dbRef and as a plusOp when within an expr. This type of issue is common with parsers and there are ways to deal with this, though I'm not sure exactly how with pyparsing.
Found solution - the issue was that my grammar for dbRef was consuming some of the characters that were part of the function specification. New grammar that works correctly:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional(oneOf("_ -") + Word(alphas)))