Rule precedence issue with grako - python

I'm redoing a minilanguage I originally built on Perl (see Chessa# on github), but I'm running into a number of issues when I go to apply semantics.
Here is the grammar:
(* integers *)
DEC = /([1-9][0-9]*|0+)/;
int = /(0b[01]+|0o[0-7]+|0x[0-9a-fA-F]+)/ | DEC;
(* floats *)
pointfloat = /([0-9]*\.[0-9]+|[0-9]+\.)/;
expfloat = /([0-9]+\.?|[0-9]*\.)[eE][+-]?[0-9]+/;
float = pointfloat | expfloat;
list = '[' #+:atom {',' #+:atom}* ']';
(* atoms *)
identifier = /[_a-zA-Z][_a-zA-Z0-9]*/;
symbol = int |
float |
identifier |
list;
(* functions *)
arglist = #+:atom {',' #+:atom}*;
function = identifier '(' [arglist] ')';
atom = function | symbol;
prec8 = '(' atom ')' | atom;
prec7 = [('+' | '-' | '~')] prec8;
prec6 = prec7 ['!'];
prec5 = [prec6 '**'] prec6;
prec4 = [prec5 ('*' | '/' | '%' | 'd')] prec5;
prec3 = [prec4 ('+' | '-')] prec4;
(* <| and >| are rotate-left and rotate-right, respectively. They assume the nearest C size. *)
prec2 = [prec3 ('<<' | '>>' | '<|' | '>|')] prec3;
prec1 = [prec2 ('&' | '|' | '^')] prec2;
expr = prec1 $;
The issue I'm running into is that the d operator is being pulled into the identifier rule when no whitespace exists between the operator and any following alphanumeric strings. While the grammar itself is LL(2), I don't understand where the issue is here.
For instance, 4d6 stops the parser because it's being interpreted as 4 d6, where d6 is an identifier. What should occur is that it's interpreted as 4 d 6, with the d being an operator. In an LL parser, this would indeed be the case.
A possible solution would be to disallow d from beginning an identifier, but this would disallow functions such as drop from being named as such.

In Perl, you can use Marpa, a general BNF parser, which supports generalized precedence with associativity (and many more) out of the box, e.g.
:start ::= Script
Script ::= Expression+ separator => comma
comma ~ [,]
Expression ::=
Number bless => primary
| '(' Expression ')' bless => paren assoc => group
|| Expression '**' Expression bless => exponentiate assoc => right
|| Expression '*' Expression bless => multiply
| Expression '/' Expression bless => divide
|| Expression '+' Expression bless => add
| Expression '-' Expression bless => subtract
Full working example is here. As for programming languages, there is a C parser based on Marpa.
Hope this helps.

The problem with your example is that Grako has the nameguard feature enabled by default, and that won't allow parsing just the d when d6 is ahead.
To disable the feature, instantiate your own Buffer and pass it to an instance of the generated parser:
from grako.buffering import Buffer
from myparser import MyParser
# get the text
parser = MyParser()
parser.parse(Buffer(text, nameguard=False), 'expre')
The tip version of Grako in the Bitbucket repository adds a --no-nameguard command-line option to generated parsers.

Related

DSL for generating sequences

trying to create DSL to generate sequences ... here is what i did so far :
?start : expr
token : WORD
repeat_token : token ":" INT
tokens : (token | repeat_token)+
repeat : ":" INT
expr : "(" tokens | expr ")" repeat?
here is how the DSL look like :
(a b:2 (c d:3):2 ):3
[[a bb [[c ddd] [c ddd]] ] ... ]
I have problem with expr within expr ... ?
this fails:
(a:2 (b))
How do you see fitting (a:2 (b)) into your grammar? It doesn't seem like you can. Here's my logic:
The outer level has to be an expr because of the parens. In that expr you have both a repeat_token and another expr. I don't see anywhere that lets you have a sequence of elements that includes both repeat_tokens and exprs. Because of that, your input can't be parsed with your grammar.
As it is, a expr can only be in another expr all by itself, which doesn't seem very useful in general. That could only lead to extra sets of parentheses I think. What I think you need to do is allow an expr to be included in a tokens.
So then maybe:
?start : expr
token : WORD
repeat_token : token ":" INT
tokens : (token | repeat_token | expr)+
repeat : ":" INT
expr : "(" tokens ")" repeat?

pyparsing nestedExpr and double closing characters

I am trying to parse nested column type definitions such as
1 string
2 struct<col_1:string,col_2:int>
3 row(col_1 string,array(col_2 string),col_3 boolean)
4 array<struct<col_1:string,col_2:int>,col_3:boolean>
5 array<struct<col_1:string,col2:int>>
Using nestedExpr works as expected for cases 1-4, but throws a parse error on case 5. Adding a space between double closing brackets like "> >" seems work, and might be explained by this quote from the author.
By default, nestedExpr will look for space-delimited words of printables
https://sourceforge.net/p/pyparsing/bugs/107/
I'm mostly looking for alternatives to pre and post processing the input string
type_str = type_str.replace(">", "> ")
# parse string here
type_str = type_str.replace("> ", ">")
I've tried using the infix_notation but I haven't been able to figure out how to use it in this situation. I'm probably just using this the wrong way...
Code snippet
array_keyword = pp.Keyword('array')
row_keyword = pp.Keyword('row')
struct_keyword = pp.Keyword('struct')
nest_open = pp.Word('<([')
nest_close = pp.Word('>)]')
col_name = pp.Word(pp.alphanums + '_')
col_type = pp.Forward()
col_type_delimiter = pp.Word(':') | pp.White(' ')
column = col_name('name') + col_type_delimiter + col_type('type')
col_list = pp.delimitedList(pp.Group(column))
struct_type = pp.nestedExpr(
opener=struct_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
)
row_type = pp.locatedExpr(pp.nestedExpr(
opener=row_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
))
array_type = pp.nestedExpr(
opener=array_keyword + nest_open, closer=nest_close, content=col_type, ignoreExpr=None
)
col_type <<= struct_type('children') | array_type('children') | row_type('children') | scalar_type('type')
nestedExpr and infixNotation are not really appropriate for this project. nestedExpr is generally a short-cut expression for stuff you don't really want to go into details parsing, you just want to detect and step over some chunk of text that happens to have some nesting in opening and closing punctuation. infixNotation is intended for parsing expressions with unary and binary operators, usually some kind of arithmetic. You might be able to treat the punctuation in your grammar as operators, but it is a stretch, and definitely doing things the hard way.
For your project, you will really need to define the different elements, and it will be a recursive grammar (since the array and struct types will themselves be defined in terms of other types, which could also be arrays or structs).
I took a stab at a BNF, for a subset of your grammar using scalar types int, float, boolean, and string, and compound types array and struct, with just the '<' and '>' nesting punctuation. An array will take a single type argument, to define the type of the elements in the array. A struct will take one or more struct fields, where each field is an identifier:type pair.
scalar_type ::= 'int' | 'float' | 'string' | 'boolean'
array_type ::= 'array' '<' type_defn '>'
struct_type ::= 'struct' '<' struct_element (',' struct_element)... '>'
struct_element ::= identifier ':' type_defn
type_defn ::= scalar_type | array_type | struct_type
(If you later want to add a row definition also, think about what the row is supposed to look like, and how its elements would be defined, and then add it to this BNF.)
You look pretty comfortable with the basics of pyparsing, so I'll just start you off with some intro pieces, and then let you fill in the rest.
# define punctuation
LT, GT, COLON = map(pp.Suppress, "<>:")
ARRAY = pp.Keyword('array')
STRUCT = pp.Keyword('struct')
# create a Forward that will be used in other type expressions
type_defn = pp.Forward()
# here is the array type, you can fill in the other types following this model
# and the definitions in the BNF
array_type = pp.Group(ARRAY + LT + type_defn + GT)
...
# then finally define type_defn in terms of the other type expressions
type_defn <<= scalar_type | array_type | struct_type
Once you have that finished, try it out with some tests:
type_defn.runTests("""\
string
struct<col_1:string,col_2:int>
array<struct<col_1:string,col2:int>>
""", fullDump=False)
And you should get something like:
string
['string']
struct<col_1:string,col_2:int>
['struct', [['col_1', 'string'], ['col_2', 'int']]]
array<struct<col_1:string,col2:int>>
['array', ['struct', [['col_1', 'string'], ['col2', 'int']]]]>
Once you have that, you can play around with extending it to other types, such as your row type, maybe unions, or arrays that take multiple types (if that was your intention in your posted example). Always start by updating the BNF - then the changes you'll need to make in the code will generally follow.

Parsimonious ParseError

Digging deeper into grammars and PEG in special, I wanted to have a DSL with the following syntax:
a OR (b AND c)
I am using parsimonious here with the following grammar:
from parsimonious.grammar import Grammar
grammar = Grammar(
"""
expr = (term operator term)+
term = (lpar term rpar) / (variable operator variable)
operator = and / or
or = _? "OR" _?
and = _? "AND" _?
variable = ~r"[a-z]+"
lpar = "("
rpar = ")"
_ = ~r"\s*"
"""
)
print(grammar.parse('a OR (b AND c)'))
However, this fails for the above text with
parsimonious.exceptions.ParseError: Rule 'variable' didn't match at '(b AND c)' (line 1, column 6).
Why? Haven't I specified term as ( term ) or term ?
Why does it choose the rule for variable instead (which of course fails) ?
The first thing in expr is a term, so that's what the parser looks for.
A term in your grammar is either
( term )
or
variable operator variable
And the input is
a OR (b AND c)
That doesn't start with a ( so the only way it can be a term is if it matches variable operator variable. a is a variable; OR is an operator. So the next thing to match is variable.
Perhaps what you want is:
expr = term (operator term)*
term = (lpar expr rpar) / variable

Accessing attributes on literals work on all types, but not `int`; why? [duplicate]

This question already has answers here:
Why is "1.real" a syntax error but "1 .real" valid in Python?
(3 answers)
Closed 3 years ago.
I have read that everything in python is an object, and as such I started to experiment with different types and invoking __str__ on them — at first I was feeling really excited, but then I got confused.
>>> "hello world".__str__()
'hello world'
>>> [].__str__()
'[]'
>>> 3.14.__str__()
'3.14'
>>> 3..__str__()
'3.0'
>>> 123.__str__()
File "<stdin>", line 1
123.__str__()
^
SyntaxError: invalid syntax
Why does something.__str__() work for "everything" besides int?
Is 123 not an object of type int?
You need parens:
(4).__str__()
The problem is the lexer thinks "4." is going to be a floating-point number.
Also, this works:
x = 4
x.__str__()
So you think you can dance floating-point?
123 is just as much of an object as 3.14, the "problem" lies within the grammar rules of the language; the parser thinks we are about to define a float — not an int with a trailing method call.
We will get the expected behavior if we wrap the number in parenthesis, as in the below.
>>> (123).__str__()
'123'
Or if we simply add some whitespace after 123:
>>> 123 .__str__()
'123'
The reason it does not work for 123.__str__() is that the dot following the 123 is interpreted as the decimal-point of some partially declared floating-point.
>>> 123.__str__()
File "", line 1
123.__str__()
^
SyntaxError: invalid syntax
The parser tries to interpret __str__() as a sequence of digits, but obviously fails — and we get a SyntaxError basically saying that the parser stumbled upon something that it did not expect.
Elaboration
When looking at 123.__str__() the python parser could use either 3 characters and interpret these 3 characters as an integer, or it could use 4 characters and interpret these as the start of a floating-point.
123.__str__()
^^^ - int
123.__str__()
^^^^- start of floating-point
Just as a little child would like as much cake as possible on their plate, the parser is greedy and would like to swallow as much as it can all at once — even if this isn't always the best of ideas —as such the latter ("better") alternative is chosen.
When it later realizes that __str__() can in no way be interpreted as the decimals of a floating-point it is already too late; SyntaxError.
Note
123 .__str__() # works fine
In the above snippet, 123 (note the space) must be interpreted as an integer since no number can contain spaces. This means that it is semantically equivalent to (123).__str__().
Note
123..__str__() # works fine
The above also works because a number can contain at most one decimal-point, meaning that it is equivalent to (123.).__str__().
For the language-lawyers
This section contains the lexical definition of the relevant literals.
Lexical analysis - 2.4.5 Floating point literals
floatnumber ::= pointfloat | exponentfloat
pointfloat ::= [intpart] fraction | intpart "."
exponentfloat ::= (intpart | pointfloat) exponent
intpart ::= digit+
fraction ::= "." digit+
exponent ::= ("e" | "E") ["+" | "-"] digit+
Lexical analysis - 2.4.4 Integer literals
integer ::= decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::= nonzerodigit digit* | "0"+
nonzerodigit ::= "1"..."9"
digit ::= "0"..."9"
octinteger ::= "0" ("o" | "O") octdigit+
hexinteger ::= "0" ("x" | "X") hexdigit+
bininteger ::= "0" ("b" | "B") bindigit+
octdigit ::= "0"..."7"
hexdigit ::= digit | "a"..."f" | "A"..."F"
bindigit ::= "0" | "1"
Add a space after the 4:
4 .__str__()
Otherwise, the lexer will split this expression into the tokens "4.", "__str__", "(" and ")", i.e. the first token is interpreted as a floating point number. The lexer always tries to build the longest possible token.
actually (to increase unreadability...):
4..hex()
is valid, too. it gives '0x1.0000000000000p+2' -- but then it's a float, of course...

pyparsing to parse a python function call in its most general form

I would like to use the excellent pyparsing package to parse a python function call in its most general form. I read one post that was somewhat useful here but still not general enough.
I would like to parse the following expression:
f(arg1,arg2,arg3,...,kw1=var1,kw2=var2,kw3=var3,...)
where
arg1,arg2,arg3 ... are any kind of valid python objects (integer, real, list, dict, function, variable name ...)
kw1, kw2, kw3 ... are valid python keyword names
var1,var2,var3 are valid python objects
I was wondering if a grammar could be defined for such a general template. I am perhaps asking too much ... Would you have any idea ?
thank you very much for your help
Eric
Is that all? Let's start with a simple informal BNF for this:
func_call ::= identifier '(' func_arg [',' func_arg]... ')'
func_arg ::= named_arg | arg_expr
named_arg ::= identifier '=' arg_expr
arg_expr ::= identifier | real | integer | dict_literal | list_literal | tuple_literal | func_call
identifier ::= (alpha|'_') (alpha|num|'_')*
alpha ::= some letter 'a'..'z' 'A'..'Z'
num ::= some digit '0'..'9'
Translating to pyparsing, work bottom-up:
identifier = Word(alphas+'_', alphanums+'_')
# definitions of real, integer, dict_literal, list_literal, tuple_literal go here
# see further text below
# define a placeholder for func_call - we don't have it yet, but we need it now
func_call = Forward()
string = pp.quotedString | pp.unicodeString
arg_expr = identifier | real | integer | string | dict_literal | list_literal | tuple_literal | func_call
named_arg = identifier + '=' + arg_expr
# to define func_arg, must first see if it is a named_arg
# why do you think this is?
func_arg = named_arg | arg_expr
# now define func_call using '<<' instead of '=', to "inject" the definition
# into the previously declared Forward
#
# Group each arg to keep its set of tokens separate, otherwise you just get one
# continuous list of parsed strings, which is almost as worthless the original
# string
func_call << identifier + '(' + delimitedList(Group(func_arg)) + ')'
Those arg_expr elements could take a while to work through, but fortunately, you can get them off the pyparsing wiki's Examples page: http://pyparsing.wikispaces.com/file/view/parsePythonValue.py
from parsePythonValue import (integer, real, dictStr as dict_literal,
listStr as list_literal, tupleStr as tuple_literal)
You still might get args passed using *list_of_args or **dict_of_named_args notation. Expand arg_expr to support these:
deref_list = '*' + (identifier | list_literal | tuple_literal)
deref_dict = '**' + (identifier | dict_literal)
arg_expr = identifier | real | integer | dict_literal | list_literal | tuple_literal | func_call | deref_list | deref_dict
Write yourself some test cases now - start simple and work your way up to complicated:
sin(30)
sin(a)
hypot(a,b)
len([1,2,3])
max(*list_of_vals)
Additional argument types that will need to be added to arg_expr (left as further exercise for the OP):
indexed arguments : dictval['a'] divmod(10,3)[0] range(10)[::2]
object attribute references : a.b.c
arithmetic expressions : sin(30), sin(a+2*b)
comparison expressions : sin(a+2*b) > 0.5 10 < a < 20
boolean expressions : a or b and not (d or c and b)
lambda expression : lambda x : sin(x+math.pi/2)
list comprehension
generator expression

Categories

Resources