Overcoming infinite left recursion in TextX parser

Overcoming infinite left recursion in TextX parser - python

I am writing a parser for an existing language, using the TextX Python Library (based on the Arpeggio PEG parser)
But when I try to use it to parse a file, I get the exception:
RecursionError: maximum recursion depth exceeded while calling a Python object
Here is a minimal example that raises this exception:
#!/usr/bin/env python
from textx import metamodel_from_str
meta_model_string = "Expr: ( Expr '+' Expr ) | INT ;"
model_string = "1 + 1"
mm = metamodel_from_str(meta_model_string, debug=True)
m = mm.model_from_str(model_string, debug=True)
I tracked it down to Arpeggio's left recursion issue, where it state that a rule like A := A B is unsupported and should be converted to a rule where there is no such recursion.
So my question is: Is it possible to rewrite the Expr := Expr '+' Expr rule above in a way that does not use left recursion? Note that the real Expr rule is much more complicated. A slightly less simplified version of it will be:
Expr: '(' Expr ')' | Expr '+' Expr | Expr '*' Expr' | '!' Expr | INT | STRING ;

textX author here. In addition to Paul's excellent answer, there is expression example which should provide you a good start.
Top-down parsers in general are not handling left-recursive rules without hacks like this. If your language is going to be complex and heavily expression oriented it might be better to try some bottom-up parser that allows for left recursion and provides declarative priority and associativity specification. If you liked textX then I suggest to take a look at parglare which has similar design goals but uses bottom-up parsing technique (specifically LR and GLR). Quick intro example is the exact language you are building.
In this post I blogged about rationale of starting parglare project and differences with textX/Arpeggio.

This is more typically written as:
multop: '*' | '/'
addop: '+' | '-'
Factor: INT | STRING | '(' Expr ')' ;
Term: Factor [multop Factor]... ;
Expr: Term [addop Term]... ;
Now Expr will not directly recurse to itself until first matching a leading '('. You will also get groups that correspond to precedence of operations. (Note that the repetition for Expr and Term will end up producing groups like ['1', '+', '1', '+', '1'], when you might have expected [['1', '+', '1'], '+', '1'] which is what a left-recursive parser will give you.)

Related

Parsing right-associative operator (exponents)

I've been writing an lexer/parser/interpreter for my own language and so far all has been working. I've been following the examples over at Ruslan Spivak's blog (Github link to each article).
I wanted to extend my language grammar past what is written in the articles to include more operators like comparisons (<, >=, etc.) and also exponents (** or ^ in my language). I have this grammar:
expression : exponent ((ADD | SUB) exponent)*
exponent : term ((POWER) term)*
# this one is right-associative (powers **)
term : comparison ((MUL | DIV) comparison)*
comparison : factor ((EQUAl | L_EQUAL | LESS
N_EQUAL | G_EQUAL | GREATER) factor)*
# these are all binary operations
factor : NUM | STR | variable
| ADD factor | SUB factor
| LPAREN expr RPAREN
# different types of 'base' types like integers
# also contains parenthesised expressions which are evalutaed first
In terms of parsing tokens, I use the same method as used in Ruslan's blog. Here is one that will parse the exponent line, which handles addition and subtraction despite its name, as the grammar says that expressions are parsed as
exponent_expr (+ / -) exponent_expr
def exponent(self):
node = self.term()
while self.current_token.type in (ADD, SUB):
token = self.current_token
if token.type == ADD:
self.consume_token(ADD)
elif token.type == SUB:
self.consume_token(SUB)
node = BinaryOperation(left_node=node,
operator=token,
right_node=self.term())
return node
Now this parses left-associative tokens just fine (since the token stream comes left to right naturally), but I am stuck on how to parse right-associative exponents. Look at this expected in/out for reference:
>>> 2 ** 3 ** 2
# should be parsed as...
>>> 2 ** (3 ** 2)
# which is...
>>> 2 ** 9
# which returns...
512
# Mine, at the moment, parses it as...
>>> (2 ** 3) ** 2
# which is...
>>> 8 ** 2
# which returns...
64
To solve this, I tried switching the BinaryOperation() constructor's left and right nodes to make the current node the right and the new node the left, but this just makes 2**5 parse as 5**2 which gives me 25 instead of the expected 32.
Any approaches that I could try?

The fact that your exponent function actually parses expressions should have been a red flag. In fact, what you need is an expression function which parses expressions and an exponent function which parses exponentiations.
You've also mixed up the precedences of exponentiation and multiplication (and other operations), because 2 * x ** 4 does not mean (2 * x) ** 4 (which would be 16x⁴), but rather 2 * (x ** 4). By the same token, x * 3 < 17 does not mean x * (3 < 17), which is how your grammar will parse it.
Normally the precedences for arithmetics look something like this:
comparison <, <=, ==, ... ( lowest precedence)
additive +, -
multiplicative *, /, %
unary +, -
exponentiation **
atoms numbers, variables, parenthesized expressions, etc.
(If you had postfix operators like function calls, they would go in between exponentiation and atoms.)
Once you've reworked your grammar in this form, the exponent parser will look something like this:
def exponent(self):
node = self.term()
while self.current_token.type is POWER:
self.consume_token(ADD)
node = BinaryOperation(left_node=node,
operator=token,
right_node=self.exponent())
return node
The recursive call at the end produces right associativity. In this case recursion is acceptable because the left operand and the operator have already been consumed. Thus the recursive call cannot produce an infinite loop.

How to force pyparsing to parenthesize infix notation "9 + 2 + 3"

Let's take a look at the simplest arithmetic example in the pyparsing doc, here.
More specifically, I'm looking at the "+" operation that is defined as left associative and the first example test where we're parsing "9 + 2 + 3".
The outcome of the parsing I would have expected would be ((9+2)+3), that is, first compute the infix binary operator on 9 and 2 and then compute the infix binary operator on the result and 3. What I get however is (9+2+3), all on the same level, which is really not all that helpful, after all I have now to decide the order of evaluation myself and yet it was defined to be left associative. Why am I forced to parenthesize myself? What am I missing?
Thanks & Regards

Why does exponential notation with decimal values fail? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Conventionally 1e3 means 10**3.
>>> 1e3
1000.0
>>> 10**3
1000
Similar case is exp(3) compared to e**3.
>>> exp(3)
20.085536923187668
>>> e**3
20.085536923187664
However now notice if the exponent is a float value:
>>> exp(3.1)
22.197951281441636
>>> e**3.1
22.197951281441632
which is fine. Now for the first example:
>>> 1e3.1
File "<stdin>", line 1
1e3.1
^
SyntaxError: invalid syntax
>>> 10**3.1
1258.9254117941675
which shows Python does not like 1e3.1, Fortran too.
Regardless it could be a standard (!) why it is like that?

The notation with the e is a numeric literal, part of the lexical syntax of many programming languages, based on standard form/scientific notation.
The purpose of this notation is to allow you to specify very large/small numbers by shifting the point position. It's not intended to allow you to encode multiplication by some arbitrary power of 10 into numeric literals. Therefore, that point and the following digits aren't even recognised as part of the numeric literal token.
If you want arbitrary powers, as you've found, there are math functions and operators that do the job. Unlike a numeric literal, you even get to determine the parameter values at run-time.

From the docs:
sign ::= '+' | '-'
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
indicator ::= 'e' | 'E'
digits ::= digit [digit]...
decimal-part ::= digits '.' [digits] | ['.'] digits
exponent-part ::= indicator [sign] digits #no dots allowed here

You seem to be conflating syntax for literals with operators. While you can claim that 1e3.1 follows your "convention" it should be quite clear that 1e3.1 is not a valid literal expression to the Python interpeter. The language has a defined standard grammar, and that grammer doesn't support floating point literal expressions as "exponents" for its numeric literals.
The "e" in a Python numeric literal is not an operator (any more than the decimal point would be). So your expectation that Python's literal syntax should support some "convention" ... based on some pattern you've divined ... is not particularly reasonable.

Python: spaces in expressions inside slice/bracket notation [duplicate]

PEP 8 doesn't mention the slice operator. From my understanding, unlike other operators, it should not be surrounded with whitespace
spam[3:5] # OK
spam[3 : 5] # NOT OK
Does this hold when using complex expressions, that is, which one is considered better style
1. spam[ham(66)//3:44+eggs()]
2. spam[ham(66) // 3: 44 + eggs()]
3. spam[ham(66) // 3 : 44 + eggs()]
4. something else?

As you already mentioned, PEP8 doesn't explicitly mention the slice operator in that format, but spam[3:5] is definitely more common and IMHO more readable.
If pep8 checker is anything to go by, the space before : will be flagged up
[me#home]$ pep8 <(echo "spam[3:44]") # no warnings
[me#home]$ pep8 <(echo "spam[3 : 44]")
/dev/fd/63:1:7: E203 whitespace before ':'
... but that's only because of it assumes : to be the operator for defining a literal dict and no space is expected before the operator. spam[3: 44] passes for that reason, but that just doesn't seem right.
On that count, I'd stick to spam[3:44].
Nested arithmetic operations are a little trickier. Of your 3 examples, only the 2nd one passes PEP8 validation:
[me#home]$ pep8 <(echo "spam[ham(66)//3:44+eggs()]")
/dev/fd/63:1:13: E225 missing whitespace around operator
[me#home]$ pep8 <(echo "spam[ham(66) // 3:44 + eggs()]") # OK
[me#home]$ pep8 <(echo "spam[ham(66) // 3 : 44 + eggs()]")
/dev/fd/63:1:18: E203 whitespace before ':'
However, I find all of the above difficult to parse by eye at first glance.
For readability and compliance with PEP8, I'd personally go for:
spam[(ham(66) // 3):(44 + eggs())]
Or for more complication operations:
s_from = ham(66) // 3
s_to = 44 + eggs()
spam[s_from:s_to]

I do see slicing used in PEP8:
- Use ''.startswith() and ''.endswith() instead of string slicing to check
for prefixes or suffixes.
startswith() and endswith() are cleaner and less error prone. For
example:
Yes: if foo.startswith('bar'):
No: if foo[:3] == 'bar':
I wouldn't call that definitive but it backs up your (and my) understanding:
spam[3:5] # OK
As far as which to use in the more complex situation, I'd use #3. I don't think the no-spaces-around-the-: method looks good in that case:
spam[ham(66) / 3:44 + eggs()] # looks like it has a time in the middle. Bad.
If you want the : to stand out more, don't sacrifice operator spacing, add extra spaces to the ::
spam[ham(66) / 3 : 44 + eggs()] # Wow, it's easy to read!
I would not use #1 because I like operator spacing, and #2 looks too much like the dictionary key: value syntax.
I also wouldn't call it an operator. It's special syntax for constructing a slice object -- you could also do
spam[slice(3, 5)]

I agree with your first example. For the latter one: PEP 20. Readability counts. The semantically most important part of your complex slice expression is the slice operator itself, it divides the expression into two parts that should be parsed (both by the human reader and the interpreter) separately. Therefore my intuition is that the consistency with PEP 8 should be sacrificed in order to highlight the : operator, ie. by surrounding it with whitespaces as in example 3. Question is if omitting the whitespaces within the two sides of the expression to increases readability or not:
1. spam[ham(66)/3 : 44+eggs()]
vs.
2. spam[ham(66) / 3 : 44 + eggs()]
I find 1. quicker to parse.

What is the general way to implement operators precedence in Python

Suppose I would like to write a fairly simple programming language, and I want to implement operators such like 2 + 3 * 2 = 8
What is the general way to implement things like this?

I'm not sure how much detail you're interested in, but it sounds like you're looking to implement a parser. There's typically two steps:
The lexer reads over the text and converts it to tokens. For example, it might read "2 + 3 * 2" and convert it to INTEGER PLUS INTEGER STAR INTEGER
The parser reads in the tokens and tries to match them to rules. For example, you might have these rules:
Expr := Sum | Product | INTEGER;
Sum := Expr PLUS Expr;
Product := Expr STAR Expr;
It reads the tokens and tries to apply the rules such that the start rule maps to the tokens its read in. In this case, it might do:
Expr := Sum
Expr := Expr PLUS Expr
Expr := INTEGER(2) PLUS Expr
Expr := INTEGER(2) PLUS Product
Expr := INTEGER(2) PLUS Expr STAR Expr
Expr := INTEGER(2) PLUS Integer(3) STAR Expr
Expr := INTEGER(2) PLUS Integer(3) STAR Integer(2)
There are many types of parsers. In this example I read from left to right, and started from the initial expression, working down until I'd replaced everything with a token, so this would be an LL parser. As it does this replacement, it can generate an abstract syntax tree that represents the data. The tree for this might look something like:
You can see that the Product rule is a child of the Sum rule, so it will end up happening first: 2 + (3 * 2). If the expression had been parsed differently we might've ended up with this tree:
Now we're calculating (2 + 3) * 2. It all comes down to which way the parser generates the tree
If you actually want to parse expressions, odds are you don't want to write the parser by hand. There are parser generators that take a configuration (called a grammar) similar to the one I used above, and generate the actual parser code. Parser generators will let you specify which rule should take priority, so for example:
Expr := Sum | Product | INTEGER;
Sum := Expr PLUS Expr; [2]
Product := Expr STAR Expr; [1]
I labeled the Product rule as priority 1, and Sum as priority 2, so given the choice the generated parser will favor Product. You can also design the grammar itself such that the priority is built-in (this is the more common approach). For example:
Expr := Sum | INTEGER;
Sum := Expr PLUS Product;
Product := Term STAR INTEGER;
This forces the Products to be under the Sums in the AST. Naturally this grammar is very limited (for example, it wouldn't match 2 * 3 + 2), but a comprehensive grammar can be written that still embeds an order of operations automatically

You would need to write a parser for your fairly simple programming language. If you want to do this in Python, start by reading Ned Batchelder's blog post Python Parsing Tools.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overcoming infinite left recursion in TextX parser - python

Related

Parsing right-associative operator (exponents)

How to force pyparsing to parenthesize infix notation "9 + 2 + 3"

Why does exponential notation with decimal values fail? [closed]

Python: spaces in expressions inside slice/bracket notation [duplicate]

What is the general way to implement operators precedence in Python

Categories

Resources