Suppose I would like to write a fairly simple programming language, and I want to implement operators such like 2 + 3 * 2 = 8
What is the general way to implement things like this?
I'm not sure how much detail you're interested in, but it sounds like you're looking to implement a parser. There's typically two steps:
The lexer reads over the text and converts it to tokens. For example, it might read "2 + 3 * 2" and convert it to INTEGER PLUS INTEGER STAR INTEGER
The parser reads in the tokens and tries to match them to rules. For example, you might have these rules:
Expr := Sum | Product | INTEGER;
Sum := Expr PLUS Expr;
Product := Expr STAR Expr;
It reads the tokens and tries to apply the rules such that the start rule maps to the tokens its read in. In this case, it might do:
Expr := Sum
Expr := Expr PLUS Expr
Expr := INTEGER(2) PLUS Expr
Expr := INTEGER(2) PLUS Product
Expr := INTEGER(2) PLUS Expr STAR Expr
Expr := INTEGER(2) PLUS Integer(3) STAR Expr
Expr := INTEGER(2) PLUS Integer(3) STAR Integer(2)
There are many types of parsers. In this example I read from left to right, and started from the initial expression, working down until I'd replaced everything with a token, so this would be an LL parser. As it does this replacement, it can generate an abstract syntax tree that represents the data. The tree for this might look something like:
You can see that the Product rule is a child of the Sum rule, so it will end up happening first: 2 + (3 * 2). If the expression had been parsed differently we might've ended up with this tree:
Now we're calculating (2 + 3) * 2. It all comes down to which way the parser generates the tree
If you actually want to parse expressions, odds are you don't want to write the parser by hand. There are parser generators that take a configuration (called a grammar) similar to the one I used above, and generate the actual parser code. Parser generators will let you specify which rule should take priority, so for example:
Expr := Sum | Product | INTEGER;
Sum := Expr PLUS Expr; [2]
Product := Expr STAR Expr; [1]
I labeled the Product rule as priority 1, and Sum as priority 2, so given the choice the generated parser will favor Product. You can also design the grammar itself such that the priority is built-in (this is the more common approach). For example:
Expr := Sum | INTEGER;
Sum := Expr PLUS Product;
Product := Term STAR INTEGER;
This forces the Products to be under the Sums in the AST. Naturally this grammar is very limited (for example, it wouldn't match 2 * 3 + 2), but a comprehensive grammar can be written that still embeds an order of operations automatically
You would need to write a parser for your fairly simple programming language. If you want to do this in Python, start by reading Ned Batchelder's blog post Python Parsing Tools.
Related
I've been programming in Python for years but something extremely trivial has surprised me:
>>> -1 ** 2
-1
Of course, squaring any negative real number should produce a positive result. Probably Python's math is not completely broken. Let's look at how it parsed this expression:
>>> ast.dump(ast.parse('-1 ** 2').body[0])
Expr(
value=UnaryOp(
op=USub(),
operand=BinOp(
left=Num(n=1),
op=Pow(),
right=Num(n=2)
)
)
)
Ok, so it is treating it as if I had written -(1 ** 2). But why is the - prefix to 1 being treated as a separate unary subtraction operator, instead of the sign of the constant?
Note that the expression -1 is not parsed as the unary subtraction of the constant 1, but just the constant -1:
>>> ast.dump(ast.parse('-1').body[0])
Expr(
value=Num(n=-1)
)
The same goes for -1 * 2, even though it is syntactically nearly identical to the first expression.
>>> ast.dump(ast.parse('-1 * 2').body[0])
Expr(
value=BinOp(
left=Num(n=-1),
op=Mult(),
right=Num(n=2)
)
)
This behavior turns out to be common to many languages including perl, PHP, and Ruby.
It behaves just like the docs explain here:
2.4.4. Numeric literals
[...] Note that numeric literals do not include a sign; a phrase like -1 is actually an expression composed of the unary operator - and the literal 1.
and here:
6.5. The power operator
The power operator binds more tightly than unary operators on its
left; [...]
See also the precedence table from this part. Here is the relevant part from that table:
Operator | Description
-------------|---------------------------------
* | Multiplication, ...
+x, -x, ~x | Positive, negative, bitwise NOT
** | Exponentiation
This explains why the parse tree is different between the ** and * examples.
I am writing a parser for an existing language, using the TextX Python Library (based on the Arpeggio PEG parser)
But when I try to use it to parse a file, I get the exception:
RecursionError: maximum recursion depth exceeded while calling a Python object
Here is a minimal example that raises this exception:
#!/usr/bin/env python
from textx import metamodel_from_str
meta_model_string = "Expr: ( Expr '+' Expr ) | INT ;"
model_string = "1 + 1"
mm = metamodel_from_str(meta_model_string, debug=True)
m = mm.model_from_str(model_string, debug=True)
I tracked it down to Arpeggio's left recursion issue, where it state that a rule like A := A B is unsupported and should be converted to a rule where there is no such recursion.
So my question is: Is it possible to rewrite the Expr := Expr '+' Expr rule above in a way that does not use left recursion? Note that the real Expr rule is much more complicated. A slightly less simplified version of it will be:
Expr: '(' Expr ')' | Expr '+' Expr | Expr '*' Expr' | '!' Expr | INT | STRING ;
textX author here. In addition to Paul's excellent answer, there is expression example which should provide you a good start.
Top-down parsers in general are not handling left-recursive rules without hacks like this. If your language is going to be complex and heavily expression oriented it might be better to try some bottom-up parser that allows for left recursion and provides declarative priority and associativity specification. If you liked textX then I suggest to take a look at parglare which has similar design goals but uses bottom-up parsing technique (specifically LR and GLR). Quick intro example is the exact language you are building.
In this post I blogged about rationale of starting parglare project and differences with textX/Arpeggio.
This is more typically written as:
multop: '*' | '/'
addop: '+' | '-'
Factor: INT | STRING | '(' Expr ')' ;
Term: Factor [multop Factor]... ;
Expr: Term [addop Term]... ;
Now Expr will not directly recurse to itself until first matching a leading '('. You will also get groups that correspond to precedence of operations. (Note that the repetition for Expr and Term will end up producing groups like ['1', '+', '1', '+', '1'], when you might have expected [['1', '+', '1'], '+', '1'] which is what a left-recursive parser will give you.)
I've been writing an lexer/parser/interpreter for my own language and so far all has been working. I've been following the examples over at Ruslan Spivak's blog (Github link to each article).
I wanted to extend my language grammar past what is written in the articles to include more operators like comparisons (<, >=, etc.) and also exponents (** or ^ in my language). I have this grammar:
expression : exponent ((ADD | SUB) exponent)*
exponent : term ((POWER) term)*
# this one is right-associative (powers **)
term : comparison ((MUL | DIV) comparison)*
comparison : factor ((EQUAl | L_EQUAL | LESS
N_EQUAL | G_EQUAL | GREATER) factor)*
# these are all binary operations
factor : NUM | STR | variable
| ADD factor | SUB factor
| LPAREN expr RPAREN
# different types of 'base' types like integers
# also contains parenthesised expressions which are evalutaed first
In terms of parsing tokens, I use the same method as used in Ruslan's blog. Here is one that will parse the exponent line, which handles addition and subtraction despite its name, as the grammar says that expressions are parsed as
exponent_expr (+ / -) exponent_expr
def exponent(self):
node = self.term()
while self.current_token.type in (ADD, SUB):
token = self.current_token
if token.type == ADD:
self.consume_token(ADD)
elif token.type == SUB:
self.consume_token(SUB)
node = BinaryOperation(left_node=node,
operator=token,
right_node=self.term())
return node
Now this parses left-associative tokens just fine (since the token stream comes left to right naturally), but I am stuck on how to parse right-associative exponents. Look at this expected in/out for reference:
>>> 2 ** 3 ** 2
# should be parsed as...
>>> 2 ** (3 ** 2)
# which is...
>>> 2 ** 9
# which returns...
512
# Mine, at the moment, parses it as...
>>> (2 ** 3) ** 2
# which is...
>>> 8 ** 2
# which returns...
64
To solve this, I tried switching the BinaryOperation() constructor's left and right nodes to make the current node the right and the new node the left, but this just makes 2**5 parse as 5**2 which gives me 25 instead of the expected 32.
Any approaches that I could try?
The fact that your exponent function actually parses expressions should have been a red flag. In fact, what you need is an expression function which parses expressions and an exponent function which parses exponentiations.
You've also mixed up the precedences of exponentiation and multiplication (and other operations), because 2 * x ** 4 does not mean (2 * x) ** 4 (which would be 16x⁴), but rather 2 * (x ** 4). By the same token, x * 3 < 17 does not mean x * (3 < 17), which is how your grammar will parse it.
Normally the precedences for arithmetics look something like this:
comparison <, <=, ==, ... ( lowest precedence)
additive +, -
multiplicative *, /, %
unary +, -
exponentiation **
atoms numbers, variables, parenthesized expressions, etc.
(If you had postfix operators like function calls, they would go in between exponentiation and atoms.)
Once you've reworked your grammar in this form, the exponent parser will look something like this:
def exponent(self):
node = self.term()
while self.current_token.type is POWER:
self.consume_token(ADD)
node = BinaryOperation(left_node=node,
operator=token,
right_node=self.exponent())
return node
The recursive call at the end produces right associativity. In this case recursion is acceptable because the left operand and the operator have already been consumed. Thus the recursive call cannot produce an infinite loop.
While working within python 3, I have created a calculator that accepts string inputs such as "-28 + 4.0/3 * 5" and other similar mathematical equations. As an exercise I had wanted to support exponentiation through use of the '^' key such that inputs like "5.23 * 2^4/3^2 -1.0" or other equations that contain values to a certain power would be functional. However, implementation with my current code has proven difficult. Not wanting to scrap my work, I realized that I could implement this if I could find a way to take the original string and selectively solve for the '^' operations such that inputs like the aforementioned "5.23 * 2^4/3^2 -1.0" would become "5.23 * 16/9 -1.0" which I could then feed into the code written prior. Only problem is, I am having some trouble isolating these pieces of the equations and was hoping someone might be able to lend a hand.
As binary and infix operators, you could split the string into symbols (numbers, operators), assign priority to operators and then rearrange it into (prefix-notation-like) stack.
Or split the input string into the parts separated by exponent mark, each number at the end-begining of neighbooring sub strings could then be cut, evaluated and replaced: "6 * 4^3 +2" -> ["6 * 4", "3 + 2"] -> "6 *" + x + "+ 2"
Let's take a look at the simplest arithmetic example in the pyparsing doc, here.
More specifically, I'm looking at the "+" operation that is defined as left associative and the first example test where we're parsing "9 + 2 + 3".
The outcome of the parsing I would have expected would be ((9+2)+3), that is, first compute the infix binary operator on 9 and 2 and then compute the infix binary operator on the result and 3. What I get however is (9+2+3), all on the same level, which is really not all that helpful, after all I have now to decide the order of evaluation myself and yet it was defined to be left associative. Why am I forced to parenthesize myself? What am I missing?
Thanks & Regards