Inject new lexemes in a yacc Rule - python

I have the following grammar (this a simplified one):
S -> EXPR
EXPR -> ITEM addop EXPR
EXPR -> ITEM
ITEM -> num
ITEM -> ident
having:
num: a floating point number
ident: a string representing an identifier
addop: +
I am using PLY library for python, and have the following code:
def p_L_S(self,p):
''' S : EXPR'''
p[0] = p[1]
def p_L_EXPR_1(self,p):
''' EXPR : ITEM addop EXPR'''
p[0] = p[1] + p[2]
def p_L_EXPR_2(self,p):
''' EXPR : ITEM'''
p[0] = p[1]
def p_L_ITEM_1(self,p):
''' ITEM : num '''
p[0] = float(p[1])
def p_L_ITEM_2(self,p):
''' ITEM : ident '''
p[0] = value_of_expr_associated_to_ident(p[1])
[...]
in the last function (p_L_ITEM_2) I would like to interprete the string associated to p[1] (which is an expression recognized by the grammar) without launching another parsing.
Today, the function value_of_expr_associated_to_ident launches a new parsing (calling parse method) of the expression associated to ident.
But this gives really poor performances, even if it works.
Is there a way to send to the parser the lexems associated to the expresion associated to the ident to avoid having to start a new parsing ?
I don't know if this is clear, and if not I will try to clarify.
Thanks a lot.
Sam

If you are trying to do some sort of lazy evaluation in a functional language, read on. That's not as easy as it looks, and I haven't provided anything more than a rough idea about the approach.
If the values associated with identifiers are plain strings representing expressions in the language, then a recursive call to parse is what you need to do.
But it seems like it would be worthwhile at least caching the resulting parse tree (which means that your parser needs to create a parse tree, rather than doing an immediate evaluation). Alternatively, you could parse the string value into a parse tree when you assign it to the variable.
However you accomplish the recursive parse, you need to somehow deal with infinite recursion, as exemplified by the value of r being r (or r + 1).

Related

pyparsing optional parenthesis around an expression: pp.Opt(Suppress()) vs. nested_expr

QUESTIONS
This is a long post, so I will highlight my main two questions now before giving details:
How can one succinctly allow for optional matched parentheses/brackets around an expression?
How does one properly parse the content of nested_expr? This answer suggests that this function is not quite appropriate for this, and infix_notation is better, but that doesn't seem to fit my use case (I don't think).
DETAILS
I am working on a grammar to parse prolog strings. The data I have involves a lot of optional brackets or parentheses.
For example, both predicate([arg1, arg2, arg3]) and predicate(arg1, arg2, arg3) are legal and appear in the data.
My full grammar is a little complicated, and likely could be cleaned up, but I will paste it here for reproducibility. I have a couple versions of the grammar as I found new data that I had to account for. The first one works with the following example string:
pred(Var, arg_name1:arg#arg_type, arg_name2:(sub_arg1, sub_arg2))
For some visual clarity, I am turning the parsed strings into graphs, so this is what this one should look like:
Note that the arg2:(sub_arg1, sub_arg1) is slightly idiosyncratic syntax where the things inside the parens are supposed to be thought of as having an AND operator between them. The only thing indicating this is the fact that this wrapped expression essentially appears "naked" (i.e. has no predicate name of its own, it's just some values lumped together with parens).
VERSION 1: works on the above string
# GRAMMAR VER 1
predication = pp.Forward()
join_predication = pp.Forward()
entity = pp.Forward()
args_list = pp.Forward()
# atoms are used either as predicate names or bottom level argument values
# str_atoms are just quoted strings which may also appear as arguments
atom = pp.Word(pp.alphanums + '_' + '.')
str_atom = pp.QuotedString("'")
# TYPICAL ARGUMENT: arg_name:ARG_VALUE, where the ARG_VALUE may be an entity, join_predication, predication, or just an atom.
# Note that the arg_name is optional and may not always appear
# EXAMPLES:
# with name: pred(arg1:val1, arg2:val2)
# without name: pred(val1, val2)
argument = pp.Group(pp.Opt(atom("arg_name") + pp.Suppress(":")) + (entity | join_predication | predication | atom("arg_value") | str_atom("arg_value")))
# List of arguments
args_list = pp.Opt(pp.Suppress("[")) + pp.delimitedList(argument) + pp.Opt(pp.Suppress("]"))
# As in the example string above, sometimes predications are grouped together in parentheses and are meant to be understood as having an AND operator between them when evaluating the truth of both together
# EXAMPLE: pred(arg1:(sub_pred1, subpred2))
# I am just treating it as an args_list inside obligatory parentheses
join_predication <<= pp.Group(pp.Suppress("(") + args_list("args_list") + pp.Suppress(")"))("join_predication")
# pred_name with optional arguments (though I've never seen one without arguments, just in case)
predication <<= pp.Group(atom("pred_name") + pp.Suppress("(") + pp.Opt(args_list)("args_list") + pp.Suppress(")"))("predication")
# ent_name with optional arguments and a #type
entity <<= (pp.Group(((atom("ent_name")
+ pp.Suppress("(") + pp.Opt(args_list)("args_list") + pp.Suppress(")"))
| str_atom("ent_name") | atom("ent_name"))
+ pp.Suppress("#") + atom("type"))("entity"))
# starter symbol
lf_fragment = entity | join_predication | predication
Although this works, I came across another very similar string which used brackets instead of parentheses for a join_predication:
pred(Var, arg_name1:arg#arg_type, arg_name2:[sub_arg1, sub_arg2])
This broke my parser seemingly because the brackets are used in other places and because they are often optional, it could mistakenly match one with the wrong parser element as I am doing nothing to enforce that they must go together. For this I thought to turn to nested_expr, but this caused further problems because as mentioned in this answer, parsing the elements inside of a nested_expr doesn't work very well, and I have lost a lot of the substructure I need for the graphs I'm building.
VERSION 2: using nested_expr
# only including those expressions that have been changed
# args_list might not have brackets
args_list = pp.nested_expr("[", "]", pp.delimitedList(argument)) | pp.delimitedList(argument)
# join_predication is an args_list with obligatory wrapping parens/brackets
join_predication <<= pp.nested_expr("(", ")", args_list("args_list"))("join_predication") | pp.nested_expr("[", "]", args_list("args_list"))("join_predication")
I likely need to ensure matching for predication and entity, but haven't for now.
Using the above grammar, I can parse both example strings, but I lose the named structure that I had before.
In the original grammar, parse_results['predication']['args_list'] was a list of every argument, exactly as I expected. In the new grammar, it only contains the first argument, Var, in the example strings.

Recursive regular expressions for defining syntax using 'fr' strings

When creating grammar rules for a language I am making, I would like to be able to check syntax and step through it instead of the current method which often will miss syntax errors.
I've started off using regular expressions to define the grammar like so:
add = r"(\+)"
sub = r"(-)"
mul = r"(\*)"
div = r"(\\)"
pow = r"(\^)"
bin_op = fr"({add}|{sub}|{mul}|{div}|{pow})"
open_br = r"(\()"
close_br = r"(\))"
open_sq = r"(\[)"
close_sq = r"(\])"
dot = r"(\.)"
short_id = r"([A-Za-z]\d*)" # i.e. "a1", "b1232", etc.
long_id = r"([A-Za-z0-9]+)" # i.e. "sin2", "tan", etc. for use in assignment
long_id_ref = r"('" + long_id + "')" # i.e. "'sin'", for referencing
#note that "'sin'" is one value while "sin" = "s" * "i" * "n"
id_assign = fr"({short_id}|{long_id})" # for assignment
id_ref = fr"({short_id}|{long_id_ref})" # for reference (apostrophes)
integer = r"(\d+)" # i.e 123
float = fr"(\d+{dot}\d+)" # i.e. 3.4
operand = fr"({integer}|{float}|{id_ref})"
Now the issue here is that definitions may be recursive, for example in expression = fr"{expression}{bin_op}{expression}|({open_br}{expression}{close_br})|({expression}{open_sq}{expression}{close_sq})|..." and as you can see, I have shown some possible expressions that would be recursive. The issue is, of course, that expression is not defined when defining expression therefore an error would be raised.
It seems that (?R) would not work since it would copy everything before it not the whole string. Does Python's regex have a way of dealing with this or do I have to create my own BNF or regex interpreter that supports recursion?
Alternatively would it be feasible to use regular expressions but not use any recursion?
I know that there are 3rd-party applications that can help with this but I'd like to be able to do it all myself without external code.

Parse Python Code using PyParsing?

I'm trying to write PyParsing code capable of parsing any Python code (I know that the AST module exists, but that will just be a starting point - I ultimately want to parse more than just Python code.)
Anyways, I figure I'll just start by writing something able to parse the classic
print("Hello World!")
So here's what I wrote:
from pyparsing import (alphanums, alphas, delimitedList, Forward,
quotedString, removeQuotes, Suppress, Word)
expr = Forward()
string = quotedString.setParseAction(removeQuotes)
call = expr + Suppress('(') + Optional(delimitedList(expr)) + Suppress(')')
name = World(alphas + '_', alphanums + '_')
expr <<= string | name | call
test = 'print("Hello World!")'
print(expr.parseString(test))
When I do that, though, it just spits out:
['print']
Which is technically a valid expr - you can type that into the REPL and there's no problem parsing it, even if it's useless.
So I thought maybe what I would want is to flip around name and call in my expr definition, so it would prefer returning calls to names, like this:
expr <<= string | call | name
Now I get a maximum recursion depth exceeded error. That makes sense, too:
Checks if it's an expr.
Checks if it's a string, it's not.
Checks if it's a call.
It must start with an expr, return to start of outer list.
So my question is... how can I define call and expr so that I don't end up with an infinite recursion, but also so that it doesn't just stop when it sees the name and ignore the arguments?
Is Python code too complicated for PyParsing to handle? If not, is there any limit to what PyParsing can handle?
(Note - I've included the general tags parsing, abstract-syntax-tree, and bnf, because I suspect this is a general recursive grammar definition problem, not something necessarily specific to pyparsing.)
Your grammar is left recursive: expr expects a call which expects an expr which expects a call... If PyParsing can't handle left recursion, you need to change the grammar to something that PyParsing can work with.
One approach to remove direct left recursion is to change a gramar rule such us:
A = A b | c
into
A = c b*
In your case, left recursion is indirect: it doesn't happen in expr, but in a sub rule (call):
E = C | s | n
C = E x y z
To remove indirect left recursion you usually "lift" the definition of the sub-rule to the main rule. Unfortunatelly this removes the offending sub rule from the grammar -- in other words, you lose some structural expressiveness when you do that.
The previous example, with indirect recursion removed, would look like this:
E = E x y z | s | n
At this point, you have direct left recursion, which is easier to transform. When you deal with that, the result would be something like this -- in pseudo EBNF:
E = (s | n) (x y z)*
In your case, the definition of Expr would become:
Expr = (string | name) Args*
Args = "(" ExprList? ")"
ExprList = Expr ("," Expr)*

python infix forward pipe

I'm trying to implement a forward pipe functionality, like bash's | or R's recent %>%. I've seen this implementation https://mdk.fr/blog/pipe-infix-syntax-for-python.html, but this requires that we define in advance all the functions that might work with the pipe. In going for something completely general, here's what I've thought of so far.
This function applies its first argument to its second (a function)
def function_application(a,b):
return b(a)
So for example, if we have a squaring function
def sq(s):
return s**2
we could invoke that function in this cumbersome way function_application(5,sq). To get a step closer to a forward pipe, we want to use function_application with infix notation.
Drawing from this, we can define an Infix class so we can wrap functions in special characters such as |.
class Infix:
def __init__(self, function):
self.function = function
def __ror__(self, other):
return Infix(lambda x, self=self, other=other: self.function(other, x))
def __or__(self, other):
return self.function(other)
Now we can define our pipe which is simply the infix version of the function function_application,
p = Infix(function_application)
So we can do things like this
5 |p| sq
25
or
[1,2,3,8] |p| sum |p| sq
196
After that long-winded explanation, my question is if there is any way to override the limitations on valid function names. Here, I've named the pipe p, but is it possible to overload a non-alphanumeric character? Can I name a function > so my pipe is |>|?
Quick answer:
You can't really use |>| in python, at the bare minimum you need | * > * | where * needs to be a identifier, number, string, or another expression.
Long answer:
Every line is a statement (simple or compound), a stmt can be a couple of things, among them an expression, an expression is the only construct that allows the use of or operator | and greater than comparison > (or all operators and comparisons for that matter < > <= >= | ^ & >> << - + % / //), every expression needs a left hand side and a right hand side, ultimatelly being in the form lhs op rhs, both left and right hand side could be another expression, but the exit case is the use of an primary (with the exception of unnary -, ~ and + that need just a rhs), the primary will boil down to an identifier, number or string, so, at the end of the day you are required to have an identifier [a-zA-Z_][a-zA-Z_0-9]* along side a |.
Have you considered a different approach, like one class that override the or operator instead of a infix class? I have a tiny library that does piping, might interest you
For reference, here is the full grammar:
https://docs.python.org/2/reference/grammar.html
I was looking for a way to do this too. So I created a Python library called Pypework.
You just add a prefix such as f. to the beginning of each function call to make it pipeable. Then you can chain them together using the >> operator, like so:
"Lorem Ipsum" >> f.lowercase >> f.replace(" ", "_") # -> "lorem_ipsum"
Or across multiple lines if wrapped in parentheses, like so:
(
"Lorem Ipsum"
>> f.lowercase
>> f.replace(" ", "_")
)
# -> "lorem_ipsum"

Check if a formula is a term in Z3Py

In Z3Py, I need to check if something is a term using the standard grammar term := const | var | f(t1,...,tn)). I have written the following function to determine that but my method to check if for n-ary function doesn't seem very optimal.
Is there a better way to do so? These utility functions is_term, is_atom, is_literal, etc would be useful to be included in Z3. I will put them in the contrib section
CONNECTIVE_OPS = [Z3_OP_NOT,Z3_OP_AND,Z3_OP_OR,Z3_OP_IMPLIES,Z3_OP_IFF,Z3_OP_ITE]
REL_OPS = [Z3_OP_EQ,Z3_OP_LE,Z3_OP_LT,Z3_OP_GE,Z3_OP_GT]
def is_term(a):
"""
term := const | var | f(t1,...,tn)
"""
if is_const(a):
return True
else:
r = (is_app(a) and \
a.decl().kind() not in CONNECTIVE_OPS + REL_OPS and \
all(is_term(c) for c in a.children()))
return r
The function is reasonable, a few comments:
It depends on what you mean by "var" in your specification. Z3 has variables as de-Brujin indices. There is a function in z3py "is_var(a)" to check if "a" is a variable index.
There is another Boolean connective Z3_OP_XOR.
There are additional relational operations, such as operations that compare bit-vectors.
It depends on your intent and usage of the code, but you could alternatively check if the
sort of the expression is Boolean, and if it is ensure that the head function symbol is
uninterpreted.
is_const(a) is defined as return is_app(a) and a.num_args() == 0. So is_const is really handled by the default case.
Expressions that Z3 creates as a result of simplification, parsing or other transformations may have many shared sub-expressions. So a straight-forward recursive descent can take exponential time in the DAG size of the expression. You can deal with this by maintaining a hash table of visited nodes. From Python you can use Z3_get_ast_id to retrieve a unique number for the expression and maintain this in a set. The identifiers are unique as long as terms are not garbage collected, so
you should just maintain such a set as a local variable.
So, something along the lines of:
def get_expr_id(e):
return Z3_get_ast_id(e.ctx.ref(), e.ast)
def is_term_aux(a, seen):
if get_expr_id(a) in seen:
return True
else:
seen[get_expr_id(a)] = True
r = (is_app(a) and \
a.decl().kind() not in CONNECTIVE_OPS + REL_OPS and \
all(is_term_aux(c, seen) for c in a.children()))
return r
def is_term(a):
return is_term_aux(a, {})
The "text book" definitions of term, atom and literal used in first-order logic cannot be directly applied to Z3 expressions. In Z3, we allow expressions such as f(And(a, b)) > 0 and f(ForAll([x], g(x) == 0)), where f is a function from Boolean to Integer. This extensions do not increase the expressivity, but they are very convenient when writing problems. The SMT 2.0 standard also allows "term" if-then-else expressions. This is another feature that allows us to nest "formulas" inside "terms". Example: g(If(And(a, b), 1, 0)).
When implementing procedures that manipulate Z3 expressions, we sometimes need to distinguish between Boolean and non-Boolean expressions. In this case, a "term" is just an expression that does not have Boolean sort.
def is_term(a):
return not is_bool(a)
In other instances, we want to process the Boolean connectives (And, Or, ...) in a special way. For example, we are defining a CNF translator. In this case, we define an "atom" as any Boolean expression that is not a quantifier, is a (free) variable or an application that is not one of the Boolean connectives.
def is_atom(a):
return is_bool(a) and (is_var(a) or (is_app(a) and a.decl().kind() not in CONNECTIVE_OPS))
After we define a atom, a literal can be defined as:
def is_literal(a):
return is_atom(a) or (is_not(a) and is_atom(a.arg(0)))
Here is an example that demonstrates these functions (also available online at rise4fun):
x = Int('x')
p, q = Bools('p q')
f = Function('f', IntSort(), BoolSort())
g = Function('g', IntSort(), IntSort())
print is_literal(Not(x > 0))
print is_literal(f(x))
print is_atom(Not(x > 0))
print is_atom(f(x))
print is_atom(x)
print is_term(f(x))
print is_term(g(x))
print is_term(x)
print is_term(Var(1, IntSort()))

Categories

Resources