I would extract an abstract syntax tree for a very simple recursive grammar, for example
for a C function call statement. I have defined the grammar as:
name = Word(srange("[a-z]"), srange("[a-zA-Z0-9_]"))
func_args = Forward()
func_call = (name + "(" + func_args + ZeroOrMore(Word(",") + func_args) + ")").setParseAction(create_node)
func_args <<= (func_call | name)
res = func_call.parseString("func1(func2(v1,func3(v2,v3)))", True)
Parsing is ok but i can't figure out the best way to create the AST.
What i want for this example string is this AST:
func1
- func2
- v1
- func3
- v2
- v3
Suppose i have a class tree. When the callback create_node is called for the first func3(v2, v3), i should create nodes for func3 with childs v2 and v3 etc etc until the outer statement is parsed. which is the best way? Thank you
While your example code throws up on me, did you try to name the tokens, similar to shown below, then use dump() on the parsing result?
...
func_call = (name + ...)('call')
func_args <<= (func_call | name)('func')
...
print(res.dump())
You'll have to adapt this code I'm afraid, but somewhere along those lines.
Related
QUESTIONS
This is a long post, so I will highlight my main two questions now before giving details:
How can one succinctly allow for optional matched parentheses/brackets around an expression?
How does one properly parse the content of nested_expr? This answer suggests that this function is not quite appropriate for this, and infix_notation is better, but that doesn't seem to fit my use case (I don't think).
DETAILS
I am working on a grammar to parse prolog strings. The data I have involves a lot of optional brackets or parentheses.
For example, both predicate([arg1, arg2, arg3]) and predicate(arg1, arg2, arg3) are legal and appear in the data.
My full grammar is a little complicated, and likely could be cleaned up, but I will paste it here for reproducibility. I have a couple versions of the grammar as I found new data that I had to account for. The first one works with the following example string:
pred(Var, arg_name1:arg#arg_type, arg_name2:(sub_arg1, sub_arg2))
For some visual clarity, I am turning the parsed strings into graphs, so this is what this one should look like:
Note that the arg2:(sub_arg1, sub_arg1) is slightly idiosyncratic syntax where the things inside the parens are supposed to be thought of as having an AND operator between them. The only thing indicating this is the fact that this wrapped expression essentially appears "naked" (i.e. has no predicate name of its own, it's just some values lumped together with parens).
VERSION 1: works on the above string
# GRAMMAR VER 1
predication = pp.Forward()
join_predication = pp.Forward()
entity = pp.Forward()
args_list = pp.Forward()
# atoms are used either as predicate names or bottom level argument values
# str_atoms are just quoted strings which may also appear as arguments
atom = pp.Word(pp.alphanums + '_' + '.')
str_atom = pp.QuotedString("'")
# TYPICAL ARGUMENT: arg_name:ARG_VALUE, where the ARG_VALUE may be an entity, join_predication, predication, or just an atom.
# Note that the arg_name is optional and may not always appear
# EXAMPLES:
# with name: pred(arg1:val1, arg2:val2)
# without name: pred(val1, val2)
argument = pp.Group(pp.Opt(atom("arg_name") + pp.Suppress(":")) + (entity | join_predication | predication | atom("arg_value") | str_atom("arg_value")))
# List of arguments
args_list = pp.Opt(pp.Suppress("[")) + pp.delimitedList(argument) + pp.Opt(pp.Suppress("]"))
# As in the example string above, sometimes predications are grouped together in parentheses and are meant to be understood as having an AND operator between them when evaluating the truth of both together
# EXAMPLE: pred(arg1:(sub_pred1, subpred2))
# I am just treating it as an args_list inside obligatory parentheses
join_predication <<= pp.Group(pp.Suppress("(") + args_list("args_list") + pp.Suppress(")"))("join_predication")
# pred_name with optional arguments (though I've never seen one without arguments, just in case)
predication <<= pp.Group(atom("pred_name") + pp.Suppress("(") + pp.Opt(args_list)("args_list") + pp.Suppress(")"))("predication")
# ent_name with optional arguments and a #type
entity <<= (pp.Group(((atom("ent_name")
+ pp.Suppress("(") + pp.Opt(args_list)("args_list") + pp.Suppress(")"))
| str_atom("ent_name") | atom("ent_name"))
+ pp.Suppress("#") + atom("type"))("entity"))
# starter symbol
lf_fragment = entity | join_predication | predication
Although this works, I came across another very similar string which used brackets instead of parentheses for a join_predication:
pred(Var, arg_name1:arg#arg_type, arg_name2:[sub_arg1, sub_arg2])
This broke my parser seemingly because the brackets are used in other places and because they are often optional, it could mistakenly match one with the wrong parser element as I am doing nothing to enforce that they must go together. For this I thought to turn to nested_expr, but this caused further problems because as mentioned in this answer, parsing the elements inside of a nested_expr doesn't work very well, and I have lost a lot of the substructure I need for the graphs I'm building.
VERSION 2: using nested_expr
# only including those expressions that have been changed
# args_list might not have brackets
args_list = pp.nested_expr("[", "]", pp.delimitedList(argument)) | pp.delimitedList(argument)
# join_predication is an args_list with obligatory wrapping parens/brackets
join_predication <<= pp.nested_expr("(", ")", args_list("args_list"))("join_predication") | pp.nested_expr("[", "]", args_list("args_list"))("join_predication")
I likely need to ensure matching for predication and entity, but haven't for now.
Using the above grammar, I can parse both example strings, but I lose the named structure that I had before.
In the original grammar, parse_results['predication']['args_list'] was a list of every argument, exactly as I expected. In the new grammar, it only contains the first argument, Var, in the example strings.
I am trying to create grammar which will parse the following expressions:
func()
func(a)
func(a) + func(b)
func(func(a) + func()) + func(b)
I implemented it for (1) and (2), but once I extended rvalue << (identifier | function_call) by operation, it stopped working due to:
Exception raised:Expected W:(ABCD...), found ')' (at char 5), (line:1, col:6)
Exception raised:maximum recursion depth exceeded
Can anyone of you explain why? As far as I understood in expression rvalue << (identifier | function_call | operation) function_call should be matched before operation and the recursion shouldn't take place.
Code:
from pyparsing import Forward, Optional, Word, Literal, alphanums, delimitedList
rvalue = Forward()
operation = rvalue + Literal('+') + rvalue
identifier = Word(alphanums + '_')('identifier')
function_args = delimitedList(rvalue)('function_args')
function_name = identifier('function_name')
function_call = (
(function_name + Literal("(") + Optional(function_args) + Literal(")"))
)('function_call')
rvalue << (identifier | function_call | operation)
function_call.setDebug()
def test_function_call_no_args():
bdict = function_call.parseString("func()", parseAll=True).asDict()
assert bdict['function_name'] == 'func'
assert 'function_args' not in bdict
def test_function_call_one_arg():
bdict = function_call.parseString("func(arg)", parseAll=True).asDict()
assert bdict['function_name'] == 'func'
assert 'function_args' in bdict
def test_function_call_many_args():
bdict = function_call.parseString("func(arg1, arg2)", parseAll=True).asDict()
assert bdict['function_name'] == 'func'
assert 'function_args' in bdict
As far as I understood in expression rvalue << (identifier | function_call | operation) function_call should be matched before operation and the recursion shouldn't take place.
The recursion doesn't take place if one of the previous alternatives succeeds. But if both fail, operation is tried and you get infinite recursion.
For example, in test_function_call_no_args you try to parse func() using the function_call rule. This will parse func as the name of the function and ( as the beginning of the argument list. Then it will try to parse Optional(function_args), which will in turn try to parse delimitedList(rvalue). Now this will try to parse rvalue and since ) doesn't match the first two alternatives, it will try the last one, which will cause the infinite recursion.
When a rule is recursive, you must always consume input before the recursion is reached - it must not be possible to reach the recursion without consuming input. So having the recursion come last in an alternative isn't enough - there must actually be another non-optional rule (that also doesn't match the empty string) be successfully invoked before it.
PS: rvalue as it is can actually never match a function call because function calls start with an identifier and you match identifier first.
This question already has answers here:
Creating a function object from a string
(3 answers)
Closed 6 years ago.
Consider that we have the following input
formula = "(([foo] + [bar]) - ([baz]/2) )"
function_mapping = {
"foo" : FooFunction,
"bar" : BarFunction,
"baz" : BazFunction,
}
Is there any python library that lets me parse the formula and convert it into
a python function representation.
eg.
converted_formula = ((FooFunction() + BarFunction() - (BazFunction()/2))
I am currently looking into something like
In [11]: ast = compiler.parse(formula)
In [12]: ast
Out[12]: Module(None, Stmt([Discard(Sub((Add((List([Name('foo')]), List([Name('bar')]))), Div((List([Name('baz')]), Const(2))))))]))
and then process this ast tree further.
Do you know of any cleaner alternate solution?
Any help or insight is much appreciated!
You could use the re module to do what you want via regular-expression pattern matching and relatively straight-forward text substitution.
import re
alias_pattern = re.compile(r'''(?:\[(\w+)\])''')
def mapper(mat):
func_alias = mat.group(1)
function = function_alias_mapping.get(func_alias)
if not function:
raise NameError(func_alias)
return function.__name__ + '()'
# must be defined before anything can be mapped to them
def FooFunction(): return 15
def BarFunction(): return 30
def BazFunction(): return 6
function_alias_mapping = dict(foo=FooFunction, bar=BarFunction, baz=BazFunction)
formula = "(([foo] + [bar]) - ([baz]/2))" # Custom formula.
converted_formula = re.sub(alias_pattern, mapper, formula)
print('converted_formula = "{}"'.format(converted_formula))
# define contexts and function in which to evalute the formula expression
global_context = dict(FooFunction=FooFunction,
BarFunction=BarFunction,
BazFunction=BazFunction)
local_context = {'__builtins__': None}
function = lambda: eval(converted_formula, global_context, local_context)
print('answer = {}'.format(function())) # call function
Output:
converted_formula = "((FooFunction() + BarFunction()) - (BazFunction()/2))"
answer = 42
You can use what's called string formatting to accomplish this.
function_mapping = {
"foo" : FooFunction(),
"bar" : BarFunction(),
"baz" : BazFunction(),
}
formula = "(({foo} + {bar}) - ({baz}/2) )".format( **function_mapping )
Will give you the result of ((FooFunction() + BarFunction() - (BazFunction()/2))
But I believe the functions will execute when the module is loaded, so perhaps a better solution would be
function_mapping = {
"foo" : "FooFunction",
"bar" : "BarFunction",
"baz" : "BazFunction",
}
formula = "(({foo}() + {bar}()) - ({baz}()/2) )".format( **function_mapping )
This will give you the string '((FooFunction() + BarFunction() - (BazFunction()/2))' which you can then execute at any time with the eval function.
If you change the syntax used in the formulas slightly, (another) way to do this — as I mentioned in a comment — would be to use string.Template substitution.
Out of curiosity I decided to find out if this other approach was viable — and consequently was able to come up with better answer in the sense that not only is it simpler than my other one, it's also a little more flexible in the sense that it would be easy to add arguments to the functions being called as noted in a comment below.
from string import Template
def FooFunction(): return 15
def BarFunction(): return 30
def BazFunction(): return 6
formula = "(($foo + $bar) - ($baz/2))"
function_mapping = dict(foo='FooFunction()', # note these calls could have args
bar='BarFunction()',
baz='BazFunction()')
converted_formula = Template(formula).substitute(function_mapping)
print('converted_formula = "{}"'.format(converted_formula))
# define contexts in which to evalute the expression
global_context = dict(FooFunction=FooFunction,
BarFunction=BarFunction,
BazFunction=BazFunction)
local_context = dict(__builtins__=None)
function = lambda: eval(converted_formula, global_context, local_context)
answer = function() # call it
print('answer = {}'.format(answer))
As a final note, notice that string.Template supports different kinds of Advanced usage which would allow you to fine-tune the expression syntax even further — because internally it uses the re module (in a more sophisticated way than I did in my original answer).
For the cases where the mapped functions all return values that can be represented as Python literals — like numbers — and aren't being called just for the side-effects they produce, you could make the following modification which effectively cache (aka memoize) the results:
function_cache = dict(foo=FooFunction(), # calls and caches function results
bar=BarFunction(),
baz=BazFunction())
def evaluate(formula):
print('formula = {!r}'.format(formula))
converted_formula = Template(formula).substitute(function_cache)
print('converted_formula = "{}"'.format(converted_formula))
return eval(converted_formula, global_context, local_context)
print('evaluate(formula) = {}'.format(evaluate(formula)))
Output:
formula = '(($foo + $bar) - ($baz/2))'
converted_formula = "((15 + 30) - (6/2))"
evaluate(formula) = 42
I'm trying to write PyParsing code capable of parsing any Python code (I know that the AST module exists, but that will just be a starting point - I ultimately want to parse more than just Python code.)
Anyways, I figure I'll just start by writing something able to parse the classic
print("Hello World!")
So here's what I wrote:
from pyparsing import (alphanums, alphas, delimitedList, Forward,
quotedString, removeQuotes, Suppress, Word)
expr = Forward()
string = quotedString.setParseAction(removeQuotes)
call = expr + Suppress('(') + Optional(delimitedList(expr)) + Suppress(')')
name = World(alphas + '_', alphanums + '_')
expr <<= string | name | call
test = 'print("Hello World!")'
print(expr.parseString(test))
When I do that, though, it just spits out:
['print']
Which is technically a valid expr - you can type that into the REPL and there's no problem parsing it, even if it's useless.
So I thought maybe what I would want is to flip around name and call in my expr definition, so it would prefer returning calls to names, like this:
expr <<= string | call | name
Now I get a maximum recursion depth exceeded error. That makes sense, too:
Checks if it's an expr.
Checks if it's a string, it's not.
Checks if it's a call.
It must start with an expr, return to start of outer list.
So my question is... how can I define call and expr so that I don't end up with an infinite recursion, but also so that it doesn't just stop when it sees the name and ignore the arguments?
Is Python code too complicated for PyParsing to handle? If not, is there any limit to what PyParsing can handle?
(Note - I've included the general tags parsing, abstract-syntax-tree, and bnf, because I suspect this is a general recursive grammar definition problem, not something necessarily specific to pyparsing.)
Your grammar is left recursive: expr expects a call which expects an expr which expects a call... If PyParsing can't handle left recursion, you need to change the grammar to something that PyParsing can work with.
One approach to remove direct left recursion is to change a gramar rule such us:
A = A b | c
into
A = c b*
In your case, left recursion is indirect: it doesn't happen in expr, but in a sub rule (call):
E = C | s | n
C = E x y z
To remove indirect left recursion you usually "lift" the definition of the sub-rule to the main rule. Unfortunatelly this removes the offending sub rule from the grammar -- in other words, you lose some structural expressiveness when you do that.
The previous example, with indirect recursion removed, would look like this:
E = E x y z | s | n
At this point, you have direct left recursion, which is easier to transform. When you deal with that, the result would be something like this -- in pseudo EBNF:
E = (s | n) (x y z)*
In your case, the definition of Expr would become:
Expr = (string | name) Args*
Args = "(" ExprList? ")"
ExprList = Expr ("," Expr)*
I have a lot of class attributes that I want to create, so I decided to use a function to do so:
def make_index_variables(self):
for index, label in enumerate(self.variable_labels):
eval('self.' + label + '_index = ' + str(index))
If earlier, I defined:
self.variable_labels = ['x', 'y']
I get an error message like this:
eval('self.' + label + '_index = ' + str(index))
self.x_index = 0
^
SyntaxError: invalid syntax
I am beginning to realize that using setattr is probably better than using eval (but I am not sure). In any case, why does eval raise this error?
You want to do exec instead of eval
exec('self.' + label + '_index = ' + str(index))
eval will evaluate a expression, not run it like you want.
Think of eval like the argument of a if statement.
Also, if you want to set attributes of a class, you should definitely use setattr instead.
Actually, 99% of time there are better options for what you want rather than using exec.
Try this one:
setattr(self, name + "_index", index)
Eval evaluates an expression. Different from C, in Python an assignment is a statement, not an expression (you cannot write c = (a = b) == None, for example. The variant a = b = 3 is somewhat special syntax. It does actually not pass the value assigned to b , but the value on the right side (yes, this is a subtle, but important difference).
If it just for an index, there may be better versions which do not pullute the namespace, however.