Parsing arithmetic expressions with function calls - python

I am working with pyparsing and found it to be excellent for developing a simple DSL that allows me to extract data fields out of MongoDB and do simple arithmetic operations on them. I am now trying to extend my tools such that I can apply functions of the form Rank[Person:Height] to the fields and potentially include simple expressions as arguments to the function calls. I am struggling hard with getting the parsing syntax to work. Here is what I have so far:
# Define parser
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + "." + Word(nums)).setParseAction(EvalConstant)
# Handle database field references that are coming out of Mongo,
# accounting for the fact that some fields contain whitespace
dbRef = Combine(Word(alphas) + ":" + Word(printables) + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword("Rank") | Keyword("ZS") | Keyword("Ntile")) + "[" + expr + "]"
functionCall.setParseAction(EvalFunction)
operand = functionCall | dbRef | (real | integer)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
My issue is that when I test a simple expression like Rank[Person:Height] I am getting a parse exception:
ParseException: Expected "]" (at char 19), (line:1, col:20)
If I use a float or arithmetic expression as the argument like Rank[3 + 1.1] the parsing works ok, and if I simplify the dbRef grammar so its just Word(alphas) it also works. Cannot for the life of me figure out whats wrong with my full grammar. I have tried rearranging the order of operands as well as simplifying the functionCall grammar to no avail. Can anyone see what I am doing wrong?
Once I get this working I would want to take a last step and introduce support for variable assignment in expressions ..
EDIT: Upon further testing, if I remove the printables from dbRef grammar, things work ok:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional("_" + Word(alphas)))
HOWEVER, if I add the character "-" to dbRef (which I need for DB fields like "Class:S-N"), the parser fails again. I think the "-" is being consumed by the signop in my operatorPrecedence?

What appears to happen is that the ] character at the end of your test string (Rank[Person:Height]) gets consumed as part of the dbRef token, because the portion of this token past the initial : is declared as being made of Word(printables) (and this character set, unfortunately includes the square brackets characters)
Then the parser tries to produce a functionCall but is missing the closing ] hence the error message.
A tentative fix is to use a character set that doesn't include the square brackets, maybe something more explicit like:
dbRef = Combine(Word(alphas) + ":" + Word(alphas, alphas+"-_./") + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
Edit:
Upon closer look, the above is loosely correct, but the token hierarchy is wrong (e.g. the parser attempts to produce a functionCall as one operand of an an expr etc.)
Also, my suggested fix will not work because of the ambiguity with the - sign which should be understood as a plain character when within a dbRef and as a plusOp when within an expr. This type of issue is common with parsers and there are ways to deal with this, though I'm not sure exactly how with pyparsing.

Found solution - the issue was that my grammar for dbRef was consuming some of the characters that were part of the function specification. New grammar that works correctly:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional(oneOf("_ -") + Word(alphas)))

Related

pyparsing nestedExpr and double closing characters

I am trying to parse nested column type definitions such as
1 string
2 struct<col_1:string,col_2:int>
3 row(col_1 string,array(col_2 string),col_3 boolean)
4 array<struct<col_1:string,col_2:int>,col_3:boolean>
5 array<struct<col_1:string,col2:int>>
Using nestedExpr works as expected for cases 1-4, but throws a parse error on case 5. Adding a space between double closing brackets like "> >" seems work, and might be explained by this quote from the author.
By default, nestedExpr will look for space-delimited words of printables
https://sourceforge.net/p/pyparsing/bugs/107/
I'm mostly looking for alternatives to pre and post processing the input string
type_str = type_str.replace(">", "> ")
# parse string here
type_str = type_str.replace("> ", ">")
I've tried using the infix_notation but I haven't been able to figure out how to use it in this situation. I'm probably just using this the wrong way...
Code snippet
array_keyword = pp.Keyword('array')
row_keyword = pp.Keyword('row')
struct_keyword = pp.Keyword('struct')
nest_open = pp.Word('<([')
nest_close = pp.Word('>)]')
col_name = pp.Word(pp.alphanums + '_')
col_type = pp.Forward()
col_type_delimiter = pp.Word(':') | pp.White(' ')
column = col_name('name') + col_type_delimiter + col_type('type')
col_list = pp.delimitedList(pp.Group(column))
struct_type = pp.nestedExpr(
opener=struct_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
)
row_type = pp.locatedExpr(pp.nestedExpr(
opener=row_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
))
array_type = pp.nestedExpr(
opener=array_keyword + nest_open, closer=nest_close, content=col_type, ignoreExpr=None
)
col_type <<= struct_type('children') | array_type('children') | row_type('children') | scalar_type('type')
nestedExpr and infixNotation are not really appropriate for this project. nestedExpr is generally a short-cut expression for stuff you don't really want to go into details parsing, you just want to detect and step over some chunk of text that happens to have some nesting in opening and closing punctuation. infixNotation is intended for parsing expressions with unary and binary operators, usually some kind of arithmetic. You might be able to treat the punctuation in your grammar as operators, but it is a stretch, and definitely doing things the hard way.
For your project, you will really need to define the different elements, and it will be a recursive grammar (since the array and struct types will themselves be defined in terms of other types, which could also be arrays or structs).
I took a stab at a BNF, for a subset of your grammar using scalar types int, float, boolean, and string, and compound types array and struct, with just the '<' and '>' nesting punctuation. An array will take a single type argument, to define the type of the elements in the array. A struct will take one or more struct fields, where each field is an identifier:type pair.
scalar_type ::= 'int' | 'float' | 'string' | 'boolean'
array_type ::= 'array' '<' type_defn '>'
struct_type ::= 'struct' '<' struct_element (',' struct_element)... '>'
struct_element ::= identifier ':' type_defn
type_defn ::= scalar_type | array_type | struct_type
(If you later want to add a row definition also, think about what the row is supposed to look like, and how its elements would be defined, and then add it to this BNF.)
You look pretty comfortable with the basics of pyparsing, so I'll just start you off with some intro pieces, and then let you fill in the rest.
# define punctuation
LT, GT, COLON = map(pp.Suppress, "<>:")
ARRAY = pp.Keyword('array')
STRUCT = pp.Keyword('struct')
# create a Forward that will be used in other type expressions
type_defn = pp.Forward()
# here is the array type, you can fill in the other types following this model
# and the definitions in the BNF
array_type = pp.Group(ARRAY + LT + type_defn + GT)
...
# then finally define type_defn in terms of the other type expressions
type_defn <<= scalar_type | array_type | struct_type
Once you have that finished, try it out with some tests:
type_defn.runTests("""\
string
struct<col_1:string,col_2:int>
array<struct<col_1:string,col2:int>>
""", fullDump=False)
And you should get something like:
string
['string']
struct<col_1:string,col_2:int>
['struct', [['col_1', 'string'], ['col_2', 'int']]]
array<struct<col_1:string,col2:int>>
['array', ['struct', [['col_1', 'string'], ['col2', 'int']]]]>
Once you have that, you can play around with extending it to other types, such as your row type, maybe unions, or arrays that take multiple types (if that was your intention in your posted example). Always start by updating the BNF - then the changes you'll need to make in the code will generally follow.

Parsing text file in python using pyparsing

I am trying to parse the following text using pyparsing.
acp (SOLO1,
"solo-100",
"hi here is the gift"
"Maximum amount of money, goes",
430, 90)
jhk (SOLO2,
"solo-101",
"hi here goes the wind."
"and, they go beyond",
1000, 320)
I have tried the following code but it doesn't work.
flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)
print list(enclosed.searchString (str1))
The comma(,) within the quotation is producing undesired results.
Well, I might have oversimplified slightly in my comments - here is a more complete
answer.
If you don't really have to deal with nested data items, then a single-level parenthesized
data group in each section will look like this:
LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)
# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' +
''.join(map(lambda s:s.strip('"\''),t)) +
'"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string
# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
I wasn't sure if it was intentional or not when you omitted the comma between
two consecutive quoted strings, so I chose to implement logic like Python's compiler,
in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is
the same as "AB CD EF". This was done with the definition of quoted_string, and adding
the parse action to quoted_string to concatenate the contents of the 2 or more component
quoted strings.
Finally, we create a parser for the overall group
results = OneOrMore(Group(section)).parseString(source)
results.pprint()
and get from your posted input sample:
[['acp',
['SOLO1',
'"solo-100"',
'"hi here is the giftMaximum amount of money, goes"',
'430',
'90']],
['jhk',
['SOLO2',
'"solo-101"',
'"hi here goes the wind.and, they go beyond"',
'1000',
'320']]]
If you do have nested parenthetical groups, then your section definition can be
as simple as this:
# section defined with nesting
section = ident + nestedExpr()
Although as you have already found, this will retain the separate commas as if they
were significant tokens instead of just data separators.

Use pyparsing to parse expression starting with parenthesis

I'm trying to develop a grammar which can parse expression starting with parenthesis and ending parenthesis. There can be any combination of characters inside the parenthesis. I've written the following code, following the Hello World program from pyparsing.
from pyparsing import *
select = Literal("select")
predicate = "(" + Word(printables) + ")"
selection = select + predicate
print (selection.parseString("select (a)"))
But this throws error. I think it may be because printables also consist of ( and ) and it's somehow conflicting with the specified ( and ).
What is the correct way of doing this?
You could use alphas instead of printables.
from pyparsing import *
select = Literal("select")
predicate = "(" + Word(alphas) + ")"
selection = select + predicate
print (selection.parseString("select (a)"))
If using { } as the nested characters
from pyparsing import *
expr = Combine(Suppress('select ') + nestedExpr('{', '}'))
value = "select {a(b(c\somethinsdfsdf##!#$###$##))}"
print( expr.parseString( value ) )
output: [['a(b(c\\somethinsdfsdf##!#$###$##))']]
The problem with ( ) is they are used as the default quoting characters.

Function equivalent to prepending string with "r" in Python

It's great that I can write
s = r"some line\n"
but what is the functional equivalent to preprending with r? For example:
s = raw_rep( s )
There isn't one. The r is an integral part of the string literal token, and omitting it is a lossy operation.
For example, r'\n', r'\12' and r'\x0a' are three different strings. However, if you omit the r, they become identical, making it impossible to tell which of the three it was to begin with.
For this reason, this is no method that would reconstruct the original string 100% of the time.
def raw_rep(s):
quote = '"' if "'" in s else "'"
return 'r' + quote + s + quote
>>> print raw_rep(r'some line\n')
r'some line\n'

Python: Using strings as an object argument?

Essentially my problem is as follows...
In Python, I have a function that will return an output string in the following form:
'union(symbol(a), symbol(b))'
The function forms found within this string actually exist in an object class called RegExTree. Further this class contains a function to construct a tree data structure using the function "construct()" as shown below:
tree = RegExTree()
tree.construct(union(symbol(a), symbol(b))
The above two lines of code would work normally, constructing a tree based on parsing the arguments within the construct function. I want to pass in a string in a similar fashion, perhaps this line of code illustrates what I want:
tree = RegExTree()
expression = 'union(' + 'symbol(' + 'a' + ')' + ', ' + 'symbol(' + 'b' + ')' + ')'
tree.construct(expression)
Right now the way I have the code written as above it yields an error (in the Linux terminal) as follows:
$ Attribute Error: 'str' object has no attribute 'value'
Can you coerce Python to interpret the string as a valid argument/line of code. In essence, not as string, but as object constructors.
Is there a way to get Python to interpret a string as rather something that would have been parsed/compiled into objects and have it construct the objects from the string as if it were a line of code meant to describe the same end goal?
Is what I'm asking for some kind of back-door type conversion? Or is what I'm asking not possible in programming languages, specifically Python?
EDIT: Using Michael's solution posited below that involves "eval()", there is one way to hack this into form:
tree = RegExTree()
a = 'a'
b = 'b'
expression = 'union(' + 'symbol(' + a + ')' + ', ' + 'symbol(' + b + ')' + ')'
tree.construct(eval(expression))
Is there a better way of doing this? Or is it just that the nature of my output as string representing functions is just not a good idea?
[Thanks martineau for the correction for my solution edit!]
You can use the python built-in eval statement.
A word of caution though... you do not want to run eval() on a string that's coming into your program as external input provided by the user. That could create a security hole where users of your program could run arbitrary Python code of their own design.
In your example it'd look something like this:
tree = RegExTree()
expression = 'union(' + 'symbol(' + 'a' + ')' + ', ' + 'symbol(' + 'b' + ')' + ')'
tree.construct( eval(expression) ) # Notice the eval statement here

Categories

Resources