I'm trying to use pyparsing to parse function calls in the form:
f(x, y)
That's easy. But since it's a recursive-descent parser, it should also be easy to parse:
f(g(x), y)
That's what I can't get. Here's a boiled-down example:
from pyparsing import Forward, Word, alphas, alphanums, nums, ZeroOrMore, Literal
lparen = Literal("(")
rparen = Literal(")")
identifier = Word(alphas, alphanums + "_")
integer = Word( nums )
functor = identifier
# allow expression to be used recursively
expression = Forward()
arg = identifier | integer | expression
args = arg + ZeroOrMore("," + arg)
expression << functor + lparen + args + rparen
print expression.parseString("f(x, y)")
print expression.parseString("f(g(x), y)")
And here's the output:
['f', '(', 'x', ',', 'y', ')']
Traceback (most recent call last):
File "tmp.py", line 14, in <module>
print expression.parseString("f(g(x), y)")
File "/usr/local/lib/python2.6/dist-packages/pyparsing-1.5.6-py2.6.egg/pyparsing.py", line 1032, in parseString
raise exc
pyparsing.ParseException: Expected ")" (at char 3), (line:1, col:4)
Why does my parser interpret the functor of the inner expression as a standalone identifier?
Nice catch on figuring out that identifier was masking expression in your definition of arg. Here are some other tips on your parser:
x + ZeroOrMore(',' + x) is a very common pattern in pyparsing parsers, so pyparsing includes a helper method delimitedList which allows you to replace that expression with delimitedList(x). Actually, delimitedList does one other thing - it suppresses the delimiting commas (or other delimiter if given using the optional delim argument), based on the notion that the delimiters are useful at parsing time, but are just clutter tokens when trying to sift through the parsed data afterwards. So you can rewrite args as args = delimitedList(arg), and you will get just the args in a list, no commas to have to "step over".
You can use the Group class to create actual structure in your parsed tokens. This will build your nesting hierarchy for you, without having to walk this list looking for '(' and ')' to tell you when you've gone down a level in the function nesting:
arg = Group(expression) | identifier | integer
expression << functor + Group(lparen + args + rparen)
Since your args are being Grouped for you, you can further suppress the parens, since like the delimiting commas, they do their job during parsing, but with grouping of your tokens, they are no longer necessary:
lparen = Literal("(").suppress()
rparen = Literal(")").suppress()
I assume 'h()' is a valid function call, just no args. You can allow args to be optional using Optional:
expression << functor + Group(lparen + Optional(args) + rparen)
Now you can parse "f(g(x), y, h())".
Welcome to pyparsing!
The definition of arg should be arranged with the item that starts with another at the left, so it is matched preferentially:
arg = expression | identifier | integer
Paul's answer helped a lot. For posterity, the same can be used to define for loops, as follows (simplified pseudo-parser here, to show the structure):
from pyparsing import (
Forward, Group, Keyword, Literal, OneOrMore)
sep = Literal(';')
if_ = Keyword('if')
then_ = Keyword('then')
elif_ = Keyword('elif')
end_ = Keyword('end')
if_block = Forward()
do_block = Forward()
stmt = other | if_block
stmts = OneOrMore(stmt + sep)
case = Group(guard + then_ + stmts)
cases = case + OneOrMore(elif_ + case)
if_block << if_ + cases + end_
Related
I am trying to parse the following text using pyparsing.
acp (SOLO1,
"solo-100",
"hi here is the gift"
"Maximum amount of money, goes",
430, 90)
jhk (SOLO2,
"solo-101",
"hi here goes the wind."
"and, they go beyond",
1000, 320)
I have tried the following code but it doesn't work.
flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)
print list(enclosed.searchString (str1))
The comma(,) within the quotation is producing undesired results.
Well, I might have oversimplified slightly in my comments - here is a more complete
answer.
If you don't really have to deal with nested data items, then a single-level parenthesized
data group in each section will look like this:
LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)
# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' +
''.join(map(lambda s:s.strip('"\''),t)) +
'"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string
# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
I wasn't sure if it was intentional or not when you omitted the comma between
two consecutive quoted strings, so I chose to implement logic like Python's compiler,
in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is
the same as "AB CD EF". This was done with the definition of quoted_string, and adding
the parse action to quoted_string to concatenate the contents of the 2 or more component
quoted strings.
Finally, we create a parser for the overall group
results = OneOrMore(Group(section)).parseString(source)
results.pprint()
and get from your posted input sample:
[['acp',
['SOLO1',
'"solo-100"',
'"hi here is the giftMaximum amount of money, goes"',
'430',
'90']],
['jhk',
['SOLO2',
'"solo-101"',
'"hi here goes the wind.and, they go beyond"',
'1000',
'320']]]
If you do have nested parenthetical groups, then your section definition can be
as simple as this:
# section defined with nesting
section = ident + nestedExpr()
Although as you have already found, this will retain the separate commas as if they
were significant tokens instead of just data separators.
I'm trying to develop a grammar which can parse expression starting with parenthesis and ending parenthesis. There can be any combination of characters inside the parenthesis. I've written the following code, following the Hello World program from pyparsing.
from pyparsing import *
select = Literal("select")
predicate = "(" + Word(printables) + ")"
selection = select + predicate
print (selection.parseString("select (a)"))
But this throws error. I think it may be because printables also consist of ( and ) and it's somehow conflicting with the specified ( and ).
What is the correct way of doing this?
You could use alphas instead of printables.
from pyparsing import *
select = Literal("select")
predicate = "(" + Word(alphas) + ")"
selection = select + predicate
print (selection.parseString("select (a)"))
If using { } as the nested characters
from pyparsing import *
expr = Combine(Suppress('select ') + nestedExpr('{', '}'))
value = "select {a(b(c\somethinsdfsdf##!#$###$##))}"
print( expr.parseString( value ) )
output: [['a(b(c\\somethinsdfsdf##!#$###$##))']]
The problem with ( ) is they are used as the default quoting characters.
I am working with pyparsing and found it to be excellent for developing a simple DSL that allows me to extract data fields out of MongoDB and do simple arithmetic operations on them. I am now trying to extend my tools such that I can apply functions of the form Rank[Person:Height] to the fields and potentially include simple expressions as arguments to the function calls. I am struggling hard with getting the parsing syntax to work. Here is what I have so far:
# Define parser
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + "." + Word(nums)).setParseAction(EvalConstant)
# Handle database field references that are coming out of Mongo,
# accounting for the fact that some fields contain whitespace
dbRef = Combine(Word(alphas) + ":" + Word(printables) + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword("Rank") | Keyword("ZS") | Keyword("Ntile")) + "[" + expr + "]"
functionCall.setParseAction(EvalFunction)
operand = functionCall | dbRef | (real | integer)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
My issue is that when I test a simple expression like Rank[Person:Height] I am getting a parse exception:
ParseException: Expected "]" (at char 19), (line:1, col:20)
If I use a float or arithmetic expression as the argument like Rank[3 + 1.1] the parsing works ok, and if I simplify the dbRef grammar so its just Word(alphas) it also works. Cannot for the life of me figure out whats wrong with my full grammar. I have tried rearranging the order of operands as well as simplifying the functionCall grammar to no avail. Can anyone see what I am doing wrong?
Once I get this working I would want to take a last step and introduce support for variable assignment in expressions ..
EDIT: Upon further testing, if I remove the printables from dbRef grammar, things work ok:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional("_" + Word(alphas)))
HOWEVER, if I add the character "-" to dbRef (which I need for DB fields like "Class:S-N"), the parser fails again. I think the "-" is being consumed by the signop in my operatorPrecedence?
What appears to happen is that the ] character at the end of your test string (Rank[Person:Height]) gets consumed as part of the dbRef token, because the portion of this token past the initial : is declared as being made of Word(printables) (and this character set, unfortunately includes the square brackets characters)
Then the parser tries to produce a functionCall but is missing the closing ] hence the error message.
A tentative fix is to use a character set that doesn't include the square brackets, maybe something more explicit like:
dbRef = Combine(Word(alphas) + ":" + Word(alphas, alphas+"-_./") + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
Edit:
Upon closer look, the above is loosely correct, but the token hierarchy is wrong (e.g. the parser attempts to produce a functionCall as one operand of an an expr etc.)
Also, my suggested fix will not work because of the ambiguity with the - sign which should be understood as a plain character when within a dbRef and as a plusOp when within an expr. This type of issue is common with parsers and there are ways to deal with this, though I'm not sure exactly how with pyparsing.
Found solution - the issue was that my grammar for dbRef was consuming some of the characters that were part of the function specification. New grammar that works correctly:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional(oneOf("_ -") + Word(alphas)))
What is the difference between:
foo = TOKEN1 + TOKEN2
and
foo = Combine(TOKEN1 + TOKEN2)
Thanks.
UPDATE: Based on my experimentation, it seems like Combine() is for terminals, where you're trying to build an expression to match on, whereas plain + is for non-terminals. But I'm not sure.
Combine has 2 effects:
it concatenates all the tokens into a single string
it requires the matching tokens to all be adjacent with no intervening whitespace
If you create an expression like
realnum = Word(nums) + "." + Word(nums)
Then realnum.parseString("3.14") will return a list of 3 tokens: the leading '3', the '.', and the trailing '14'. But if you wrap this in Combine, as in:
realnum = Combine(Word(nums) + "." + Word(nums))
then realnum.parseString("3.14") will return '3.14' (which you could then convert to a float using a parse action). And since Combine suppresses pyparsing's default whitespace skipping between tokens, you won't accidentally find "3.14" in "The answer is 3. 14 is the next answer."
This code works:
from pyparsing import *
zipRE = "\d{5}(?:[-\s]\d{4})?"
fooRE = "^\!\s+.*"
zipcode = Regex( zipRE )
foo = Regex( fooRE )
query = ( zipcode | foo )
tests = [ "80517", "C6H5OH", "90001-3234", "! sfs" ]
for t in tests:
try:
results = query.parseString( t )
print t,"->", results
except ParseException, pe:
print pe
I'm stuck on two issues:
1 - How to use a custom function to parse a token. For instance, if I wanted to use some custom logic instead of a regex to determine if a number is a zipcode.
Instead of:
zipcode = Regex( zipRE )
perhaps:
zipcode = MyFunc()
2 - How do I determine what a string parses TO. "80001" parses to "zipcode" but how do I determine this using pyparsing? I'm not parsing a string for its contents but simply to determine what kind of query it is.
You could use zipcode and foo separately, so that you know which one the string matches.
zipresults = zipcode.parseString( t )
fooresults = foo.parseString( t )
Your second question is easy, so I'll answer that first. Change query to assign results names to the different expressions:
query = ( zipcode("zip") | foo("foo") )
Now you can call getName() on the returned result:
print t,"->", results, results.getName()
Giving:
80517 -> ['80517'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['90001-3234'] zip
! sfs -> ['! sfs'] foo
If you are going to use the result's fooness or zipness to call another function, then you could do this at parse time by attaching a parse action to your foo and zipcode expressions:
# enclose zipcodes in '*'s, foos in '#'s
zipcode.setParseAction(lambda t: '*' + t[0] + '*')
foo.setParseAction(lambda t: '#' + t[0] + '#')
query = ( zipcode("zip") | foo("foo") )
Now gives:
80517 -> ['*80517*'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['*90001-3234*'] zip
! sfs -> ['#! sfs#'] foo
For your first question, I don't exactly know what kind of function you mean. Pyparsing provides many more parsing classes than just Regex (such as Word, Keyword, Literal, CaselessLiteral), and you compose your parser by combining them with '+', '|', '^', '~', '#' and '*' operators. For instance, if you wanted to parse for a US social security number, but not use a Regex, you could use:
ssn = Combine(Word(nums,exact=3) + '-' +
Word(nums,exact=2) + '-' + Word(nums,exact=4))
Word matches for contiguous "words" made up of the given characters in its constructor, Combine concatenates the matched tokens into a single token.
If you wanted to parse for a potential list of such numbers, delimited by '/'s, use:
delimitedList(ssn, '/')
or if there were between 1 and 3 such numbers, with no delimters, use:
ssn * (1,3)
And any expression can have results names or parse actions attached to them, to further enrich the parsed results, or the functionality during parsing. You can even build recursive parsers, such as nested lists of parentheses, arithmetic expressions, etc. using the Forward class.
My intent when I wrote pyparsing was that this composition of parsers from basic building blocks would be the primary form for creating a parser. It was only in a later release that I added Regex as (what I though was) the ultimate escape valve - if people couldn't build up their parser, they could fall back on regex's format, which has definitely proven its power over time.
Or, as one other poster suggests, you can open up the pyparsing source, and subclass one of the existing classes, or write your own, following their structure. Here is a class that would match for paired characters:
class PairOf(Token):
"""Token for matching words composed of a pair
of characters in a given set.
"""
def __init__( self, chars ):
super(PairOf,self).__init__()
self.pair_chars = set(chars)
def parseImpl( self, instring, loc, doActions=True ):
if (loc < len(instring)-1 and
instring[loc] in self.pair_chars and
instring[loc+1] == instring[loc]):
return loc+2, instring[loc:loc+2]
else:
raise ParseException(instring, loc, "Not at a pair of characters")
So that:
punc = r"~!##$%^&*_-+=|\?/"
parser = OneOrMore(Word(alphas) | PairOf(punc))
print parser.parseString("Does ** this match #### %% the parser?")
Gives:
['Does', '**', 'this', 'match', '##', '##', '%%', 'the', 'parser']
(Note the omission of the trailing single '?')
I do not have the pyparsing module, but Regex must be a class, not a function.
What you can do is subclass from it and override methods as required to customize behaviour, then use your subclasses instead.