pyparsing nestedExpr and double closing characters - python

I am trying to parse nested column type definitions such as
1 string
2 struct<col_1:string,col_2:int>
3 row(col_1 string,array(col_2 string),col_3 boolean)
4 array<struct<col_1:string,col_2:int>,col_3:boolean>
5 array<struct<col_1:string,col2:int>>
Using nestedExpr works as expected for cases 1-4, but throws a parse error on case 5. Adding a space between double closing brackets like "> >" seems work, and might be explained by this quote from the author.
By default, nestedExpr will look for space-delimited words of printables
https://sourceforge.net/p/pyparsing/bugs/107/
I'm mostly looking for alternatives to pre and post processing the input string
type_str = type_str.replace(">", "> ")
# parse string here
type_str = type_str.replace("> ", ">")
I've tried using the infix_notation but I haven't been able to figure out how to use it in this situation. I'm probably just using this the wrong way...
Code snippet
array_keyword = pp.Keyword('array')
row_keyword = pp.Keyword('row')
struct_keyword = pp.Keyword('struct')
nest_open = pp.Word('<([')
nest_close = pp.Word('>)]')
col_name = pp.Word(pp.alphanums + '_')
col_type = pp.Forward()
col_type_delimiter = pp.Word(':') | pp.White(' ')
column = col_name('name') + col_type_delimiter + col_type('type')
col_list = pp.delimitedList(pp.Group(column))
struct_type = pp.nestedExpr(
opener=struct_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
)
row_type = pp.locatedExpr(pp.nestedExpr(
opener=row_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
))
array_type = pp.nestedExpr(
opener=array_keyword + nest_open, closer=nest_close, content=col_type, ignoreExpr=None
)
col_type <<= struct_type('children') | array_type('children') | row_type('children') | scalar_type('type')

nestedExpr and infixNotation are not really appropriate for this project. nestedExpr is generally a short-cut expression for stuff you don't really want to go into details parsing, you just want to detect and step over some chunk of text that happens to have some nesting in opening and closing punctuation. infixNotation is intended for parsing expressions with unary and binary operators, usually some kind of arithmetic. You might be able to treat the punctuation in your grammar as operators, but it is a stretch, and definitely doing things the hard way.
For your project, you will really need to define the different elements, and it will be a recursive grammar (since the array and struct types will themselves be defined in terms of other types, which could also be arrays or structs).
I took a stab at a BNF, for a subset of your grammar using scalar types int, float, boolean, and string, and compound types array and struct, with just the '<' and '>' nesting punctuation. An array will take a single type argument, to define the type of the elements in the array. A struct will take one or more struct fields, where each field is an identifier:type pair.
scalar_type ::= 'int' | 'float' | 'string' | 'boolean'
array_type ::= 'array' '<' type_defn '>'
struct_type ::= 'struct' '<' struct_element (',' struct_element)... '>'
struct_element ::= identifier ':' type_defn
type_defn ::= scalar_type | array_type | struct_type
(If you later want to add a row definition also, think about what the row is supposed to look like, and how its elements would be defined, and then add it to this BNF.)
You look pretty comfortable with the basics of pyparsing, so I'll just start you off with some intro pieces, and then let you fill in the rest.
# define punctuation
LT, GT, COLON = map(pp.Suppress, "<>:")
ARRAY = pp.Keyword('array')
STRUCT = pp.Keyword('struct')
# create a Forward that will be used in other type expressions
type_defn = pp.Forward()
# here is the array type, you can fill in the other types following this model
# and the definitions in the BNF
array_type = pp.Group(ARRAY + LT + type_defn + GT)
...
# then finally define type_defn in terms of the other type expressions
type_defn <<= scalar_type | array_type | struct_type
Once you have that finished, try it out with some tests:
type_defn.runTests("""\
string
struct<col_1:string,col_2:int>
array<struct<col_1:string,col2:int>>
""", fullDump=False)
And you should get something like:
string
['string']
struct<col_1:string,col_2:int>
['struct', [['col_1', 'string'], ['col_2', 'int']]]
array<struct<col_1:string,col2:int>>
['array', ['struct', [['col_1', 'string'], ['col2', 'int']]]]>
Once you have that, you can play around with extending it to other types, such as your row type, maybe unions, or arrays that take multiple types (if that was your intention in your posted example). Always start by updating the BNF - then the changes you'll need to make in the code will generally follow.

Related

Parsing a text file to store into class objects and attributes

2 part question. How to parse text and save as class object/attributes and best way to rewrite text from the classes in a specific format.
I'm wanting to parse through a text file and extract sections of text and create a class object and attributes. There will be several classes (Polygons, space, zone, system, schedule) involved. In the original file each "Object" and it's "attributes" are separated by '..'. An example of one is below.
"Office PSZ" = SYSTEM
TYPE = PSZ
HEAT-SOURCE = FURNACE
FAN-SCHEDULE = "HVAC Yr Schedule"
COOLING-EIR = 0.233207
..
I'd like to read this text and store into class objects. So "Office PSZ" would be of the HVACsystem or SYSTEM class, haven't decided. 'SYSTEM' would be a class variable. For this instance ("Office PSZ"), self.TYPE would be PSZ. self.HEAT-SOURCE would equal FURNACE,etc.
I want to manipulate these objects based on their attributes. The end result though would be to write all the data that was manipulated back into a text file with the original format. End result for this instance may be.
"Office PSZ" = SYSTEM
TYPE = PSZ
HEAT-SOURCE = ELECTRIC
FAN-SCHEDULE = "Other Schedule"
COOLING-EIR = 0.200
..
Is there a way to print the attribute name/title (idk what to call it)? Because the attribute name (i.e. TYPE,HEAT-SOURCE) comes from the original file and it would be easier to not have to manually anticipate all of the attributes associated with every class.
I suppose I could create an array of all of the values on the left side of "=" and another array for the values on the right and loop through those as I'm writing/formatting a new text file. But I'm not sure if that's a good way to go.
I'm still quite the amateur so I might be overreaching but any suggestions on how I should proceed?
Pyparsing makes it easy to write custom parsers for data like this, and gives back
parsed data in a pyparsing data structure call ParseResults. ParseResults give you
access to your parsed values by position (like a list), by key (like a dict), or for
names that work as Python identifiers, by attribute (like an object).
I've simplfied my parsing of your data to pretty much just take every key = value line
and build up a structure using the key strings as keys. The '..' lines work great
as terminators for each object.
A simple BNF for this might look like:
object ::= attribute+ end
attribute ::= key '=' value
key ::= word composed of letters 'A'..'Z' and '-', starting with 'A'..'Z',
or a quoted string
value ::= value_string | value_number | value_word
value_word ::= a string of non-whitespace characters
value_string ::= a string of any characters in '"' quotes
value_number ::= an integer or float numeric value
end ::= '..'
To implement a pyparsing parser, we work bottom up to define pyparsing sub-expressions.
Then we use Python '+' and '|' operators to assemble lower-level expressions to higher-level
ones:
import pyparsing as pp
END = pp.Suppress("..")
EQ = pp.Suppress('=')
pyparsing includes some predefined expressions for quoted strings and numerics;
the numerics will be automatically converted to ints or floats.
value_number = pp.pyparsing_common.number
value_string = pp.quotedString
value_word = pp.Word(pp.printables)
value = value_string | value_number | value_word
For our attribute key, we will use the two-argument form for Word. The first
argument is a string of allowable leading characters, and the second argument is a
string of allowable body characters. If we just wrote `Word(alphas + '-'), then
our parser would accept '---' as a legal key.
key = pp.Word(pp.alphas, pp.alphas + '-') | pp.quotedString
An attribute definition is just a key, an '=' sign, and a value
attribute = key + EQ + value
Lastly we will use some of the more complex features of pyparsing. The simplest form
would just be "pp.OneOrMore(attribute) + END", but this would just give us back a
pile of parsed tokens with no structure. The Group class structures the enclosed expressions
so that their results will be returned as a sub-list. We will catch every attribute as
its own sub-list using Group. Dict will apply some naming to the results, using
the text from each key expression as the key for that group. Finally, the whole collection
of attributes will be Group'ed again, this time representing all the attributes for a
single object:
object_defn = pp.Group(pp.Dict(pp.OneOrMore(pp.Group(attribute)))) + END
To use this expression, we'll define our parser as:
parser = pp.OneOrMore(object_defn)
and parse the sample string using:
objs = parser.parseString(sample)
The objs variable we get back will be a pyparsing ParseResults, which will work like
a list of the grouped object attributes. We can view just the parsed attributes as a list
of lists using asList():
for obj in objs:
print(obj.asList())
[['"Office PSZ"', 'SYSTEM'], ['TYPE', 'PSZ'], ['HEAT-SOURCE', 'FURNACE'],
['FAN-SCHEDULE', '"HVAC Yr Schedule"'], ['COOLING-EIR', 0.233207]]
If we had not used the Dict class, this would have all we would get, but since we
did use Dict, we can also see the attributes as a Python dict:
for obj in objs:
print(obj.asDict())
{'COOLING-EIR': 0.233207, '"Office PSZ"': 'SYSTEM', 'TYPE': 'PSZ',
'FAN-SCHEDULE': '"HVAC Yr Schedule"', 'HEAT-SOURCE': 'FURNACE'}
We can even access named fields by name, if they work as Python identifiers. In your
sample, "TYPE" is the only legal identifier, so you can see how to print it here. There
is also a dump() method that will give the results in list form, followed by an
indented list of defined key pairs. (I've also shown how you can use list and dict
type access directly on the ParseResults object, without having to convert to list
or dict types):
for obj in objs:
print(obj[0])
print(obj['FAN-SCHEDULE'])
print(obj.TYPE)
print(obj.dump())
['"Office PSZ"', 'SYSTEM']
"HVAC Yr Schedule"
PSZ
[['"Office PSZ"', 'SYSTEM'], ['TYPE', 'PSZ'], ['HEAT-SOURCE', 'FURNACE'],
['FAN-SCHEDULE', '"HVAC Yr Schedule"'], ['COOLING-EIR', 0.233207]]
- "Office PSZ": 'SYSTEM'
- COOLING-EIR: 0.233207
- FAN-SCHEDULE: '"HVAC Yr Schedule"'
- HEAT-SOURCE: 'FURNACE'
- TYPE: 'PSZ'
Here is the full parser code for you to work from:
import pyparsing as pp
END = pp.Suppress("..")
EQ = pp.Suppress('=')
value_number = pp.pyparsing_common.number
value_string = pp.quotedString
value_word = pp.Word(pp.printables)
value = value_string | value_number | value_word
key = pp.Word(pp.alphas, pp.alphas+"-") | pp.quotedString
attribute = key + EQ + value
object_defn = pp.Group(pp.Dict(pp.OneOrMore(pp.Group(attribute)))) + END
parser = pp.OneOrMore(object_defn)
objs = parser.parseString(sample)
for obj in objs:
print(obj.asList())
for obj in objs:
print(obj.asDict())
for obj in objs:
print(obj[0])
print(obj['FAN-SCHEDULE'])
print(obj.TYPE)
print(obj.dump())

Pyparsing: dblQuotedString parsing differently in nestedExpr

I'm working on a grammar to parse search queries (not evaluate them, just break them into component pieces). Right now I'm working with nestedExpr, just to grab the different 'levels' of each term, but I seem to have an issue if the first part of a term is in double quotes.
A simple version of the grammar:
QUOTED = QuotedString(quoteChar = '“', endQuoteChar = '”', unquoteResults = False).setParseAction(remove_curlies)
WWORD = Word(alphas8bit + printables.replace("(", "").replace(")", ""))
WORDS = Combine(OneOrMore(dblQuotedString | QUOTED | WWORD), joinString = ' ', adjacent = False)
TERM = OneOrMore(WORDS)
NESTED = OneOrMore(nestedExpr(content = TERM))
query = '(dog* OR boy girl w/3 ("girls n dolls" OR friends OR "best friend" OR (friends w/10 enemies)))'
Calling NESTED.parseString(query) returns:
[['dog* OR boy girl w/3', ['"girls n dolls"', 'OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
The first dblQuotedString instance is separate from the rest of the term at the same nesting, which doesn't occur to the second dblQuotedString instance, and also doesn't occur if the quoted bit is a QUOTED instance (with curly quotes) rather than dblQuotedString with straight ones.
Is there something special about dblQuotedString that I'm missing?
NOTE: I know that operatorPrecedence can break up search terms like this, but I have some limits on what can be broken apart, so I'm testing if I can use nestedExpr to work within those limits.
nestedExpr takes an optional keyword argument ignoreExpr, to take an expression that nestedExpr should use to ignore characters that would otherwise be interpreted as nesting openers or closers, and the default is pyparsing's quotedString, which is defined as sglQuotedString | dblQuotedString. This is to handle strings like:
(this has a tricky string "string with )" )
Since the default ignoreExpr is quotedString, the ')' in quotes is not misinterpreted as the closing parenthesis.
However, your content argument also matches on dblQuotedString. The leading quoted string is matched internally by nestedExpr by way of skipping over quoted strings that may contain "()"s, then your content is matched, which also matches quoted strings. You can suppress nestedExpr's ignore expression using a NoMatch:
NESTED = OneOrMore(nestedExpr(content = TERM, ignoreExpr=NoMatch()))
which should now give you:
[['dog* OR boy girl w/3',
['"girls n dolls" OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
You'll find more details and examples at https://pythonhosted.org/pyparsing/pyparsing-module.html#nestedExpr

Parsing text file in python using pyparsing

I am trying to parse the following text using pyparsing.
acp (SOLO1,
"solo-100",
"hi here is the gift"
"Maximum amount of money, goes",
430, 90)
jhk (SOLO2,
"solo-101",
"hi here goes the wind."
"and, they go beyond",
1000, 320)
I have tried the following code but it doesn't work.
flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)
print list(enclosed.searchString (str1))
The comma(,) within the quotation is producing undesired results.
Well, I might have oversimplified slightly in my comments - here is a more complete
answer.
If you don't really have to deal with nested data items, then a single-level parenthesized
data group in each section will look like this:
LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)
# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' +
''.join(map(lambda s:s.strip('"\''),t)) +
'"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string
# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
I wasn't sure if it was intentional or not when you omitted the comma between
two consecutive quoted strings, so I chose to implement logic like Python's compiler,
in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is
the same as "AB CD EF". This was done with the definition of quoted_string, and adding
the parse action to quoted_string to concatenate the contents of the 2 or more component
quoted strings.
Finally, we create a parser for the overall group
results = OneOrMore(Group(section)).parseString(source)
results.pprint()
and get from your posted input sample:
[['acp',
['SOLO1',
'"solo-100"',
'"hi here is the giftMaximum amount of money, goes"',
'430',
'90']],
['jhk',
['SOLO2',
'"solo-101"',
'"hi here goes the wind.and, they go beyond"',
'1000',
'320']]]
If you do have nested parenthetical groups, then your section definition can be
as simple as this:
# section defined with nesting
section = ident + nestedExpr()
Although as you have already found, this will retain the separate commas as if they
were significant tokens instead of just data separators.

Parsing arithmetic expressions with function calls

I am working with pyparsing and found it to be excellent for developing a simple DSL that allows me to extract data fields out of MongoDB and do simple arithmetic operations on them. I am now trying to extend my tools such that I can apply functions of the form Rank[Person:Height] to the fields and potentially include simple expressions as arguments to the function calls. I am struggling hard with getting the parsing syntax to work. Here is what I have so far:
# Define parser
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + "." + Word(nums)).setParseAction(EvalConstant)
# Handle database field references that are coming out of Mongo,
# accounting for the fact that some fields contain whitespace
dbRef = Combine(Word(alphas) + ":" + Word(printables) + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword("Rank") | Keyword("ZS") | Keyword("Ntile")) + "[" + expr + "]"
functionCall.setParseAction(EvalFunction)
operand = functionCall | dbRef | (real | integer)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
My issue is that when I test a simple expression like Rank[Person:Height] I am getting a parse exception:
ParseException: Expected "]" (at char 19), (line:1, col:20)
If I use a float or arithmetic expression as the argument like Rank[3 + 1.1] the parsing works ok, and if I simplify the dbRef grammar so its just Word(alphas) it also works. Cannot for the life of me figure out whats wrong with my full grammar. I have tried rearranging the order of operands as well as simplifying the functionCall grammar to no avail. Can anyone see what I am doing wrong?
Once I get this working I would want to take a last step and introduce support for variable assignment in expressions ..
EDIT: Upon further testing, if I remove the printables from dbRef grammar, things work ok:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional("_" + Word(alphas)))
HOWEVER, if I add the character "-" to dbRef (which I need for DB fields like "Class:S-N"), the parser fails again. I think the "-" is being consumed by the signop in my operatorPrecedence?
What appears to happen is that the ] character at the end of your test string (Rank[Person:Height]) gets consumed as part of the dbRef token, because the portion of this token past the initial : is declared as being made of Word(printables) (and this character set, unfortunately includes the square brackets characters)
Then the parser tries to produce a functionCall but is missing the closing ] hence the error message.
A tentative fix is to use a character set that doesn't include the square brackets, maybe something more explicit like:
dbRef = Combine(Word(alphas) + ":" + Word(alphas, alphas+"-_./") + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
Edit:
Upon closer look, the above is loosely correct, but the token hierarchy is wrong (e.g. the parser attempts to produce a functionCall as one operand of an an expr etc.)
Also, my suggested fix will not work because of the ambiguity with the - sign which should be understood as a plain character when within a dbRef and as a plusOp when within an expr. This type of issue is common with parsers and there are ways to deal with this, though I'm not sure exactly how with pyparsing.
Found solution - the issue was that my grammar for dbRef was consuming some of the characters that were part of the function specification. New grammar that works correctly:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional(oneOf("_ -") + Word(alphas)))

Making some detail explanations in python string

With Java, I can split the string and give some detailed explanations
String x = "a" + // First
"b" + // Second
"c"; // Third
// x = "abc"
How can I make the equivalence in python?
I could split the string, but I can't make a comment on this like I do with Java.
x = "a" \
"b" \
"c"
I need this feature for explaining regular expression usage.
Pattern p = Pattern.compile("rename_method\\(" + // ignore 'rename_method('
"\"([^\"]*)\"," + // find '"....",'
This
x = ( "a" #foo
"b" #bar
)
will work.
The magic is done here by the parenthesis -- python automatically continues lines inside of any unterminated brakets (([{). Note that python also automatically concatenates strings when they're placed next to each other (We don't even need the + operator!)-- really cool.
If you want to do it specifically for regular expressions, you can do it pretty easily with the re.VERBOSE flag. From the Python docs (scroll down a bit to see the documentation for the VERBOSE flag):
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
0[0-7]+ # Octal form
| [0-9]+ # Decimal form
| x[0-9a-fA-F]+ # Hexadecimal form
)
; # Trailing semicolon
""", re.VERBOSE)

Categories

Resources