Python: Invalid Syntax with test data using Pyparser - python

Using pyparser, I am trying to create a very simple parser for the S-Expression language. I have written a very small grammar.
Here is my code:
from pyparsing import *
alphaword = Word(alphas)
integer = Word(nums)
sexp = Forward()
LPAREN = Suppress("(")
RPAREN = Suppress(")")
sexp << ( alphaword | integer | ( LPAREN + ZeroOrMore(sexp) + RPAREN)
tests = """\
red
100
( red 100 blue )
( green ( ( 1 2 ) mauve ) plaid () ) """.splitlines()
for t in tests:
print t
print sexp.parseString(t)
print
While looking at examples of this code it seems that everything is fine, however when running i get a syntax error for this line
tests = """\
^
I don't understand it. I would be grateful for any help

parentheses on a previous line are not closed.
sexp << ( alphaword | integer | ( LPAREN + ZeroOrMore(sexp) + RPAREN)
Needs more )'s

Related

PLY Lex: ID could be anything

I have a following simple format:
BLOCK ID {
SUBBLOCK ID {
SUBSUBBLOCK ID {
SOME STATEMENTS;
};
};
};
I configured ply to work with this format. But the issue is that ID could be any string including "BLOCK", "SUBBLOCK", etc.
In the lexer I define ID as:
#TOKEN(r'[a-zA-Z_][a-zA-Z_0-9]*')
def t_ID(self, t):
t.type = self.keyword_map.get(t.value, "ID")
return t
But it means that BLOCK word will not be allowed as a block name.
How I can overcome this issue?
The easiest solution is to create a non-terminal name to be used instead of ID in productions which need a name, such as block : BLOCK name braced_statements:
# Docstring is added later
def p_name(self, p):
p[0] = p[1]
Then you compute the productions for name and assign them to p_name's docstring by executing this before you generate the parser:
Parser.p_name.__doc__ = '\n| '.join(
['name : ID']
+ list(Lexer.keyword_map.values())
)
I probably dont understand your question
but i think something like the following would work (its been a long time since
I messed with PLY so keep in mind this is pseudocode
FUNCTION_ARGS = zeroOrMore(lazy('STATEMENT'),sep=',')
FUNCTION_CALL = t_ID + lparen + FUNCTION_ARGS + rparen
STATEMENT= FUNCTION_CALL | t_Literal | t_ID
SUBBLOCK = Literal('SUBBLOCK') + t_ID + lbrace + STATEMENT + rbrace
BLOCK = Literal('BLOCK') + lbrace + oneOrMore(SUBBLOCK) + rbrace

Dealing with ZeroOrMore in pyparsing

I'm trying to parse pactl list with pyparsing: So far all parse is working correctly but I cannot make ZeroOrMore to work correctly.
I can find foo: or foo: bar and try to deal with that with ZeroOrMore but it doesn't work, I have to add special case "Argument:" to find results without value, but there're Argument: foo results (with value) so it will not work, and I expect any other property to exist without value.
With this definition, and a fixed pactl list output:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1)))))
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | ("Argument:") | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This gets:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
[[['Module'], ['6']],
[['Argument:'],
[[['Name'], ['module-alsa-card']]],
[[['Usage counter'], ['0']]],
['Properties:',
[[[['module.author'], ['"Lennart Poettering"']]],
[[['module.description'], ['"ALSA Card"']]],
[[['module.version'], ['"14.0-rebootstrapped"']]]]]]]
So far so good, but removing special case for Argument: it gets into error, as ZeroOrMore doesn't behave as expected:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This results in:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 19(3,9)
Matched Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) -> [[['Argument'], ['Name']]]
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 1(2,1)
Exception raised:Expected ":", found '#' (at char 8), (line:2, col:8)
Traceback (most recent call last):
File "/home/alberto/projects/node/pacmd_list_json/./pactl.py", line 55, in <module>
parseTree = syntax.parseString(partial)
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 1955, in parseString
raise exc
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 6336, in checkUnindent
raise ParseException(s, l, "not an unindent")
pyparsing.ParseException: Expected {{Group:({Group:(W:(ABCD...)) Suppress:("#") Group:(W:(0123...))}) indented block} | {"Properties:" indented block} | Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) | Group:({Group:(W:(ABCD...)) Suppress:("=") Group:(Combine:({{W:(ABCD...) | <SP><TAB>}}...))})}, found ':' (at char 41), (line:4, col:13)
See from setDebug value grammar ZeroOrMore is getting the tokens from next line [[['Argument'], ['Name']]]
I tried LineEnd() and other tricks but none works.
Any idea on how to deal with ZeroOrMore to stop on LineEnd() or without special cases?
NOTE: Real output can be retrieved using:
env = os.environ.copy()
env['LANG'] = 'C'
data = check_output(
['pactl', 'list'], universal_newlines=True, env=env)
indentedBlock is not the easiest pyparsing element to work with. But there are a few things that you are doing that are getting in your way.
To debug this, I broke down some of your more complex expressions, use setName() to give them names, and then added .setDebug(). Like this:
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
This will tell pyparsing to output a message whenever this expression is about to be matched, if it matched successfully, or if not, the exception that was raised.
Match identifier at loc 1(2,1)
Matched identifier -> ['Module']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 23(3,13)
Exception raised:Expected identifier, found ':' (at char 23), (line:3, col:13)
It looks like these expressions are messing up the indentedBlock matching, by processing whitespace that should be indentation space:
Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))
The " character in the Word and the whitespace lead me to believe you are trying to match quoted strings. I replaced this expression with:
Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString))
You also need to take care not to read past the end of the line, or you'll also mess up the indentedBlock indentation tracking. I added this expression for a newline at the top:
NL = LineEnd()
and then used it as the stopOn argument to OneOrMore and ZeroOrMore:
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
Here is the parser I ended up with:
indentStack = [1]
stmt = Forward()
NL = LineEnd()
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums))).setName("sect_def")#.setDebug()
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
#~ value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
value_label = originalTextFor(OneOrMore(identifier)).setName("value_label")#.setDebug()
value = Group(value_label
+ Suppress(":")
+ Optional(~NL + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_.') | quotedString(), stopOn=NL))))).setName("value")#.setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
#~ prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
prop = (prop_name + prop_section).setName("prop")#.setDebug()
stmt << ( section | prop | value | prop_val )
Which gives this:
[[['Module'], ['6']],
[[['Argument']],
[['Name', ['module-alsa-card']]],
[['Usage counter', ['0']]],
['Properties:',
[[['module.author', ['"Lennart Poettering"']]],
[['module.description', ['"ALSA Card"']]],
[['module.version', ['"14.0-rebootstrapped"']]]]]]]

How to parse parenthetical trees in python?

I need help with this developing this algorithm that I'm working on. I have a an input of a tree in the following format:
( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )
This looks this the following tree.
Root
|
____________
AB CD
| |
__________ ___________
ABC CBA CDE FGH
What the algorithm is suppose to is read the parenthetical format in and give the following output:
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH
It list the root and its children and all other parents that have children.
I am not able to understand how to start up on this, Can someone help me gimme hint or give some references or links?
Solution: the Tree class from module nltk
(aka Natural Language Toolkit)
Making the actual parsing
This is your input:
input = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'
And you parse it very simply by doing:
from nltk import Tree
t = Tree.fromstring(input)
Playing with the parsed tree
>>> t.label()
'Root'
>>> len(t)
2
>>> t[0]
Tree('AB', [Tree('ABC', []), Tree('CBA', [])])
>>> t[1]
Tree('CD', [Tree('CDE', []), Tree('FGH', [])])
>>> t[0][0]
Tree('ABC', [])
>>> t[0][1]
Tree('CBA', [])
>>> t[1][0]
Tree('CDE', [])
>>> t[1][1]
Tree('FGH', [])
As you seen, you can treat each node as a list of subtrees.
To pretty-print the tree
>>> t.pretty_print()
Root
_______|________
AB CD
___|___ ___|___
ABC CBA CDE FGH
| | | |
... ... ... ...
To obtain the output you want
from sys import stdout
def showtree(t):
if (len(t) == 0):
return
stdout.write(t.label() + ' ->')
for c in t:
stdout.write(' ' + c.label())
stdout.write('\n')
for c in t:
showtree(c)
Usage:
>>> showtree(t)
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH
To install the module
pip install nltk
(Use sudo if required)
A recursive descent parser is a simple form of parser that can parse many grammars. While the entire theory of parsing is too large for a stack-overflow answer, the most common approach to parsing involves two steps: first, tokenisation, which extracts subwords of your string (here, probably words like 'Root', and 'ABC', or brackets like '(' and ')'), and then parsing using recursive functions.
This code parses input (like your example), producing a so-called parse tree, and also has a function 'show_children' which takes the parse tree, and produces the children view of the expression as your question asked.
import re
class ParseError(Exception):
pass
# Tokenize a string.
# Tokens yielded are of the form (type, string)
# Possible values for 'type' are '(', ')' and 'WORD'
def tokenize(s):
toks = re.compile(' +|[A-Za-z]+|[()]')
for match in toks.finditer(s):
s = match.group(0)
if s[0] == ' ':
continue
if s[0] in '()':
yield (s, s)
else:
yield ('WORD', s)
# Parse once we're inside an opening bracket.
def parse_inner(toks):
ty, name = next(toks)
if ty != 'WORD': raise ParseError
children = []
while True:
ty, s = next(toks)
if ty == '(':
children.append(parse_inner(toks))
elif ty == ')':
return (name, children)
# Parse this grammar:
# ROOT ::= '(' INNER
# INNER ::= WORD ROOT* ')'
# WORD ::= [A-Za-z]+
def parse_root(toks):
ty, _ = next(toks)
if ty != '(': raise ParseError
return parse_inner(toks)
def show_children(tree):
name, children = tree
if not children: return
print '%s -> %s' % (name, ' '.join(child[0] for child in children))
for child in children:
show_children(child)
example = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'
show_children(parse_root(tokenize(example)))
Try this :
def toTree(expression):
tree = dict()
msg =""
stack = list()
for char in expression:
if(char == '('):
stack.append(msg)
msg = ""
elif char == ')':
parent = stack.pop()
if parent not in tree:
tree[parent] = list()
tree[parent].append(msg)
msg = parent
else:
msg += char
return tree
expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
print toTree(expression)
It returns a dictionary, where the root can be accessed with the key ''. You can then do a simple BFS to print the output.
OUTPUT :
{
'' : ['Root'],
'AB' : ['ABC', 'CBA'],
'Root': ['AB', 'CD'],
'CD' : ['CDE', 'FGH']
}
You will have to eliminate all the whitespaces in the Expression before you start, or ignore the inrrelevant charaters in the expression by adding the following as the very first line in the for-loop :
if char == <IRRELEVANT CHARACTER>:
continue
The above code will run in O(n) time, where n is the length of the expression.
EDIT
Here is the printing function :
def printTree(tree, node):
if node not in tree:
return
print '%s -> %s' % (node, ' '.join(child for child in tree[node]))
for child in tree[node]:
printTree(tree, child)
The desired Output can be achieved by the following :
expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
tree = toTree(expression)
printTree(tree, tree[''][0])
Output
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH
EDIT
Assuming the node names are not unique, we just have to give new names to the nodes. This can be done using :
def parseExpression(expression):
nodeMap = dict()
counter = 1
node = ""
retExp =""
for char in expression:
if char == '(' or char == ')' :
if (len(node) > 0):
nodeMap[str(counter)] = node;
retExp += str(counter)
counter +=1
retExp += char
node =""
elif char == ' ': continue
else :
node += char
return retExp,nodeMap
The print Function will now change to :
def printTree(tree, node, nodeMap):
if node not in tree:
return
print '%s -> %s' % (nodeMap[node], ' '.join(nodeMap[child] for child in tree[node]))
for child in tree[node]:
printTree(tree, child, nodeMap)
The output can be obtained by using :
expression = " ( Root( SQ ( VBZ ) ( NP ( DT ) ( NN ) ) ( VP ( VB ) ( NP ( NN ) ) ) ))"
expression, nodeMap = parseExpression(expression)
tree = toTree(expression)
printTree(tree, tree[''][0], nodeMap)
Output :
Root -> SQ
SQ -> VBZ NP VP
NP -> DT NN
VP -> VB NP
NP -> NN
I think the most popular solution for parsing in Python is PyParsing. PyParsing comes with a grammar for parsing S-expressions and you should be able to just use it. Discussed in this StackOverflow answer:
Parsing S-Expressions in Python

pyparsing ParseException: Expected end of line -- general questions

I am a newbie to pyparsing and have been reading the examples, looking here and trying some things out.
I created a grammar and provided a buffer. I do however have a heavy background in lex/yacc from the old days.
I have a general question or two.
I'm currently seeing
ParseException: Expected end of line (at char 7024), (line 213, col:2)
and then it terminates
Because of the nature of my buffer, newlines have meaning, I did:
ParserElement.setDefaultWhitespaceChars('') # <-- zero len string
Does this error mean that somewhere in my productions, I have a rule that is looking for an LineEnd() and that rule happens to somehow be 'last'?
The location it is dying is the 'end of file'. I tried using parseFile but my file contains chars > ord(127) so instead I am loading it to memory, filtering all > ord(127) chars, then calling parseString.
I tried turning on verbose_stacktrace=True for some of the elements of my grammar where I thought the problem originated.
Is there a better way to track down the exact ParserElement it is trying to recognize when an error such as this occurs? Or can I get a 'stack or most recently recognized production trace?
I didn't realize I could edit up here...
My crash is this:
[centos#new-host /tmp/sample]$ ./zooparser.py
!(zooparser.py) TEST test1: valid message type START
Ready to roll
Parsing This message: ( ignore leading>>> and trailing <<< ) >>>
ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//
<<<
Match {Group:({Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) [Group:({LineEnd "FREE" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"})]...}) [LineEnd]... StringEnd} at loc 0(1,1)
Match Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) at loc 0(1,1)
Match Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!##$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!##$...)}}) "//"}) at loc 0(1,1)
Exception raised:None
Exception raised:None
Exception raised:None
Traceback (most recent call last):
File "./zooparser.py", line 319, in <module>
test1(pgm)
File "./zooparser.py", line 309, in test1
test(pgm, zooMsg, 'test1: valid message type' )
File "./zooparser.py", line 274, in test
tokens = zg.getTokensFromBuffer(fileName)
File "./zooparser.py", line 219, in getTokensFromBuffer
tokens = self.text.parseString(filteredBuffer,parseAll=True)
File "/usr/local/lib/python2.7/site-packages/pyparsing-1.5.7-py2.7.egg/pyparsing.py", line 1006, in parseString
raise exc
pyparsing.ParseException: Expected end of line (at char 148), (line:8, col:2)
[centos#new-host /tmp/sample]$
source: see http://prj1.y23.org/zoo.zip
pyparsing takes a different view toward parsing than lex/yacc does. You have to let the classes do some of the work. Here's an example in your code:
self.columnHeader = OneOrMore(self.aucc) \
| OneOrMore(nums) \
| OneOrMore(self.blankCharacter) \
| OneOrMore(self.specialCharacter)
You are equating OneOrMore with the '+' character of a regex. In pyparsing, this is true for ParseElements, but at the character level, pyparsing uses the Word class:
self.columnHeader = Word(self.aucc + nums + self.blankCharacter + self.specialCharacter)
OneOrMore works with ParseElements, not characters. Look at:
OneOrMore(nums)
nums is the string "0123456789", so OneOrMore(nums) will match "0123456789", "01234567890123456789", etc., but not "123". That is what Word is for. OneOrMore will accept a string argument, but will implicitly convert it to a Literal.
This is a fundamental difference between using pyparsing and lex/yacc, and I think is the source of much of the complexity in your code.
Some other suggestions:
Your code has some premature optimizations in it - you write:
aucc = ''.join(set([alphas.upper(),"'"]))
Assuming that this will be used for defining Words, just do:
aucc = alphas.upper() + "'"
There is no harm in having duplicate characters in aucc, Word will convert this to a set internally.
Write a BNF for what you want to parse. It does not have to be overly rigorous as you would with lex/yacc. From your samples, it looks something like:
# sample
ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//
parser :: header topicEntry+
header :: "ZOO" sep namedValue
namedValue :: uppercaseWord sep valueBody
valueBody :: (everything up to //)
topicEntry :: topicHeader topicBody
topicHeader :: "TOPIC" sep valuebody
topicBody :: freeText
freeText :: "FREE" sep valuebody
sep :: "/"
Converting to pyparsing, this looks something like:
SEP = Literal("/")
BODY_TERMINATOR = Literal("//")
FREE_,TOPIC_,ZOO_ = map(Keyword,"FREE TOPIC ZOO".split())
uppercaseWord = Word(alphas.upper())
valueBody = SkipTo(BODY_TERMINATOR) # adjust later, but okay for now...
freeText = FREE_ + SEP + valueBody
topicBody = freeText
topicHeader = TOPIC_ + SEP + valueBody
topicEntry = topicHeader + topicBody
namedValue = uppercaseWord + SEP + valueBody
zooHeader = ZOO_ + SEP + namedValue
parser = zooHeader + OneOrMore(topicEntry)
(valueBody will have to get more elaborate when you add support for '://' embedded within a value, but save that for Round 2.)
Don't make things super complicated until you get at least some simple stuff working.

decompress name

what is the easiest way to decompress a data name?
For example, change compressed form:
abc[3:0]
into decompressed form:
abc[3]
abc[2]
abc[1]
abc[0]
preferable 1 liner :)
In Perl:
#!perl -w
use strict;
use 5.010;
my #abc = qw/ a b c d /;
say join( " ", reverse #abc[0..3] );
Or if you wanted them into separate variables:
my( $abc3, $abc2, $abc1, $abc0 ) = reverse #abc[0..3];
Edit: Per your clarification:
my $str = "abc[3:0]";
$str =~ /(abc)\[(\d+):(\d+)\]/;
my $base = $1;
my $from = ( $2 < $3 ? $2 : $3 );
my $to = ( $2 > $3 ? $2 : $3 );
my #strs;
foreach my $num ( $from .. $to ) {
push #strs, $base . '[' . $num . ']';
}
This is a little pyparsing exercise that I've done in the past, adapted to your example (also supports multiple ranges and unpaired indexes, all separated by commas - see the last test case):
from pyparsing import (Suppress, Word, alphas, alphanums, nums, delimitedList,
Combine, Optional, Group)
LBRACK,RBRACK,COLON = map(Suppress,"[]:")
ident = Word(alphas+"_", alphanums+"_")
integer = Combine(Optional('-') + Word(nums))
integer.setParseAction(lambda t : int(t[0]))
intrange = Group(integer + COLON + integer)
rangedIdent = ident("name") + LBRACK + delimitedList(intrange|integer)("indexes") + RBRACK
def expandIndexes(t):
ret = []
for ind in t.indexes:
if isinstance(ind,int):
ret.append("%s[%d]" % (t.name, ind))
else:
offset = (-1,1)[ind[0] < ind[1]]
ret.extend(
"%s[%d]" % (t.name, i) for i in range(ind[0],ind[1]+offset,offset)
)
return ret
rangedIdent.setParseAction(expandIndexes)
print rangedIdent.parseString("abc[0:3]")
print rangedIdent.parseString("abc[3:0]")
print rangedIdent.parseString("abc[0:3,7,14:16,24:20]")
Prints:
['abc[0]', 'abc[1]', 'abc[2]', 'abc[3]']
['abc[3]', 'abc[2]', 'abc[1]', 'abc[0]']
['abc[0]', 'abc[1]', 'abc[2]', 'abc[3]', 'abc[7]', 'abc[14]', 'abc[15]', 'abc[16]', 'abc[24]', 'abc[23]', 'abc[22]', 'abc[21]', 'abc[20]']

Categories

Resources