How to parse parenthetical trees in python?

How to parse parenthetical trees in python? - python

I need help with this developing this algorithm that I'm working on. I have a an input of a tree in the following format:
( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )
This looks this the following tree.
Root
|
____________
AB CD
| |
__________ ___________
ABC CBA CDE FGH
What the algorithm is suppose to is read the parenthetical format in and give the following output:
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH
It list the root and its children and all other parents that have children.
I am not able to understand how to start up on this, Can someone help me gimme hint or give some references or links?

Solution: the Tree class from module nltk
(aka Natural Language Toolkit)
Making the actual parsing
This is your input:
input = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'
And you parse it very simply by doing:
from nltk import Tree
t = Tree.fromstring(input)
Playing with the parsed tree
>>> t.label()
'Root'
>>> len(t)
2
>>> t[0]
Tree('AB', [Tree('ABC', []), Tree('CBA', [])])
>>> t[1]
Tree('CD', [Tree('CDE', []), Tree('FGH', [])])
>>> t[0][0]
Tree('ABC', [])
>>> t[0][1]
Tree('CBA', [])
>>> t[1][0]
Tree('CDE', [])
>>> t[1][1]
Tree('FGH', [])
As you seen, you can treat each node as a list of subtrees.
To pretty-print the tree
>>> t.pretty_print()
Root
_______|________
AB CD
___|___ ___|___
ABC CBA CDE FGH
| | | |
... ... ... ...
To obtain the output you want
from sys import stdout
def showtree(t):
if (len(t) == 0):
return
stdout.write(t.label() + ' ->')
for c in t:
stdout.write(' ' + c.label())
stdout.write('\n')
for c in t:
showtree(c)
Usage:
>>> showtree(t)
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH
To install the module
pip install nltk
(Use sudo if required)

A recursive descent parser is a simple form of parser that can parse many grammars. While the entire theory of parsing is too large for a stack-overflow answer, the most common approach to parsing involves two steps: first, tokenisation, which extracts subwords of your string (here, probably words like 'Root', and 'ABC', or brackets like '(' and ')'), and then parsing using recursive functions.
This code parses input (like your example), producing a so-called parse tree, and also has a function 'show_children' which takes the parse tree, and produces the children view of the expression as your question asked.
import re
class ParseError(Exception):
pass
# Tokenize a string.
# Tokens yielded are of the form (type, string)
# Possible values for 'type' are '(', ')' and 'WORD'
def tokenize(s):
toks = re.compile(' +|[A-Za-z]+|[()]')
for match in toks.finditer(s):
s = match.group(0)
if s[0] == ' ':
continue
if s[0] in '()':
yield (s, s)
else:
yield ('WORD', s)
# Parse once we're inside an opening bracket.
def parse_inner(toks):
ty, name = next(toks)
if ty != 'WORD': raise ParseError
children = []
while True:
ty, s = next(toks)
if ty == '(':
children.append(parse_inner(toks))
elif ty == ')':
return (name, children)
# Parse this grammar:
# ROOT ::= '(' INNER
# INNER ::= WORD ROOT* ')'
# WORD ::= [A-Za-z]+
def parse_root(toks):
ty, _ = next(toks)
if ty != '(': raise ParseError
return parse_inner(toks)
def show_children(tree):
name, children = tree
if not children: return
print '%s -> %s' % (name, ' '.join(child[0] for child in children))
for child in children:
show_children(child)
example = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'
show_children(parse_root(tokenize(example)))

Try this :
def toTree(expression):
tree = dict()
msg =""
stack = list()
for char in expression:
if(char == '('):
stack.append(msg)
msg = ""
elif char == ')':
parent = stack.pop()
if parent not in tree:
tree[parent] = list()
tree[parent].append(msg)
msg = parent
else:
msg += char
return tree
expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
print toTree(expression)
It returns a dictionary, where the root can be accessed with the key ''. You can then do a simple BFS to print the output.
OUTPUT :
{
'' : ['Root'],
'AB' : ['ABC', 'CBA'],
'Root': ['AB', 'CD'],
'CD' : ['CDE', 'FGH']
}
You will have to eliminate all the whitespaces in the Expression before you start, or ignore the inrrelevant charaters in the expression by adding the following as the very first line in the for-loop :
if char == <IRRELEVANT CHARACTER>:
continue
The above code will run in O(n) time, where n is the length of the expression.
EDIT
Here is the printing function :
def printTree(tree, node):
if node not in tree:
return
print '%s -> %s' % (node, ' '.join(child for child in tree[node]))
for child in tree[node]:
printTree(tree, child)
The desired Output can be achieved by the following :
expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
tree = toTree(expression)
printTree(tree, tree[''][0])
Output
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH
EDIT
Assuming the node names are not unique, we just have to give new names to the nodes. This can be done using :
def parseExpression(expression):
nodeMap = dict()
counter = 1
node = ""
retExp =""
for char in expression:
if char == '(' or char == ')' :
if (len(node) > 0):
nodeMap[str(counter)] = node;
retExp += str(counter)
counter +=1
retExp += char
node =""
elif char == ' ': continue
else :
node += char
return retExp,nodeMap
The print Function will now change to :
def printTree(tree, node, nodeMap):
if node not in tree:
return
print '%s -> %s' % (nodeMap[node], ' '.join(nodeMap[child] for child in tree[node]))
for child in tree[node]:
printTree(tree, child, nodeMap)
The output can be obtained by using :
expression = " ( Root( SQ ( VBZ ) ( NP ( DT ) ( NN ) ) ( VP ( VB ) ( NP ( NN ) ) ) ))"
expression, nodeMap = parseExpression(expression)
tree = toTree(expression)
printTree(tree, tree[''][0], nodeMap)
Output :
Root -> SQ
SQ -> VBZ NP VP
NP -> DT NN
VP -> VB NP
NP -> NN

I think the most popular solution for parsing in Python is PyParsing. PyParsing comes with a grammar for parsing S-expressions and you should be able to just use it. Discussed in this StackOverflow answer:
Parsing S-Expressions in Python

Related

Dealing with ZeroOrMore in pyparsing

I'm trying to parse pactl list with pyparsing: So far all parse is working correctly but I cannot make ZeroOrMore to work correctly.
I can find foo: or foo: bar and try to deal with that with ZeroOrMore but it doesn't work, I have to add special case "Argument:" to find results without value, but there're Argument: foo results (with value) so it will not work, and I expect any other property to exist without value.
With this definition, and a fixed pactl list output:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1)))))
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | ("Argument:") | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This gets:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
[[['Module'], ['6']],
[['Argument:'],
[[['Name'], ['module-alsa-card']]],
[[['Usage counter'], ['0']]],
['Properties:',
[[[['module.author'], ['"Lennart Poettering"']]],
[[['module.description'], ['"ALSA Card"']]],
[[['module.version'], ['"14.0-rebootstrapped"']]]]]]]
So far so good, but removing special case for Argument: it gets into error, as ZeroOrMore doesn't behave as expected:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This results in:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 19(3,9)
Matched Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) -> [[['Argument'], ['Name']]]
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 1(2,1)
Exception raised:Expected ":", found '#' (at char 8), (line:2, col:8)
Traceback (most recent call last):
File "/home/alberto/projects/node/pacmd_list_json/./pactl.py", line 55, in <module>
parseTree = syntax.parseString(partial)
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 1955, in parseString
raise exc
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 6336, in checkUnindent
raise ParseException(s, l, "not an unindent")
pyparsing.ParseException: Expected {{Group:({Group:(W:(ABCD...)) Suppress:("#") Group:(W:(0123...))}) indented block} | {"Properties:" indented block} | Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) | Group:({Group:(W:(ABCD...)) Suppress:("=") Group:(Combine:({{W:(ABCD...) | <SP><TAB>}}...))})}, found ':' (at char 41), (line:4, col:13)
See from setDebug value grammar ZeroOrMore is getting the tokens from next line [[['Argument'], ['Name']]]
I tried LineEnd() and other tricks but none works.
Any idea on how to deal with ZeroOrMore to stop on LineEnd() or without special cases?
NOTE: Real output can be retrieved using:
env = os.environ.copy()
env['LANG'] = 'C'
data = check_output(
['pactl', 'list'], universal_newlines=True, env=env)

indentedBlock is not the easiest pyparsing element to work with. But there are a few things that you are doing that are getting in your way.
To debug this, I broke down some of your more complex expressions, use setName() to give them names, and then added .setDebug(). Like this:
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
This will tell pyparsing to output a message whenever this expression is about to be matched, if it matched successfully, or if not, the exception that was raised.
Match identifier at loc 1(2,1)
Matched identifier -> ['Module']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 23(3,13)
Exception raised:Expected identifier, found ':' (at char 23), (line:3, col:13)
It looks like these expressions are messing up the indentedBlock matching, by processing whitespace that should be indentation space:
Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))
The " character in the Word and the whitespace lead me to believe you are trying to match quoted strings. I replaced this expression with:
Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString))
You also need to take care not to read past the end of the line, or you'll also mess up the indentedBlock indentation tracking. I added this expression for a newline at the top:
NL = LineEnd()
and then used it as the stopOn argument to OneOrMore and ZeroOrMore:
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
Here is the parser I ended up with:
indentStack = [1]
stmt = Forward()
NL = LineEnd()
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums))).setName("sect_def")#.setDebug()
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
#~ value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
value_label = originalTextFor(OneOrMore(identifier)).setName("value_label")#.setDebug()
value = Group(value_label
+ Suppress(":")
+ Optional(~NL + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_.') | quotedString(), stopOn=NL))))).setName("value")#.setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
#~ prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
prop = (prop_name + prop_section).setName("prop")#.setDebug()
stmt << ( section | prop | value | prop_val )
Which gives this:
[[['Module'], ['6']],
[[['Argument']],
[['Name', ['module-alsa-card']]],
[['Usage counter', ['0']]],
['Properties:',
[[['module.author', ['"Lennart Poettering"']]],
[['module.description', ['"ALSA Card"']]],
[['module.version', ['"14.0-rebootstrapped"']]]]]]]

Visualizing nested function calls in python

I need a way to visualize nested function calls in python, preferably in a tree-like structure. So, if I have a string that contains f(g(x,h(y))), I'd like to create a tree that makes the levels more readable. For example:
f()
|
g()
/ \
x h()
|
y
Or, of course, even better, a tree plot like the one that sklearn.tree.plot_tree creates.
This seems like a problem that someone has probably solved long ago, but it has so far resisted my attempts to find it. FYI, this is for the visualization of genetic programming output that tends to have very complex strings like this.
thanks!
update:
toytree and toyplot get pretty close, but just not quite there:
This is generated with:
import toytree, toyplot
mystyle = {"layout": 'down','node_labels':True}
s = '((x,(y)));'
toytree.tree(s).draw(**mystyle);
It's close, but the node labels aren't strings...
Update 2:
I found another potential solution that gets me closer in text form:
https://rosettacode.org/wiki/Visualize_a_tree#Python
tree2 = Node('f')([
Node('g')([
Node('x')([]),
Node('h')([
Node('y')([])
])
])
])
print('\n\n'.join([drawTree2(True)(False)(tree2)]))
This results in the following:
That's right, but I had to hand convert my string to the Node notation the drawTree2 function needs.

Here's a solution using pyparsing and asciitree. This can be adapted to parse just about anything and to generate whatever data structure is required for plotting. In this case, the code generates nested dictionaries suitable for input to asciitree.
#!/usr/bin/env python3
from collections import OrderedDict
from asciitree import LeftAligned
from pyparsing import Suppress, Word, alphas, Forward, delimitedList, ParseException, Optional
def grammar():
lpar = Suppress('(')
rpar = Suppress(')')
identifier = Word(alphas).setParseAction(lambda t: (t[0], {}))
function_name = Word(alphas)
expr = Forward()
function_arg = delimitedList(expr)
function = (function_name + lpar + Optional(function_arg) + rpar).setParseAction(lambda t: (t[0] + '()', OrderedDict(t[1:])))
expr << (function | identifier)
return function
def parse(expr):
g = grammar()
try:
parsed = g.parseString(expr, parseAll=True)
except ParseException as e:
print()
print(expr)
print(' ' * e.loc + '^')
print(e.msg)
raise
return dict([parsed[0]])
if __name__ == '__main__':
expr = 'f(g(x,h(y)))'
tree = parse(expr)
print(LeftAligned()(tree))
Output:
f()
+-- g()
+-- x
+-- h()
+-- y
Edit
With some tweaks, you can build an edge list suitable for plotting in your favorite graph library (igraph example below).
#!/usr/bin/env python3
import igraph
from pyparsing import Suppress, Word, alphas, Forward, delimitedList, ParseException, Optional
class GraphBuilder(object):
def __init__(self):
self.labels = {}
self.edges = []
def add_edges(self, source, targets):
for target in targets:
self.add_edge(source, target)
return source
def add_edge(self, source, target):
x = self.labels.setdefault(source, len(self.labels))
y = self.labels.setdefault(target, len(self.labels))
self.edges.append((x, y))
def build(self):
g = igraph.Graph()
g.add_vertices(len(self.labels))
g.vs['label'] = sorted(self.labels.keys(), key=lambda l: self.labels[l])
g.add_edges(self.edges)
return g
def grammar(gb):
lpar = Suppress('(')
rpar = Suppress(')')
identifier = Word(alphas)
function_name = Word(alphas).setParseAction(lambda t: t[0] + '()')
expr = Forward()
function_arg = delimitedList(expr)
function = (function_name + lpar + Optional(function_arg) + rpar).setParseAction(lambda t: gb.add_edges(t[0], t[1:]))
expr << (function | identifier)
return function
def parse(expr, gb):
g = grammar(gb)
g.parseString(expr, parseAll=True)
if __name__ == '__main__':
expr = 'f(g(x,h(y)))'
gb = GraphBuilder()
parse(expr, gb)
g = gb.build()
layout = g.layout('tree', root=len(gb.labels)-1)
igraph.plot(g, layout=layout, vertex_size=30, vertex_color='white')

Preferential parsing of sentence elements using infixnotation in Pyparsing

I have some sentences that I need to convert to regex code and I was trying to use Pyparsing for it. The sentences are basically search rules, telling us what to search for.
Examples of sentences -
LINE_CONTAINS this is a phrase
-this is an example search rule telling that the line you are searching on should have the phrase this is a phrase
LINE_STARTSWITH However we - this is an example search rule telling that the line you are searching on should start with the phrase However we
The rules can be combined too, like- LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we
Now, I am trying to parse these sentences and then convert them to regex code. All lines start with either of the 2 symbols mentioned above (call them line_directives). I want to be able to consider these line_directives, and parse them appropriately and do the same for the phrase that follow them, albeit differently parsed. Using help from Paul McGuire(here)and my own inputs, I have the following code-
from pyparsing import *
import re
UPTO, AND, OR, WORDS = map(Literal, "upto AND OR words".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
word = ~keyword + Word(alphas)
phrase = Group(OneOrMore(word))
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
print (self.tokens[0], 20)
for el in self.tokens[0]:
print (el,type(el), 19)
print (type(self.tokens[0]), 18)
return "(%s)" %(' '.join(self.tokens[0])) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '.*'.join(t.generate() for t in tokens[::2]) # change this to the correct form of AND in regex
def __repr__(self):
return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '|'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])
class UpToNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = tokens[0].generate()
print (123123)
word_re = r"\s+\S+"
space_re = r"\s+"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
print ret
return ret
def __repr__(self):
tokens = self.tokens[0]
ret = repr(tokens[0])
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
return ret
phrase_expr = infixNotation(phrase,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
[(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
) # grammar for the entire rule/sentence
phrase_expr = infixNotation(line_contents_expr.setParseAction(LiteralNode),
[
(upto_expr, 2, opAssoc.LEFT, UpToNode),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
])
tests1 = """LINE_CONTAINS overexpressing gene AND other things""".splitlines()
for t in tests1:
t = t.strip()
if not t:
continue
# print(t, 12)
try:
parsed = phrase_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
print (parsed[0], 14)
print (type(parsed[0]))
print(parsed[0].generate(), 15)
This simple code, on running gives the following error-
((['LINE_CONTAINS', ([(['overexpressing', 'gene'], {})], {})],
{'phrase': [(([(['overexpressing', 'gene'], {})], {}), 1)],
'line_directive': [('LINE_CONTAINS', 0)]}), 14)
((['LINE_CONTAINS', ([(['overexpressing', 'gene'], {})], {})],
{'phrase': [(([(['overexpressing', 'gene'], {})], {}), 1)],
'line_directive': [('LINE_CONTAINS', 0)]}), 20)
('LINE_CONTAINS', <, 19)
(([(['overexpressing', 'gene'], {})], {}), , 19)
(, 18)
TypeError: sequence item 1: expected string, ParseResults found (line
29)
(The error code is not completely correct, as angular brackets are not well supported in blockquote here)
So my question is, even though I have written the grammar (using infixnotation) such that it treats LINE_CONTAINS as a line_directive and parse the remaining the line accordingly, why is it not able to parse properly? What is a good way to parse such lines?

python difflib character diff with unifed contextual format

I need to display character difference per line in a unix unified diff like style. Is there a way to do that using difflib?
I can get "unified diff" and "character per line diff" separately using difflib.unified_diff and difflib.Differ() (ndiff) respectively, but how can I combine them?
This is what I am looking for:
#
# This is difflib.unified
#
>>> print ''.join(difflib.unified_diff('one\ntwo\nthree\n'.splitlines(1), 'ore\ntree\nemu\n'.splitlines(1), 'old', 'new'))
--- old
+++ new
## -1,3 +1,3 ##
-one
-two
-three
+ore
+tree
+emu
>>>
#
# This is difflib.Differ
#
>>> print ''.join(difflib.ndiff('one\ntwo\nthree\n'.splitlines(1), 'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
>>>
#
# I want the merge of above two, something like this...
#
>>> print ''.join(unified_with_ndiff('one\ntwo\nthree\n'.splitlines(1), 'ore\ntree\nemu\n'.splitlines(1))),
--- old
+++ new
## -1,3 +1,3 ##
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
>>>

Found the answer on my own after digging into the source code of difflib.
'''
# mydifflib.py
#author: Amit Barik
#summary: Overrides difflib.Differ to present the user with unified format (for Python 2.7).
Its basically merging of difflib.unified_diff() and difflib.Differ.compare()
'''
from difflib import SequenceMatcher
from difflib import Differ
class UnifiedDiffer(Differ):
def unified_diff(self, a, b, fromfile='', tofile='', fromfiledate='',
tofiledate='', n=3, lineterm='\n'):
r"""
Compare two sequences of lines; generate the resulting delta, in unified
format
Each sequence must contain individual single-line strings ending with
newlines. Such sequences can be obtained from the `readlines()` method
of file-like objects. The delta generated also consists of newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.
Example:
>>> print ''.join(Differ().unified_diff('one\ntwo\nthree\n'.splitlines(1),
... 'ore\ntree\nemu\n'.splitlines(1)),
... 'old.txt', 'new.txt', 'old-date', 'new-date'),
--- old.txt old-date
+++ new.txt new-date
## -1,5 +1,5 ##
context1
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
context2
"""
started = False
for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
if not started:
fromdate = '\t%s' % fromfiledate if fromfiledate else ''
todate = '\t%s' % tofiledate if tofiledate else ''
yield '--- %s%s%s' % (fromfile, fromdate, lineterm)
yield '+++ %s%s%s' % (tofile, todate, lineterm)
started = True
i1, i2, j1, j2 = group[0][1], group[-1][2], group[0][3], group[-1][4]
yield "## -%d,%d +%d,%d ##%s" % (i1+1, i2-i1, j1+1, j2-j1, lineterm)
for tag, i1, i2, j1, j2 in group:
if tag == 'replace':
for line in a[i1:i2]:
g = self._fancy_replace(a, i1, i2, b, j1, j2)
elif tag == 'equal':
for line in a[i1:i2]:
g = self._dump(' ', a, i1, i2)
if n > 0:
for line in g:
yield line
continue
elif tag == 'delete':
for line in a[i1:i2]:
g = self._dump('-', a, i1, i2)
elif tag == 'insert':
for line in b[j1:j2]:
g = self._dump('+', b, j1, j2)
else:
raise ValueError, 'unknown tag %r' % (tag,)
for line in g:
yield line
def main():
# Test
a ='context1\none\ntwo\nthree\ncontext2\n'.splitlines(1)
b = 'context1\nore\ntree\nemu\ncontext2\n'.splitlines(1)
x = UnifiedDiffer().unified_diff(a, b, 'old.txt', 'new.txt', 'old-date', 'new-date', n=1)
print ''.join(x)
if __name__ == '__main__':
main()

Parsing significant whitespace with a PEG in Python (specificly Parsley)

I'm creating a syntax that supports significant whitespace (most like the "Z" lisp variant than Python or yaml, but same idea)
I came across this article on how to do significant whitespace parsing in a pegasus a PEG parser for C#
But I've been less than successful at converting that to parsley, looks like the #STATE# variable in Pegasus follows backtracking in some way.
This is the closest I've gotten to a simple parser, If I use the version of indent with look ahead it can't parse children, and if I use the version without, it can't parse siblings.
If this is a limitation of parsley and I need to use PyPEG or Parsimonious or something, I'm open to that, but it seems like if the internal indent variable could follow the PEGs internal backtracking this would all work.
import parsley
def indent(s):
s['i'] += 2
print('indent i=%d' % s['i'])
def deindent(s):
s['i'] -= 2
print('deindent i=%d' % s['i'])
grammar = parsley.makeGrammar(r'''
id = <letterOrDigit+>
eol = '\n' | end
nots = anything:x ?(x != ' ')
node = I:i id:name eol !(fn_print(_state['i'], name)) -> i, name
#I = !(' ' * _state['i'])
I = (' '*):spaces ?(len(spaces) == _state['i'])
#indent = ~~(!(' ' * (_state['i'] + 2)) nots) -> fn_indent(_state)
#deindent = ~~(!(' ' * (_state['i'] - 2)) nots) -> fn_deindent(_state)
indent = -> fn_indent(_state)
deindent = -> fn_deindent(_state)
child_list = indent (ntree+):children deindent -> children
ntree = node:parent (child_list?):children -> parent, children
nodes = ntree+
''', {
'_state': {'i': 0},
'fn_indent': indent,
'fn_deindent': deindent,
'fn_print': print,
})
test_string = '\n'.join((
'brother',
' brochild1',
#' gchild1',
#' brochild2',
#' grandchild',
'sister',
#' sischild',
#'brother2',
))
nodes = grammar(test_string).nodes()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse parenthetical trees in python? - python

I think the most popular solution for parsing in Python is PyParsing. PyParsing comes with a grammar for parsing S-expressions and you should be able to just use it. Discussed in this StackOverflow answer: Parsing S-Expressions in Python

Related

Dealing with ZeroOrMore in pyparsing

Visualizing nested function calls in python

Preferential parsing of sentence elements using infixnotation in Pyparsing

python difflib character diff with unifed contextual format

Parsing significant whitespace with a PEG in Python (specificly Parsley)

Categories

Resources