Preferential parsing of sentence elements using infixnotation in Pyparsing

Preferential parsing of sentence elements using infixnotation in Pyparsing - python

I have some sentences that I need to convert to regex code and I was trying to use Pyparsing for it. The sentences are basically search rules, telling us what to search for.
Examples of sentences -
LINE_CONTAINS this is a phrase
-this is an example search rule telling that the line you are searching on should have the phrase this is a phrase
LINE_STARTSWITH However we - this is an example search rule telling that the line you are searching on should start with the phrase However we
The rules can be combined too, like- LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we
Now, I am trying to parse these sentences and then convert them to regex code. All lines start with either of the 2 symbols mentioned above (call them line_directives). I want to be able to consider these line_directives, and parse them appropriately and do the same for the phrase that follow them, albeit differently parsed. Using help from Paul McGuire(here)and my own inputs, I have the following code-
from pyparsing import *
import re
UPTO, AND, OR, WORDS = map(Literal, "upto AND OR words".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
word = ~keyword + Word(alphas)
phrase = Group(OneOrMore(word))
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
print (self.tokens[0], 20)
for el in self.tokens[0]:
print (el,type(el), 19)
print (type(self.tokens[0]), 18)
return "(%s)" %(' '.join(self.tokens[0])) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '.*'.join(t.generate() for t in tokens[::2]) # change this to the correct form of AND in regex
def __repr__(self):
return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '|'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])
class UpToNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = tokens[0].generate()
print (123123)
word_re = r"\s+\S+"
space_re = r"\s+"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
print ret
return ret
def __repr__(self):
tokens = self.tokens[0]
ret = repr(tokens[0])
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
return ret
phrase_expr = infixNotation(phrase,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
[(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
) # grammar for the entire rule/sentence
phrase_expr = infixNotation(line_contents_expr.setParseAction(LiteralNode),
[
(upto_expr, 2, opAssoc.LEFT, UpToNode),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
])
tests1 = """LINE_CONTAINS overexpressing gene AND other things""".splitlines()
for t in tests1:
t = t.strip()
if not t:
continue
# print(t, 12)
try:
parsed = phrase_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
print (parsed[0], 14)
print (type(parsed[0]))
print(parsed[0].generate(), 15)
This simple code, on running gives the following error-
((['LINE_CONTAINS', ([(['overexpressing', 'gene'], {})], {})],
{'phrase': [(([(['overexpressing', 'gene'], {})], {}), 1)],
'line_directive': [('LINE_CONTAINS', 0)]}), 14)
((['LINE_CONTAINS', ([(['overexpressing', 'gene'], {})], {})],
{'phrase': [(([(['overexpressing', 'gene'], {})], {}), 1)],
'line_directive': [('LINE_CONTAINS', 0)]}), 20)
('LINE_CONTAINS', <, 19)
(([(['overexpressing', 'gene'], {})], {}), , 19)
(, 18)
TypeError: sequence item 1: expected string, ParseResults found (line
29)
(The error code is not completely correct, as angular brackets are not well supported in blockquote here)
So my question is, even though I have written the grammar (using infixnotation) such that it treats LINE_CONTAINS as a line_directive and parse the remaining the line accordingly, why is it not able to parse properly? What is a good way to parse such lines?

Related

Visualizing nested function calls in python

I need a way to visualize nested function calls in python, preferably in a tree-like structure. So, if I have a string that contains f(g(x,h(y))), I'd like to create a tree that makes the levels more readable. For example:
f()
|
g()
/ \
x h()
|
y
Or, of course, even better, a tree plot like the one that sklearn.tree.plot_tree creates.
This seems like a problem that someone has probably solved long ago, but it has so far resisted my attempts to find it. FYI, this is for the visualization of genetic programming output that tends to have very complex strings like this.
thanks!
update:
toytree and toyplot get pretty close, but just not quite there:
This is generated with:
import toytree, toyplot
mystyle = {"layout": 'down','node_labels':True}
s = '((x,(y)));'
toytree.tree(s).draw(**mystyle);
It's close, but the node labels aren't strings...
Update 2:
I found another potential solution that gets me closer in text form:
https://rosettacode.org/wiki/Visualize_a_tree#Python
tree2 = Node('f')([
Node('g')([
Node('x')([]),
Node('h')([
Node('y')([])
])
])
])
print('\n\n'.join([drawTree2(True)(False)(tree2)]))
This results in the following:
That's right, but I had to hand convert my string to the Node notation the drawTree2 function needs.

Here's a solution using pyparsing and asciitree. This can be adapted to parse just about anything and to generate whatever data structure is required for plotting. In this case, the code generates nested dictionaries suitable for input to asciitree.
#!/usr/bin/env python3
from collections import OrderedDict
from asciitree import LeftAligned
from pyparsing import Suppress, Word, alphas, Forward, delimitedList, ParseException, Optional
def grammar():
lpar = Suppress('(')
rpar = Suppress(')')
identifier = Word(alphas).setParseAction(lambda t: (t[0], {}))
function_name = Word(alphas)
expr = Forward()
function_arg = delimitedList(expr)
function = (function_name + lpar + Optional(function_arg) + rpar).setParseAction(lambda t: (t[0] + '()', OrderedDict(t[1:])))
expr << (function | identifier)
return function
def parse(expr):
g = grammar()
try:
parsed = g.parseString(expr, parseAll=True)
except ParseException as e:
print()
print(expr)
print(' ' * e.loc + '^')
print(e.msg)
raise
return dict([parsed[0]])
if __name__ == '__main__':
expr = 'f(g(x,h(y)))'
tree = parse(expr)
print(LeftAligned()(tree))
Output:
f()
+-- g()
+-- x
+-- h()
+-- y
Edit
With some tweaks, you can build an edge list suitable for plotting in your favorite graph library (igraph example below).
#!/usr/bin/env python3
import igraph
from pyparsing import Suppress, Word, alphas, Forward, delimitedList, ParseException, Optional
class GraphBuilder(object):
def __init__(self):
self.labels = {}
self.edges = []
def add_edges(self, source, targets):
for target in targets:
self.add_edge(source, target)
return source
def add_edge(self, source, target):
x = self.labels.setdefault(source, len(self.labels))
y = self.labels.setdefault(target, len(self.labels))
self.edges.append((x, y))
def build(self):
g = igraph.Graph()
g.add_vertices(len(self.labels))
g.vs['label'] = sorted(self.labels.keys(), key=lambda l: self.labels[l])
g.add_edges(self.edges)
return g
def grammar(gb):
lpar = Suppress('(')
rpar = Suppress(')')
identifier = Word(alphas)
function_name = Word(alphas).setParseAction(lambda t: t[0] + '()')
expr = Forward()
function_arg = delimitedList(expr)
function = (function_name + lpar + Optional(function_arg) + rpar).setParseAction(lambda t: gb.add_edges(t[0], t[1:]))
expr << (function | identifier)
return function
def parse(expr, gb):
g = grammar(gb)
g.parseString(expr, parseAll=True)
if __name__ == '__main__':
expr = 'f(g(x,h(y)))'
gb = GraphBuilder()
parse(expr, gb)
g = gb.build()
layout = g.layout('tree', root=len(gb.labels)-1)
igraph.plot(g, layout=layout, vertex_size=30, vertex_color='white')

Is there a way to programmatically combine Korean unicode into one?

Using a Korean Input Method Editor (IME), it's possible to type 버리 + 어 and it will automatically become 버려.
Is there a way to programmatically do that in Python?
>>> x, y = '버리', '어'
>>> z = '버려'
>>> ord(z[-1])
47140
>>> ord(x[-1]), ord(y)
(47532, 50612)
Is there a way to compute that 47532 + 50612 -> 47140?
Here's some more examples:
가보 + 아 -> 가봐
끝나 + ㄹ -> 끝날

I'm a Korean. First, if you type 버리 + 어, it becomes 버리어 not 버려. 버려 is an abbreviation of 버리어 and it's not automatically generated. Also 가보아 cannot becomes 가봐 automatically during typing by the same reason.
Second, by contrast, 끝나 + ㄹ becomes 끝날 because 나 has no jongseong(종성). Note that one character of Hangul is made of choseong(초성), jungseong(중성), and jongseong. choseong and jongseong are a consonant, jungseong is a vowel. See more at Wikipedia. So only when there's no jongseong during typing (like 끝나), there's a chance that it can have jongseong(ㄹ).
If you want to make 버리 + 어 to 버려, you should implement some Korean language grammar like, especially for this case, abbreviation of jungseong. For example ㅣ + ㅓ = ㅕ, ㅗ + ㅏ = ㅘ as you provided. 한글 맞춤법 chapter 4. section 5 (I can't find English pages right now) defines abbreviation like this. It's possible, but not so easy job especially for non-Koreans.
Next, if what you want is just to make 끝나 + ㄹ to 끝날, it can be a relatively easy job since there're libraries which can handle composition and decomposition of choseong, jungseong, jongseong. In case of Python, I found hgtk. You can try like this (nonpractical code):
# hgtk methods take one character at a time
cjj1 = hgtk.letter.decompose('나') # ('ㄴ', 'ㅏ', '')
cjj2 = hgtk.letter.decompose('ㄹ') # ('ㄹ', '', '')
if cjj1[2]) == '' and cjj2[1]) == '':
cjj = (cjj1[0], cjj1[1], cjj2[0])
cjj2 = None
Still, without proper knowledge of Hangul, it will be very hard to get it done.

You could use your own Translation table.
The drawback is you have to input all pairs manual or you have a file to get it from.
For instance:
# Sample Korean chars to map
k = [[('버리', '어'), ('버려')], [('가보', '아'), ('가봐')], [('끝나', 'ㄹ'), ('끝날')]]
class Korean(object):
def __init__(self):
self.map = {}
for m in k:
key = m[0][0] + m[0][1]
self.map[hash(key)] = m[1]
def __getitem__(self, item):
return self.map[hash(item)]
def translate(self, s):
return [ self.map[hash(token)] for token in s]
if __name__ == '__main__':
k_map = Korean()
k_chars = [ m[0][0] + m[0][1] for m in k]
print('Input: %s' % k_chars)
print('Output: %s' % k_map.translate(k_chars))
one_char_3 = k[0][0][0] + k[0][0][1]
print('%s = %s' % (one_char_3, k_map[ one_char_3 ]) )
Input: ['버리어', '가보아', '끝나ㄹ']
Output: ['버려', '가봐', '끝날']
버리어 = 버려
Tested with Python:3.4.2

Extract data from within parenthesis in python

I know there are many questions with the same title. My situation is a little different. I have a string like:
"Cat(Money(8)Points(80)Friends(Online(0)Offline(8)Total(8)))Mouse(Money(10)Points(10000)Friends(Online(10)Offline(80)Total(90)))"
(Notice that there are parenthesis nested inside another)
and I need to parse it into nested dictionaries like for example:
d["Cat"]["Money"] == 8
d["Cat"]["Points"] = 80
d["Mouse"]["Friends"]["Online"] == 10
and so on. I would like to do this without libraries and regex. If you choose to use these, please explain the code in great detail.
Thanks in advance!
Edit:
Although this code will not make any sense, this is what I have so far:
o_str = "Jake(Money(8)Points(80)Friends(Online(0)Offline(8)Total(8)))Mouse(Money(10)Points(10000)Friends(Online(10)Offline(80)Total(90)))"
spl = o_str.split("(")
def reverseIndex(str1, str2):
try:
return len(str1) - str1.rindex(str2)
except Exception:
return len(str1)
def app(arr,end):
new_arr = []
for i in range(0,len(arr)):
if i < len(arr)-1:
new_arr.append(arr[i]+end)
else:
new_arr.append(arr[i])
return new_arr
spl = app(spl,"(")
ends = []
end_words = []
op = 0
cl = 0
for i in range(0,len(spl)):
print i
cl += spl[i].count(")")
op += 1
if cl == op-1:
ends.append(i)
end_words.append(spl[i])
#break
print op
print cl
print
print end_words
The end words are the sections at the beginning of each statement. I plan on using recursive to do the rest.

Now that was interesting. You really nerd-sniped me on this one...
def parse(tokens):
""" take iterator of tokens, parse to dictionary or atom """
dictionary = {}
# iterate tokens...
for token in tokens:
if token == ")" or next(tokens) == ")":
# token is ')' -> end of dict; next is ')' -> 'leaf'
break
# add sub-parse to dictionary
dictionary[token] = parse(tokens)
# return dict, if non-empty, else token
return dictionary or int(token)
Setup and demo:
>>> s = "Cat(Money(8)Points(80)Friends(Online(0)Offline(8)Total(8)))Mouse(Money(10)Points(10000)Friends(Online(10)Offline(80)Total(90)))"
>>> tokens = iter(s.replace("(", " ( ").replace(")", " ) ").split())
>>> pprint(parse(tokens))
{'Cat': {'Friends': {'Offline': 8, 'Online': 0, 'Total': 8},
'Money': 8,
'Points': 80},
'Mouse': {'Friends': {'Offline': 80, 'Online': 10, 'Total': 90},
'Money': 10,
'Points': 10000}}
Alternatively, you could also use a series of string replacements to turn that string into an actual Python dictionary string and then evaluate that, e.g. like this:
as_dict = eval("{'" + s.replace(")", "'}, ")
.replace("(", "': {'")
.replace(", ", ", '")
.replace(", ''", "")[:-3] + "}")
This will wrap the 'leafs' in singleton sets of strings, e.g. {'8'} instead of 8, but this should be easy to fix in a post-processing step.

Parsing significant whitespace with a PEG in Python (specificly Parsley)

I'm creating a syntax that supports significant whitespace (most like the "Z" lisp variant than Python or yaml, but same idea)
I came across this article on how to do significant whitespace parsing in a pegasus a PEG parser for C#
But I've been less than successful at converting that to parsley, looks like the #STATE# variable in Pegasus follows backtracking in some way.
This is the closest I've gotten to a simple parser, If I use the version of indent with look ahead it can't parse children, and if I use the version without, it can't parse siblings.
If this is a limitation of parsley and I need to use PyPEG or Parsimonious or something, I'm open to that, but it seems like if the internal indent variable could follow the PEGs internal backtracking this would all work.
import parsley
def indent(s):
s['i'] += 2
print('indent i=%d' % s['i'])
def deindent(s):
s['i'] -= 2
print('deindent i=%d' % s['i'])
grammar = parsley.makeGrammar(r'''
id = <letterOrDigit+>
eol = '\n' | end
nots = anything:x ?(x != ' ')
node = I:i id:name eol !(fn_print(_state['i'], name)) -> i, name
#I = !(' ' * _state['i'])
I = (' '*):spaces ?(len(spaces) == _state['i'])
#indent = ~~(!(' ' * (_state['i'] + 2)) nots) -> fn_indent(_state)
#deindent = ~~(!(' ' * (_state['i'] - 2)) nots) -> fn_deindent(_state)
indent = -> fn_indent(_state)
deindent = -> fn_deindent(_state)
child_list = indent (ntree+):children deindent -> children
ntree = node:parent (child_list?):children -> parent, children
nodes = ntree+
''', {
'_state': {'i': 0},
'fn_indent': indent,
'fn_deindent': deindent,
'fn_print': print,
})
test_string = '\n'.join((
'brother',
' brochild1',
#' gchild1',
#' brochild2',
#' grandchild',
'sister',
#' sischild',
#'brother2',
))
nodes = grammar(test_string).nodes()

pretty print assertEqual() for HTML strings

I want to compare two strings in a python unittest which contain html.
Is there a method which outputs the result in a human friendly (diff like) version?

A simple method is to strip whitespace from the HTML and split it into a list. Python 2.7's unittest (or the backported unittest2) then gives a human-readable diff between the lists.
import re
def split_html(html):
return re.split(r'\s*\n\s*', html.strip())
def test_render_html():
expected = ['<div>', '...', '</div>']
got = split_html(render_html())
self.assertEqual(expected, got)
If I'm writing a test for working code, I usually first set expected = [], insert a self.maxDiff = None before the assert and let the test fail once. The expected list can then be copy-pasted from the test output.
You might need to tweak how whitespace is stripped depending on what your HTML looks like.

I submitted a patch to do this some years back. The patch was rejected but you can still view it on the python bug list.
I doubt you would want to hack your unittest.py to apply the patch (if it even still works after all this time), but here's the function for reducing two strings a manageable size while still keeping at least part of what differs. So long as all you didn't want the complete differences this might be what you want:
def shortdiff(x,y):
'''shortdiff(x,y)
Compare strings x and y and display differences.
If the strings are too long, shorten them to fit
in one line, while still keeping at least some difference.
'''
import difflib
LINELEN = 79
def limit(s):
if len(s) > LINELEN:
return s[:LINELEN-3] + '...'
return s
def firstdiff(s, t):
span = 1000
for pos in range(0, max(len(s), len(t)), span):
if s[pos:pos+span] != t[pos:pos+span]:
for index in range(pos, pos+span):
if s[index:index+1] != t[index:index+1]:
return index
left = LINELEN/4
index = firstdiff(x, y)
if index > left + 7:
x = x[:left] + '...' + x[index-4:index+LINELEN]
y = y[:left] + '...' + y[index-4:index+LINELEN]
else:
x, y = x[:LINELEN+1], y[:LINELEN+1]
left = 0
cruncher = difflib.SequenceMatcher(None)
xtags = ytags = ""
cruncher.set_seqs(x, y)
editchars = { 'replace': ('^', '^'),
'delete': ('-', ''),
'insert': ('', '+'),
'equal': (' ',' ') }
for tag, xi1, xi2, yj1, yj2 in cruncher.get_opcodes():
lx, ly = xi2 - xi1, yj2 - yj1
edits = editchars[tag]
xtags += edits[0] * lx
ytags += edits[1] * ly
# Include ellipsis in edits line.
if left:
xtags = xtags[:left] + '...' + xtags[left+3:]
ytags = ytags[:left] + '...' + ytags[left+3:]
diffs = [ x, xtags, y, ytags ]
if max([len(s) for s in diffs]) < LINELEN:
return '\n'.join(diffs)
diffs = [ limit(s) for s in diffs ]
return '\n'.join(diffs)

Maybe this is a quite 'verbose' solution. You could add a new 'equality function' for your user defined type (e.g: HTMLString) which you have to define first:
class HTMLString(str):
pass
Now you have to define a type equality function:
def assertHTMLStringEqual(first, second):
if first != second:
message = ... # TODO here: format your message, e.g a diff
raise AssertionError(message)
All you have to do is format your message as you like. You can also use a class method in your specific TestCase as a type equality function. This gives you more functionality to format your message, since unittest.TestCase does this a lot.
Now you have to register this equality function in your unittest.TestCase:
...
def __init__(self):
self.addTypeEqualityFunc(HTMLString, assertHTMLStringEqual)
The same for a class method:
...
def __init__(self):
self.addTypeEqualityFunc(HTMLString, 'assertHTMLStringEqual')
And now you can use it in your tests:
def test_something(self):
htmlstring1 = HTMLString(...)
htmlstring2 = HTMLString(...)
self.assertEqual(htmlstring1, htmlstring2)
This should work well with python 2.7.

I (the one asking this question) use BeautfulSoup now:
def assertEqualHTML(string1, string2, file1='', file2=''):
u'''
Compare two unicode strings containing HTML.
A human friendly diff goes to logging.error() if there
are not equal, and an exception gets raised.
'''
from BeautifulSoup import BeautifulSoup as bs
import difflib
def short(mystr):
max=20
if len(mystr)>max:
return mystr[:max]
return mystr
p=[]
for mystr, file in [(string1, file1), (string2, file2)]:
if not isinstance(mystr, unicode):
raise Exception(u'string ist not unicode: %r %s' % (short(mystr), file))
soup=bs(mystr)
pretty=soup.prettify()
p.append(pretty)
if p[0]!=p[1]:
for line in difflib.unified_diff(p[0].splitlines(), p[1].splitlines(), fromfile=file1, tofile=file2):
logging.error(line)
raise Exception('Not equal %s %s' % (file1, file2))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preferential parsing of sentence elements using infixnotation in Pyparsing - python

Related

Visualizing nested function calls in python

Is there a way to programmatically combine Korean unicode into one?

Extract data from within parenthesis in python

Parsing significant whitespace with a PEG in Python (specificly Parsley)

pretty print assertEqual() for HTML strings

Categories

Resources