Edit: I did a first version, which Eike helped me to advance quite a bit on it. I'm now stuck to a more specific problem, which I will describe bellow. You can have a look at the original question in the history
I'm using pyparsing to parse a small language used to request specific data from a database. It features numerous keyword, operators and datatypes as well as boolean logic.
I'm trying to improve the error message sent to the user when he does a syntax error, since the current one is not very useful. I designed a small example, similar to what I'm doing with the language aforementioned but much smaller:
#!/usr/bin/env python
from pyparsing import *
def validate_number(s, loc, tokens):
if int(tokens[0]) != 0:
raise ParseFatalException(s, loc, "number musth be 0")
def fail(s, loc, tokens):
raise ParseFatalException(s, loc, "Unknown token %s" % tokens[0])
def fail_value(s, loc, expr, err):
raise ParseFatalException(s, loc, "Wrong value")
number = Word(nums).setParseAction(validate_number).setFailAction(fail_value)
operator = Literal("=")
error = Word(alphas).setParseAction(fail)
rules = MatchFirst([
Literal('x') + operator + number,
])
rules = operatorPrecedence(rules | error , [
(Literal("and"), 2, opAssoc.RIGHT),
])
def try_parse(expression):
try:
rules.parseString(expression, parseAll=True)
except Exception as e:
msg = str(e)
print("%s: %s" % (msg, expression))
print(" " * (len("%s: " % msg) + (e.loc)) + "^^^")
So basically, the only things which we can do with this language, is writing series of x = 0, joined together with and and parenthesis.
Now, there are cases, when and and parenthesis are used, where the error reporting is not very good. Consider the following examples:
>>> try_parse("x = a and x = 0") # This one is actually good!
Wrong value (at char 4), (line:1, col:5): x = a and x = 0
^^^
>>> try_parse("x = 0 and x = a")
Expected end of text (at char 6), (line:1, col:1): x = 0 and x = a
^^^
>>> try_parse("x = 0 and (x = 0 and (x = 0 and (x = a)))")
Expected end of text (at char 6), (line:1, col:1): x = 0 and (x = 0 and (x = 0 and (x = a)))
^^^
>>> try_parse("x = 0 and (x = 0 and (x = 0 and (x = 0)))")
Expected end of text (at char 6), (line:1, col:1): x = 0 and (x = 0 and (x = 0 and (xxxxxxxx = 0)))
^^^
Actually, it seems that if the parser can't parse (and parse here is important) something after a and, it doesn't produce good error messages anymore :(
And I mean parse, since if it can parse 5 but the "validation" fails in the parse action, it still produces a good error message. But, if it can't parse a valid number (like a) or a valid keyword (like xxxxxx), it stops producing the right error messages.
Any idea?
Pyparsing will always have somewhat bad error messages, because it backtracks. The error message is generated in the last rule that the parser tries. The parser can't know where the error really is, it only knows that there is no matching rule.
For good error messages you need a parser that gives up early. These parsers are less flexible than Pyparsing, but most conventional programming languages can be parsed with such parsers. (C++ and Scala IMHO can't.)
To improve error messages in Pyparsing use the - operator, it works like the + operator, but it does not backtrack. You would use it like this:
assignment = Literal("let") - varname - "=" - expression
Here is a small article on improving error reporting, by Pyparsing's author.
Edit
You could also generate good error messages for the invalid numbers in the parse actions that do the validation. If the number is invalid you raise an exception that is not caught by Pyparsing. This exception can contain a good error message.
Parse actions can have three arguments [1]:
s = the original string being parsed (see note below)
loc = the location of the matching substring
toks = a list of the matched tokens, packaged as a ParseResults object
There are also three useful helper methods for creating good error messages [2]:
lineno(loc, string) - function to give the line number of the location within the string; the first line is line 1, newlines start new rows.
col(loc, string) - function to give the column number of the location within the string; the first column is column 1, newlines reset the column number to 1.
line(loc, string) - function to retrieve the line of text representing lineno(loc, string). Useful when printing out diagnostic messages for exceptions.
Your validating parse action would then be like this:
def validate_odd_number(s, loc, toks):
value = toks[0]
value = int(value)
if value % 2 == 0:
raise MyFatalParseException(
"not an odd number. Line {l}, column {c}.".format(l=lineno(loc, s),
c=col(loc, s)))
[1] http://pythonhosted.org/pyparsing/pyparsing.pyparsing.ParserElement-class.html#setParseAction
[2] HowToUsePyparsing
Edit
Here [3] is an improved version of the question's current (2013-4-10) script. It gets the example errors right, but other error are indicated at the wrong position. I believe there are bugs in my version of Pyparsing ('1.5.7'), but maybe I just don't understand how Pyparsing works. The issues are:
ParseFatalException seems not to be always fatal. The script works as expected when I use my own exception.
The - operator seems not to work.
[3] http://pastebin.com/7E4kSnkm
Related
Essentially - I'm making a slightly different test harness - and my requirement is simple, but I just have no idea whether it's possible in python?
all I want is to make a function check such that if i do
check( "check the total matches", 123, a + call_something() * call_something_else() )
I get output like:
check the total matches failed...
expected: 123
evaluated: a + call_something() * call_something_else()
got: 122
ie my question is about constructing that "evaluated" string. In other languages I've sort of looked at the stack frame worked out where the source code for that function, searched for a matching assert line and parsed it - but it feels like there might be an easier way in python - because exeptions seem to contain that information?
so this is ugly, but it's working for me:
def show_expression( expression ):
caller = [x.line for x in traceback.extract_stack(limit = 2)][0]
return re.findall( r"show_expression\(\s*(.*?)\s*\)\s*$", caller)[0]
show_expression( 1 + 2 )
My specific concerns are
When parsing a parameter, is it intuitive for a future maintainer to understand code that depends on throwing an error?
Is it expensive to be throwing exceptions as a matter of course for the
default case? (seems like it might be according to https://stackoverflow.com/a/9859202/776940 )
Context
I have a parameter counter that determines the name of a counter to increment, and optionally can increment by a positive or negative integer separated from the counter name by an =. If no increment value is provided, the default magnitude of the increment is 1. The function is fed by breaking up a comma delimited list of counters and increments, so a valid input to the whole process can look like:
"counter1,counter2=2,counter3=-1"
which would increment "counter1" by 1, increment "counter2" by 2 and decrement "counter3" by 1.
How I Originally Wrote It
counterDescriptor = counterValue.split('=')
if len(counterDescriptor) == 1:
counterName = counterDescriptor[0]
counterIncr = 1
elif len(counterDescriptor) == 2:
counterName = counterDescriptor[0]
counterIncr = int(counterDescriptor[1])
else:
counterName, counterIncr = ('counterParsingError', 1)
which strikes me, as I recently came back to look at it, as overly verbose and clunky.
Is this a more or less Pythonic way to code that behavior?
def cparse(counter):
try:
desc,mag = counter.split('=')
except ValueError:
desc = counter
mag = ''
finally:
if mag == '':
mag = 1
return desc, int(mag)
With these test cases, I see:
>>> cparse("byfour=4")
('byfour', 4)
>>> cparse("minusone=-1")
('minusone', -1)
>>> cparse("equalAndNoIncr=")
('equalAndNoIncr', 1)
>>> cparse("noEqual")
('noEqual', 1)
These test cases that would have been caught how I originally wrote it (above) won't get caught this way:
>>> cparse("twoEquals=2=3")
('twoEquals=2=3', 1)
>>> cparse("missingComma=5missingComma=-5")
('missingComma=5missingComma=-5', 1)
and this last test case doesn't get caught by either way of doing it. Both make the int() vomit:
>>> cparse("YAmissingComma=5NextCounter")
ValueError: invalid literal for int() with base 10: '5NextCounter'
I'm glad I discovered this problem by asking this question. The service that consumes this value would eventually choke on it. I suppose I could change the one line return desc, int(mag) of the function to this:
if desc.find("=")<0 and (mag=='0' or (mag if mag.find('..') > -1 else mag.lstrip('-+').rstrip('0').rstrip('.')).isdigit()):
return desc, int(mag)
else:
return 'counterParsingError: {}'.format(desc), 1
(hat tip to https://stackoverflow.com/a/9859202/776940 for figuring out that this was the fastest way offered in that discussion to determine if a string is an integer)
I would consider that pythonic, though you might perhaps prefer:
def cparse(counter):
if "=" not in counter:
# early exit for this expected case
return (counter, 1)
desc, mag = counter.split("=", maxsplit=1)
# note the use of the optional maxsplit to prevent ValueErrors on "a=b=c"
# and since we've already tested and short-circuited out of the "no equals" case
# we can now consider this handled completely without nesting in a try block.
try:
mag = int(mag)
except ValueError:
# can't convert mag to an int, this is unexpected!
mag = 1
return (desc, mag)
You can tweak this to ensure you get the right output while parsing strings like a=b=c. If you expect to receive ('a', 1) then keep the code as-is. If you expect ('a=b', 1) you can use counter.rsplit instead of counter.split.
I'm in need of a function returning only the significant part of a value with respect to a given error. Meaning something like this:
def (value, error):
""" This function takes a value and determines its significant
accuracy by its error.
It returns only the scientific important part of a value and drops the rest. """
magic magic magic....
return formated value as String.
What i have written so far to show what I mean:
import numpy as np
def signigicant(value, error):
""" Returns a number in a scintific format. Meaning a value has an error
and that error determines how many digits of the
value are signifcant. e.g. value = 12.345MHz,
error = 0.1MHz => 12.3MHz because the error is at the first digit.
(in reality drop the MHz its just to show why.)"""
xx = "%E"%error # I assume this is most ineffective.
xx = xx.split("E")
xx = int(xx[1])
if error <= value: # this should be the normal case
yy = np.around(value, -xx)
if xx >= 0: # Error is 1 or bigger
return "%i"%yy
else: # Error is smaller than 1
string = "%."+str(-xx) +"f"
return string%yy
if error > value: # This should not be usual but it can happen.
return "%g"%value
What I don't want is a function like numpys around or round. Those functions take a value and want to know what part of this value is important. The point is that in general I don't know how many digits are significant. It depends in the size of the error of that value.
Another example:
value = 123, error = 12, => 120
One can drop the 3, because the error is at the size of 10. However this behaviour is not so important, because some people still write 123 for the value. Here it is okay but not perfectly right.
For big numbers the "g" string operator is a usable choice but not always what I need. For e.g.
If the error is bigger than the value.( happens e.g. when someone wants to measure something that does not exist.)
value = 10, error = 100
I still wish to keep the 10 as the value because I done know it any better. The function should return 10 then and not 0.
The stuff I have written does work more or less, but its clearly not effective or elegant in any way. Also I assume this question does concern hundreds of people because every scientist has to format numbers in that way. So I'm sure there is a ready to use solution somewhere but I haven't found it yet.
Probably my google skills aren't good enough but I wasn't able to find a solution to this in two days and now I ask here.
For testing my code I used this the following but more is needed.
errors = [0.2,1.123,1.0, 123123.1233215,0.123123123768]
values = [12.3453,123123321.4321432, 0.000321 ,321321.986123612361236,0.00001233214 ]
for value, error in zip(values, errors):
print "Teste Value: ",value, "Error:", error
print "Result: ", signigicant(value, error)
import math
def round_on_error(value, error):
significant_digits = 10**math.floor(math.log(error, 10))
return value // significant_digits * significant_digits
Example:
>>> errors = [0.2,1.123,1.0, 123123.1233215,0.123123123768]
>>> values = [12.3453,123123321.4321432, 0.000321 ,321321.986123612361236,0.00001233214 ]
>>> map(round_on_error, values, errors)
[12.3, 123123321.0, 0.0, 300000.0, 0.0]
And if you want to keep a value that is inferior to its error
if (value < error)
return value
else
def round_on_error(value, error):
significant_digits = 10**math.floor(math.log(error, 10))
return value // significant_digits * significant_digits
I'm parsing multiple choice questions with multiple answers that look like this :
ParserElement.setDefaultWhitespaceChars(u""" \t""")
in_ = """1) first stem.
= option one one key
= option one two key
- option one three distractor
= option one four key
2) second stem ?
- option two one distractor
- option two two distractor
= option one three key
3) third stem.
- option three one key
= option three two distractor
"""
The equal sign represents a correct answer, the dash a distractor.
My grammar looks like this :
newline = Suppress(u"\n")
end_number = Suppress(oneOf(u') / ('))
end_stem = Suppress(oneOf(u"? .")) + newline
end_phrase = Optional(u'.').suppress() + newline
phrase = OneOrMore(Word(alphas)) + end_phrase
prefix = Word(u"-", max=1)('distractor') ^ Word(u"=", max=1)('key')
stem = Group(OneOrMore(Word(alphas))) + end_stem
number = Word(nums) + end_number
question = number + stem('stem') +
Group(OneOrMore(Group(prefix('prefix') + phrase('phrase'))))('options')
And when I'm parsing the results:
for match, start, end in question.scanString(in_):
for o in match.options:
try:
print('key', o.prefix.key)
except:
print('distractor', o.prefix.distractor)
I get :
AttributeError: 'unicode' object has no attribute 'distractor'
I'm pretty sure the result names are chainable. If so, what am I doing wrong ? I can easily work around this but it's unsatisfactory not knowing what I did wrong and what I misunderstood.
The problem is that o is actually the prefix -- when you call o.prefix, you're actually going one level deeper then you need to, and are retrieving the string the prefix maps to, not the ParseResults object.
You can see this by modifying the code so that it prints out the parse tree:
for match, start, end in question.scanString(in_):
for o in match.options:
print o.asXML()
try:
print('key', o.prefix.key)
except:
print('distractor', o.prefix.distractor)
The code will then print out:
<prefix>
<key>=</key>
<phrase>option</phrase>
<ITEM>one</ITEM>
<ITEM>one</ITEM>
<ITEM>key</ITEM>
</prefix>
Traceback (most recent call last):
File "so07.py", line 37, in <module>
print('distractor', o.prefix.distractor)
AttributeError: 'str' object has no attribute 'distractor'
The problem then becomes clear -- if o is the prefix, then it doesn't make sense to do o.prefix. Rather, you need to simply call o.key or o.distractor.
Also, it appears that if you try and call o.key where no key exists, then pyparsing will return an empty string rather than throwing an exception.
So, your fixed code should look like this:
for match, start, end in question.scanString(in_):
for o in match.options:
if o.key != '':
print('key', o.key)
else:
print('distractor', o.distractor)
I'm currently transitioning from Java to Python and have taken on the task of trying to create a calculator that can carry out symbolic operations on infix-notated mathematical expressions (without using custom modules like Sympy). Currently, it's built to accept strings that are space delimited and can only carry out the (, ), +, -, *, and / operators. Unfortunately, I can't figure out the basic algorithm for simplifying symbolic expressions.
For example, given the string '2 * ( ( 9 / 6 ) + 6 * x )', my program should carry out the following steps:
2 * ( 1.5 + 6 * x )
3 + 12 * x
But I can't get the program to ignore the x when distributing the 2. In addition, how can I handle 'x * 6 / x' so it returns '6' after simplification?
EDIT: To clarify, by "symbolic" I meant that it will leave letters like "A" and "f" in the output while carrying out the remaining calculations.
EDIT 2: I (mostly) finished the code. I'm posting it here if anyone stumbles on this post in the future, or if any of you were curious.
def reduceExpr(useArray):
# Use Python's native eval() to compute if no letters are detected.
if (not hasLetters(useArray)):
return [calculate(useArray)] # Different from eval() because it returns string version of result
# Base case. Returns useArray if the list size is 1 (i.e., it contains one string).
if (len(useArray) == 1):
return useArray
# Base case. Returns the space-joined elements of useArray as a list with one string.
if (len(useArray) == 3):
return [' '.join(useArray)]
# Checks to see if parentheses are present in the expression & sets.
# Counts number of parentheses & keeps track of first ( found.
parentheses = 0
leftIdx = -1
# This try/except block is essentially an if/else block. Since useArray.index('(') triggers a KeyError
# if it can't find '(' in useArray, the next line is not carried out, and parentheses is not incremented.
try:
leftIdx = useArray.index('(')
parentheses += 1
except Exception:
pass
# If a KeyError was returned, leftIdx = -1 and rightIdx = parentheses = 0.
rightIdx = leftIdx + 1
while (parentheses > 0):
if (useArray[rightIdx] == '('):
parentheses += 1
elif (useArray[rightIdx] == ')'):
parentheses -= 1
rightIdx += 1
# Provided parentheses pair isn't empty, runs contents through again; else, removes the parentheses
if (leftIdx > -1 and rightIdx - leftIdx > 2):
return reduceExpr(useArray[:leftIdx] + [' '.join(['(',reduceExpr(useArray[leftIdx+1:rightIdx-1])[0],')'])] + useArray[rightIdx:])
elif (leftIdx > -1):
return reduceExpr(useArray[:leftIdx] + useArray[rightIdx:])
# If operator is + or -, hold the first two elements and process the rest of the list first
if isAddSub(useArray[1]):
return reduceExpr(useArray[:2] + reduceExpr(useArray[2:]))
# Else, if operator is * or /, process the first 3 elements first, then the rest of the list
elif isMultDiv(useArray[1]):
return reduceExpr(reduceExpr(useArray[:3]) + useArray[3:])
# Just placed this so the compiler wouldn't complain that the function had no return (since this was called by yet another function).
return None
You need much more processing before you go into operations on symbols. The form you want to get to is a tree of operations with values in the leaf nodes. First you need to do a lexer run on the string to get elements - although if you always have space-separated elements it might be enough to just split the string. Then you need to parse that array of tokens using some grammar you require.
If you need theoretical information about grammars and parsing text, start here: http://en.wikipedia.org/wiki/Parsing If you need something more practical, go to https://github.com/pyparsing/pyparsing (you don't have to use the pyparsing module itself, but their documentation has a lot of interesting info) or http://www.nltk.org/book
From 2 * ( ( 9 / 6 ) + 6 * x ), you need to get to a tree like this:
*
2 +
/ *
9 6 6 x
Then you can visit each node and decide if you want to simplify it. Constant operations will be the simplest ones to eliminate - just compute the result and exchange the "/" node with 1.5 because all children are constants.
There are many strategies to continue, but essentially you need to find a way to go through the tree and modify it until there's nothing left to change.
If you want to print the result then, just walk the tree again and produce an expression which describes it.
If you are parsing expressions in Python, you might consider Python syntax for the expressions and parse them using the ast module (AST = abstract syntax tree).
The advantages of using Python syntax: you don't have to make a separate language for the purpose, the parser is built in, and so is the evaluator. Disadvantages: there's quite a lot of extra complexity in the parse tree that you don't need (you can avoid some of it by using the built-in NodeVisitor and NodeTransformer classes to do your work).
>>> import ast
>>> a = ast.parse('x**2 + x', mode='eval')
>>> ast.dump(a)
"Expression(body=BinOp(left=BinOp(left=Name(id='x', ctx=Load()), op=Pow(),
right=Num(n=2)), op=Add(), right=Name(id='x', ctx=Load())))"
Here's an example class that walks a Python parse tree and does recursive constant folding (for binary operations), to show you the kind of thing you can do fairly easily.
from ast import *
class FoldConstants(NodeTransformer):
def visit_BinOp(self, node):
self.generic_visit(node)
if isinstance(node.left, Num) and isinstance(node.right, Num):
expr = copy_location(Expression(node), node)
value = eval(compile(expr, '<string>', 'eval'))
return copy_location(Num(value), node)
else:
return node
>>> ast.dump(FoldConstants().visit(ast.parse('3**2 - 5 + x', mode='eval')))
"Expression(body=BinOp(left=Num(n=4), op=Add(), right=Name(id='x', ctx=Load())))"