How can I allow users to execute mathematical expressions in a safe way?
Do I need to write a full parser?
Is there something like ast.literal_eval(), but for expressions?
The examples provided with Pyparsing include several expression parsers:
https://github.com/pyparsing/pyparsing/blob/master/examples/fourFn.py is a conventional arithmetic infix notation parser/evaluator implementation using pyparsing. (Despite its name, this actually does 5-function arithmetic, plus several trig functions.)
https://github.com/pyparsing/pyparsing/blob/master/examples/simpleBool.py is a boolean infix notation parser/evaluator, using a pyparsing helper method operatorPrecedence, which simplifies the definition of infix operator notations.
https://github.com/pyparsing/pyparsing/blob/master/examples/simpleArith.py and https://github.com/pyparsing/pyparsing/blob/master/examples/eval_arith.py recast fourFn.py using operatorPrecedence. The first just parses and returns a parse tree; the second adds evaluation logic.
If you want a more pre-packaged solution, look at plusminus, a pyparsing-based extensible arithmetic parsing package.
What sort of expressions do you want? Variable assignment? Function evaluation?
SymPy aims to become a full-fledged Python CAS.
Few weeks ago I did similar thing, but for logical expressions (or, and, not, comparisons, parentheses etc.). I did this using Ply parser. I have created simple lexer and parser. Parser generated AST tree that was later use to perform calculations. Doing this in that way allow you to fully control what user enter, because only expressions that are compatible with grammar will be parsed.
Yes. Even if there were an equivalent of ast.literal_eval() for expressions, a Python expression can be lots of things other than just a pure mathematical expression, for example an arbitrary function call.
It wouldn't surprise me if there's already a good mathematical expression parser/evaluator available out there in some open-source module, but if not, it's pretty easy to write one of your own.
maths functions will consist of numeric and punctuation characters, possible 'E' or 'e' if you allow scientific notation for rational numbers, and the only (other) legal use of alpha characters will be if you allow/provide specific maths functions (e.g. stddev). So, should be trivial to run along the string for alpha characters and check the next little bit isn't suspicious, then simply eval the string in a try/except block.
Re the comments this reply has received... I agree this approach is playing with fire. Still, that doesn't mean it can't be done safely. I'm new to python (< 2 months), so may not know the workarounds to which this is vulnerable (and of course a new Python version could always render the code unsafe in the future), but - for what little it's worth (mainly my own amusement) - here's my crack at it:
def evalMaths(s):
i = 0
while i < len(s):
while s[i].isalpha() and i < len(s):
idn += s[i]
i += 1
if (idn and idn != 'e' and idn != 'abs' and idn != 'round'):
raise Exception("you naughty boy: don't " + repr(idn))
else:
i += 1
return eval(s)
I would be very interested to hear if/how it can be circumvented... (^_^) BTW / I know you can call functions like abs2783 or _983 - if they existed, but they won't. I mean something practical.
In fact, if anyone can do so, I'll create a question with 200 bounty and accept their answer.
Related
Is it possible to do the opposite of eval() in python which treats inputs as literal strings? For example f(Node(2)) -> 'Node(2)' or if there's some kind of higher-order operator that stops interpreting the input like lisp's quote() operator.
The answer to does Python have quote is no, Python does not have quote. There's no homoiconicity, and no native infrastructure to represent the languages own source code.
Now for what is the opposite of eval: you seem to be missing some part the picture here. Python's eval is not the opposite of quote. There's three types values can be represented as: as unparsed strings, as expressions, and as values (of any type).
quote and LISP-style "eval" convert between expressions and values.
Parsing and pretty printing form another pair between expressions and strings.
Python's actual eval goes directly from strings to values, and repr goes back (in the sense of printing a parsable value1).
Since SO does not support drawing diagrams:
So if you are looking for the opposite of eval, it's either trivially repr, but that might not really help you, or it would be the composition of quote and pretty priniting, if there were quoting in Python.
This composition with string functions a bit cumbersome, which is why LISPs let the reader deal with parsing once and for all, and from that point on only go between expressions and values. I don't know any language that has such a "quote + print" function, actually (C#'s nameof comes close, but is restricted to names).
Some caveats about the diagram:
It does not commute: you have deal with pure values or disregard side effects, for one, and most importantly quote is not really a function in some sense, of course. It's a special form. That's why repr(x) is not the same as prettyprint(quote(x)), in general.
It's not depicting pairs of isomorphisms. Multiple expressions can eval to the same value etc.
1That's not always the case in reality, but that's what it's suppsedly there for.
I want users to input math formula in my system. How can convert case1 formula to case2 formula using Python? In another word, I would like to change math order specifically for double asterisks.
#case1
3*2**3**2*5
>>>7680
#case2
3*(2**3)**2*5
>>>960
Not only is this not something that Python supports, but really, why would you want to? Modifying BIDMAS or PEMDAS (depending on your location), would not only give you incorrect answers, but also confuse the hell out of any devs looking at the code.
Just use brackets like in Case 2, it's what they're for.
If users are supposed to enter formulas into your program, I would suggest keeping it as is. The reason is that exponentiation in mathematics is right-associative, meaning the execution goes from the top level down. For example: a**b**c = a**(b**c), by convention.
There are some programs that use bottom-up resolution of the stacked exponentiation -- MS Excel and LibreOffice are some of them, however, it is against the regular convention, and always confused the hell out of me.
If you would like to override this behavior, and still be mathematically correct, you have to use brackets.
You can always declare your own power method that would resolve the way you want it -- something like numpy.pow(). You could overload the built-in, but that's too much hastle.
Read this
Below is the example to achieve this using re as:
expression = '3*2**3**2*5'
asterisk_exprs = re.findall(r"\d+\*\*\d+", expression) # List of all expressions matching the criterion
for expr in asterisk_exprs:
expression = expression.replace(expr, "({})".format(expr)) # Replace the expression with required expression
# Value of variable 'expression': '3*(2**3)**2*5'
In order to evaluate the mathematical value of str expression, use eval as:
>>> eval(expression)
960
I am teaching some neighborhood kids to program in Python. Our first project is to convert a string given as a Roman numeral to the Arabic value.
So we developed an function to evaluate a string that is a Roman numeral the function takes a string and creates a list that has the Arabic equivalents and the operations that would be done to evaluate to the Arabic equivalent.
For example suppose you fed in XI the function will return [1,'+',10]
If you fed in IX the function will return [10,'-',1]
Since we need to handle the cases where adjacent values are equal separately let us ignore the case where the supplied value is XII as that would return [1,'=',1,'+',10] and the case where the Roman is IIX as that would return [10,'-',1,'=',1]
Here is the function
def conversion(some_roman):
roman_dict = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M',1000}
arabic_list = []
for letter in some_roman.upper():
if len(roman_list) == 0:
arabic_list.append(roman_dict[letter]
continue
previous = roman_list[-1]
current_arabic = roman_dict[letter]
if current_arabic > previous:
arabic_list.extend(['+',current_arabic])
continue
if current_arabic == previous:
arabic_list.extend(['=',current_arabic])
continue
if current_arabic < previous:
arabic_list.extend(['-',current_arabic])
arabic_list.reverse()
return arabic_list
the only way I can think to evaluate the result is to use eval()
something like
def evaluate(some_list):
list_of_strings = [str(item) for item in some_list]
converted_to_string = ''.join([list_of_strings])
arabic_value = eval(converted_to_string)
return arabic_value
I am a little bit nervous about this code because at some point I read that eval is dangerous to use in most circumstances as it allows someone to introduce mischief into your system. But I can't figure out another way to evaluate the list returned from the first function. So without having to write a more complex function.
The kids get the conversion function so even if it looks complicated they understand the process of roman numeral conversion and it makes sense. When we have talked about evaluation though I can see they get lost. Thus I am really hoping for some way to evaluate the results of the conversion function that doesn't require too much convoluted code.
Sorry if this is warped, I am so . . .
Is there a way to accomplish what eval does without using eval
Yes, definitely. One option would be to convert the whole thing into an ast tree and parse it yourself (see here for an example).
I am a little bit nervous about this code because at some point I read that eval is dangerous to use in most circumstances as it allows someone to introduce mischief into your system.
This is definitely true. Any time you consider using eval, you need to do some thinking about your particular use-case. The real question is how much do you trust the user and what damage can they do? If you're distributing this as a script and users are only using it on their own computer, then it's really not a problem -- After all, they don't need to inject malicious code into your script to remove their home directory. If you're planning on hosting this on your server, that's a different story entirely ... Then you need to figure out where the string comes from and if there is any way for the user to modify the string in a way that could make it untrusted to run. Hackers are pretty clever1,2 and so hosting something like this on your server is generally not a good idea. (I always assume that the hackers know python WAY better than I do).
1http://blog.delroth.net/2013/03/escaping-a-python-sandbox-ndh-2013-quals-writeup/
2http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html
The only implementation of a safe expression evalulator that I've come across is:
https://pypi.org/project/simpleeval/
It supports a lot of basic Python-ish expressions and is quite restricted in what it allows you to do (so you don't blow up the interpreter or do something evil). It uses the python ast module for parsing, and evaluates the result itself.
Example:
from simpleeval import simple_eval
simple_eval("21 + 21")
Then you can extend it and give it access to the parts of your program that you want to:
simple_eval("x + y", names={"x": 22, "y": 48})
or
simple_eval("do_thing(11)", functions={"do_thing": my_callback})
and so on.
I'm parsing a xml file in which I get basic expressions (like id*10+2). What I am trying to do is to evaluate the expression to actually get the value. To do so, I use the eval() method which works very well.
The only thing is the numbers are in fact hexadecimal numbers. The eval() method could work well if every hex number was prefixed with '0x', but I could not find a way to do it, neither could I find a similar question here. How would it be done in a clean way ?
Use the re module.
>>> import re
>>> re.sub(r'([\dA-F]+)', r'0x\1', 'id*A+2')
'id*0xA+0x2'
>>> eval(re.sub(r'([\dA-F]+)', r'0x\1', 'CAFE+BABE'))
99772
Be warned though, with an invalid input to eval, it won't work. There are also many risks of using eval.
If your hex numbers have lowercase letters, then you could use this:
>>> re.sub(r'(?<!i)([\da-fA-F]+)', r'0x\1', 'id*a+b')
'id*0xa+0xb'
This uses a negative lookbehind assertion to assure that the letter i is not before the section it is trying to convert (preventing 'id' from turning into 'i0xd'. Replace i with I if the variable is Id.
If you can parse expresion into individual numbers then I would suggest to use int function:
>>> int("CAFE", 16)
51966
Be careful with eval! Do not ever use it in untrusted inputs.
If it's just simple arithmetic, I'd use a custom parser (there are tons of examples out in the wild)... And using parser generators (flex/bison, antlr, etc.) is a skill that is useful and easily forgotten, so it could be a good chance to refresh or learn it.
One option is to use the parser module:
import parser, token, re
def hexify(ast):
if not isinstance(ast, list):
return ast
if ast[0] in (token.NAME, token.NUMBER) and re.match('[0-9a-fA-F]+$', ast[1]):
return [token.NUMBER, '0x' + ast[1]]
return map(hexify, ast)
def hexified_eval(expr, *args):
ast = parser.sequence2st(hexify(parser.expr(expr).tolist()))
return eval(ast.compile(), *args)
>>> hexified_eval('id*10 + BABE', {'id':0xcafe})
567466
This is somewhat cleaner than a regex solution in that it only attempts to replace tokens that have been positively identified as either names or numbers (and look like hex numbers). It also correctly handles more general python expressions such as id*10 + len('BABE') (it won't replace 'BABE' with '0xBABE').
OTOH, the regex solution is simpler and might cover all the cases you need to deal with anyway.
I have a source code in Fortran (almost irrelevant) and I want to parse the function names and arguments.
eg using
(\w+)\([^\(\)]+\)
with
a(b(1 + 2 * 2), c(3,4))
I get the following: (as expected)
b, 1 + 2 * 2
c, 3,4
where I would need
a, b(1 + 2 * 2), c(3,4)
b, 1 + 2 * 2
c, 3,4
Any suggestions?
Thanks for your time...
It can be done with regular expressions-- use them to tokenize the string, and work with the tokens. i.e. see re.Scanner. Alternatively, just use pyparsing.
This is a nonlinear grammar -- you need to be able to recurse on a set of allowed rules. Look at pyparsing to do simple CFG (Context Free Grammar) parsing via readable specifications.
It's been a while since I've written out CFGs, and I'm probably rusty, so I'll refer you to the Python EBNF to get an idea of how you can construct one for a subset of a language syntax.
Edit: If the example will always be simple, you can code a small state machine class/function that iterates over the tokenized input string, as #Devin Jeanpierre suggests.
You can take a look at PLY (Python Lex-Yacc), it's (in my opinion) very simple to use and well documented, and it comes with a calculator example which could be a good starting point.
I don't think this is a job for regular expressions... they can't really handle nested patterns.
This is because regexes are compiled into FSMs (Finite State Machines). In order to parse arbitrarily nested expressions, you can't use a FSM, because you need infinitely many states to keep track of the arbitrary nesting. Also see this SO thread.
You can't do this with regular expression only. It's sort of recursive. You should match first the most external function and its arguments, print the name of the function, then do the same (match the function name, then its arguments) with all its arguments. Regex alone are not enough.