Is is safe to parse the Abstract Syntax Trees of untrusted code?

Is is safe to parse the Abstract Syntax Trees of untrusted code? - python

Is it ok to use the ast module to parse and modify untrusted external Python code programatically?
I will just parse the source code, get some info from the source code (docstrings, function definitions, maybe, I don't know) and leave it there, not compile it or run it.

If you're using the ast.parse function then it should be safe. As the documentation says, this function will
Parse the source into an AST node. Equivalent to compile(source, filename, mode, ast.PyCF_ONLY_AST)
which simply parses the file even if it contains invalid Python code. It doesn't do any sort of evaluation.
If your aim is to evaluate expressions, then you can use ast.literal_eval, which is safer than the built-in eval statement

"Unsafe" implies something bad could happen controlled by the artifact you are engaging. Since parsing only builds ASTs, and (assuming there isn't something malicious in the parsing and AST building code), then parsing an arbitrary bit of text can't hurt you.
Typically to get malicious behaviour from the outside, something (controlled by you) must essentially execute some supplied code. Clearly building a parse tree doesn't execute the outside program. However, if you built an interpreter that interpreted the parse tree and ran it, you might have a problem.

I believe so. No code is executed. In fact, parsing the ast is exactly what ast.literal_eval does, and that's deemed safe.

Related

How to make a .py script readable by another python script?

I want to build a program (in Python) which is analysing other Python scripts. Therefor I need a way to make .py-files readable for a Python program.
I thought about simply converting .py to .txt and then using .startswith and .find methods. Is there a way to convert .py to .txt?
Also feel free to tell other ways for analysing. Important is that structures like if-statements or loops and indentation-levels get figured out.

Also feel free to tell other ways for analysing. Important is that structures like if-statements or loops and indentation-levels get figured out.
If you want to preserve this kind of structure in exactly the same way that Python would itself parse the file, you should use the standard library ast module (https://docs.python.org/3/library/ast.html). AST means "abstract syntax tree": the representation of the code as Python understands it.
The basic usage pattern is to call ast.parse(file) (https://docs.python.org/3/library/ast.html#ast.parse) on the file you want to parse. You'll get back an object (https://docs.python.org/3/library/ast.html#ast.AST) which is the top of a tree of AST nodes.
You may be interested in picking through the source code for black (https://github.com/psf/black), which is a Python code formatter that uses ast to validate that the formatted code has the exact same behavior as the code it was originally run on.

Add own realtime custom parser to Python to generate and compile AST

My task is to add switch statement and remove mandatory colons from functions, classes, loops in Python.
Maybe to add some other nice features from Coffeescript.
The .py files with custom syntax must be imported with python interpreter, than parsed with a custom parser (just like Coffeescript compiler does).
(I already had a little experience in writing Python-like "for" syntax to already created custom parser, corrected several bugs. But it takes a long time to read all code and get it. So I decided to ask advice first.)
I searched a long time through internet, found several helpful answers, but still don't know how to implement it better.
Some from what I found:
Parse a .py file, read the AST, modify it, then write back the modified source code
Python's tokenize module
Python's ast module
Python's c-like preprocessor with import hook
What I think to do:
Rewrite Coffeescript parser or Python parser into pure Python
Make import hook to parse files to AST by my own parser.
Continue import (compile AST and import it to module)
(like Coffeescript does it)
So I have such questions:
- Is there a Python parser written in Python (not to rewrite all Coffeescript parser) ?
- Maybe is there any way to make ast.AST class frow own parser not rewriting ast library from C into Python ?
- How can I do it better and easier ? (except modifying Python's sources, all must be done in runtime and be totally compatible with all other Python interpreters)
- Maybe there are already some libraries, that help modifying Python's syntax ?
Thank you very much.
Best regards, Serj.

evaluate string arithmetic expression using python

How can I evaluate following:
a=b=c=d=e=0 # initially
if user enters:
"a=b=4" as a string, it should modify the existing value.
So result would be something like a=4, b=4
if user enters:
"a=(c=4)*2", it should evaluate as expression and update the values.
so result would be something like a=8, c=4
The brackets can be nested further.
Any help would be really appreciated. I am using python.

If this is a trivial console script where security is absolutely no concern, you can use exec() to execute mathematical statements. However, python does not support code such as a=(c=4)*2, so it won't be possible to do natively.
However, exec() is a gaping security hole if this is running on, say, a web server. If this is expected to be something where untrusted and potentially malicious users can submit commands, you should look into either sanitizing it and parsing it yourself, or implementing sandboxing.
TL;DR Since you're working on custom commands not supported, you should write your own parser to handle and execute these without worrying about executing untrusted code.

Probably the best way is writing a parser for this particular grammar (the worst is something involving eval/exec - submitting user-provided content to these functions is a security can of worms).
Take a look at this example:
http://pyparsing.wikispaces.com/file/detail/SimpleCalc.py

Partial evaluation for parsing

I'm working on a macro system for Python (as discussed here) and one of the things I've been considering are units of measure. Although units of measure could be implemented without macros or via static macros (e.g. defining all your units ahead of time), I'm toying around with the idea of allowing syntax to be extended dynamically at runtime.
To do this, I'm considering using a sort of partial evaluation on the code at compile-time. If parsing fails for a given expression, due to a macro for its syntax not being available, the compiler halts evaluation of the function/block and generates the code it already has with a stub where the unknown expression is. When this stub is hit at runtime, the function is recompiled against the current macro set. If this compilation fails, a parse error would be thrown because execution can't continue. If the compilation succeeds, the new function replaces the old one and execution continues.
The biggest issue I see is that you can't find parse errors until the affected code is run. However, this wouldn't affect many cases, e.g. group operators like [], {}, (), and `` still need to be paired (requirement of my tokenizer/list parser), and top-level syntax like classes and functions wouldn't be affected since their "runtime" is really load time, where the syntax is evaluated and their objects are generated.
Aside from the implementation difficulty and the problem I described above, what problems are there with this idea?

Here are a few possible problems:
You may find it difficult to provide the user with helpful error messages in case of a problem. This seems likely, as any compilation-time syntax error could be just a syntax extension.
Performance hit.
I was trying to find some discussion of the pluses, minuses, and/or implementation of dynamic parsing in Perl 6, but I couldn't find anything appropriate. However, you may find this quote from Nicklaus Wirth (designer of Pascal and other languages) interesting:
The phantasies of computer scientists
in the 1960s knew no bounds. Spurned
by the success of automatic syntax
analysis and parser generation, some
proposed the idea of the flexible, or
at least extensible language. The
notion was that a program would be
preceded by syntactic rules which
would then guide the general parser
while parsing the subsequent program.
A step further: The syntax rules would
not only precede the program, but they
could be interspersed anywhere
throughout the text. For example, if
someone wished to use a particularly
fancy private form of for statement,
he could do so elegantly, even
specifying different variants for the
same concept in different sections of
the same program. The concept that
languages serve to communicate between
humans had been completely blended
out, as apparently everyone could now
define his own language on the fly.
The high hopes, however, were soon
damped by the difficulties encountered
when trying to specify, what these
private constructions should mean. As
a consequence, the intreaguing idea of
extensible languages faded away rather
quickly.
Edit: Here's Perl 6's Synopsis 6: Subroutines, unfortunately in markup form because I couldn't find an updated, formatted version; search within for "macro". Unfortunately, it's not too interesting, but you may find some things relevant, like Perl 6's one-pass parsing rule, or its syntax for abstract syntax trees. The approach Perl 6 takes is that a macro is a function that executes immediately after its arguments are parsed and returns either an AST or a string; Perl 6 continues parsing as if the source actually contained the return value. There is mention of generation of error messages, but they make it seem like if macros return ASTs, you can do alright.

Pushing this one step further, you could do "lazy" parsing and always only parse enough to evaluate the next statement. Like some kind of just-in-time parser. Then syntax errors could become normal runtime errors that just raise a normal Exception that could be handled by surrounding code:
def fun():
not implemented yet
try:
fun()
except:
pass
That would be an interesting effect, but if it's useful or desirable is a different question. Generally it's good to know about errors even if you don't call the code at the moment.
Macros would not be evaluated until control reaches them and naturally the parser would already know all previous definitions. Also the macro definition could maybe even use variables and data that the program has calculated so far (like adding some syntax for all elements in a previously calculated list). But this is probably a bad idea to start writing self-modifying programs for things that could usually be done as well directly in the language. This could get confusing...
In any case you should make sure to parse code only once, and if it is executed a second time use the already parsed expression, so that it doesn't lead to performance problems.

Here are some ideas from my master's thesis, which may or may not be helpful.
The thesis was about robust parsing of natural language.
The main idea: given a context-free grammar for a language, try to parse a given
text (or, in your case, a python program). If parsing failed, you will have a partially generated parse tree. Use the tree structure to suggest new grammar rules that will better cover the parsed text.
I could send you my thesis, but unless you read Hebrew this will probably not be useful.
In a nutshell:
I used a bottom-up chart parser. This type of parser generates edges for productions from the grammar. Each edge is marked with the part of the tree that was consumed. Each edge gets a score according to how close it was to full coverage, for example:
S -> NP . VP
Has a score of one half (We succeeded in covering the NP but not the VP).
The highest-scored edges suggest a new rule (such as X->NP).
In general, a chart parser is less efficient than a common LALR or LL parser (the types usually used for programming languages) - O(n^3) instead of O(n) complexity, but then again you are trying something more complicated than just parsing an existing language.
If you can do something with the idea, I can send you further details.
I believe looking at natural language parsers may give you some other ideas.

Another thing I've considered is making this the default behavior across the board, but allow languages (meaning a set of macros to parse a given language) to throw a parse error at compile-time. Python 2.5 in my system, for example, would do this.
Instead of the stub idea, simply recompile functions that couldn't be handled completely at compile-time when they're executed. This will also make self-modifying code easier, as you can modify the code and recompile it at runtime.

You'll probably need to delimit the bits of input text with unknown syntax, so that the rest of the syntax tree can be resolved, apart from some character sequences nodes which will be expanded later. Depending on your top level syntax, that may be fine.
You may find that the parsing algorithm and the lexer and the interface between them all need updating, which might rule out most compiler creation tools.
(The more usual approach is to use string constants for this purpose, which can be parsed to a little interpreter at run time).

I don't think your approach would work very well. Let's take a simple example written in pseudo-code:
define some syntax M1 with definition D1
if _whatever_:
define M1 to do D2
else:
define M1 to do D3
code that uses M1
So there is one example where, if you allow syntax redefinition at runtime, you have a problem (since by your approach the code that uses M1 would be compiled by definition D1). Note that verifying if syntax redefinition occurs is undecidable. An over-approximation could be computed by some kind of typing system or some other kind of static analysis, but Python is not well known for this :D.
Another thing that bothers me is that your solution does not 'feel' right. I find it evil to store source code you can't parse just because you may be able to parse it at runtime.
Another example that jumps to mind is this:
...function definition fun1 that calls fun2...
define M1 (at runtime)
use M1
...function definition for fun2
Technically, when you use M1, you cannot parse it, so you need to keep the rest of the program (including the function definition of fun2) in source code. When you run the entire program, you'll see a call to fun2 that you cannot call, even if it's defined.

Is "safe_eval" really safe?

I'm looking for a "safe" eval function, to implement spreadsheet-like calculations (using numpy/scipy).
The functionality to do this (the rexec module) has been removed from Python since 2.3 due to apparently unfixable security problems. There are several third-party hacks out there that purport to do this - the most thought-out solution that I have found is
this Python Cookbok recipe, "safe_eval".
Am I reasonably safe if I use this (or something similar), to protect from malicious code, or am I stuck with writing my own parser? Does anyone know of any better alternatives?
EDIT: I just discovered RestrictedPython, which is part of Zope. Any opinions on this are welcome.

Depends on your definition of safe I suppose. A lot of the security depends on what you pass in and what you are allowed to pass in the context. For instance, if a file is passed in, I can open arbitrary files:
>>> names['f'] = open('foo', 'w+')
>>> safe_eval.safe_eval("baz = type(f)('baz', 'w+')", names)
>>> names['baz']
<open file 'baz', mode 'w+' at 0x413da0>
Furthermore, the environment is very restricted (you cannot pass in modules), thus, you can't simply pass in a module of utility functions like re or random.
On the other hand, you don't need to write your own parser, you could just write your own evaluator for the python ast:
>>> import compiler
>>> ast = compiler.parse("print 'Hello world!'")
That way, hopefully, you could implement safe imports. The other idea is to use Jython or IronPython and take advantage of Java/.Net sandboxing capabilities.

Writing your own parser could be fun! It might be a better option because people are expecting to use the familiar spreadsheet syntax (Excel, etc) and not Python when they're entering formulas. I'm not familiar with safe_eval but I would imagine that anything like this certainly has the potential for exploitation.

If you simply need to write down and read some data structure in Python, and don't need the actual capacity of executing custom code, this one is a better fit:
http://code.activestate.com/recipes/364469-safe-eval/
It garantees that no code is executed, only static data structures are evaluated: strings, lists, tuples, dictionnaries.

Although that code looks quite secure, I've always held the opinion that any sufficiently motivated person could break it given adequate time. I do think it will take quite a bit of determination to get through that, but I'm relatively sure it could be done.

Daniel,
Jinja implements a sandboxe environment that may or may not be useful to you. From what I remember, it doesn't yet "comprehend" list comprehensions.
Sanbox info

The functionality you want is in the compiler language services, see
http://docs.python.org/library/language.html
If you define your app to accept only expressions, you can compile the input as an expression and get an exception if it is not, e.g. if there are semicolons or statement forms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is is safe to parse the Abstract Syntax Trees of untrusted code? - python

Is it ok to use the ast module to parse and modify untrusted external Python code programatically? I will just parse the source code, get some info from the source code (docstrings, function definitions, maybe, I don't know) and leave it there, not compile it or run it.

I believe so. No code is executed. In fact, parsing the ast is exactly what ast.literal_eval does, and that's deemed safe.

Related

How to make a .py script readable by another python script?

Add own realtime custom parser to Python to generate and compile AST

evaluate string arithmetic expression using python

Partial evaluation for parsing

Is "safe_eval" really safe?

Categories

Resources