Safety of Python 'eval' For List Deserialization

Safety of Python 'eval' For List Deserialization - python

Are there any security exploits that could occur in this scenario:
eval(repr(unsanitized_user_input), {"__builtins__": None}, {"True":True, "False":False})
where unsanitized_user_input is a str object. The string is user-generated and could be nasty. Assuming our web framework hasn't failed us, it's a real honest-to-god str instance from the Python builtins.
If this is dangerous, can we do anything to the input to make it safe?
We definitely don't want to execute anything contained in the string.
See also:
Funny blog post about eval safety
Previous Question
Blog: Fast deserialization in Python
The larger context which is (I believe) not essential to the question is that we have thousands of these:
repr([unsanitized_user_input_1,
unsanitized_user_input_2,
unsanitized_user_input_3,
unsanitized_user_input_4,
...])
in some cases nested:
repr([[unsanitized_user_input_1,
unsanitized_user_input_2],
[unsanitized_user_input_3,
unsanitized_user_input_4],
...])
which are themselves converted to strings with repr(), put in persistent storage, and eventually read back into memory with eval.
Eval deserialized the strings from persistent storage much faster than pickle and simplejson. The interpreter is Python 2.5 so json and ast aren't available. No C modules are allowed and cPickle is not allowed.

It is indeed dangerous and the safest alternative is ast.literal_eval (see the ast module in the standard library). You can of course build and alter an ast to provide e.g. evaluation of variables and the like before you eval the resulting AST (when it's down to literals).
The possible exploit of eval starts with any object it can get its hands on (say True here) and going via .__class_ to its type object, etc. up to object, then gets its subclasses... basically it can get to ANY object type and wreck havoc. I can be more specific but I'd rather not do it in a public forum (the exploit is well known, but considering how many people still ignore it, revealing it to wannabe script kiddies could make things worse... just avoid eval on unsanitized user input and live happily ever after!-).

If you can prove beyond doubt that unsanitized_user_input is a str instance from the Python built-ins with nothing tampered, then this is always safe. In fact, it'll be safe even without all those extra arguments since eval(repr(astr)) = astr for all such string objects. You put in a string, you get back out a string. All you did was escape and unescape it.
This all leads me to think that eval(repr(x)) isn't what you want--no code will ever be executed unless someone gives you an unsanitized_user_input object that looks like a string but isn't, but that's a different question--unless you're trying to copy a string instance in the slowest way possible :D.

With everything as you describe, it is technically safe to eval repred strings, however, I'd avoid doing it anyway as it's asking for trouble:
There could be some weird corner-case where your assumption that only repred strings are stored (eg. a bug / different pathway into the storage that doesn't repr instantly becmes a code injection exploit where it might otherwise be unexploitable)
Even if everything is OK now, assumptions might change at some point, and unsanitised data may get stored in that field by someone unaware of the eval code.
Your code may get reused (or worse, copy+pasted) into a situation you didn't consider.
As Alex Martelli pointed out, in python2.6 and higher, there is ast.literal_eval which will safely handle both strings and other simple datatypes like tuples. This is probably the safest and most complete solution.
Another possibility however is to use the string-escape codec. This is much faster than eval (about 10 times according to timeit), available in earlier versions than literal_eval, and should do what you want:
>>> s = 'he\nllo\' wo"rld\0\x03\r\n\tabc'
>>> repr(s)[1:-1].decode('string-escape') == s
True
(The [1:-1] is to strip the outer quotes repr adds.)

Generally, you should never allow anyone to post code.
So called "paid professional programmers" have a hard-enough time writing code that actually works.
Accepting code from the anonymous public -- without benefit of formal QA -- is the worst of all possible scenarios.
Professional programmers -- without good, solid formal QA -- will make a hash of almost any web site. Indeed, I'm reverse engineering some unbelievably bad code from paid professionals.
The idea of allowing a non-professional -- unencumbered by QA -- to post code is truly terrifying.

repr([unsanitized_user_input_1,
unsanitized_user_input_2,
...
... unsanitized_user_input is a str object
You shouldn't have to serialise strings to store them in a database..
If these are all strings, as you mentioned - why can't you just store the strings in a db.StringListProperty?
The nested entries might be a bit more complicated, but why is this the case? When you have to resort to eval to get data from the database, you're probably doing something wrong..
Couldn't you store each unsanitized_user_input_x as it's own db.StringProperty row, and have group them by an reference field?
Either of those may not be applicable, since I've no idea what you're trying to achieve, but my point is - can you not structure the data in a way you where don't have to rely on eval (and also rely on it not being a security issue)?

Related

Can the usage of `setattr` (and `getattr`) be considered as bad practice?

setattr and getattr kind of got into my style of programing (mainly scientific stuff, my knowledge about python is self told).
Considering that exec and eval inherit a potential danger since in some cases they might lead to security issues, I was wondering if for setattr the same argument is considered to be valid. (About getattr I found this question which contains some info - although the argumentation is not very convincing.)
From what I know, setattr can be used without worrying to much, but to be honest I don't trust my python knowledge enough to be sure, and if I'm wrong I'd like to try and get rid of the habit of using setattr.
Any input is very much appreciated!

First, it could definitely make it easier to an existing security hole.
For example, let's say you have code that does exec, eval, SQL queries or URLs built via string formatting, etc. And let's say you're passing, say, locals() or a filtered __dict__ to the formatting command or as the eval context or whatever. Using setattr clearly widens the security hole, making it much easier for me to find ways to attack your code, because you can no longer be sure what you're going to be passing to those functions.
But what if you don't do anything else unsafe? Is setattr safe then?
Not as bad, but it's still not safe. If I can influence the names of the attributes you're setting, I can, e.g., replace any method I want on your objects.
You can try to protect against this by, e.g., first checking that the old value was not callable, or not a method-type descriptor, or whatever. In the same way you can try to protect against people calling functions in eval or putting quotes and semicolons in SQL parameters and so on. This is effectively the same as any of those other cases. And it's a lot harder to try to close all illegitimate paths to an open door, than to just not open the door in the first place.
What if the name never comes from anything that can be influenced by the user?
Well, in that case, why are you using setattr? There is no good reason to call setattr with a literal.
Anyway, when Lattyware said that there are often better ways to solve the problem at hand, he was almost certainly talking about readability, maintainability, and idiomaticness. But the side effect of using those better ways is that you also often avoid any security implications.
90% of the time, the solution is to use a dict instead of an object. Unlike Javascript, they're not the same thing in Python, and they're not meant to be used the same way. A dict doesn't have methods, or inheritance, or built-in special names, so you don't have to worry about any of that. It also has a more convenient syntax, where you can say d['foo'] instead of setattr(o, 'foo'). And it's probably more efficient. And so on. But ultimately, the reason to use a dict is the conceptual reason: a dict is a named collection of values; a class instance is a representation of a model-space object, and those are not the same thing.
So, why does setattr even exist?
It's there for the same basic reasons as other low-level features, like being able to access im_func or func_closure, or having modules like traceback and imp, or treating special methods just like any other methods, or for that matter exec and eval.
First, you can build higher-level things out of these low-level tools. For example, to build collections.namedtuple, you'd need either exec or setattr.
Second, you occasionally need to monkey-patch code at runtime because you can't modify it (or maybe even see it) at compile time, and tools like setattr can be essential to doing that.
The setattr feature—much like eval—is often misused by people coming from Javascript, Tcl, or a few other languages. But as long as it can be used for good, you don't want to take it out of the language. (TOOWTDI shouldn't be taken so literally that only one program can ever be written.)
But that doesn't mean you should go around using this stuff whenever possible. You wouldn't write mylist.__getitem__(slice(1, 10, 2)) instead of mylist[1:10:2]. Sometimes, being able to call __getitem__ directly or build slice objects explicitly is a foundation to something that lets the rest of your code be more pythonic, or way to localize a workaround to avoid infecting the rest of your code. Otherwise, there are clearer and simpler ways to do it.

Check if something is a list

What is the easiest way to check if something is a list?
A method doSomething has the parameters a and b. In the method, it will loop through the list a and do something. I'd like a way to make sure a is a list, before looping through - thus avoiding an error or the unfortunate circumstance of passing in a string then getting back a letter from each loop.
This question must have been asked before - however my googles failed me. Cheers.

To enable more usecases, but still treat strings as scalars, don't check for a being a list, check that it isn't a string:
if not isinstance(a, basestring):
...

Typechecking hurts the generality, simplicity, and maintainability of your code. It is seldom used in good, idiomatic Python programs.
There are two main reasons people want to typecheck:
To issue errors if the caller provides the wrong type.
This is not worth your time. If the user provides an incompatible type for the operation you are performing, an error will already be raised when the compatibility is hit. It is worrisome that this might not happen immediately, but it typically doesn't take long at all and results in code that is more robust, simple, efficient, and easier to write.
Oftentimes people insist on this with the hope they can catch all the dumb things a user can do. If a user is willing to do arbitrarily dumb things, there is nothing you can do to stop him. Typechecking mainly has the potential of keeping a user who comes in with his own types that are drop-in replacements for the ones replaced or when the user recognizes that your function should actually be polymorphic and provides something different that can accept the same operation.
If I had a big system where lots of things made by lots of people should fit together right, I would use a system like zope.interface to make testing that everything fits together right.
To do different things based on the types of the arguments received.
This makes your code worse because your API is inconsistent. A function or method should do one thing, not fundamentally different things. This ends up being a feature not usually worth supporting.
One common scenario is to have an argument that can either be a foo or a list of foos. A cleaner solution is simply to accept a list of foos. Your code is simpler and more consistent. If it's an important, common use case only to have one foo, you can consider having another convenience method/function that calls the one that accepts a list of foos and lose nothing. Providing the first API would not only have been more complicated and less consistent, but it would break when the types were not the exact values expected; in Python we distinguish between objects based on their capabilities, not their actual types. It's almost always better to accept an arbitrary iterable or a sequence instead of a list and anything that works like a foo instead of requiring a foo in particular.
As you can tell, I do not think either reason is compelling enough to typecheck under normal circumstances.

I'd like a way to make sure a is a list, before looping through
Document the function.

Usually it's considered not a good style to perform type-check in Python, but try
if isinstance(a, list):
...
(I think you may also check if a.__iter__ exists.)

Python - Things I shouldn't be doing?

I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:
...
junk_block = "".join(open("foo.txt","rb").read().split())
...
I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:
f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?

As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).
Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:
junk_block = "".join(open("foo.txt","rb").read().split())
presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.
Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:
with open("foo.txt","rb") as f:
junk_block = "".join(f.read().split())
is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.
Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).

First of all, there's not really a reason you shouldn't use the first example - it'd quite readable in that it's concise about what it does. No reason to break it up since it's just a linear combination of calls.
Second, import within a function block is useful if there's a particular library function that you only need within that function - since the scope of an imported symbol is only the block within which it is imported, if you only ever use something once, you can just import it where you need it and not have to worry about name conflicts in other functions. This is especially handy with from X import Y statements, since Y won't be qualified by its containing module name and thus might conflict with a similarly named function in a different module being used elsewhere.

from PEP 8 (which is worth reading anyway)
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants

That line has the same result as this:
junk_block = open("foo.txt","rb").read().replace(' ', '')
In your example you are splitting the words of the text into a list of words, and then you are joining them back together with no spaces. The above example instead uses the str.replace() method.
The differences:
Yours builds a file object into memory, builds a string into memory by reading it, builds a list into memory by splitting the string, builds a new string by joining the list.
Mine builds a file object into memory, builds a string into memory by reading it, builds a new string into memory by replacing spaces.
You can see a bit less RAM is used in the new variation but more processor is used. RAM is more valuable in some cases and so memory waste is frowned upon when it can be avoided.
Most of the memory will be garbage collected immediately but multiple users at the same time will hog RAM.

If you want to know if your second code fragment is slower, the quick way to find out would be to just use timeit. I wouldn't expect there to be that much difference though, since they seem pretty equivalent.
You should also ask if a performance difference actually matters in the code in question. Often readability is of more value than performance.
I can't think of any good reasons for importing a module in a function, but sometimes you just don't know you'll need to do something until you see the problem. I'll have to leave it to others to point out a constructive example of that, if it exists.

I think the two codes are readable. I (and that's just a question of personal style) will probably use the first, adding a coment line, something like: "Open the file and convert the data inside into a list"
Also, there are times when I use the second, maybe not so separated, but something like
f_data = open("foo.txt", "rb").read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
But then I'm giving more entity to each operation, which could be important in the flow of the code. I think it's important you are confortable and don't think that the code is difficult to understand in the future.
Definitly, the code will not be (at least, much) slower, as the only "overload" you're making is to asing the results to values.

Security of Python's eval() on untrusted strings?

If I am evaluating a Python string using eval(), and have a class like:
class Foo(object):
a = 3
def bar(self, x): return x + a
What are the security risks if I do not trust the string? In particular:
Is eval(string, {"f": Foo()}, {}) unsafe? That is, can you reach os or sys or something unsafe from a Foo instance?
Is eval(string, {}, {}) unsafe? That is, can I reach os or sys entirely from builtins like len and list?
Is there a way to make builtins not present at all in the eval context?
There are some unsafe strings like "[0] * 100000000" I don't care about, because at worst they slow/stop the program. I am primarily concerned about protecting user data external to the program.
Obviously, eval(string) without custom dictionaries is unsafe in most cases.

eval() will allow malicious data to compromise your entire system, kill your cat, eat your dog and make love to your wife.
There was recently a thread about how to do this kind of thing safely on the python-dev list, and the conclusions were:
It's really hard to do this properly.
It requires patches to the python interpreter to block many classes of attacks.
Don't do it unless you really want to.
Start here to read about the challenge: http://tav.espians.com/a-challenge-to-break-python-security.html
What situation do you want to use eval() in? Are you wanting a user to be able to execute arbitrary expressions? Or are you wanting to transfer data in some way? Perhaps it's possible to lock down the input in some way.

You cannot secure eval with a blacklist approach like this. See Eval really is dangerous for examples of input that will segfault the CPython interpreter, give access to any class you like, and so on.

You can get to os using builtin functions: __import__('os').
For python 2.6+, the ast module may help; in particular ast.literal_eval, although it depends on exactly what you want to eval.

Note that even if you pass empty dictionaries to eval(), it's still possible to segfault (C)Python with some syntax tricks. For example, try this on your interpreter: eval("()"*8**5)

You are probably better off turning the question around:
What sort of expressions are you wanting to eval?
Can you insure that only strings matching some narrowly defined syntax are eval()d?
Then consider if that is safe.
For example, if you are wanting to let the user enter an algebraic expression for evaluation, consider limiting them to one letter variable names, numbers, and a specific set of operators and functions. Don't eval() strings containing anything else.

There is a very good article on the un-safety of eval() in Mark Pilgrim's Dive into Python tutorial.
Quoted from this article:
In the end, it is possible to safely
evaluate untrusted Python expressions,
for some definition of “safe” that
turns out not to be terribly useful in
real life. It’s fine if you’re just
playing around, and it’s fine if you
only ever pass it trusted input. But
anything else is just asking for
trouble.

Is "safe_eval" really safe?

I'm looking for a "safe" eval function, to implement spreadsheet-like calculations (using numpy/scipy).
The functionality to do this (the rexec module) has been removed from Python since 2.3 due to apparently unfixable security problems. There are several third-party hacks out there that purport to do this - the most thought-out solution that I have found is
this Python Cookbok recipe, "safe_eval".
Am I reasonably safe if I use this (or something similar), to protect from malicious code, or am I stuck with writing my own parser? Does anyone know of any better alternatives?
EDIT: I just discovered RestrictedPython, which is part of Zope. Any opinions on this are welcome.

Depends on your definition of safe I suppose. A lot of the security depends on what you pass in and what you are allowed to pass in the context. For instance, if a file is passed in, I can open arbitrary files:
>>> names['f'] = open('foo', 'w+')
>>> safe_eval.safe_eval("baz = type(f)('baz', 'w+')", names)
>>> names['baz']
<open file 'baz', mode 'w+' at 0x413da0>
Furthermore, the environment is very restricted (you cannot pass in modules), thus, you can't simply pass in a module of utility functions like re or random.
On the other hand, you don't need to write your own parser, you could just write your own evaluator for the python ast:
>>> import compiler
>>> ast = compiler.parse("print 'Hello world!'")
That way, hopefully, you could implement safe imports. The other idea is to use Jython or IronPython and take advantage of Java/.Net sandboxing capabilities.

Writing your own parser could be fun! It might be a better option because people are expecting to use the familiar spreadsheet syntax (Excel, etc) and not Python when they're entering formulas. I'm not familiar with safe_eval but I would imagine that anything like this certainly has the potential for exploitation.

If you simply need to write down and read some data structure in Python, and don't need the actual capacity of executing custom code, this one is a better fit:
http://code.activestate.com/recipes/364469-safe-eval/
It garantees that no code is executed, only static data structures are evaluated: strings, lists, tuples, dictionnaries.

Although that code looks quite secure, I've always held the opinion that any sufficiently motivated person could break it given adequate time. I do think it will take quite a bit of determination to get through that, but I'm relatively sure it could be done.

Daniel,
Jinja implements a sandboxe environment that may or may not be useful to you. From what I remember, it doesn't yet "comprehend" list comprehensions.
Sanbox info

The functionality you want is in the compiler language services, see
http://docs.python.org/library/language.html
If you define your app to accept only expressions, you can compile the input as an expression and get an exception if it is not, e.g. if there are semicolons or statement forms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.