Running JSON through Python's eval()?

Running JSON through Python's eval()? - python

DO NOT DO THIS.
This question is still getting upvotes, so I wanted to add a warning to it. If you're using Python 3, just use the included json package. If you're using Python 2, do everything you can to move to Python 3. If you're prevented from using Python 3 (my condolences), use the simplejson package suggested by James Thompson.
Original question follows.
Best practices aside, is there a compelling reason not to do this?
I'm writing a post-commit hook for use with a Google Code project, which provides commit data via a JSON object. GC provides an HMAC authentication token along with the request (outside the JSON data), so by validating that token I gain high confidence that the JSON data is both benign (as there's little point in distrusting Google) and valid.
My own (brief) investigations suggest that JSON happens to be completely valid Python, with the exception of the "\/" escape sequence — which GC doesn't appear to generate.
So, as I'm working with Python 2.4 (i.e. no json module), eval() is looking really tempting.
Edit: For the record, I am very much not asking if this is a good idea. I'm quite aware that it isn't, and I very much doubt I'll ever use this technique for any future projects even if I end up using it for this one. I just wanted to make sure that I know what kind of trouble I'll run into if I do. :-)

If you're comfortable with your script working fine for a while, and then randomly failing on some obscure edge case, I would go with eval.
If it's important that your code be robust, I would take the time to add simplejson. You don't need the C portion for speedups, so it really shouldn't be hard to dump a few .py files into a directory somewhere.
As an example of something that might bite you, JSON uses Unicode and simplejson returns Unicode, whereas eval returns str:
>>> simplejson.loads('{"a":1, "b":2}')
{u'a': 1, u'b': 2}
>>> eval('{"a":1, "b":2}')
{'a': 1, 'b': 2}
Edit: a better example of where eval() behaves differently:
>>> simplejson.loads('{"X": "\uabcd"}')
{u'X': u'\uabcd'}
>>> eval('{"X": "\uabcd"}')
{'X': '\\uabcd'}
>>> simplejson.loads('{"X": "\uabcd"}') == eval('{"X": "\uabcd"}')
False
Edit 2: saw yet another problem today pointed out by SilentGhost: eval doesn't handle true -> True, false -> False, null -> None correctly.
>>> simplejson.loads('[false, true, null]')
[False, True, None]
>>> eval('[false, true, null]')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "<string>", line 1, in <module>
NameError: name 'false' is not defined
>>>

The point of best practices is that in most cases, it's a bad idea to disregard them. If I were you, I'd use a parser to parse JSON into Python. Try out simplejson, it was very straightforward for parsing JSON when I last tried it and it claims to be compatible with Python 2.4.
I disagree that there's little point in distrusting Google. I wouldn't distrust them, but I'd verify the data you get from them. The reason that I'd actually use a JSON parser is right in your question:
My own (brief) investigations suggest that JSON happens to be completely valid Python, with the exception of the "/" escape sequence — which GC doesn't appear to generate.
What makes you think that Google Code will never generate an escape sequence like that?
Parsing is a solved problem if you use the right tools. If you try to take shortcuts like this, you'll eventually get bitten by incorrect assumptions, or you'll do something like trying to hack together a parser with regex's and boolean logic when a parser already exists for your language of choice.

One major difference is that a boolean in JSON is true|false, but Python uses True|False.
The most important reason not to do this can be generalized: eval should never be used to interpret external input since this allows for arbitrary code execution.

evaling JSON is a bit like trying to run XML through a C++ compiler.
eval is meant to evaluate Python code. Although there are some syntactical similarities, JSON isn't Python code. Heck, not only is it not Python code, it's not code to begin with. Therefore, even if you can get away with it for your use-case, I'd argue that it's a bad idea conceptually. Python is an apple, JSON is orange-flavored soda.

Related

Is there a way to accomplish what eval does without using eval in Python

I am teaching some neighborhood kids to program in Python. Our first project is to convert a string given as a Roman numeral to the Arabic value.
So we developed an function to evaluate a string that is a Roman numeral the function takes a string and creates a list that has the Arabic equivalents and the operations that would be done to evaluate to the Arabic equivalent.
For example suppose you fed in XI the function will return [1,'+',10]
If you fed in IX the function will return [10,'-',1]
Since we need to handle the cases where adjacent values are equal separately let us ignore the case where the supplied value is XII as that would return [1,'=',1,'+',10] and the case where the Roman is IIX as that would return [10,'-',1,'=',1]
Here is the function
def conversion(some_roman):
roman_dict = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M',1000}
arabic_list = []
for letter in some_roman.upper():
if len(roman_list) == 0:
arabic_list.append(roman_dict[letter]
continue
previous = roman_list[-1]
current_arabic = roman_dict[letter]
if current_arabic > previous:
arabic_list.extend(['+',current_arabic])
continue
if current_arabic == previous:
arabic_list.extend(['=',current_arabic])
continue
if current_arabic < previous:
arabic_list.extend(['-',current_arabic])
arabic_list.reverse()
return arabic_list
the only way I can think to evaluate the result is to use eval()
something like
def evaluate(some_list):
list_of_strings = [str(item) for item in some_list]
converted_to_string = ''.join([list_of_strings])
arabic_value = eval(converted_to_string)
return arabic_value
I am a little bit nervous about this code because at some point I read that eval is dangerous to use in most circumstances as it allows someone to introduce mischief into your system. But I can't figure out another way to evaluate the list returned from the first function. So without having to write a more complex function.
The kids get the conversion function so even if it looks complicated they understand the process of roman numeral conversion and it makes sense. When we have talked about evaluation though I can see they get lost. Thus I am really hoping for some way to evaluate the results of the conversion function that doesn't require too much convoluted code.
Sorry if this is warped, I am so . . .

Is there a way to accomplish what eval does without using eval
Yes, definitely. One option would be to convert the whole thing into an ast tree and parse it yourself (see here for an example).
I am a little bit nervous about this code because at some point I read that eval is dangerous to use in most circumstances as it allows someone to introduce mischief into your system.
This is definitely true. Any time you consider using eval, you need to do some thinking about your particular use-case. The real question is how much do you trust the user and what damage can they do? If you're distributing this as a script and users are only using it on their own computer, then it's really not a problem -- After all, they don't need to inject malicious code into your script to remove their home directory. If you're planning on hosting this on your server, that's a different story entirely ... Then you need to figure out where the string comes from and if there is any way for the user to modify the string in a way that could make it untrusted to run. Hackers are pretty clever1,2 and so hosting something like this on your server is generally not a good idea. (I always assume that the hackers know python WAY better than I do).
1http://blog.delroth.net/2013/03/escaping-a-python-sandbox-ndh-2013-quals-writeup/
2http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html

The only implementation of a safe expression evalulator that I've come across is:
https://pypi.org/project/simpleeval/
It supports a lot of basic Python-ish expressions and is quite restricted in what it allows you to do (so you don't blow up the interpreter or do something evil). It uses the python ast module for parsing, and evaluates the result itself.
Example:
from simpleeval import simple_eval
simple_eval("21 + 21")
Then you can extend it and give it access to the parts of your program that you want to:
simple_eval("x + y", names={"x": 22, "y": 48})
or
simple_eval("do_thing(11)", functions={"do_thing": my_callback})
and so on.

Python: base64.b64decode() vs .decode?

The Code Furies have turned their baleful glares upon me, and it's fallen to me to implement "Secure Transport" as defined by The Direct Project. Whether or not we internally use DNS rather than LDAP for sharing certificates, I'm obviously going to need to set up the former to test against, and that's what's got me stuck. Apparently, an X509 cert needs some massaging to be used in a CERT record, and I'm trying to work out how that's done.
The clearest thing I've found is a script on Videntity's blog, but not being versed in python, I'm hitting a stumbling block. Specifically, this line crashes:
decoded_clean_pk = clean_pk.decode('base64', strict)
since it doesn't seem to like (or rather, to know) whatever 'strict' is supposed to represent. I'm making the semi-educated guess that the line is supposed to decode the base64 data, but I learned from the Debian OpenSSL debacle some years back that blindly diddling with crypto-related code is a Bad Thing(TM).
So I turn the illustrious python wonks on SO to ask if that line might be replaced by this one (with the appropriate import added):
decoded_clean_pk = base64.b64decode(clean_pk)
The script runs after that change, and produces correct-looking output, but I've got enough instinct to know that I can't necessarily trust my instincts here. :)

This line should've work if you would've called like this:
decoded_clean_pk = clean_pk.decode('base64', 'strict')
Notice that strict has to be a string, otherwise python interpreter would try to search for a variable named strict and if it didn't find it or otherwise has other value than: strict, ignore, and replace, it'll probably would've complain about it.
Take a look at this code:
>>>b=base64.b64encode('hello world')
>>>b.decode('base64')
'hello world'
>>>base64.b64decode(b)
'hello world'
Both decode and b64decode works the same when .decode is passed the base64 argument string.
The difference is that str.decode will take a string of bytes as arguments and will return it's Unicode representation depending on the encoding argument you pass as first parameter. In this case, you're telling it to handle a bas64 string so it will do it ok.
To answer your question, both works the same, although b64decode/encode are meant to work only with base64 encodings and str.decode can handle as many encodings as the library is aware of.
For further information take a read at both of the doc sections: decode and b64decode.
UPDATE: Actually, and this is the most important example I guess :) take a look at the source code for encodings/base64_codec.py which is that decode() uses:
def base64_decode(input,errors='strict'):
""" Decodes the object input and returns a tuple (output
object, length consumed).
input must be an object which provides the bf_getreadbuf
buffer slot. Python strings, buffer objects and memory
mapped files are examples of objects providing this slot.
errors defines the error handling to apply. It defaults to
'strict' handling which is the only currently supported
error handling for this codec.
"""
assert errors == 'strict'
output = base64.decodestring(input)
return (output, len(input))
As you may see, it actually uses base64 module to do it :)
Hope this clarify in some way your question.

How to properly handle non ASCII strings in python

I'm building an application that in the database has data with latin symbols. Users are able to enter this data.
What I've been doing so far is encode('latin2') every user input and decode('latin2') at the very end when displaying data in the template.
This is a bit annoying and I'm wondering if there is any better way of handling this.

Python's unicode type is designed to be the "natural" representation for strings. Besides the unicode type, strings are expected to be in some unspecified encoding but there's no way to "tag" them with the encoding used, and python will very insistently assume that strings are in ASCII or UTF-8 encoding. As such, you're probably asking for headaches if you write your whole program to assume that str means latin2. Encoding problems have a way of creeping in at odd places in the code and percolating through layers, sometimes getting bad data in your database, and ultimately causing odd behavior or nasty errors somewhere completely unrelated and impossible to debug.
I would recommend you see about converting your db data to UTF-8.
If you can't do that, I would strongly recommend moving your encoding/decoding calls right up to the moment you transmit data to/from the database. If you have any sort of database abstraction layer, you can probably configure it to handle that for you more or less automatically. Then you should make sure any user input is converted to the unicode type right away.
Using unicode types and explicitly encoding/decoding this way also has the advantage that if you do have encoding problems, you will probably notice sooner and you can just throw unicode-nazi at them to track them down (see How can you make python 2.x warn when coercing strings to unicode?).
For your markup problem: Flask and Jinja2 will by default escape any unsafe characters in your strings before rendering them into your HTML. To override the autoescaping, just use the safe filter:
<h1>More than just text!</h1>
<div>{{ html_data|safe }}</div>
See Flask Templates: Controlling Autoescaping for details, and use this with extreme caution since you're effectively loading code from the database and executing it. In real life, you'll probably want to scrub the data (see Python HTML sanitizer / scrubber / filter or Jinja2 escape all HTML but img, b, etc).

try add this to the top of your program.
import sys
reload(sys)
sys.setdefaultencoding('latin2')
We have to reload sys because:
>>> import sys
>>> sys.setdefaultencoding
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding
<built-in function setdefaultencoding>

how to avoid python numeric literals beginning with "0" being treated as octal?

I am trying to write a small Python 2.x API to support fetching a
job by jobNumber, where jobNumber is provided as an integer.
Sometimes the users provide ajobNumber as an integer literal
beginning with 0, e.g. 037537. (This is because they have been
coddled by R, a language that sanely considers 037537==37537.)
Python, however, considers integer literals starting with "0" to
be OCTAL, thus 037537!=37537, instead 037537==16223. This
strikes me as a blatant affront to the principle of least
surprise, and thankfully it looks like this was fixed in Python
3---see PEP 3127.
But I'm stuck with Python 2.7 at the moment. So my users do this:
>>> fetchJob(037537)
and silently get the wrong job (16223), or this:
>>> fetchJob(038537)
File "<stdin>", line 1
fetchJob(038537)
^
SyntaxError: invalid token
where Python is rejecting the octal-incompatible digit.
There doesn't seem to be anything provided via __future__ to
allow me to get the Py3K behavior---it would have to be built-in
to Python in some manner, since it requires a change to the lexer
at least.
Is anyone aware of how I could protect my users from getting the
wrong job in cases like this? At the moment the best I can think
of is to change that API so it take a string instead of an int.

At the moment the best I can think of is to change that API so it take a string instead of an int.
Yes, and I think this is a reasonable option given the situation.
Another option would be to make sure that all your job numbers contain at least one digit greater than 7 so that adding the leading zero will give an error immediately instead of an incorrect result, but that seems like a bigger hack than using strings.
A final option could be to educate your users. It will only take five minutes or so to explain not to add the leading zero and what can happen if you do. Even if they forget or accidentally add the zero due to old habits, they are more likely to spot the problem if they have heard of it before.

Perhaps you could take the input as a string, strip leading zeros, then convert back to an int?
test = "001234505"
test = int(test.lstrip("0")) # 1234505

In Python what's the best way to emulate Perl's END?

Am I correct in thinking that that Python doesn't have a direct equivalent for Perl's __END__?
print "Perl...\n";
__END__
End of code. I can put anything I want here.
One thought that occurred to me was to use a triple-quoted string. Is there a better way to achieve this in Python?
print "Python..."
"""
End of code. I can put anything I want here.
"""

The __END__ block in perl dates from a time when programmers had to work with data from the outside world and liked to keep examples of it in the program itself.
Hard to imagine I know.
It was useful for example if you had a moving target like a hardware log file with mutating messages due to firmware updates where you wanted to compare old and new versions of the line or keep notes not strictly related to the programs operations ("Code seems slow on day x of month every month") or as mentioned above a reference set of data to run the program against. Telcos are an example of an industry where this was a frequent requirement.
Lastly Python's cult like restrictiveness seems to have a real and tiresome effect on the mindset of its advocates, if your only response to a question is "Why would you want to that when you could do X?" when X is not as useful please keep quiet++.

The triple-quote form you suggested will still create a python string, whereas Perl's parser simply ignores anything after __END__. You can't write:
"""
I can put anything in here...
Anything!
"""
import os
os.system("rm -rf /")
Comments are more suitable in my opinion.
#__END__
#Whatever I write here will be ignored
#Woohoo !

What you're asking for does not exist.
Proof: http://www.mail-archive.com/python-list#python.org/msg156396.html
A simple solution is to escape any " as \" and do a normal multi line string -- see official docs: http://docs.python.org/tutorial/introduction.html#strings
( Also, atexit doesn't work: http://www.mail-archive.com/python-list#python.org/msg156364.html )

Hm, what about sys.exit(0) ? (assuming you do import sys above it, of course)
As to why it would useful, sometimes I sit down to do a substantial rewrite of something and want to mark my "good up to this point" place.
By using sys.exit(0) in a temporary manner, I know nothing below that point will get executed, therefore if there's a problem (e.g., server error) I know it had to be above that point.
I like it slightly better than commenting out the rest of the file, just because there are more chances to make a mistake and uncomment something (stray key press at beginning of line), and also because it seems better to insert 1 line (which will later be removed), than to modify X-many lines which will then have to be un-modified later.
But yeah, this is splitting hairs; commenting works great too... assuming your editor supports easily commenting out a region, of course; if not, sys.exit(0) all the way!

I use __END__ all the time for multiples of the reasons given. I've been doing it for so long now that I put it (usually preceded by an exit('0');), along with BEGIN {} / END{} routines, in by force-of-habit. It is a shame that Python doesn't have an equivalent, but I just comment-out the lines at the bottom: extraneous, but that's about what you get with one way to rule them all languages.

Python does not have a direct equivalent to this.
Why do you want it? It doesn't sound like a really great thing to have when there are more consistent ways like putting the text at the end as comments (that's how we include arbitrary text in Python source files. Triple quoted strings are for making multi-line strings, not for non-code-related text.)
Your editor should be able to make using many lines of comments easy for you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.