Writing a Domain Specific Language for selecting rows from a table

Writing a Domain Specific Language for selecting rows from a table - python

I'm writing a server that I expect to be run by many different people, not all of whom I will have direct contact with. The servers will communicate with each other in a cluster. Part of the server's functionality involves selecting a small subset of rows from a potentially very large table. The exact choice of what rows are selected will need some tuning, and it's important that it's possible for the person running the cluster (eg, myself) to update the selection criteria without getting each and every server administrator to deploy a new version of the server.
Simply writing the function in Python isn't really an option, since nobody is going to want to install a server that downloads and executes arbitrary Python code at runtime.
What I need are suggestions on the simplest way to implement a Domain Specific Language to achieve this goal. The language needs to be capable of simple expression evaluation, as well as querying table indexes and iterating through the returned rows. Ease of writing and reading the language is secondary to ease of implementing it. I'd also prefer not to have to write an entire query optimiser, so something that explicitly specifies what indexes to query would be ideal.
The interface that this will have to compile against will be similar in capabilities to what the App Engine datastore exports: You can query for sequential ranges on any index on the table (eg, less-than, greater-than, range and equality queries), then filter the returned row by any boolean expression. You can also concatenate multiple independent result sets together.
I realise this question sounds a lot like I'm asking for SQL. However, I don't want to require that the datastore backing this data be a relational database, and I don't want the overhead of trying to reimplement SQL myself. I'm also dealing with only a single table with a known schema. Finally, no joins will be required. Something much simpler would be far preferable.
Edit: Expanded description to clear up some misconceptions.

Building a DSL to be interpreted by Python.
Step 1. Build the run-time classes and objects. These classes will have all the cursor loops and SQL statements and all of that algorithmic processing tucked away in their methods. You'll make heavy use of the Command and Strategy design patterns to build these classes. Most things are a command, options and choices are plug-in strategies. Look at the design for Apache Ant's Task API -- it's a good example.
Step 2. Validate that this system of objects actually works. Be sure that the design is simple and complete. You're tests will construct the Command and Strategy objects, and then execute the top-level Command object. The Command objects will do the work.
At this point you're largely done. Your run-time is just a configuration of objects created from the above domain. [This isn't as easy as it sounds. It requires some care to define a set of classes that can be instantiated and then "talk among themselves" to do the work of your application.]
Note that what you'll have will require nothing more than declarations. What's wrong with procedural? One you start to write a DSL with procedural elements, you find that you need more and more features until you've written Python with different syntax. Not good.
Further, procedural language interpreters are simply hard to write. State of execution, and scope of references are simply hard to manage.
You can use native Python -- and stop worrying about "getting out of the sandbox". Indeed, that's how you'll unit test everything, using a short Python script to create your objects. Python will be the DSL.
["But wait", you say, "If I simply use Python as the DSL people can execute arbitrary things." Depends on what's on the PYTHONPATH, and sys.path. Look at the site module for ways to control what's available.]
A declarative DSL is simplest. It's entirely an exercise in representation. A block of Python that merely sets the values of some variables is nice. That's what Django uses.
You can use the ConfigParser as a language for representing your run-time configuration of objects.
You can use JSON or YAML as a language for representing your run-time configuration of objects. Ready-made parsers are totally available.
You can use XML, too. It's harder to design and parse, but it works fine. People love it. That's how Ant and Maven (and lots of other tools) use declarative syntax to describe procedures. I don't recommend it, because it's a wordy pain in the neck. I recommend simply using Python.
Or, you can go off the deep-end and invent your own syntax and write your own parser.

I think we're going to need a bit more information here. Let me know if any of the following is based on incorrect assumptions.
First of all, as you pointed out yourself, there already exists a DSL for selecting rows from arbitrary tables-- it is called "SQL". Since you don't want to reinvent SQL, I'm assuming that you only need to query from a single table with a fixed format.
If this is the case, you probably don't need to implement a DSL (although that's certainly one way to go); it may be easier, if you are used to Object Orientation, to create a Filter object.
More specifically, a "Filter" collection that would hold one or more SelectionCriterion objects. You can implement these to inherit from one or more base classes representing types of selections (Range, LessThan, ExactMatch, Like, etc.) Once these base classes are in place, you can create column-specific inherited versions which are appropriate to that column. Finally, depending on the complexity of the queries you want to support, you'll want to implement some kind of connective glue to handle AND and OR and NOT linkages between the various criteria.
If you feel like it, you can create a simple GUI to load up the collection; I'd look at the filtering in Excel as a model, if you don't have anything else in mind.
Finally, it should be trivial to convert the contents of this Collection to the corresponding SQL, and pass that to the database.
However: if what you are after is simplicity, and your users understand SQL, you could simply ask them to type in the contents of a WHERE clause, and programmatically build up the rest of the query. From a security perspective, if your code has control over the columns selected and the FROM clause, and your database permissions are set properly, and you do some sanity checking on the string coming in from the users, this would be a relatively safe option.

"implement a Domain Specific Language"
"nobody is going to want to install a server that downloads and executes arbitrary Python code at runtime"
I want a DSL but I don't want Python to be that DSL. Okay. How will you execute this DSL? What runtime is acceptable if not Python?
What if I have a C program that happens to embed the Python interpreter? Is that acceptable?
And -- if Python is not an acceptable runtime -- why does this have a Python tag?

Why not create a language that when it "compiles" it generates SQL or whatever query language your datastore requires ?
You would be basically creating an abstraction over your persistence layer.

You mentioned Python. Why not use Python? If someone can "type in" an expression in your DSL, they can type in Python.
You'll need some rules on structure of the expression, but that's a lot easier than implementing something new.

You said nobody is going to want to install a server that downloads and executes arbitrary code at runtime. However, that is exactly what your DSL will do (eventually) so there probably isn't that much of a difference. Unless you're doing something very specific with the data then I don't think a DSL will buy you that much and it will frustrate the users who are already versed in SQL. Don't underestimate the size of the task you'll be taking on.
To answer your question however, you will need to come up with a grammar for your language, something to parse the text and walk the tree, emitting code or calling an API that you've written (which is why my comment that you're still going to have to ship some code).
There are plenty of educational texts on grammars for mathematical expressions you can refer to on the net, that's fairly straight forward. You may have a parser generator tool like ANTLR or Yacc you can use to help you generate the parser (or use a language like Lisp/Scheme and marry the two up). Coming up with a reasonable SQL grammar won't be easy. But google 'BNF SQL' and see what you come up with.
Best of luck.

It really sounds like SQL, but perhaps it's worth to try using SQLite if you want to keep it simple?

It sounds like you want to create a grammar not a DSL. I'd look into ANTLR which will allow you to create a specific parser that will interpret text and translate to specific commands. ANTLR provides libraries for Python, SQL, Java, C++, C, C# etc.
Also, here is a fine example of an ANTLR calculation engine created in C#

A context-free grammar usually has a tree like structure and functional programs have a tree like structure too. I don't claim the following would solve all of your problems, but it is a good step in the direction if you are sure that you don't want to use something like SQLite3.
from functools import partial
def select_keys(keys, from_):
return ({k : fun(v, row) for k, (v, fun) in keys.items()}
for row in from_)
def select_where(from_, where):
return (row for row in from_
if where(row))
def default_keys_transform(keys, transform=lambda v, row: row[v]):
return {k : (k, transform) for k in keys}
def select(keys=None, from_=None, where=None):
"""
SELECT v1 AS k1, 2*v2 AS k2 FROM table WHERE v1 = a AND v2 >= b OR v3 = c
translates to
select(dict(k1=(v1, lambda v1, r: r[v1]), k2=(v2, lambda v2, r: 2*r[v2])
, from_=table
, where= lambda r : r[v1] = a and r[v2] >= b or r[v3] = c)
"""
assert from_ is not None
idfunc = lambda k, t : t
select_k = idfunc if keys is None else select_keys
if isinstance(keys, list):
keys = default_keys_transform(keys)
idfunc = lambda t, w : t
select_w = idfunc if where is None else select_where
return select_k(keys, select_w(from_, where))
How do you make sure that you are not giving users ability to execute arbitrary code. This framework admits all possible functions. Well, you can right a wrapper over it for security that expose a fixed list of function objects that are acceptable.
ALLOWED_FUNCS = [ operator.mul, operator.add, ...] # List of allowed funcs
def select_secure(keys=None, from_=None, where=None):
if keys is not None and isinstance(keys, dict):
for v, fun keys.values:
assert fun in ALLOWED_FUNCS
if where is not None:
assert_composition_of_allowed_funcs(where, ALLOWED_FUNCS)
return select(keys=keys, from_=from_, where=where)
How to write assert_composition_of_allowed_funcs. It is very difficult to do that it in python but easy in lisp. Let us assume that where is a list of functions to be evaluated in a lips like format i.e. where=(operator.add, (operator.getitem, row, v1), 2) or where=(operator.mul, (operator.add, (opreator.getitem, row, v2), 2), 3).
This makes it possible to write a apply_lisp function that makes sure that the where function is only made up of ALLOWED_FUNCS or constants like float, int, str.
def apply_lisp(where, rowsym, rowval, ALLOWED_FUNCS):
assert where[0] in ALLOWED_FUNCS
return apply(where[0],
[ (apply_lisp(w, rowsym, rowval, ALLOWED_FUNCS)
if isinstance(w, tuple)
else rowval if w is rowsym
else w if isinstance(w, (float, int, str))
else None ) for w in where[1:] ])
Aside, you will also need to check for exact types, because you do not want your types to be overridden. So do not use isinstance, use type in (float, int, str). Oh boy we have run into:
Greenspun's Tenth Rule of Programming: any sufficiently complicated
C or Fortran program contains an ad hoc informally-specified
bug-ridden slow implementation of half of Common Lisp.

Related

How do I sanitize a list comprehension given by a user?

I am working on an interface for a simulator that is meant to be friendly to people who prefer the command line to a GUI. To give the simulator the levels, the user types the information into a file, which is then parsed and generates design points, which are then sent to a main server.
I would like to be able to implement some sort of "range" feature so that the user will not need to type out all of the individual levels. More power is needed than a simple additive sequence. Since the parser and related code is already in Python, this seems like a perfect use case for list comprehensions. However, the list comprehension is user input and not guaranteed to be valid. Using eval seems too dangerous, and literal_eval does not support list comprehensions.
My current goal is for something like this to be valid and safe:
{"Factor 1": [1,2,3,7,8],
"Factor 2": "[2**x for x in range(5,20) if (x % 3) == 0]"}
The base format for files that the user types is JSON. I am looking to extend the language to have additional features (like range) to fill various user needs. "Data set 1" can be parsed in the existing system. The list comprehension will be evaluated on the user's machine, so simple attacks like 'x'*9**999999**99999 are self-destructive.
It seems relatively easy to sanitize the range function using a regex, but I'm not sure how to make sure that the other parts are safe. Are regexes sufficient for this task, or is there another approach I should be following?

Further analysis seems to show that eval is less dangerous than usual here. The parsing is all done client side, and the code of the project is open-source. Therefore, anything malicious that the user could do by exploiting an eval is either self-destructive or possible through less convoluted methods. Therefore, I can just use eval to generate the list of levels.
Of course, I will have to heavily document this decision due to the "eval is evil; kill it with fire" reaction that most people (including myself) have towards its use.

CGI - Good Design Practice: setattr/eval/if-then?

I have a question about good practice. In developing a back-end for my company, wherein our sales staff can manage clients remotely, I want them to simply be able to click "Edit Contact Info." (for instance) and have the python script "manageClients.py" call the "editContactInfo" function.
Obviously, the easiest way to do this would be to href manageClients.pyfunction=editContactInfo&client=ClientName
and in the script, after getting a fieldstorage instance:
if function:
eval(function)
However, everything I see says to not use eval if it can be avoided. I don't fully understand setattr(), and I don't really know how I would work it into such a simple script. I also think if function=="editClientInfo: (and so on) would be way too bulky.
Obviously the code/URL stuff listed above is simplistic so I could clearly ask the question. There are security measures in place, etc.
I just want a way to tackle this that is considered good practice, as this is my first big programming project, and I want to do it right.

If I understand correctly, you have a set of functions and you want to specify, by name, which one to use. I'd set up a dictionary mapping the name to the function, and then its just a simple lookup to get the function to use:
# Map names to functions. Note there is no need for the names
# to match the actual function name, and the functions can appear
# multiple times.
function_map = {
'editContactInfo': editContactInfo,
'editClientInfo': editClientInfo,
'editAlias': editContactInfo,
}
# Get name of function to run etc. I'm assuming that the function
# name is stored in the variable function_name, and any parameters
# call it with is in the dictionary parameters (i.e., the functions
# all take keyword arguments).
# Do a lookup to get the function.
try:
function = function_map[function_name]
except KeyError:
# Replace with your own error handling code.
print 'Function {0} does not exist.'.format(function_name)
# Call the function
result = function(**parameters)
This is the standard way to do a lookup, which is the what you seem to be asking for. However, you also asked about best practice, and I agree with RestRisiko: using a framework designed for web applications is the best practice in this case.
The only Python framework I have any real experience with is Django. This is a 'heavier' (more complex) framework than something like Flask or Bottle, but comes with a lot more inbuilt features. It really depends what your needs are, and I'd recommend checking them all out to see which fits your situation best.

CGI? This is an ancient culprit of the last century.
Please use a decent Python web-framework for implementing your functionality
in a decent and modern way.
Small framework like Flask or Bottle are much, much, much better than dealing with CGI
nowadays - no excuses please. There is no reason nowadays for using CGI in any way.
Even building something on top of the Python HTTPServer is a much better option.

programming language implemented in pure python

i am creating ( researching possibility of ) a highly customizable python client and would like to allow users to actually edit the code in another language to customize the running of program. ( analogous to browser which itself coded in c/c++ and run another language html/js ). so my question is , is there any programming language implemented in pure python which i can see as a reference ( or use directly ? ) -- i need simple language ( simple statements and ifs can do )
edit: sorry if i did not make myself clear but what i want is "a language to customize the running of program" , even though pypi seems a great option, what i am looking for is more simple which i can study and extend myself if need arise. my google searches pointing towards xml based langagues. ( BMEL , XForms etc ).

The question isn't completely clear on scope, but I have a hunch that PyPy, embedding other full languages, and similar solutions might be overkill. It sounds like iamgopal may really be interested in something more like Interpreter Pattern or Little Language.
If the language you want to support is really small (see the Interpreter Pattern link), then hand-coding this yourself in Python won't be too hard. You can write a simple parser (Google around; here's one example), then walk the AST and evaluate user expressions.
However, if you expect this to be used for a long time or by many people, it may be worth throwing a real language at the problem. (I'd recommend Python itself if your users are already familiar with basic Python syntax).

Ren'Py is a modification to Python syntax built on top of Python itself, using the language tools in the stdlib.

For your user's sake, don't use an XML based language - XML is an awful basis for a programming language and your users will hate you for it.
Here is a suggestion. Use a strict subset of Python for your language. Use the compiler module to convert their code into an abstract syntax tree and walk the tree to to validate that the code conforms to your subset before converting the AST into python bytecode.
N.B. I just checked the docs and see that the compiler package is deprecated in 2.6 and removed in Python 3.x. Does anyone know why that is?

Numerous template languages such as Cheetah, Django templates, Genshi, Mako, Mighty might serve as an example.

Why not Python itself? With some care you can use eval to run user code.
One of the good thing about interpreted scripting languages is that you don't need another extra scripting language!

PLY (Python Lex-Yacc)
is something of your interest.

Possibly Common Lisp (or any other Lisp) will be the best choice for that task. Because Lisp make it possible to easily extend host language with powerful macroses and construct DSL (domain specific language).

If all you need is simple if statements and expressions, I'm sure it wouldn't be an awful task to parse each line. Something like
if some flag
activate some feature
deactivate some feature
elif some other flag
activate some feature
activate some feature
else
logout
Just write a class which, while parsing takes the first word, checks if it's "if, elif, else," etc, and if so, check a flag and set a flag saying you either are or are not executing until the next conditional. If it's not a conditional, call a function based on the first keyword that would modify the program state in some way.
The class could store some local execution state (are we in an if statement? If so are we executing this branch?) and have another class containing some global application state (flags that are checkable by if statements, etc).
This is probably the wrong thing to do in your situation (it's very prone to bugs, it's dangerous if you don't treat the data in the scripts correctly), but it's at least a start if you do decide to interpret your own mini-language.
Seriously though, if you try this, be very, very, srs careful. Don't give the scripts any functionality that they don't definitely need, because you are almost certainly opening security holes by doing something like this.
Don't say I didn't warn you.

First Order Logic Engine

I'd like to create an application that can do simple reasoning using first order logic. Can anyone recommend an "engine" that can accept an arbitrary number of FOL expressions, and allow querying of those expressions (preferably accessible via Python)?

Don't query using first-order logic (FOL) unless you absolutely have to: first-order logic is not decidable, but only semi-decidable, and so queries will often, unavoidably not terminate.
Description logic is essentially a decidable fragment of first-order logic, reformulated in a manner that is good for talking about classes of entity and their interrelationships. There are many engines for description logic in Python, for example seth, based on OWL-DL.
If you are really sure that you need the vastness of FOL, then FLiP is worth a look. I've not used it (not really keen on Python, to be honest), but this is a good approach to making logic checking available to a programming language.

PyLog:
PyLog is a first order logic library
including a PROLOG engine in Python.

Recipe 303057: Pythologic -- Prolog syntax in Python / http://code.activestate.com/recipes/303057/

Can you do LINQ-like queries in a language like Python or Boo?

Take this simple C# LINQ query, and imagine that db.Numbers is an SQL table with one column Number:
var result =
from n in db.Numbers
where n.Number < 5
select n.Number;
This will run very efficiently in C#, because it generates an SQL query something like
select Number from Numbers where Number < 5
What it doesn't do is select all the numbers from the database, and then filter them in C#, as it might appear to do at first.
Python supports a similar syntax:
result = [n.Number for n in Numbers if n.Number < 5]
But it the if clause here does the filtering on the client side, rather than the server side, which is much less efficient.
Is there something as efficient as LINQ in Python? (I'm currently evaluating Python vs. IronPython vs. Boo, so an answer that works in any of those languages is fine.)

sqlsoup in sqlalchemy gives you the quickest solution in python I think if you want a clear(ish) one liner . Look at the page to see.
It should be something like...
result = [n.Number for n in db.Numbers.filter(db.Numbers.Number < 5).all()]

LINQ is a language feature of C# and VB.NET. It is a special syntax recognized by the compiler and treated specially. It is also dependent on another language feature called expression trees.
Expression trees are a little different in that they are not special syntax. They are written just like any other class instantiation, but the compiler does treat them specially under the covers by turning a lambda into an instantiation of a run-time abstract syntax tree. These can be manipulated at run-time to produce a command in another language (i.e. SQL).
The C# and VB.NET compilers take LINQ syntax, and turn it into lambdas, then pass those into expression tree instantiations. Then there are a bunch of framework classes that manipulate these trees to produce SQL. You can also find other libraries, both MS-produced and third party, that offer "LINQ providers", which basically pop a different AST processer in to produce something from the LINQ other than SQL.
So one obstacle to doing these things in another language is the question whether they support run-time AST building/manipulation. I don't know whether any implementations of Python or Boo do, but I haven't heard of any such features.

Look closely at SQLAlchemy. This can probably do much of what you want. It gives you Python syntax for plain-old SQL that runs on the server.

I believe that when IronPython 2.0 is complete, it will have LINQ support (see this thread for some example discussion). Right now you should be able to write something like:
Queryable.Select(Queryable.Where(someInputSequence, somePredicate), someFuncThatReturnsTheSequenceElement)
Something better might have made it into IronPython 2.0b4 - there's a lot of current discussion about how naming conflicts were handled.

Boo supports list generator expressions using the same syntax as python. For more information on that, check out the Boo documentation on Generator expressions and List comprehensions.

A key factor for LINQ is the ability of the compiler to generate expression trees.
I am using a macro in Nemerle that converts a given Nemerle expression into an Expression tree object.
I can then pass this to the Where/Select/etc extension methods on IQueryables.
It's not quite the syntax of C# and VB, but it's close enough for me.
I got the Nemerle macro via a link on this post:
http://groups.google.com/group/nemerle-dev/browse_thread/thread/99b9dcfe204a578e
It should be possible to create a similar macro for Boo. It's quite a bit of work however, given the large set of possible expressions you need to support.
Ayende has given a proof of concept here:
http://ayende.com/Blog/archive/2008/08/05/Ugly-Linq.aspx

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.