library for transforming a node tree

library for transforming a node tree - python

I'd like to be able to express a general transformation of one tree into another without writing a bunch of repetitive spaghetti code. Are there any libraries to help with this problem? My target language is Python, but I'll look at other languages as long as it's feasible to port to Python.
Example: I'd like to transform this node tree: (please excuse the S-expressions)
(A (B) (C) (D))
Into this one:
(C (B) (D))
As long as the parent is A and the second ancestor is C, regardless of context (there may be more parents or ancestors). I'd like to express this transformation in a simple, concise, and re-usable way. Of course this example is very specific. Please try to address the general case.
Edit: RefactoringNG is the kind of thing I'm looking for, although it introduces an entirely new grammar to solve the problem, which i'd like to avoid. I'm still looking for more and/or better examples.
Background:
I'm able to convert python and cheetah (don't ask!) files into tokenized tree representations, and in turn convert those into lxml trees. I plan to then re-organize the tree and write-out the results in order to implement automated refactoring. XSLT seems to be the standard tool to rewrite XML, but the syntax is terrible (in my opinion, obviously) and nobody at our shop would understand it.
I could write some functions which simply use the lxml methods (.xpath and such) to implement my refactorings, but I'm worried that I will wind up with a bunch of purpose-built spaghetti code which can't be re-used.

Let's try this in Python code. I've used strings for the leaves, but this will work with any objects.
def lift_middle_child(in_tree):
(A, (B,), (C,), (D,)) = in_tree
return (C, (B,), (D,))
print lift_middle_child(('A', ('B',), ('C',), ('D',))) # could use lists too
This sort of tree transformation is generally better performed in a functional style - if you create a bunch of these functions, you can explicitly compose them, or create a composition function to work with them in a point-free style.
Because you've used s-expressions, I assume you're comfortable representing trees as nested lists (or the equivalent - unless I'm mistaken, lxml nodes are iterable in that way). Obviously, this example relies on a known input structure, but your question implies that. You can write more flexible functions, and still compose them, as long as they have this uniform interface.
Here's the code in action: http://ideone.com/02Uv0i
Now, here's a function to reverse children, and using that and the above function, one to lift and reverse:
def compose2(a,b): # might want to get this from the functional library
return lambda *x: a(b(*x))
def compose(*funcs): #compose(a,b,c) = a(b(c(x))) - you might want to reverse that
return reduce(compose2,funcs)
def reverse_children(in_tree):
return in_tree[0:1] + in_tree[1:][::-1] # slightly cryptic, but works for anything subscriptable
lift_and_reverse = compose(reverse_children,lift_middle_child) # right most function applied first - if you find this confusing, reverse order in compose function.
print lift_and_reverse(('A', ('B',), ('C',), ('D',)))

What you really want IMHO is an program transformation system, which allows you to parse and transform code using the patterns expressed in the surface syntax of the source code (and even the target language) to express the rewrites directly.
You will find that even if you can get your hands on an XML representation of the Python tree, that the effort to write an XSLT/XPath transformation is more than you expect; trees representing real code are messier than you'd expect, XSLT isn't that convenient a notation, and it cannot express directly common conditions on trees that you'd like to check (e.g., that two subtrees are the same). An final complication with XML: assume its has been transformed. How do you regenerate the source code syntax from which came? You need some kind of prettyprinter.
A general problem regardless of how the code is represented is that without information about scopes and types (where you can get it), writing correct transformations is pretty hard. After all, if you are going to transform python into a language that uses different operators for string concat and arithmetic (unlike Java which uses "+" for both), you need to be able to decide which operator to generate. So you need type information to decide. Python is arguably typeless, but in practice most expressions involve variables which have only one type for their entire lifetime. So you'll also need flow analysis to compute types.
Our DMS Software Reengineering Toolkit has all of these capabilities (parsing, flow analysis, pattern matching/rewriting, prettyprinting), and robust parsers for many languages including Python. (While it has flow analysis capability instantiated for C, COBOL, Java, this is not instantiated for Python. But then, you said you wanted to do the transformation regardless of context).
To express your rewrite in DMS on Python syntax close to your example (which isn't Python?)
domain Python;
rule revise_arguments(f:IDENTIFIER,A:expression,B:expression,
C:expression,D:expression):primary->primary
= " \f(\A,(\B),(\C),(\D)) "
-> " \f(\C,(\B),(\D)) ";
The notation above is the DMS rule-rewriting language (RSL). The "..." are metaquotes that separate Python syntax (inside those quotes, DMS knows it is Python because of the domain notation declaration) from the DMS RSL language. The \n inside the meta quote refers to the syntax variable placeholders of the named nonterminal type defined in the rule parameter list. Yes, (...) inside the metaquotes are Python ( ) ... they exist in the syntax trees as far as DMS is concerned, because they, like the rest of the language, are just syntax.
The above rule looks a bit odd because I'm trying to follow your example as close as possible, and from and expression language point of view, your example is odd precisely because it does have unusual parentheses.
With this rule, DMS could parse Python (using its Python parser) like
foobar(2+3,(x-y),(p),(baz()))
build an AST, match the (parsed-to-AST) rule against that AST, rewrite it to another AST corresponding to:
foobar(p,(x-y),(baz()))
and then prettyprint the surface syntax (valid) python back out.
If you intended your example to be a transformation on LISP code, you'd
need a LISP grammar for DMS (not hard to build, but we don't have much
call for this), and write corresponding surface syntax:
domain Lisp;
rule revise_form(A:form,B:form, C:form, D:form):form->form
= " (\A,(\B),(\C),(\D)) "
-> " (\C,(\B),(\D)) ";
You can get a better feel for this by looking at Algebra as a DMS domain.
If your goal is to implement all this in Python... I don't have much help.
DMS is a pretty big system, and it would be a lot of effort to replicate.

Related

Proper way to handle ambiguous tokens in PLY

I am implementing an existing scripting language in part as a toy project and in part so that I can write my own implementation of the program that uses the language. One of the issues I'm running into is that I have a few constructs that overlap in terms of specification but are more clear when used:
Variables - r'[A-Za-z0-9_]+' # Yes, '456' is a valid variable name
Numbers - r'-?[0-9]+(\.[0-9]+)?'
Macros - r'\#[A-Za-z0-9_]+'
Field Reference - r'(this\.)?([A-Za-z]+\.)*[A-Za-z]+'
Tag reference - r'[A-Za-z0-9_]+\.[A-Za-z0-9_]*\??'
This mostly works, but, for example, "456" could be a number or a variable. "34.567" could be a number or a tag reference (the documentation for the scripting language says that it's a bad idea to start identifiers with numbers, but doesn't outright forbid it). Is there a good way to handle the potential ambiguity of the tokens? Currently, I'm tokenizing the former as variable, and the latter as a number, and handling it later in the parser, but it feels very clumsy.

Is there any need for the tokenizer to distinguish between variables, numbers, field references and tag references? Presumably, the parser will be able to decide which of those categories a particular token falls into, by consulting its symbol table of declared variables and possibly by considering the context in which the token was used. If that's the case, then you can just return a single token for all four cases, which will simplify your lexer and probably your grammar.
There's a general principle of parser design, which is never sufficiently emphasised, so I'll put it in bold here:
Every parser component should do the absolute minimum amount of work necessary to distinguish between correct inputs.
In other words, if the only possibilities are a unique correct parse and an input error, and it's at all difficult to decide at that point which applies, then just pass the decision on to the next phases, where more information is available. Only do the work necessary to distinguish between two or more different correct inputs.
This applies, for example, to trying to do type-checking in the parser. That's a losing proposition; there isn't enough information to do it correctly until semantic analysis is complete and you know what all of the identifiers refer to. More importantly, it adds no benefit to the parser (or the lexer) because it does not affect how a correct input is parsed; all it does is let you identify certain (not all) incorrect inputs. By the above principle, you shouldn't try.
This principle comes up over and over again in parsing. There is always the temptation to try to make error detection "more precise" too early in the parse. Resist! Do error detection only when you have enough information to do it reliably. You'll have to do it at that point anyway, so you're not saving anything by trying to do some of it earlier. Early detection might shave a few microseconds off of a failed parse, but the speed of parsing incorrect inputs is not very important. Always optimise for correct inputs.
This also applies to writing grammars for syntaxes which are not easy to precisely shoehorn into a one-token lookahead grammar. It's OK to let an incorrect input to sneak through the parse and then detect it during semantic analysis. For example, you could try to detect whether built-in function calls have the correct number of arguments. But why bother? Letting a call with too many or too few arguments go through to semantic analysis does not create any ambiguities. There are lots of other examples.
Other big benefits of letting errors trickle down to the semantic analysis are that it's much easier to generate accurate error messages, which are useful for the end user, and that it's much easier to do error recovery, so you can continue processing the input and provide multiple errors and warnings in a single run, another feature your users will appreciate.
There are exceptions to every guideline, so I'm not saying this is an absolute rule. In COBOL, for example, some operators have different parsing precedences depending on their datatype. (No sensible language designer would commit that barbarity today, I hope, but you do need to take it into account for legacy parsers.) You can only pass a decision down the line if it doesn't create ambiguities between correct inputs. But you should always try to keep this guideline in mind.

How do you robustly generate code, given a description of its exact behavior in another language?

I have been reverse engineering a specific black box equation that is part of a system I do not own (do not worry, it's white-hat), in which you can only measure the inputs (a large set of integers) and outputs (two integers).
This system can only be perfectly described as a program/function in which all the input integers are used, and so far I can perfectly describe the behavior by creating a data structure that has named "mathematical terms" in which each named input integer lives, and each term has an ordering for the inputs that it owns. I also have a function that takes the model description, and a set of named inputs, and outputs two integers. So the mapping of lists of input names to program behavior lives in here and in the model description in tandem.
I've been programming the reverse engineering utility in python, but ultimately I want to output a low level lua program that represents this function in a less abstract manner. When there were less terms in the model, it was simple to manually write a "transpiler" from this model (in python) to lua, but as the complexity grows it's painful to rewrite the code generator for new types of terms, especially in an ad-hoc manner.
From reading other questions about similar systems, it seems the very last two steps of this process would be: generating an abstract syntax tree representing my desired program, and giving the ast to a lua prettyprinter to generate the code. But I'm not sure if there's useful abstractions that I'm unaware of that help me generate a lua ast, given my current description of the model.

What you're looking for is an abstract syntax tree, which can define the behavior of a program through a graph. Since each component of an abstract syntax tree is highly compartmentalized (eg, "Add", "Number Constant"...) is it extraordinarily efficient to translate an abstract syntax tree back into a high-level programming language, such as Lua.
Abstract syntax trees are used in many compilers and transpilers, so you will not be digging long to find good examples.
CSharp.Lua does a similar thing to what you want; transpiling C# to Lua using a simple abstract syntax tree and a slightly less simple code generator.
Speedy Web Compiler contains an excellent implementation of a javascript code generator
ESBuild also has a well-done implementation for javascript.

How to get a type of the variable from the Python's AST?

Suppose I want to get the type of all variables from the AST tree that I have generated from some source code -- how would I go about doing that?
For example, suppose in my source code I have something like i = 5. How would I determine, from the abstract syntax tree, that the type of i is integer?
I tried the type() function; however, it does not work in this situation.

As explained in other posts, there isn't easy way to achieve this without heavy analysis of the syntax tree, for which python ast module provides no facilities.
You can still use logilab's astng1, which is the basis for pylint2 and provides static inference capabilities.
Here is a quick example :
from logilab.astng.builder import ASTNGBuilder
builder = ASTNGBuilder()
astng = builder.string_build('i = 1', __name__, '<string>')
assnode = astng['']
print [(inf.value, type(inf.value)) for inf in assnode.infer()]
Of course you'll have to dig the api for more real-life usage. You can still write python-projects#lists.logilab.org for help on this.

As other posters have noted, this isn't so easy in a dynamically typed language. You can't just trace the assignment back to a static type declaration, as you can in C or Java.
However, one can often make a reasonable determination of the type.
Presumably the scoping rules allow one to determine which i (or which set of i's) might be accessed/updated/bound where the question is asked ("what the type of at this point in the code?"). Then one can do an analysis of all the values that might be assigned (a particularly trivial case is when i is bound only to a function definition). The upper bound in the type lattice on those types is the "type" of i. Yes, it might be "anything" in some cases, but in most well-written programs even dynamic variables have a "narrow" type intended by the programmer, and often its a primitive langauge type (like, er, "int"). Or the programmer wouldn't be able to reasonably write an algorithm (What, your array index isn't an integer sometimes?).
You need to do some kind of conservative analysis of the program to determine this upperbound type. (You can obviously do the trivial analysis, and conclude useless that a variable can be "any" type). I think that's an unsatisfactory answer.
The machinery to do all this analysis is pretty complicated (you need global flow analysis and some determination of what can be dynamically loaded to do this really well) and I doubt if Python's AST package does it.

You can't, because Python's variables don't have a type. Values have types.
That's how dynamic typing works.

Data Structures in Python

All the books I've read on data structures so far seem to use C/C++, and make heavy use of the "manual" pointer control that they offer. Since Python hides that sort of memory management and garbage collection from the user is it even possible to implement efficient data structures in this language, and is there any reason to do so instead of using the built-ins?

Python gives you some powerful, highly optimized data structures, both as built-ins and as part of a few modules in the standard library (lists and dicts, of course, but also tuples, sets, arrays in module array, and some other containers in module collections).
Combinations of these data structures (and maybe some of the functions from helper modules such as heapq and bisect) are generally sufficient to implement most richer structures that may be needed in real-life programming; however, that's not invariably the case.
When you need something more than the rich library provides, consider the fact that an object's attributes (and items in collections) are essentially "pointers" to other objects (without pointer arithmetic), i.e., "reseatable references", in Python just like in Java. In Python, you normally use a None value in an attribute or item to represent what NULL would mean in C++ or null would mean in Java.
So, for example, you could implement binary trees via, e.g.:
class Node(object):
__slots__ = 'payload', 'left', 'right'
def __init__(self, payload=None, left=None, right=None):
self.payload = payload
self.left = left
self.right = right
plus methods or functions for traversal and similar operations (the __slots__ class attribute is optional -- mostly a memory optimization, to avoid each Node instance carrying its own __dict__, which would be substantially larger than the three needed attributes/references).
Other examples of data structures that may best be represented by dedicated Python classes, rather than by direct composition of other existing Python structures, include tries (see e.g. here) and graphs (see e.g. here).

For some simple data structures (eg. a stack), you can just use the builtin list to get your job done. With more complex structures (eg. a bloom filter), you'll have to implement them yourself using the primitives the language supports.
You should use the builtins if they serve your purpose really since they're debugged and optimised by a horde of people for a long time. Doing it from scratch by yourself will probably produce an inferior data structure.
If however, you need something that's not available as a primitive or if the primitive doesn't perform well enough, you'll have to implement your own type.
The details like pointer management etc. are just implementation talk and don't really limit the capabilities of the language itself.

C/C++ data structure books are only attempting to teach you the underlying principles behind the various structures - they are generally not advising you to actually go out and re-invent the wheel by building your own library of stacks and lists.
Whether you're using Python, C++, C#, Java, whatever, you should always look to the built in data structures first. They will generally be implemented using the same system primitives you would have to use doing it yourself, but with the advantage of having been tried and tested.
Only when the provided data structures do not allow you to accomplish what you need, and there isn't an alternative and reliable library available to you, should you be looking at building something from scratch (or extending what's provided).

How Python handles objects at a low level isn't too strange anyway. This document should disambiguate it a tad; it's basically all the pointer logic you're already familiar with.

With Python you have access to a vast assortment of library modules written and debugged by other people. Odds are very good that somewhere out there, there is a module that does at least part of what you want, and odds are even good that it might be implemented in C for performance.
For example, if you need to do matrix math, you can use NumPy, which was written in C and Fortran.
Python is slow enough that you won't be happy if you try to write some sort of really compute-intensive code (example, a Fast Fourier Transform) in native Python. On the other hand, you can get a C-coded Fourier Transform as part of SciPy, and just use it.
I have never had a situation where I wanted to solve a problem in Python and said "darn, I just can't express the data structure I need."
If you are a pioneer, and you are doing something in Python for which there just isn't any library module out there, then you can try writing it in pure Python. If it is fast enough, you are done. If it is too slow, you can profile it, figure out where the slow parts are, and rewrite them in C using the Python C API. I have never needed to do this yet.

It's not possible to implement something like a C++ vector in Python, since you don't have array primitives the way C/C++ do. However, anything more complicated can be implemented (efficiently) on top of it, including, but not limited to: linked lists, hash tables, multisets, bloom filters, etc.

Partial evaluation with pyparsing

I need to be able to take a formula that uses the OpenDocument formula syntax, parse it into syntax that Python can understand, but without evaluating the variables, and then be able to evaluate the formula many times with changing valuables for the variables.
Formulas can be user input, so pyparsing allows me to both effectively handle the formula syntax, and clean user input. There are a number of good examples of pyparsing available, but all the mathematical ones seem to assume that one evaluates everything in the current scope immediately.
For context, I am working with a model of the industrial economy (life cycle assessment, or LCA), where these formulas represent the amount of material or energy exchanges between processes. The variable amount can be a function of several parameters, such as geographical location. THe chain of formula and variable references are stored in a directed acyclic graph, so that formulas can always be simply evaluated. Formulas are stored as strings in a database.
My questions are:
Is it possible to parse a formula such that the parsed evaluation can also be stored in the database (as a string to be evaled, or something else)?
Are there alternatives to this approach? Bear in mind that the ideal solution is to parse/write once, and read many times. For example, partially parsing the formula, and then using the ast module, although I don't know how this could work with database storage.
Any examples of a project or library similar to this that I could look over? I am not a programmer, just a student trying to finish his thesis while making an open-source LCA software model in my spare time.
Is this approach too slow? I would like to be able to do substantial Monte Carlo runs, where each run could involve tens of thousands of formula evaluations (it is a big database).

1) Yes, it is possible to pickle the results from parsing your expression, and save that to a database. Then you can just fetch and unpickle the expression, rather than reparse the original again.
2) You can do a quick-and-dirty pass at this just using the compile and eval built-ins, as in the following interactive session:
>>> y = compile("m*x+b","","eval")
>>> m = 100
>>> x = 5
>>> b = 1
>>> eval(y)
501
Of course, this has the security pitfalls of any eval- or exec-based implementation, in that untrusted or malicious source strings can embed harmful system calls. But if this is your thesis and entirely within your scope of control, just don't do anything foolish.
3) You can get an online example of parsing an expression into a "evaluatable" data structure at the pyparsing wiki's Examples page. Check out simpleBool.py and evalArith.py especially. If you're feeling flush, order a back issue of the May,2008 issue of Python magazine, which has my article "Writing a Simple Interpreter/Compiler with Pyparsing" with a more detailed description of the methods used, plus a description of how pickling and unpickling the parsed results works.
4) The slow part will be the parsing, so you are on the right track in preserving these results in some intermediate and repeatably-evaluatable form. The eval part should be fairly snappy. The second slow part will be in fetching these pickled structures from your database. During your MC run, I would package a single function that takes the selection parameters for an expression, fetches from the database, and unpickles and returns the evaluatable expression. Then once you have this working, use a memoize decorator to cache these query-results pairs, so that any given expression only needs to be fetched/unpickled once.
Good luck with your thesis!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.