Partial evaluation for parsing

Partial evaluation for parsing - python

I'm working on a macro system for Python (as discussed here) and one of the things I've been considering are units of measure. Although units of measure could be implemented without macros or via static macros (e.g. defining all your units ahead of time), I'm toying around with the idea of allowing syntax to be extended dynamically at runtime.
To do this, I'm considering using a sort of partial evaluation on the code at compile-time. If parsing fails for a given expression, due to a macro for its syntax not being available, the compiler halts evaluation of the function/block and generates the code it already has with a stub where the unknown expression is. When this stub is hit at runtime, the function is recompiled against the current macro set. If this compilation fails, a parse error would be thrown because execution can't continue. If the compilation succeeds, the new function replaces the old one and execution continues.
The biggest issue I see is that you can't find parse errors until the affected code is run. However, this wouldn't affect many cases, e.g. group operators like [], {}, (), and `` still need to be paired (requirement of my tokenizer/list parser), and top-level syntax like classes and functions wouldn't be affected since their "runtime" is really load time, where the syntax is evaluated and their objects are generated.
Aside from the implementation difficulty and the problem I described above, what problems are there with this idea?

Here are a few possible problems:
You may find it difficult to provide the user with helpful error messages in case of a problem. This seems likely, as any compilation-time syntax error could be just a syntax extension.
Performance hit.
I was trying to find some discussion of the pluses, minuses, and/or implementation of dynamic parsing in Perl 6, but I couldn't find anything appropriate. However, you may find this quote from Nicklaus Wirth (designer of Pascal and other languages) interesting:
The phantasies of computer scientists
in the 1960s knew no bounds. Spurned
by the success of automatic syntax
analysis and parser generation, some
proposed the idea of the flexible, or
at least extensible language. The
notion was that a program would be
preceded by syntactic rules which
would then guide the general parser
while parsing the subsequent program.
A step further: The syntax rules would
not only precede the program, but they
could be interspersed anywhere
throughout the text. For example, if
someone wished to use a particularly
fancy private form of for statement,
he could do so elegantly, even
specifying different variants for the
same concept in different sections of
the same program. The concept that
languages serve to communicate between
humans had been completely blended
out, as apparently everyone could now
define his own language on the fly.
The high hopes, however, were soon
damped by the difficulties encountered
when trying to specify, what these
private constructions should mean. As
a consequence, the intreaguing idea of
extensible languages faded away rather
quickly.
Edit: Here's Perl 6's Synopsis 6: Subroutines, unfortunately in markup form because I couldn't find an updated, formatted version; search within for "macro". Unfortunately, it's not too interesting, but you may find some things relevant, like Perl 6's one-pass parsing rule, or its syntax for abstract syntax trees. The approach Perl 6 takes is that a macro is a function that executes immediately after its arguments are parsed and returns either an AST or a string; Perl 6 continues parsing as if the source actually contained the return value. There is mention of generation of error messages, but they make it seem like if macros return ASTs, you can do alright.

Pushing this one step further, you could do "lazy" parsing and always only parse enough to evaluate the next statement. Like some kind of just-in-time parser. Then syntax errors could become normal runtime errors that just raise a normal Exception that could be handled by surrounding code:
def fun():
not implemented yet
try:
fun()
except:
pass
That would be an interesting effect, but if it's useful or desirable is a different question. Generally it's good to know about errors even if you don't call the code at the moment.
Macros would not be evaluated until control reaches them and naturally the parser would already know all previous definitions. Also the macro definition could maybe even use variables and data that the program has calculated so far (like adding some syntax for all elements in a previously calculated list). But this is probably a bad idea to start writing self-modifying programs for things that could usually be done as well directly in the language. This could get confusing...
In any case you should make sure to parse code only once, and if it is executed a second time use the already parsed expression, so that it doesn't lead to performance problems.

Here are some ideas from my master's thesis, which may or may not be helpful.
The thesis was about robust parsing of natural language.
The main idea: given a context-free grammar for a language, try to parse a given
text (or, in your case, a python program). If parsing failed, you will have a partially generated parse tree. Use the tree structure to suggest new grammar rules that will better cover the parsed text.
I could send you my thesis, but unless you read Hebrew this will probably not be useful.
In a nutshell:
I used a bottom-up chart parser. This type of parser generates edges for productions from the grammar. Each edge is marked with the part of the tree that was consumed. Each edge gets a score according to how close it was to full coverage, for example:
S -> NP . VP
Has a score of one half (We succeeded in covering the NP but not the VP).
The highest-scored edges suggest a new rule (such as X->NP).
In general, a chart parser is less efficient than a common LALR or LL parser (the types usually used for programming languages) - O(n^3) instead of O(n) complexity, but then again you are trying something more complicated than just parsing an existing language.
If you can do something with the idea, I can send you further details.
I believe looking at natural language parsers may give you some other ideas.

Another thing I've considered is making this the default behavior across the board, but allow languages (meaning a set of macros to parse a given language) to throw a parse error at compile-time. Python 2.5 in my system, for example, would do this.
Instead of the stub idea, simply recompile functions that couldn't be handled completely at compile-time when they're executed. This will also make self-modifying code easier, as you can modify the code and recompile it at runtime.

You'll probably need to delimit the bits of input text with unknown syntax, so that the rest of the syntax tree can be resolved, apart from some character sequences nodes which will be expanded later. Depending on your top level syntax, that may be fine.
You may find that the parsing algorithm and the lexer and the interface between them all need updating, which might rule out most compiler creation tools.
(The more usual approach is to use string constants for this purpose, which can be parsed to a little interpreter at run time).

I don't think your approach would work very well. Let's take a simple example written in pseudo-code:
define some syntax M1 with definition D1
if _whatever_:
define M1 to do D2
else:
define M1 to do D3
code that uses M1
So there is one example where, if you allow syntax redefinition at runtime, you have a problem (since by your approach the code that uses M1 would be compiled by definition D1). Note that verifying if syntax redefinition occurs is undecidable. An over-approximation could be computed by some kind of typing system or some other kind of static analysis, but Python is not well known for this :D.
Another thing that bothers me is that your solution does not 'feel' right. I find it evil to store source code you can't parse just because you may be able to parse it at runtime.
Another example that jumps to mind is this:
...function definition fun1 that calls fun2...
define M1 (at runtime)
use M1
...function definition for fun2
Technically, when you use M1, you cannot parse it, so you need to keep the rest of the program (including the function definition of fun2) in source code. When you run the entire program, you'll see a call to fun2 that you cannot call, even if it's defined.

Related

How to solve caveats of ast.unparse?

I want to modify some constructs of python source code (e.g. variable names). Working with plain python is troublesome, so I am using abstract syntax trees. Using ast (built-in python library) worked out great for me, but in docs of ast.unparse() there are two warnings that I'm concerned about, since I don't want any uncontrolled modifications.
# small example
import ast
code = 'a = 0'
root = ast.parse(code)
for node in ast.walk(root):
if isinstance(node, ast.Name):
node.id = 'b'
code = ast.unparse(root)
print(code)
How to unparse ast without running into these problems?
Are there any alternatives to this method?

I don't know what the line about compiler optimizations is referring to, but basically the AST does not include comments and indentation has been reduced to INDENT and DEDENT, while other whitespace has been removed altogether. unparse treats an indent as being exactly four spaces, and inserts a single space character between tokens if necessary. That indeed might be a problem if you are attempting to edit existing code.
If you want to preserve comments and whitespace, you'll have to use a different parsing strategy, not based on the built-in AST model. There are parsers which preserve comments and whitespace (for example, parsers used for syntax highlighting); if you feel you need one, you should be able to find one with an internet search.
As for the recursion depth warning, you'll need extremely deeply nested code to trigger a stack overflow. Practically no-one writes code by hand which would trigger the problem, but it certainly can happen. Mostly it happens on machine-generated code. Personally, I wouldn't worry about it until it happens to you, since there's a good chance that it will never happen in your problem domain. (And, if it does happen, you'll be informed because it raises an exception, rather than diving into Undefined Behaviour like Certain other programming languages.)

How to safely manipulate user code in Python?

If I were to make, for example, a program that takes as input a sorting algorithm and determines empirically whether or not the algorithm is partially correct, how would I convert the string input to an executable program? I have read other threads in which it was suggested using exec or eval, but all answers recommended against using this method due to security risks. Is there a way to create such a program that does not involve converting a string to executable code? Or will it inherently be a risky program no matter the implementation? Lastly, is there another programming language that would be a better alternative to define such a program?

Executing Arbitrary Code
No matter what language you choose, if you read code from the user and execute that code, it will be dangerous. No ifs, ands, or buts. You notice the same caveats to Python's exec and eval also are noted for Javascript, PHP, and many other languages.
Safely Executing Code from a String
There are safe ways to map strings to predefined functions, but there is no safe way to compile/interpret and execute arbitrary code.
One good example is the following on how to safely map functions to a string:
functions = {
'print': print,
'str': str,
'int': int
}
name = input('Choose from the above functions here')
functions.get(name)()
Static Code Analysis
And for the final answer, potentially, but no, as there would be ways of evaluating a sorting algorithm, but they're unlikely to be effective, reproducible, or accurate without compiling the code or at least interpreting it. Static code analysis is difficult, and can only go so far.
One simple example for how difficult static code analysis can be even with a single if statement is the following:
for index, value in enumerate(range(10)):
if index and value - old == 1:
print(value)
old = value
Some libraries that do static code analysis think this code will raise an error (such as Pylint, for example), because old is defined after it is first used, however, since bool(0) will evaluate to False, old actually only ever checked after the first loop, after it is already defined, and so the code runs without issue.
Think of the complexity of inputs, the complexity of outputs, and the number of variations of possible sort algorithms that would all be equivalent. The easiest way to test code is to run it. There are limitations of dynamic code analysis, but with a given input and then comparing it to the desired output, you can get a good idea if the code works as it should, something that is very difficult with merely static analysis.

How do the for / while / print things work in python?

What i mean is, how is the syntax defined, i.e. how can i make my own constructs like these?
I realise in a lot of languages, things like this will be built into the compiler / spec, and so it's dealt with by the compiler (at least that how i understand it to work).
But with python, everything i've come across so far has been accessible to the programmer, and so you more or less have the freedom to do whatever you want.
How would i go about writing my own version of for or while? Is it even possible?
I don't have any actual application for this, so the answer to any WHY?! questions is just "because why not?" or "curiosity".

No, you can't, not from within Python. You can't add new syntax to the language. (You'd have to modify the source code of Python itself to make your own custom version of Python.)
Note that the iterator protocol allows you to define objects that can be used with for in a custom way, which covers a lot of the possible use cases of writing your own iteration syntax.

Well, you have a couple of options for creating your own syntax:
Write a higher-order function, like map or reduce.
Modify python at the C level. This is, as you might expect, relatively easy as compared with fiddling with many other languages. See this article for an example: http://eli.thegreenplace.net/2010/06/30/python-internals-adding-a-new-statement-to-python/
Fake it using the debug facilities, or the encodings facility. See this code: http://entrian.com/goto/download.html and http://timhatch.com/projects/pybraces/
Use a preprocessor. Here's one project that tries to make this easy: http://www.fiber-space.de/langscape/doc/index.html
Use of the python facilities built in to achieve a similar effect (decorators, metaclasses, and the like).
Obviously, none of this is quite what you're looking for, but python, unlike smalltalk or lisp, isn't (necessarily) programmed in itself and guarantees to expose its own underlying execution and parsing mechanisms at runtime.

You can't make equivalent constructs. for, while, if etc. are statements, and they are built into the language with their own specific syntax. There are languages that do allow this sort of thing though (to some degree), such as Scala.

while, print, for etc. are keywords. That means they are parsed by the python parser whilst reading the code, stripped any redundant characters and result in tokens. Afterwards a lexer takes those tokens as input and builds a program tree which is then excuted by the interpreter. Said so, those constructs are used only as syntactic sugar for underlying lexical machinery and as such are not visible from inside the code.

Would optional static typing benefit Python API-design or be a disadvantage? (type checking decorator example included)

I'm a long time Python developer and I really love the dynamic nature of the language, but I wonder if Python would benefit from optional static typing.
Would it be beneficial to be able to apply static typing to the API of a library, and what would the disadvantages of this be?
I quickly sketched up a decorator implementing runtime-static type checking on pastebin and it works like this:
# A TypeError will be thrown if the argument "string" is not a "str" and if
# the returned value is not an "int"
#typed(int, string = str)
def getStringLength(string):
return len(string)
Would it be practical to use a decorator like this on the API-functions of a library? In my point of view type checking is not needed in the internal workings of a domain specific module of a library, but on the connection points between the library and it's client a simple version of design by contract by applying type checking could be useful. Especially as a type of enforced documentation which clearly states to the client of the library what it expects and returns.
Like this example where addObjectToQueue() and isObjectProcessed() are exposed for use by the client and processTheQueueAndDoAdvancedStuff() is an internal library function. I think type checking could be useful on the outward facing functions but would only bloat and restrict the dynamicness and usefulness of python if used on the internal functions.
# some_library_module.py
#typed(int, name = string)
def addObjectToQueue(name):
return random.randint() # Some object id
def processTheQueueAndDoAdvancedStuff(arg_of_library_specific_type)
# Function body here
#typed(bool, object_id = int)
def isObjectProcessed(object_id):
return True
What would the disadvantages of using this technique be?
What would the disadvantages of my naive implementation on pastebin be?
I don't want answers discussing the conversion of Python to a statically typed language, but thoughts about API design-specific pros/cons. (please move this to programmers.stackexchange.com if you consider it not a question)

Personally, I don't find this idea attractive for Python. This is all just my opinion, of course, but for context I'll tell you that Python and Haskell are probably my two favourite programming languages - I like languages at both extreme ends of the static vs dynamic typing spectrum.
I see the main benefits of static typing as follows:
Increased likelihood that your code is correct once the compiler has accepted it; if I know I've threaded my values through all the operations I invoked in such a way that the result type of one always matches the input type of another, and the final result type is the one I wanted, it increases the probability that I've selected the correct operations. This point is of deeply arguable value, since it only really matters if you're not testing very much, which would be bad. But it is true that, when programming in Haskell, when I sit back and say "there, done!" I am actually done a lot of the time, whereas that's almost never true of my Python code.
The compiler automatically points out most of the places that need changing when I make an incompatible change to a data structure or interface (most of the time). Again, tests are still needed to actually be sure you've caught all the implications, but most of the time the compiler's nagging is actually sufficient, in my experience, which deeply simplifies such refactoring; you can go straight from implementing the core of the refactoring to testing that the program still works okay, because the actual work of making all the flow-on changes is almost mechanical.
Efficient implementation. The compiler gets to use all the knowledge it has about types to do optimisation.
Your suggested system doesn't really provide any of these benefits.
Having written a program making use of your library, I still don't know if it contains any type-incorrect uses of your functions until I do extensive testing with full code coverage to see if any execution path contains a bad call.
When I refactor something, I need to go through many many rounds of "run full test suite, look for exception, find where it came from, fix the code" to get anything at all like a static-typing compiler's problem detection.
Python will still be behaving as if those variables could be anything at any time.
And to get even that much, you've sacrificed the flexibility of Python duck-typing; it's not enough that I provide a sufficiently "list-like" object, I have to actually provide a list.
To me, this sort of static typing is the worst of both worlds. The main dynamic typing argument is "you have to test your code anyway, so you may as well use those tests to catch type errors and free yourself from having to work around the type system when it doesn't help you". That may or may not be a good argument with respect to a really good static type system, but it absolutely is a compelling argument with respect to a weak partial static type system that only detects type errors at runtime. I don't think nicer error messages (which is all it really buys you most of the time; a type error not caught at the interface is almost certainly going to throw an exception deeper in the call stack) is worth the loss of flexibility.

Why is it not possible to create a practical Perl to Python source code converter?

It would be nice if there existed a program that automatically transforms Perl code to Python code, making the resultant Python program as readable and maintainable as the original one, let alone working the same way.
The most obvious solution would just invoke perl via Python utils:
#!/usr/bin/python
os.exec("tail -n -2 "+__file__+" | perl -")
...the rest of file is the original perl program...
However, the resultant code is hardly a Python code, it's essentially a Perl code. The potential converter should convert Perl constructs and idioms to easy-to-read Python code, it should retain variable and subroutine names (i.e. the result should not look obfuscated) and should not shatter the wrokflow too much.
Such a conversion is obviously very hard. The hardness of the conversion depends on the number of Perl features and syntactical constructs, which do not have easy-to-read, unobfuscated Python equivalents. I believe that the large amount of such features renders such automatic conversion impossible practically (while theoretical possibility exists).
So, could you please name Perl idioms and syntax features that can't be expressed in Python as concise as in the original Perl code?
Edit: some people linked Python-to-Perl conventers and deduced, on this basis, that it should be easy to write Perl-to-Python as well. However, I'm sure that converting to Python is in greater demand; still this converter is not yet written--while the reverse has already been! Which only makes my confidence in impossibility of writing a good converter to Python more solid.

Your best Perl to Python converter is probably 23 years old, just graduated university and is looking for a job.

Why Perl is not Python.
Perl has statements which Python more-or-less totally lacks. While you can probably contrive matching statements, the syntax will be so utterly unlike Perl as to make it difficult to call it a "translation". You'd really have to cook up some fancy Python stuff to make it as terse as the original Perl.
Perl has run-time semantics which are so unlike Python as to make translation very challenging. We'll look at just one example below.
Perl has data structures which are enough different from Python that translation is hard.
Perl threads don't share data by default. Only selected data elements can be shared. Python threads have more common "shared everything" data.
One example of #2 should be enough.
Perl:
do_something || die()
Where do_something is any statement of any kind.
To automagically translate this into Python you'd have to wrap every || die() statement in
try:
python_version_of_do_something
except OrdinaryStatementFailure, e:
die()
sys.exit()
Where the more common formulation
Perl
do_something
Would become this using simple -- unthinking -- translation of the source
try:
python_version_of_do_something
except OrdinaryStatementFailure, e:
pass
And, of course,
Perl
do_this || do_that || die()
Is even more complex to translate into Python.
And
Perl
do_this && do_that || die()
really push the envelope. My Perl is rusty, so I can't recall the precise semantics of this kind of thing. But you have to totally understand the semantics to work out a Pythonic implementation.
The Python examples are not good Python. To write good Python requires "thinking", something an automatic translated can't do.
And every Perl construct would have to be "wrapped" like that in order to get the original Perl semantics into a Pythonic form.
Now, do a similar analysis for every feature of Perl.

Just to expand on some of the other lists here, these are a few Perl constructs that are probably very clumsy in python (if possible).
dynamic scope (via the local keyword)
typeglob manipulation (multiple variables with the same name)
formats (they have a syntax all their own)
closures over mutable variables
pragmas
lvalue subroutines (mysub() = 5; type code)
source filters
context (list vs scalar, and the way that called code can inspect this with wantarray)
type coercion / dynamic typing
any program that uses string eval
The list goes on an on, and someone could try to create a mapping between all of the analogous constructs, but in the end it will be a failure for one simple reason.
Perl can not be statically parsed. The definitions in Perl code (particularly those in BEGIN blocks) change the way the compiler is going to interpret the remaining code. So for non-trivial programs, conversion from Perl => Python suffers from the halting problem.
There is no way to know exactly how all of the program will be compiled until the program has finished running, and it is theoretically possible to create a Perl program that will compile differently every time it is run. Meaning that one Perl program could map to an infinite number of Python programs, the correct of which is only know after running the original program in the perl interpreter.

It is not impossible, it would just take a lot of work.
By the way, there is Perthon, a Python-to-Perl translator. It just seems like nobody is willing to make one that goes the other way.
EDIT: I think I might I've found the reason why a Python to Perl translator is much easier to implement. It's because Python lets you fiddle with a script's AST. See parser module.

Perl can experimentally be built to collect additional information (for instance, comments) during compilation of perl code and even emit the results as XML. There doesn't appear to be any documentation of this outside the source, except for: http://search.cpan.org/perldoc/perl5100delta#MAD
This should be helpful in building a translator. I'd expect you to get 80% of the way there fairly easily, 95% with great difficulty, and never much better than that. There are too many things that don't map well.

Fundamentally, these are two different languages. Converting from one to another and have the result be mostly readable would mean that the software would have to be able to recognize and generate code idioms, and be able to do some static analysis.
The meaning of a program may be exactly defined by the language definition, but the programmer did not necessarily require all the details. A C programmer testing if the value a printf() returned is negative is checking for an error condition, and doesn't typically care about the exact value. if (printf("%s","...") < 0) exit(); can be translated into Perl as print "..." or die();. These statements may not mean exactly the same thing, but they'll typically be what the programmer means, and to create idiomatic C or Perl code from idiomatic Perl or C code the translator must take this into account.
Since different computer languages tend to have different slightly semantics for similar things, it's typically impossible to translate one language into another and come up with the exact same meaning in readable form. To create readable code, the translator needs to understand what the programmer was intending to do, and that's real difficult.
In addition, it would be easier to translate from Python to Perl rather than Perl to Python. Python is intended as a straightforward language with clear standard ways to do things, while Perl is an unduly complex language with the motto "There's More Than One Way To Do It." Translating a Python expression into one of the innumerable corresponding Perl expressions is easier than figuring out what the Perl programmer meant and expressing it in Python.

Python scope and namespace are different from Perl.
In Python, everything is an object. In Perl, everything under the hood seems to be a list/hash/scalar/reference/function. This induces different design approaches and idioms.
Perl has anonymous code blocks and can generate closures on the fly with some branches. I am pretty sure that is not a python feature.
I do think that a very smart chap could statically analyze the bulk of Perl and produce a program that takes small Perl programs and output Python programs that do the same job.
I am much more doubtful about the feasibility of large and/or gnarly Perl translation. Some of us write some really funky code at times.... :)

This is impossible just because you can't even properly parse perl code. See Perl Cannot Be Parsed: A Formal Proof for more details.

The B set of modules by Malcolm Beattie would be the only sane starting point for something like this, though I'm with other answers in that this would be a difficult problem to solve. In general, translating the sense of one high-level language into another high-level language requires a high-level translator, and, for the time being, that can mean only a human.
The difficulty of this problem, for any pair of languages, is due to fundamental differences in the nature of the languages in question, such as runtime semantics and common idioms, not to mention libraries.

The reason it is close to impossible to create a generic translator from one high-level language to another, is that the program only describe HOW and not WHY (this is the reason for comments in the source code).
In order to create a meaningful program in another highlevel language you (or the translator program) needs to know WHY to be able to create the best possible program. If you cannot do that, all you can do is essentially to create a Python interpreter for the compiled version of the Perl program.
In other words, to do this properly you need to go outside the box, and this is very hard for a computer.

NullUserException basically summed it up - it certainly can be done; it would just be an enormous amount of effort to do so. Some language conversion utilities I've seen compile to an intermediate language (such as .NET's CIL) and then decompile that to the desired language. I have not seen any for Perl to Python. You can, however, find a Python to Perl converter here, though that's likely of little use to you unless you're trying to create your own, in which case it may provide some helpful reference.
Edit: if you just need the exact functionality in a Python script, PyPerl may be of some use to you.

Try my version of the Pythonizer: http://github.com/snoopyjc/pythonizer - it does a decent job

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.