best way to parse a language that's ALMOST Python?

best way to parse a language that's ALMOST Python? - python

I'm working on a domain-specific language implemented on top of Python. The grammar is so close to Python's that until now we've just been making a few trivial string transformations and then feeding it into ast. For example, indentation is replaced by #endfor/#endwhile/#endif statements, so we normalize the indentation while it's still a string.
I'm wondering if there's a better way? As far as I can tell, ast is hardcoded to parse the Python grammar and I can't really find any documentation other than http://docs.python.org/library/ast.html#module-ast (and the source itself, I suppose).
Does anyone have personal experience with PyParsing, ANTLR, or PLY?
There are vague plans to rewrite the interpreter into something that transforms our language into valid Python and feeds that into the Python interpreter itself, so I'd like something compatible with compile, but this isn't a deal breaker.
Update: It just occurred to me that
from __future__ import print_function, with_statement
changes the way Python parses the following source. However, PEP 236 suggests that this is syntactic window dressing for a compiler feature. Could someone confirm that trying to override/extend __future__ is not the correct solution to my problem?

PLY works. It's odd because it mimics lex/yacc in a way that's not terribly pythonic.
Both lex and yacc have an implicit interface that makes it possible to run the output from lex as a stand-alone program. This "feature" is carefully preserved. Similarly for the yacc-like features of PLY. The "feature" to create a weird, implicit stand-alone main program is carefully preserved.
However, PLY as lex/yacc-compatible toolset is quite nice. All your lex/yacc skills are preserved.
[Editorial Comment. "Fixing" Python's grammar will probably be a waste of time. Almost everyone can indent correctly without any help. Check C, Java, C++ and even Pascal code, and you'll see that almost everyone can indent really well. Indeed, people go to great lengths to indent Java where it's not needed. If indentation is unimportant in Java, why do people do such a good job of it?]

Related

ISO human-readable parser for Python in Python

I'm looking for a parser for Python (preferably v. 2.7) written in human-readable Python. Performance or flexibility are not important. The accuracy/correctness of the parsing and the clarity of the parser's code are far more important considerations here.
Searching online I've found a few parser generators that generate human-readable Python code, but I have not found the corresponding Python grammar to go with any of them (from what I could see, they all follow different grammar specification conventions). At any rate, even if I could find a suitable parser-generator/Python grammar combo, a readily available Python parser that fits my requirements (human-readable Python code) is naturally far more preferable.
Any suggestions?
Thanks!

PyPy is a Python implementation written entirely in Python. I am not an expert, but here's the link to their parser which - obviously - has been written in Python itself:
https://bitbucket.org/pypy/pypy/src/819faa2129a8/pypy/interpreter/pyparser

I think you should invest your effort in ast. An excerpt from the python docs.
The ast module helps Python applications to process trees of the
Python abstract syntax grammar. The abstract syntax itself might
change with each Python release; this module helps to find out
programmatically what the current grammar looks like.

Why is it not possible to create a practical Perl to Python source code converter?

It would be nice if there existed a program that automatically transforms Perl code to Python code, making the resultant Python program as readable and maintainable as the original one, let alone working the same way.
The most obvious solution would just invoke perl via Python utils:
#!/usr/bin/python
os.exec("tail -n -2 "+__file__+" | perl -")
...the rest of file is the original perl program...
However, the resultant code is hardly a Python code, it's essentially a Perl code. The potential converter should convert Perl constructs and idioms to easy-to-read Python code, it should retain variable and subroutine names (i.e. the result should not look obfuscated) and should not shatter the wrokflow too much.
Such a conversion is obviously very hard. The hardness of the conversion depends on the number of Perl features and syntactical constructs, which do not have easy-to-read, unobfuscated Python equivalents. I believe that the large amount of such features renders such automatic conversion impossible practically (while theoretical possibility exists).
So, could you please name Perl idioms and syntax features that can't be expressed in Python as concise as in the original Perl code?
Edit: some people linked Python-to-Perl conventers and deduced, on this basis, that it should be easy to write Perl-to-Python as well. However, I'm sure that converting to Python is in greater demand; still this converter is not yet written--while the reverse has already been! Which only makes my confidence in impossibility of writing a good converter to Python more solid.

Your best Perl to Python converter is probably 23 years old, just graduated university and is looking for a job.

Why Perl is not Python.
Perl has statements which Python more-or-less totally lacks. While you can probably contrive matching statements, the syntax will be so utterly unlike Perl as to make it difficult to call it a "translation". You'd really have to cook up some fancy Python stuff to make it as terse as the original Perl.
Perl has run-time semantics which are so unlike Python as to make translation very challenging. We'll look at just one example below.
Perl has data structures which are enough different from Python that translation is hard.
Perl threads don't share data by default. Only selected data elements can be shared. Python threads have more common "shared everything" data.
One example of #2 should be enough.
Perl:
do_something || die()
Where do_something is any statement of any kind.
To automagically translate this into Python you'd have to wrap every || die() statement in
try:
python_version_of_do_something
except OrdinaryStatementFailure, e:
die()
sys.exit()
Where the more common formulation
Perl
do_something
Would become this using simple -- unthinking -- translation of the source
try:
python_version_of_do_something
except OrdinaryStatementFailure, e:
pass
And, of course,
Perl
do_this || do_that || die()
Is even more complex to translate into Python.
And
Perl
do_this && do_that || die()
really push the envelope. My Perl is rusty, so I can't recall the precise semantics of this kind of thing. But you have to totally understand the semantics to work out a Pythonic implementation.
The Python examples are not good Python. To write good Python requires "thinking", something an automatic translated can't do.
And every Perl construct would have to be "wrapped" like that in order to get the original Perl semantics into a Pythonic form.
Now, do a similar analysis for every feature of Perl.

Just to expand on some of the other lists here, these are a few Perl constructs that are probably very clumsy in python (if possible).
dynamic scope (via the local keyword)
typeglob manipulation (multiple variables with the same name)
formats (they have a syntax all their own)
closures over mutable variables
pragmas
lvalue subroutines (mysub() = 5; type code)
source filters
context (list vs scalar, and the way that called code can inspect this with wantarray)
type coercion / dynamic typing
any program that uses string eval
The list goes on an on, and someone could try to create a mapping between all of the analogous constructs, but in the end it will be a failure for one simple reason.
Perl can not be statically parsed. The definitions in Perl code (particularly those in BEGIN blocks) change the way the compiler is going to interpret the remaining code. So for non-trivial programs, conversion from Perl => Python suffers from the halting problem.
There is no way to know exactly how all of the program will be compiled until the program has finished running, and it is theoretically possible to create a Perl program that will compile differently every time it is run. Meaning that one Perl program could map to an infinite number of Python programs, the correct of which is only know after running the original program in the perl interpreter.

It is not impossible, it would just take a lot of work.
By the way, there is Perthon, a Python-to-Perl translator. It just seems like nobody is willing to make one that goes the other way.
EDIT: I think I might I've found the reason why a Python to Perl translator is much easier to implement. It's because Python lets you fiddle with a script's AST. See parser module.

Perl can experimentally be built to collect additional information (for instance, comments) during compilation of perl code and even emit the results as XML. There doesn't appear to be any documentation of this outside the source, except for: http://search.cpan.org/perldoc/perl5100delta#MAD
This should be helpful in building a translator. I'd expect you to get 80% of the way there fairly easily, 95% with great difficulty, and never much better than that. There are too many things that don't map well.

Fundamentally, these are two different languages. Converting from one to another and have the result be mostly readable would mean that the software would have to be able to recognize and generate code idioms, and be able to do some static analysis.
The meaning of a program may be exactly defined by the language definition, but the programmer did not necessarily require all the details. A C programmer testing if the value a printf() returned is negative is checking for an error condition, and doesn't typically care about the exact value. if (printf("%s","...") < 0) exit(); can be translated into Perl as print "..." or die();. These statements may not mean exactly the same thing, but they'll typically be what the programmer means, and to create idiomatic C or Perl code from idiomatic Perl or C code the translator must take this into account.
Since different computer languages tend to have different slightly semantics for similar things, it's typically impossible to translate one language into another and come up with the exact same meaning in readable form. To create readable code, the translator needs to understand what the programmer was intending to do, and that's real difficult.
In addition, it would be easier to translate from Python to Perl rather than Perl to Python. Python is intended as a straightforward language with clear standard ways to do things, while Perl is an unduly complex language with the motto "There's More Than One Way To Do It." Translating a Python expression into one of the innumerable corresponding Perl expressions is easier than figuring out what the Perl programmer meant and expressing it in Python.

Python scope and namespace are different from Perl.
In Python, everything is an object. In Perl, everything under the hood seems to be a list/hash/scalar/reference/function. This induces different design approaches and idioms.
Perl has anonymous code blocks and can generate closures on the fly with some branches. I am pretty sure that is not a python feature.
I do think that a very smart chap could statically analyze the bulk of Perl and produce a program that takes small Perl programs and output Python programs that do the same job.
I am much more doubtful about the feasibility of large and/or gnarly Perl translation. Some of us write some really funky code at times.... :)

This is impossible just because you can't even properly parse perl code. See Perl Cannot Be Parsed: A Formal Proof for more details.

The B set of modules by Malcolm Beattie would be the only sane starting point for something like this, though I'm with other answers in that this would be a difficult problem to solve. In general, translating the sense of one high-level language into another high-level language requires a high-level translator, and, for the time being, that can mean only a human.
The difficulty of this problem, for any pair of languages, is due to fundamental differences in the nature of the languages in question, such as runtime semantics and common idioms, not to mention libraries.

The reason it is close to impossible to create a generic translator from one high-level language to another, is that the program only describe HOW and not WHY (this is the reason for comments in the source code).
In order to create a meaningful program in another highlevel language you (or the translator program) needs to know WHY to be able to create the best possible program. If you cannot do that, all you can do is essentially to create a Python interpreter for the compiled version of the Perl program.
In other words, to do this properly you need to go outside the box, and this is very hard for a computer.

NullUserException basically summed it up - it certainly can be done; it would just be an enormous amount of effort to do so. Some language conversion utilities I've seen compile to an intermediate language (such as .NET's CIL) and then decompile that to the desired language. I have not seen any for Perl to Python. You can, however, find a Python to Perl converter here, though that's likely of little use to you unless you're trying to create your own, in which case it may provide some helpful reference.
Edit: if you just need the exact functionality in a Python script, PyPerl may be of some use to you.

Try my version of the Pythonizer: http://github.com/snoopyjc/pythonizer - it does a decent job

what next after pyparsing?

I have a huge grammar developed for pyparsing as part of a large, pure Python application.
I have reached the limit of performance tweaking and I'm at the point where the diminishing returns make me start to look elsewhere. Yes, I think I know most of the tips and tricks and I've profiled my grammar and my application to dust.
What next?
I hope to find a parser that gives me the same readability, usability (I'm using many advanced features of pyparsing such as parse-actions to start the post processing of the input which is being parsed) and python integration but at 10× the performance.
I love the fact the the grammar is pure Python.
All my basic blocks are regular expressions, so reusing them would be nice.
I know I can't have everything so I am willing to give up on some of the features I have today to get to the requested 10× performance.
Where do I go from here?

It looks like the pyparsing folks have anticipated your problem. From https://github.com/pyparsing/pyparsing/blob/master/docs/HowToUsePyparsing.rst :
Performance of pyparsing may be slow for complex grammars and/or large input strings. The psyco package can be used to improve the speed of the pyparsing module with no changes to grammar or program logic - observed improvments have been in the 20-50% range.
However, as Vangel noted in the comments below, psyco is an obsolete project as of March 2012. Its successor is the PyPy project, which starts from the same basic approach to performance: use a JIT native-code compiler instead of a bytecode interpreter. You should be able to achieve similar or greater gains with PyPy if switching Python implementations will work for you.
If you're really a speed demon, but want to keep some of the legibility and declarative syntax, I'd suggest having a look at ANTLR. Probably not the Python-generating backend; I'm skeptical whether that's mature or high-performance enough for your needs. I'm talking about the goods: the C backend that started it all.
Wrap a Python C extension module around the entry point to the parser, and turn it loose.
Having said that, you'll be giving up a lot in this transition: basically any Python you want to do in your parser will have to be done through the C API (not altogether pretty). Also, you'll have to get used to very different ways of doing things. ANTLR has its charms, but it's not based on combinators, so there's not the easy and fluid relationship between your grammar and your language that there is in pyparsing. Plus, it's its own DSL, much like lex/yacc, which can present a learning curve – but, because it's LL based, you'll probably find it easier to adapt to your needs.

Switch to a generated C/C++ parser (using ANTLR, flex/bison, etc.). If you can delay all the action rules until after you are done parsing, you might be able to build an AST with trivial code and then pass that back to your python code via something like SWIG and process on it with your current actions rules. OTOH, for that to give you a speed boost, the parsing has to be the heavy lifting. If your action rules are the big cost, then this will buy you nothing unless you write your action rules in C as well (but you might have to do it anyway to avoid paying for whatever impedance mismatch you get between the python and C code).

If you really want performance for large grammars, look no farther than SimpleParse (which itself relies on mxTextTools, a C extension). However, know now that it comes at the cost of being more cryptic and requiring that you be well-versed in EBNF.
It's definitely not the more Pythonic route, and you're going to have to start all over with an EBNF grammar to use SimpleParse.

A bit late to the party, but PLY (Python Lex-Yacc), has served me very well. PLY gives you a pure Python framework for constructing lex-based tokenizers, and yacc-based LR parsers.
I went this route when I hit performance issues with pyparsing.
Here is a somewhat old but still interesting article on Python parsing which includes benchmarks for ANTLR, PLY and pyparsing. PLY is roughly 4 times faster than pyparsing in this test.

There's no way to know what kind of benefit you'll get without just testing it, but it's within the range of possibility that you could get 10x benefit just from using Unladen Swallow if your process is long-running and repetitive. (Also, if you have many things to parse and you typically start a new interpreter for each one, Unladen Swallow gets faster - to a point - the longer you run your process, so while parsing one input might not show much gain, you might get significant gains on the 2nd and 3rd inputs in the same process).
(Note: pull the latest out of SVN - you'll get far better performance than the latest tarball)

programming language implemented in pure python

i am creating ( researching possibility of ) a highly customizable python client and would like to allow users to actually edit the code in another language to customize the running of program. ( analogous to browser which itself coded in c/c++ and run another language html/js ). so my question is , is there any programming language implemented in pure python which i can see as a reference ( or use directly ? ) -- i need simple language ( simple statements and ifs can do )
edit: sorry if i did not make myself clear but what i want is "a language to customize the running of program" , even though pypi seems a great option, what i am looking for is more simple which i can study and extend myself if need arise. my google searches pointing towards xml based langagues. ( BMEL , XForms etc ).

The question isn't completely clear on scope, but I have a hunch that PyPy, embedding other full languages, and similar solutions might be overkill. It sounds like iamgopal may really be interested in something more like Interpreter Pattern or Little Language.
If the language you want to support is really small (see the Interpreter Pattern link), then hand-coding this yourself in Python won't be too hard. You can write a simple parser (Google around; here's one example), then walk the AST and evaluate user expressions.
However, if you expect this to be used for a long time or by many people, it may be worth throwing a real language at the problem. (I'd recommend Python itself if your users are already familiar with basic Python syntax).

Ren'Py is a modification to Python syntax built on top of Python itself, using the language tools in the stdlib.

For your user's sake, don't use an XML based language - XML is an awful basis for a programming language and your users will hate you for it.
Here is a suggestion. Use a strict subset of Python for your language. Use the compiler module to convert their code into an abstract syntax tree and walk the tree to to validate that the code conforms to your subset before converting the AST into python bytecode.
N.B. I just checked the docs and see that the compiler package is deprecated in 2.6 and removed in Python 3.x. Does anyone know why that is?

Numerous template languages such as Cheetah, Django templates, Genshi, Mako, Mighty might serve as an example.

Why not Python itself? With some care you can use eval to run user code.
One of the good thing about interpreted scripting languages is that you don't need another extra scripting language!

PLY (Python Lex-Yacc)
is something of your interest.

Possibly Common Lisp (or any other Lisp) will be the best choice for that task. Because Lisp make it possible to easily extend host language with powerful macroses and construct DSL (domain specific language).

If all you need is simple if statements and expressions, I'm sure it wouldn't be an awful task to parse each line. Something like
if some flag
activate some feature
deactivate some feature
elif some other flag
activate some feature
activate some feature
else
logout
Just write a class which, while parsing takes the first word, checks if it's "if, elif, else," etc, and if so, check a flag and set a flag saying you either are or are not executing until the next conditional. If it's not a conditional, call a function based on the first keyword that would modify the program state in some way.
The class could store some local execution state (are we in an if statement? If so are we executing this branch?) and have another class containing some global application state (flags that are checkable by if statements, etc).
This is probably the wrong thing to do in your situation (it's very prone to bugs, it's dangerous if you don't treat the data in the scripts correctly), but it's at least a start if you do decide to interpret your own mini-language.
Seriously though, if you try this, be very, very, srs careful. Don't give the scripts any functionality that they don't definitely need, because you are almost certainly opening security holes by doing something like this.
Don't say I didn't warn you.

How to parse *.py file with python?

I'd like to parse Python source in order to try making a basic source code converter from Python to Go.
What module should I use?
Should I proceed or not?
If I should proceed, how?

Have a look at the language services packages, particularly the ast.
My guess is that if you don't already have a solid grasp of both parsing as well as code generation techniques, this is going to be a difficult project to undertake.
good luck!

As for the 'should I go ahead or better not' question: why do you want to do this in the first place?
If it's a purely learning exercise, then you don't don't need to ask us whether it's worthwhile. You want to learn, so go right ahead.
If it's meant to be a practical tool, then my suggestion is to not do it. An industrial-strength tool to perform such conversions might be useful but I would guess that you're not going to go that far. With that in mind it's probably more fruitful to rewrite the Python code in Go manually.
That assumes there is any real benefit to compiling to Go; current testing suggests that you get better performance and similar code structure from using Stackless Python.

The Boo Solution
Are you trying to make a python-like language, that can compiles into Go? This seems most sensible, as you will want to do Go-specific things (to take advantage of Go features).
Look at pyparsing. It includes an example of a complete python parser, but you probably don't want to do that.
You want to incrementally build your converter / translator, so you want to incrementally build the parser, otherwise you might choke on the AST. OK, you could parse everything and just ignore the stuff you don't understand, but that's not great behavior from a compiler.
You could start with parsing basic arithmetic.
The Pyrex Solution
This is similar to the Boo solution, but much harder. Get the Boo solution working first. Then learn to generate wrapper code, so your Go and python parts can work together.
The PyPy Solution
A complete Python-Go compiler? Good luck. You'll need it.

There's a good list of parsers rounded-up by Ned Batchelder which might help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.