Is there a python regex which will generically match a method definition (not just the declaration but also the method body) inside a python code file?
I did my share of googling but only found something similar for Java. Python is different w.r.t. to the way scopes are entered through indentation rather than accolades. What makes this problem hard is the fact that indentation may drop inside the method body (i.e. blank lines, multiline strings, comments).
I also looked for DOM parsers but basically they're all aimed at XML or HTML.
Finally I am looking into introspection (How can I get the source code of a Python function?) but I still wonder if there is a nicer way for code analysis without execution.
EDIT: the question receives a bunch of downvotes but I think it's actually a valid and specific programming question. I elaborated the question a bit.
Err, you don't want to use regexes to parse Python. The 'nicer way for code analysis without execution' is to use the Python standard library parser and/or ast modules. Look under the heading Python Language Services, e.g. https://docs.python.org/2/library/language.html
Related
I want to modify some constructs of python source code (e.g. variable names). Working with plain python is troublesome, so I am using abstract syntax trees. Using ast (built-in python library) worked out great for me, but in docs of ast.unparse() there are two warnings that I'm concerned about, since I don't want any uncontrolled modifications.
# small example
import ast
code = 'a = 0'
root = ast.parse(code)
for node in ast.walk(root):
if isinstance(node, ast.Name):
node.id = 'b'
code = ast.unparse(root)
print(code)
How to unparse ast without running into these problems?
Are there any alternatives to this method?
I don't know what the line about compiler optimizations is referring to, but basically the AST does not include comments and indentation has been reduced to INDENT and DEDENT, while other whitespace has been removed altogether. unparse treats an indent as being exactly four spaces, and inserts a single space character between tokens if necessary. That indeed might be a problem if you are attempting to edit existing code.
If you want to preserve comments and whitespace, you'll have to use a different parsing strategy, not based on the built-in AST model. There are parsers which preserve comments and whitespace (for example, parsers used for syntax highlighting); if you feel you need one, you should be able to find one with an internet search.
As for the recursion depth warning, you'll need extremely deeply nested code to trigger a stack overflow. Practically no-one writes code by hand which would trigger the problem, but it certainly can happen. Mostly it happens on machine-generated code. Personally, I wouldn't worry about it until it happens to you, since there's a good chance that it will never happen in your problem domain. (And, if it does happen, you'll be informed because it raises an exception, rather than diving into Undefined Behaviour like Certain other programming languages.)
Struggling to find a Python library of script to tokenize (find specific tokens like function definition names, variable names, keywords etc.).
I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc. I was hoping of using a pre-existent script; I explored Pygments with no success. Its lexer seems amazing for what I want but have no idea how to utilize it in Python and to also get positions for each found token.
For example I am looking at doing something like that:
int fac(int n)
{
return (n>1) ? n∗fac(n−1) : 1;
}
from the source code above I would like to get:
function_name: 'fac' at position (x, y)
variable_name: 'n' at position (x, y+8)
EDITED:
Any suggestions will be appreciated since I am in the dark here regarding tokenizations and parsing in C++?
Eli Bendersky is a smart guy, and sometimes active here on SO. He's got a blog post on this issue which I'll refer you directly to: Parsing C++ in Python with Clang.
Because things disappear, here's the takeaway:
Eli Bendersky wrote a C language (not C++) parser in Python, called pycparser. People keep asking him if he's going to add support for C++. He is not. He recommends instead that people use the Python bindings for libclang to get access to "a C API that the Clang team vows to keep relatively stable, allowing the user to examine parsed code at the level of an abstract syntax tree (AST)".
You can find the bindings separately on PyPI here. Note though that you'll have to have clang installed, so you may just want to point your PYTHON_PATH directly at the install location.
You're struggling to find a python library to do what you want because what you want is impossible to do, fundamentally.
I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc
You mean like this:
foo = 3
def foo():pass
What is foo? All a tokenizer should/can tell you is that foo is an identifier. It's context tells you whether it's a variable or a function declaration. You need a parser to handle context free grammars. Mathematically, the space of context free grammars is too large for a standard lexer to tackle.
Try a parser: here's one in python
Normally I'd try and provide you links here to distinguish between the topics, but this is too broad to provide a single good link to. If you're interested, start with any standard compiler text. Elsewhere on SE, we see this question pop up as a theoretical question and, in some form, as a famous question about html.
Once you realize that tokenizers are (usually) built (largely) on regular expressions, it becomes more obvious why your task is not going to end happily.
Now that you know the terminology, I think you'll find this SO article useful, which recommends gcc-ml. I don't know how up-to-date it is, but it's the type of program you're looking for.
I want to add a new functionality in python, purely for experimental purposes where I would like to extend the decorator syntax. Currently decorators can be applied on functions and classes.
I would like to also use decorators on loops (for loop for example) and also to blocks of code.
Example 1:
#foo
for i in range(20):
# do something
# and something more
Example 2:
#foo
# there's a block starting from an indent here.
# there's some code now
# do something
# and something more
Now although this is the basic idea, my requirement is to modify the body of the decorator.
For example, I want to change the loop a bit based on the decorator applied to it. I can use the AST module for this.
The problem is I do not want to completely add a new syntax and its complete implementation. I just want to parse with the new syntax, access the parse tree and the decorator's body, operate on it and insert that into the body of the program, removing the decorator thus changing the program which had a new syntax to a syntax that python has right now.
Any idea on how I would go about doing this?
You can't do that without adding new syntax. Decorators themselves don't have a "body" as such. Decorators can apply to functions or to classes and that's it. See near the top of http://docs.python.org/2/reference/grammar.html :
decorated: decorators (classdef | funcdef)
If you want something else, it can't be a decorator, it has to be your own syntax that looks like a decorator.
You could write some kind of preprocessor that parses your syntax and transforms into valid Python. One possibility is the parser module. It has facilities for parsing basic Python elements like suites (i.e., blocks). You can see a simple example in the documentation. The ast module also provides this functionality. But these modules don't provide a way to parse decorators independently of class/function defs; the decorators are essentially considered as part of the class/function defition.
Even if you manage to parse your particular construct, you will probably have to do substantial trickery to create the AST. The problem is that you can't just "access the parse tree" and "modify the AST" because the program as you've written won't have a normal Python parse tree since it can't be parsed as valid Python. So you'll have to try to stitch together your own AST by patching together your custom code with ordinary Python-parsed code.
What i mean is, how is the syntax defined, i.e. how can i make my own constructs like these?
I realise in a lot of languages, things like this will be built into the compiler / spec, and so it's dealt with by the compiler (at least that how i understand it to work).
But with python, everything i've come across so far has been accessible to the programmer, and so you more or less have the freedom to do whatever you want.
How would i go about writing my own version of for or while? Is it even possible?
I don't have any actual application for this, so the answer to any WHY?! questions is just "because why not?" or "curiosity".
No, you can't, not from within Python. You can't add new syntax to the language. (You'd have to modify the source code of Python itself to make your own custom version of Python.)
Note that the iterator protocol allows you to define objects that can be used with for in a custom way, which covers a lot of the possible use cases of writing your own iteration syntax.
Well, you have a couple of options for creating your own syntax:
Write a higher-order function, like map or reduce.
Modify python at the C level. This is, as you might expect, relatively easy as compared with fiddling with many other languages. See this article for an example: http://eli.thegreenplace.net/2010/06/30/python-internals-adding-a-new-statement-to-python/
Fake it using the debug facilities, or the encodings facility. See this code: http://entrian.com/goto/download.html and http://timhatch.com/projects/pybraces/
Use a preprocessor. Here's one project that tries to make this easy: http://www.fiber-space.de/langscape/doc/index.html
Use of the python facilities built in to achieve a similar effect (decorators, metaclasses, and the like).
Obviously, none of this is quite what you're looking for, but python, unlike smalltalk or lisp, isn't (necessarily) programmed in itself and guarantees to expose its own underlying execution and parsing mechanisms at runtime.
You can't make equivalent constructs. for, while, if etc. are statements, and they are built into the language with their own specific syntax. There are languages that do allow this sort of thing though (to some degree), such as Scala.
while, print, for etc. are keywords. That means they are parsed by the python parser whilst reading the code, stripped any redundant characters and result in tokens. Afterwards a lexer takes those tokens as input and builds a program tree which is then excuted by the interpreter. Said so, those constructs are used only as syntactic sugar for underlying lexical machinery and as such are not visible from inside the code.
I'm working on a macro system for Python (as discussed here) and one of the things I've been considering are units of measure. Although units of measure could be implemented without macros or via static macros (e.g. defining all your units ahead of time), I'm toying around with the idea of allowing syntax to be extended dynamically at runtime.
To do this, I'm considering using a sort of partial evaluation on the code at compile-time. If parsing fails for a given expression, due to a macro for its syntax not being available, the compiler halts evaluation of the function/block and generates the code it already has with a stub where the unknown expression is. When this stub is hit at runtime, the function is recompiled against the current macro set. If this compilation fails, a parse error would be thrown because execution can't continue. If the compilation succeeds, the new function replaces the old one and execution continues.
The biggest issue I see is that you can't find parse errors until the affected code is run. However, this wouldn't affect many cases, e.g. group operators like [], {}, (), and `` still need to be paired (requirement of my tokenizer/list parser), and top-level syntax like classes and functions wouldn't be affected since their "runtime" is really load time, where the syntax is evaluated and their objects are generated.
Aside from the implementation difficulty and the problem I described above, what problems are there with this idea?
Here are a few possible problems:
You may find it difficult to provide the user with helpful error messages in case of a problem. This seems likely, as any compilation-time syntax error could be just a syntax extension.
Performance hit.
I was trying to find some discussion of the pluses, minuses, and/or implementation of dynamic parsing in Perl 6, but I couldn't find anything appropriate. However, you may find this quote from Nicklaus Wirth (designer of Pascal and other languages) interesting:
The phantasies of computer scientists
in the 1960s knew no bounds. Spurned
by the success of automatic syntax
analysis and parser generation, some
proposed the idea of the flexible, or
at least extensible language. The
notion was that a program would be
preceded by syntactic rules which
would then guide the general parser
while parsing the subsequent program.
A step further: The syntax rules would
not only precede the program, but they
could be interspersed anywhere
throughout the text. For example, if
someone wished to use a particularly
fancy private form of for statement,
he could do so elegantly, even
specifying different variants for the
same concept in different sections of
the same program. The concept that
languages serve to communicate between
humans had been completely blended
out, as apparently everyone could now
define his own language on the fly.
The high hopes, however, were soon
damped by the difficulties encountered
when trying to specify, what these
private constructions should mean. As
a consequence, the intreaguing idea of
extensible languages faded away rather
quickly.
Edit: Here's Perl 6's Synopsis 6: Subroutines, unfortunately in markup form because I couldn't find an updated, formatted version; search within for "macro". Unfortunately, it's not too interesting, but you may find some things relevant, like Perl 6's one-pass parsing rule, or its syntax for abstract syntax trees. The approach Perl 6 takes is that a macro is a function that executes immediately after its arguments are parsed and returns either an AST or a string; Perl 6 continues parsing as if the source actually contained the return value. There is mention of generation of error messages, but they make it seem like if macros return ASTs, you can do alright.
Pushing this one step further, you could do "lazy" parsing and always only parse enough to evaluate the next statement. Like some kind of just-in-time parser. Then syntax errors could become normal runtime errors that just raise a normal Exception that could be handled by surrounding code:
def fun():
not implemented yet
try:
fun()
except:
pass
That would be an interesting effect, but if it's useful or desirable is a different question. Generally it's good to know about errors even if you don't call the code at the moment.
Macros would not be evaluated until control reaches them and naturally the parser would already know all previous definitions. Also the macro definition could maybe even use variables and data that the program has calculated so far (like adding some syntax for all elements in a previously calculated list). But this is probably a bad idea to start writing self-modifying programs for things that could usually be done as well directly in the language. This could get confusing...
In any case you should make sure to parse code only once, and if it is executed a second time use the already parsed expression, so that it doesn't lead to performance problems.
Here are some ideas from my master's thesis, which may or may not be helpful.
The thesis was about robust parsing of natural language.
The main idea: given a context-free grammar for a language, try to parse a given
text (or, in your case, a python program). If parsing failed, you will have a partially generated parse tree. Use the tree structure to suggest new grammar rules that will better cover the parsed text.
I could send you my thesis, but unless you read Hebrew this will probably not be useful.
In a nutshell:
I used a bottom-up chart parser. This type of parser generates edges for productions from the grammar. Each edge is marked with the part of the tree that was consumed. Each edge gets a score according to how close it was to full coverage, for example:
S -> NP . VP
Has a score of one half (We succeeded in covering the NP but not the VP).
The highest-scored edges suggest a new rule (such as X->NP).
In general, a chart parser is less efficient than a common LALR or LL parser (the types usually used for programming languages) - O(n^3) instead of O(n) complexity, but then again you are trying something more complicated than just parsing an existing language.
If you can do something with the idea, I can send you further details.
I believe looking at natural language parsers may give you some other ideas.
Another thing I've considered is making this the default behavior across the board, but allow languages (meaning a set of macros to parse a given language) to throw a parse error at compile-time. Python 2.5 in my system, for example, would do this.
Instead of the stub idea, simply recompile functions that couldn't be handled completely at compile-time when they're executed. This will also make self-modifying code easier, as you can modify the code and recompile it at runtime.
You'll probably need to delimit the bits of input text with unknown syntax, so that the rest of the syntax tree can be resolved, apart from some character sequences nodes which will be expanded later. Depending on your top level syntax, that may be fine.
You may find that the parsing algorithm and the lexer and the interface between them all need updating, which might rule out most compiler creation tools.
(The more usual approach is to use string constants for this purpose, which can be parsed to a little interpreter at run time).
I don't think your approach would work very well. Let's take a simple example written in pseudo-code:
define some syntax M1 with definition D1
if _whatever_:
define M1 to do D2
else:
define M1 to do D3
code that uses M1
So there is one example where, if you allow syntax redefinition at runtime, you have a problem (since by your approach the code that uses M1 would be compiled by definition D1). Note that verifying if syntax redefinition occurs is undecidable. An over-approximation could be computed by some kind of typing system or some other kind of static analysis, but Python is not well known for this :D.
Another thing that bothers me is that your solution does not 'feel' right. I find it evil to store source code you can't parse just because you may be able to parse it at runtime.
Another example that jumps to mind is this:
...function definition fun1 that calls fun2...
define M1 (at runtime)
use M1
...function definition for fun2
Technically, when you use M1, you cannot parse it, so you need to keep the rest of the program (including the function definition of fun2) in source code. When you run the entire program, you'll see a call to fun2 that you cannot call, even if it's defined.