regular expression - function body extracting - python

in Python script, for every method definition in some C++ code of the form:
return_value ClassName::MethodName(args)
{MehodBody}
I need to extract three parts: the class name, the method name and the method body for further processing. Finding and extracting the ClassName and MethodName is easy, but is there any simple way to extract the body of the method? With all possible '{' and '}' inside it? Or are regexes unsuitable for such task?

>>> s = """return_value ClassName::MethodName(args)
{MehodBody {} } """
>>> re.findall(r'\b(\w+)::(\w+)\([^{]+\{(.+)}', s, re.S)
[('ClassName', 'MethodName', 'MehodBody {} ')]

I would recommend that you use the parser module rather than regexps since it will handle things like multiple line functions, different indentations and will abort on malformed input so that you can manage things better. "Avoid regexps if you can" is one of the rules I live by since they're often more trouble that they're worth.
Edit:
Oh okay. I misread your question. I thought you wanted to parse Python code itself. I googled a little bit and found this but it's C only. Perhaps you can extend that? The grammar for C++ is there in the "C++ programming language book"

Related

Parsing Python function declaration

In order to write a custom documentation generator for my Python code, I'd like to write a regular expression capable of matching to following:
def my_function(arg1,arg2,arg3):
"""
DOC
"""
My current problem is that, using the following regex:
def (.+)\((?:(\w+),)*(\w+)\)
I can only match my_function, arg2 and arg3 (according to Pythex.org).
I don't understand what I'm doing wrong, since my (?:(\w+),)* should match as many arguments as possible, until the last one (here arg3). Could anyone explain?
Thanks
This isn't possible in a general sense because Python functions are not regular expressions -- they can take on forms that can't be captured with regular expression syntax, especially in the context of OTHER Python structures. But take heart, there is a lot to learn from your question!
The fascinating thing is that although you said you're trying to learn regular expressions, you accidentally stumbled into the very heart of computer science itself, compiler theory!
I'm going to address less than a fraction of the tip of the iceberg in this post to help get you started, and then suggest a few free and paid resources to help you continue.
A python function by itself may take on several forms:
def foo(x):
"docstring"
<body>
def foo1(x):
"""doc
string"""
<body>
def foo2(x):
<body>
Additionally, what comes before and after the function may not be another function!
This is what would make using a regex by itself impossible to autogenerate documentation (well, not possible for me. I'm not smart enough to write a single regular expression than can account for the entire Python language!).
What you need to look into is parsing (by the way, I'm using the term parsing very loosely to cover parsing, tokenizing, and lexing just to keep things "simple") Using regular expressions are typically a very important part of parsing.
The general strategy would be to parse the file into syntactic constructs. Identify which of those constructs are functions. Isolate the text of that function. THEN you can use a regular expression to parse the text of that construct. OR you can parse one level further and break up the function into distinct syntactic constructions -- function name, parameter declaration, doc string, body, etc... at which point your problem will be solved.
I was attempting to write a regular expression for a standard function definition (without parsing) like foo or foo1 but I was struggling to do-so even having written a few languages.
So just to be clear, the point that I would think about parsing as opposed to simple regex is any time your input spans multiple lines. Regex is most effective on single lines.
A parsing function looks like this:
def parse_fn_definition(definition):
def parse_name(lines):
<code>
def parse_args(lines):
<code>
def parse_doc(lines):
<code>
def parse_body(lines):
<code>
...
Now here's the real trick: Each of these parse functions returns two things:
0) The chunk of parsed regex
1) The REST of line
so for instance,
def parse_name(lines):
pattern = 'def\s*?([a-zA-Z_][a-zA-Z0-9_]*?)'
for line in lines:
m = re.match(pattern, line)
if m:
res, rest = m.groups()
return res, [rest] + lines
else:
raise Exception("Line cannot be parsed by parse_name: {}".format(line))
So, once you've isolated the function text (that's a whole other set of tricks to do, usually involves creating something called a "grammar" -- don't worry, I set you up with some resources down below), you can parse the function text with the following technique:
def parse_fn(lines_of_text):
name, rest = parse_name(lines_of_text)
params, rest = parse_params(rest)
doc_string, rest = parse_doc(rest)
body, rest = parse_body(rest)
function = [name, params, doc_string, body]
res = function, rest
return res
This function would return some data structure that represents the function (I just used a simple list for illustration) and the rest of the lines of text. That would get passed on to something that will appropriately catalog your function data and then classify and process the rest of the text!
Anyway, if this is something that interests you, don't give up! I would offer a few humble suggestions:
1) Start with an EASIER language to parse, like Scheme/LISP. These languages were designed to be easy to parse and manipulate! Then work your way up to more irregular languages.
2a) Peter Norvig has done some amazing and very accessible work on this. Check out Lispy!
2b) Peter Norvig's class CS212 (specifically unit 3 code) is very challenging but does an excellent job introducing fundamental language design concepts. Every job I've ever gotten, and my love for programming, is because of that course.
3) If you want to advance yourself even further and you can afford it, I would strongly recommend checking out Dave Beazely's workshops on compilers or interpreters. I've taken two courses from Dave, and while I can't promise this for everyone, my salary has literally doubled after each course, so I think it's a worthwhile investment.
4) Absolutely check out Structure and Interpretation of Computer Programs (the wizard book) and Compilers (the dragon book). They'll change your life.
5) DON'T GIVE UP! YOU GOT THIS!! Good luck to you!

Sublime Text syntax: Python 3.6 f-strings

I am trying to modify the default Python.sublime_syntax file to handle Python’s f-string literals properly. My goal is to have expressions in interpolated strings recognised as such:
f"hello {person.name if person else 'there'}"
-----------source.python----------
------string.quoted.double.block.python------
Within f-strings, ranges of text between a single { and another } (but terminating before format specifiers such as !r}, :<5}, etc—see PEP 498) should be recognised as expressions. As far as I know, that might look a little like this:
...
string:
- match: "(?<=[^\{]\{)[^\{].*)(?=(!(s|r|a))?(:.*)?\})" # I'll need a better regex
push: expressions
However, upon inspecting the build-in Python.sublime_syntax file, the string contexts especially are to unwieldy to even approach (~480 lines?) and I have no idea how to begin. Thanks heaps for any info.
There was an update to syntax highlighting in BUILD 3127 (Which includes: Significant improvements to Python syntax highlighting).
However, a couple users have stated that in BUILD 3176 syntax highlighting still is not set to correctly highlight Python expressions that are located within f strings. According to #Jollywatt, it is set to source.python f"string.quoted.double.block {constant.other.placeholder}" rather than f"string.quoted.double.block {source.python}"
It looks like Sublime uses this tool, PackageDev, "to ease the creation of snippets, syntax definitions, etc. for Sublime Text."

Python : splitting and splitting

I need some help;
I'm trying to program a sort of command prompt with python
I need to split a text file into lines then split them into strings
example :
splitting
command1 var1 var2;
command2 (blah, bleh);
command3 blah (b bleh);
command4 var1(blah b(bleh * var2));
into :
line1=['command1','var1','var2']
line2=['command2']
line2_sub1=['blah','bleh']
line3=['blah']
line3_sub1=['b','bleh']
line4=['command4']
line4_sub1=['blah','b']
line4_sub2=['bleh','var2']
line4_sub2_operand=['*']
Would that be possible at all?
If so could some one explain how or give me a piece of code that would do it?
Thanks a lot,
It's been pointed out, that there appears to be no reasoning to your language. All I can do is point you to pyparsing, which is what I would use if I were solving a problem similar to this, here is a pyparsing example for the python language.
Like everyone else is saying, your language is confusingly designed and you probably need to simplify it. But I'm going to give you what you're looking for and let you figure that out the hard way.
The standard python file object (returned by open()) is an iterator of lines, and the split() method of the python string class splits a string into a list of substrings. So you'll probably want to start with something like:
for line in command_file
words = line.split(' ')
http://docs.python.org/3/library/string.html
you could use this code to read the file line by line and split it by spaces between words.
a= True
f = open(filename)
while a:
nextline=f.readline()
wordlist= nextline.split("")
print(wordlist)
if nextline=="\n":
a= False
What you're talking about is writing a simple programming language. It's not extraordinarily difficult if you know what you're doing, but it is the sort of thing most people take a full semester class to learn. The fact that you've got multiple different types of lexical units with what looks to be a non-trivial, recursive syntax means that you'll need a scanner and a parser. If you really want to teach yourself to do this, this might not be a bad place to start.
If you simplify your grammar such that each command only has a fixed number of arguments, you can probably get away with using regular expressions to represent the syntax of your individual commands.
Give it a shot. Just don't expect it to all work itself out overnight.

IronPython: Is there an alternative to significant whitespace?

For rapidly changing business rules, I'm storing IronPython fragments in XML files. So far this has been working out well, but I'm starting to get to the point where I need more that just one-line expressions.
The problem is that XML and significant whilespace don't play well together. Before I abandon it for another language, I would like to know if IronPython has an alternative syntax.
IronPython doesn't have an alternate syntax. It's an implementation of Python, and Python uses significant indentation (all languages use significant whitespace, not sure why we talk about whitespace when it's only indentation that's unusual in the Python case).
>>> from __future__ import braces
File "<stdin>", line 1
from __future__ import braces
^
SyntaxError: not a chance
All I want is something that will let my users write code like
Ummm... Don't do this. You don't actually want this. In the long run, this will cause endless little issues because you're trying to force too much content into an attribute.
Do this.
<Rule Name="Markup">
<Formula>(Account.PricingLevel + 1) * .05</Formula>
</Rule>
You should try not to have significant, meaningful stuff in attributes. As a general XML design policy, you should use tags and save attributes for names and ID's and the like. When you look at well-done XSD's and DTD's, you see that attributes are used minimally.
Having the body of the rule in a separate tag (not an attribute) saves much pain. And it allows a tool to provide correct CDATA sections. Use a tool like Altova's XML Spy to assure that your tags have space preserved properly.
I think you can set the xml:space="preserve" attribute or use a <![CDATA[ to avoid other issues, with for example quotes and greater equal signs.
Apart from the already mentioned CDATA sections, there's pindent.py which can, among others, fix broken indentation based on comments a la #end if - to quote the linked file:
When called as "pindent -r" it assumes its input is a Python program with block-closing comments but with its indentation messed up, and outputs a properly indented version.
...
A "block-closing comment" is a comment of the form '# end <keyword>' where is the keyword that opened the block. If the opening keyword is 'def' or 'class', the function or class name may be repeated in the block-closing comment as well. Here is an example of a program fully augmented with block-closing comments:
def foobar(a, b):
if a == b:
a = a+1
elif a < b:
b = b-1
if b > a: a = a-1
# end if
else:
print 'oops!'
# end if
# end def foobar
It's bundeled with CPython, but if IronPython doesn't have it, just grab it from the repository.

Sensible python source line wrapping for printout

I am working on a latex document that will require typesetting significant amounts of python source code. I'm using pygments (the python module, not the online demo) to encapsulate this python in latex, which works well except in the case of long individual lines - which simply continue off the page. I could manually wrap these lines except that this just doesn't seem that elegant a solution to me, and I prefer spending time puzzling about crazy automated solutions than on repetitive tasks.
What I would like is some way of processing the python source code to wrap the lines to a certain maximum character length, while preserving functionality. I've had a play around with some python and the closest I've come is inserting \\\n in the last whitespace before the maximum line length - but of course, if this ends up in strings and comments, things go wrong. Quite frankly, I'm not sure how to approach this problem.
So, is anyone aware of a module or tool that can process source code so that no lines exceed a certain length - or at least a good way to start to go about coding something like that?
You might want to extend your current approach a bit, but using the tokenize module from the standard library to determine where to put your line breaks. That way you can see the actual tokens (COMMENT, STRING, etc.) of your source code rather than just the whitespace-separated words.
Here is a short example of what tokenize can do:
>>> from cStringIO import StringIO
>>> from tokenize import tokenize
>>>
>>> python_code = '''
... def foo(): # This is a comment
... print 'foo'
... '''
>>>
>>> fp = StringIO(python_code)
>>>
>>> tokenize(fp.readline)
1,0-1,1: NL '\n'
2,0-2,3: NAME 'def'
2,4-2,7: NAME 'foo'
2,7-2,8: OP '('
2,8-2,9: OP ')'
2,9-2,10: OP ':'
2,11-2,30: COMMENT '# This is a comment'
2,30-2,31: NEWLINE '\n'
3,0-3,4: INDENT ' '
3,4-3,9: NAME 'print'
3,10-3,15: STRING "'foo'"
3,15-3,16: NEWLINE '\n'
4,0-4,0: DEDENT ''
4,0-4,0: ENDMARKER ''
I use the listings package in LaTeX to insert source code; it does syntax highlight, linebreaks et al.
Put the following in your preamble:
\usepackage{listings}
%\lstloadlanguages{Python} # Load only these languages
\newcommand{\MyHookSign}{\hbox{\ensuremath\hookleftarrow}}
\lstset{
% Language
language=Python,
% Basic setup
%basicstyle=\footnotesize,
basicstyle=\scriptsize,
keywordstyle=\bfseries,
commentstyle=,
% Looks
frame=single,
% Linebreaks
breaklines,
prebreak={\space\MyHookSign},
% Line numbering
tabsize=4,
stepnumber=5,
numbers=left,
firstnumber=1,
%numberstyle=\scriptsize,
numberstyle=\tiny,
% Above and beyond ASCII!
extendedchars=true
}
The package has hook for inline code, including entire files, showing it as figures, ...
I'd check a reformat tool in an editor like NetBeans.
When you reformat java it properly fixes the lengths of lines both inside and outside of comments, if the same algorithm were applied to Python, it would work.
For Java it allows you to set any wrapping width and a bunch of other parameters. I'd be pretty surprised if that didn't exist either native or as a plugin.
Can't tell for sure just from the description, but it's worth a try:
http://www.netbeans.org/features/python/

Categories

Resources