Parsing Python function declaration - python

In order to write a custom documentation generator for my Python code, I'd like to write a regular expression capable of matching to following:
def my_function(arg1,arg2,arg3):
"""
DOC
"""
My current problem is that, using the following regex:
def (.+)\((?:(\w+),)*(\w+)\)
I can only match my_function, arg2 and arg3 (according to Pythex.org).
I don't understand what I'm doing wrong, since my (?:(\w+),)* should match as many arguments as possible, until the last one (here arg3). Could anyone explain?
Thanks

This isn't possible in a general sense because Python functions are not regular expressions -- they can take on forms that can't be captured with regular expression syntax, especially in the context of OTHER Python structures. But take heart, there is a lot to learn from your question!
The fascinating thing is that although you said you're trying to learn regular expressions, you accidentally stumbled into the very heart of computer science itself, compiler theory!
I'm going to address less than a fraction of the tip of the iceberg in this post to help get you started, and then suggest a few free and paid resources to help you continue.
A python function by itself may take on several forms:
def foo(x):
"docstring"
<body>
def foo1(x):
"""doc
string"""
<body>
def foo2(x):
<body>
Additionally, what comes before and after the function may not be another function!
This is what would make using a regex by itself impossible to autogenerate documentation (well, not possible for me. I'm not smart enough to write a single regular expression than can account for the entire Python language!).
What you need to look into is parsing (by the way, I'm using the term parsing very loosely to cover parsing, tokenizing, and lexing just to keep things "simple") Using regular expressions are typically a very important part of parsing.
The general strategy would be to parse the file into syntactic constructs. Identify which of those constructs are functions. Isolate the text of that function. THEN you can use a regular expression to parse the text of that construct. OR you can parse one level further and break up the function into distinct syntactic constructions -- function name, parameter declaration, doc string, body, etc... at which point your problem will be solved.
I was attempting to write a regular expression for a standard function definition (without parsing) like foo or foo1 but I was struggling to do-so even having written a few languages.
So just to be clear, the point that I would think about parsing as opposed to simple regex is any time your input spans multiple lines. Regex is most effective on single lines.
A parsing function looks like this:
def parse_fn_definition(definition):
def parse_name(lines):
<code>
def parse_args(lines):
<code>
def parse_doc(lines):
<code>
def parse_body(lines):
<code>
...
Now here's the real trick: Each of these parse functions returns two things:
0) The chunk of parsed regex
1) The REST of line
so for instance,
def parse_name(lines):
pattern = 'def\s*?([a-zA-Z_][a-zA-Z0-9_]*?)'
for line in lines:
m = re.match(pattern, line)
if m:
res, rest = m.groups()
return res, [rest] + lines
else:
raise Exception("Line cannot be parsed by parse_name: {}".format(line))
So, once you've isolated the function text (that's a whole other set of tricks to do, usually involves creating something called a "grammar" -- don't worry, I set you up with some resources down below), you can parse the function text with the following technique:
def parse_fn(lines_of_text):
name, rest = parse_name(lines_of_text)
params, rest = parse_params(rest)
doc_string, rest = parse_doc(rest)
body, rest = parse_body(rest)
function = [name, params, doc_string, body]
res = function, rest
return res
This function would return some data structure that represents the function (I just used a simple list for illustration) and the rest of the lines of text. That would get passed on to something that will appropriately catalog your function data and then classify and process the rest of the text!
Anyway, if this is something that interests you, don't give up! I would offer a few humble suggestions:
1) Start with an EASIER language to parse, like Scheme/LISP. These languages were designed to be easy to parse and manipulate! Then work your way up to more irregular languages.
2a) Peter Norvig has done some amazing and very accessible work on this. Check out Lispy!
2b) Peter Norvig's class CS212 (specifically unit 3 code) is very challenging but does an excellent job introducing fundamental language design concepts. Every job I've ever gotten, and my love for programming, is because of that course.
3) If you want to advance yourself even further and you can afford it, I would strongly recommend checking out Dave Beazely's workshops on compilers or interpreters. I've taken two courses from Dave, and while I can't promise this for everyone, my salary has literally doubled after each course, so I think it's a worthwhile investment.
4) Absolutely check out Structure and Interpretation of Computer Programs (the wizard book) and Compilers (the dragon book). They'll change your life.
5) DON'T GIVE UP! YOU GOT THIS!! Good luck to you!

Related

Making a lexical analyzer WITHOUT manually walking / checking

I'm making my own programming language and I'm on the lexer right now. My current approach is to manually walk through the code and check for valid keywords, then append a Token object to a tokens array. But it leaves me with a massive if/else statement that's not only ugly but slow too. I'm struggling to find any resources about this online, and I'm trying to find out if there's a better way to do this - Some regex pattern or something?
Here's the code
class Token:
def __init__(self, type, value):
self.type = type
self.value = value
def __str__(self):
return f'Token({self.type}, {self.value})'
def __repr__(self):
return self.__str__()
def lex(code):
tokens = []
for index in range(len(code)):
pass # This is where the if/else statement goes
return tokens
I don't want to use lex or anything. Thanks in advance for the help.
Parser generators can help you get started quickly by helping you define syntax trees and giving you a declarative syntax to describe the lexing & parsing steps.
that's not only ugly but slow too
This seems odd to me. Hand-rolled lexers usually are pretty performant. As long as your syntax doesn't require too much lookahead or back-tracking.
Parser generators typically work based on automata; they build state tables so most of the work is just a loop that at each steps looks up into those tables.
One trick that high-performance, hand-rolled lexers often do is to have a lookup-table that classifies each ASCII character. So the lexing loop looks like
while position < limit:
code_point = read_codepoint(position)
if code_point <= MAX_ASCII:
# switch on CLASSIFICATION[code_point]
else:
# Do something else probably identifier related
where CLASSIFICATION stores info that lets you recognize that quote characters inevitably lead to parsing as a quoted string or character literal and space characters can be skipped over and 0-9 inevitably lead to parsing a numeric token.
Some regex pattern or something?
This can work if your lexical grammar is regular.
That probably isn't true if your syntax requires nesting tokens.
For example, JS has non-regularity because template strings can embed expressions:
`string stuff ${ expressionStuff } more string stuff`
so a JS lexer needs to keep state so it knows when a } transitions back into a string state or not.

How do I write and structure code in most efficient way possible?

Few weeks ago I needed a crawler for data collection and sorting so I started learning python.
Same day I wrote a simple crawler but the code looked ugly as hell. Mainly because I don't know how to do certain things and I don't know how to properly google them.
Example:
Instead of deleting [, ] and ' in one line I did
extra_nr = extra_nr.replace("'", '')
extra_nr = extra_nr.replace("[", '')
extra_nr = extra_nr.replace("]", '')
extra_nr = extra_nr.replace(",", '')
Because I couldn't do stuff to list object and when I did str(list object) It looked like ['this', 'and this'].
Now I'm creating discord bot that will upload data that I feed to it to google spreadsheet. The code is long and ugly. And it takes like 2-3 secs to start the bot (idk if this is normal, I think the more I write the more time it takes to start it which makes me think that code is garbage). Sometimes it works, sometimes it doesn't.
My question is how do I know that I wrote something good? And if I just keep adding stuff like in the example, how will it affect my program? If I have a really long code do I split it and call the parts of it only when they are needed or how does it work?
tl;dr to get good at Python and write good code, write a lot of Python and read other people's code. Learn multiple approaches to different problem types and get a feel for which to use and when. It's something that comes over time with a lot of practice. As far as resources, I highly recommend the book "Automate the Boring Stuff with Python".
As for your code sample, you could use translate for this:
def strip(my_string):
bad_chars = [*"[],'"]
return my_string.translate({ord(c): None for c in bad_chars})
translate does a character by character translation of the string given a translation table, so you create a small translation table with the characters you don't want set to None.
The list of characters you don't want is created by unpacking (splatting) a string of the characters.
>>> [*"abc"] == ["a", "b", "c"]
True
Another option would be using comprehensions:
def strip(my_string):
bad_chars = [*"[],'"]
return "".join(c for c in my_string if c not in bad_chars)
Here we use the comprehension format [x for x in y] to build a new list of xs from y, just specifying to drop the character if it appears in bad_chars. We then join the remaining list of characters into a string that doesn't have the specified characters in it.
You will definitely improve quickly from reading (or listening) up on Python best practices from resources like Real Python and Talk Python To Me.
Meanwhile, I'd recommend starting using some code analysers like pylint and bandit as part of your regular workflow.
In any case, welcome to the world of Python and enjoy! :-)
You can use maketrans() to define characters to remove (3rd parameter):
def clean(S): return S.translate(str.maketrans("","","[],'"))
clean("A['23']") # 'A23'

Is there a way to accomplish what eval does without using eval in Python

I am teaching some neighborhood kids to program in Python. Our first project is to convert a string given as a Roman numeral to the Arabic value.
So we developed an function to evaluate a string that is a Roman numeral the function takes a string and creates a list that has the Arabic equivalents and the operations that would be done to evaluate to the Arabic equivalent.
For example suppose you fed in XI the function will return [1,'+',10]
If you fed in IX the function will return [10,'-',1]
Since we need to handle the cases where adjacent values are equal separately let us ignore the case where the supplied value is XII as that would return [1,'=',1,'+',10] and the case where the Roman is IIX as that would return [10,'-',1,'=',1]
Here is the function
def conversion(some_roman):
roman_dict = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M',1000}
arabic_list = []
for letter in some_roman.upper():
if len(roman_list) == 0:
arabic_list.append(roman_dict[letter]
continue
previous = roman_list[-1]
current_arabic = roman_dict[letter]
if current_arabic > previous:
arabic_list.extend(['+',current_arabic])
continue
if current_arabic == previous:
arabic_list.extend(['=',current_arabic])
continue
if current_arabic < previous:
arabic_list.extend(['-',current_arabic])
arabic_list.reverse()
return arabic_list
the only way I can think to evaluate the result is to use eval()
something like
def evaluate(some_list):
list_of_strings = [str(item) for item in some_list]
converted_to_string = ''.join([list_of_strings])
arabic_value = eval(converted_to_string)
return arabic_value
I am a little bit nervous about this code because at some point I read that eval is dangerous to use in most circumstances as it allows someone to introduce mischief into your system. But I can't figure out another way to evaluate the list returned from the first function. So without having to write a more complex function.
The kids get the conversion function so even if it looks complicated they understand the process of roman numeral conversion and it makes sense. When we have talked about evaluation though I can see they get lost. Thus I am really hoping for some way to evaluate the results of the conversion function that doesn't require too much convoluted code.
Sorry if this is warped, I am so . . .
Is there a way to accomplish what eval does without using eval
Yes, definitely. One option would be to convert the whole thing into an ast tree and parse it yourself (see here for an example).
I am a little bit nervous about this code because at some point I read that eval is dangerous to use in most circumstances as it allows someone to introduce mischief into your system.
This is definitely true. Any time you consider using eval, you need to do some thinking about your particular use-case. The real question is how much do you trust the user and what damage can they do? If you're distributing this as a script and users are only using it on their own computer, then it's really not a problem -- After all, they don't need to inject malicious code into your script to remove their home directory. If you're planning on hosting this on your server, that's a different story entirely ... Then you need to figure out where the string comes from and if there is any way for the user to modify the string in a way that could make it untrusted to run. Hackers are pretty clever1,2 and so hosting something like this on your server is generally not a good idea. (I always assume that the hackers know python WAY better than I do).
1http://blog.delroth.net/2013/03/escaping-a-python-sandbox-ndh-2013-quals-writeup/
2http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html
The only implementation of a safe expression evalulator that I've come across is:
https://pypi.org/project/simpleeval/
It supports a lot of basic Python-ish expressions and is quite restricted in what it allows you to do (so you don't blow up the interpreter or do something evil). It uses the python ast module for parsing, and evaluates the result itself.
Example:
from simpleeval import simple_eval
simple_eval("21 + 21")
Then you can extend it and give it access to the parts of your program that you want to:
simple_eval("x + y", names={"x": 22, "y": 48})
or
simple_eval("do_thing(11)", functions={"do_thing": my_callback})
and so on.

regular expression - function body extracting

in Python script, for every method definition in some C++ code of the form:
return_value ClassName::MethodName(args)
{MehodBody}
I need to extract three parts: the class name, the method name and the method body for further processing. Finding and extracting the ClassName and MethodName is easy, but is there any simple way to extract the body of the method? With all possible '{' and '}' inside it? Or are regexes unsuitable for such task?
>>> s = """return_value ClassName::MethodName(args)
{MehodBody {} } """
>>> re.findall(r'\b(\w+)::(\w+)\([^{]+\{(.+)}', s, re.S)
[('ClassName', 'MethodName', 'MehodBody {} ')]
I would recommend that you use the parser module rather than regexps since it will handle things like multiple line functions, different indentations and will abort on malformed input so that you can manage things better. "Avoid regexps if you can" is one of the rules I live by since they're often more trouble that they're worth.
Edit:
Oh okay. I misread your question. I thought you wanted to parse Python code itself. I googled a little bit and found this but it's C only. Perhaps you can extend that? The grammar for C++ is there in the "C++ programming language book"

IronPython: Is there an alternative to significant whitespace?

For rapidly changing business rules, I'm storing IronPython fragments in XML files. So far this has been working out well, but I'm starting to get to the point where I need more that just one-line expressions.
The problem is that XML and significant whilespace don't play well together. Before I abandon it for another language, I would like to know if IronPython has an alternative syntax.
IronPython doesn't have an alternate syntax. It's an implementation of Python, and Python uses significant indentation (all languages use significant whitespace, not sure why we talk about whitespace when it's only indentation that's unusual in the Python case).
>>> from __future__ import braces
File "<stdin>", line 1
from __future__ import braces
^
SyntaxError: not a chance
All I want is something that will let my users write code like
Ummm... Don't do this. You don't actually want this. In the long run, this will cause endless little issues because you're trying to force too much content into an attribute.
Do this.
<Rule Name="Markup">
<Formula>(Account.PricingLevel + 1) * .05</Formula>
</Rule>
You should try not to have significant, meaningful stuff in attributes. As a general XML design policy, you should use tags and save attributes for names and ID's and the like. When you look at well-done XSD's and DTD's, you see that attributes are used minimally.
Having the body of the rule in a separate tag (not an attribute) saves much pain. And it allows a tool to provide correct CDATA sections. Use a tool like Altova's XML Spy to assure that your tags have space preserved properly.
I think you can set the xml:space="preserve" attribute or use a <![CDATA[ to avoid other issues, with for example quotes and greater equal signs.
Apart from the already mentioned CDATA sections, there's pindent.py which can, among others, fix broken indentation based on comments a la #end if - to quote the linked file:
When called as "pindent -r" it assumes its input is a Python program with block-closing comments but with its indentation messed up, and outputs a properly indented version.
...
A "block-closing comment" is a comment of the form '# end <keyword>' where is the keyword that opened the block. If the opening keyword is 'def' or 'class', the function or class name may be repeated in the block-closing comment as well. Here is an example of a program fully augmented with block-closing comments:
def foobar(a, b):
if a == b:
a = a+1
elif a < b:
b = b-1
if b > a: a = a-1
# end if
else:
print 'oops!'
# end if
# end def foobar
It's bundeled with CPython, but if IronPython doesn't have it, just grab it from the repository.

Categories

Resources