Making a lexical analyzer WITHOUT manually walking / checking - python

I'm making my own programming language and I'm on the lexer right now. My current approach is to manually walk through the code and check for valid keywords, then append a Token object to a tokens array. But it leaves me with a massive if/else statement that's not only ugly but slow too. I'm struggling to find any resources about this online, and I'm trying to find out if there's a better way to do this - Some regex pattern or something?
Here's the code
class Token:
def __init__(self, type, value):
self.type = type
self.value = value
def __str__(self):
return f'Token({self.type}, {self.value})'
def __repr__(self):
return self.__str__()
def lex(code):
tokens = []
for index in range(len(code)):
pass # This is where the if/else statement goes
return tokens
I don't want to use lex or anything. Thanks in advance for the help.

Parser generators can help you get started quickly by helping you define syntax trees and giving you a declarative syntax to describe the lexing & parsing steps.
that's not only ugly but slow too
This seems odd to me. Hand-rolled lexers usually are pretty performant. As long as your syntax doesn't require too much lookahead or back-tracking.
Parser generators typically work based on automata; they build state tables so most of the work is just a loop that at each steps looks up into those tables.
One trick that high-performance, hand-rolled lexers often do is to have a lookup-table that classifies each ASCII character. So the lexing loop looks like
while position < limit:
code_point = read_codepoint(position)
if code_point <= MAX_ASCII:
# switch on CLASSIFICATION[code_point]
else:
# Do something else probably identifier related
where CLASSIFICATION stores info that lets you recognize that quote characters inevitably lead to parsing as a quoted string or character literal and space characters can be skipped over and 0-9 inevitably lead to parsing a numeric token.
Some regex pattern or something?
This can work if your lexical grammar is regular.
That probably isn't true if your syntax requires nesting tokens.
For example, JS has non-regularity because template strings can embed expressions:
`string stuff ${ expressionStuff } more string stuff`
so a JS lexer needs to keep state so it knows when a } transitions back into a string state or not.

Related

Parsing Python function declaration

In order to write a custom documentation generator for my Python code, I'd like to write a regular expression capable of matching to following:
def my_function(arg1,arg2,arg3):
"""
DOC
"""
My current problem is that, using the following regex:
def (.+)\((?:(\w+),)*(\w+)\)
I can only match my_function, arg2 and arg3 (according to Pythex.org).
I don't understand what I'm doing wrong, since my (?:(\w+),)* should match as many arguments as possible, until the last one (here arg3). Could anyone explain?
Thanks
This isn't possible in a general sense because Python functions are not regular expressions -- they can take on forms that can't be captured with regular expression syntax, especially in the context of OTHER Python structures. But take heart, there is a lot to learn from your question!
The fascinating thing is that although you said you're trying to learn regular expressions, you accidentally stumbled into the very heart of computer science itself, compiler theory!
I'm going to address less than a fraction of the tip of the iceberg in this post to help get you started, and then suggest a few free and paid resources to help you continue.
A python function by itself may take on several forms:
def foo(x):
"docstring"
<body>
def foo1(x):
"""doc
string"""
<body>
def foo2(x):
<body>
Additionally, what comes before and after the function may not be another function!
This is what would make using a regex by itself impossible to autogenerate documentation (well, not possible for me. I'm not smart enough to write a single regular expression than can account for the entire Python language!).
What you need to look into is parsing (by the way, I'm using the term parsing very loosely to cover parsing, tokenizing, and lexing just to keep things "simple") Using regular expressions are typically a very important part of parsing.
The general strategy would be to parse the file into syntactic constructs. Identify which of those constructs are functions. Isolate the text of that function. THEN you can use a regular expression to parse the text of that construct. OR you can parse one level further and break up the function into distinct syntactic constructions -- function name, parameter declaration, doc string, body, etc... at which point your problem will be solved.
I was attempting to write a regular expression for a standard function definition (without parsing) like foo or foo1 but I was struggling to do-so even having written a few languages.
So just to be clear, the point that I would think about parsing as opposed to simple regex is any time your input spans multiple lines. Regex is most effective on single lines.
A parsing function looks like this:
def parse_fn_definition(definition):
def parse_name(lines):
<code>
def parse_args(lines):
<code>
def parse_doc(lines):
<code>
def parse_body(lines):
<code>
...
Now here's the real trick: Each of these parse functions returns two things:
0) The chunk of parsed regex
1) The REST of line
so for instance,
def parse_name(lines):
pattern = 'def\s*?([a-zA-Z_][a-zA-Z0-9_]*?)'
for line in lines:
m = re.match(pattern, line)
if m:
res, rest = m.groups()
return res, [rest] + lines
else:
raise Exception("Line cannot be parsed by parse_name: {}".format(line))
So, once you've isolated the function text (that's a whole other set of tricks to do, usually involves creating something called a "grammar" -- don't worry, I set you up with some resources down below), you can parse the function text with the following technique:
def parse_fn(lines_of_text):
name, rest = parse_name(lines_of_text)
params, rest = parse_params(rest)
doc_string, rest = parse_doc(rest)
body, rest = parse_body(rest)
function = [name, params, doc_string, body]
res = function, rest
return res
This function would return some data structure that represents the function (I just used a simple list for illustration) and the rest of the lines of text. That would get passed on to something that will appropriately catalog your function data and then classify and process the rest of the text!
Anyway, if this is something that interests you, don't give up! I would offer a few humble suggestions:
1) Start with an EASIER language to parse, like Scheme/LISP. These languages were designed to be easy to parse and manipulate! Then work your way up to more irregular languages.
2a) Peter Norvig has done some amazing and very accessible work on this. Check out Lispy!
2b) Peter Norvig's class CS212 (specifically unit 3 code) is very challenging but does an excellent job introducing fundamental language design concepts. Every job I've ever gotten, and my love for programming, is because of that course.
3) If you want to advance yourself even further and you can afford it, I would strongly recommend checking out Dave Beazely's workshops on compilers or interpreters. I've taken two courses from Dave, and while I can't promise this for everyone, my salary has literally doubled after each course, so I think it's a worthwhile investment.
4) Absolutely check out Structure and Interpretation of Computer Programs (the wizard book) and Compilers (the dragon book). They'll change your life.
5) DON'T GIVE UP! YOU GOT THIS!! Good luck to you!

Pretty-print Lisp using Python

Is there a way to pretty-print Lisp-style code string (in other words, a bunch of balanced parentheses and text within) in Python without re-inventing a wheel?
Short answer
I think a reasonable approach, if you can, is to generate Python lists or custom objects instead of strings and use the pprint module, as suggested by #saulspatz.
Long answer
The whole question look like an instance of an XY-problem. Why? because you are using Python (why not Lisp?) to manipulate strings (why not data-structures?) representing generated Lisp-style code, where Lisp-style is defined as "a bunch of parentheses and text within".
To the question "how to pretty-print?", I would thus respond "I wouldn't start from here!".
The best way to not reinvent the wheel in your case, apart from using existing wheels, is to stick to a simple output format.
But first of all all, why do you need to pretty-print? who will look at the resulting code?
Depending on the exact Lisp dialect you are using and the intended usage of the code, you could format your code very differently. Think about newlines, indentation and maximum width of your text, for example. The Common Lisp pretty-printer is particulary evolved and I doubt you want to have the same level of configurability.
If you used Lisp, a simple call to pprint would solve your problem, but you are using Python, so stick with the most reasonable output for the moment because pretty-printing is a can of worms.
If your code is intended for human readers, please:
don't put closing parenthesis on their own lines
don't vertically align open and close parenthesis
don't add spaces between opening parenthesis
This is ugly:
( * ( + 3 x )
(f
x
y
)
)
This is better:
(* (+ 3 x)
(f x y))
Or simply:
(* (+ 3 x) (f x y))
See here for more details.
But before printing, you have to parse your input string and make sure it is well-formed. Maybe you are sure it is well-formed, due to how you generate your forms, but I'd argue that the printer should ignore that and not make too many assumptions. If you passed the pretty-printer an AST represented by Python objects instead of just strings, this would be easier, as suggested in comments. You could build a data-structure or custom classes and use the pprint (python) module. That, as said above, seems to be the way to go in your case, if you can change how you generate your Lisp-style code.
With strings, you are supposed to handle any possible input and reject invalid ones.
This means checking that parenthesis and quotes are balanced (beware of escape characters), etc.
Actually, you don't need to really build an intermediate tree for printing (though it would probably help for other parts of your program), because Lisp-style code is made of forms that are easily nested and use a prefix notation: you can scan your input string from left-to-right and print as required when seeing parenthesis (open parenthesis: recurse; close parenthesis, return from recursion). When you first encounter an unescaped double-quote ", read until the next one ", ...
This, coupled with a simple printing method, could be sufficient for your needs.
I think the easiest method would be to use triple quotations. If you say:
print """
(((This is some lisp code))) """
It should work.
You can format your code any way you like within the triple quotes and it will come out the way you want it to.
Best of luck and happy coding!
I made this rudimentary pretty printer once for prettifying CLIPS, which is based on Lisp. Might help:
def clips_pprint(clips_str: str) -> str:
"""Pretty-prints a CLIPS string.
Indents a CLIPS string for easier visual confirmation during development
and verification.
Assumes the CLIPS string is valid CLIPS, i.e. braces are paired.
"""
LB = "("
RB = ")"
TAB = " " * 4
formatted_clips_str = ""
tab_count = 0
for c in clips_str:
if c == LB:
formatted_clips_str += os.linesep
for _i in range(tab_count):
formatted_clips_str += TAB
tab_count += 1
elif c == RB:
tab_count -= 1
formatted_clips_str += c
return formatted_clips_str.strip()

How to apply Morgan's law to parsed string? (transforming the string or with parseactions)

I am trying to do a program that evaluates if a propositional logic formula is valid or invalid using the semantic three method.
I managed to evaluate if a formula is well formed or not so far:
from pyparsing import *
from string import lowercase
def fbf():
atom = Word(lowercase, max=1) #alfabeto minusculas
op = oneOf('^ V => <=>') #Operadores
identOp = oneOf('( [ {')
identCl = oneOf(') ] }')
form = Forward() #Iniciar de manera recursiva
#Gramatica
form << ( (Group(Literal('~') + form)) | ( Group(identOp + form + op + form + identCl) ) | ( Group(identOp + form + identCl) ) | (atom) )
return form
#Haciendo todo lo que se debe
entrada = raw_input("Entrada: ")
try:
print fbf().parseString(entrada, parseAll=True)
except ParseException as error: #Manejando error
print error.markInputline()
print error
print
Now I need to convert the negated forumla ~(form) acording to the Monrgan's Law, The BNF of Morgan's Law its something like this:
~((form) V (form)) = (~(form) ^ ~(form))
~((form) ^ (form)) = (~(form) V ~(form))
http://en.wikipedia.org/wiki/De_Morgans_laws
Parsing must be recursive; I was reading about Parseactions, but I don't really understand I'm new to python and very unskilled.
Can somebody help me on how to get this to work?
Juan Jose -
You are asking for a lot of work on the part of this audience, whether you realize it or not. Here are some suggestions on how to make progress on this problem:
Recognize that parsing the input is only the first step in this overall program. You can't just write any parser that gets through the input, and then declare yourself ready for the next step. You need to anticipate what you will do with the parsed output, and try to parse the data in such a way that it readies you to take the next step - which in your case is to do some logical transformations to apply DeMorgans Laws. In fact, you may be best off working backwards - assume you have a parser, what would you need your transformation code to work with, how would an expression look, and how would you perform the transform itself? This will naturally structure your thinking to the application domain, and give you a target result format when you start writing the parser.
When you start to write your parser, look at other pyparsing examples that do similar tasks, such as SimpleBool.py on the pyparsing wiki. See how they parse the input to create a set of evaluatable objects, which can then be acted upon in the application domain (whether it is to evaluate them, transform them, or whatever). Think about what kind of objects you want to create in your parser that will work with the transformation methods you outlined in the last step.
Take time to write a BNF for the syntax you will parse. Write out some sample test strings that you would parse to help you anticipate syntax issues. Is "~~p ^ q V r" a valid string? Can identifiers be multiple characters, or will you restrict to just single characters (single will be easier to work with at the beginning, and you can expand it later easily)? Keep your syntax simple if you can, such as just supporting ()'s for grouping, instead of any matched pair of ()'s, []'s, or {}'s.
When you implement your parser, start with simple test cases first and work your way up. You may have to backtrack a bit if you find that you made some assumptions early on that more complicated strings don't support, but that's pretty typical for most programming projects.
As an implementation tip, read up on using the operatorPrecedence helper, as it is specifically designed for these types of parsing jobs. Look at how it is used in SimpleBool.py to create an object hierarchy that mirrors the structure of the input string. Then think about what objects would do in your transformation process.
Good luck!

Using a grammar parser for Python and constructing files from the tree

I have a custom made grammar for an interpreted language and I am looking for advice on a parser which will create a tree which I can query. From the structure I would like to be able to generate code in the interpreted language. Most grammar parsers that I have seen validate already existing code. The second part of my question is should the grammar be abstracted to the point that the Python code will substitute symbols in the tree for actual code terminology? Ideally, I would love be be able to query a root symbol and have returned all the symbols which fall under that root and so forth all the way to a terminal symbol.
Any advice on this process or my vocabulary regarding it would be very helpful. Thank you.
The vast majority of parser libraries will create an abstract syntax tree (AST) from whatever code it is you're generating; you can use whatever, eg pyparsing. To go from the AST to code, you might have to write functions manually to do that, but it's pretty easy to do that recursively. For example:
def generate(ast):
if ast[0] == '+':
return generate(ast[1]) + " + " + generate(ast[2])
elif ast[0] == 'for':
return "for %s in %s:\n" % (ast[1], generate(ast[2])) + generate(ast[3])
...
assuming an AST structure that's just a list where the first element is a tag for the node name, followed by the trees for any arguments: [+, 4, [*, 'x', 5]]. Of course, you should use whatever your parser library uses, unless you're writing the parser yourself.
I don't understand what you mean by Python code substituting symbols in the tree for actual code terminology.
You could write an easy function to iterate over all the symbols under a root node:
def traverse_preorder(ast):
yield ast[0]
for arg in ast[1:]:
for x in traverse_preorder(arg):
yield x
On second thought, the variable name ast is maybe a poor choice because of the ast module.
I'd use ANTLR. Version 3 (current) supports generating Python code. It will generate an Abstract Syntax Tree (AST) automatically during parsing, which you can then traverse. An important part of this will be annotating your grammar with which tokens are to be treated as subtrees (e.g. operators).

IronPython: Is there an alternative to significant whitespace?

For rapidly changing business rules, I'm storing IronPython fragments in XML files. So far this has been working out well, but I'm starting to get to the point where I need more that just one-line expressions.
The problem is that XML and significant whilespace don't play well together. Before I abandon it for another language, I would like to know if IronPython has an alternative syntax.
IronPython doesn't have an alternate syntax. It's an implementation of Python, and Python uses significant indentation (all languages use significant whitespace, not sure why we talk about whitespace when it's only indentation that's unusual in the Python case).
>>> from __future__ import braces
File "<stdin>", line 1
from __future__ import braces
^
SyntaxError: not a chance
All I want is something that will let my users write code like
Ummm... Don't do this. You don't actually want this. In the long run, this will cause endless little issues because you're trying to force too much content into an attribute.
Do this.
<Rule Name="Markup">
<Formula>(Account.PricingLevel + 1) * .05</Formula>
</Rule>
You should try not to have significant, meaningful stuff in attributes. As a general XML design policy, you should use tags and save attributes for names and ID's and the like. When you look at well-done XSD's and DTD's, you see that attributes are used minimally.
Having the body of the rule in a separate tag (not an attribute) saves much pain. And it allows a tool to provide correct CDATA sections. Use a tool like Altova's XML Spy to assure that your tags have space preserved properly.
I think you can set the xml:space="preserve" attribute or use a <![CDATA[ to avoid other issues, with for example quotes and greater equal signs.
Apart from the already mentioned CDATA sections, there's pindent.py which can, among others, fix broken indentation based on comments a la #end if - to quote the linked file:
When called as "pindent -r" it assumes its input is a Python program with block-closing comments but with its indentation messed up, and outputs a properly indented version.
...
A "block-closing comment" is a comment of the form '# end <keyword>' where is the keyword that opened the block. If the opening keyword is 'def' or 'class', the function or class name may be repeated in the block-closing comment as well. Here is an example of a program fully augmented with block-closing comments:
def foobar(a, b):
if a == b:
a = a+1
elif a < b:
b = b-1
if b > a: a = a-1
# end if
else:
print 'oops!'
# end if
# end def foobar
It's bundeled with CPython, but if IronPython doesn't have it, just grab it from the repository.

Categories

Resources