How do Python parsers handle indentation?

How do Python parsers handle indentation? - python

When parsing a freeform language like C, it is easy for the parser to determine when several expressions are related to one another simply by looking at the symbols emitted by the parser. For example, in the code
if (x == 5) {
a = b;
c = d;
}
The parser can tell that a = b; and c = d; are part of the same block statement because they're surrounded by braces. This could easily be encoded as a CFG using something like this:
STMT ::= IF_STMT | EXPR; | BLOCK_STMT | STMT STMT
IF_STMT ::= if ( EXPR ) STMT
BLOCK_STMT ::= { STMT }
In Python and other whitespace-sensitive languages, though, it's not as easy to do this because the structure of the statements can only be inferred from their absolute position, which I don't think can easily be encoded into a CFG. For example, the above code in Python would look like this:
if x == 5:
a = b
c = d
Try as I might, I can't see a way to write a CFG that would accept this, because I can't figure out how to encode "two statements at the same level of nesting" into a CFG.
How do Python parsers group statements as they do? Do they rely on a scanner that automatically inserts extra tokens denoting starts and ends of statements? Do they produce a rough AST for the program, then have an extra pass that assembles statements based on their indentation? Is there a clever CFG for this problem that I'm missing? Or do they use a more powerful parser than a standard LL(1) or LALR(1) parser that's able to take whitespace level into account?

The indentations are handled with two "pseudo tokens" - INDENT and DEDENT. There are some details here. For more information, you should look at the source for the python tokeniser and parser.

Related

Pretty-print Lisp using Python

Is there a way to pretty-print Lisp-style code string (in other words, a bunch of balanced parentheses and text within) in Python without re-inventing a wheel?

Short answer
I think a reasonable approach, if you can, is to generate Python lists or custom objects instead of strings and use the pprint module, as suggested by #saulspatz.
Long answer
The whole question look like an instance of an XY-problem. Why? because you are using Python (why not Lisp?) to manipulate strings (why not data-structures?) representing generated Lisp-style code, where Lisp-style is defined as "a bunch of parentheses and text within".
To the question "how to pretty-print?", I would thus respond "I wouldn't start from here!".
The best way to not reinvent the wheel in your case, apart from using existing wheels, is to stick to a simple output format.
But first of all all, why do you need to pretty-print? who will look at the resulting code?
Depending on the exact Lisp dialect you are using and the intended usage of the code, you could format your code very differently. Think about newlines, indentation and maximum width of your text, for example. The Common Lisp pretty-printer is particulary evolved and I doubt you want to have the same level of configurability.
If you used Lisp, a simple call to pprint would solve your problem, but you are using Python, so stick with the most reasonable output for the moment because pretty-printing is a can of worms.
If your code is intended for human readers, please:
don't put closing parenthesis on their own lines
don't vertically align open and close parenthesis
don't add spaces between opening parenthesis
This is ugly:
( * ( + 3 x )
(f
x
y
)
)
This is better:
(* (+ 3 x)
(f x y))
Or simply:
(* (+ 3 x) (f x y))
See here for more details.
But before printing, you have to parse your input string and make sure it is well-formed. Maybe you are sure it is well-formed, due to how you generate your forms, but I'd argue that the printer should ignore that and not make too many assumptions. If you passed the pretty-printer an AST represented by Python objects instead of just strings, this would be easier, as suggested in comments. You could build a data-structure or custom classes and use the pprint (python) module. That, as said above, seems to be the way to go in your case, if you can change how you generate your Lisp-style code.
With strings, you are supposed to handle any possible input and reject invalid ones.
This means checking that parenthesis and quotes are balanced (beware of escape characters), etc.
Actually, you don't need to really build an intermediate tree for printing (though it would probably help for other parts of your program), because Lisp-style code is made of forms that are easily nested and use a prefix notation: you can scan your input string from left-to-right and print as required when seeing parenthesis (open parenthesis: recurse; close parenthesis, return from recursion). When you first encounter an unescaped double-quote ", read until the next one ", ...
This, coupled with a simple printing method, could be sufficient for your needs.

I think the easiest method would be to use triple quotations. If you say:
print """
(((This is some lisp code))) """
It should work.
You can format your code any way you like within the triple quotes and it will come out the way you want it to.
Best of luck and happy coding!

I made this rudimentary pretty printer once for prettifying CLIPS, which is based on Lisp. Might help:
def clips_pprint(clips_str: str) -> str:
"""Pretty-prints a CLIPS string.
Indents a CLIPS string for easier visual confirmation during development
and verification.
Assumes the CLIPS string is valid CLIPS, i.e. braces are paired.
"""
LB = "("
RB = ")"
TAB = " " * 4
formatted_clips_str = ""
tab_count = 0
for c in clips_str:
if c == LB:
formatted_clips_str += os.linesep
for _i in range(tab_count):
formatted_clips_str += TAB
tab_count += 1
elif c == RB:
tab_count -= 1
formatted_clips_str += c
return formatted_clips_str.strip()

Modgrammar defining nested strings in python

I need to define Strings for a custom language parser. I am using modgrammar to define this language parser. This language (sqf) has a datatype String which allows for nesting.
A string is denoted by the standard double quotes ("), a string within a string is denoted by two sets of double quotes example.
"this is a string ""this is a string within a string"""
As far as i am aware there is no limit as to the levels of nesting.
So far i have tried the following to parse strings:
from modgrammar import *
class String (Grammar):
grammar = (
(L("\""), ANY_EXCEPT("\""), (L("\"")),
(
OPTIONAL((L("\""),
REF("String"),
(L("\""))
)
)
String.grammar_resolve_refs()
And
class String (Grammar):
grammar = (
(L("\""),
ANY_EXCEPT("\""),
(L("\"")
)
class StringNested (Grammar):
grammar = (String,OPTIONAL((L("\""),REF("StringNested"),(L("\""))
)
and:
class StringBase (Grammar):
grammar_greedy = True
grammar = (REPEAT(WORD("A-Za-z0-9")))
class String (Grammar):
# grammar =(OR(
# OR((StringBase,LITERAL('"'),StringBase, LITERAL('"')), (LITERAL('"'),StringBase,LITERAL('"'),StringBase) ),
# StringBase,
# ))
grammar = L('"'),OPTIONAL(L('"'),StringBase,L('"')), OPTIONAL(LITERAL('"'),L('"'),REF("String"), LITERAL('"'),L('"')), (StringBase),L('"')
neither of these seem to be working.
edit: using python 3.4 and modgrammar 0.10
edit 2: NOTE:
I have found that while mod grammar is powerful and good at what it does it may not be the right solution my problem, I found that a hand coded linear parsing was much more efficient at parsing data in this instance data provided is already programmic output and therefor unlikely to contain errors in a way that would require such extensive testing as modgrammar allows.

So I finally found a solution that seems to work (pending extensive testing).
The answer can be found here:
Google Groups: Parsing Escaped Characters
posted by Dan S:
Dan S
08/09/2012
Found one way:
grammar = ('"', ZERO_OR_MORE(WORD('^"') | WORD('^\', '"', count=2)), '"')

Triple-double quote v.s. Double quote

What is the preferred way to write Python doc string?
""" or "
In the book Dive Into Python, the author provides the following example:
def buildConnectionString(params):
"""Build a connection string from a dictionary of parameters.
Returns string."""
In another chapter, the author provides another example:
def stripnulls(data):
"strip whitespace and nulls"
return data.replace("\00", "").strip()
Both syntax work. The only difference to me is that """ allows us to write multi-line doc.
Are there any differences other than that?

From the PEP8 Style Guide:
PEP 257 describes good docstring conventions. Note that most
importantly, the """ that ends a multiline docstring should be on a
line by itself, e.g.:
"""Return a foobang
Optional plotz says to frobnicate the bizbaz first.
"""
For one liner docstrings, it's okay to keep the closing """ on the
same line.
PEP 257 recommends using triple quotes, even for one-line docstrings:
Triple quotes are used even though the string fits on one line. This
makes it easy to later expand it.
Note that not even the Python standard library itself follows these recommendations consistently. For example,
abcoll.py
ftplib.py
functools.py
inspect.py

They're both strings, so there is no difference. The preferred style is triple double quotes (PEP 257):
For consistency, always use """triple double quotes""" around docstrings.
Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""".

No, not really. If you are writing to a file, using triple quotes may be ideal, because you don't have to use "\n" in order to go a line down. Just make sure the quotes you start and end with are the same type(Double or Triple quotes). Here is a reliable resource if you have any more questions:
http://docs.python.org/release/1.5.1p1/tut/strings.html

You can also use triple-double quotes for a long SQL query to improve the readability and not to scroll right to see it as shown below:
query = """
SELECT count(*)
FROM (SELECT *
FROM student
WHERE grade = 2 AND major = 'Computer Science'
FOR UPDATE)
AS result;
"""
And, if using double quotes for the SQL query above, the readability is worse and you will need to scroll right to see it as shown below:
query = "SELECT count(*) FROM (SELECT * FROM student WHERE grade = 2 AND major = 'Computer Science' FOR UPDATE) AS result;"
In addition, you can also use triple-double quotes for a GraphQL query as shown below:
query = """
{
products(first: 5) {
edges {
node {
id
handle
}
}
}
}"""

How to apply Morgan's law to parsed string? (transforming the string or with parseactions)

I am trying to do a program that evaluates if a propositional logic formula is valid or invalid using the semantic three method.
I managed to evaluate if a formula is well formed or not so far:
from pyparsing import *
from string import lowercase
def fbf():
atom = Word(lowercase, max=1) #alfabeto minusculas
op = oneOf('^ V => <=>') #Operadores
identOp = oneOf('( [ {')
identCl = oneOf(') ] }')
form = Forward() #Iniciar de manera recursiva
#Gramatica
form << ( (Group(Literal('~') + form)) | ( Group(identOp + form + op + form + identCl) ) | ( Group(identOp + form + identCl) ) | (atom) )
return form
#Haciendo todo lo que se debe
entrada = raw_input("Entrada: ")
try:
print fbf().parseString(entrada, parseAll=True)
except ParseException as error: #Manejando error
print error.markInputline()
print error
print
Now I need to convert the negated forumla ~(form) acording to the Monrgan's Law, The BNF of Morgan's Law its something like this:
~((form) V (form)) = (~(form) ^ ~(form))
~((form) ^ (form)) = (~(form) V ~(form))
http://en.wikipedia.org/wiki/De_Morgans_laws
Parsing must be recursive; I was reading about Parseactions, but I don't really understand I'm new to python and very unskilled.
Can somebody help me on how to get this to work?

Juan Jose -
You are asking for a lot of work on the part of this audience, whether you realize it or not. Here are some suggestions on how to make progress on this problem:
Recognize that parsing the input is only the first step in this overall program. You can't just write any parser that gets through the input, and then declare yourself ready for the next step. You need to anticipate what you will do with the parsed output, and try to parse the data in such a way that it readies you to take the next step - which in your case is to do some logical transformations to apply DeMorgans Laws. In fact, you may be best off working backwards - assume you have a parser, what would you need your transformation code to work with, how would an expression look, and how would you perform the transform itself? This will naturally structure your thinking to the application domain, and give you a target result format when you start writing the parser.
When you start to write your parser, look at other pyparsing examples that do similar tasks, such as SimpleBool.py on the pyparsing wiki. See how they parse the input to create a set of evaluatable objects, which can then be acted upon in the application domain (whether it is to evaluate them, transform them, or whatever). Think about what kind of objects you want to create in your parser that will work with the transformation methods you outlined in the last step.
Take time to write a BNF for the syntax you will parse. Write out some sample test strings that you would parse to help you anticipate syntax issues. Is "~~p ^ q V r" a valid string? Can identifiers be multiple characters, or will you restrict to just single characters (single will be easier to work with at the beginning, and you can expand it later easily)? Keep your syntax simple if you can, such as just supporting ()'s for grouping, instead of any matched pair of ()'s, []'s, or {}'s.
When you implement your parser, start with simple test cases first and work your way up. You may have to backtrack a bit if you find that you made some assumptions early on that more complicated strings don't support, but that's pretty typical for most programming projects.
As an implementation tip, read up on using the operatorPrecedence helper, as it is specifically designed for these types of parsing jobs. Look at how it is used in SimpleBool.py to create an object hierarchy that mirrors the structure of the input string. Then think about what objects would do in your transformation process.
Good luck!

How to parse code (in Python)?

I need to parse some special data structures. They are in some somewhat-like-C format that looks roughly like this:
Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
}
I can think of several ways to go about this. I could 'tokenize' the code using regular expressions. I could read the code one character at a time and use a state machine to construct my data structure. I could get rid of comma-linebreaks and read the thing line by line. I could write some conversion script that converts this code to executable Python code.
Is there a nice pythonic way to parse files like this?
How would you go about parsing it?
This is more a general question about how to parse strings and not so much about this particular file format.

Using pyparsing (Mark Tolonen, I was just about to click "Submit Post" when your post came thru), this is pretty straightforward - see comments embedded in the code below:
data = """Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
} """
from pyparsing import *
# define basic punctuation and data types
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,"{}();")
GROUP = Keyword("Group")
ENTRY = Keyword("Entry")
# use parse actions to do parse-time conversion of values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
# parses a string enclosed in quotes, but strips off the quotes at parse time
string = QuotedString('"')
# define structure expressions
value = string | real | integer
entry = Group(ENTRY + LPAREN + Group(Optional(delimitedList(value)))) + RPAREN + SEMI
# since Groups can contain Groups, need to use a Forward to define recursive expression
group = Forward()
group << Group(GROUP + LPAREN + string("name") + RPAREN +
LBRACE + Group(ZeroOrMore(group | entry))("body") + RBRACE)
# ignore C style comments wherever they occur
group.ignore(cStyleComment)
# parse the sample text
result = group.parseString(data)
# print out the tokens as a nice indented list using pprint
from pprint import pprint
pprint(result.asList())
Prints
[['Group',
'GroupName',
[['Group',
'AnotherGroupName',
[['Entry', ['some', 'variables', 0, 3.141]],
['Entry', ['other', 'variables', 1, 2.718]]]],
['Entry', ['linebreaks', 'allowed', 3, 1.4139999999999999]]]]]
(Unfortunately, there may be some confusion since pyparsing defines a "Group" class, for imparting structure to the parsed tokens - note how the value lists in an Entry get grouped because the list expression is enclosed within a pyparsing Group.)

Check out pyparsing. It has lots of parsing examples.

Depends on how often you need this and if the syntax stays the same. If the answers are "quite often" and "more or less yes" then I would look at a way to express the syntax and write a specific parser to that language with a tool like PyPEG or LEPL. Defining the parser rules is the big job so unless you need to parse the same kind of files often it might not necessarily be effective, though.
But if you look at the PyPEG page it tells you how to output the parsed data to XML so if that tool doesn't give enough power to you, you could use it to generate the XML and then use e.g. lxml to parse the xml.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.