How to parse code (in Python)?

How to parse code (in Python)? - python

I need to parse some special data structures. They are in some somewhat-like-C format that looks roughly like this:
Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
}
I can think of several ways to go about this. I could 'tokenize' the code using regular expressions. I could read the code one character at a time and use a state machine to construct my data structure. I could get rid of comma-linebreaks and read the thing line by line. I could write some conversion script that converts this code to executable Python code.
Is there a nice pythonic way to parse files like this?
How would you go about parsing it?
This is more a general question about how to parse strings and not so much about this particular file format.

Using pyparsing (Mark Tolonen, I was just about to click "Submit Post" when your post came thru), this is pretty straightforward - see comments embedded in the code below:
data = """Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
} """
from pyparsing import *
# define basic punctuation and data types
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,"{}();")
GROUP = Keyword("Group")
ENTRY = Keyword("Entry")
# use parse actions to do parse-time conversion of values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
# parses a string enclosed in quotes, but strips off the quotes at parse time
string = QuotedString('"')
# define structure expressions
value = string | real | integer
entry = Group(ENTRY + LPAREN + Group(Optional(delimitedList(value)))) + RPAREN + SEMI
# since Groups can contain Groups, need to use a Forward to define recursive expression
group = Forward()
group << Group(GROUP + LPAREN + string("name") + RPAREN +
LBRACE + Group(ZeroOrMore(group | entry))("body") + RBRACE)
# ignore C style comments wherever they occur
group.ignore(cStyleComment)
# parse the sample text
result = group.parseString(data)
# print out the tokens as a nice indented list using pprint
from pprint import pprint
pprint(result.asList())
Prints
[['Group',
'GroupName',
[['Group',
'AnotherGroupName',
[['Entry', ['some', 'variables', 0, 3.141]],
['Entry', ['other', 'variables', 1, 2.718]]]],
['Entry', ['linebreaks', 'allowed', 3, 1.4139999999999999]]]]]
(Unfortunately, there may be some confusion since pyparsing defines a "Group" class, for imparting structure to the parsed tokens - note how the value lists in an Entry get grouped because the list expression is enclosed within a pyparsing Group.)

Check out pyparsing. It has lots of parsing examples.

Depends on how often you need this and if the syntax stays the same. If the answers are "quite often" and "more or less yes" then I would look at a way to express the syntax and write a specific parser to that language with a tool like PyPEG or LEPL. Defining the parser rules is the big job so unless you need to parse the same kind of files often it might not necessarily be effective, though.
But if you look at the PyPEG page it tells you how to output the parsed data to XML so if that tool doesn't give enough power to you, you could use it to generate the XML and then use e.g. lxml to parse the xml.

Related

Different strings in a single variable in python [duplicate]

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.

Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

Eliminating " " from a JSON file so that they don't interrupt the string [duplicate]

While trying to parse JSON from an AJAX request, the string returned contains invalid JSON.
Although the best practice would be to change the server to reply with valid JSON, as suggested in multiple related answers, this is not an option.
Trying to solve this problem using python, I looked at regular expressions.
The main problem is elements as follows (which I currently use as a test string:
testStr = '{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}'
I currently use the following code:
jsonString = re.sub(r'(?<=\w)\"(?=[^\(\:\}\,])','\\"',testStr)
jsonString = re.sub(r'\"\"(?![,}:])','\"\\\"',jsonString)
with very limited success.
If I was using C, I would parse the string, and simply escape all double quotes within the element (i.e between all double quotes which are preceded by [:{},] )
There must be a pythonic way to parse, without resorting to a for loop and looking ahead, and keeping history.
EDIT:
Assuming that strings do not contain: [ : { } ]
And also assuming that the unescaped double quotes are only within the value, and not in the key,
Then I assume that the following (or something similar should solve the problem:
import re
re.sub(r'(?<![\[\:])\"(?![,\}),'\"',testString)
But it still does not work.

Seems I needed a break to solve this.
The following regular expression seems to replace only doublequotes that are contained within the element string. (With the assumptions I stated in the question)
output = re.sub(r'(?<![\[\:\{\,])\"(?![\:\}\,])','\\\"', stringName)
I have created a sandbox here: https://repl.it/vNK
Example Output:
Original String:
{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}
Modified String:
{"KEY1":"THIS IS \"AN\" ELEMENT","KEY2":"\"\"THIS IS ANOTHER \"ELEMENT\""}
Parsed JSON:
{
"KEY1": "THIS IS \"AN\" ELEMENT",
"KEY2": "\"\"THIS IS ANOTHER \"ELEMENT\""
}
Any suggestions are welcome.

JSON String with elements containing unescaped double quotes

While trying to parse JSON from an AJAX request, the string returned contains invalid JSON.
Although the best practice would be to change the server to reply with valid JSON, as suggested in multiple related answers, this is not an option.
Trying to solve this problem using python, I looked at regular expressions.
The main problem is elements as follows (which I currently use as a test string:
testStr = '{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}'
I currently use the following code:
jsonString = re.sub(r'(?<=\w)\"(?=[^\(\:\}\,])','\\"',testStr)
jsonString = re.sub(r'\"\"(?![,}:])','\"\\\"',jsonString)
with very limited success.
If I was using C, I would parse the string, and simply escape all double quotes within the element (i.e between all double quotes which are preceded by [:{},] )
There must be a pythonic way to parse, without resorting to a for loop and looking ahead, and keeping history.
EDIT:
Assuming that strings do not contain: [ : { } ]
And also assuming that the unescaped double quotes are only within the value, and not in the key,
Then I assume that the following (or something similar should solve the problem:
import re
re.sub(r'(?<![\[\:])\"(?![,\}),'\"',testString)
But it still does not work.

Seems I needed a break to solve this.
The following regular expression seems to replace only doublequotes that are contained within the element string. (With the assumptions I stated in the question)
output = re.sub(r'(?<![\[\:\{\,])\"(?![\:\}\,])','\\\"', stringName)
I have created a sandbox here: https://repl.it/vNK
Example Output:
Original String:
{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}
Modified String:
{"KEY1":"THIS IS \"AN\" ELEMENT","KEY2":"\"\"THIS IS ANOTHER \"ELEMENT\""}
Parsed JSON:
{
"KEY1": "THIS IS \"AN\" ELEMENT",
"KEY2": "\"\"THIS IS ANOTHER \"ELEMENT\""
}
Any suggestions are welcome.

How to apply Morgan's law to parsed string? (transforming the string or with parseactions)

I am trying to do a program that evaluates if a propositional logic formula is valid or invalid using the semantic three method.
I managed to evaluate if a formula is well formed or not so far:
from pyparsing import *
from string import lowercase
def fbf():
atom = Word(lowercase, max=1) #alfabeto minusculas
op = oneOf('^ V => <=>') #Operadores
identOp = oneOf('( [ {')
identCl = oneOf(') ] }')
form = Forward() #Iniciar de manera recursiva
#Gramatica
form << ( (Group(Literal('~') + form)) | ( Group(identOp + form + op + form + identCl) ) | ( Group(identOp + form + identCl) ) | (atom) )
return form
#Haciendo todo lo que se debe
entrada = raw_input("Entrada: ")
try:
print fbf().parseString(entrada, parseAll=True)
except ParseException as error: #Manejando error
print error.markInputline()
print error
print
Now I need to convert the negated forumla ~(form) acording to the Monrgan's Law, The BNF of Morgan's Law its something like this:
~((form) V (form)) = (~(form) ^ ~(form))
~((form) ^ (form)) = (~(form) V ~(form))
http://en.wikipedia.org/wiki/De_Morgans_laws
Parsing must be recursive; I was reading about Parseactions, but I don't really understand I'm new to python and very unskilled.
Can somebody help me on how to get this to work?

Juan Jose -
You are asking for a lot of work on the part of this audience, whether you realize it or not. Here are some suggestions on how to make progress on this problem:
Recognize that parsing the input is only the first step in this overall program. You can't just write any parser that gets through the input, and then declare yourself ready for the next step. You need to anticipate what you will do with the parsed output, and try to parse the data in such a way that it readies you to take the next step - which in your case is to do some logical transformations to apply DeMorgans Laws. In fact, you may be best off working backwards - assume you have a parser, what would you need your transformation code to work with, how would an expression look, and how would you perform the transform itself? This will naturally structure your thinking to the application domain, and give you a target result format when you start writing the parser.
When you start to write your parser, look at other pyparsing examples that do similar tasks, such as SimpleBool.py on the pyparsing wiki. See how they parse the input to create a set of evaluatable objects, which can then be acted upon in the application domain (whether it is to evaluate them, transform them, or whatever). Think about what kind of objects you want to create in your parser that will work with the transformation methods you outlined in the last step.
Take time to write a BNF for the syntax you will parse. Write out some sample test strings that you would parse to help you anticipate syntax issues. Is "~~p ^ q V r" a valid string? Can identifiers be multiple characters, or will you restrict to just single characters (single will be easier to work with at the beginning, and you can expand it later easily)? Keep your syntax simple if you can, such as just supporting ()'s for grouping, instead of any matched pair of ()'s, []'s, or {}'s.
When you implement your parser, start with simple test cases first and work your way up. You may have to backtrack a bit if you find that you made some assumptions early on that more complicated strings don't support, but that's pretty typical for most programming projects.
As an implementation tip, read up on using the operatorPrecedence helper, as it is specifically designed for these types of parsing jobs. Look at how it is used in SimpleBool.py to create an object hierarchy that mirrors the structure of the input string. Then think about what objects would do in your transformation process.
Good luck!

How do Python parsers handle indentation?

When parsing a freeform language like C, it is easy for the parser to determine when several expressions are related to one another simply by looking at the symbols emitted by the parser. For example, in the code
if (x == 5) {
a = b;
c = d;
}
The parser can tell that a = b; and c = d; are part of the same block statement because they're surrounded by braces. This could easily be encoded as a CFG using something like this:
STMT ::= IF_STMT | EXPR; | BLOCK_STMT | STMT STMT
IF_STMT ::= if ( EXPR ) STMT
BLOCK_STMT ::= { STMT }
In Python and other whitespace-sensitive languages, though, it's not as easy to do this because the structure of the statements can only be inferred from their absolute position, which I don't think can easily be encoded into a CFG. For example, the above code in Python would look like this:
if x == 5:
a = b
c = d
Try as I might, I can't see a way to write a CFG that would accept this, because I can't figure out how to encode "two statements at the same level of nesting" into a CFG.
How do Python parsers group statements as they do? Do they rely on a scanner that automatically inserts extra tokens denoting starts and ends of statements? Do they produce a rough AST for the program, then have an extra pass that assembles statements based on their indentation? Is there a clever CFG for this problem that I'm missing? Or do they use a more powerful parser than a standard LL(1) or LALR(1) parser that's able to take whitespace level into account?

The indentations are handled with two "pseudo tokens" - INDENT and DEDENT. There are some details here. For more information, you should look at the source for the python tokeniser and parser.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.