Modgrammar defining nested strings in python

Modgrammar defining nested strings in python - python

I need to define Strings for a custom language parser. I am using modgrammar to define this language parser. This language (sqf) has a datatype String which allows for nesting.
A string is denoted by the standard double quotes ("), a string within a string is denoted by two sets of double quotes example.
"this is a string ""this is a string within a string"""
As far as i am aware there is no limit as to the levels of nesting.
So far i have tried the following to parse strings:
from modgrammar import *
class String (Grammar):
grammar = (
(L("\""), ANY_EXCEPT("\""), (L("\"")),
(
OPTIONAL((L("\""),
REF("String"),
(L("\""))
)
)
String.grammar_resolve_refs()
And
class String (Grammar):
grammar = (
(L("\""),
ANY_EXCEPT("\""),
(L("\"")
)
class StringNested (Grammar):
grammar = (String,OPTIONAL((L("\""),REF("StringNested"),(L("\""))
)
and:
class StringBase (Grammar):
grammar_greedy = True
grammar = (REPEAT(WORD("A-Za-z0-9")))
class String (Grammar):
# grammar =(OR(
# OR((StringBase,LITERAL('"'),StringBase, LITERAL('"')), (LITERAL('"'),StringBase,LITERAL('"'),StringBase) ),
# StringBase,
# ))
grammar = L('"'),OPTIONAL(L('"'),StringBase,L('"')), OPTIONAL(LITERAL('"'),L('"'),REF("String"), LITERAL('"'),L('"')), (StringBase),L('"')
neither of these seem to be working.
edit: using python 3.4 and modgrammar 0.10
edit 2: NOTE:
I have found that while mod grammar is powerful and good at what it does it may not be the right solution my problem, I found that a hand coded linear parsing was much more efficient at parsing data in this instance data provided is already programmic output and therefor unlikely to contain errors in a way that would require such extensive testing as modgrammar allows.

So I finally found a solution that seems to work (pending extensive testing).
The answer can be found here:
Google Groups: Parsing Escaped Characters
posted by Dan S:
Dan S
08/09/2012
Found one way:
grammar = ('"', ZERO_OR_MORE(WORD('^"') | WORD('^\', '"', count=2)), '"')

Related

Different strings in a single variable in python [duplicate]

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.

Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

What metasyntax notation is Python using?

Full grammar specification for Python 3.6.3 is as follows: https://docs.python.org/3/reference/grammar.html
It looks like EBNF appended by some special constructs taken from regular expressions, for example: ()* (repeat zero or more times?) and ()+ (repeat one or more times?).
What metasyntax is Python using and where its specification can be found?
Update
Python's grammar is defined in this file (thanks #larsks). However, the question still stands - what notation is used?

The Python grammar is parsed by the parser in the Parser directory of the source. You can see this in Makefile.pre. This generates Include/graminit.[ch], which are used in, e.g., Python/ast.c as well as Modules/parsermodule.c.
The format of the grammar is described at the bottom of pgen.c:
Input is a grammar in extended BNF (using * for repetition, + for
at-least-once repetition, [] for optional parts, | for alternatives and
() for grouping).

Pretty-print Lisp using Python

Is there a way to pretty-print Lisp-style code string (in other words, a bunch of balanced parentheses and text within) in Python without re-inventing a wheel?

Short answer
I think a reasonable approach, if you can, is to generate Python lists or custom objects instead of strings and use the pprint module, as suggested by #saulspatz.
Long answer
The whole question look like an instance of an XY-problem. Why? because you are using Python (why not Lisp?) to manipulate strings (why not data-structures?) representing generated Lisp-style code, where Lisp-style is defined as "a bunch of parentheses and text within".
To the question "how to pretty-print?", I would thus respond "I wouldn't start from here!".
The best way to not reinvent the wheel in your case, apart from using existing wheels, is to stick to a simple output format.
But first of all all, why do you need to pretty-print? who will look at the resulting code?
Depending on the exact Lisp dialect you are using and the intended usage of the code, you could format your code very differently. Think about newlines, indentation and maximum width of your text, for example. The Common Lisp pretty-printer is particulary evolved and I doubt you want to have the same level of configurability.
If you used Lisp, a simple call to pprint would solve your problem, but you are using Python, so stick with the most reasonable output for the moment because pretty-printing is a can of worms.
If your code is intended for human readers, please:
don't put closing parenthesis on their own lines
don't vertically align open and close parenthesis
don't add spaces between opening parenthesis
This is ugly:
( * ( + 3 x )
(f
x
y
)
)
This is better:
(* (+ 3 x)
(f x y))
Or simply:
(* (+ 3 x) (f x y))
See here for more details.
But before printing, you have to parse your input string and make sure it is well-formed. Maybe you are sure it is well-formed, due to how you generate your forms, but I'd argue that the printer should ignore that and not make too many assumptions. If you passed the pretty-printer an AST represented by Python objects instead of just strings, this would be easier, as suggested in comments. You could build a data-structure or custom classes and use the pprint (python) module. That, as said above, seems to be the way to go in your case, if you can change how you generate your Lisp-style code.
With strings, you are supposed to handle any possible input and reject invalid ones.
This means checking that parenthesis and quotes are balanced (beware of escape characters), etc.
Actually, you don't need to really build an intermediate tree for printing (though it would probably help for other parts of your program), because Lisp-style code is made of forms that are easily nested and use a prefix notation: you can scan your input string from left-to-right and print as required when seeing parenthesis (open parenthesis: recurse; close parenthesis, return from recursion). When you first encounter an unescaped double-quote ", read until the next one ", ...
This, coupled with a simple printing method, could be sufficient for your needs.

I think the easiest method would be to use triple quotations. If you say:
print """
(((This is some lisp code))) """
It should work.
You can format your code any way you like within the triple quotes and it will come out the way you want it to.
Best of luck and happy coding!

I made this rudimentary pretty printer once for prettifying CLIPS, which is based on Lisp. Might help:
def clips_pprint(clips_str: str) -> str:
"""Pretty-prints a CLIPS string.
Indents a CLIPS string for easier visual confirmation during development
and verification.
Assumes the CLIPS string is valid CLIPS, i.e. braces are paired.
"""
LB = "("
RB = ")"
TAB = " " * 4
formatted_clips_str = ""
tab_count = 0
for c in clips_str:
if c == LB:
formatted_clips_str += os.linesep
for _i in range(tab_count):
formatted_clips_str += TAB
tab_count += 1
elif c == RB:
tab_count -= 1
formatted_clips_str += c
return formatted_clips_str.strip()

Triple-double quote v.s. Double quote

What is the preferred way to write Python doc string?
""" or "
In the book Dive Into Python, the author provides the following example:
def buildConnectionString(params):
"""Build a connection string from a dictionary of parameters.
Returns string."""
In another chapter, the author provides another example:
def stripnulls(data):
"strip whitespace and nulls"
return data.replace("\00", "").strip()
Both syntax work. The only difference to me is that """ allows us to write multi-line doc.
Are there any differences other than that?

From the PEP8 Style Guide:
PEP 257 describes good docstring conventions. Note that most
importantly, the """ that ends a multiline docstring should be on a
line by itself, e.g.:
"""Return a foobang
Optional plotz says to frobnicate the bizbaz first.
"""
For one liner docstrings, it's okay to keep the closing """ on the
same line.
PEP 257 recommends using triple quotes, even for one-line docstrings:
Triple quotes are used even though the string fits on one line. This
makes it easy to later expand it.
Note that not even the Python standard library itself follows these recommendations consistently. For example,
abcoll.py
ftplib.py
functools.py
inspect.py

They're both strings, so there is no difference. The preferred style is triple double quotes (PEP 257):
For consistency, always use """triple double quotes""" around docstrings.
Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""".

No, not really. If you are writing to a file, using triple quotes may be ideal, because you don't have to use "\n" in order to go a line down. Just make sure the quotes you start and end with are the same type(Double or Triple quotes). Here is a reliable resource if you have any more questions:
http://docs.python.org/release/1.5.1p1/tut/strings.html

You can also use triple-double quotes for a long SQL query to improve the readability and not to scroll right to see it as shown below:
query = """
SELECT count(*)
FROM (SELECT *
FROM student
WHERE grade = 2 AND major = 'Computer Science'
FOR UPDATE)
AS result;
"""
And, if using double quotes for the SQL query above, the readability is worse and you will need to scroll right to see it as shown below:
query = "SELECT count(*) FROM (SELECT * FROM student WHERE grade = 2 AND major = 'Computer Science' FOR UPDATE) AS result;"
In addition, you can also use triple-double quotes for a GraphQL query as shown below:
query = """
{
products(first: 5) {
edges {
node {
id
handle
}
}
}
}"""

How to parse code (in Python)?

I need to parse some special data structures. They are in some somewhat-like-C format that looks roughly like this:
Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
}
I can think of several ways to go about this. I could 'tokenize' the code using regular expressions. I could read the code one character at a time and use a state machine to construct my data structure. I could get rid of comma-linebreaks and read the thing line by line. I could write some conversion script that converts this code to executable Python code.
Is there a nice pythonic way to parse files like this?
How would you go about parsing it?
This is more a general question about how to parse strings and not so much about this particular file format.

Using pyparsing (Mark Tolonen, I was just about to click "Submit Post" when your post came thru), this is pretty straightforward - see comments embedded in the code below:
data = """Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
} """
from pyparsing import *
# define basic punctuation and data types
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,"{}();")
GROUP = Keyword("Group")
ENTRY = Keyword("Entry")
# use parse actions to do parse-time conversion of values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
# parses a string enclosed in quotes, but strips off the quotes at parse time
string = QuotedString('"')
# define structure expressions
value = string | real | integer
entry = Group(ENTRY + LPAREN + Group(Optional(delimitedList(value)))) + RPAREN + SEMI
# since Groups can contain Groups, need to use a Forward to define recursive expression
group = Forward()
group << Group(GROUP + LPAREN + string("name") + RPAREN +
LBRACE + Group(ZeroOrMore(group | entry))("body") + RBRACE)
# ignore C style comments wherever they occur
group.ignore(cStyleComment)
# parse the sample text
result = group.parseString(data)
# print out the tokens as a nice indented list using pprint
from pprint import pprint
pprint(result.asList())
Prints
[['Group',
'GroupName',
[['Group',
'AnotherGroupName',
[['Entry', ['some', 'variables', 0, 3.141]],
['Entry', ['other', 'variables', 1, 2.718]]]],
['Entry', ['linebreaks', 'allowed', 3, 1.4139999999999999]]]]]
(Unfortunately, there may be some confusion since pyparsing defines a "Group" class, for imparting structure to the parsed tokens - note how the value lists in an Entry get grouped because the list expression is enclosed within a pyparsing Group.)

Check out pyparsing. It has lots of parsing examples.

Depends on how often you need this and if the syntax stays the same. If the answers are "quite often" and "more or less yes" then I would look at a way to express the syntax and write a specific parser to that language with a tool like PyPEG or LEPL. Defining the parser rules is the big job so unless you need to parse the same kind of files often it might not necessarily be effective, though.
But if you look at the PyPEG page it tells you how to output the parsed data to XML so if that tool doesn't give enough power to you, you could use it to generate the XML and then use e.g. lxml to parse the xml.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.