What metasyntax notation is Python using?

What metasyntax notation is Python using? - python

Full grammar specification for Python 3.6.3 is as follows: https://docs.python.org/3/reference/grammar.html
It looks like EBNF appended by some special constructs taken from regular expressions, for example: ()* (repeat zero or more times?) and ()+ (repeat one or more times?).
What metasyntax is Python using and where its specification can be found?
Update
Python's grammar is defined in this file (thanks #larsks). However, the question still stands - what notation is used?

The Python grammar is parsed by the parser in the Parser directory of the source. You can see this in Makefile.pre. This generates Include/graminit.[ch], which are used in, e.g., Python/ast.c as well as Modules/parsermodule.c.
The format of the grammar is described at the bottom of pgen.c:
Input is a grammar in extended BNF (using * for repetition, + for
at-least-once repetition, [] for optional parts, | for alternatives and
() for grouping).

Related

How is PLY's parsetab.py formatted?

I'm working on a project to convert MATLAB code to Python, and have been somewhat successful after building off others work. The tool uses PLY (an implementation of lex and yacc parsing tools for Python) to parse the MATLAB input. Unfortunately, it is a requirement that my code is written in Python 3, not Python 2. The tool runs without issue in Python 2, but I get a strange error in Python 3 (Assuming A is an array):
log_idx = A <= 16;
^
SyntaxError: Unexpected "=" (parser)
The MATLAB code I am trying to convert is:
idx = A <= 16;
which should convert to almost the same thing in Python 3:
idx = A <= 16
The only real difference between the Python 3 code and the Python 2 code is the PLY-generated parsetab.py file, which has substantial differences in the following variables:
_tabversion
_lr_signature
_lr_action_items
_lr_goto_items
I'm having trouble understanding the purpose of these variables and why they could be different when the only difference was the Python version used to generate the parsetab.py file.
I tried searching for documentation on this, but was unsuccessful. I originally suspected it could be a difference in the way strings are formatted between Python 2 and Python 3, but that didn't turn anything up either. Is there anyone familiar with PLY that could give some insight into how these variables are generated, or why the Python version is creating this difference?
Edit: I'm not sure if this would be useful to anyone because the file is very long and cryptic, but below is an example of part of the first lines of _lr_action_items and _lr_goto_items
Python 2:
_lr_action_items = {'DOTDIV':([6,9,14,20,22,24,32,34,36,42,46,47,52,54,56,57,60,71,72,73,74,75 ...
_lr_goto_items = {'lambda_args':([45,80,238,],[99,161,263,]),'unwind':([1,8,28,77,87,160,168,177 ...
Python 3:
_lr_action_items = {'END_STMT':([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,39,41,48,50 ...
_lr_goto_items = {'top':([0,],[1,]),'stmt':([1,44,46,134,137,207,212,214,215,244,245,250 ...

I'm going to go out on a limb here, because you have provided practically no indication of what code you are actually using. So I'm just going to assume that you copied the lexer.py file from the github repository you linked to in your question.
There's an important clue in this error message:
log_idx = A <= 16;
^
SyntaxError: Unexpected "=" (parser)
Evidently, <= is not being scanned as a single token; otherwise, the parser would not see an = token at that point in the input. This can only mean that the scanner is returning two tokens, < and =, and if that's the case, it is most certainly a syntax error, as you would expect from
log_idx = A < = 16;
To figure out why the lexer would do this, it's important to understand how the Ply (default) lexer works. It gathers up all the lexer patterns from variables whose names start t_, which must be either functions or variables whose values are strings. It then sorts them as follows:
function docstrings, in order by line number in the source file.
string values, in reverse order by length.
See Specification of Tokens in the Ply manual.
That usually does the right thing, but not always. The intention of sorting in reverse order by length is that a prefix pattern will come after a pattern which matches a longer string. So if you had patterns '<' and '<=', '<=' would be tried first, and so in the case where the input had <=, the < pattern would never be tried. That's important, since if '<' is tried first, '<=' will never be recognised.
However, this simple heuristic does not always work. The fact that a regular expression is shorter does not necessarily mean that its match will be shorter. So if you expect "maximal munch" semantics, you sometimes have to be careful about your patterns. (Or you can supply them as docstrings, because then you have complete control over the order.)
And whoever created that lexer.py file was not careful about their patterns, because it includes (among other issues):
t_LE = r"<="
t_LT = r"\<"
Note that since these are raw strings, the backslash is retained in the second string, so both patterns are of length 2:
>>> len(r"\<")
2
>>> len(r"<=")
2
Since the two patterns have the same length, their relative order in the sort is unspecified. And it is quite possible that the two versions of Python produce different sort orders, either because of differences in the implementation of sort or because of differences in the order which the dictionary of variables is iterated, or some combination of the above.
< has no special significance in a Python regular expression, so there is no need to backslash-escape it in the definition of t_LT. (Clearly, since it is not backslash-escaped in t_LE.) So the simplest solution would be to make the sort order unambiguous by removing the backslash:
t_LE = r"<="
t_LT = r"<"
Now, t_LE is longer and will definitely be tried first.
That's not the only instance of this problem in the lexer file, so you might want to revise it carefully.
Note: You could also fix the problem by adding an unnecessary backslash to the t_LE pattern; there is an argument for taking the attitude, "When in doubt, escape." However, it is useful to know which characters need to be escaped in a Python regex, and the Python documentation for the re package contains a complete list. Also, consider using long raw strings for patterns which include quotes, since neither " nor ' need to be backslash escaped in a Python regex.

Sublime Text syntax: Python 3.6 f-strings

I am trying to modify the default Python.sublime_syntax file to handle Python’s f-string literals properly. My goal is to have expressions in interpolated strings recognised as such:
f"hello {person.name if person else 'there'}"
-----------source.python----------
------string.quoted.double.block.python------
Within f-strings, ranges of text between a single { and another } (but terminating before format specifiers such as !r}, :<5}, etc—see PEP 498) should be recognised as expressions. As far as I know, that might look a little like this:
...
string:
- match: "(?<=[^\{]\{)[^\{].*)(?=(!(s|r|a))?(:.*)?\})" # I'll need a better regex
push: expressions
However, upon inspecting the build-in Python.sublime_syntax file, the string contexts especially are to unwieldy to even approach (~480 lines?) and I have no idea how to begin. Thanks heaps for any info.

There was an update to syntax highlighting in BUILD 3127 (Which includes: Significant improvements to Python syntax highlighting).
However, a couple users have stated that in BUILD 3176 syntax highlighting still is not set to correctly highlight Python expressions that are located within f strings. According to #Jollywatt, it is set to source.python f"string.quoted.double.block {constant.other.placeholder}" rather than f"string.quoted.double.block {source.python}"
It looks like Sublime uses this tool, PackageDev, "to ease the creation of snippets, syntax definitions, etc. for Sublime Text."

Stack Overflow when Pyparsing Ada 2005 Scoped Identifiers using Reference Manual Grammar

I'm currently implementing an Ada 2005 parser using Pyparsing and the reference manual grammar rules. We need this in order to analyze and transform parts of our aging Ada-codebase to C/C++.
Most things work.
However, one little annoying problem remains:
The grammar rule name when parsing scoped identifiers (rule selected_component) such as the expression "Global_Types.Integer2" fails because it is part of a left-associative grammar rule cycle.
I believe this rule is incorrectly written: the sub-rule direct_name should be placed after the sub-rule direct_name. In fact it should be placed last in the list of alternatives. Otherwise direct_name and in turn name matches and "Global_Types" only and then expects the string to end after that. Not what I want.
Therefore I now move the rule direct_name to the end of name-alternatives...but then I instead get an Pyparsing infinite recursion and Python spits out maximum recursion depth exceeded.
I believe the problem is caused either by the fact that
the associativity of the grammar rule selected_component is right-to-left. I've search the reference manual of Pyparsing but haven't found anything relevant. Should we treat dot (.) as an operator with right-to-left associativity or can we solve it throught extensions and restructurings of the grammar rules?
or by the fact that there is no check in Pyparsing infinite recursions. I believe this wouldn't be too hard to implement. Use a map from currently active rules (functions) to source position/offset (getTokensEndLoc()) and always fail a rule if the current source input position/offset equals the position related to the rule just entered.
Recursive expressions with pyparsing may be related to my problem.
The problem also seems closely related to Need help in parsing part of python grammar
which unfortunately doesn't have answers yet.
Here's the Ada 2005 grammar rule cycle that causes infinite recursion:
name =>
selected_component =>
prefix =>
name
Note that this problem is not an Ada-specific issue but is related to all grammars containing left-recursive rules.

For reference, as noted in GNAT: The GNU Ada Compiler, §2.2 The Parser, "The Ada grammar given [in] the ARM is ambiguous, and a table-driven parser would be forced to modify the grammar to make it acceptable to LL (1) or LALR (1) techniques."

Using a grammar parser for Python and constructing files from the tree

I have a custom made grammar for an interpreted language and I am looking for advice on a parser which will create a tree which I can query. From the structure I would like to be able to generate code in the interpreted language. Most grammar parsers that I have seen validate already existing code. The second part of my question is should the grammar be abstracted to the point that the Python code will substitute symbols in the tree for actual code terminology? Ideally, I would love be be able to query a root symbol and have returned all the symbols which fall under that root and so forth all the way to a terminal symbol.
Any advice on this process or my vocabulary regarding it would be very helpful. Thank you.

The vast majority of parser libraries will create an abstract syntax tree (AST) from whatever code it is you're generating; you can use whatever, eg pyparsing. To go from the AST to code, you might have to write functions manually to do that, but it's pretty easy to do that recursively. For example:
def generate(ast):
if ast[0] == '+':
return generate(ast[1]) + " + " + generate(ast[2])
elif ast[0] == 'for':
return "for %s in %s:\n" % (ast[1], generate(ast[2])) + generate(ast[3])
...
assuming an AST structure that's just a list where the first element is a tag for the node name, followed by the trees for any arguments: [+, 4, [*, 'x', 5]]. Of course, you should use whatever your parser library uses, unless you're writing the parser yourself.
I don't understand what you mean by Python code substituting symbols in the tree for actual code terminology.
You could write an easy function to iterate over all the symbols under a root node:
def traverse_preorder(ast):
yield ast[0]
for arg in ast[1:]:
for x in traverse_preorder(arg):
yield x
On second thought, the variable name ast is maybe a poor choice because of the ast module.

I'd use ANTLR. Version 3 (current) supports generating Python code. It will generate an Abstract Syntax Tree (AST) automatically during parsing, which you can then traverse. An important part of this will be annotating your grammar with which tokens are to be treated as subtrees (e.g. operators).

Function Parser with RegEx in Python

I have a source code in Fortran (almost irrelevant) and I want to parse the function names and arguments.
eg using
(\w+)\([^\(\)]+\)
with
a(b(1 + 2 * 2), c(3,4))
I get the following: (as expected)
b, 1 + 2 * 2
c, 3,4
where I would need
a, b(1 + 2 * 2), c(3,4)
b, 1 + 2 * 2
c, 3,4
Any suggestions?
Thanks for your time...

It can be done with regular expressions-- use them to tokenize the string, and work with the tokens. i.e. see re.Scanner. Alternatively, just use pyparsing.

This is a nonlinear grammar -- you need to be able to recurse on a set of allowed rules. Look at pyparsing to do simple CFG (Context Free Grammar) parsing via readable specifications.
It's been a while since I've written out CFGs, and I'm probably rusty, so I'll refer you to the Python EBNF to get an idea of how you can construct one for a subset of a language syntax.
Edit: If the example will always be simple, you can code a small state machine class/function that iterates over the tokenized input string, as #Devin Jeanpierre suggests.

You can take a look at PLY (Python Lex-Yacc), it's (in my opinion) very simple to use and well documented, and it comes with a calculator example which could be a good starting point.

I don't think this is a job for regular expressions... they can't really handle nested patterns.
This is because regexes are compiled into FSMs (Finite State Machines). In order to parse arbitrarily nested expressions, you can't use a FSM, because you need infinitely many states to keep track of the arbitrary nesting. Also see this SO thread.

You can't do this with regular expression only. It's sort of recursive. You should match first the most external function and its arguments, print the name of the function, then do the same (match the function name, then its arguments) with all its arguments. Regex alone are not enough.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.